static_assert('×' == 50071);
GCC - warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char
or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
static_assert('×' == 50071);
GCC - warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not
bytes.
We decode utf8 then we have the character to decide if it is multi char or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
static_assert('×' == 50071);
GCC - warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char
or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
On 13/08/2024 21:33, Keith Thompson wrote:
Bart<bc@freeuk.com> writes:
[...]
What exactly do you mean by multi-byte characters? Is it a literalIt's a character constant of type int with an implementation-defined
such as 'ABCD'?
I've no idea what C makes of that,
value. Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer character constant containing
a single character that maps to a single value in the literal encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a single
value in the literal encoding,
is implementation-defined. If an integer character constant contains a
single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."
I am suggesting the define this:
"The value of an integer character constant containing more than one character (e.g. ’ab’), or containing a character or escape sequence that does not map to a single value in the literal encoding, is implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2 bytes
and the utf8 decoded value is 215.
Then this is the defined behavior
static_assert('×' == 215)
On 14/08/2024 10:05, Bart wrote:
On 14/08/2024 12:41, Thiago Adams wrote:
On 13/08/2024 21:33, Keith Thompson wrote:
Bart<bc@freeuk.com> writes:
[...]
What exactly do you mean by multi-byte characters? Is it a literalIt's a character constant of type int with an implementation-defined
such as 'ABCD'?
I've no idea what C makes of that,
value. Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal
encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding
interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a
single value in the literal encoding,
is implementation-defined. If an integer character constant contains
a single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."
I am suggesting the define this:
"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence
that does not map to a single value in the literal encoding, is
implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2
bytes and the utf8 decoded value is 215.
By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.
215 is the unicode number of the character '×'.
Where then does the 215 appear? Do your char* strings use 215 for ×,
or do they use 195 and 215?
215 is the result of decoding two utf8 encoded bytes. (195 and 151)
I think this is why C requires those prefixes like u8'...'.
Then this is the defined behavior
static_assert('×' == 215)
This is where you need to decide whether the integer value within
'...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
why runtime? It is compile time. This is why source code must be
universally encoded (utf8)
On 14/08/2024 12:34, Bart wrote:
In that case I don't understand what you are testing for here. Is it
an error for '×' to be 215, or an error for it not to be?
GCC handles this as multibyte. Without decoding.
The result of GCC is 50071
static_assert('×' == 50071);
The explanation is that GCC is doing:
256*195 + 151 = 50071
(Remember the utf8 bytes were 195 151)
The way 'ab' is handled is the same of '×' on GCC.
And what is the test for, to ensure encoding is UTF8 in this ...
source file? ... compiler?
MSVC has some checks, I don't know that is the logic.
Where would the 'decoded 215' come into it?
215 is the value after decoding utf8 and producing the unicode value.
So my suggestion is decode first.
The bad part of my suggestion we may have two different ways of
producing the same value.
For instance the number generated by ab is the same of
'ab' == '𤤰'
On 14/08/2024 14:07, Bart wrote:
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.
#include <stdio.h>
#include <assert.h>
int get_value(const char* s0)
{
const char * s = s0;
int value = 0;
int uc;
s = utf8_decode(s, &uc);
while (s)
{
if (uc < 0x007F)
{
//multichar formula
value = value*256+uc;
}
else
{
//single char
value = uc;
break; //check if there is more then error..
}
s = utf8_decode(s, &uc);
}
return value;
}
int main(){
printf("%d\n", get_value(u8"×"));
printf("%d\n", get_value(u8"ab"));
}
On 14/08/2024 15:12, Bart wrote:
On 14/08/2024 18:40, Thiago Adams wrote:
On 14/08/2024 14:07, Bart wrote:
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.
#include <stdio.h>
#include <assert.h>
...
int get_value(const char* s0)
{
const char * s = s0;
int value = 0;
int uc;
s = utf8_decode(s, &uc);
while (s)
{
if (uc < 0x007F)
{
//multichar formula
value = value*256+uc;
}
else
{
//single char
value = uc;
break; //check if there is more then error..
}
s = utf8_decode(s, &uc);
}
return value;
}
int main(){
printf("%d\n", get_value(u8"×"));
printf("%d\n", get_value(u8"ab"));
}
I see your problem. You're mixing things up.
The objective is :
- make single characters have the Unicode value without having to use U''
- allow more than one chars like 'ab' in some cases where each
character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)
In any case..my suggestion looks dangerous. But meanwhile this is not
well specified in the standard.
I get the impression that C's wide characters are intended for those
Unicode indices, but that's not going to work well on Windows with its
16-bit wide character type.
215 is the unicode number of the character '×'.
The result of GCC is 50071 static_assert('×' == 50071);
The explanation is that GCC is doing:
256*195 + 151 = 50071
(Remember the utf8 bytes were 195 151)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 07:21:05 |
Calls: | 10,388 |
Calls today: | 3 |
Files: | 14,061 |
Messages: | 6,416,825 |
Posted today: | 1 |