Forum: >>> Magnum BBS <<<

Re: multi bytes character - how to make it defined behavior?

From Bart@21:1/5 to Thiago Adams on Wed Aug 14 00:52:13 2024

On 13/08/2024 15:45, Thiago Adams wrote:

static_assert('×' == 50071);

GCC - warning multi byte
CLANG - error character too large

I think instead of "multi bytes" we need "multi characters" - not bytes.

We decode utf8 then we have the character to decide if it is multi char
or not.

decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.

It is not multi byte : 256*195 + 151 = 50071

O the other hand 'ab' is "multi character" resulting

256 * 'a' + 'b' = 256*97+98= 24930

One consequence is that

'ab' == '𤤰'

But I don't think this is a problem. At least everything is defined.

What exactly do you mean by multi-byte characters? Is it a literal such
as 'ABCD'?

I've no idea what C makes of that, so you will first have to specify
what it might represent:

* Is it a single character represented by multiple bytes?

* If so, do those multiple bytes specify a Unicode number (2-3 bytes),
or a UTF8 sequence (up to 4 bytes, maybe more)?

* If those multiple sequence are allowed, could you have more than one
mixed ASCII/Unicode/UTF8 characters?

One problem with UTF8 in C character literals is that I believe those
are limited to an 'int' type, so 32 bits. You can't fit much in there.
And once you have such a value, how do you print it?

Some of this you can take care of in your 'cake' product, and
superimpose a particular spec on top of C (maybe they can be extended to
64 bits) but you probably can't do much about 'printf'.

(In my language, I overhauled this part of it earlier this year. There
it works like this:

* Character literals can be 64 bits

* They can represent up to 8 ASCII characters: 'ABCDEFGH'

* They can include escape codes for both Unicode and UTF8, and multiple
such characters can be specified:

'A\u20ACB' # All represent A€B; this is Unicode
'A\h EC 82 AC\B' # This is UTF8
'A\xEC\x82\xACB' # C-style escape

Internally they are stored as UTF8, so the 20AC is converted to UTF8

* The ordering of the characters matches that of the equivalent
"A\e20ACB" string when stored in memory; but this applies only to
little-endian

* Print routines have options to print the first character (which can be
a Unicode one), or the whole sequence)

Another aspect is when typing Unicode text directly via your text editor instead of using escape codes; will the C source be UTF8, or some other encoding? This will affect how the text is represented, and how much you
can fit into one 32/64-bit literal.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Thiago Adams on Wed Aug 14 01:32:14 2024

Thiago Adams <thiago.adams@gmail.com> writes:

static_assert('×' == 50071);

static_assert(U'×' == 215);

works, but then I don't know what you were trying to do.

GCC - warning multi byte
CLANG - error character too large

I think instead of "multi bytes" we need "multi characters" - not
bytes.

We decode utf8 then we have the character to decide if it is multi char or not.

These terms can be confusing and I don't know exactly how you are using
them. Basically I simply don't know what that second sentence is
saying.

decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.

Yes, Unicode 215 is UTF-8 encoded as two bytes with values 195 and 151.

It is not multi byte : 256*195 + 151 = 50071

If that × is UTF-8 encoded then it might look, to the compiler, just
like an old-fashioned multi-character character constant just like 'ab'
does. Then again, it might not. gcc and clan take different views on
the matter.

You can get clang to that the same view a gcc by writing

static_assert('\xC3\x97' == 50071);

instead. Now both gcc and clang see it for what it is: an old-fashioned multi-character character constant.

O the other hand 'ab' is "multi character" resulting

The term for these things used to be "multi-byte character constant" and
they were highly non-portable. The trouble is that the term "multi-byte character" now refers to highly portable encodings like UTF-8. Maybe
that's why gcc seems to have changed it's warning from what you gave to:

warning: multi-character character constant [-Wmultichar]

256 * 'a' + 'b' = 256*97+98= 24930

One consequence is that

'ab' == '𤤰'

But I don't think this is a problem. At least everything is defined.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Thiago Adams on Tue Aug 13 23:44:24 2024

On 8/13/24 10:45 AM, Thiago Adams wrote:

static_assert('×' == 50071);

GCC - warning multi byte
CLANG - error character too large

I think instead of "multi bytes" we need "multi characters" - not bytes.

We decode utf8 then we have the character to decide if it is multi char
or not.

decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.

It is not multi byte : 256*195 + 151 = 50071

O the other hand 'ab' is "multi character" resulting

256 * 'a' + 'b' = 256*97+98= 24930

One consequence is that

'ab' == '𤤰'

But I don't think this is a problem. At least everything is defined.

When you use the single quotes by themselves ('), you are specifying
characters in the narrow character set, typically ASCII, but might be
some other 8-bit character encoding. It can not specify extended
character beyond those.

You can (if the implementation allows it) place multiple characters in
the constant to get an integer value with those characters packed.

When you use the double quotes by themselves ("), you are specifying a
string of these narrow characters, although this form might allow for multi-byte encodings of some characters, like is done with UTF-8.

You can specifiy wide character constants by the syntax of L'x', u'x',
or U'x'.

L'x' will give you what ever the inplementation calls its "wide
character set". This MIGHT be UCS-2/UTF-16 or UCS-4/UTF-32 encoded, but
doesn't need to be.

The u'x' form will always be USC-2/UTF-16, and U'x' will always be
UCS-4/UTF-32

Like the plain 'x' form, the results from a single character, can not be
a multi-unit value, so u'x' can't generate a two surrogate pairs for a
single source character.

Change the ' to a " and you get wide strings, just like the characters,
but now u"xx" and L"xx" can generate charaters that use surrogate pairs
(or other multi-part encodings for L"xxx")

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to Thiago Adams on Wed Aug 14 14:05:22 2024

On 14/08/2024 12:41, Thiago Adams wrote:

On 13/08/2024 21:33, Keith Thompson wrote:

Bart<bc@freeuk.com> writes:
[...]

What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?

I've no idea what C makes of that,

It's a character constant of type int with an implementation-defined
value. Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).

(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)

We discussed this at some length several years ago.

[...]

"An integer character constant has type int. The value of an integer character constant containing
a single character that maps to a single value in the literal encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a single
value in the literal encoding,
is implementation-defined. If an integer character constant contains a
single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."

I am suggesting the define this:

"The value of an integer character constant containing more than one character (e.g. ’ab’), or containing a character or escape sequence that does not map to a single value in the literal encoding, is implementation-defined."

How?

First, all source code should be utf8.

Then I am suggesting we first decode the bytes.

For instance, '×' is encoded with 195 and 151. We consume these 2 bytes
and the utf8 decoded value is 215.

By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.

Where then does the 215 appear? Do your char* strings use 215 for ×, or
do they use 195 and 215?

I think this is why C requires those prefixes like u8'...'.

Then this is the defined behavior

static_assert('×' == 215)

This is where you need to decide whether the integer value within '...',
AT RUNTIME, represents the Unicode index or the UTF8 sequence.

(In my language, though I do very little with Unicode ATM, I decided
that everything is UTF8 both at compile time and runtime. Unless I
explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either
will work), which contains 21-bit Unicode index values.)

I get the impression that C's wide characters are intended for those
Unicode indices, but that's not going to work well on Windows with its
16-bit wide character type.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to Thiago Adams on Wed Aug 14 16:34:07 2024

On 14/08/2024 14:31, Thiago Adams wrote:

On 14/08/2024 10:05, Bart wrote:

On 14/08/2024 12:41, Thiago Adams wrote:

On 13/08/2024 21:33, Keith Thompson wrote:

Bart<bc@freeuk.com> writes:
[...]

What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?

I've no idea what C makes of that,

It's a character constant of type int with an implementation-defined
value. Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).

(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)

We discussed this at some length several years ago.

[...]

"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal
encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding
interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a
single value in the literal encoding,
is implementation-defined. If an integer character constant contains
a single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."

I am suggesting the define this:

"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence
that does not map to a single value in the literal encoding, is
implementation-defined."

How?

First, all source code should be utf8.

Then I am suggesting we first decode the bytes.

For instance, '×' is encoded with 195 and 151. We consume these 2
bytes and the utf8 decoded value is 215.

By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.

215 is the unicode number of the character '×'.

Where then does the 215 appear? Do your char* strings use 215 for ×,
or do they use 195 and 215?

215 is the result of decoding two utf8 encoded bytes. (195 and 151)

I think this is why C requires those prefixes like u8'...'.

Then this is the defined behavior

static_assert('×' == 215)

This is where you need to decide whether the integer value within
'...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.

why runtime? It is compile time. This is why source code must be
universally encoded (utf8)

In that case I don't understand what you are testing for here. Is it an
error for '×' to be 215, or an error for it not to be?

And what is the test for, to ensure encoding is UTF8 in this ... source
file? ... compiler?

Where would the 'decoded 215' come into it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to Thiago Adams on Wed Aug 14 18:07:26 2024

On 14/08/2024 17:10, Thiago Adams wrote:

On 14/08/2024 12:34, Bart wrote:

In that case I don't understand what you are testing for here. Is it
an error for '×' to be 215, or an error for it not to be?

GCC handles this as multibyte. Without decoding.

The result of GCC is 50071
static_assert('×' == 50071);

The explanation is that GCC is doing:

256*195 + 151 = 50071

So the 50071 is the 2-byte UTF8 sequence.

(Remember the utf8 bytes were 195 151)

The way 'ab' is handled is the same of '×' on GCC.

I don't understand. 'a' and 'b' each occupy one byte. Together they need
two bytes.

Where's the problem? Are you perhaps confused as to what UTF8 is?

The 50071 above is much better expressed as hex: C397, which is two
bytes. Since both values are in 128..255, they are UTF8 codes, here
expressing a single Unicode character.

Given any two bytes in UTF8, it is easy to see whether they are two
ASCII character, or one (or part of) a Unicode characters, or one ASCII character followed by the first byte of a UTF8 sequence, or if they are malformed (eg. the middle of a UTF8 sequence).

There is no confusion.

And what is the test for, to ensure encoding is UTF8 in this ...
source file? ... compiler?

MSVC has some checks, I don't know that is the logic.

Where would the 'decoded 215' come into it?

215 is the value after decoding utf8 and producing the unicode value.

Who or what does that, and for what purpose? From what I've seen, only
you have introduced it.

So my suggestion is decode first.

Why? What are you comparing? Both sides of == must use UTF8 or Unicode,
but why introduce Unicode at all if apparently everything in source code
and at compile time, as you yourself have stated, is UTF8?

The bad part of my suggestion we may have two different ways of
producing the same value.

For instance the number generated by ab is the same of

'ab' == '𤤰'

I don't think so. If I run this program:

#include <stdio.h>
#include <string.h>

int main() {
printf("%u\n", '×');
printf("%04X\n", '×');
printf("%u\n", 'ab');
printf("%04X\n", 'ab');
printf("%u\n", '𤤰');
printf("%04X\n", '𤤰');
}

I get this output (I've left out the decimal versions for clarity):

C397 ×

6162 ab

F0A4A4B0 𤤰

That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to Thiago Adams on Wed Aug 14 19:12:34 2024

On 14/08/2024 18:40, Thiago Adams wrote:

On 14/08/2024 14:07, Bart wrote:

That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.

My suggestion again. I am using string but imagine this working with
bytes from file.

#include <stdio.h>
#include <assert.h>

...

int get_value(const char* s0)
{
   const char * s = s0;
   int value = 0;
   int uc;
   s = utf8_decode(s, &uc);
   while (s)
   {
     if (uc < 0x007F)
     {
        //multichar formula
        value = value*256+uc;
     }
     else
     {
        //single char
        value = uc;
        break; //check if there is more then error..
     }
     s = utf8_decode(s, &uc);
   }
   return value;
}

int main(){
printf("%d\n", get_value(u8"×"));
printf("%d\n", get_value(u8"ab"));
}

I see your problem. You're mixing things up.

gcc will combine BYTE values together (by shifting by 8 bits or
multiplying by 256), including the individual bytes that represent UTF8.

You are combining ONLY ASCII bytes, and comparing the results with
21-bit Unicode values.

That is meaningless. I'm not surprised you get a clash between A*256+B,
and some arbitrary Unicode index.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to Thiago Adams on Wed Aug 14 20:32:31 2024

On 14/08/2024 19:28, Thiago Adams wrote:

On 14/08/2024 15:12, Bart wrote:

On 14/08/2024 18:40, Thiago Adams wrote:

On 14/08/2024 14:07, Bart wrote:

That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.

My suggestion again. I am using string but imagine this working with
bytes from file.

#include <stdio.h>
#include <assert.h>

...

int get_value(const char* s0)
{
    const char * s = s0;
    int value = 0;
    int uc;
    s = utf8_decode(s, &uc);
    while (s)
    {
      if (uc < 0x007F)
      {
         //multichar formula
         value = value*256+uc;
      }
      else
      {
         //single char
         value = uc;
         break; //check if there is more then error..
      }
      s = utf8_decode(s, &uc);
    }
    return value;
}

int main(){
   printf("%d\n", get_value(u8"×"));
   printf("%d\n", get_value(u8"ab"));
}

I see your problem. You're mixing things up.

The objective is :
- make single characters have the Unicode value without having to use U''
- allow more than one chars like 'ab' in some cases where each
character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)

Obviously that can't work, for example because two printable ASCII
characters with codes 32 to 96, will have values from 1024 to 9216 when combined in a character literal. Those are going to clash with Unicode characters with those values.

It won't work either at compile-time or runtime.

You need to choose between Unicode representation and UTF8. Either that
or use some prefix to disambiguate in source code, but you still need
decide whether '€' in source code is represented as the Unicode bytes 20
AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide
which end of those sequences will be the least signfificant byte.

In any case..my suggestion looks dangerous. But meanwhile this is not
well specified in the standard.

It wasn't well-specified even when dealing with 100% ASCII. For example,
'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another,
maybe just 0x41 or 0x42 on a third, or even 0x41410000.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Bart on Thu Aug 15 01:39:27 2024

On Wed, 14 Aug 2024 14:05:22 +0100, Bart wrote:

I get the impression that C's wide characters are intended for those
Unicode indices, but that's not going to work well on Windows with its
16-bit wide character type.

Unfortunately, Windows (like Java) is shackled to the UTF-16 Albatross,
owing to embracing Unicode at exactly the wrong time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thiago Adams on Thu Aug 15 02:41:48 2024

On Wed, 14 Aug 2024 10:31:59 -0300, Thiago Adams wrote:

215 is the unicode number of the character '×'.

Be careful about the use of the term “character” in Unicode.

Unicode defines “code points”. A “grapheme” (which I think is their term
for “character”) can be made up of one or more “code points”, with no upper limit on their number.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thiago Adams on Thu Aug 15 02:43:03 2024

On Wed, 14 Aug 2024 13:10:01 -0300, Thiago Adams wrote:

The result of GCC is 50071 static_assert('×' == 50071);

The explanation is that GCC is doing:

256*195 + 151 = 50071

(Remember the utf8 bytes were 195 151)

That would be an endian-dependent interpretation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	07:21:05
Calls:	10,388
Calls today:	3
Files:	14,061
Messages:	6,416,825
Posted today:	1

Re: multi bytes character - how to make it defined behavior?

Who's Online

Recent Visitors

System Info