• Re: multi bytes character - how to make it defined behavior?

    From Bart@21:1/5 to Thiago Adams on Wed Aug 14 00:52:13 2024
    On 13/08/2024 15:45, Thiago Adams wrote:
    static_assert('×' == 50071);

    GCC -  warning multi byte
    CLANG - error character too large

    I think instead of "multi bytes" we need "multi characters" - not bytes.

    We decode utf8 then we have the character to decide if it is multi char
    or not.

    decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.

    It is not multi byte : 256*195 + 151 = 50071

    O the other hand 'ab' is "multi character" resulting

    256 * 'a' + 'b' = 256*97+98= 24930

    One consequence is that

    'ab' == '𤤰'

    But I don't think this is a problem. At least everything is defined.

    What exactly do you mean by multi-byte characters? Is it a literal such
    as 'ABCD'?

    I've no idea what C makes of that, so you will first have to specify
    what it might represent:

    * Is it a single character represented by multiple bytes?

    * If so, do those multiple bytes specify a Unicode number (2-3 bytes),
    or a UTF8 sequence (up to 4 bytes, maybe more)?

    * If those multiple sequence are allowed, could you have more than one
    mixed ASCII/Unicode/UTF8 characters?

    One problem with UTF8 in C character literals is that I believe those
    are limited to an 'int' type, so 32 bits. You can't fit much in there.
    And once you have such a value, how do you print it?

    Some of this you can take care of in your 'cake' product, and
    superimpose a particular spec on top of C (maybe they can be extended to
    64 bits) but you probably can't do much about 'printf'.

    (In my language, I overhauled this part of it earlier this year. There
    it works like this:

    * Character literals can be 64 bits

    * They can represent up to 8 ASCII characters: 'ABCDEFGH'

    * They can include escape codes for both Unicode and UTF8, and multiple
    such characters can be specified:

    'A\u20ACB' # All represent A€B; this is Unicode
    'A\h EC 82 AC\B' # This is UTF8
    'A\xEC\x82\xACB' # C-style escape

    Internally they are stored as UTF8, so the 20AC is converted to UTF8

    * The ordering of the characters matches that of the equivalent
    "A\e20ACB" string when stored in memory; but this applies only to
    little-endian

    * Print routines have options to print the first character (which can be
    a Unicode one), or the whole sequence)

    Another aspect is when typing Unicode text directly via your text editor instead of using escape codes; will the C source be UTF8, or some other encoding? This will affect how the text is represented, and how much you
    can fit into one 32/64-bit literal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Thiago Adams on Wed Aug 14 01:32:14 2024
    Thiago Adams <thiago.adams@gmail.com> writes:

    static_assert('×' == 50071);

    static_assert(U'×' == 215);

    works, but then I don't know what you were trying to do.

    GCC - warning multi byte
    CLANG - error character too large

    I think instead of "multi bytes" we need "multi characters" - not
    bytes.

    We decode utf8 then we have the character to decide if it is multi char or not.

    These terms can be confusing and I don't know exactly how you are using
    them. Basically I simply don't know what that second sentence is
    saying.

    decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.

    Yes, Unicode 215 is UTF-8 encoded as two bytes with values 195 and 151.

    It is not multi byte : 256*195 + 151 = 50071

    If that × is UTF-8 encoded then it might look, to the compiler, just
    like an old-fashioned multi-character character constant just like 'ab'
    does. Then again, it might not. gcc and clan take different views on
    the matter.

    You can get clang to that the same view a gcc by writing

    static_assert('\xC3\x97' == 50071);

    instead. Now both gcc and clang see it for what it is: an old-fashioned multi-character character constant.

    O the other hand 'ab' is "multi character" resulting

    The term for these things used to be "multi-byte character constant" and
    they were highly non-portable. The trouble is that the term "multi-byte character" now refers to highly portable encodings like UTF-8. Maybe
    that's why gcc seems to have changed it's warning from what you gave to:

    warning: multi-character character constant [-Wmultichar]

    256 * 'a' + 'b' = 256*97+98= 24930

    One consequence is that

    'ab' == '𤤰'

    But I don't think this is a problem. At least everything is defined.


    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Thiago Adams on Tue Aug 13 23:44:24 2024
    On 8/13/24 10:45 AM, Thiago Adams wrote:
    static_assert('×' == 50071);

    GCC -  warning multi byte
    CLANG - error character too large

    I think instead of "multi bytes" we need "multi characters" - not bytes.

    We decode utf8 then we have the character to decide if it is multi char
    or not.

    decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.

    It is not multi byte : 256*195 + 151 = 50071

    O the other hand 'ab' is "multi character" resulting

    256 * 'a' + 'b' = 256*97+98= 24930

    One consequence is that

    'ab' == '𤤰'

    But I don't think this is a problem. At least everything is defined.

    When you use the single quotes by themselves ('), you are specifying
    characters in the narrow character set, typically ASCII, but might be
    some other 8-bit character encoding. It can not specify extended
    character beyond those.

    You can (if the implementation allows it) place multiple characters in
    the constant to get an integer value with those characters packed.

    When you use the double quotes by themselves ("), you are specifying a
    string of these narrow characters, although this form might allow for multi-byte encodings of some characters, like is done with UTF-8.

    You can specifiy wide character constants by the syntax of L'x', u'x',
    or U'x'.

    L'x' will give you what ever the inplementation calls its "wide
    character set". This MIGHT be UCS-2/UTF-16 or UCS-4/UTF-32 encoded, but
    doesn't need to be.

    The u'x' form will always be USC-2/UTF-16, and U'x' will always be
    UCS-4/UTF-32

    Like the plain 'x' form, the results from a single character, can not be
    a multi-unit value, so u'x' can't generate a two surrogate pairs for a
    single source character.

    Change the ' to a " and you get wide strings, just like the characters,
    but now u"xx" and L"xx" can generate charaters that use surrogate pairs
    (or other multi-part encodings for L"xxx")

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to Thiago Adams on Wed Aug 14 14:05:22 2024
    On 14/08/2024 12:41, Thiago Adams wrote:
    On 13/08/2024 21:33, Keith Thompson wrote:
    Bart<bc@freeuk.com>  writes:
    [...]
    What exactly do you mean by multi-byte characters? Is it a literal
    such as 'ABCD'?

    I've no idea what C makes of that,
    It's a character constant of type int with an implementation-defined
    value.  Read the section on "Character constants" in the C standard
    (6.4.4.4 in C17).

    (With gcc, its value is 0x41424344, but other compilers can and do
    behave differently.)

    We discussed this at some length several years ago.

    [...]


    "An integer character constant has type int. The value of an integer character constant containing
    a single character that maps to a single value in the literal encoding (6.2.9) is the numerical value
    of the representation of the mapped character in the literal encoding interpreted as an integer.
    The value of an integer character constant containing more than one
    character (e.g. ’ab’), or
    containing a character or escape sequence that does not map to a single
    value in the literal encoding,
    is implementation-defined. If an integer character constant contains a
    single character or escape
    sequence, its value is the one that results when an object with type
    char whose value is that of the
    single character or escape sequence is converted to type int."


    I am suggesting the define this:

    "The value of an integer character constant containing more than one character (e.g. ’ab’), or containing a character or escape sequence that does not map to a single value in the literal encoding, is implementation-defined."

    How?

    First, all source code should be utf8.

    Then I am suggesting we first decode the bytes.

    For instance, '×' is encoded with 195 and 151. We consume these 2 bytes
    and the utf8 decoded value is 215.

    By that you mean the Unicode index. But you say elsewhere that
    everything in your source code is UTF8.

    Where then does the 215 appear? Do your char* strings use 215 for ×, or
    do they use 195 and 215?

    I think this is why C requires those prefixes like u8'...'.


    Then this is the defined behavior

    static_assert('×' == 215)

    This is where you need to decide whether the integer value within '...',
    AT RUNTIME, represents the Unicode index or the UTF8 sequence.

    (In my language, though I do very little with Unicode ATM, I decided
    that everything is UTF8 both at compile time and runtime. Unless I
    explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either
    will work), which contains 21-bit Unicode index values.)

    I get the impression that C's wide characters are intended for those
    Unicode indices, but that's not going to work well on Windows with its
    16-bit wide character type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to Thiago Adams on Wed Aug 14 16:34:07 2024
    On 14/08/2024 14:31, Thiago Adams wrote:
    On 14/08/2024 10:05, Bart wrote:
    On 14/08/2024 12:41, Thiago Adams wrote:
    On 13/08/2024 21:33, Keith Thompson wrote:
    Bart<bc@freeuk.com>  writes:
    [...]
    What exactly do you mean by multi-byte characters? Is it a literal
    such as 'ABCD'?

    I've no idea what C makes of that,
    It's a character constant of type int with an implementation-defined
    value.  Read the section on "Character constants" in the C standard
    (6.4.4.4 in C17).

    (With gcc, its value is 0x41424344, but other compilers can and do
    behave differently.)

    We discussed this at some length several years ago.

    [...]


    "An integer character constant has type int. The value of an integer
    character constant containing
    a single character that maps to a single value in the literal
    encoding (6.2.9) is the numerical value
    of the representation of the mapped character in the literal encoding
    interpreted as an integer.
    The value of an integer character constant containing more than one
    character (e.g. ’ab’), or
    containing a character or escape sequence that does not map to a
    single value in the literal encoding,
    is implementation-defined. If an integer character constant contains
    a single character or escape
    sequence, its value is the one that results when an object with type
    char whose value is that of the
    single character or escape sequence is converted to type int."


    I am suggesting the define this:

    "The value of an integer character constant containing more than one
    character (e.g. ’ab’), or containing a character or escape sequence
    that does not map to a single value in the literal encoding, is
    implementation-defined."

    How?

    First, all source code should be utf8.

    Then I am suggesting we first decode the bytes.

    For instance, '×' is encoded with 195 and 151. We consume these 2
    bytes and the utf8 decoded value is 215.

    By that you mean the Unicode index. But you say elsewhere that
    everything in your source code is UTF8.


    215 is the unicode number of the character '×'.

    Where then does the 215 appear? Do your char* strings use 215 for ×,
    or do they use 195 and 215?

    215 is the result of decoding two utf8 encoded bytes. (195 and 151)

    I think this is why C requires those prefixes like u8'...'.


    Then this is the defined behavior

    static_assert('×' == 215)

    This is where you need to decide whether the integer value within
    '...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.

    why runtime? It is compile time. This is why source code must be
    universally encoded (utf8)


    In that case I don't understand what you are testing for here. Is it an
    error for '×' to be 215, or an error for it not to be?

    And what is the test for, to ensure encoding is UTF8 in this ... source
    file? ... compiler?

    Where would the 'decoded 215' come into it?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to Thiago Adams on Wed Aug 14 18:07:26 2024
    On 14/08/2024 17:10, Thiago Adams wrote:
    On 14/08/2024 12:34, Bart wrote:

    In that case I don't understand what you are testing for here. Is it
    an error for '×' to be 215, or an error for it not to be?


    GCC handles this as multibyte. Without decoding.

    The result of GCC is 50071
    static_assert('×' == 50071);

    The explanation is that GCC is doing:

    256*195 + 151 = 50071

    So the 50071 is the 2-byte UTF8 sequence.



    (Remember the utf8 bytes were 195 151)

    The way 'ab' is handled is the same of '×' on GCC.

    I don't understand. 'a' and 'b' each occupy one byte. Together they need
    two bytes.

    Where's the problem? Are you perhaps confused as to what UTF8 is?

    The 50071 above is much better expressed as hex: C397, which is two
    bytes. Since both values are in 128..255, they are UTF8 codes, here
    expressing a single Unicode character.

    Given any two bytes in UTF8, it is easy to see whether they are two
    ASCII character, or one (or part of) a Unicode characters, or one ASCII character followed by the first byte of a UTF8 sequence, or if they are malformed (eg. the middle of a UTF8 sequence).

    There is no confusion.



    And what is the test for, to ensure encoding is UTF8 in this ...
    source file? ... compiler?

    MSVC has some checks, I don't know that is the logic.


    Where would the 'decoded 215' come into it?

    215 is the value after decoding utf8 and producing the unicode value.

    Who or what does that, and for what purpose? From what I've seen, only
    you have introduced it.

    So my suggestion is decode first.

    Why? What are you comparing? Both sides of == must use UTF8 or Unicode,
    but why introduce Unicode at all if apparently everything in source code
    and at compile time, as you yourself have stated, is UTF8?

    The bad part of my suggestion we may have two different ways of
    producing the same value.

    For instance the number generated by ab is the same of

    'ab' == '𤤰'

    I don't think so. If I run this program:

    #include <stdio.h>
    #include <string.h>

    int main() {
    printf("%u\n", '×');
    printf("%04X\n", '×');
    printf("%u\n", 'ab');
    printf("%04X\n", 'ab');
    printf("%u\n", '𤤰');
    printf("%04X\n", '𤤰');
    }


    I get this output (I've left out the decimal versions for clarity):

    C397 ×

    6162 ab

    F0A4A4B0 𤤰

    That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
    clash with some other Unicode character.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to Thiago Adams on Wed Aug 14 19:12:34 2024
    On 14/08/2024 18:40, Thiago Adams wrote:
    On 14/08/2024 14:07, Bart wrote:

    That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
    clash with some other Unicode character.



    My suggestion again. I am using string but imagine this working with
    bytes from file.


    #include <stdio.h>
    #include <assert.h>

    ...
    int get_value(const char* s0)
    {
       const char * s = s0;
       int value = 0;
       int  uc;
       s = utf8_decode(s, &uc);
       while (s)
       {
         if (uc < 0x007F)
         {
            //multichar formula
            value = value*256+uc;
         }
         else
         {
            //single char
            value = uc;
            break; //check if there is more then error..
         }
         s = utf8_decode(s, &uc);
       }
       return value;
    }

    int main(){
      printf("%d\n", get_value(u8"×"));
      printf("%d\n", get_value(u8"ab"));
    }

    I see your problem. You're mixing things up.

    gcc will combine BYTE values together (by shifting by 8 bits or
    multiplying by 256), including the individual bytes that represent UTF8.

    You are combining ONLY ASCII bytes, and comparing the results with
    21-bit Unicode values.

    That is meaningless. I'm not surprised you get a clash between A*256+B,
    and some arbitrary Unicode index.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to Thiago Adams on Wed Aug 14 20:32:31 2024
    On 14/08/2024 19:28, Thiago Adams wrote:
    On 14/08/2024 15:12, Bart wrote:
    On 14/08/2024 18:40, Thiago Adams wrote:
    On 14/08/2024 14:07, Bart wrote:

    That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
    clash with some other Unicode character.



    My suggestion again. I am using string but imagine this working with
    bytes from file.


    #include <stdio.h>
    #include <assert.h>

    ...
    int get_value(const char* s0)
    {
        const char * s = s0;
        int value = 0;
        int  uc;
        s = utf8_decode(s, &uc);
        while (s)
        {
          if (uc < 0x007F)
          {
             //multichar formula
             value = value*256+uc;
          }
          else
          {
             //single char
             value = uc;
             break; //check if there is more then error..
          }
          s = utf8_decode(s, &uc);
        }
        return value;
    }

    int main(){
       printf("%d\n", get_value(u8"×"));
       printf("%d\n", get_value(u8"ab"));
    }

    I see your problem. You're mixing things up.


    The objective is :
     - make single characters have the Unicode value without  having to use U''
     - allow more than one chars like 'ab' in some cases where each
    character is less than 0x007F. This can break code for instance '¼¼'.
    but I am suspecting people are not using in this way (I hope)

    Obviously that can't work, for example because two printable ASCII
    characters with codes 32 to 96, will have values from 1024 to 9216 when combined in a character literal. Those are going to clash with Unicode characters with those values.

    It won't work either at compile-time or runtime.

    You need to choose between Unicode representation and UTF8. Either that
    or use some prefix to disambiguate in source code, but you still need
    decide whether '€' in source code is represented as the Unicode bytes 20
    AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide
    which end of those sequences will be the least signfificant byte.


    In any case..my suggestion looks dangerous. But meanwhile this is not
    well specified in the standard.

    It wasn't well-specified even when dealing with 100% ASCII. For example,
    'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another,
    maybe just 0x41 or 0x42 on a third, or even 0x41410000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bart on Thu Aug 15 01:39:27 2024
    On Wed, 14 Aug 2024 14:05:22 +0100, Bart wrote:

    I get the impression that C's wide characters are intended for those
    Unicode indices, but that's not going to work well on Windows with its
    16-bit wide character type.

    Unfortunately, Windows (like Java) is shackled to the UTF-16 Albatross,
    owing to embracing Unicode at exactly the wrong time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thiago Adams on Thu Aug 15 02:41:48 2024
    On Wed, 14 Aug 2024 10:31:59 -0300, Thiago Adams wrote:

    215 is the unicode number of the character '×'.

    Be careful about the use of the term “character” in Unicode.

    Unicode defines “code points”. A “grapheme” (which I think is their term
    for “character”) can be made up of one or more “code points”, with no upper limit on their number.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thiago Adams on Thu Aug 15 02:43:03 2024
    On Wed, 14 Aug 2024 13:10:01 -0300, Thiago Adams wrote:

    The result of GCC is 50071 static_assert('×' == 50071);

    The explanation is that GCC is doing:

    256*195 + 151 = 50071

    (Remember the utf8 bytes were 195 151)

    That would be an endian-dependent interpretation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)