• UTF-8 overlong encodings

    From Rainer Weikusat@21:1/5 to All on Fri Dec 17 22:55:51 2021
    As usual with technical terms "everyone understands", it gets thrown
    around everywhere but is never defined. The definition I derived is
    below.

    The non-ASCII part of UTF-8 is composed of 5 ranges each of which
    starts with a number which has only one bit set. The starting numbers
    are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
    'overlong' when this start bit isn't set.

    Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
    and 26.

    Each range is composed of a number of six bit blocks plus a
    remainder which gets put into the byte starting the encoded
    sequence. Again expressed as (left) shift arguments, the highest bits of
    the left-most six bit blocks are 5, 11, 17, 23, 29.

    Subtracting the shift value corresponding with the highest bit in the
    first six bit block from the shift value of the starting bit yiels the
    position of this starting bit relative to the highest bit in the first
    six bit block. The corresponding values are 2, 0, -1, -2 and -3.

    The first case is special because the starting bit is the bit
    corresponding with 1 in the first byte. All other start bits are in the
    second byte, at positions 5, 4, 3 and 2.

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start
    of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Rainer Weikusat on Sat Dec 18 00:04:05 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    As usual with technical terms "everyone understands", it gets thrown
    around everywhere but is never defined. The definition I derived is
    below.

    The non-ASCII part of UTF-8 is composed of 5 ranges each of which
    starts with a number which has only one bit set. The starting numbers
    are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
    'overlong' when this start bit isn't set.

    I'd express it in terms of magnitude. An overlong 2-byte sequence will
    decode to a value than 0x80. An overlong encoded 3-byte value will be
    less than 0x800 and so on. Or going the other way, you need at least
    two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
    so on.

    When looking at the encoding itself, an overlong sequence is one that
    starts with one of the bytes C0, C1, E0, F0, F8 or FC.

    Unicode has said it won't use more than the 21 bits available in the
    four-byte encoding, so all sequences of length 5 or 6 are "overlong",
    although in an entirely different sense.

    Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
    and 26.

    Each range is composed of a number of six bit blocks plus a
    remainder which gets put into the byte starting the encoded
    sequence. Again expressed as (left) shift arguments, the highest bits of
    the left-most six bit blocks are 5, 11, 17, 23, 29.

    Subtracting the shift value corresponding with the highest bit in the
    first six bit block from the shift value of the starting bit yiels the position of this starting bit relative to the highest bit in the first
    six bit block. The corresponding values are 2, 0, -1, -2 and -3.

    The first case is special because the starting bit is the bit
    corresponding with 1 in the first byte. All other start bits are in the second byte, at positions 5, 4, 3 and 2.

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start
    of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    I was not sure what this part of the description was supposed to add to
    the initial definition.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Rainer Weikusat on Fri Dec 17 16:37:00 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    As usual with technical terms "everyone understands", it gets thrown
    around everywhere but is never defined. The definition I derived is
    below.

    The non-ASCII part of UTF-8 is composed of 5 ranges each of which
    starts with a number which has only one bit set. The starting numbers
    are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
    'overlong' when this start bit isn't set.

    Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
    and 26.

    Each range is composed of a number of six bit blocks plus a
    remainder which gets put into the byte starting the encoded
    sequence. Again expressed as (left) shift arguments, the highest bits of
    the left-most six bit blocks are 5, 11, 17, 23, 29.

    Subtracting the shift value corresponding with the highest bit in the
    first six bit block from the shift value of the starting bit yiels the position of this starting bit relative to the highest bit in the first
    six bit block. The corresponding values are 2, 0, -1, -2 and -3.

    The first case is special because the starting bit is the bit
    corresponding with 1 in the first byte. All other start bits are in the second byte, at positions 5, 4, 3 and 2.

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start
    of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    Unicode only defines character values up to 0x10fffd, so there are no
    valid encodings longer than 4 octets.

    Here's a table I came up with a while ago:

    00-7F (7 bits) 0xxxxxxx
    0080-07FF (11 bits) 110xxxxx 10xxxxxx
    0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
    010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    The character code is determined by concatenating the 'x's together.
    A 1-octet encoding has 7 value bits.
    A 2-octet encoding has 11 value bits.
    And so on.

    An octet starting with 0 is always a single-octet character (ASCII compatible). An octet starting with 11 is always the first octet of a multi-octet encoding. An octet starting with 10 is always a continuation octet.

    Overlong encodings that use more octets than necessary are invalid. For example, the letter 'k' is 0x6b or 1101011 and is encoded in a single
    octet:
    01101011
    -------
    A two-octet encoding of the same character code is invalid:
    11000001 10101011
    ----- ------

    You could extrapolate UTF-8 to up to 8-octet encodings, representing up to
    42 bits, but that's also not valid UTF-8 (though I can imagine it being
    useful for some purposes).

    110000-3FFFFFF (26 bits) 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    4000000-7FFFFFF (31 bits) 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    8000000-FFFFFFFFF (36 bits) 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    1000000000-3FFFFFFFFFF (42 bits) 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    https://en.wikipedia.org/wiki/UTF-8

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Ben Bacarisse on Sat Dec 18 17:15:12 2021
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    As usual with technical terms "everyone understands", it gets thrown
    around everywhere but is never defined. The definition I derived is
    below.

    The non-ASCII part of UTF-8 is composed of 5 ranges each of which
    starts with a number which has only one bit set. The starting numbers
    are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
    'overlong' when this start bit isn't set.

    I'd express it in terms of magnitude. An overlong 2-byte sequence will decode to a value than 0x80. An overlong encoded 3-byte value will be
    less than 0x800 and so on. Or going the other way, you need at least
    two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
    so on.

    Yes. That's an error I made: An overlong sequence is one where none of
    the bits between the end of the prefix and the start bit (inclusive) are
    set.

    [...]

    Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
    and 26.

    Each range is composed of a number of six bit blocks plus a
    remainder which gets put into the byte starting the encoded
    sequence. Again expressed as (left) shift arguments, the highest bits of
    the left-most six bit blocks are 5, 11, 17, 23, 29.

    Subtracting the shift value corresponding with the highest bit in the
    first six bit block from the shift value of the starting bit yiels the
    position of this starting bit relative to the highest bit in the first
    six bit block. The corresponding values are 2, 0, -1, -2 and -3.

    The first case is special because the starting bit is the bit
    corresponding with 1 in the first byte. All other start bits are in the
    second byte, at positions 5, 4, 3 and 2.

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start
    of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    I was not sure what this part of the description was supposed to add to
    the initial definition.

    I want to calculate that with a general algorithm.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Keith Thompson on Sat Dec 18 17:19:17 2021
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    As usual with technical terms "everyone understands", it gets thrown
    around everywhere but is never defined. The definition I derived is
    below.

    [...]

    Unicode only defines character values up to 0x10fffd, so there are no
    valid encodings longer than 4 octets.

    Here's a table I came up with a while ago:

    00-7F (7 bits) 0xxxxxxx
    0080-07FF (11 bits) 110xxxxx 10xxxxxx
    0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
    010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    The Linux UTF-8 man page also has 5 and 6 byte sequences.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Rainer Weikusat on Sat Dec 18 23:14:15 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    As usual with technical terms "everyone understands", it gets thrown
    around everywhere but is never defined. The definition I derived is
    below.

    The non-ASCII part of UTF-8 is composed of 5 ranges each of which
    starts with a number which has only one bit set. The starting numbers
    are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
    'overlong' when this start bit isn't set.

    I'd express it in terms of magnitude. An overlong 2-byte sequence will
    decode to a value than 0x80. An overlong encoded 3-byte value will be
    less than 0x800 and so on. Or going the other way, you need at least
    two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
    so on.

    Yes. That's an error I made: An overlong sequence is one where none of
    the bits between the end of the prefix and the start bit (inclusive) are
    set.

    [...]

    Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21 >>> and 26.

    Each range is composed of a number of six bit blocks plus a
    remainder which gets put into the byte starting the encoded
    sequence. Again expressed as (left) shift arguments, the highest bits of >>> the left-most six bit blocks are 5, 11, 17, 23, 29.

    Subtracting the shift value corresponding with the highest bit in the
    first six bit block from the shift value of the starting bit yiels the
    position of this starting bit relative to the highest bit in the first
    six bit block. The corresponding values are 2, 0, -1, -2 and -3.

    The first case is special because the starting bit is the bit
    corresponding with 1 in the first byte. All other start bits are in the
    second byte, at positions 5, 4, 3 and 2.

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start >>> of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    I was not sure what this part of the description was supposed to add to
    the initial definition.

    I want to calculate that with a general algorithm.

    I don't know what "that" refers to. Do you want to calculate the UTF-8 sequence length from the code point? It seems not. Do you want to
    determine if a sequence is overlong by looking at the sequence? It
    seems not. What is the algorithm given, and what it its result?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John McCue@21:1/5 to Rainer Weikusat on Sun Dec 19 00:49:58 2021
    Rainer Weikusat <rweikusat@talktalk.net> wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    <snip>

    Unicode only defines character values up to 0x10fffd, so there are no
    valid encodings longer than 4 octets.

    This is my understanding also.

    <snip>

    The Linux UTF-8 man page also has 5 and 6 byte sequences.

    I saw somewhere 5 and 6 byte sequences were originally
    defined or thought it would be needed, but now limited
    to 4 bytes.

    --
    csh(1) - "An elegant shell, for a more... civilized age."
    - Paraphrasing Star Wars

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Rainer Weikusat on Sun Dec 19 21:53:25 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    [...]

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start >>>>> of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    I was not sure what this part of the description was supposed to add to >>>> the initial definition.

    I want to calculate that with a general algorithm.

    I don't know what "that" refers to. Do you want to calculate the UTF-8
    sequence length from the code point? It seems not. Do you want to
    determine if a sequence is overlong by looking at the sequence? It
    seems not. What is the algorithm given, and what it its result?

    I want to determine if a sequence is overlong using a generalized
    algorithm for that, ie, not by special-casing start byte values. So far,
    the untested (and very likely buggy) code for this looks like follows:

    u_len is the length of the sequence in bytes, p a pointer to the first
    byte. Some unrelated consistency checks removed.

    At the moment, I'm convinced that this algorithm is complete
    nonsense. :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Ben Bacarisse on Sun Dec 19 21:39:02 2021
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    [...]

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start >>>> of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    I was not sure what this part of the description was supposed to add to
    the initial definition.

    I want to calculate that with a general algorithm.

    I don't know what "that" refers to. Do you want to calculate the UTF-8 sequence length from the code point? It seems not. Do you want to
    determine if a sequence is overlong by looking at the sequence? It
    seems not. What is the algorithm given, and what it its result?

    I want to determine if a sequence is overlong using a generalized
    algorithm for that, ie, not by special-casing start byte values. So far,
    the untested (and very likely buggy) code for this looks like follows:

    u_len is the length of the sequence in bytes, p a pointer to the first
    byte. Some unrelated consistency checks removed.

    mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
    x = *p & mask;
    if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

    y = *++p;

    if (!x) { /* x == 0 implies u_len > 2 */
    mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
    if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Rainer Weikusat on Mon Dec 20 03:57:24 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    [...]

    An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
    ignoring the initial special case, the shift value relative to the start >>>>> of the first six bit block for each encoded sequence is 8 -
    its length:

    3 -> 5
    4 -> 4
    5 -> 3
    6 -> 2

    Any corrections or other comments very much welcome.

    I was not sure what this part of the description was supposed to add to >>>> the initial definition.

    I want to calculate that with a general algorithm.

    I don't know what "that" refers to. Do you want to calculate the UTF-8
    sequence length from the code point? It seems not. Do you want to
    determine if a sequence is overlong by looking at the sequence? It
    seems not. What is the algorithm given, and what it its result?

    I want to determine if a sequence is overlong using a generalized
    algorithm for that, ie, not by special-casing start byte values.

    I don't think I follow what you mean. Over long sequences are special
    case so you have to special-case something. Why not the first byte? It
    seems to be such a simple method.

    So far,
    the untested (and very likely buggy) code for this looks like follows:

    u_len is the length of the sequence in bytes,

    How have you calculated u_len? You can detect and overlong sequence
    without knowing it, so there is some risk in using it when it's not
    needed.

    p a pointer to the first
    byte. Some unrelated consistency checks removed.

    mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */

    That includes one more bit than you want. In a proper UTF-8 sequence,
    that bit will be zero, so it's harmless, but have you already checked
    that the sequence is valid (other than possibly being overlong).

    By the way, I'd use 0xff >> u_len to get the mask. It seems more
    natural.

    x = *p & mask;
    if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

    (or if no bits are set, but you include that in your test)

    y = *++p;

    I don't see why you need to look at the next byte.

    if (!x) { /* x == 0 implies u_len > 2 */

    x == 0 implies an overlong sequence now that you have dealt with the
    length 2 case which can have one bit on x set and still be overlong.

    mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
    if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
    }

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas George@21:1/5 to All on Mon Dec 20 09:42:06 2021
    John McCue , dans le message <splvjm$msm$1@dont-email.me>, a écrit :
    I saw somewhere 5 and 6 byte sequences were originally
    defined or thought it would be needed, but now limited
    to 4 bytes.

    Unicode was limited to 20-21 bits because Microsoft and Sun decided to use UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
    to sizeof(wchar_t) or equivalent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas George@21:1/5 to All on Mon Dec 20 13:34:08 2021
    Rainer Weikusat , dans le message <87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>, a écrit :
    I want to determine if a sequence is overlong using a generalized
    algorithm for that

    Just decode and re-encode and see if the length is the same.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Nicolas George on Mon Dec 20 10:56:19 2021
    Nicolas George <nicolas$george@salle-s.org> writes:
    John McCue , dans le message <splvjm$msm$1@dont-email.me>, a écrit :
    I saw somewhere 5 and 6 byte sequences were originally
    defined or thought it would be needed, but now limited
    to 4 bytes.

    Unicode was limited to 20-21 bits because Microsoft and Sun decided to use UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
    to sizeof(wchar_t) or equivalent.

    I don't see how that follows. UTF-16, at least in its current form, can represent all 1,112,064 valid Unicode code points.

    Sun used UTF-16 for Java. UTF-16, as far as I know, was rare on Solaris.

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Ben Bacarisse on Mon Dec 20 18:28:47 2021
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

    [...]


    u_len is the length of the sequence in bytes,

    How have you calculated u_len? You can detect and overlong sequence
    without knowing it, so there is some risk in using it when it's not
    needed.

    Assuming x is the start byte of a UTF-8 sequences stored as unsigned
    32-bit integer, the length of the sequence is (using a gcc extension)

    __builtin_clz(x ^ 0xff) - 24

    All bits left of 0x80 will already be clear. x ^ 0xff will clear all
    bits up to the trailing 0 bit of the prefix and set that to 1.


    p a pointer to the first
    byte. Some unrelated consistency checks removed.

    mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */

    That includes one more bit than you want.

    Indeed. 8 - u_len is the shift index of the last non-zero prefix bit. It
    should have been 7 - u_len or (8 - (u_len + 1)).


    [...]

    I don't see why you need to look at the next byte.

    if (!x) { /* x == 0 implies u_len > 2 */

    x == 0 implies an overlong sequence now that you have dealt with the
    length 2 case which can have one bit on x set and still be overlong.

    According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:

    1110xxxx 10xxxxxx 10xxxxxx

    Program encoding 0x800 in this way:

    --------
    #include <stdio.h>

    int main(void)
    {
    unsigned u = 0x800;
    unsigned char utf[3], *p;

    p = utf;
    *p++ = 0xe0 | (u >> 12);
    *p++ = 0x80 | ((u >> 6) & 63);
    *p = 0x80 | (u & 63);

    printf("%x %x %x\n", *utf, utf[1], utf[2]);

    return 0;
    }
    -------

    And the output is e0 a0 80. The situation is similar for all other
    ranges, including the 4-byte sequence which is actually supposed to be
    used: The first number of the range corresponds with a set bit in the second byte.

    I may again have gotten something wrong here. But I've done all the calculations twice and got the same result.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Rainer Weikusat on Mon Dec 20 19:30:33 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

    [...]


    I don't see why you need to look at the next byte.

    if (!x) { /* x == 0 implies u_len > 2 */

    x == 0 implies an overlong sequence now that you have dealt with the
    length 2 case which can have one bit on x set and still be overlong.

    According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:

    1110xxxx 10xxxxxx 10xxxxxx

    Program encoding 0x800 in this way:

    [...]

    And the output is e0 a0 80.

    Addition: Not technically a proof of correctness but the ActionCable
    (rot in hell) UTF-8 checker I have to placate accepts 0xe0 0xa0 0x80 as
    valid UTF-8 sequence.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Rainer Weikusat on Mon Dec 20 20:46:21 2021
    Rainer Weikusat <rweikusat@talktalk.net> writes:

    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    <cut>
    I don't see why you need to look at the next byte.

    if (!x) { /* x == 0 implies u_len > 2 */

    x == 0 implies an overlong sequence now that you have dealt with the
    length 2 case which can have one bit on x set and still be overlong.

    According to the Linux man page, a number in the range 0x800 - 0xffff is encoded as three bytes:

    Yup. I was not thinking. You need to look at the first two bytes when
    the sequence length is > 2.

    If b1 is 0xE0 then b2 must be >= 0xA0.
    If b1 is 0xF0 then b2 must be >= 0x90.
    If b1 is 0xF8 then b2 must be >= 0x88.
    If b1 is 0xFC then b2 must be >= 0x84.

    In this diagram, the ^ marks the least significant bit that must be set
    for the encoding not to be overlong:

    110x xxxx 10xx xxxx
    ^
    1110 xxxx 10xx xxxx 10xx xxxx
    ^
    1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
    ^
    1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
    ^
    1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
    ^
    In terms of bit masks,

    (b2 & 0x3f) >> 8 - ulen

    must be non zero for ulen > 2.

    Sorry for the noise.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Ben Bacarisse on Mon Dec 20 21:24:19 2021
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    Rainer Weikusat <rweikusat@talktalk.net> writes:
    Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
    <cut>
    I don't see why you need to look at the next byte.

    if (!x) { /* x == 0 implies u_len > 2 */

    x == 0 implies an overlong sequence now that you have dealt with the
    length 2 case which can have one bit on x set and still be overlong.

    According to the Linux man page, a number in the range 0x800 - 0xffff is
    encoded as three bytes:

    [...]

    In terms of bit masks,

    (b2 & 0x3f) >> 8 - ulen

    Idea I had meanwhile myself, too: The expressions become simpler when shifting out the
    unwanted bits instead of selecting the wanted ones via &-masking. The
    one above is for the second byte, the one for the first would be

    b1 << u_len

    with the special case that a result of 4 means it's overlong when the
    length of the sequence is 2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)