• Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings hand

    From Nicolas Paul Colin de Glocester@21:1/5 to All on Sun Aug 31 23:27:49 2025
    XPost: fr.comp.lang.ada

    Dear Mister Chadwick,

    Thanks for this contribution.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kevin Chadwick@21:1/5 to All on Sun Aug 31 21:23:27 2025
    Most languages only support working in one encoding. Go UTF-8 and Dart
    Utf-16. Perhaps Ada was too ambitious but wide_wide worked for me when I
    needed it. Finalising support is a potential aim of the next standard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alex // nytpu@21:1/5 to All on Tue Sep 2 10:01:34 2025
    XPost: fr.comp.lang.ada

    I've written about this at length before because it's a major pain
    point; but I can't find any of my old writing on it so I've rewritten it
    here lol. I go into extremely verbose detail on all the recommendations
    and the issues at play below, but to summarize:
    - You really should use Unicode both in storage/interchange and internally
    - Use Wide_Wide_<> internally everywhere in your program
    - Use Ada's Streams facility to read/write external text as binary,
    transcoding it manually using UTF_Encoding (or custom implemented
    routines if you need non-Unicode encodings)
    - You can use Text_Streams to get a binary stream even from stdin/stdout/stderr, although with some annoying caveats regarding
    Text_IO adding spurious end-of-file newlines when writing
    - Be careful with string functions that inspect the contents of strings
    even for Wide_Wide_Strings, because Unicode can have tricky issues
    (basically, just only ever look for/split on/etc. hardcoded valid sequences/characters due to issues with multi-codepoint graphemes)

    ***

    Right off the bat, in modern code either on its own or interfacing with
    other modern code, you really should use Unicode, and really really
    should use UTF-8. If you use Latin-1 or Windows-1252 or some weird
    regional encoding everyone will hate you, and if you restrict inputs to
    7-bit ASCII everyone will hate you too lol. And people will get annoyed
    if you use UTF-16 or UTF-32 instead of UTF-8 as the interchange/storage
    format in a new program.

    But first, looking at how you deal with text internally with your
    program, you *really* have two options (technically there's more but the
    others are not good): storing UTF-8 with Strings (you have to use a
    String even for individual characters), or storing UTF-32 in Wide_Wide_String/Wide_Wide_Characters.

    When storing UTF-8 in a String (for good practice, use the Ada.Strings.UTF_Encoding.UTF_8_String subtype just to indicate that it
    is UTF-8 and not Latin-1), the main thing is you can't use or have to be
    very cautious (and really should just avoid if possible) using any of
    the built-in String/Unbounded_String utilities that inspect or
    manipulate the contents of text.

    With Wide_Wide_<>, you're technically wasting 11 out of every 32 bits of
    memory for alignment reasons---or 24 out of 32 bits with text that's
    mostly ASCII with only the occasional higher character---but eh, not
    that big a deal *on modern systems capable of running a modern hosted environment*. Note that there is zero chance in hell that UTF-32 will
    ever be adopted as an interchange or storage encoding (except in
    isolated singular corporate apps *maybe*), so UTF-32 being used should
    purely be an internal implementation detail: incoming text in whatever
    encoding gets converted to it and outgoing text will always get
    converted from it. And you should only convert at the I/O "boundary",
    don't have half of your program dealing with native string encoding and
    half dealing with Wide_Wide_<> (with the only exception being that if
    you don't need to look at the string's contents and are just passing it through, then you can and should avoid transcoding at all).

    I personally use Wide_Wide_<> for everything just because it's more
    convenient to have more useful built-in string functions, and it makes
    dealing with input/output encoding much easier later (detailed below).

    I would never use Wide_<> unless you're exclusively targeting Windows or something, because UTF-16 is just inconvenient and has none of the
    benefits of UTF-8 nor any of the benefits of UTF-32 and most of the
    downsides of both. Plus since Ada standardized wide characters so early there's additional fuckups relating to UCS-2---UTF-16 incompatibilities
    like Windows has[1] and you absolutely do not want to deal with that.

    I'm unfortunate enough to know most of the nuances of Unicode but I
    won't subject you to it, but a lot of the statements in your collection
    are a bit oversimplified (UCS-4 has a number of additional differences
    from UTF-32 regarding "valid encodings", namely that all valid Unicode codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
    Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive) are
    valid in UTF-32), and are missing some additional information: a key
    detail is that even with UTF-32 where each Unicode scalar value is held
    in one array element rather than being variable-width like UTF-8/UTF-16,
    you still can't treat them as arbitrary arrays like 7-bit ASCII because
    a grapheme can be made up of multiple Unicode scalar values. Even with
    ASCII characters there's the possibility of combining diacritics or such
    that would break if you split the string between them.

    Also, I just stumbled across Ada.Strings.Text_Buffers which seems to be
    new to Ada 2022, makes "string builder" stuff much more convenient
    because you can write text using any of Ada's string types and then get
    a string in whatever encoding you want (and with the correct
    system-specific line endings which is a whole 'nother issue with Ada
    strings) out of it instead of needing to fiddle with all that manually,
    maybe that'll be useful if you can use Ada 2022.

    ***

    Okay, so I've discussed the internal representation and issues with
    that, but now we get into input/output transcoding... this is just a
    nightmare in Ada, one almost decent solution but even it has caveats and
    bugs, uggh.

    In general, just the Text_IO packages will always transcode the input
    file to whatever format you're getting and transcode your given output
    to some other format, and it's annoying to configure what encoding is
    used at compile time[2] and impossible to change at runtime which makes
    the Text_IO packages just useless for non-Latin-1/ASCII IMO. Even if
    you get GNAT whipped into shape for your codebase's needs you're
    abandoning all portability should a hypothetical second Ada
    implementation that you might want to use arise.

    The only way to get full control of the input and output encodings is to
    use one of Ada's ways of performing binary I/O and then manually convert strings to binary yourself. I personally prefer using Streams over Sequential_IO/Direct_IO, using UTF_Encoding (or the new Text_Buffers) to convert to/from the specific format I want before reading or writing
    from the stream.

    There is one singular bug though: if you use Ada.Text_IO.Text_Streams to
    get a byte stream from an Text_IO output file (the only way to
    read/write binary data from stdin, stdout, and stderr at all), then
    after writing and the file is closed, an extra newline will always be
    added. The Ada standard requires that Text_IO always output a newline
    if the output didn't end with one, and the stream from text_streams
    completely bypasses all of the Text_IO package's bookkeeping, so from
    its perspective nothing was written to the file (let alone a newline) so
    it has to add a newline.[3] So you either just have to deal with output
    files having an empty trailing line or make sure to strip off the final
    newline from the text you're outputting.

    ***

    Sorry for it being so long, but that's the horror of working with text
    XD, particularly older things like Ada that didn't have the benefit of
    modern hindsight for how text encoding would end up and had to bolt on solutions afterwards that just doesn't work right. Although at least
    Ada is better than the unfixable un-work-aroundable C/C++ nightmare[4]
    or Windows or really any software created prior to Unicode 1.1 (1993).

    ~nytpu

    [1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
    [2]: The problem is GNAT completely changes how the Text_IO packages
    behave with regards to text encoding through opaque methods. The
    encodings used by Text_IO are mostly (but not entirely) based off of the `-gnatW` flag, which is configuring the encoding of THE PROGRAM'S SOURCE
    CODE. Absolutely batshit they abused the source file encoding flag as
    the only way for the programmer to configure what encoding the program
    reads and writes, which is completely orthogonal to the source code.
    [3]: When I was more active on IRC, either Lucretia or Shark8 (who you
    both quoted) would whine about this every chance possible lol. It is
    extremely annoying even when you use Text_IO directly rather than
    through streams, because it's messing with my damn file even when I
    didn't ask it to.
    [4]: https://github.com/mpv-player/mpv/commit/1e70e82baa91

    --
    Alex // nytpu
    https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to All on Tue Sep 2 20:08:36 2025
    XPost: fr.comp.lang.ada

    On 2025-09-02 18:01, Alex // nytpu wrote:
    I've written about this at length before because it's a major pain
    point; but I can't find any of my old writing on it so I've rewritten it
    here lol.
    The matter is quite straightforward:

    1. Never ever use Wide and Wide_Wide. There is a marginal case of
    Windows API where you need Wide_String for UTF-16 encoding. Otherwise,
    use cases are absent. No text processing algorithms require code point
    access.

    2. Use Character as octet. String as UTF-8 encoded.

    That is all.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas Paul Colin de Glocester@21:1/5 to All on Tue Sep 2 19:40:07 2025
    XPost: fr.comp.lang.ada

    This message is in MIME format. The first part should be readable text,
    while the remaining parts are likely unreadable without MIME-aware tools.

    Alex // nytpu wrote during this decade, specifically today: |---------------------------------------------------------------------------| |"I can't find any of my old writing on it so I've rewritten it | |here lol." | |---------------------------------------------------------------------------|

    Dear Alex:

    A teammate had once solved a problem but he had forgotten how he
    solved it. So he has queried a search engine. So it showed him a
    webpage with a perfect solution --- a webpage written by him!

    I recommend searching for that old writing about Unicode: perhaps it
    has more details than this comp.lang.ada thread, or perhaps a
    perspective has been changed in an interesting way. Even if there is
    no difference, perhaps it is in a directory with other missing files
    which need to be backed up!

    |---------------------------------------------------------------------------| |"If you use Latin-1 or Windows-1252 or some weird | |regional encoding everyone will hate you, and if you restrict inputs to | |7-bit ASCII everyone will hate you too lol. And people will get annoyed | |if you use UTF-16 or UTF-32 instead of UTF-8 as the interchange/storage | |format in a new program." | |---------------------------------------------------------------------------|

    I quote Usenet articles in a way which does not endear me to
    persons. Not everyone reacts in the same way. OC Systems asked me how
    do I draw those boxes.

    I advocate Ada which also does not endear me to persons.

    |---------------------------------------------------------------------------| |"[. . .] |
    | | |I personally use Wide_Wide_<> for everything just because it's more | |convenient to have more useful built-in string functions, and it makes | |dealing with input/output encoding much easier later (detailed below). |
    | | |[. . .] |
    | | |I'm unfortunate enough to know most of the nuances of Unicode but I | |won't subject you to it, but a lot of the statements in your collection | |are a bit oversimplified (UCS-4 has a number of additional differences | |from UTF-32 regarding "valid encodings", [. . .] | |[. . .]" | |---------------------------------------------------------------------------|

    Thanks for this feedback and more will be as welcome as can be. I
    quoted examples of what I found in this newsgroup. This newsgroup used
    not have many statements with explicit references to "UTF-32" or
    "UTF32" or "UCS-4" which differ overwhelmingly from what I quoted
    during the previous week.

    |---------------------------------------------------------------------------| |"Also, I just stumbled across Ada.Strings.Text_Buffers which seems to be | |new to Ada 2022, makes "string builder" stuff much more convenient | |because you can write text using any of Ada's string types and then get | |a string in whatever encoding you want [. . .] | |[. . .]" | |---------------------------------------------------------------------------|

    Package Ada.Strings.Text_Buffers does not support UCS-4.

    |---------------------------------------------------------------------------| |"Note that there is zero chance in hell that UTF-32 will ever be adopted as| |an interchange or storage encoding (except in isolated singular corporate | |apps *maybe*), so UTF-32 being used should purely be an internal | |implementation detail: incoming text in whatever encoding gets converted to| |it and outgoing text will always get converted from it." | |---------------------------------------------------------------------------|

    One can know but what one can too optimistically know can be
    false. Character sets or encodings used to be subjects of unfulfilled expectations.

    I can say that for now, UTF-8 is enough for a particular application.

    Deadly Head did not have the same luck.

    |---------------------------------------------------------------------------| |"The encodings used by | |Text_IO are mostly (but not entirely) based off of the `-gnatW` flag, which| |is configuring the encoding of THE PROGRAM'S SOURCE CODE." | |---------------------------------------------------------------------------|

    GNAT has many switches. It could easily gain more switches.

    Sincères salutations.



    Nicolas Paul Colin de Glocester

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas Paul Colin de Glocester@21:1/5 to All on Tue Sep 2 19:42:50 2025
    XPost: fr.comp.lang.ada

    The first endnote (i.e.
    [1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
    ) in news:10974d1$jn0e$1@dont-email.me
    is not reproduced in
    HTTPS://nytpu.com/gemlog/2025-09-02
    I do not know if that is intentional. Thanks for saying "It has an amusing large collection of quotes" on that webpage.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alex // nytpu@21:1/5 to Dmitry A. Kazakov on Tue Sep 2 13:13:12 2025
    XPost: fr.comp.lang.ada

    On 9/2/25 12:08 PM, Dmitry A. Kazakov wrote:
    The matter is quite straightforward:
    Objectively false, "text" is never actually straightforward despite what
    it seems like on a surface level :P
    1. Never ever use Wide and Wide_Wide. There is a marginal case of
    Windows API where you need Wide_String for UTF-16 encoding. Otherwise,
    use cases are absent. No text processing algorithms require code point access.
    Somewhat inclined to agree with Wide_<> but I don't see strong
    justification to *never* use Wide_Wide_<>, there's pretty substantial
    tradeoffs to both using UTF-32 and UTF-8 (in any programming language
    that supports both, but particularly with Ada's string situation) so unfortunately it ultimately falls on the programmer to understand and
    choose.
    2. Use Character as octet. String as UTF-8 encoded.
    Perfectly valid, explicitly mentioned as an option in my post. Maybe
    actually would be better for most applications because they wouldn't
    need to transcode it, I should've noted that more clearly in my original response. The only two issues: make sure to avoid the Latin-1 String
    routines unless you know you're doing is sound; and in older Ada
    versions I remember reading long debates about the String type may not
    be able to safely store UTF-8 on many compilers (of the era), but that
    issue was clarified by even Ada 95 IIRC.

    I just personally prefer Wide_Wide_<> to get its slightly more
    Unicode-aware string routines, but it's not the only (or even inherently
    the best) option.

    ~nytpu

    --
    Alex // nytpu
    https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alex // nytpu@21:1/5 to Nicolas Paul Colin de Glocester on Tue Sep 2 13:15:07 2025
    XPost: fr.comp.lang.ada

    On 9/2/25 11:42 AM, Nicolas Paul Colin de Glocester wrote:
    The first endnote (i.e.
    [1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
    ) in news:10974d1$jn0e$1@dont-email.me
    is not reproduced in
    HTTPS://nytpu.com/gemlog/2025-09-02
    I do not know if that is intentional.
    I just converted the footnote to an inline link since HTML supports it
    while plaintext posts don't.

    Thanks for saying "It has an amusing large collection of quotes" on that webpage.
    It is a very thorough collection, I liked it.

    (Also I didn't think to ask before posting your original message or my
    reply to my website, sorry. I'll take it down if you don't want it
    rehosted like that)

    ~nytpu

    --
    Alex // nytpu
    https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas Paul Colin de Glocester@21:1/5 to Keith Thompson on Tue Sep 2 21:27:39 2025
    XPost: fr.comp.lang.ada

    This message is in MIME format. The first part should be readable text,
    while the remaining parts are likely unreadable without MIME-aware tools.

    On Tue, 2 Sep 2025, Keith Thompson wrote:
    "> I quote Usenet articles in a way which does not endear me to
    persons. Not everyone reacts in the same way. OC Systems asked me how
    do I draw those boxes.

    Why do you do that?"

    Such a quoting style is correlated with a possibly misguided perception
    that a language does not have a quotation mark at the beginning of each intermediate line. Indications that this perception is misguided are
    English documents which are supposedly from decades before Ada 83 which do indeed show a "“" (i.e. an English opening quotation mark) at the
    beginning of each intermediate line.

    However I am not interested enough in English and I do not have enough
    time to investigate whether or not that is the real way to quote in
    English. If one could show me an authoriative document older than the 20th century on how to write in English which declares so, then it might nudge
    me.

    I had not originally believed that drawing rectangles for embedded
    quotations is annoying, as others used to draw so before me. However, unfortunately these rectangles clearly annoy Mister Thompson. Sorry!

    " It seems like a lot of effort to produce an
    annoying result."

    No effort! As I wrote to OC Systems on
    Date: Wed, 2 Jul 2008 16:34:41 -0400 (EDT)
    long after I wrote an Emacs-Lisp code for these quotations:
    "Thank you for asking. At least so far as I have noticed, you are the
    first person to have asked me that even though I have been using them
    since last year. They are largely created by an Emacs Lisp function
    which I wrote (see far below) to save me labor, [. . .]
    [. . .]
    [. . .] (Emacs Lisp is terrible, but it is commonly available on
    email servers and I was using a buggy Common Lisp program at the time
    so I thought that drawing the boxes in Emacs Lisp might serve as some
    practice for bug fixing in Common Lisp.)

    [. . .]"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicolas Paul Colin de Glocester@21:1/5 to All on Tue Sep 2 21:50:44 2025
    XPost: fr.comp.lang.ada

    This message is in MIME format. The first part should be readable text,
    while the remaining parts are likely unreadable without MIME-aware tools.

    On Tue, 2 Sep 2025, Alex // nytpu wrote:
    "It is a very thorough collection,"

    Dear Alex,

    False. This collection does not quote all the comp.lang.ada articles
    referring to "UTF-32" or "UTF32" or "UCS-4" etc. that I read during the previous week, but there is largely no difference in substance in the ones that I read during the previous week that I decide to not quote. So as to
    have a good Subject: header, I had quite some job deciding which article
    to press the reply button on. I wanted to reply to
    Subject: Re: Supporting full Unicode
    but I did not because I did not actually quote anything from that thread.

    On Tue, 2 Sep 2025, Alex // nytpu wrote:
    "I liked it."

    Thanks and welcome.

    On Tue, 2 Sep 2025, Alex // nytpu wrote:
    "(Also I didn't think to ask before posting your original message or my
    reply to
    my website,"

    No need to ask so.

    On Tue, 2 Sep 2025, Alex // nytpu wrote:
    "I'll take it down if you don't want it rehosted like
    that)"

    I do not oppose rehosting it.

    Actually, though
    HTTPS://Usenet.Ada-Lang.IO/comp.lang.ada
    is excellent and better than all other comp.lang.ada archives, it unfortunately lacks a few non-spam posts.

    Sincères salutations.



    Nicolas Paul Colin de Glocester

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence =?iso-8859-13?q?D=FFOlivei@21:1/5 to All on Tue Sep 2 22:56:12 2025
    XPost: fr.comp.lang.ada

    On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote:

    ... (UCS-4 has a number of additional differences from UTF-32
    regarding "valid encodings", namely that all valid Unicode
    codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
    Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive)
    are valid in UTF-32) ...

    So what do those codes mean in UCS-4?

    ... and are missing some additional information: a key detail is
    that even with UTF-32 where each Unicode scalar value is held in one
    array element rather than being variable-width like UTF-8/UTF-16,
    you still can't treat them as arbitrary arrays like 7-bit ASCII
    because a grapheme can be made up of multiple Unicode scalar values.
    Even with ASCII characters there's the possibility of combining
    diacritics or such that would break if you split the string between
    them.

    This is why you have “normalization”. <https://www.unicode.org/faq/char_combmark.html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alex // nytpu@21:1/5 to All on Tue Sep 2 18:20:09 2025
    XPost: fr.comp.lang.ada

    On 9/2/25 4:56 PM, Lawrence D’Oliveiro wrote:
    On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote:
    ... (UCS-4 has a number of additional differences from UTF-32
    regarding "valid encodings", namely that all valid Unicode
    codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
    Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive)
    are valid in UTF-32) ...

    So what do those codes mean in UCS-4?
    Unfortunately, here's where you get more complexity. So there's a
    difference between a valid codepoint/scalar value and an assigned scalar
    value. The vast majority of valid scalar values are unassigned
    (currently 154,998 characters are standardized out of 1,114,112 possible characters), but everything other than text renderers and normalizers
    should handle them like any other character to allow for at least some
    level of forwards compatibility when new characters are added.

    So in UCS-4 (or any UCS-<>) implementation, they're just treated like unassigned codepoints (that will never be assigned, not that they'd
    know); while they're completely invalid and should not be represented at
    all in UTF-32. Implementations should either error out or replace it
    with the substitution character U+FFFD in order to ensure that it's
    always working with valid UTF-32 (this is what makes the Windows
    character set and Ada's Wide_Strings messy, because they were originally standardized before UTF-16 so to keep backwards compatibility they still support unpaired surrogates so you have to sanitize it yourself to avoid
    making your UTF-8 encoder or the other software reading your text
    declare the encoding invalid).

    (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a hellish mess caused by extreme lack of foresight and it's horrible they
    saddled everyone, including people not using UTF-16, with this crap.
    UTF-16 and its surrogate pairs is also what's responsible for the
    maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
    though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
    encoding the Chinese government came up with can all trivially encode
    full 32-bit values)

    This is why you have “normalization”. <https://www.unicode.org/faq/char_combmark.html>
    Still can't just arbitrarily split strings without being careful, there
    are characters that are inherently multi-codepoint (e.g. most emoji
    among others) without the possibility to be reduced to a single
    codepoint like some can. Really, unfortunately, with Unicode you really
    just shouldn't try to make use of an "array" of any fixed-size quantity
    because with multi-codepoint graphemes and combining characters and such
    it's just not possible.

    Plus conveniently Ada doesn't have routines for normalization, but can't
    hold that against it since neither does any other programming language
    because the lookup tables required are like 20 MiB even when optimized
    for space. (Everyone says to just link to libicu, which also lets you
    get out of needing to keep your program's Unicode tables up-to-date when
    a new Unicode version releases)

    Plus you shouldn't normalize text other than performing actions like
    substring matching, equality tests, or sorting---and even if you
    normalize when performing those, *when possible* you should store the unnormalized original for display/output afterwards. Normalization
    causes lots of semantic information loss because many distinct
    characters are mapped onto one (e.g. non-breaking spaces and zero-width
    spaces are mapped to plain space, mathematical font variants and
    superscripts are mapped to the plain Latin/Greek versions, many
    different languages' characters are mapped to one if the characters
    happen to be visually similar, etc. etc.).

    ~nytpu

    --
    Alex // nytpu
    https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence =?iso-8859-13?q?D=FFOlivei@21:1/5 to All on Wed Sep 3 04:10:47 2025
    XPost: fr.comp.lang.ada

    On Tue, 2 Sep 2025 18:20:09 -0600, Alex // nytpu wrote:

    (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a hellish mess caused by extreme lack of foresight and it's horrible they saddled everyone, including people not using UTF-16, with this crap.

    I gather the basic problem was that Unicode was originally going to be a fixed-length 16-bit code, and that was that. And so early adopters
    (Windows NT and Java among them), built UCS-2 right into their DNA.

    Until Unicode 2.0, I believe it was, where they went “on second thought, let’s go beyond our original brief and start including all kinds of other things as well” ... and UCS-2 had to become UTF-16 ...

    UTF-16 and its surrogate pairs is also what's responsible for the
    maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
    though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
    encoding the Chinese government came up with can all trivially encode
    full 32-bit values)

    I wondered about that limit ...

    Plus conveniently Ada doesn't have routines for normalization, but can't
    hold that against it since neither does any other programming language because the lookup tables required are like 20 MiB even when optimized
    for space.

    I think Python has them
    <https://docs.python.org/3/library/unicodedata.html>. But then, on
    platforms with decent package management, that data can be shared with
    other installed packages that require it as well.

    Plus you shouldn't normalize text other than performing actions like substring matching, equality tests, or sorting---and even if you
    normalize when performing those, *when possible* you should store the unnormalized original for display/output afterwards.

    I thought it was always safe to store decomposed versions of everything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)