• Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling

    From Nicolas Paul Colin de Glocester@21:1/5 to Ludovic Brenta on Sun Aug 31 19:39:56 2025
    XPost: fr.comp.lang.ada

    This message is in MIME format. The first part should be readable text,
    while the remaining parts are likely unreadable without MIME-aware tools.

    Dear Adaists,

    Björn Persson wrote during 2006:
    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    $"Gnat's approach to character encodings is$
    $amazingly faulty." $ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

    Björn Persson wrote during 2006: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    $"> System.WCh_Cnv confound JIS character code with Unicode, it makes $
    troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact, $ because there is no what uses JIS character code as it is, conversion$
    is needed after all. $
    $ $
    $I haven't used that package myself so I don't know how it works, but I $ $won't be surprised if it's buggy. In my experience, Adacore's handling $
    $of character encodings is rather unimpressive." $ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

    Deadly Head wrote during 2010: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %"This is a pretty big deal to me. For a long time I've been a bit... % %frustrated? ... by the fact that the Ada standard specifically gives %
    %us Wide_ and Wide_Wide_Characters and their associated strings, but % %actually _using_ them seemed pretty much worthless. I mean, if you %
    %can't actually _talk_ with them to a modern system (UTF-8 or UTF-16 % %encoding seems to be pretty much the way it goes), what's the point in%
    %using them? %
    % %
    %So I'm pretty happy with using either the WCEM=8 or -gnatW8 methods of% %setting the encoding to get UTF-8 input and output. What I'm % %wondering now is can I get other UTF outputs to work? %
    % %
    %I actually have the peculiar case of dealing with UTF-32 encoded % %files, which need to be translated to UTF-8 for editing, and back to % %UTF-32 for machine-use again. It seems that it would be pretty % %straight-forward to just pull the file in with a straight % %Wide_Wide_Text_IO.Open/Get_Line system, then output via % %Wide_Wide_Text_IO.Put on a file where Form => "WCEM=8". So far, % %though, I'm having trouble [. . .]" % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    Ludovic Brenta wrote during 2014: |-------------------------------------------------------------------------| |"As for the text that your program must process, that's really up to you.| |Ada 95 added the Wide_Character and Wide_String to help you use 16-bit | |characters (not exactly UTF-16, rather supporting only the first plane |
    |of the Unicode character set); Ada 2005 added Wide_Wide_Character for | |32-bit characters (i.e. UTF-32 encoding) The String Encoding package is | |there to help you transcode text between 8-bit Latin_1, UTF-8, proper | |UTF-16 and UTF-32. The new packages are there to help you but they | |don't do anything that wasn't possible in previous versions of Ada | |(i.e. you could reimplement them in Ada 95 if you so wished)." | |-------------------------------------------------------------------------|

    Yannick Duchêne (Hibou57) wrote during 2010: ############################################################################## #"Extract from the thread “S-expression I/O in Ada”. Subtopic moved in a #
    #separate thread for clarity. # # # #Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen <rosen@adalog.fr> a écrit: #
    Slightly OT, but you (and others) might be interested to know that Ada # 2012 will include string encoding packages to the various UTF-X # encodings. These will be (are?) provided very soon by GNAT. #
    # See AI05-137-2 # (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)#
    # # #Time for my stupid question of the day :) # # # #I've noticed this introduction in the last amendment, because Unicode has # #always been an issue/matter for me (actually use my own). # # # #I could not avoid two questions: why no UTF-32 ? (this would not be an # #implementation nightmare) and why BOM handled for each string while BOM is # #to be used at stream/file level ? (see XML or HTML files for example). Or # #are these strings supposed to hold the whole content of a file/stream ? # # # #Quote: # #http://www.unicode.org/faq/utf_bom.html #
    A: A byte order mark (BOM) consists of the character code U+FEFF at the # beginning of a data stream #
    # # #This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML # #reference, HTTML reference) all says the same. # # # #This matter, because the code point U+FEFF can stands for two different # #things: Zero Width No Break Space or encoding Byte Order Mark. The only # #way to distinguish both usage, is where-it-appears. # # # #If it appears as the first code point of a stream, this is a BOM # #(heuristics may be applied to automatically switch encoding with an # #analysis of the first byte of a stream, this is what I do) ; if this # #appears any where else in a stream, this is a character code point." # ##############################################################################

    Contrarily to “Ada 2012 will include string encoding packages to the
    various UTF-X encodings”, a standard Ada package does not support UTF-32! Even Ada 2022 lacks!

    "Table 23-6. Unicode Encoding Scheme Signatures
    Encoding Scheme Signature
    UTF-8 EF BB BF
    UTF-16 Big-endian FE FF
    UTF-16 Little-endian FF FE
    UTF-32 Big-endian 00 00 FE FF
    UTF-32 Little-endian FF FE 00 00"
    says HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G19635

    iconv --list
    reports many kinds: "UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE,
    UCS2, UCS4," and "UNICODE, UNICODEBIG, UNICODELITTLE," and "UTF-7-IMAP,
    UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE,
    UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE".

    "package Ada.Strings.UTF_Encoding
    with Pure is
    4/3
    -- Declarations common to the string encoding packages
    type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
    5/3
    subtype UTF_String is String;
    6/3
    subtype UTF_8_String is String;
    7/3
    subtype UTF_16_Wide_String is Wide_String;
    8/3
    Encoding_Error : exception;
    9/3
    BOM_8 : constant UTF_8_String :=
    Character'Val(16#EF#) &
    Character'Val(16#BB#) &
    Character'Val(16#BF#);
    10/3
    BOM_16BE : constant UTF_String :=
    Character'Val(16#FE#) &
    Character'Val(16#FF#);
    11/3
    BOM_16LE : constant UTF_String :=
    Character'Val(16#FF#) &
    Character'Val(16#FE#);
    12/3
    BOM_16 : constant UTF_16_Wide_String :=
    (1 => Wide_Character'Val(16#FEFF#));"
    says
    HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html without UTF-32.

    John or Erich Rast wrote during 2014: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^"there are plenty of converters between different Unicode versions^
    ^(UTF-8, UTF-16, UTF-32)." ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Contrast with
    "package Ada.Strings.UTF_Encoding
    with Pure is
    4/3
    -- Declarations common to the string encoding packages
    type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
    [. . .]
    end Ada.Strings.UTF_Encoding;
    15/5
    package Ada.Strings.UTF_Encoding.Conversions
    with Pure is
    16/3
    -- Conversions between various encoding schemes
    function Convert (Item : UTF_String;
    Input_Scheme : Encoding_Scheme;
    Output_Scheme : Encoding_Scheme;
    Output_BOM : Boolean := False) return UTF_String;" says
    HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html

    "A full featured character encoding converter will have to provide the following 13 encoding variants of Unicode and UCS:

    UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE"
    says
    HTTPS://WWW.CL.Cam.ac.UK/~mgk25/unicode.html

    (The same webpage says:
    "The term UTF-32 was introduced in Unicode to describe a 4-byte encoding
    of the extended “21-bit” Unicode. UTF-32 is the exact same thing as UCS-4, except that by definition UTF-32 is never used to represent characters
    above U-0010FFFF, while UCS-4 can cover all 2[**]31 code positions up to U-7FFFFFFF."

    Contrast with:
    "UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now treated simply as a synonym for UTF-32, and is considered the canonical
    form for representation of characters in 10646."
    says
    HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/appendix-c
    So much for standardisation!)

    Randy L. Brukardt wrote during 2017: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >"In Ada, type Character = Latin-1 = first 255 code positions, 8-bit > >representation. Text_IO and type String are for Latin-1 strings. >
    >
    type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code > >positions = UCS-2 = 16-bit representation. >
    >
    type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. >
    >
    There is no native support in Ada for UTF-8 or UTF-16 strings. There is a > >conversion package (Ada.Strings.Encoding) [which is nasty because it breaks> >strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and> >Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 > >(there is no good way to tell between them in the general case). >
    >
    Windows uses a BOM character at the start of UTF-8 files to differentiate > >(at least in programs like Notepad and the built-in edit control), but that> >is not recommended by Unicode. I think they would prefer a world where > >Latin-1 had disappeared completely, but that of course is not the real > >world." > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

    Luke A. Guest wrote during 2021: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    !"And this is there the Ada standard gets it wrong, in the encodings!
    !package re utf-8." ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    Vadim Godunko wrote during 2021: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <"Ada doesn't have good Unicode support. :( So, you need to find suitable<
    <set of "workarounds"." < <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

    Randy L. Brukardt wrote during 2013: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    "Right. The proper thing to do (for Ada 2012) is to use > >Ada.Characters.Wide_Handling (or Wide_Wide_Handling) to do the case> >conversion, after converting the UTF-8 into a Wide_String (or > >Wide_Wide_String)." > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

    However, Dmitry A. Kazakov wrote during 2021: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    !"Never ever use !
    !Wide or Wide_Wide, they are useless."!
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    Vadim Godunko wrote during 2022: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    <"I think ((Wide_)Wide_)(Character|String) is obsolete for modern <
    <systems and programming languages; more cleaner types and API is a < <requirement now. The only case when old character/string types is <
    <really makes value is low resources embedded systems; in other cases<
    <their use generates a lot of hidden issues, which is very hard to < <detect." < <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

    Maxim Reznik wrote during 2021: \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
    \"You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to\
    \process Unicode strings. But this is not very handy. I use the \ \Matreshka library for Unicode strings." \ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

    I do not find Matreshka to be handy. Cf. an ALIRE failure shown below.

    Dmitry A. Kazakov wrote during 2021: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !"On 2021-06-21 00:50, Jeffrey R. Carter wrote: !
    On 6/20/21 8:47 PM, Dmitry A. Kazakov wrote: !
    On 2021-06-20 20:21, Jeffrey R. Carter wrote: ! !>> ! !>> That ship has sailed. I would say that any use of String as Latin-1 is ! !>> a mistake now because most of the libraries would use UTF-8 encoding ! !>> instead of Latin-1. !
    !
    I have never subscribed to the illogic that if enough people make the ! same mistake, it ceases to be a mistake. !
    ! ! !The mistake is on the Ada type system design side. People repurposed ! !Latin-1 strings for UTF-8 strings because there was no other feasible way."! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    Cf.
    "Why do people do this?!
    Honestly, I don't really know. This is one of those mysteries that might
    never get solved. Oh, there is one lead: it seems to be generated mostly (exclusively?) by Windows systems. Really, who would have thought?"
    says
    HTTPS://WWW.ueber.net/who/mjl/projects/bomstrip

    Cf.
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~"For a long time, it was believed that Unicode could get by with 16 ~
    ~bits to represent the characters for all languages of the ~
    ~world. Originally, “Unicode” was defined as “16 bit ~ ~characters”. History showed this was a bad idea, but it was believed~
    ~to be true for long enough that many systems are stuck with 16 bit ~ ~characters; both Java and Windows, for example, deal in 16 bit ~ ~characters." ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    says
    HTTPS://EntropicThoughts.com/unicode-strings-in-ada-2012
    by Christoffer Stjernlöf.

    Cf.
    "One hundred repetitions three nights a week for four years, thought
    Bernard Marx, who was a specialist on hypnopædia. Sixty-two thousand four hundred repetitions make one truth. Idiots!"
    says
    @book{Sixty-two-thousand-four-hundred-repetitions-make-one-truth-Idiots, author={Aldous Huxley},
    title={{Brave New World}},
    publisher={Chatto \& Windus with T. and A. Constable with the University
    Press Edinburgh},
    address={London and Edinburgh},
    year={1932}
    }

    Cf. publications by psychologists. E.g. Kimberlee Weaver; Stephen M.
    Garcia; Norbert Schwarz; and Dale T. Miller, "Inferring the Popularity of
    an Opinion From Its Familiarity: A Repetitive Voice Can Sound Like a
    Chorus", "Journal of Personality and Social Psychology", 92(5):821-833,
    2007.

    Cf. "majority opinion turns out to be wrong with a fairly high frequency
    in science"
    says
    James Woodward and David Goodstein, “Conduct, Misconduct and the
    Structure of Science,” September–October, "American Scientist", 1996, 479–490.

    Shark8 wrote during 2013: //////////////////////////////////////////////////////////////////////// /"UTF-16 is perhaps the worst possible encoding you can have for / /Unicode. With UTF-8 you don't need to worry about byte-order / /(everything's sequential) and with UTF-32 you don't need to decode the/ /information (each element *IS* a code-point)... but UTF-16 offers / /neither of these." / ////////////////////////////////////////////////////////////////////////

    Randy Brukardt wrote during 2023: ****************************************************************************** *"But my opinion is that Ada got strings completely wrong, and the best thing* *to do with them is to completely nuke them and start over. [. . .]" * ******************************************************************************

    I have been given a dataset. These files are supposedly homogeneous UTF-8
    XML files. Actually
    for data_file in *.xml ; do file $data_file | sed -e 's/^.*: //' ; done |
    sort | uniq
    reports:
    "ASCII text, with CRLF line terminators
    Unicode text, UTF-8 text, with CRLF line terminators
    XML 1.0 document, Unicode text, UTF-8 (with BOM) text, with CRLF line terminators".
    (If file does not call an example "XML 1.0 document, Unicode [. . .]"
    then such an example lacks a line with
    <?xml version='1.0' encoding='utf-8'?>
    but does consist of XML parts.)

    A valid letter in this language expressed in UTF-8 octets can have:
    1 octet (e.g. 16#41#);
    2 octets (e.g. 16#C3_BA#);
    or
    3 octets (e.g. 16#E1_BA_9B#).
    I do not believe that I am overlooking a 4-octet example . . . but what
    if?

    This is not a constrained computer. It will not run out of memory. It is
    not slow. Deadly Head needs UTF-32. I do not need UTF-32 or UCS-4 for this application, but elegance might promote a uniform quantity of octets for
    all letters; and a polyglot user might try to insert some weird
    punctuation or whatever which I do not know or might copy and paste some multilingual table from Unicode.org. I do not want
    "a lot of hidden issues, which is very hard to
    detect"
    as Vadim Godunko said. I do not want a crash, especially with some
    exception which is less informative than a Java exception. Granted, all
    these already existing files are in UTF-8. But what if some future
    application will need general UCS4?

    Sincères salutations.



    Nicolas Paul Colin de Glocester

    cd Matreshka_league__ALIRE_failed_to_build_this

    /home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this
    $ alr get matreshka_league
    ⓘ Running post_fetch actions for matreshka_league=21.0.0...
    [. . .]
    configure: creating source/league/matreshka-config.ads

    matreshka_league=21.0.0 successfully retrieved.
    Dependencies were solved as follows:

    + make 4.3.0 (new)


    /home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this
    $ cd matreshka_league_21.0.0_0c8f4d47

    /home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47
    $ alr run
    ⓘ Building matreshka_league/gnat/matreshka_league.gpr...
    Compile
    [Ada] xml-sax-simple_readers-scanner.adb
    [. . .]
    league-iris.adb:1476:36: warning: Is_Valid unimplemented [enabled by
    default]
    [. . .]
    [Ada] matreshka-cldr-collation_rules_parser.adb matreshka-internals-utf16.ads:100:04: warning: pragma Pack for
    "Utf16_String" ignored [-gnatwr]
    [. . .]
    [Ada] league-calendars-iso_8601.adb matreshka-cldr-collation_rules_parser.adb:186:30: warning: assignment to pass-by-copy formal may have no effect [enabled by default] matreshka-cldr-collation_rules_parser.adb:186:30: warning: "raise"
    statement may result in abnormal return (RM 6.4.1(17)) [enabled by
    default]
    [. . .]
    [Ada] matreshka-atomics-generic_test_and_set__gcc__64.adb matreshka-atomics-counters__gcc.adb:50:14: warning: intrinsic binding type mismatch on parameter 2 [enabled by default] matreshka-atomics-counters__gcc.adb:50:14: warning: profile of "Sync_Add_And_Fetch_32" doesn't match the builtin it binds [enabled by default]
    matreshka-atomics-counters__gcc.adb:54:13: warning: intrinsic binding type mismatch on result [enabled by default] matreshka-atomics-counters__gcc.adb:54:13: warning: intrinsic binding type mismatch on parameter 2 [enabled by default] matreshka-atomics-counters__gcc.adb:54:13: warning: profile of "Sync_Sub_And_Fetch_32" doesn't match the builtin it binds [enabled by default]
    matreshka-atomics-counters__gcc.adb:57:14: warning: intrinsic binding type mismatch on parameter 2 [enabled by default] matreshka-atomics-counters__gcc.adb:57:14: warning: profile of "Sync_Sub_And_Fetch_32" doesn't match the builtin it binds [enabled by default]
    [. . .]
    league-locales.ads:46:12: warning: unit "League.Strings" is not referenced [-gnatwu]

    compilation of matreshka-internals-unicode-ucd-properties.adb failed
    compilation of league-strings-cursors-grapheme_clusters.adb failed
    compilation of matreshka-internals-code_point_sets.adb failed
    compilation of league-character_sets.adb failed
    compilation of matreshka-internals-unicode-ucd-norms.ads failed
    compilation of matreshka-internals-unicode-ucd-core.ads failed

    gprbuild: *** compilation phase failed
    error: Command ["gprbuild", "-s", "-j0", "-p", "-P", "/coldstorage/gloucester/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47/gnat/matreshka_league.gpr"]
    exited with code 4
    error: Build failed

    /home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47
    $ date
    Tue Aug 26 12:03:12 CEST 2025

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)