• Re: Simple string conversion from UCS2 to ISO8859-1

    From Richard Damon@21:1/5 to pozz on Fri Feb 21 07:05:57 2025
    On 2/21/25 6:40 AM, pozz wrote:
    I want to write a simple function that converts UCS2 string into ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
    size because ucs2 isn't null terminated.

    Typically UCS2 strings ARE null terminated, it just a null is two bytes
    long.


    I know I can use iconv() feature, but I'm on an embedded platform
    without an OS and without iconv() function.

    It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.

    Note, I think you will find that it is that 0000-00FF that match. (as I remember ISO8859-1 was the base for starting Unicode).


    It isn't so simple to convert higher codes. For example, the small e
    with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
    0x27 in ISO8859-1.

    To be correct, u2019 isn't 0x27, its just character that looks a lot
    like it.


    Is there a simplified mapping table that can be written with if/switch?

    if (code < 0x80) {
      *dst++ = (char)code;
    } else {
      switch (code) {
        case 0x2019: *dst++ = 0x27; break;  // Apostrophe
        case 0x...: *dst++ = ...; break;
        default: *ds++ = ' ';
      }
    }

    I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.



    Then you have to decide which are sufficient mappings. No character
    above FF *IS* the character below, but some have a close approximation,
    so you will need to decide what to map.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to pozz on Fri Feb 21 14:06:03 2025
    On 21.02.2025 13:42, pozz wrote:
    Il 21/02/2025 13:05, Richard Damon ha scritto:
    On 2/21/25 6:40 AM, pozz wrote:
    I want to write a simple function that converts UCS2 string into
    ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
    passing size because ucs2 isn't null terminated.

    [...]

    It is trivial to convert "0000"-"007F" chars: it's a simple cast from
    unsigned int to char.

    Note, I think you will find that it is that 0000-00FF that match. (as
    I remember ISO8859-1 was the base for starting Unicode).

    I second that.


    It isn't so simple to convert higher codes. For example, the small e
    with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
    trivial again. But I saw the code "2019" (apostrophe) that can be
    rendered as 0x27 in ISO8859-1.

    To be correct, u2019 isn't 0x27, its just character that looks a lot
    like it.

    Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

    Note that there are _standard names_ assigned with the characters.
    These are normative what the characters represent. - I strongly
    suggest to not twist these standards by assigning different
    characters; you will do no one a favor but inflict only confusion
    and harm.


    Is there a simplified mapping table that can be written with if/switch?

    if (code < 0x80) {
    *dst++ = (char)code;
    } else {
    switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
    }
    }

    I'm not searching a very detailed and correct mapping, but just a
    "sufficient" implementation.

    Then you have to decide which are sufficient mappings. No character
    above FF *IS* the character below, but some have a close
    approximation, so you will need to decide what to map.

    Yes, I have to decide, but it is a very big problem (there are thousands
    of Unicode symbols that can be approximated to another ISO8859-1 code).
    I'm wondering if such an approximation is just implemented somewhere.

    I've just made a run across the names of UCS-2 and ISO 8859-1, based
    on their normative names and, as mentioned above already; they match
    one-to-one in the ranges 0000-00FF and 00-FF respectively.

    BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
    of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
    contains a few other characters like the € (Euro Sign). If that is
    possible for your context you have to map a handful of characters.

    Janis

    For example, what iconv() does in this case?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Fri Feb 21 14:17:32 2025
    On 21.02.2025 14:06, Janis Papanagnou wrote:
    On 21.02.2025 13:42, pozz wrote:
    Il 21/02/2025 13:05, Richard Damon ha scritto:
    On 2/21/25 6:40 AM, pozz wrote:

    [...] But I saw the code "2019" (apostrophe) that can be
    rendered as 0x27 in ISO8859-1.

    To be correct, u2019 isn't 0x27, its just character that looks a lot
    like it.

    Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

    Note that there are _standard names_ assigned with the characters.
    These are normative what the characters represent. - I strongly
    suggest to not twist these standards by assigning different
    characters; you will do no one a favor but inflict only confusion
    and harm.

    I want to amend the standard names to make it clear...

    0027 APOSTROPHE
    2019 RIGHT SINGLE QUOTATION MARK

    27 APOSTROPHE

    Hope that helps to understand the standard names.


    (You should also be aware that a glyph for a character may be
    depicted differently depending on the source of the respective
    documents.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to pozz on Fri Feb 21 15:23:51 2025
    On 21/02/2025 12:40, pozz wrote:
    I want to write a simple function that converts UCS2 string into ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
    size because ucs2 isn't null terminated.

    I know I can use iconv() feature, but I'm on an embedded platform
    without an OS and without iconv() function.

    It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.

    It isn't so simple to convert higher codes. For example, the small e
    with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
    0x27 in ISO8859-1.

    Is there a simplified mapping table that can be written with if/switch?

    if (code < 0x80) {
      *dst++ = (char)code;
    } else {
      switch (code) {
        case 0x2019: *dst++ = 0x27; break;  // Apostrophe
        case 0x...: *dst++ = ...; break;
        default: *ds++ = ' ';
      }
    }

    I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.



    <https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>

    As has been mentioned by others, 0 - 0xff should be a direct translation
    (with the possible exception of Latin-9 differences).

    <https://en.wikipedia.org/wiki/ISO/IEC_8859-15>


    When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80
    - 0xff), you will quickly see that virtually none of them make any sense
    to support in the way you are thinking. Just because a couple of the characters in the Thaana block look a bit like quotation marks, does not
    mean it makes any sense to try to transliterate them. Realistically,
    you can at most make use of a few punctuation symbols (like 0x2019
    above), and maybe approximate forms for some extended Latin alphabet
    characters that you will never see in practice. Oh, and you might be
    able to support those spam emails that use Greek and other letters that
    look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's assuming you have output support for the full Latin-1 or Latin-9 range.


    Unicode is rarely much use unless you want and can provide good support
    for non-Latin alphabets. Otherwise your translations are going to be so limited and simple that they are barely worth the effort and won't cover anything useful.


    So here I would say that whoever provides the text, provides it in
    Latin-9 encoding. There's no point in allowing external translators to
    use whatever characters they feel is best in their language, and then
    your code makes some kind of odd approximation giving results that look different. If someone really wants to use the letter "ā" that is found
    in the Latin Extended A block, how do /you/ know whether the best
    Latin-9 match is "a", "ã", "ä", or something different like "aa" or an alternative spelling of the word? Maybe the rules are different for
    Latvian and Anglicised Mandarin.


    When we have worked with multiple languages on small embedded systems
    (too small for big fonts and UTF-8), we have used one of three techniques :

    1. Insist that the external translators provide strings in Latin-9 only
    (or even just ASCII when the system was more restricted).

    2. Use primarily ASCII, with a few user-defined characters per language
    (that's useful for old-style character displays with space for perhaps 8 user-defined characters).

    3. Use a PC program to figure out the characters actually used in the
    strings, and put them into a single table indexing a generated list of
    bitmap glyphs, also generated by the program (from freely available
    fonts). The source is, naturally, UTF-8 - the strings stored in the
    embedded system are not in any standard encoding representing
    characters, but now hold glyph table indices.


    Your idea here sounds to me like a lot of work for virtually no benefit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Fri Feb 21 23:35:57 2025
    On 21.02.2025 20:40, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    [...]
    BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
    of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
    contains a few other characters like the € (Euro Sign). If that is
    possible for your context you have to map a handful of characters.

    Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
    not, which would make the translation more difficult.

    Yes, that had already been pointed out upthread.

    The (open) question is whether it makes sense to convert to "Latin 1"
    only because it has a one-to-one mapping concerning the first UCS-2
    characters, or if the underlying application of the OP wants support
    of contemporary information by e.g. providing the € (Euro) sign with
    "Latin 9".


    <https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
    the 8 characters that differ betwween Latin-1 and Latin-9.

    If at all possible, it would be better to convert to UTF-8. The
    conversion is exact and reversible, and UTF-8 has largely superseded the various Latin-* character encodings.

    Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
    while the ISO 8859-X family represents single octet representations.

    I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

    I think this, or why he can't use "Latin 9", are essential questions.

    It seems to have got clear after a subsequent post of the OP; some
    message/data source seems to provide characters from the upper planes
    of Unicode and the OP has to (or wants to) somehow map them to some
    constant octet character set. - Yet there's no information provided
    what Unicode characters - characters that don't have a representation
    in Latin 1 or Latin 9 - the OP will encounter or not from that source.

    As it sounds it all seems to make little sense.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to pozz on Fri Feb 21 20:05:22 2025
    On 2/21/25 7:42 AM, pozz wrote:
    Il 21/02/2025 13:05, Richard Damon ha scritto:
    On 2/21/25 6:40 AM, pozz wrote:
    I want to write a simple function that converts UCS2 string into
    ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
    passing size because ucs2 isn't null terminated.

    Typically UCS2 strings ARE null terminated, it just a null is two
    bytes long.

    Sure, but this isn't an issue here.


    I know I can use iconv() feature, but I'm on an embedded platform
    without an OS and without iconv() function.

    It is trivial to convert "0000"-"007F" chars: it's a simple cast from
    unsigned int to char.

    Note, I think you will find that it is that 0000-00FF that match. (as
    I remember ISO8859-1 was the base for starting Unicode).


    It isn't so simple to convert higher codes. For example, the small e
    with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
    trivial again. But I saw the code "2019" (apostrophe) that can be
    rendered as 0x27 in ISO8859-1.

    To be correct, u2019 isn't 0x27, its just character that looks a lot
    like it.

    Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

    And, as such is a subjective decision that you need to make.



    Is there a simplified mapping table that can be written with if/switch?

    if (code < 0x80) {
       *dst++ = (char)code;
    } else {
       switch (code) {
         case 0x2019: *dst++ = 0x27; break;  // Apostrophe
         case 0x...: *dst++ = ...; break;
         default: *ds++ = ' ';
       }
    }

    I'm not searching a very detailed and correct mapping, but just a
    "sufficient" implementation.

    Then you have to decide which are sufficient mappings. No character
    above FF *IS* the character below, but some have a close
    approximation, so you will need to decide what to map.

    Yes, I have to decide, but it is a very big problem (there are thousands
    of Unicode symbols that can be approximated to another ISO8859-1 code).
    I'm wondering if such an approximation is just implemented somewhere.

    For example, what iconv() does in this case?

    Just look at its code, there will be open source versions of it.

    The two real options is just reject anything above 0xFF, or have a big table/switch to handle some determined list of things "close enough"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to pozz on Sat Feb 22 01:20:20 2025
    On 2025-02-21, pozz <pozzugno@gmail.com> wrote:
    I want to write a simple function that converts UCS2 string into ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
    size because ucs2 isn't null terminated.

    This kind of normalizing is a good way of introducing injection
    exploits.

    Suppose the input is some syntax that has been validated; the decision
    is trusted after that. The normalization to the 8-bit character set can
    produce characters which are special in the syntax, changing its
    meaning.

    In Microsoft Windows, there is an example of such a problem. Programs
    which use GetCommandLineA to get the argument string before parsing it
    into arguments are vulnerable to argument injection. The attacker
    specifies a piece of datum to be used by program A as an argument in
    calling program B such that when the datum is decimated to the 8 bit
    character set, quotes appear in it, creating additional arguments to
    program B.

    again. But I saw the code "2019" (apostrophe) that can be rendered as
    0x27 in ISO8859-1.

    ... and that's a common quoting character in various data syntaxes, oops!
    What could go wrong?

    I think in 2025 we shouldn't have to be crippling Unicode data to fit
    some ISO Latin (or any other 8 bit) character set; we should be rooting
    out technologies and situations which do that.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to pozz on Sat Feb 22 03:00:31 2025
    On Fri, 21 Feb 2025 13:42:13 +0100, pozz wrote:

    Yes, I have to decide, but it is a very big problem (there are thousands
    of Unicode symbols that can be approximated to another ISO8859-1 code).
    I'm wondering if such an approximation is just implemented somewhere.

    If you look at NamesList.txt, you will see, next to each character,
    references to others that might be similar or related in some way.

    They say not to try to parse that file automatically, but I’ve had some success doing exactly that ... so far ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Sat Feb 22 05:29:14 2025
    On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:
    On Fri, 21 Feb 2025 13:42:13 +0100, pozz wrote:

    Yes, I have to decide, but it is a very big problem (there are thousands
    of Unicode symbols that can be approximated to another ISO8859-1 code).
    I'm wondering if such an approximation is just implemented somewhere.

    If you look at NamesList.txt, you will see, next to each character, references to others that might be similar or related in some way.

    They say not to try to parse that file automatically, but I’ve had some success doing exactly that ... so far ...

    I wonder why they say so, given that there's a syntax description
    available on their pages (see the respective HTML file[*]).


    BTW; curious about that [informal] part of the syntax description

    LF: <any sequence of a single ASCII 0A or 0D, or both>

    It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
    Is the latter of any practical relevance?

    Janis

    [*] https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sat Feb 22 06:13:27 2025
    On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

    On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

    They say not to try to parse that file automatically, but I’ve had some
    success doing exactly that ... so far ...

    I wonder why they say so, given that there's a syntax description
    available on their pages (see the respective HTML file[*]).

    The file itself says different <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

    This file is semi-automatically derived from UnicodeData.txt and a
    set of manually created annotations using a script to select or
    suppress information from the data file. The rules used for this
    process are aimed at readability for the human reader, at the
    expense of some details; therefore, this file should not be parsed
    for machine-readable information.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Sat Feb 22 09:11:02 2025
    On 22.02.2025 07:13, Lawrence D'Oliveiro wrote:
    On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

    On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

    They say not to try to parse that file automatically, but I’ve had some >>> success doing exactly that ... so far ...

    I wonder why they say so, given that there's a syntax description
    available on their pages (see the respective HTML file[*]).

    The file itself says different <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

    This file is semi-automatically derived from UnicodeData.txt and a
    set of manually created annotations using a script to select or
    suppress information from the data file. The rules used for this
    process are aimed at readability for the human reader, at the
    expense of some details; therefore, this file should not be parsed
    for machine-readable information.

    I see, but I certainly wouldn't refrain from parsing it. (In the past
    I had parsed much worse data; irregular HTML stuff and the like.)
    OTOH, there's also the CSV data file available, yet even simpler to
    parse with standard tools and no effort.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Janis Papanagnou on Sat Feb 22 12:29:59 2025
    On 21/02/2025 23:35, Janis Papanagnou wrote:
    On 21.02.2025 20:40, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    [...]
    BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
    of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
    contains a few other characters like the € (Euro Sign). If that is
    possible for your context you have to map a handful of characters.

    Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
    not, which would make the translation more difficult.

    Yes, that had already been pointed out upthread.

    The (open) question is whether it makes sense to convert to "Latin 1"
    only because it has a one-to-one mapping concerning the first UCS-2 characters, or if the underlying application of the OP wants support
    of contemporary information by e.g. providing the € (Euro) sign with
    "Latin 9".


    <https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
    the 8 characters that differ betwween Latin-1 and Latin-9.

    If at all possible, it would be better to convert to UTF-8. The
    conversion is exact and reversible, and UTF-8 has largely superseded the
    various Latin-* character encodings.

    Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
    while the ISO 8859-X family represents single octet representations.

    I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

    I think this, or why he can't use "Latin 9", are essential questions.

    It seems to have got clear after a subsequent post of the OP; some message/data source seems to provide characters from the upper planes
    of Unicode and the OP has to (or wants to) somehow map them to some
    constant octet character set. - Yet there's no information provided
    what Unicode characters - characters that don't have a representation
    in Latin 1 or Latin 9 - the OP will encounter or not from that source.

    As it sounds it all seems to make little sense.

    Janis


    As the OP explained in a reply to one of my posts, he is getting data in
    in UCS-2 format from SMS's from a modem. Somewhere along the line,
    either the firmware in the modem or in the code sending the SMS's,
    characters beyond the BMP are being used needlessly. So it looks like
    his first idea of manually handling a few cases (like code 0x2019) seems
    like the right approach.

    Whether Latin-1 or Latin-9 is better will depend on his application.
    The additional characters in Latin-9, with the exception of the Euro
    symbol, are pretty obscure - it's unlikely that you'd need them and not
    need a good deal more other characters (i.e., supporting much more of
    Unicode).

    As for why not use UTF-8, the answer is clearly simplicity. The OP is
    working with a resource-constrained embedded system. I don't know what
    he is doing with the characters after converting them from UCS-2, but it
    is massively simpler to use an 8-bit character set if they are going to
    be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
    maps to one code unit with Latin-9 here. The space needed for UTF-8 is
    much harder to predict, and the OP will want to avoid any kind of
    malloc() or dynamic allocation where possible.

    If the incoming SMS's are just being logged, or passed out in some other
    way, then UTF-8 may be a convenient alternative.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Sat Feb 22 13:11:34 2025
    On 22.02.2025 12:29, David Brown wrote:

    As the OP explained in a reply to one of my posts, he is getting data in
    in UCS-2 format from SMS's from a modem. [...]

    (Yes. I wrote: "have got clear after a subsequent post".)


    Whether Latin-1 or Latin-9 is better will depend on his application.

    (Was also my stance upthread; "If that is possible for your context")

    The
    additional characters in Latin-9, with the exception of the Euro symbol,
    are pretty obscure

    ISTR they are some language specific symbols, so probably less obscure
    to someone from those countries.

    - it's unlikely that you'd need them and not need a
    good deal more other characters (i.e., supporting much more of Unicode).

    As for why not use UTF-8, the answer is clearly simplicity.

    This was not my point (someone else suggested that). To me that was
    clear; UTF-8 is an _encoding_ (as I wrote), as opposed to a direct representation of a fixed width character (either 8 bit width ISO
    8859-X or 16 bit with UCS-2). Conversions to/from UTF-8 are not as straightforward as fixed width character representations are.

    The OP is
    working with a resource-constrained embedded system. I don't know what
    he is doing with the characters after converting them from UCS-2, but it
    is massively simpler to use an 8-bit character set if they are going to
    be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
    maps to one code unit with Latin-9 here. The space needed for UTF-8 is
    much harder to predict, and the OP will want to avoid any kind of
    malloc() or dynamic allocation where possible.

    You should address that to the other poster. :-)

    Janis


    If the incoming SMS's are just being logged, or passed out in some other
    way, then UTF-8 may be a convenient alternative.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to David Brown on Sat Feb 22 07:15:09 2025
    On 2/22/25 6:29 AM, David Brown wrote:
    On 21/02/2025 23:35, Janis Papanagnou wrote:
    On 21.02.2025 20:40, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    [...]
    BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
    of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
    contains a few other characters like the € (Euro Sign). If that is
    possible for your context you have to map a handful of characters.

    Latin-1 maps exactly to Unicode for the first 256 values.  Latin-9 does >>> not, which would make the translation more difficult.

    Yes, that had already been pointed out upthread.

    The (open) question is whether it makes sense to convert to "Latin 1"
    only because it has a one-to-one mapping concerning the first UCS-2
    characters, or if the underlying application of the OP wants support
    of contemporary information by e.g. providing the € (Euro) sign with
    "Latin 9".


    <https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing >>> the 8 characters that differ betwween Latin-1 and Latin-9.

    If at all possible, it would be better to convert to UTF-8.  The
    conversion is exact and reversible, and UTF-8 has largely superseded the >>> various Latin-* character encodings.

    Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
    while the ISO 8859-X family represents single octet representations.

    I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

    I think this, or why he can't use "Latin 9", are essential questions.

    It seems to have got clear after a subsequent post of the OP; some
    message/data source seems to provide characters from the upper planes
    of Unicode and the OP has to (or wants to) somehow map them to some
    constant octet character set. - Yet there's no information provided
    what Unicode characters - characters that don't have a representation
    in Latin 1 or Latin 9 - the OP will encounter or not from that source.

    As it sounds it all seems to make little sense.

    Janis


    As the OP explained in a reply to one of my posts, he is getting data in
    in UCS-2 format from SMS's from a modem.  Somewhere along the line,
    either the firmware in the modem or in the code sending the SMS's,
    characters beyond the BMP are being used needlessly.  So it looks like
    his first idea of manually handling a few cases (like code 0x2019) seems
    like the right approach.

    Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
    did like so many other programs a "pretty" transformation of a simple
    single quotation mark, to a fancy version.


    Whether Latin-1 or Latin-9 is better will depend on his application. The additional characters in Latin-9, with the exception of the Euro symbol,
    are pretty obscure - it's unlikely that you'd need them and not need a
    good deal more other characters (i.e., supporting much more of Unicode).

    As for why not use UTF-8, the answer is clearly simplicity.  The OP is working with a resource-constrained embedded system.  I don't know what
    he is doing with the characters after converting them from UCS-2, but it
    is massively simpler to use an 8-bit character set if they are going to
    be used for display on a small system.  It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
    maps to one code unit with Latin-9 here.  The space needed for UTF-8 is
    much harder to predict, and the OP will want to avoid any kind of
    malloc() or dynamic allocation where possible.

    If the incoming SMS's are just being logged, or passed out in some other
    way, then UTF-8 may be a convenient alternative.




    I would ssy the big difference is that an 8-bit character set needs to
    only store 256 glyphs for its font. Converting to UTF-8, would still
    require storing some massive font, and the need to decide exactly how
    massive it will be.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Richard Damon on Sat Feb 22 14:12:28 2025
    On 22/02/2025 13:15, Richard Damon wrote:
    On 2/22/25 6:29 AM, David Brown wrote:

    As for why not use UTF-8, the answer is clearly simplicity.  The OP is
    working with a resource-constrained embedded system.  I don't know
    what he is doing with the characters after converting them from UCS-2,
    but it is massively simpler to use an 8-bit character set if they are
    going to be used for display on a small system.  It also keeps memory
    management simpler, and that is essential on such systems - one UCS-2
    character maps to one code unit with Latin-9 here.  The space needed
    for UTF-8 is much harder to predict, and the OP will want to avoid any
    kind of malloc() or dynamic allocation where possible.

    If the incoming SMS's are just being logged, or passed out in some
    other way, then UTF-8 may be a convenient alternative.




    I would ssy the big difference is that an 8-bit character set needs to
    only store 256 glyphs for its font. Converting to UTF-8, would still
    require storing some massive font, and the need to decide exactly how
    massive it will be.

    Yes, exactly. A key point is what the OP is going to do with the text.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Janis Papanagnou on Sat Feb 22 14:11:11 2025
    On 22/02/2025 13:11, Janis Papanagnou wrote:
    On 22.02.2025 12:29, David Brown wrote:

    As the OP explained in a reply to one of my posts, he is getting data in
    in UCS-2 format from SMS's from a modem. [...]

    (Yes. I wrote: "have got clear after a subsequent post".)


    Whether Latin-1 or Latin-9 is better will depend on his application.

    (Was also my stance upthread; "If that is possible for your context")

    The
    additional characters in Latin-9, with the exception of the Euro symbol,
    are pretty obscure

    ISTR they are some language specific symbols, so probably less obscure
    to someone from those countries.


    The point (as I said below) is that adding these letters (š, ž, œ) makes very little difference to anyone because they are not enough to let them
    write their language properly. Sure, someone writing Czech might have
    regular use of the letter ž - but with Latin-9 they can't write the
    letters ť, ř, ď or several other Czech letters. So it provides little benefit to most people who have those letters in their alphabet. If you
    want to let people write their languages properly (something I strongly support), you need much fuller Unicode support - unless you are working specifically with Sami, Finnish or Estonian, the only benefit of moving
    from Latin-1 to Latin-9 is for the Euro symbol.


    - it's unlikely that you'd need them and not need a
    good deal more other characters (i.e., supporting much more of Unicode).





    As for why not use UTF-8, the answer is clearly simplicity.

    This was not my point (someone else suggested that).

    <snip>

    You should address that to the other poster. :-)


    I was making a single reply that covered both parts - I know you didn't
    write the bits you quoted from further up-thread.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Sat Feb 22 14:18:03 2025
    On 21/02/2025 20:45, Keith Thompson wrote:
    pozz <pozzugno@gmail.com> writes:
    I want to write a simple function that converts UCS2 string into ISO8859-1: >>
    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
    passing size because ucs2 isn't null terminated.

    Is the UCS-2 really represented as a sequence of ASCII hex digits?

    In actual UCS-2, each character is 2 bytes. The representation for
    "Hello" would be 10 bytes, either "\0H\0e\0l\0l\0o" or
    "H\0e\0l\0l\0o\0", depending on endianness. (UCS-2 is a subset of
    UTF-16; the latter uses longer sequences to represent characters
    outside the Basic Multilingual Plane.)


    My understanding here is that the OP is getting the UCS-2 encoded string
    in from a modem, almost certainly on a serial line. The UCS-2 encoded
    data is itself a binary sequence of 16-bit code units, and the modem
    firmware is sending those as four hex digits. This is a very common way
    to handle transmission of binary data in such systems - there is no need
    for escapes or other complications to delimit the binary data. I would
    expect that the entire incoming message will be comma-separated fields
    with the time and date, sender's telephone number, and so on, as well as
    the text itself as this long hex string.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Richard Damon on Sat Feb 22 16:43:55 2025
    On 22.02.2025 13:15, Richard Damon wrote:

    [...] It did like so many other programs a "pretty" transformation of
    a simple single quotation mark, to a fancy version.

    Good to put the "pretty" in quotes; I've seen so many "fancy versions",
    one worse than the other. They are culture specific and on a terminal
    they often look bad even in their native target form. For example “--”
    in a man page (say, 'man awk') has a left and right slant respectively
    and they are linear, but my newsreader shows them both in the same
    direction but the one thicker at the bottom the other at the top. It's
    similar with single quotes; here we often see accents used at one side
    and a regular single quote at the other side. In 'man man' for example
    we find even a comment on that in the description of option '--ascii'.
    There's *tons* of such quoting characters for the various languages,
    in my mother tongue there's even _more than one_ type used in printed
    media. Single or double and left or right and bottom or top or mixed
    or double or single angle brackets in opening and closing form, plus
    the *misused* accent characters (which look worst, IMO, especially if
    combined inconsistently with other forms).

    I'm glad that in programming there's a bias on symmetric use of the
    neutral forms " and ' (for strings and characters and other quoting)
    and that things like accents ` and ´ *seem* to gradually vanish for
    quoting purposes; e.g. shell `...` long superseded by $(...). Only
    document contents occasionally still adhere to trashy use.

    One thing I'd really like to understand is why folks have been mixing
    accents with quotes, as in ``standard'' (also taken from 'man awk').

    They may look acceptable in one display or printing device but become
    a typographical catastrophe when viewed on another device type.

    </rant>

    Janis
    --

    0022;QUOTATION MARK
    00AB;LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
    00BB;RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    2018;LEFT SINGLE QUOTATION MARK
    2019;RIGHT SINGLE QUOTATION MARK
    201A;SINGLE LOW-9 QUOTATION MARK
    201B;SINGLE HIGH-REVERSED-9 QUOTATION MARK
    201C;LEFT DOUBLE QUOTATION MARK
    201D;RIGHT DOUBLE QUOTATION MARK
    201E;DOUBLE LOW-9 QUOTATION MARK
    201F;DOUBLE HIGH-REVERSED-9 QUOTATION MARK
    2039;SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    203A;SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    275B;HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
    275C;HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
    275D;HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
    275E;HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
    275F;HEAVY LOW SINGLE COMMA QUOTATION MARK ORNAMENT
    2760;HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
    276E;HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
    276F;HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT
    2E42;DOUBLE LOW-REVERSED-9 QUOTATION MARK
    301D;REVERSED DOUBLE PRIME QUOTATION MARK
    301E;DOUBLE PRIME QUOTATION MARK
    301F;LOW DOUBLE PRIME QUOTATION MARK
    FF02;FULLWIDTH QUOTATION MARK
    1F676;SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT 1F677;SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
    1F678;SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
    E0022;TAG QUOTATION MARK

    0027;APOSTROPHE
    02BC;MODIFIER LETTER APOSTROPHE
    02EE;MODIFIER LETTER DOUBLE APOSTROPHE
    055A;ARMENIAN APOSTROPHE
    FF07;FULLWIDTH APOSTROPHE
    E0027;TAG APOSTROPHE

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sat Feb 22 21:23:42 2025
    On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:

    UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
    representation of a fixed width character (either 8 bit width ISO 8859-X
    or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
    straightforward as fixed width character representations are.

    Unicode is not, and never has been, a fixed-width character set.

    UCS-2 was a fixed-width set of code points. Even that idea has been
    abandoned.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sat Feb 22 21:22:20 2025
    On Sat, 22 Feb 2025 09:11:02 +0100, Janis Papanagnou wrote:

    On 22.02.2025 07:13, Lawrence D'Oliveiro wrote:

    On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

    On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

    They say not to try to parse that file automatically, but I’ve had
    some
    success doing exactly that ... so far ...

    I wonder why they say so, given that there's a syntax description
    available on their pages (see the respective HTML file[*]).

    The file itself says different
    <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

    This file is semi-automatically derived from UnicodeData.txt and a
    set of manually created annotations using a script to select or
    suppress information from the data file. The rules used for this
    process are aimed at readability for the human reader, at the
    expense of some details; therefore, this file should not be parsed
    for machine-readable information.

    I see, but I certainly wouldn't refrain from parsing it.

    Particularly since the information on related code points doesn’t seem to
    be available anywhere else.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Richard Damon on Sat Feb 22 21:24:28 2025
    On Sat, 22 Feb 2025 07:15:09 -0500, Richard Damon wrote:

    I would ssy the big difference is that an 8-bit character set needs to
    only store 256 glyphs for its font.

    Note that glyphs are not characters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Sun Feb 23 00:02:32 2025
    On 2025-02-22, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sat, 22 Feb 2025 07:15:09 -0500, Richard Damon wrote:

    I would ssy the big difference is that an 8-bit character set needs to
    only store 256 glyphs for its font.

    Note that glyphs are not characters.

    Unemployable shithead, note that Richard said "font". A font does assign
    glyphs to abstract characters. The sentence is not the most precise we
    can imagine, since character sets are not containers that store, but
    it's not important here.

    Gee, what are the odds you would fuck up an attempt to nit-pick someone
    ten times your brain size?

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Richard Damon on Sat Feb 22 23:44:49 2025
    Richard Damon <richard@damon-family.org> wrote:
    On 2/22/25 6:29 AM, David Brown wrote:
    On 21/02/2025 23:35, Janis Papanagnou wrote:
    On 21.02.2025 20:40, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    [...]
    BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
    of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
    contains a few other characters like the € (Euro Sign). If that is >>>>> possible for your context you have to map a handful of characters.

    Latin-1 maps exactly to Unicode for the first 256 values.  Latin-9 does >>>> not, which would make the translation more difficult.

    Yes, that had already been pointed out upthread.

    The (open) question is whether it makes sense to convert to "Latin 1"
    only because it has a one-to-one mapping concerning the first UCS-2
    characters, or if the underlying application of the OP wants support
    of contemporary information by e.g. providing the € (Euro) sign with
    "Latin 9".


    <https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing >>>> the 8 characters that differ betwween Latin-1 and Latin-9.

    If at all possible, it would be better to convert to UTF-8.  The
    conversion is exact and reversible, and UTF-8 has largely superseded the >>>> various Latin-* character encodings.

    Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
    while the ISO 8859-X family represents single octet representations.

    I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

    I think this, or why he can't use "Latin 9", are essential questions.

    It seems to have got clear after a subsequent post of the OP; some
    message/data source seems to provide characters from the upper planes
    of Unicode and the OP has to (or wants to) somehow map them to some
    constant octet character set. - Yet there's no information provided
    what Unicode characters - characters that don't have a representation
    in Latin 1 or Latin 9 - the OP will encounter or not from that source.

    As it sounds it all seems to make little sense.

    Janis


    As the OP explained in a reply to one of my posts, he is getting data in
    in UCS-2 format from SMS's from a modem.  Somewhere along the line,
    either the firmware in the modem or in the code sending the SMS's,
    characters beyond the BMP are being used needlessly.  So it looks like
    his first idea of manually handling a few cases (like code 0x2019) seems
    like the right approach.

    Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
    did like so many other programs a "pretty" transformation of a simple
    single quotation mark, to a fancy version.


    Whether Latin-1 or Latin-9 is better will depend on his application. The
    additional characters in Latin-9, with the exception of the Euro symbol,
    are pretty obscure - it's unlikely that you'd need them and not need a
    good deal more other characters (i.e., supporting much more of Unicode).

    As for why not use UTF-8, the answer is clearly simplicity.  The OP is
    working with a resource-constrained embedded system.  I don't know what
    he is doing with the characters after converting them from UCS-2, but it
    is massively simpler to use an 8-bit character set if they are going to
    be used for display on a small system.  It also keeps memory management
    simpler, and that is essential on such systems - one UCS-2 character
    maps to one code unit with Latin-9 here.  The space needed for UTF-8 is
    much harder to predict, and the OP will want to avoid any kind of
    malloc() or dynamic allocation where possible.

    If the incoming SMS's are just being logged, or passed out in some other
    way, then UTF-8 may be a convenient alternative.




    I would ssy the big difference is that an 8-bit character set needs to
    only store 256 glyphs for its font. Converting to UTF-8, would still
    require storing some massive font, and the need to decide exactly how
    massive it will be.

    Most european characters are ASCII letter + accents, that can be
    stored quite efficiently. Korean requires handful of basic characters,
    rest can be syntetised from them.

    Full Unicode certainly requires massive font, but selected subset
    may be possible with modest resources (but probably more than 256
    positions).

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Janis Papanagnou on Sat Feb 22 22:38:45 2025
    On 2/22/25 10:43 AM, Janis Papanagnou wrote:
    On 22.02.2025 13:15, Richard Damon wrote:

    [...] It did like so many other programs a "pretty" transformation of
    a simple single quotation mark, to a fancy version.

    Good to put the "pretty" in quotes; I've seen so many "fancy versions",
    one worse than the other. They are culture specific and on a terminal
    they often look bad even in their native target form. For example “--”
    in a man page (say, 'man awk') has a left and right slant respectively
    and they are linear, but my newsreader shows them both in the same
    direction but the one thicker at the bottom the other at the top. It's similar with single quotes; here we often see accents used at one side
    and a regular single quote at the other side. In 'man man' for example
    we find even a comment on that in the description of option '--ascii'. There's *tons* of such quoting characters for the various languages,
    in my mother tongue there's even _more than one_ type used in printed
    media. Single or double and left or right and bottom or top or mixed
    or double or single angle brackets in opening and closing form, plus
    the *misused* accent characters (which look worst, IMO, especially if combined inconsistently with other forms).

    I'm glad that in programming there's a bias on symmetric use of the
    neutral forms " and ' (for strings and characters and other quoting)
    and that things like accents ` and ´ *seem* to gradually vanish for
    quoting purposes; e.g. shell `...` long superseded by $(...). Only
    document contents occasionally still adhere to trashy use.

    One thing I'd really like to understand is why folks have been mixing
    accents with quotes, as in ``standard'' (also taken from 'man awk').

    They may look acceptable in one display or printing device but become
    a typographical catastrophe when viewed on another device type.

    </rant>

    Janis

    I have more often seen not the "accents" but the curly quotes (one and
    closed) that look more like elevated commas flipped around.

    When used to "escape" a character (or quote a string as an extended
    escape), people come up with all sorts of ideas, and there sometimes the strange characters were chosen to minimize the need for ways to escape
    the escape character.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Janis Papanagnou on Sun Feb 23 00:01:37 2025
    On 2/21/25 23:29, Janis Papanagnou wrote:
    ...
    BTW; curious about that [informal] part of the syntax description

    LF: <any sequence of a single ASCII 0A or 0D, or both>

    It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
    Is the latter of any practical relevance?
    According to <https://en.wikipedia.org/wiki/Newline#Representation>,
    LF-CR is used by "Acorn BBC and RISC OS spooled text output". I presume
    you would not consider that to be of any practical importance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sun Feb 23 05:53:59 2025
    On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

    It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
    Is the latter of any practical relevance?

    Not to answer the question, but just to add to it; from the XML 1.1 spec <https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-xml11>:

    In addition, XML 1.0 attempts to adapt to the line-end conventions
    of various modern operating systems, but discriminates against the
    conventions used on IBM and IBM-compatible mainframes. As a
    result, XML documents on mainframes are not plain text files
    according to the local conventions. XML 1.0 documents generated on
    mainframes must either violate the local line-end conventions, or
    employ otherwise unnecessary translation phases before parsing and
    after generation. Allowing straightforward interoperability is
    particularly important when data stores are shared between
    mainframe and non-mainframe systems (as opposed to being copied
    from one to the other). Therefore XML 1.1 adds NEL (#x85) to the
    list of line-end characters. For completeness, the Unicode line
    separator character, #x2028, is also supported.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Sun Feb 23 07:03:04 2025
    On 2025-02-22, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    LF: <any sequence of a single ASCII 0A or 0D, or both>

    It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
    Is the latter of any practical relevance?

    Because if Unicode people spot the slightest opportunity to add
    pointless complexity to anything, they tend to pounce on it.

    Why just specify one line ending convention, when you can require the
    processor of the file to watch out for four different tokens denoting
    the line break?

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Mon Feb 24 08:31:21 2025
    On 22.02.2025 22:23, Lawrence D'Oliveiro wrote:
    On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:

    UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
    representation of a fixed width character (either 8 bit width ISO 8859-X
    or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
    straightforward as fixed width character representations are.

    Unicode is not, and never has been, a fixed-width character set.

    I was speaking about the "UTF-8 _encoding_" of Unicode.

    (Not sure what you consider the term "Unicode" implies.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Mon Feb 24 08:27:19 2025
    On 23.02.2025 08:03, Kaz Kylheku wrote:
    On 2025-02-22, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    LF: <any sequence of a single ASCII 0A or 0D, or both>

    It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
    Is the latter of any practical relevance?

    Because if Unicode people spot the slightest opportunity to add
    pointless complexity to anything, they tend to pounce on it.

    Given what's all collected in Unicode they've long passed the line
    where one more or less character would matter. ;-)

    That said; I anyway think it's good to have one standard instead of
    hundreds of individual specific character sets and "codepage" variants.


    Why just specify one line ending convention, when you can require the processor of the file to watch out for four different tokens denoting
    the line break?

    Well, the history is (partly) understandable. Doesn't that stem from
    early IT days where printers and their components got controlled by
    atomic commands; CR, LF, BS [*]. Sending such a text file with CR LF
    to the printer would perform the necessary printer raw commands.[**]

    I think at some early point in history they should have differentiated
    and standardized the file ending to use a single character.

    Is it now too late given that even some RFC protocol standards specify
    CR LF as ending sequence?

    Janis

    [*] I recall a mainframe terminal that echoed the password by sequences
    of <PW-char> BS 'X' BS 'O' etc. to keep it "unreadable". Of course
    you could see the PW by manually turning the "drum" during the print
    process.

    [**] OTOH I recall there was another control method, where the first
    character of a line determined the printer control; sending a file to
    the raw printer produces quite a mess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Mon Feb 24 20:10:17 2025
    On Mon, 24 Feb 2025 08:27:19 +0100, Janis Papanagnou wrote:

    Given what's all collected in Unicode they've long passed the line where
    one more or less character would matter. ;-)

    Unicode isn’t a collection of characters as such, it’s a collection of
    code points. And yes, every code point is there for a reason.

    Remember the proposal to add code points for country icons like 🇳🇿, 🇺🇳 and
    🇪🇺 got rejected? Remember what they did instead?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to pozz on Mon Feb 24 20:13:00 2025
    On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:

    Il 22/02/2025 14:18, David Brown ha scritto:

    My understanding here is that the OP is getting the UCS-2 encoded
    string in from a modem, almost certainly on a serial line.  The UCS-2
    encoded data is itself a binary sequence of 16-bit code units, and the
    modem firmware is sending those as four hex digits.

    Exactly. This is the reply to AT+CMGR command that is standardized in
    3GPP TS 27.005.

    Anything that is specifying the use of UCS-2 encoding automatically dates itself to about the early-to-mid 1990s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Mon Feb 24 20:11:29 2025
    On Mon, 24 Feb 2025 08:31:21 +0100, Janis Papanagnou wrote:

    On 22.02.2025 22:23, Lawrence D'Oliveiro wrote:

    On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:

    UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
    representation of a fixed width character (either 8 bit width ISO
    8859-X or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
    straightforward as fixed width character representations are.

    Unicode is not, and never has been, a fixed-width character set.

    I was speaking about the "UTF-8 _encoding_" of Unicode.

    You *did* say that UCS-2 was “a direct representation of a fixed width character”, did you not? It’s in your posting quoted above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to pozz on Tue Feb 25 10:24:34 2025
    On 21/02/2025 11:40, pozz wrote:
    I want to write a simple function that converts UCS2 string into ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
    size because ucs2 isn't null terminated.

    I know I can use iconv() feature, but I'm on an embedded platform
    without an OS and without iconv() function.

    It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.

    It isn't so simple to convert higher codes. For example, the small e
    with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
    0x27 in ISO8859-1.

    Is there a simplified mapping table that can be written with if/switch?

    if (code < 0x80) {
      *dst++ = (char)code;
    } else {
      switch (code) {
        case 0x2019: *dst++ = 0x27; break;  // Apostrophe
        case 0x...: *dst++ = ...; break;
        default: *ds++ = ' ';
      }
    }

    I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.



    Can you use iconv to help build your switch statement?

    Foreach ucs2 char, if 'iconv -f ucs2 -t iso8859-1//translit' doesn't
    complain, then you can build up your case statements.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to pozz on Tue Feb 25 07:18:28 2025
    On 2/25/25 2:35 AM, pozz wrote:
    Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:
    On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:

    Il 22/02/2025 14:18, David Brown ha scritto:

    My understanding here is that the OP is getting the UCS-2 encoded
    string in from a modem, almost certainly on a serial line.  The UCS-2 >>>> encoded data is itself a binary sequence of 16-bit code units, and the >>>> modem firmware is sending those as four hex digits.

    Exactly. This is the reply to AT+CMGR command that is standardized in
    3GPP TS 27.005.

    Anything that is specifying the use of UCS-2 encoding automatically dates
    itself to about the early-to-mid 1990s.

    Sincereley I don't know why and when, but the LTE modem I'm using
    (Simcom A7672E) replies to AT+CMGR in two different format:

    - what is described as GSM 7-bit alphabet (but it's really UTF-8 when
    non ASCII chas are present)

    - UCS2

    Of course, in the header, it specifies the <dcs> (data coding scheme) so
    the receiver on the UART can interpret correctly all the data.


    Are you sure it is UCS2 and not UTF-16?

    Can it not handle characters not in the BMP?

    The difference between UCS2 and UTF-16 is that UCS2 is the character set
    that predates the surrogate-pairs added to extend it. It is very much
    the equivalent relationship of ASCII to UTF-8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to pozz on Tue Feb 25 17:16:23 2025
    On 25/02/2025 15:53, pozz wrote:
    Il 25/02/2025 13:18, Richard Damon ha scritto:
    On 2/25/25 2:35 AM, pozz wrote:
    Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:
    On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:

    Il 22/02/2025 14:18, David Brown ha scritto:

    My understanding here is that the OP is getting the UCS-2 encoded
    string in from a modem, almost certainly on a serial line.  The UCS-2 >>>>>> encoded data is itself a binary sequence of 16-bit code units, and >>>>>> the
    modem firmware is sending those as four hex digits.

    Exactly. This is the reply to AT+CMGR command that is standardized in >>>>> 3GPP TS 27.005.

    Anything that is specifying the use of UCS-2 encoding automatically
    dates
    itself to about the early-to-mid 1990s.

    Sincereley I don't know why and when, but the LTE modem I'm using
    (Simcom A7672E) replies to AT+CMGR in two different format:

    - what is described as GSM 7-bit alphabet (but it's really UTF-8 when
    non ASCII chas are present)

    - UCS2

    Of course, in the header, it specifies the <dcs> (data coding scheme)
    so the receiver on the UART can interpret correctly all the data.


    Are you sure it is UCS2 and not UTF-16?

    Can it not handle characters not in the BMP?

    The difference between UCS2 and UTF-16 is that UCS2 is the character
    set that predates the surrogate-pairs added to extend it. It is very
    much the equivalent relationship of ASCII to UTF-8.

    Sincerely I don't know, the standard says UCS2

    The standard used by modems here is UCS2, not UTF-16. As you point out,
    this was all standardised in the early 1990's (before UTF-16) - as a standardisation of things that had already been used before that. And
    once a telecom standard is made, it is set in stone and never changed.
    Unlike for some things that adopted Unicode early using UCS2 (like
    Windows NT, Java, Qt, Python) the UCS2 use in established modem standard commands (like AT+CMGR) could not, and were not, extended to UTF-16.
    There might be other AT commands supported by some modems that /do/
    support UTF-8 or UTF-16, but existing standardised commands don't change.

    For all Unicode code points supported by UCS2, the coding is the same as
    for UTF-16 (as Richard says, it's like the ASCII subset of UTF-8). So
    you can always treat UCS2 as UTF-16. Unicode characters outside this
    set simply have no representation in UCS2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to pozz on Tue Feb 25 20:31:27 2025
    On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

    ... the standard says UCS2

    Does it mention anything about the surrogates ranges (0xD800 .. 0xDFFF)?
    In order for it to be strict UCS-2, they would have to be forbidden. If
    they are allowed, then that makes it UTF-16.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to pozz on Tue Feb 25 20:29:40 2025
    On Tue, 25 Feb 2025 08:32:46 +0100, pozz wrote:

    Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:

    Anything that is specifying the use of UCS-2 encoding automatically
    dates itself to about the early-to-mid 1990s.

    Here[1] the first version of this document is dated back to 1999, but
    UCS2 remains and is implemented in currently modem on the market.

    [1] https://www.3gpp.org/ftp/Specs/archive/27_series/27.005/

    Why would they still be using UCS-2, not UTF-16?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to pozz on Wed Feb 26 03:16:50 2025
    On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

    Sincerely I don't know, the standard says UCS2

    Remembered that the specs are online <https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1514>.
    They don’t actually say what “data coding schemes” are allowed; the
    only mention of UCS-2 is

    if <dcs> indicates that 8-bit or UCS2 data coding scheme is used,
    or <fo> indicates that 3GPP TS 23.040 [3]
    TP-User-Data-Header-Indication is set: ME/TA converts each 8-bit
    octet into two IRA character long hexadecimal number (e.g. octet
    with integer value 42 is presented to TE as two characters 2A (IRA
    50 and 65))

    So it doesn’t say that UTF-16 is or isn’t allowed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Lawrence D'Oliveiro on Tue Feb 25 23:21:21 2025
    On 2/25/25 3:31 PM, Lawrence D'Oliveiro wrote:
    On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

    ... the standard says UCS2

    Does it mention anything about the surrogates ranges (0xD800 .. 0xDFFF)?
    In order for it to be strict UCS-2, they would have to be forbidden. If
    they are allowed, then that makes it UTF-16.

    To my knowledge, UCS-2 doesn't say those codes are "forbidden", just
    that they are not defined codes.

    UCS-2 basically became a legacy code when they needed to expand unicode
    to more than 16 bits. Systems defined to use it basically just treat
    UTF-16 surrogate pairs as two characters they don't know what they mean,
    just like a lot of programs can treat UTF-8 as "ASCII" with some codes
    it doesn't know what they mean.

    The ignorance is bliss method works well for a number of tasks, you just
    need to only alter strings at points you "understand", and not need to
    actualy count characters (which actualy becomes hard to do totally right
    in Unicode anyway).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Wed Feb 26 09:57:21 2025
    On 26/02/2025 04:37, Keith Thompson wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

    Sincerely I don't know, the standard says UCS2

    Remembered that the specs are online
    <https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1514>.
    They don’t actually say what “data coding schemes” are allowed; the
    only mention of UCS-2 is

    if <dcs> indicates that 8-bit or UCS2 data coding scheme is used,
    or <fo> indicates that 3GPP TS 23.040 [3]
    TP-User-Data-Header-Indication is set: ME/TA converts each 8-bit
    octet into two IRA character long hexadecimal number (e.g. octet
    with integer value 42 is presented to TE as two characters 2A (IRA
    50 and 65))

    So it doesn’t say that UTF-16 is or isn’t allowed.

    It doesn't say that EBCDIC or UTF-7 is or isn't allowed.


    There are two specifications at work here. One is the 3G standards
    about coding schemes used for SMS data, and the other is the common AT
    command set. The later is, I think, mostly a de-facto standard - modem manufacturers have tried to keep a common subset (along with their own device-specific commands). 3G may allow for 8-bit encoding sets without specifying them in detail, but the modem commands are more specific -
    the ones used by the OP are strictly UCS-2.

    (It is a /long/ time since I have read details of these things, so I
    might be misremembering.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff@21:1/5 to pozz on Sat Mar 1 09:31:55 2025
    On Fri, 21 Feb 2025 12:40:06 +0100, pozz <pozzugno@gmail.com> wrote:

    I want to write a simple function that converts UCS2 string into ISO8859-1:

    void ucs2_to_iso8859p1(char *ucs2, size_t size);

    ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
    size because ucs2 isn't null terminated.

    I know I can use iconv() feature, but I'm on an embedded platform
    without an OS and without iconv() function.

    It is trivial to convert "0000"-"007F" chars: it's a simple cast from >unsigned int to char.

    It isn't so simple to convert higher codes. For example, the small e
    with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial >again. But I saw the code "2019" (apostrophe) that can be rendered as
    0x27 in ISO8859-1.

    Is there a simplified mapping table that can be written with if/switch?

    if (code < 0x80) {
    *dst++ = (char)code;
    } else {
    switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
    }
    }

    I'm not searching a very detailed and correct mapping, but just a >"sufficient" implementation.


    #include <stdint.h>
    #include <stddef.h>

    // Function to convert UCS2 to ISO8859-1
    void UCS2ToISO88591(const uint16_t* ucs2, size_t length, char* iso88591) {
    for (size_t i = 0; i < length; ++i) {
    uint16_t ucs2_char = ucs2[i];
    if (ucs2_char <= 0x00FF) {
    iso88591[i] = (char)ucs2_char;
    } else {
    // Handle characters that cannot be represented in ISO8859-1
    iso88591[i] = '?'; // Replace with a placeholder character
    }
    }
    // Null-terminate the ISO8859-1 string
    iso88591[length] = '\0';
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)