Forum: >>> Magnum BBS <<<

Re: Simple string conversion from UCS2 to ISO8859-1

From Richard Damon@21:1/5 to pozz on Fri Feb 21 07:05:57 2025

On 2/21/25 6:40 AM, pozz wrote:

I want to write a simple function that converts UCS2 string into ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.

Typically UCS2 strings ARE null terminated, it just a null is two bytes
long.

I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.

It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.

Note, I think you will find that it is that 0000-00FF that match. (as I remember ISO8859-1 was the base for starting Unicode).

It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Is there a simplified mapping table that can be written with if/switch?

if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
}
}

I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close approximation,
so you will need to decide what to map.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to pozz on Fri Feb 21 14:06:03 2025

On 21.02.2025 13:42, pozz wrote:

Il 21/02/2025 13:05, Richard Damon ha scritto:

On 2/21/25 6:40 AM, pozz wrote:

I want to write a simple function that converts UCS2 string into
ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

[...]

It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.

Note, I think you will find that it is that 0000-00FF that match. (as
I remember ISO8859-1 was the base for starting Unicode).

I second that.

It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

Note that there are _standard names_ assigned with the characters.
These are normative what the characters represent. - I strongly
suggest to not twist these standards by assigning different
characters; you will do no one a favor but inflict only confusion
and harm.

Is there a simplified mapping table that can be written with if/switch?

if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}

I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close
approximation, so you will need to decide what to map.

Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.

I've just made a run across the names of UCS-2 and ISO 8859-1, based
on their normative names and, as mentioned above already; they match
one-to-one in the ranges 0000-00FF and 00-FF respectively.

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Janis

For example, what iconv() does in this case?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Fri Feb 21 14:17:32 2025

On 21.02.2025 14:06, Janis Papanagnou wrote:

On 21.02.2025 13:42, pozz wrote:

Il 21/02/2025 13:05, Richard Damon ha scritto:

On 2/21/25 6:40 AM, pozz wrote:

[...] But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

Note that there are _standard names_ assigned with the characters.
These are normative what the characters represent. - I strongly
suggest to not twist these standards by assigning different
characters; you will do no one a favor but inflict only confusion
and harm.

I want to amend the standard names to make it clear...

0027 APOSTROPHE
2019 RIGHT SINGLE QUOTATION MARK

27 APOSTROPHE

Hope that helps to understand the standard names.

(You should also be aware that a glyph for a character may be
depicted differently depending on the source of the respective
documents.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to pozz on Fri Feb 21 15:23:51 2025

On 21/02/2025 12:40, pozz wrote:

I want to write a simple function that converts UCS2 string into ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.

I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.

It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.

It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.

Is there a simplified mapping table that can be written with if/switch?

if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
}
}

I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.

<https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>

As has been mentioned by others, 0 - 0xff should be a direct translation
(with the possible exception of Latin-9 differences).

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15>

When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80
- 0xff), you will quickly see that virtually none of them make any sense
to support in the way you are thinking. Just because a couple of the characters in the Thaana block look a bit like quotation marks, does not
mean it makes any sense to try to transliterate them. Realistically,
you can at most make use of a few punctuation symbols (like 0x2019
above), and maybe approximate forms for some extended Latin alphabet
characters that you will never see in practice. Oh, and you might be
able to support those spam emails that use Greek and other letters that
look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's assuming you have output support for the full Latin-1 or Latin-9 range.

Unicode is rarely much use unless you want and can provide good support
for non-Latin alphabets. Otherwise your translations are going to be so limited and simple that they are barely worth the effort and won't cover anything useful.

So here I would say that whoever provides the text, provides it in
Latin-9 encoding. There's no point in allowing external translators to
use whatever characters they feel is best in their language, and then
your code makes some kind of odd approximation giving results that look different. If someone really wants to use the letter "ā" that is found
in the Latin Extended A block, how do /you/ know whether the best
Latin-9 match is "a", "ã", "ä", or something different like "aa" or an alternative spelling of the word? Maybe the rules are different for
Latvian and Anglicised Mandarin.

When we have worked with multiple languages on small embedded systems
(too small for big fonts and UTF-8), we have used one of three techniques :

1. Insist that the external translators provide strings in Latin-9 only
(or even just ASCII when the system was more restricted).

2. Use primarily ASCII, with a few user-defined characters per language
(that's useful for old-style character displays with space for perhaps 8 user-defined characters).

3. Use a PC program to figure out the characters actually used in the
strings, and put them into a single table indexing a generated list of
bitmap glyphs, also generated by the program (from freely available
fonts). The source is, naturally, UTF-8 - the strings stored in the
embedded system are not in any standard encoding representing
characters, but now hold glyph table indices.

Your idea here sounds to me like a lot of work for virtually no benefit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Keith Thompson on Fri Feb 21 23:35:57 2025

On 21.02.2025 20:40, Keith Thompson wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.

The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.

If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the various Latin-* character encodings.

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.

It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.

As it sounds it all seems to make little sense.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to pozz on Fri Feb 21 20:05:22 2025

On 2/21/25 7:42 AM, pozz wrote:

Il 21/02/2025 13:05, Richard Damon ha scritto:

On 2/21/25 6:40 AM, pozz wrote:

I want to write a simple function that converts UCS2 string into
ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

Typically UCS2 strings ARE null terminated, it just a null is two
bytes long.

Sure, but this isn't an issue here.

I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.

It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.

Note, I think you will find that it is that 0000-00FF that match. (as
I remember ISO8859-1 was the base for starting Unicode).

It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

And, as such is a subjective decision that you need to make.

Is there a simplified mapping table that can be written with if/switch?

if (code < 0x80) {
   *dst++ = (char)code;
} else {
   switch (code) {
     case 0x2019: *dst++ = 0x27; break; // Apostrophe
     case 0x...: *dst++ = ...; break;
     default: *ds++ = ' ';
   }
}

I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close
approximation, so you will need to decide what to map.

Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.

For example, what iconv() does in this case?

Just look at its code, there will be open source versions of it.

The two real options is just reject anything above 0xFF, or have a big table/switch to handle some determined list of things "close enough"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to pozz on Sat Feb 22 01:20:20 2025

On 2025-02-21, pozz <pozzugno@gmail.com> wrote:

I want to write a simple function that converts UCS2 string into ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.

This kind of normalizing is a good way of introducing injection
exploits.

Suppose the input is some syntax that has been validated; the decision
is trusted after that. The normalization to the 8-bit character set can
produce characters which are special in the syntax, changing its
meaning.

In Microsoft Windows, there is an example of such a problem. Programs
which use GetCommandLineA to get the argument string before parsing it
into arguments are vulnerable to argument injection. The attacker
specifies a piece of datum to be used by program A as an argument in
calling program B such that when the datum is decimated to the 8 bit
character set, quotes appear in it, creating additional arguments to
program B.

again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.

... and that's a common quoting character in various data syntaxes, oops!
What could go wrong?

I think in 2025 we shouldn't have to be crippling Unicode data to fit
some ISO Latin (or any other 8 bit) character set; we should be rooting
out technologies and situations which do that.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to pozz on Sat Feb 22 03:00:31 2025

On Fri, 21 Feb 2025 13:42:13 +0100, pozz wrote:

Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.

If you look at NamesList.txt, you will see, next to each character,
references to others that might be similar or related in some way.

They say not to try to parse that file automatically, but I’ve had some success doing exactly that ... so far ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Sat Feb 22 05:29:14 2025

On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

On Fri, 21 Feb 2025 13:42:13 +0100, pozz wrote:

Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.

If you look at NamesList.txt, you will see, next to each character, references to others that might be similar or related in some way.

They say not to try to parse that file automatically, but I’ve had some success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

BTW; curious about that [informal] part of the syntax description

LF: <any sequence of a single ASCII 0A or 0D, or both>

It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Janis

[*] https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sat Feb 22 06:13:27 2025

On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

The file itself says different <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Sat Feb 22 09:11:02 2025

On 22.02.2025 07:13, Lawrence D'Oliveiro wrote:

On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

They say not to try to parse that file automatically, but I’ve had some >>> success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

The file itself says different <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.

I see, but I certainly wouldn't refrain from parsing it. (In the past
I had parsed much worse data; irregular HTML stuff and the like.)
OTOH, there's also the CSV data file available, yet even simpler to
parse with standard tools and no effort.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Janis Papanagnou on Sat Feb 22 12:29:59 2025

On 21/02/2025 23:35, Janis Papanagnou wrote:

On 21.02.2025 20:40, Keith Thompson wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.

The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2 characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.

If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the
various Latin-* character encodings.

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.

It seems to have got clear after a subsequent post of the OP; some message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.

As it sounds it all seems to make little sense.

Janis

As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. Somewhere along the line,
either the firmware in the modem or in the code sending the SMS's,
characters beyond the BMP are being used needlessly. So it looks like
his first idea of manually handling a few cases (like code 0x2019) seems
like the right approach.

Whether Latin-1 or Latin-9 is better will depend on his application.
The additional characters in Latin-9, with the exception of the Euro
symbol, are pretty obscure - it's unlikely that you'd need them and not
need a good deal more other characters (i.e., supporting much more of
Unicode).

As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.

If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to David Brown on Sat Feb 22 13:11:34 2025

On 22.02.2025 12:29, David Brown wrote:

As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. [...]

(Yes. I wrote: "have got clear after a subsequent post".)

Whether Latin-1 or Latin-9 is better will depend on his application.

(Was also my stance upthread; "If that is possible for your context")

The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure

ISTR they are some language specific symbols, so probably less obscure
to someone from those countries.

- it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).

As for why not use UTF-8, the answer is clearly simplicity.

This was not my point (someone else suggested that). To me that was
clear; UTF-8 is an _encoding_ (as I wrote), as opposed to a direct representation of a fixed width character (either 8 bit width ISO
8859-X or 16 bit with UCS-2). Conversions to/from UTF-8 are not as straightforward as fixed width character representations are.

The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.

You should address that to the other poster. :-)

Janis

If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to David Brown on Sat Feb 22 07:15:09 2025

On 2/22/25 6:29 AM, David Brown wrote:

On 21/02/2025 23:35, Janis Papanagnou wrote:

On 21.02.2025 20:40, Keith Thompson wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does >>> not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.

The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing >>> the 8 characters that differ betwween Latin-1 and Latin-9.

If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the >>> various Latin-* character encodings.

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.

It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.

As it sounds it all seems to make little sense.

Janis

As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. Somewhere along the line,
either the firmware in the modem or in the code sending the SMS's,
characters beyond the BMP are being used needlessly. So it looks like
his first idea of manually handling a few cases (like code 0x2019) seems
like the right approach.

Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
did like so many other programs a "pretty" transformation of a simple
single quotation mark, to a fancy version.

Whether Latin-1 or Latin-9 is better will depend on his application. The additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure - it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).

As for why not use UTF-8, the answer is clearly simplicity. The OP is working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.

If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font. Converting to UTF-8, would still
require storing some massive font, and the need to decide exactly how
massive it will be.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Richard Damon on Sat Feb 22 14:12:28 2025

On 22/02/2025 13:15, Richard Damon wrote:

On 2/22/25 6:29 AM, David Brown wrote:

As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know
what he is doing with the characters after converting them from UCS-2,
but it is massively simpler to use an 8-bit character set if they are
going to be used for display on a small system. It also keeps memory
management simpler, and that is essential on such systems - one UCS-2
character maps to one code unit with Latin-9 here. The space needed
for UTF-8 is much harder to predict, and the OP will want to avoid any
kind of malloc() or dynamic allocation where possible.

If the incoming SMS's are just being logged, or passed out in some
other way, then UTF-8 may be a convenient alternative.

I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font. Converting to UTF-8, would still
require storing some massive font, and the need to decide exactly how
massive it will be.

Yes, exactly. A key point is what the OP is going to do with the text.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Janis Papanagnou on Sat Feb 22 14:11:11 2025

On 22/02/2025 13:11, Janis Papanagnou wrote:

On 22.02.2025 12:29, David Brown wrote:

As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. [...]

(Yes. I wrote: "have got clear after a subsequent post".)

Whether Latin-1 or Latin-9 is better will depend on his application.

(Was also my stance upthread; "If that is possible for your context")

The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure

ISTR they are some language specific symbols, so probably less obscure
to someone from those countries.

The point (as I said below) is that adding these letters (š, ž, œ) makes very little difference to anyone because they are not enough to let them
write their language properly. Sure, someone writing Czech might have
regular use of the letter ž - but with Latin-9 they can't write the
letters ť, ř, ď or several other Czech letters. So it provides little benefit to most people who have those letters in their alphabet. If you
want to let people write their languages properly (something I strongly support), you need much fuller Unicode support - unless you are working specifically with Sami, Finnish or Estonian, the only benefit of moving
from Latin-1 to Latin-9 is for the Euro symbol.

- it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).

As for why not use UTF-8, the answer is clearly simplicity.

This was not my point (someone else suggested that).

<snip>

You should address that to the other poster. :-)

I was making a single reply that covered both parts - I know you didn't
write the bits you quoted from further up-thread.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Sat Feb 22 14:18:03 2025

On 21/02/2025 20:45, Keith Thompson wrote:

pozz <pozzugno@gmail.com> writes:

I want to write a simple function that converts UCS2 string into ISO8859-1: >>
void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

Is the UCS-2 really represented as a sequence of ASCII hex digits?

In actual UCS-2, each character is 2 bytes. The representation for
"Hello" would be 10 bytes, either "\0H\0e\0l\0l\0o" or
"H\0e\0l\0l\0o\0", depending on endianness. (UCS-2 is a subset of
UTF-16; the latter uses longer sequences to represent characters
outside the Basic Multilingual Plane.)

My understanding here is that the OP is getting the UCS-2 encoded string
in from a modem, almost certainly on a serial line. The UCS-2 encoded
data is itself a binary sequence of 16-bit code units, and the modem
firmware is sending those as four hex digits. This is a very common way
to handle transmission of binary data in such systems - there is no need
for escapes or other complications to delimit the binary data. I would
expect that the entire incoming message will be comma-separated fields
with the time and date, sender's telephone number, and so on, as well as
the text itself as this long hex string.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Richard Damon on Sat Feb 22 16:43:55 2025

On 22.02.2025 13:15, Richard Damon wrote:

[...] It did like so many other programs a "pretty" transformation of
a simple single quotation mark, to a fancy version.

Good to put the "pretty" in quotes; I've seen so many "fancy versions",
one worse than the other. They are culture specific and on a terminal
they often look bad even in their native target form. For example “--”
in a man page (say, 'man awk') has a left and right slant respectively
and they are linear, but my newsreader shows them both in the same
direction but the one thicker at the bottom the other at the top. It's
similar with single quotes; here we often see accents used at one side
and a regular single quote at the other side. In 'man man' for example
we find even a comment on that in the description of option '--ascii'.
There's *tons* of such quoting characters for the various languages,
in my mother tongue there's even _more than one_ type used in printed
media. Single or double and left or right and bottom or top or mixed
or double or single angle brackets in opening and closing form, plus
the *misused* accent characters (which look worst, IMO, especially if
combined inconsistently with other forms).

I'm glad that in programming there's a bias on symmetric use of the
neutral forms " and ' (for strings and characters and other quoting)
and that things like accents ` and ´ *seem* to gradually vanish for
quoting purposes; e.g. shell `...` long superseded by $(...). Only
document contents occasionally still adhere to trashy use.

One thing I'd really like to understand is why folks have been mixing
accents with quotes, as in ``standard'' (also taken from 'man awk').

They may look acceptable in one display or printing device but become
a typographical catastrophe when viewed on another device type.

</rant>

Janis
--

0022;QUOTATION MARK
00AB;LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
00BB;RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
2018;LEFT SINGLE QUOTATION MARK
2019;RIGHT SINGLE QUOTATION MARK
201A;SINGLE LOW-9 QUOTATION MARK
201B;SINGLE HIGH-REVERSED-9 QUOTATION MARK
201C;LEFT DOUBLE QUOTATION MARK
201D;RIGHT DOUBLE QUOTATION MARK
201E;DOUBLE LOW-9 QUOTATION MARK
201F;DOUBLE HIGH-REVERSED-9 QUOTATION MARK
2039;SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A;SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
275B;HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
275C;HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
275D;HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
275E;HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
275F;HEAVY LOW SINGLE COMMA QUOTATION MARK ORNAMENT
2760;HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
276E;HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
276F;HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT
2E42;DOUBLE LOW-REVERSED-9 QUOTATION MARK
301D;REVERSED DOUBLE PRIME QUOTATION MARK
301E;DOUBLE PRIME QUOTATION MARK
301F;LOW DOUBLE PRIME QUOTATION MARK
FF02;FULLWIDTH QUOTATION MARK
1F676;SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT 1F677;SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
1F678;SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
E0022;TAG QUOTATION MARK

0027;APOSTROPHE
02BC;MODIFIER LETTER APOSTROPHE
02EE;MODIFIER LETTER DOUBLE APOSTROPHE
055A;ARMENIAN APOSTROPHE
FF07;FULLWIDTH APOSTROPHE
E0027;TAG APOSTROPHE

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sat Feb 22 21:23:42 2025

On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:

UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO 8859-X
or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.

Unicode is not, and never has been, a fixed-width character set.

UCS-2 was a fixed-width set of code points. Even that idea has been
abandoned.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sat Feb 22 21:22:20 2025

On Sat, 22 Feb 2025 09:11:02 +0100, Janis Papanagnou wrote:

On 22.02.2025 07:13, Lawrence D'Oliveiro wrote:

On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:

They say not to try to parse that file automatically, but I’ve had

some

success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

The file itself says different
<https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.

I see, but I certainly wouldn't refrain from parsing it.

Particularly since the information on related code points doesn’t seem to
be available anywhere else.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Richard Damon on Sat Feb 22 21:24:28 2025

On Sat, 22 Feb 2025 07:15:09 -0500, Richard Damon wrote:

I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font.

Note that glyphs are not characters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Sun Feb 23 00:02:32 2025

On 2025-02-22, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Sat, 22 Feb 2025 07:15:09 -0500, Richard Damon wrote:

I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font.

Note that glyphs are not characters.

Unemployable shithead, note that Richard said "font". A font does assign
glyphs to abstract characters. The sentence is not the most precise we
can imagine, since character sets are not containers that store, but
it's not important here.

Gee, what are the odds you would fuck up an attempt to nit-pick someone
ten times your brain size?

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Richard Damon on Sat Feb 22 23:44:49 2025

Richard Damon <richard@damon-family.org> wrote:

On 2/22/25 6:29 AM, David Brown wrote:

On 21/02/2025 23:35, Janis Papanagnou wrote:

On 21.02.2025 20:40, Keith Thompson wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is >>>>> possible for your context you have to map a handful of characters.

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does >>>> not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.

The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing >>>> the 8 characters that differ betwween Latin-1 and Latin-9.

If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the >>>> various Latin-* character encodings.

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.

It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.

As it sounds it all seems to make little sense.

Janis

As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. Somewhere along the line,
either the firmware in the modem or in the code sending the SMS's,
characters beyond the BMP are being used needlessly. So it looks like
his first idea of manually handling a few cases (like code 0x2019) seems
like the right approach.

Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
did like so many other programs a "pretty" transformation of a simple
single quotation mark, to a fancy version.

Whether Latin-1 or Latin-9 is better will depend on his application. The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure - it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).

As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management
simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.

If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font. Converting to UTF-8, would still
require storing some massive font, and the need to decide exactly how
massive it will be.

Most european characters are ASCII letter + accents, that can be
stored quite efficiently. Korean requires handful of basic characters,
rest can be syntetised from them.

Full Unicode certainly requires massive font, but selected subset
may be possible with modest resources (but probably more than 256
positions).

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Janis Papanagnou on Sat Feb 22 22:38:45 2025

On 2/22/25 10:43 AM, Janis Papanagnou wrote:

On 22.02.2025 13:15, Richard Damon wrote:

[...] It did like so many other programs a "pretty" transformation of
a simple single quotation mark, to a fancy version.

Good to put the "pretty" in quotes; I've seen so many "fancy versions",
one worse than the other. They are culture specific and on a terminal
they often look bad even in their native target form. For example “--”
in a man page (say, 'man awk') has a left and right slant respectively
and they are linear, but my newsreader shows them both in the same
direction but the one thicker at the bottom the other at the top. It's similar with single quotes; here we often see accents used at one side
and a regular single quote at the other side. In 'man man' for example
we find even a comment on that in the description of option '--ascii'. There's *tons* of such quoting characters for the various languages,
in my mother tongue there's even _more than one_ type used in printed
media. Single or double and left or right and bottom or top or mixed
or double or single angle brackets in opening and closing form, plus
the *misused* accent characters (which look worst, IMO, especially if combined inconsistently with other forms).

I'm glad that in programming there's a bias on symmetric use of the
neutral forms " and ' (for strings and characters and other quoting)
and that things like accents ` and ´ *seem* to gradually vanish for
quoting purposes; e.g. shell `...` long superseded by $(...). Only
document contents occasionally still adhere to trashy use.

One thing I'd really like to understand is why folks have been mixing
accents with quotes, as in ``standard'' (also taken from 'man awk').

They may look acceptable in one display or printing device but become
a typographical catastrophe when viewed on another device type.

</rant>

Janis

I have more often seen not the "accents" but the curly quotes (one and
closed) that look more like elevated commas flipped around.

When used to "escape" a character (or quote a string as an extended
escape), people come up with all sorts of ideas, and there sometimes the strange characters were chosen to minimize the need for ways to escape
the escape character.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Janis Papanagnou on Sun Feb 23 00:01:37 2025

On 2/21/25 23:29, Janis Papanagnou wrote:
...

BTW; curious about that [informal] part of the syntax description

LF: <any sequence of a single ASCII 0A or 0D, or both>

It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

According to <https://en.wikipedia.org/wiki/Newline#Representation>,
LF-CR is used by "Acorn BBC and RISC OS spooled text output". I presume
you would not consider that to be of any practical importance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sun Feb 23 05:53:59 2025

On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:

It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Not to answer the question, but just to add to it; from the XML 1.1 spec <https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-xml11>:

In addition, XML 1.0 attempts to adapt to the line-end conventions
of various modern operating systems, but discriminates against the
conventions used on IBM and IBM-compatible mainframes. As a
result, XML documents on mainframes are not plain text files
according to the local conventions. XML 1.0 documents generated on
mainframes must either violate the local line-end conventions, or
employ otherwise unnecessary translation phases before parsing and
after generation. Allowing straightforward interoperability is
particularly important when data stores are shared between
mainframe and non-mainframe systems (as opposed to being copied
from one to the other). Therefore XML 1.1 adds NEL (#x85) to the
list of line-end characters. For completeness, the Unicode line
separator character, #x2028, is also supported.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Janis Papanagnou on Sun Feb 23 07:03:04 2025

On 2025-02-22, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

LF: <any sequence of a single ASCII 0A or 0D, or both>

It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Because if Unicode people spot the slightest opportunity to add
pointless complexity to anything, they tend to pounce on it.

Why just specify one line ending convention, when you can require the
processor of the file to watch out for four different tokens denoting
the line break?

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Mon Feb 24 08:31:21 2025

On 22.02.2025 22:23, Lawrence D'Oliveiro wrote:

On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:

UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO 8859-X
or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.

Unicode is not, and never has been, a fixed-width character set.

I was speaking about the "UTF-8 _encoding_" of Unicode.

(Not sure what you consider the term "Unicode" implies.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kaz Kylheku on Mon Feb 24 08:27:19 2025

On 23.02.2025 08:03, Kaz Kylheku wrote:

On 2025-02-22, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

LF: <any sequence of a single ASCII 0A or 0D, or both>

It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Because if Unicode people spot the slightest opportunity to add
pointless complexity to anything, they tend to pounce on it.

Given what's all collected in Unicode they've long passed the line
where one more or less character would matter. ;-)

That said; I anyway think it's good to have one standard instead of
hundreds of individual specific character sets and "codepage" variants.

Why just specify one line ending convention, when you can require the processor of the file to watch out for four different tokens denoting
the line break?

Well, the history is (partly) understandable. Doesn't that stem from
early IT days where printers and their components got controlled by
atomic commands; CR, LF, BS [*]. Sending such a text file with CR LF
to the printer would perform the necessary printer raw commands.[**]

I think at some early point in history they should have differentiated
and standardized the file ending to use a single character.

Is it now too late given that even some RFC protocol standards specify
CR LF as ending sequence?

Janis

[*] I recall a mainframe terminal that echoed the password by sequences
of <PW-char> BS 'X' BS 'O' etc. to keep it "unreadable". Of course
you could see the PW by manually turning the "drum" during the print
process.

[**] OTOH I recall there was another control method, where the first
character of a line determined the printer control; sending a file to
the raw printer produces quite a mess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Mon Feb 24 20:10:17 2025

On Mon, 24 Feb 2025 08:27:19 +0100, Janis Papanagnou wrote:

Given what's all collected in Unicode they've long passed the line where
one more or less character would matter. ;-)

Unicode isn’t a collection of characters as such, it’s a collection of
code points. And yes, every code point is there for a reason.

Remember the proposal to add code points for country icons like 🇳🇿, 🇺🇳 and
🇪🇺 got rejected? Remember what they did instead?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to pozz on Mon Feb 24 20:13:00 2025

On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:

Il 22/02/2025 14:18, David Brown ha scritto:

My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2
encoded data is itself a binary sequence of 16-bit code units, and the
modem firmware is sending those as four hex digits.

Exactly. This is the reply to AT+CMGR command that is standardized in
3GPP TS 27.005.

Anything that is specifying the use of UCS-2 encoding automatically dates itself to about the early-to-mid 1990s.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Mon Feb 24 20:11:29 2025

On Mon, 24 Feb 2025 08:31:21 +0100, Janis Papanagnou wrote:

On 22.02.2025 22:23, Lawrence D'Oliveiro wrote:

On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:

UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO
8859-X or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.

Unicode is not, and never has been, a fixed-width character set.

I was speaking about the "UTF-8 _encoding_" of Unicode.

You *did* say that UCS-2 was “a direct representation of a fixed width character”, did you not? It’s in your posting quoted above.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to pozz on Tue Feb 25 10:24:34 2025

On 21/02/2025 11:40, pozz wrote:

I want to write a simple function that converts UCS2 string into ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.

I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.

It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.

It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.

Is there a simplified mapping table that can be written with if/switch?

if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
}
}

I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.

Can you use iconv to help build your switch statement?

Foreach ucs2 char, if 'iconv -f ucs2 -t iso8859-1//translit' doesn't
complain, then you can build up your case statements.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to pozz on Tue Feb 25 07:18:28 2025

On 2/25/25 2:35 AM, pozz wrote:

Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:

On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:

Il 22/02/2025 14:18, David Brown ha scritto:

My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2 >>>> encoded data is itself a binary sequence of 16-bit code units, and the >>>> modem firmware is sending those as four hex digits.

Exactly. This is the reply to AT+CMGR command that is standardized in
3GPP TS 27.005.

Anything that is specifying the use of UCS-2 encoding automatically dates
itself to about the early-to-mid 1990s.

Sincereley I don't know why and when, but the LTE modem I'm using
(Simcom A7672E) replies to AT+CMGR in two different format:

- what is described as GSM 7-bit alphabet (but it's really UTF-8 when
non ASCII chas are present)

- UCS2

Of course, in the header, it specifies the <dcs> (data coding scheme) so
the receiver on the UART can interpret correctly all the data.

Are you sure it is UCS2 and not UTF-16?

Can it not handle characters not in the BMP?

The difference between UCS2 and UTF-16 is that UCS2 is the character set
that predates the surrogate-pairs added to extend it. It is very much
the equivalent relationship of ASCII to UTF-8.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to pozz on Tue Feb 25 17:16:23 2025

On 25/02/2025 15:53, pozz wrote:

Il 25/02/2025 13:18, Richard Damon ha scritto:

On 2/25/25 2:35 AM, pozz wrote:

Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:

On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:

Il 22/02/2025 14:18, David Brown ha scritto:

My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2 >>>>>> encoded data is itself a binary sequence of 16-bit code units, and >>>>>> the
modem firmware is sending those as four hex digits.

Exactly. This is the reply to AT+CMGR command that is standardized in >>>>> 3GPP TS 27.005.

Anything that is specifying the use of UCS-2 encoding automatically
dates
itself to about the early-to-mid 1990s.

Sincereley I don't know why and when, but the LTE modem I'm using
(Simcom A7672E) replies to AT+CMGR in two different format:

- what is described as GSM 7-bit alphabet (but it's really UTF-8 when
non ASCII chas are present)

- UCS2

Of course, in the header, it specifies the <dcs> (data coding scheme)
so the receiver on the UART can interpret correctly all the data.

Are you sure it is UCS2 and not UTF-16?

Can it not handle characters not in the BMP?

The difference between UCS2 and UTF-16 is that UCS2 is the character
set that predates the surrogate-pairs added to extend it. It is very
much the equivalent relationship of ASCII to UTF-8.

Sincerely I don't know, the standard says UCS2

The standard used by modems here is UCS2, not UTF-16. As you point out,
this was all standardised in the early 1990's (before UTF-16) - as a standardisation of things that had already been used before that. And
once a telecom standard is made, it is set in stone and never changed.
Unlike for some things that adopted Unicode early using UCS2 (like
Windows NT, Java, Qt, Python) the UCS2 use in established modem standard commands (like AT+CMGR) could not, and were not, extended to UTF-16.
There might be other AT commands supported by some modems that /do/
support UTF-8 or UTF-16, but existing standardised commands don't change.

For all Unicode code points supported by UCS2, the coding is the same as
for UTF-16 (as Richard says, it's like the ASCII subset of UTF-8). So
you can always treat UCS2 as UTF-16. Unicode characters outside this
set simply have no representation in UCS2.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to pozz on Tue Feb 25 20:31:27 2025

On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

... the standard says UCS2

Does it mention anything about the surrogates ranges (0xD800 .. 0xDFFF)?
In order for it to be strict UCS-2, they would have to be forbidden. If
they are allowed, then that makes it UTF-16.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to pozz on Tue Feb 25 20:29:40 2025

On Tue, 25 Feb 2025 08:32:46 +0100, pozz wrote:

Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:

Anything that is specifying the use of UCS-2 encoding automatically
dates itself to about the early-to-mid 1990s.

Here[1] the first version of this document is dated back to 1999, but
UCS2 remains and is implemented in currently modem on the market.

[1] https://www.3gpp.org/ftp/Specs/archive/27_series/27.005/

Why would they still be using UCS-2, not UTF-16?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to pozz on Wed Feb 26 03:16:50 2025

On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

Sincerely I don't know, the standard says UCS2

Remembered that the specs are online <https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1514>.
They don’t actually say what “data coding schemes” are allowed; the
only mention of UCS-2 is

if <dcs> indicates that 8-bit or UCS2 data coding scheme is used,
or <fo> indicates that 3GPP TS 23.040 [3]
TP-User-Data-Header-Indication is set: ME/TA converts each 8-bit
octet into two IRA character long hexadecimal number (e.g. octet
with integer value 42 is presented to TE as two characters 2A (IRA
50 and 65))

So it doesn’t say that UTF-16 is or isn’t allowed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Lawrence D'Oliveiro on Tue Feb 25 23:21:21 2025

On 2/25/25 3:31 PM, Lawrence D'Oliveiro wrote:

On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

... the standard says UCS2

Does it mention anything about the surrogates ranges (0xD800 .. 0xDFFF)?
In order for it to be strict UCS-2, they would have to be forbidden. If
they are allowed, then that makes it UTF-16.

To my knowledge, UCS-2 doesn't say those codes are "forbidden", just
that they are not defined codes.

UCS-2 basically became a legacy code when they needed to expand unicode
to more than 16 bits. Systems defined to use it basically just treat
UTF-16 surrogate pairs as two characters they don't know what they mean,
just like a lot of programs can treat UTF-8 as "ASCII" with some codes
it doesn't know what they mean.

The ignorance is bliss method works well for a number of tasks, you just
need to only alter strings at points you "understand", and not need to
actualy count characters (which actualy becomes hard to do totally right
in Unicode anyway).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Wed Feb 26 09:57:21 2025

On 26/02/2025 04:37, Keith Thompson wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:

Sincerely I don't know, the standard says UCS2

Remembered that the specs are online
<https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1514>.
They don’t actually say what “data coding schemes” are allowed; the
only mention of UCS-2 is

if <dcs> indicates that 8-bit or UCS2 data coding scheme is used,
or <fo> indicates that 3GPP TS 23.040 [3]
TP-User-Data-Header-Indication is set: ME/TA converts each 8-bit
octet into two IRA character long hexadecimal number (e.g. octet
with integer value 42 is presented to TE as two characters 2A (IRA
50 and 65))

So it doesn’t say that UTF-16 is or isn’t allowed.

It doesn't say that EBCDIC or UTF-7 is or isn't allowed.

There are two specifications at work here. One is the 3G standards
about coding schemes used for SMS data, and the other is the common AT
command set. The later is, I think, mostly a de-facto standard - modem manufacturers have tried to keep a common subset (along with their own device-specific commands). 3G may allow for 8-bit encoding sets without specifying them in detail, but the modem commands are more specific -
the ones used by the OP are strictly UCS-2.

(It is a /long/ time since I have read details of these things, so I
might be misremembering.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Geoff@21:1/5 to pozz on Sat Mar 1 09:31:55 2025

On Fri, 21 Feb 2025 12:40:06 +0100, pozz <pozzugno@gmail.com> wrote:

I want to write a simple function that converts UCS2 string into ISO8859-1:

void ucs2_to_iso8859p1(char *ucs2, size_t size);

ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.

I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.

It is trivial to convert "0000"-"007F" chars: it's a simple cast from >unsigned int to char.

It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial >again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.

Is there a simplified mapping table that can be written with if/switch?

if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}

I'm not searching a very detailed and correct mapping, but just a >"sufficient" implementation.

#include <stdint.h>
#include <stddef.h>

// Function to convert UCS2 to ISO8859-1
void UCS2ToISO88591(const uint16_t* ucs2, size_t length, char* iso88591) {
for (size_t i = 0; i < length; ++i) {
uint16_t ucs2_char = ucs2[i];
if (ucs2_char <= 0x00FF) {
iso88591[i] = (char)ucs2_char;
} else {
// Handle characters that cannot be represented in ISO8859-1
iso88591[i] = '?'; // Replace with a placeholder character
}
}
// Null-terminate the ISO8859-1 string
iso88591[length] = '\0';
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (0 / 16)
Uptime:	160:57:16
Calls:	10,385
Calls today:	2
Files:	14,056
Messages:	6,416,496

Re: Simple string conversion from UCS2 to ISO8859-1

Who's Online

Recent Visitors

System Info