I want to write a simple function that converts UCS2 string into ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.
Il 21/02/2025 13:05, Richard Damon ha scritto:[...]
On 2/21/25 6:40 AM, pozz wrote:
I want to write a simple function that converts UCS2 string into
ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.
Note, I think you will find that it is that 0000-00FF that match. (as
I remember ISO8859-1 was the base for starting Unicode).
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.
To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.
Yes, but as a first approximation, 0x27 is much better than '?' for u2019.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.
Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close
approximation, so you will need to decide what to map.
Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.
For example, what iconv() does in this case?
On 21.02.2025 13:42, pozz wrote:
Il 21/02/2025 13:05, Richard Damon ha scritto:
On 2/21/25 6:40 AM, pozz wrote:
[...] But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.
To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.
Yes, but as a first approximation, 0x27 is much better than '?' for u2019.
Note that there are _standard names_ assigned with the characters.
These are normative what the characters represent. - I strongly
suggest to not twist these standards by assigning different
characters; you will do no one a favor but inflict only confusion
and harm.
I want to write a simple function that converts UCS2 string into ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]
BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.
Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the various Latin-* character encodings.
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.
Il 21/02/2025 13:05, Richard Damon ha scritto:
On 2/21/25 6:40 AM, pozz wrote:
I want to write a simple function that converts UCS2 string into
ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.
Typically UCS2 strings ARE null terminated, it just a null is two
bytes long.
Sure, but this isn't an issue here.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.
Note, I think you will find that it is that 0000-00FF that match. (as
I remember ISO8859-1 was the base for starting Unicode).
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.
To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.
Yes, but as a first approximation, 0x27 is much better than '?' for u2019.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.
Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close
approximation, so you will need to decide what to map.
Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.
For example, what iconv() does in this case?
I want to write a simple function that converts UCS2 string into ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.
again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.
Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.
On Fri, 21 Feb 2025 13:42:13 +0100, pozz wrote:
Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.
If you look at NamesList.txt, you will see, next to each character, references to others that might be similar or related in some way.
They say not to try to parse that file automatically, but I’ve had some success doing exactly that ... so far ...
On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:
They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...
I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).
On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:
On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:
They say not to try to parse that file automatically, but I’ve had some >>> success doing exactly that ... so far ...
I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).
The file itself says different <https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:
This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.
On 21.02.2025 20:40, Keith Thompson wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]
BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.
Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.
Yes, that had already been pointed out upthread.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2 characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the
various Latin-* character encodings.
Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.
I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis
As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. [...]
Whether Latin-1 or Latin-9 is better will depend on his application.
The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure
- it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).
As for why not use UTF-8, the answer is clearly simplicity.
The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.
If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.
On 21/02/2025 23:35, Janis Papanagnou wrote:
On 21.02.2025 20:40, Keith Thompson wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]
BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.
Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does >>> not, which would make the translation more difficult.
Yes, that had already been pointed out upthread.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing >>> the 8 characters that differ betwween Latin-1 and Latin-9.
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the >>> various Latin-* character encodings.
Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.
I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis
As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. Somewhere along the line,
either the firmware in the modem or in the code sending the SMS's,
characters beyond the BMP are being used needlessly. So it looks like
his first idea of manually handling a few cases (like code 0x2019) seems
like the right approach.
Whether Latin-1 or Latin-9 is better will depend on his application. The additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure - it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).
As for why not use UTF-8, the answer is clearly simplicity. The OP is working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.
If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.
On 2/22/25 6:29 AM, David Brown wrote:
As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know
what he is doing with the characters after converting them from UCS-2,
but it is massively simpler to use an 8-bit character set if they are
going to be used for display on a small system. It also keeps memory
management simpler, and that is essential on such systems - one UCS-2
character maps to one code unit with Latin-9 here. The space needed
for UTF-8 is much harder to predict, and the OP will want to avoid any
kind of malloc() or dynamic allocation where possible.
If the incoming SMS's are just being logged, or passed out in some
other way, then UTF-8 may be a convenient alternative.
I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font. Converting to UTF-8, would still
require storing some massive font, and the need to decide exactly how
massive it will be.
On 22.02.2025 12:29, David Brown wrote:
As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. [...]
(Yes. I wrote: "have got clear after a subsequent post".)
Whether Latin-1 or Latin-9 is better will depend on his application.
(Was also my stance upthread; "If that is possible for your context")
The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure
ISTR they are some language specific symbols, so probably less obscure
to someone from those countries.
- it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).
As for why not use UTF-8, the answer is clearly simplicity.
This was not my point (someone else suggested that).
You should address that to the other poster. :-)
pozz <pozzugno@gmail.com> writes:
I want to write a simple function that converts UCS2 string into ISO8859-1: >>
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.
Is the UCS-2 really represented as a sequence of ASCII hex digits?
In actual UCS-2, each character is 2 bytes. The representation for
"Hello" would be 10 bytes, either "\0H\0e\0l\0l\0o" or
"H\0e\0l\0l\0o\0", depending on endianness. (UCS-2 is a subset of
UTF-16; the latter uses longer sequences to represent characters
outside the Basic Multilingual Plane.)
[...] It did like so many other programs a "pretty" transformation of
a simple single quotation mark, to a fancy version.
UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO 8859-X
or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.
On 22.02.2025 07:13, Lawrence D'Oliveiro wrote:some
On Sat, 22 Feb 2025 05:29:14 +0100, Janis Papanagnou wrote:
On 22.02.2025 04:00, Lawrence D'Oliveiro wrote:
They say not to try to parse that file automatically, but I’ve had
success doing exactly that ... so far ...
I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).
The file itself says different
<https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:
This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.
I see, but I certainly wouldn't refrain from parsing it.
I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font.
On Sat, 22 Feb 2025 07:15:09 -0500, Richard Damon wrote:
I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font.
Note that glyphs are not characters.
On 2/22/25 6:29 AM, David Brown wrote:
On 21/02/2025 23:35, Janis Papanagnou wrote:
On 21.02.2025 20:40, Keith Thompson wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]
BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is >>>>> possible for your context you have to map a handful of characters.
Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does >>>> not, which would make the translation more difficult.
Yes, that had already been pointed out upthread.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing >>>> the 8 characters that differ betwween Latin-1 and Latin-9.
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the >>>> various Latin-* character encodings.
Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.
I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis
As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. Somewhere along the line,
either the firmware in the modem or in the code sending the SMS's,
characters beyond the BMP are being used needlessly. So it looks like
his first idea of manually handling a few cases (like code 0x2019) seems
like the right approach.
Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
did like so many other programs a "pretty" transformation of a simple
single quotation mark, to a fancy version.
Whether Latin-1 or Latin-9 is better will depend on his application. The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure - it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).
As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management
simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.
If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.
I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font. Converting to UTF-8, would still
require storing some massive font, and the need to decide exactly how
massive it will be.
On 22.02.2025 13:15, Richard Damon wrote:
[...] It did like so many other programs a "pretty" transformation of
a simple single quotation mark, to a fancy version.
Good to put the "pretty" in quotes; I've seen so many "fancy versions",
one worse than the other. They are culture specific and on a terminal
they often look bad even in their native target form. For example “--”
in a man page (say, 'man awk') has a left and right slant respectively
and they are linear, but my newsreader shows them both in the same
direction but the one thicker at the bottom the other at the top. It's similar with single quotes; here we often see accents used at one side
and a regular single quote at the other side. In 'man man' for example
we find even a comment on that in the description of option '--ascii'. There's *tons* of such quoting characters for the various languages,
in my mother tongue there's even _more than one_ type used in printed
media. Single or double and left or right and bottom or top or mixed
or double or single angle brackets in opening and closing form, plus
the *misused* accent characters (which look worst, IMO, especially if combined inconsistently with other forms).
I'm glad that in programming there's a bias on symmetric use of the
neutral forms " and ' (for strings and characters and other quoting)
and that things like accents ` and ´ *seem* to gradually vanish for
quoting purposes; e.g. shell `...` long superseded by $(...). Only
document contents occasionally still adhere to trashy use.
One thing I'd really like to understand is why folks have been mixing
accents with quotes, as in ``standard'' (also taken from 'man awk').
They may look acceptable in one display or printing device but become
a typographical catastrophe when viewed on another device type.
</rant>
Janis
BTW; curious about that [informal] part of the syntax descriptionAccording to <https://en.wikipedia.org/wiki/Newline#Representation>,
LF: <any sequence of a single ASCII 0A or 0D, or both>
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?
LF: <any sequence of a single ASCII 0A or 0D, or both>
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?
On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:
UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO 8859-X
or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.
Unicode is not, and never has been, a fixed-width character set.
On 2025-02-22, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
LF: <any sequence of a single ASCII 0A or 0D, or both>
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?
Because if Unicode people spot the slightest opportunity to add
pointless complexity to anything, they tend to pounce on it.
Why just specify one line ending convention, when you can require the processor of the file to watch out for four different tokens denoting
the line break?
Given what's all collected in Unicode they've long passed the line where
one more or less character would matter. ;-)
Il 22/02/2025 14:18, David Brown ha scritto:
Exactly. This is the reply to AT+CMGR command that is standardized in
My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2
encoded data is itself a binary sequence of 16-bit code units, and the
modem firmware is sending those as four hex digits.
3GPP TS 27.005.
On 22.02.2025 22:23, Lawrence D'Oliveiro wrote:
On Sat, 22 Feb 2025 13:11:34 +0100, Janis Papanagnou wrote:
UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO
8859-X or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.
Unicode is not, and never has been, a fixed-width character set.
I was speaking about the "UTF-8 _encoding_" of Unicode.
I want to write a simple function that converts UCS2 string into ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.
Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:
On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:
Il 22/02/2025 14:18, David Brown ha scritto:
Exactly. This is the reply to AT+CMGR command that is standardized in
My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2 >>>> encoded data is itself a binary sequence of 16-bit code units, and the >>>> modem firmware is sending those as four hex digits.
3GPP TS 27.005.
Anything that is specifying the use of UCS-2 encoding automatically dates
itself to about the early-to-mid 1990s.
Sincereley I don't know why and when, but the LTE modem I'm using
(Simcom A7672E) replies to AT+CMGR in two different format:
- what is described as GSM 7-bit alphabet (but it's really UTF-8 when
non ASCII chas are present)
- UCS2
Of course, in the header, it specifies the <dcs> (data coding scheme) so
the receiver on the UART can interpret correctly all the data.
Il 25/02/2025 13:18, Richard Damon ha scritto:
On 2/25/25 2:35 AM, pozz wrote:
Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:
On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:
Il 22/02/2025 14:18, David Brown ha scritto:
Exactly. This is the reply to AT+CMGR command that is standardized in >>>>> 3GPP TS 27.005.
My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2 >>>>>> encoded data is itself a binary sequence of 16-bit code units, and >>>>>> the
modem firmware is sending those as four hex digits.
Anything that is specifying the use of UCS-2 encoding automatically
dates
itself to about the early-to-mid 1990s.
Sincereley I don't know why and when, but the LTE modem I'm using
(Simcom A7672E) replies to AT+CMGR in two different format:
- what is described as GSM 7-bit alphabet (but it's really UTF-8 when
non ASCII chas are present)
- UCS2
Of course, in the header, it specifies the <dcs> (data coding scheme)
so the receiver on the UART can interpret correctly all the data.
Are you sure it is UCS2 and not UTF-16?
Can it not handle characters not in the BMP?
The difference between UCS2 and UTF-16 is that UCS2 is the character
set that predates the surrogate-pairs added to extend it. It is very
much the equivalent relationship of ASCII to UTF-8.
Sincerely I don't know, the standard says UCS2
... the standard says UCS2
Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:
Anything that is specifying the use of UCS-2 encoding automatically
dates itself to about the early-to-mid 1990s.
Here[1] the first version of this document is dated back to 1999, but
UCS2 remains and is implemented in currently modem on the market.
[1] https://www.3gpp.org/ftp/Specs/archive/27_series/27.005/
Sincerely I don't know, the standard says UCS2
On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:
... the standard says UCS2
Does it mention anything about the surrogates ranges (0xD800 .. 0xDFFF)?
In order for it to be strict UCS-2, they would have to be forbidden. If
they are allowed, then that makes it UTF-16.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Tue, 25 Feb 2025 15:53:23 +0100, pozz wrote:
Sincerely I don't know, the standard says UCS2
Remembered that the specs are online
<https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1514>.
They don’t actually say what “data coding schemes” are allowed; the
only mention of UCS-2 is
if <dcs> indicates that 8-bit or UCS2 data coding scheme is used,
or <fo> indicates that 3GPP TS 23.040 [3]
TP-User-Data-Header-Indication is set: ME/TA converts each 8-bit
octet into two IRA character long hexadecimal number (e.g. octet
with integer value 42 is presented to TE as two characters 2A (IRA
50 and 65))
So it doesn’t say that UTF-16 is or isn’t allowed.
It doesn't say that EBCDIC or UTF-7 is or isn't allowed.
I want to write a simple function that converts UCS2 string into ISO8859-1:
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from >unsigned int to char.
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial >again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a >"sufficient" implementation.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (0 / 16) |
Uptime: | 160:57:16 |
Calls: | 10,385 |
Calls today: | 2 |
Files: | 14,056 |
Messages: | 6,416,496 |