Forum: >>> Magnum BBS <<<

Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

From Phillip Brooks@21:1/5 to All on Mon Feb 6 16:38:50 2023

Hi,
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.

#include <stdio.h>
#include <string.h>
#include <tcl.h>

void test(const char* s1)
{
int length = strlen(s1);
Tcl_Obj *valuePtr;
Tcl_Obj *objResultPtr;

printf("Ready to determine the length of \"%s\"\n", s1);

objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr));
Tcl_SetObjLength(objResultPtr, length);

printf("The length of \"%s\" is %d\n", s1, length);
}

int main (int argc, char *argv[]) {
Tcl_FindExecutable(NULL);

Tcl_Interp *myinterp;

myinterp = Tcl_CreateInterp();

test("//j The quick brown fox jumps over the lazy dog");
test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");

printf( "Completed test.\n" );

return 0;
}

The crash in Tcl 8.6.13 shows this stack trace:

(gdb) where
#0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
#1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
#2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\
244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
261\244\261"...) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:1068
#3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") at
main.c:14
#4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to Phillip Brooks on Mon Feb 6 20:36:02 2023

On 2/6/2023 7:38 PM, Phillip Brooks wrote:

Hi,
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.

I ran your test input in tclsh (version 8.6.12) and it ran fine:
Do you get a different result from the tclsh/wish shell?

% encoding system
utf-8

% proc test {s} {puts "$s : [string length $s]"}

% test "//j The quick brown fox jumps over the lazy dog"

% test "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"

% puts "Completed test.\n"

This is the output:

//j The quick brown fox jumps over the lazy dog : 47
//j ¤ʤë²ÄǽÀ¤¢¤ꡢ¥Á¥§¥å¯Êýˡ¤Îʣ»¨²½¤âÈò¤±¤ë°١¢º£²ó¤Ͻü³°¤Ȥ¹¤롣 : 58
Completed test.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Tue Feb 7 07:26:48 2023

Hi Phil,

Am 07.02.23 um 01:38 schrieb Phillip Brooks:

We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

I'm not sure if I understand what the point of your code is. Could you
please describe what the meaning of the long octal encoded byte sequence
is? Is it an UTF8 string with one invalid char? Also it is strange that
in your source code, you have non-ASCII chars inside of the C string.
the encoding of these depends on the C compiler (!), it might be encoded
as UTF-8, latin-1, or anything else.

In principal, the sentence you found is correct. The string
representation of a Tcl obj is a string in the sense of Tcl; usually
stored as UTF-8 with the exception that NULL bytes are encoded as C0 80,
in order to allow handling the string as NULL termination. If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom],
or you can do the same from the C level. The functions for this
described here:

https://www.tcl.tk/man/tcl/TclLib/Encoding.html

If you bypass the encodings and directly put chars into the string rep
of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phillip Brooks@21:1/5 to All on Tue Feb 7 10:44:30 2023

On Monday, February 6, 2023 at 5:36:08 PM UTC-8, saitology9 wrote:

I ran your test input in tclsh (version 8.6.12) and it ran fine:
Do you get a different result from the tclsh/wish shell?

Yes, I similarly ran the test through tclsh and that succeeded. A more complete version is:

set str1 "//j The quick brown fox jumps over the lazy dog"
set str2 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"
puts $str1
set str1_uc [ string toupper $str1 ]
puts $str1_uc
puts $str2
set str2_uc [ string toupper $str2 ]
puts $str2_uc

I am not sure what tclsh is doing to prevent the error, though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From briang@21:1/5 to Phillip Brooks on Tue Feb 7 13:17:13 2023

On Monday, February 6, 2023 at 4:38:53 PM UTC-8, Phillip Brooks wrote:

Hi,
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.

#include <stdio.h>
#include <string.h>
#include <tcl.h>

void test(const char* s1)
{
int length = strlen(s1);
Tcl_Obj *valuePtr;
Tcl_Obj *objResultPtr;

printf("Ready to determine the length of \"%s\"\n", s1);

objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);

printf("The length of \"%s\" is %d\n", s1, length);
}

int main (int argc, char *argv[]) {
Tcl_FindExecutable(NULL);

Tcl_Interp *myinterp;

myinterp = Tcl_CreateInterp();

test("//j The quick brown fox jumps over the lazy dog");
test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");

printf( "Completed test.\n" );

return 0;
}

The crash in Tcl 8.6.13 shows this stack trace:

(gdb) where
#0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
#1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
#2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\

244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\
244\261"...) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:1068

#3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") at

main.c:14

#4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28

This code is in violation according to the manual:

objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);

"Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing to
this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."

-Brian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phillip Brooks@21:1/5 to briang on Tue Feb 7 14:10:11 2023

On Tuesday, February 7, 2023 at 1:17:16 PM UTC-8, briang wrote:

This code is in violation according to the manual:
objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);
"Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing

to this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."

-Brian

Right - that bit was actually copied out from inside of Tcl someplace by the person that isolated it into a standalone problem. Our application was doing something different. This (more correct) code also crashes:

char* s1_upper = (char*) malloc( length+1 );
strcpy( s1_upper, s1 );
length = Tcl_UtfToUpper(s1_upper);

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phillip Brooks@21:1/5 to Christian Gollwitzer on Tue Feb 7 13:48:23 2023

Thanks for the response, it definitely got me farther.

On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:

I'm not sure if I understand what the point of your code is. Could you
please describe what the meaning of the long octal encoded byte sequence
is?

The string is basically undefined incorrect data - we know that. It is from a test suite
checking to make sure that we can handle undefined incorrect data, so it could really
be anything (I am not sure of the original source for the data).

If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom],
or you can do the same from the C level. The functions for this
described here:

https://www.tcl.tk/man/tcl/TclLib/Encoding.html

From the C level, something like this?

void test( Tcl_Interp *interp, const char* s1)
{
Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
int length = strlen(s1);
printf("The length of \"%s\" is %d\n", s1, length);

char* valid_s1 = (char*) malloc( length*2 );
int length_read = 0;
int length_written = 0;
int rt = Tcl_ExternalToUtf(
interp,
utf8_encoding,
s1,
length,
0,
nullptr,
valid_s1,
length*2,
&length_read,
&length_written,
nullptr );
if ( rt != TCL_OK ) {
return;
}
...

That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.

If you bypass the encodings and directly put chars into the string rep
of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.

OK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From briang@21:1/5 to Phillip Brooks on Tue Feb 7 13:55:02 2023

On Tuesday, February 7, 2023 at 1:48:26 PM UTC-8, Phillip Brooks wrote:

Thanks for the response, it definitely got me farther.
On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:

I'm not sure if I understand what the point of your code is. Could you please describe what the meaning of the long octal encoded byte sequence is?

The string is basically undefined incorrect data - we know that. It is from a test suite
checking to make sure that we can handle undefined incorrect data, so it could really
be anything (I am not sure of the original source for the data).

If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom], or you can do the same from the C level. The functions for this
described here:

https://www.tcl.tk/man/tcl/TclLib/Encoding.html

From the C level, something like this?

void test( Tcl_Interp *interp, const char* s1)
{
Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
int length = strlen(s1);
printf("The length of \"%s\" is %d\n", s1, length);
char* valid_s1 = (char*) malloc( length*2 );
int length_read = 0;
int length_written = 0;
int rt = Tcl_ExternalToUtf(
interp,
utf8_encoding,
s1,
length,
0,
nullptr,
valid_s1,
length*2,
&length_read,
&length_written,
nullptr );
if ( rt != TCL_OK ) {
return;
}
...

That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.

If you bypass the encodings and directly put chars into the string rep
of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.

OK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.

https://www.tcl-lang.org/man/tcl8.6/TclLib/StringObj.htm#M4

"Points to the first byte of an array of UTF-8-encoded bytes used to set or append to a string value. This byte array may contain embedded null characters unless numChars is negative. (Applications needing null bytes should represent them as the two-byte
sequence \300\200, use Tcl_ExternalToUtf to convert, or Tcl_NewByteArrayObj if the string is a collection of uninterpreted bytes.)"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Wed Feb 8 07:58:19 2023

Am 07.02.23 um 22:48 schrieb Phillip Brooks:

Thanks for the response, it definitely got me farther.

On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:

I'm not sure if I understand what the point of your code is. Could you
please describe what the meaning of the long octal encoded byte sequence
is?

The string is basically undefined incorrect data - we know that. It is from a test suite
checking to make sure that we can handle undefined incorrect data, so it could really
be anything (I am not sure of the original source for the data).

OK - so it is not arbitrary binary data (that would be

If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom],
or you can do the same from the C level. The functions for this
described here:

https://www.tcl.tk/man/tcl/TclLib/Encoding.html

From the C level, something like this?

void test( Tcl_Interp *interp, const char* s1)
{
Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
int length = strlen(s1);
printf("The length of \"%s\" is %d\n", s1, length);

char* valid_s1 = (char*) malloc( length*2 );
int length_read = 0;
int length_written = 0;
int rt = Tcl_ExternalToUtf(
interp,
utf8_encoding,
s1,
length,
0,
nullptr,
valid_s1,
length*2,
&length_read,
&length_written,
nullptr );
if ( rt != TCL_OK ) {
return;

In principle, yes. Instead of the malloc I would use
Tcl_ExternalToUtfDString, which allocates the correct number of bytes
itself, and you can easily convert the DString into a Tcl_Obj later on
(without copying). Also consider caching the Tcl_Encoding if you call
this function more often, and free the encoding at the end of the program.

crashing, but, oddly enough, it also converts the invalid data to some
other invalid data - but possibly a valid UTF-8 non the less? It
doesn't seem to damage any of the few valid UTF-8 strings I passed
through it.

Are you sure it is invalid data? It should have replaced all invalid
chars by the Unicode encoding for "invalid".

If you want to handle the encoding errors differently, I'm not sure how
to do it in current Tcl version. There is a discussion going on about
improving Unicode support in Tcl 9, which will bring different failure
modes also from the script level, so that the application can decide on
which errors to reject etc.

I had a similar problem in one of my projects, and I decided to check
the data for UTF8 compatbility manually. If it was incorrect data, I
passed it in as a ByteArray. That was the rigth way to do in this
specific context. The code is here:

https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ralf Fassel@21:1/5 to All on Wed Feb 8 11:10:32 2023

* Phillip Brooks <philbrks@gmail.com>
| Yes, I similarly ran the test through tclsh and that succeeded. A more complete version is:

| set str2 "//j\244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"

| I am not sure what tclsh is doing to prevent the error, though.

% string length $str2
57
% string bytelength $str2
113

% set str1 \255

% string length $str1
1
% string bytelength $str1
2

I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.

R'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phillip Brooks@21:1/5 to Ralf Fassel on Wed Feb 8 17:41:28 2023

On Wednesday, February 8, 2023 at 2:10:37 AM UTC-8, Ralf Fassel wrote:

* Phillip Brooks
I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.

OK - simply probably something similar to the solution we reached using Tcl_ExternalToUtf above. Thanks for the observations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phillip Brooks@21:1/5 to Christian Gollwitzer on Wed Feb 8 17:48:56 2023

On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:

In principle, yes. Instead of the malloc I would use Tcl_ExternalToUtfDString, which allocates the correct number of bytes itself, and you can easily convert the DString into a Tcl_Obj later on (without copying). Also consider caching the Tcl_Encoding if you call
this function more often, and free the encoding at the end of the program.

Are you sure it is invalid data? It should have replaced all invalid
chars by the Unicode encoding for "invalid".

OK - maybe not invalid but meaningless, none the less. Interestingly, I found out that this data was generated by our previous issue with the Korean UTF-8 string that was being interpreted as something else and then junk was printed into one of our
result files. That junk result file has since become part of our test suite. (Devilishly clever, some of these QA folks. Show them a bug, and they know they have spotted weakness. They cleverly dive in anew on the same spot and lay groundworks to
find another bug in the process.)

If you want to handle the encoding errors differently, I'm not sure how
to do it in current Tcl version. There is a discussion going on about improving Unicode support in Tcl 9, which will bring different failure
modes also from the script level, so that the application can decide on which errors to reject etc.

Avoiding crashing is the main objective here. We'll leave interpreting whether the UTF-8 means anything for another day (ChatGPTcl?)

I had a similar problem in one of my projects, and I decided to check
the data for UTF8 compatbility manually. If it was incorrect data, I
passed it in as a ByteArray. That was the rigth way to do in this
specific context. The code is here:

https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

Thanks, I'll take a look!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ted Nolan @21:1/5 to philbrks@gmail.com on Thu Feb 9 04:06:26 2023

In article <90e79e83-ca17-4497-8998-e5403fa339b3n@googlegroups.com>,
Phillip Brooks <philbrks@gmail.com> wrote:

On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:

In principle, yes. Instead of the malloc I would use
Tcl_ExternalToUtfDString, which allocates the correct number of bytes
itself, and you can easily convert the DString into a Tcl_Obj later on
(without copying). Also consider caching the Tcl_Encoding if you call
this function more often, and free the encoding at the end of the program.

Are you sure it is invalid data? It should have replaced all invalid
chars by the Unicode encoding for "invalid".

OK - maybe not invalid but meaningless, none the less. Interestingly, I >found out that this data was generated by our previous issue with the
Korean UTF-8 string that was being interpreted as something else and
then junk was printed into one of our result files. That junk result
file has since become part of our test suite. (Devilishly clever, some
of these QA folks. Show them a bug, and they know they have spotted >weakness. They cleverly dive in anew on the same spot and lay
groundworks to find another bug in the process.)

If you want to handle the encoding errors differently, I'm not sure how
to do it in current Tcl version. There is a discussion going on about
improving Unicode support in Tcl 9, which will bring different failure
modes also from the script level, so that the application can decide on
which errors to reject etc.

Avoiding crashing is the main objective here. We'll leave interpreting >whether the UTF-8 means anything for another day (ChatGPTcl?)

I had a similar problem in one of my projects, and I decided to check
the data for UTF8 compatbility manually. If it was incorrect data, I
passed it in as a ByteArray. That was the rigth way to do in this
specific context. The code is here:

https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

Thanks, I'll take a look!

I discovered once that "encoding convertfrom utf-8" would not
throw an error if you invoked it on invalid (non utf-8) data, which I
had not expected. I'm not sure what it actually does, but it's
happy to hand you garbage.

I wrote a little "is_utf8" extension based on code from:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

It turned out to be pretty easy.
--
columbiaclosings.com
What's not in Columbia anymore..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

Ian Rihard Kosednar
Mon Jun 23 21:21:17 2025
from No via SSH

Ian Rihard Kosednar
Mon Jun 23 17:19:07 2025
from No via SSH

Bob Worm
Mon Jun 23 13:40:10 2025
from Wales, Uk via Telnet

Plume
Mon Jun 23 10:43:22 2025
from Uk via Telnet

Plume
Mon Jun 23 10:20:22 2025
from Uk via Telnet

Centurion
Mon Jun 23 09:46:15 2025
from Berea, Ohio via Telnet

Gwylbert
Mon Jun 23 09:00:34 2025
from Sydney, Nsw via Telnet

Centurion
Mon Jun 23 02:07:35 2025
from Berea, Ohio via Telnet

System Info

Sysop: Keyop

Location: Huddersfield, West Yorkshire, UK

Users: 498

Nodes: 16 (2 / 14)

Uptime: 26:11:24

Calls: 9,829

Calls today: 8

Files: 13,761

Messages: 6,192,123

Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

Who's Online

Recent Visitors

System Info