Hi,
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.
In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in
In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:
"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."
This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?
Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.
In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in
In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:
"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."
This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?
I ran your test input in tclsh (version 8.6.12) and it ran fine:
Do you get a different result from the tclsh/wish shell?
Hi,244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.
In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in
In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:
"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."
This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?
Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.
#include <stdio.h>
#include <string.h>
#include <tcl.h>
void test(const char* s1)
{
int length = strlen(s1);
Tcl_Obj *valuePtr;
Tcl_Obj *objResultPtr;
printf("Ready to determine the length of \"%s\"\n", s1);
objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);
printf("The length of \"%s\" is %d\n", s1, length);
}
int main (int argc, char *argv[]) {
Tcl_FindExecutable(NULL);
Tcl_Interp *myinterp;
myinterp = Tcl_CreateInterp();
test("//j The quick brown fox jumps over the lazy dog");
test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");
printf( "Completed test.\n" );
return 0;
}
The crash in Tcl 8.6.13 shows this stack trace:
(gdb) where
#0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
#1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
#2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\
#3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") atmain.c:14
#4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28
This code is in violation according to the manual:to this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."
objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);
"Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing
-Brian
I'm not sure if I understand what the point of your code is. Could youThe string is basically undefined incorrect data - we know that. It is from a test suite
please describe what the meaning of the long octal encoded byte sequence
is?
If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom],
or you can do the same from the C level. The functions for this
described here:
https://www.tcl.tk/man/tcl/TclLib/Encoding.html
If you bypass the encodings and directly put chars into the string rep
of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.
Thanks for the response, it definitely got me farther.
On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
I'm not sure if I understand what the point of your code is. Could you please describe what the meaning of the long octal encoded byte sequence is?The string is basically undefined incorrect data - we know that. It is from a test suite
checking to make sure that we can handle undefined incorrect data, so it could really
be anything (I am not sure of the original source for the data).
If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom], or you can do the same from the C level. The functions for this
described here:
https://www.tcl.tk/man/tcl/TclLib/Encoding.htmlFrom the C level, something like this?
void test( Tcl_Interp *interp, const char* s1)
{
Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
int length = strlen(s1);
printf("The length of \"%s\" is %d\n", s1, length);
char* valid_s1 = (char*) malloc( length*2 );
int length_read = 0;
int length_written = 0;
int rt = Tcl_ExternalToUtf(
interp,
utf8_encoding,
s1,
length,
0,
nullptr,
valid_s1,
length*2,
&length_read,
&length_written,
nullptr );
if ( rt != TCL_OK ) {
return;
}
...
That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.
If you bypass the encodings and directly put chars into the string repOK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.
of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.
Thanks for the response, it definitely got me farther.
On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
I'm not sure if I understand what the point of your code is. Could youThe string is basically undefined incorrect data - we know that. It is from a test suite
please describe what the meaning of the long octal encoded byte sequence
is?
checking to make sure that we can handle undefined incorrect data, so it could really
be anything (I am not sure of the original source for the data).
If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom],
or you can do the same from the C level. The functions for this
described here:
https://www.tcl.tk/man/tcl/TclLib/Encoding.html
From the C level, something like this?
void test( Tcl_Interp *interp, const char* s1)
{
Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
int length = strlen(s1);
printf("The length of \"%s\" is %d\n", s1, length);
char* valid_s1 = (char*) malloc( length*2 );
int length_read = 0;
int length_written = 0;
int rt = Tcl_ExternalToUtf(
interp,
utf8_encoding,
s1,
length,
0,
nullptr,
valid_s1,
length*2,
&length_read,
&length_written,
nullptr );
if ( rt != TCL_OK ) {
return;
* Phillip Brooks
I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.
In principle, yes. Instead of the malloc I would use Tcl_ExternalToUtfDString, which allocates the correct number of bytes itself, and you can easily convert the DString into a Tcl_Obj later on (without copying). Also consider caching the Tcl_Encoding if you call
this function more often, and free the encoding at the end of the program.
Are you sure it is invalid data? It should have replaced all invalid
chars by the Unicode encoding for "invalid".
If you want to handle the encoding errors differently, I'm not sure how
to do it in current Tcl version. There is a discussion going on about improving Unicode support in Tcl 9, which will bring different failure
modes also from the script level, so that the application can decide on which errors to reject etc.
I had a similar problem in one of my projects, and I decided to check
the data for UTF8 compatbility manually. If it was incorrect data, I
passed it in as a ByteArray. That was the rigth way to do in this
specific context. The code is here:
https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56
On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:
In principle, yes. Instead of the malloc I would use
Tcl_ExternalToUtfDString, which allocates the correct number of bytes
itself, and you can easily convert the DString into a Tcl_Obj later on
(without copying). Also consider caching the Tcl_Encoding if you call
this function more often, and free the encoding at the end of the program.
Are you sure it is invalid data? It should have replaced all invalid
chars by the Unicode encoding for "invalid".
OK - maybe not invalid but meaningless, none the less. Interestingly, I >found out that this data was generated by our previous issue with the
Korean UTF-8 string that was being interpreted as something else and
then junk was printed into one of our result files. That junk result
file has since become part of our test suite. (Devilishly clever, some
of these QA folks. Show them a bug, and they know they have spotted >weakness. They cleverly dive in anew on the same spot and lay
groundworks to find another bug in the process.)
If you want to handle the encoding errors differently, I'm not sure how
to do it in current Tcl version. There is a discussion going on about
improving Unicode support in Tcl 9, which will bring different failure
modes also from the script level, so that the application can decide on
which errors to reject etc.
Avoiding crashing is the main objective here. We'll leave interpreting >whether the UTF-8 means anything for another day (ChatGPTcl?)
I had a similar problem in one of my projects, and I decided to check
the data for UTF8 compatbility manually. If it was incorrect data, I
passed it in as a ByteArray. That was the rigth way to do in this
specific context. The code is here:
https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56
Thanks, I'll take a look!
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 498 |
Nodes: | 16 (2 / 14) |
Uptime: | 26:11:24 |
Calls: | 9,829 |
Calls today: | 8 |
Files: | 13,761 |
Messages: | 6,192,123 |