• Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

    From Phillip Brooks@21:1/5 to All on Mon Feb 6 16:38:50 2023
    Hi,
    We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

    In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

    In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

    "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

    This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

    Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.

    #include <stdio.h>
    #include <string.h>
    #include <tcl.h>

    void test(const char* s1)
    {
    int length = strlen(s1);
    Tcl_Obj *valuePtr;
    Tcl_Obj *objResultPtr;

    printf("Ready to determine the length of \"%s\"\n", s1);

    objResultPtr = Tcl_NewStringObj(s1, length);
    length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr));
    Tcl_SetObjLength(objResultPtr, length);

    printf("The length of \"%s\" is %d\n", s1, length);
    }

    int main (int argc, char *argv[]) {
    Tcl_FindExecutable(NULL);

    Tcl_Interp *myinterp;

    myinterp = Tcl_CreateInterp();

    test("//j The quick brown fox jumps over the lazy dog");
    test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");

    printf( "Completed test.\n" );

    return 0;
    }

    The crash in Tcl 8.6.13 shows this stack trace:

    (gdb) where
    #0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
    #1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
    #2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
    str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
    261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\
    244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
    261\244\261"...) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:1068
    #3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") at
    main.c:14
    #4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to Phillip Brooks on Mon Feb 6 20:36:02 2023
    On 2/6/2023 7:38 PM, Phillip Brooks wrote:
    Hi,
    We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

    In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

    In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

    "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

    This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

    Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.


    I ran your test input in tclsh (version 8.6.12) and it ran fine:
    Do you get a different result from the tclsh/wish shell?

    % encoding system
    utf-8

    % proc test {s} {puts "$s : [string length $s]"}

    % test "//j The quick brown fox jumps over the lazy dog"

    % test "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"

    % puts "Completed test.\n"


    This is the output:

    //j The quick brown fox jumps over the lazy dog : 47
    //j ¤ʤë²ÄǽÀ­¤¢¤ꡢ¥Á¥§¥å¯Êýˡ¤Îʣ»¨²½¤âÈò¤±¤ë°١¢º£²ó¤Ͻü³°¤Ȥ¹¤롣 : 58
    Completed test.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Tue Feb 7 07:26:48 2023
    Hi Phil,

    Am 07.02.23 um 01:38 schrieb Phillip Brooks:
    We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

    In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

    In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

    "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

    This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?


    I'm not sure if I understand what the point of your code is. Could you
    please describe what the meaning of the long octal encoded byte sequence
    is? Is it an UTF8 string with one invalid char? Also it is strange that
    in your source code, you have non-ASCII chars inside of the C string.
    the encoding of these depends on the C compiler (!), it might be encoded
    as UTF-8, latin-1, or anything else.

    In principal, the sentence you found is correct. The string
    representation of a Tcl obj is a string in the sense of Tcl; usually
    stored as UTF-8 with the exception that NULL bytes are encoded as C0 80,
    in order to allow handling the string as NULL termination. If you want
    to handle arbitrary binary data, then either use a ByteArray, and if it
    is an UTF8 string with errors, do a script level [encoding convertfrom],
    or you can do the same from the C level. The functions for this
    described here:

    https://www.tcl.tk/man/tcl/TclLib/Encoding.html

    If you bypass the encodings and directly put chars into the string rep
    of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phillip Brooks@21:1/5 to All on Tue Feb 7 10:44:30 2023
    On Monday, February 6, 2023 at 5:36:08 PM UTC-8, saitology9 wrote:

    I ran your test input in tclsh (version 8.6.12) and it ran fine:
    Do you get a different result from the tclsh/wish shell?

    Yes, I similarly ran the test through tclsh and that succeeded. A more complete version is:

    set str1 "//j The quick brown fox jumps over the lazy dog"
    set str2 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"
    puts $str1
    set str1_uc [ string toupper $str1 ]
    puts $str1_uc
    puts $str2
    set str2_uc [ string toupper $str2 ]
    puts $str2_uc

    I am not sure what tclsh is doing to prevent the error, though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From briang@21:1/5 to Phillip Brooks on Tue Feb 7 13:17:13 2023
    On Monday, February 6, 2023 at 4:38:53 PM UTC-8, Phillip Brooks wrote:
    Hi,
    We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

    In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

    In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

    "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

    This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

    Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.

    #include <stdio.h>
    #include <string.h>
    #include <tcl.h>

    void test(const char* s1)
    {
    int length = strlen(s1);
    Tcl_Obj *valuePtr;
    Tcl_Obj *objResultPtr;

    printf("Ready to determine the length of \"%s\"\n", s1);

    objResultPtr = Tcl_NewStringObj(s1, length);
    length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);

    printf("The length of \"%s\" is %d\n", s1, length);
    }

    int main (int argc, char *argv[]) {
    Tcl_FindExecutable(NULL);

    Tcl_Interp *myinterp;

    myinterp = Tcl_CreateInterp();

    test("//j The quick brown fox jumps over the lazy dog");
    test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");

    printf( "Completed test.\n" );

    return 0;
    }

    The crash in Tcl 8.6.13 shows this stack trace:

    (gdb) where
    #0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
    #1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
    #2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
    str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\
    244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\
    261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\
    244\261"...) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:1068
    #3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") at
    main.c:14
    #4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28

    This code is in violation according to the manual:

    objResultPtr = Tcl_NewStringObj(s1, length);
    length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);

    "Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing to
    this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."

    -Brian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phillip Brooks@21:1/5 to briang on Tue Feb 7 14:10:11 2023
    On Tuesday, February 7, 2023 at 1:17:16 PM UTC-8, briang wrote:

    This code is in violation according to the manual:
    objResultPtr = Tcl_NewStringObj(s1, length);
    length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr)); Tcl_SetObjLength(objResultPtr, length);
    "Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing
    to this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."

    -Brian

    Right - that bit was actually copied out from inside of Tcl someplace by the person that isolated it into a standalone problem. Our application was doing something different. This (more correct) code also crashes:

    char* s1_upper = (char*) malloc( length+1 );
    strcpy( s1_upper, s1 );
    length = Tcl_UtfToUpper(s1_upper);

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phillip Brooks@21:1/5 to Christian Gollwitzer on Tue Feb 7 13:48:23 2023
    Thanks for the response, it definitely got me farther.

    On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
    I'm not sure if I understand what the point of your code is. Could you
    please describe what the meaning of the long octal encoded byte sequence
    is?
    The string is basically undefined incorrect data - we know that. It is from a test suite
    checking to make sure that we can handle undefined incorrect data, so it could really
    be anything (I am not sure of the original source for the data).

    If you want
    to handle arbitrary binary data, then either use a ByteArray, and if it
    is an UTF8 string with errors, do a script level [encoding convertfrom],
    or you can do the same from the C level. The functions for this
    described here:

    https://www.tcl.tk/man/tcl/TclLib/Encoding.html

    From the C level, something like this?

    void test( Tcl_Interp *interp, const char* s1)
    {
    Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
    int length = strlen(s1);
    printf("The length of \"%s\" is %d\n", s1, length);

    char* valid_s1 = (char*) malloc( length*2 );
    int length_read = 0;
    int length_written = 0;
    int rt = Tcl_ExternalToUtf(
    interp,
    utf8_encoding,
    s1,
    length,
    0,
    nullptr,
    valid_s1,
    length*2,
    &length_read,
    &length_written,
    nullptr );
    if ( rt != TCL_OK ) {
    return;
    }
    ...

    That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.

    If you bypass the encodings and directly put chars into the string rep
    of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.

    OK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From briang@21:1/5 to Phillip Brooks on Tue Feb 7 13:55:02 2023
    On Tuesday, February 7, 2023 at 1:48:26 PM UTC-8, Phillip Brooks wrote:
    Thanks for the response, it definitely got me farther.
    On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
    I'm not sure if I understand what the point of your code is. Could you please describe what the meaning of the long octal encoded byte sequence is?
    The string is basically undefined incorrect data - we know that. It is from a test suite
    checking to make sure that we can handle undefined incorrect data, so it could really
    be anything (I am not sure of the original source for the data).
    If you want
    to handle arbitrary binary data, then either use a ByteArray, and if it
    is an UTF8 string with errors, do a script level [encoding convertfrom], or you can do the same from the C level. The functions for this
    described here:

    https://www.tcl.tk/man/tcl/TclLib/Encoding.html
    From the C level, something like this?

    void test( Tcl_Interp *interp, const char* s1)
    {
    Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
    int length = strlen(s1);
    printf("The length of \"%s\" is %d\n", s1, length);
    char* valid_s1 = (char*) malloc( length*2 );
    int length_read = 0;
    int length_written = 0;
    int rt = Tcl_ExternalToUtf(
    interp,
    utf8_encoding,
    s1,
    length,
    0,
    nullptr,
    valid_s1,
    length*2,
    &length_read,
    &length_written,
    nullptr );
    if ( rt != TCL_OK ) {
    return;
    }
    ...

    That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.
    If you bypass the encodings and directly put chars into the string rep
    of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.
    OK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.

    https://www.tcl-lang.org/man/tcl8.6/TclLib/StringObj.htm#M4

    "Points to the first byte of an array of UTF-8-encoded bytes used to set or append to a string value. This byte array may contain embedded null characters unless numChars is negative. (Applications needing null bytes should represent them as the two-byte
    sequence \300\200, use Tcl_ExternalToUtf to convert, or Tcl_NewByteArrayObj if the string is a collection of uninterpreted bytes.)"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Wed Feb 8 07:58:19 2023
    Am 07.02.23 um 22:48 schrieb Phillip Brooks:
    Thanks for the response, it definitely got me farther.

    On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
    I'm not sure if I understand what the point of your code is. Could you
    please describe what the meaning of the long octal encoded byte sequence
    is?
    The string is basically undefined incorrect data - we know that. It is from a test suite
    checking to make sure that we can handle undefined incorrect data, so it could really
    be anything (I am not sure of the original source for the data).

    OK - so it is not arbitrary binary data (that would be


    If you want
    to handle arbitrary binary data, then either use a ByteArray, and if it
    is an UTF8 string with errors, do a script level [encoding convertfrom],
    or you can do the same from the C level. The functions for this
    described here:

    https://www.tcl.tk/man/tcl/TclLib/Encoding.html

    From the C level, something like this?

    void test( Tcl_Interp *interp, const char* s1)
    {
    Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
    int length = strlen(s1);
    printf("The length of \"%s\" is %d\n", s1, length);

    char* valid_s1 = (char*) malloc( length*2 );
    int length_read = 0;
    int length_written = 0;
    int rt = Tcl_ExternalToUtf(
    interp,
    utf8_encoding,
    s1,
    length,
    0,
    nullptr,
    valid_s1,
    length*2,
    &length_read,
    &length_written,
    nullptr );
    if ( rt != TCL_OK ) {
    return;


    In principle, yes. Instead of the malloc I would use
    Tcl_ExternalToUtfDString, which allocates the correct number of bytes
    itself, and you can easily convert the DString into a Tcl_Obj later on
    (without copying). Also consider caching the Tcl_Encoding if you call
    this function more often, and free the encoding at the end of the program.


    crashing, but, oddly enough, it also converts the invalid data to some
    other invalid data - but possibly a valid UTF-8 non the less? It
    doesn't seem to damage any of the few valid UTF-8 strings I passed
    through it.

    Are you sure it is invalid data? It should have replaced all invalid
    chars by the Unicode encoding for "invalid".


    If you want to handle the encoding errors differently, I'm not sure how
    to do it in current Tcl version. There is a discussion going on about
    improving Unicode support in Tcl 9, which will bring different failure
    modes also from the script level, so that the application can decide on
    which errors to reject etc.

    I had a similar problem in one of my projects, and I decided to check
    the data for UTF8 compatbility manually. If it was incorrect data, I
    passed it in as a ByteArray. That was the rigth way to do in this
    specific context. The code is here:

    https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56



    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ralf Fassel@21:1/5 to All on Wed Feb 8 11:10:32 2023
    * Phillip Brooks <philbrks@gmail.com>
    | Yes, I similarly ran the test through tclsh and that succeeded. A more complete version is:

    | set str2 "//j\244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"

    | I am not sure what tclsh is doing to prevent the error, though.

    % string length $str2
    57
    % string bytelength $str2
    113


    % set str1 \255
    ­
    % string length $str1
    1
    % string bytelength $str1
    2

    I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.

    R'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phillip Brooks@21:1/5 to Ralf Fassel on Wed Feb 8 17:41:28 2023
    On Wednesday, February 8, 2023 at 2:10:37 AM UTC-8, Ralf Fassel wrote:
    * Phillip Brooks
    I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.

    OK - simply probably something similar to the solution we reached using Tcl_ExternalToUtf above. Thanks for the observations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phillip Brooks@21:1/5 to Christian Gollwitzer on Wed Feb 8 17:48:56 2023
    On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:

    In principle, yes. Instead of the malloc I would use Tcl_ExternalToUtfDString, which allocates the correct number of bytes itself, and you can easily convert the DString into a Tcl_Obj later on (without copying). Also consider caching the Tcl_Encoding if you call
    this function more often, and free the encoding at the end of the program.

    Are you sure it is invalid data? It should have replaced all invalid
    chars by the Unicode encoding for "invalid".

    OK - maybe not invalid but meaningless, none the less. Interestingly, I found out that this data was generated by our previous issue with the Korean UTF-8 string that was being interpreted as something else and then junk was printed into one of our
    result files. That junk result file has since become part of our test suite. (Devilishly clever, some of these QA folks. Show them a bug, and they know they have spotted weakness. They cleverly dive in anew on the same spot and lay groundworks to
    find another bug in the process.)

    If you want to handle the encoding errors differently, I'm not sure how
    to do it in current Tcl version. There is a discussion going on about improving Unicode support in Tcl 9, which will bring different failure
    modes also from the script level, so that the application can decide on which errors to reject etc.

    Avoiding crashing is the main objective here. We'll leave interpreting whether the UTF-8 means anything for another day (ChatGPTcl?)

    I had a similar problem in one of my projects, and I decided to check
    the data for UTF8 compatbility manually. If it was incorrect data, I
    passed it in as a ByteArray. That was the rigth way to do in this
    specific context. The code is here:

    https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

    Thanks, I'll take a look!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Nolan @21:1/5 to philbrks@gmail.com on Thu Feb 9 04:06:26 2023
    In article <90e79e83-ca17-4497-8998-e5403fa339b3n@googlegroups.com>,
    Phillip Brooks <philbrks@gmail.com> wrote:
    On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:

    In principle, yes. Instead of the malloc I would use
    Tcl_ExternalToUtfDString, which allocates the correct number of bytes
    itself, and you can easily convert the DString into a Tcl_Obj later on
    (without copying). Also consider caching the Tcl_Encoding if you call
    this function more often, and free the encoding at the end of the program.

    Are you sure it is invalid data? It should have replaced all invalid
    chars by the Unicode encoding for "invalid".

    OK - maybe not invalid but meaningless, none the less. Interestingly, I >found out that this data was generated by our previous issue with the
    Korean UTF-8 string that was being interpreted as something else and
    then junk was printed into one of our result files. That junk result
    file has since become part of our test suite. (Devilishly clever, some
    of these QA folks. Show them a bug, and they know they have spotted >weakness. They cleverly dive in anew on the same spot and lay
    groundworks to find another bug in the process.)

    If you want to handle the encoding errors differently, I'm not sure how
    to do it in current Tcl version. There is a discussion going on about
    improving Unicode support in Tcl 9, which will bring different failure
    modes also from the script level, so that the application can decide on
    which errors to reject etc.

    Avoiding crashing is the main objective here. We'll leave interpreting >whether the UTF-8 means anything for another day (ChatGPTcl?)

    I had a similar problem in one of my projects, and I decided to check
    the data for UTF8 compatbility manually. If it was incorrect data, I
    passed it in as a ByteArray. That was the rigth way to do in this
    specific context. The code is here:

    https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

    Thanks, I'll take a look!

    I discovered once that "encoding convertfrom utf-8" would not
    throw an error if you invoked it on invalid (non utf-8) data, which I
    had not expected. I'm not sure what it actually does, but it's
    happy to hand you garbage.

    I wrote a little "is_utf8" extension based on code from:

    http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

    It turned out to be pretty easy.
    --
    columbiaclosings.com
    What's not in Columbia anymore..

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)