• Adventures with UTF-8 performance

    From =?UTF-8?Q?Jens_Lidestr=C3=B6m?=@21:1/5 to All on Tue Nov 1 02:31:43 2022
    I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.

    ### The problem

    We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.

    ### Incorrect theory

    At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:

    set result=result_stringForTheCurrentIteration

    We though this resulted in a O(n^2) execution time in the length of the string.

    ### Discoveries

    However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!

    We also learned more about M performance from this extremely interesting post by Bhaskar:

    https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ

    ### Correct diagnosis

    Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.

    The following code will have O(n) performance in the length of the string (and the value of `index`):

    s ch=$extract(largeString,index)

    This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start from
    the beginning and traverse all the characters to find the byte position of the one that it should return.

    GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.

    ### Solution

    Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all
    cases.

    The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.

    All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.

    Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding

    In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.

    In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Onix Man@21:1/5 to All on Thu Nov 3 22:43:16 2022
    вторник, 1 ноября 2022 г. в 12:31:45 UTC+3, jens.li...@vgregion.se:
    I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.

    ### The problem

    We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.

    ### Incorrect theory

    At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:

    set result=result_stringForTheCurrentIteration

    We though this resulted in a O(n^2) execution time in the length of the string.

    ### Discoveries

    However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!

    We also learned more about M performance from this extremely interesting post by Bhaskar:

    https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ

    ### Correct diagnosis

    Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.

    The following code will have O(n) performance in the length of the string (and the value of `index`):

    s ch=$extract(largeString,index)

    This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start
    from the beginning and traverse all the characters to find the byte position of the one that it should return.

    GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.

    ### Solution

    Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all
    cases.

    The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.

    All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.

    Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding

    In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.

    In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

    it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Jens_Lidestr=C3=B6m?=@21:1/5 to a1443...@gmail.com on Fri Nov 4 01:00:54 2022
    On Friday, November 4, 2022 at 6:43:17 AM UTC+1, a1443...@gmail.com wrote:
    it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

    I don't think it would be faster.

    With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

    https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alex Maslov@21:1/5 to All on Tue Nov 8 08:23:41 2022
    пятница, 4 ноября 2022 г. в 11:02:50 UTC+3, jens.li...@vgregion.se:
    I don't think it would be faster.

    With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

    Surrogate pairs are rear guests in European languages at least.
    Having some experience in both worlds (ISC and GT.M) I can confirm that ISC's approach based on UCS-2 provides much less extra expenses for Unicode support than GT.M's one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Jens_Lidestr=C3=B6m?=@21:1/5 to All on Wed Nov 16 04:21:19 2022
    It's interesting to hear about your experience, Alex!

    I guess it comes down to whether $length and buddies must give the correct result in the presence of surrogate pairs, or it's acceptable to assume that every character is 2 bytes.

    Java, for example, uses UTF-16 in it's string interface and might give weird results for surrogate pairs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)