I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.from the beginning and traverse all the characters to find the byte position of the one that it should return.
### The problem
We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.
### Incorrect theory
At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:
set result=result_stringForTheCurrentIteration
We though this resulted in a O(n^2) execution time in the length of the string.
### Discoveries
However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!
We also learned more about M performance from this extremely interesting post by Bhaskar:
https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ
### Correct diagnosis
Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.
The following code will have O(n) performance in the length of the string (and the value of `index`):
s ch=$extract(largeString,index)
This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start
GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.cases.
### Solution
Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all
The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.
All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.
Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding
In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.
In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.
it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.
I don't think it would be faster.
With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 493 |
Nodes: | 16 (2 / 14) |
Uptime: | 178:03:25 |
Calls: | 9,705 |
Calls today: | 5 |
Files: | 13,736 |
Messages: | 6,179,152 |