Forum: >>> Magnum BBS <<<

Adventures with UTF-8 performance

From =?UTF-8?Q?Jens_Lidestr=C3=B6m?=@21:1/5 to All on Tue Nov 1 02:31:43 2022

I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.

### The problem

We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.

### Incorrect theory

At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:

set result=result_stringForTheCurrentIteration

We though this resulted in a O(n^2) execution time in the length of the string.

### Discoveries

However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!

We also learned more about M performance from this extremely interesting post by Bhaskar:

https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ

### Correct diagnosis

Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.

The following code will have O(n) performance in the length of the string (and the value of `index`):

s ch=$extract(largeString,index)

This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start from
the beginning and traverse all the characters to find the byte position of the one that it should return.

GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.

### Solution

Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all
cases.

The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.

All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.

Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding

In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.

In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Onix Man@21:1/5 to All on Thu Nov 3 22:43:16 2022

вторник, 1 ноября 2022 г. в 12:31:45 UTC+3, jens.li...@vgregion.se:

I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.

### The problem

We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.

### Incorrect theory

At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:

set result=result_stringForTheCurrentIteration

We though this resulted in a O(n^2) execution time in the length of the string.

### Discoveries

However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!

We also learned more about M performance from this extremely interesting post by Bhaskar:

https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ

### Correct diagnosis

Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.

The following code will have O(n) performance in the length of the string (and the value of `index`):

s ch=$extract(largeString,index)

This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start

from the beginning and traverse all the characters to find the byte position of the one that it should return.

GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.

### Solution

Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all

cases.

The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.

All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.

Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding

In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.

In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Jens_Lidestr=C3=B6m?=@21:1/5 to a1443...@gmail.com on Fri Nov 4 01:00:54 2022

On Friday, November 4, 2022 at 6:43:17 AM UTC+1, a1443...@gmail.com wrote:

it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

I don't think it would be faster.

With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alex Maslov@21:1/5 to All on Tue Nov 8 08:23:41 2022

пятница, 4 ноября 2022 г. в 11:02:50 UTC+3, jens.li...@vgregion.se:

I don't think it would be faster.

With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

Surrogate pairs are rear guests in European languages at least.
Having some experience in both worlds (ISC and GT.M) I can confirm that ISC's approach based on UCS-2 provides much less extra expenses for Unicode support than GT.M's one.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Jens_Lidestr=C3=B6m?=@21:1/5 to All on Wed Nov 16 04:21:19 2022

It's interesting to hear about your experience, Alex!

I guess it comes down to whether $length and buddies must give the correct result in the presence of surrogate pairs, or it's acceptable to assume that every character is 2 bytes.

Java, for example, uses UTF-16 in it's string interface and might give weird results for surrogate pairs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Sun Jun 15 18:58:11 2025
  from No via SSH
- Plume
  Sun Jun 15 15:01:03 2025
  from Uk via SSH
- Centurion
  Sun Jun 15 09:44:59 2025
  from Berea, Ohio via Telnet
- Deasl
  Sun Jun 15 08:43:59 2025
  from Foo, Bar via SSH
- Deasl
  Sun Jun 15 08:41:06 2025
  from Foo, Bar via SSH
- Plume
  Sat Jun 14 21:49:07 2025
  from Uk via SSH
- Max Prime
  Sat Jun 14 16:47:41 2025
  from United Kingdom via SSH
- Deasl
  Sat Jun 14 16:38:22 2025
  from Foo, Bar via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	493
Nodes:	16 (2 / 14)
Uptime:	178:03:25
Calls:	9,705
Calls today:	5
Files:	13,736
Messages:	6,179,152

Adventures with UTF-8 performance

Who's Online

Recent Visitors

System Info