Forum: >>> Magnum BBS <<<

Re: tcl regexp performance

From pd@21:1/5 to All on Tue Apr 5 02:49:41 2022

now 1649120014 s
ok
exec time: 53.366160 s
now 1649120014 s
dif 0 s
Presione una tecla para continuar . . .

you can ignore this, is just another test with extra checking, you can see exec time is 53 s about the same of the linux shell measure with time p.pl

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From pd@21:1/5 to All on Tue Apr 5 02:47:07 2022

I was reading an article [1] about algorithms implementing regular expressions when I saw this picture [2], tcl regex performance is six orders of magnitude faster than perl regex for certain regular expression, I was really shocked and couldn't believe
it, so I've checked, and that's true, tcl regex works in microseconds where perl regex consume seconds:

% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]" a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% time {regex $r $s m}
199 microseconds per iteration

now 1649120014 s
ok
exec time: 53.366160 s
now 1649120014 s
dif 0 s
Presione una tecla para continuar . . .

$ time ./p.pl
ok
real 0m 52.30s
user 0m 52.12s
sys 0m 0.01s

with p.l :

#!/bin/perl
$s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
if ( $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ){
print "ok\n";
}

I really cannot believe perl regex which is the de facto standard performs so badly, the same applies to pcre lib and so most scripting languages like python, ruby... It's true this is only in certain cases which may be not so common but the article
gives examples of common regex affected.

Another point to tcl bag ;-)

[1] https://swtch.com/~rsc/regexp/regexp1.html
[2] https://swtch.com/~rsc/regexp/grep1p.png

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Harald Oehlmann@21:1/5 to All on Tue Apr 5 12:04:21 2022

Am 05.04.2022 um 11:49 schrieb pd:

now 1649120014 s
ok
exec time: 53.366160 s
now 1649120014 s
dif 0 s
Presione una tecla para continuar . . .

you can ignore this, is just another test with extra checking, you can see exec time is 53 s about the same of the linux shell measure with time p.pl

PD,

thank you. Yes, TCL is sometimes quite fast.
AFAIK, the used regexp lib requires UTF-16 data input.
In consequence, each string must be converted from utf-8 to utf-16
first. As TCL has no utf-16 internal string type, the conversion is done
quite frequently. Also other utf-16 interfaces like the Windows OS API
require this constant conversion. There is the idea to also support
utf-16 as internal data type form many commands, so we do not need to
recode so often. I suppose, this would again boost the regexp performance.

Take care,
Harald

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From pd@21:1/5 to All on Tue Apr 5 03:32:00 2022

El martes, 5 de abril de 2022 a las 11:47:10 UTC+2, pd escribió:

$ time ./p.pl
ok
real 0m 52.30s
user 0m 52.12s
sys 0m 0.01s

with p.l :

#!/bin/perl
$s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
if ( $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ){
print "ok\n";
}

in case you think the time difference is not due to regex processing itself but the overtime due to if and print here you are the prove it is not the case:

$ time ./p2.pl
real 0m 53.08s
user 0m 52.57s
sys 0m 0.03s

with p2.pl :

#!/bin/perl
$s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
$s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From pd@21:1/5 to All on Tue Apr 5 03:38:08 2022

El martes, 5 de abril de 2022 a las 12:04:25 UTC+2, Harald Oehlmann escribió:

thank you. Yes, TCL is sometimes quite fast.
AFAIK, the used regexp lib requires UTF-16 data input.
In consequence, each string must be converted from utf-8 to utf-16
first. As TCL has no utf-16 internal string type, the conversion is done quite frequently. Also other utf-16 interfaces like the Windows OS API require this constant conversion. There is the idea to also support
utf-16 as internal data type form many commands, so we do not need to
recode so often. I suppose, this would again boost the regexp performance.

it's really a very good job for tcl developers team, and if performance will be increased even more with internal support of utf-16 then it will be lightning fast, really today there's no reason to choose other scripting langs specially perl for heavy
regex or string processing tasks, tcl should be considered as a real winner.

For me it's very surprissing this perl (and similar) performance because perl is *the* language when you think of regex processing and pcre is *the* algorithm for implementing it, I'm aware poor performance should be corner cases but it's so
significative I really cannot believe a better algorithm has been already chosen.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Tue Apr 5 20:48:06 2022

Am 05.04.22 um 11:47 schrieb pd:

I was reading an article [1] about algorithms implementing regular expressions when I saw this picture [2], tcl regex performance is six orders of magnitude faster than perl regex for certain regular expression, I was really shocked and couldn't

believe it

Yes, it is true, now here comes the drawback: this speed advtantage
comes for certain regular expressions where there are many backtracking
points in the RE. It only occurs in specially crafted REs. If you
benchmark real-world REs, then Tcl is on the average 2x slower then the
Perl RE engine, and especially compared to the pcre JIT compiler.
Furthermore, the engine is not any longer maintained IIRC. I cannot off
the top of my head find the citations for this, but overall, I'm not
overly convinced, unfortunately.

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From pd@21:1/5 to All on Wed Apr 6 06:27:03 2022

El martes, 5 de abril de 2022 a las 20:48:11 UTC+2, Christian Gollwitzer escribió:

Furthermore, the engine is not any longer maintained IIRC. I cannot off
the top of my head find the citations for this, but overall, I'm not
overly convinced, unfortunately.

do you mean tcl regex it's not manteined? what a pity!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Wed Apr 6 22:00:17 2022

Am 06.04.22 um 21:25 schrieb jtyler:

On 4/5/2022 2:47 AM, pd wrote:

% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]"
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

% time {regex $r $s m}
199 microseconds per iteration

A lot goes on behind the scenes in tcl. There's caching of the r.e. expression going on somewhere.

That's not the reason: the regexp engine in Tcl transform the NFA for
the regexp into a DFA - those are two standard implementations for
regexp state machines. For some specially crafted REs like the ones
above, with many overlapping matches, the NFA becomes very slow -
exponential runtime in the length of the expression, IIRC - whereas the
DFA still works quickly.

You cand find an overview comparison of these algorithms here:

https://zherczeg.github.io/sljit/regex_compare.html

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jtyler@21:1/5 to All on Wed Apr 6 12:25:37 2022

On 4/5/2022 2:47 AM, pd wrote:

% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]" a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% time {regex $r $s m}
199 microseconds per iteration

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere. I'm not sure if the r.e. engine caches
them (say after compiling a r.e.). Or, perhaps tcl's object system does
it or both.

The below try with a second variable, and making a copy it's not likely
to know about, suggests to me the r.e. engine is doing the caching.

One should probably also use the repeat factor with the [time] command
to get a better timing.

1 % set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 % set r "[string repeat a? 30][string repeat a 30]" a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

3 % time {regexp $r $s m}
344 microseconds per iteration

4 % time {regexp $r $s m}
96 microseconds per iteration

5 % set r2 $r a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

6 % time {regexp $r2 $s m}
100 microseconds per iteration

7 % set r2 [string range $r 0 end] a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

8 % time {regexp $r2 $s m}
95 microseconds per iteration

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jtyler@21:1/5 to Christian Gollwitzer on Wed Apr 6 15:13:52 2022

On 4/6/2022 1:00 PM, Christian Gollwitzer wrote:

Am 06.04.22 um 21:25 schrieb jtyler:

On 4/5/2022 2:47 AM, pd wrote:

% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]"
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

% time {regex $r $s m}
199 microseconds per iteration

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere.

That's not the reason: the regexp engine in Tcl transform the NFA for
the regexp into a DFA - those are two standard implementations for
regexp state machines. For some specially crafted REs like the ones
above, with many overlapping matches, the NFA becomes very slow -
exponential runtime in the length of the expression, IIRC - whereas the
DFA still works quickly.

You cand find an overview comparison of these algorithms here:

https://zherczeg.github.io/sljit/regex_compare.html

Christian

Isn't there some caching going on as well?

I see things have changed since I last studied r.e. using Kernighan's
book software tools way back in my youth. But I would imagine that the
division into compiling an r.e. and matching likely still exist and the difference between the first run and the second could be that the r.e.
string had been processed, and apparently doesn't need to be done in the
second run.

Or can you think of some other reason why the second run is so much
faster than the first?

I seem to recall reading somewhere that the tcl_obj has a type of r.e.
that it can reuse in the same way as shimmering between strings and
binary numbers. Could that be the case here?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Thu Apr 7 07:26:49 2022

Am 07.04.22 um 00:13 schrieb jtyler:

On 4/6/2022 1:00 PM, Christian Gollwitzer wrote:

Am 06.04.22 um 21:25 schrieb jtyler:

On 4/5/2022 2:47 AM, pd wrote:

% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]"
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

% time {regex $r $s m}
199 microseconds per iteration

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere.

That's not the reason: the regexp engine in Tcl transform the NFA for
the regexp into a DFA - those are two standard implementations for
regexp state machines. For some specially crafted REs like the ones
above, with many overlapping matches, the NFA becomes very slow -
exponential runtime in the length of the expression, IIRC - whereas
the DFA still works quickly.

You cand find an overview comparison of these algorithms here:

https://zherczeg.github.io/sljit/regex_compare.html

Christian

Isn't there some caching going on as well?

Yes, indeed, the compilation of the regexp is cached in the internal representation of the Tcl_Obj for the regexp.

Or can you think of some other reason why the second run is so much
faster than the first?

You are right that this is the reason that the second run is faster than
the first. I was answering to the huge speed difference (seconds vs. microseconds) between the Perl RE engine and the Tcl RE engine for this particular (maliciously crafted) RE that the OP mentioned. Sorry for
being unclear.

I seem to recall reading somewhere that the tcl_obj has a type of r.e.
that it can reuse in the same way as shimmering between strings and
binary numbers. Could that be the case here?

You can check this for yourself easily with
tcl::unsupported::representation:

(chris) 50 % set RE {a*b}
a*b
(chris) 51 % ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 5, object pointer at
0x7f823f35a7f0, string representation "a*b"
(chris) 52 % regexp $RE "blaaaaaab"
1
(chris) 53 % ::tcl::unsupported::representation $RE
value is a regexp with a refcount of 4, object pointer at
0x7f823f35a7f0, internal representation 0x7f823f36b910:0x7f823f35b150,
string representation "a*b"

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jtyler@21:1/5 to Christian Gollwitzer on Thu Apr 7 11:16:51 2022

On 4/6/2022 10:26 PM, Christian Gollwitzer wrote:

You can check this for yourself easily with
tcl::unsupported::representation:

(chris) 50 % set RE {a*b}
a*b
(chris) 51 % ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 5, object pointer at 0x7f823f35a7f0, string representation "a*b"
(chris) 52 % regexp $RE "blaaaaaab"
1
(chris) 53 % ::tcl::unsupported::representation $RE
value is a regexp with a refcount of 4, object pointer at
0x7f823f35a7f0, internal representation 0x7f823f36b910:0x7f823f35b150,
string representation "a*b"

Christian

Thanks, I was confusing tcl objects, I thought the object was the
variable, not the data.

Below I see that it doesn't increment the ref counter for the $RE, but
does when a*b is entered literally, but they are the same object.

Most interesting, time to re-read Ashok's book chapter 10.

% set RE {a*b}
a*b

% ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 4, object pointer at 04E0BA00,
string representation "a*b"

% ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 4, object pointer at 04E0BA00,
string representation "a*b"

% ::tcl::unsupported::representation a*b
value is a pure string with a refcount of 5, object pointer at 04E0BA00,
string representation "a*b"

% ::tcl::unsupported::representation a*b
value is a pure string with a refcount of 6, object pointer at 04E0BA00,
string representation "a*b"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Thu Apr 7 21:47:56 2022

Am 07.04.22 um 20:16 schrieb jtyler:

On 4/6/2022 10:26 PM, Christian Gollwitzer wrote:
Thanks, I was confusing tcl objects, I thought the object was the
variable, not the data.

Below I see that it doesn't increment the ref counter for the $RE, but
does when a*b is entered literally, but they are the same object.

...this is complicated. If you look at the reference count values ifrom
the the interactive shell, in order to print it, the refcount increases
for the string. Addditionally, for literals there is some kind of
literal caching going on. In a compiled proc, the bytecode compiler also
moves the literal out of the place so that repeated execution of a
command wit a literal value in a loop uses always the cached Tcl_obj and
so on.

Not easy to follow this in detail, in the code you just need to make
sure that every increment is paired with a decrement and then the
lifetime management works correctly.

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Centurion
  Thu Jun 19 09:28:56 2025
  from Berea, Ohio via Telnet
- Plume
  Thu Jun 19 09:20:41 2025
  from Uk via Telnet
- Centurion
  Thu Jun 19 04:37:29 2025
  from Berea, Ohio via Telnet
- Centurion
  Thu Jun 19 03:56:52 2025
  from Berea, Ohio via Telnet
- Gretchiie
  Thu Jun 19 00:45:29 2025
  from Derry, Nh via Telnet
- Brianm
  Wed Jun 18 22:00:11 2025
  from Glasgow via Raw
- Fred
  Wed Jun 18 21:54:08 2025
  from Northrhine-Westphalia, Germany via SSH
- Fred
  Wed Jun 18 21:51:49 2025
  from Northrhine-Westphalia, Germany via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	496
Nodes:	16 (3 / 13)
Uptime:	65:03:52
Calls:	9,764
Calls today:	5
Files:	13,745
Messages:	6,185,832

Re: tcl regexp performance

Who's Online

Recent Visitors

System Info