• Re: tcl regexp performance

    From pd@21:1/5 to All on Tue Apr 5 02:49:41 2022
    now 1649120014 s
    ok
    exec time: 53.366160 s
    now 1649120014 s
    dif 0 s
    Presione una tecla para continuar . . .

    you can ignore this, is just another test with extra checking, you can see exec time is 53 s about the same of the linux shell measure with time p.pl

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pd@21:1/5 to All on Tue Apr 5 02:47:07 2022
    I was reading an article [1] about algorithms implementing regular expressions when I saw this picture [2], tcl regex performance is six orders of magnitude faster than perl regex for certain regular expression, I was really shocked and couldn't believe
    it, so I've checked, and that's true, tcl regex works in microseconds where perl regex consume seconds:

    % set s [string repeat a 30]
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % set r "[string repeat a? 30][string repeat a 30]" a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % time {regex $r $s m}
    199 microseconds per iteration

    now 1649120014 s
    ok
    exec time: 53.366160 s
    now 1649120014 s
    dif 0 s
    Presione una tecla para continuar . . .


    $ time ./p.pl
    ok
    real 0m 52.30s
    user 0m 52.12s
    sys 0m 0.01s

    with p.l :

    #!/bin/perl
    $s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
    if ( $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ){
    print "ok\n";
    }

    I really cannot believe perl regex which is the de facto standard performs so badly, the same applies to pcre lib and so most scripting languages like python, ruby... It's true this is only in certain cases which may be not so common but the article
    gives examples of common regex affected.

    Another point to tcl bag ;-)

    [1] https://swtch.com/~rsc/regexp/regexp1.html
    [2] https://swtch.com/~rsc/regexp/grep1p.png

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Harald Oehlmann@21:1/5 to All on Tue Apr 5 12:04:21 2022
    Am 05.04.2022 um 11:49 schrieb pd:
    now 1649120014 s
    ok
    exec time: 53.366160 s
    now 1649120014 s
    dif 0 s
    Presione una tecla para continuar . . .

    you can ignore this, is just another test with extra checking, you can see exec time is 53 s about the same of the linux shell measure with time p.pl


    PD,

    thank you. Yes, TCL is sometimes quite fast.
    AFAIK, the used regexp lib requires UTF-16 data input.
    In consequence, each string must be converted from utf-8 to utf-16
    first. As TCL has no utf-16 internal string type, the conversion is done
    quite frequently. Also other utf-16 interfaces like the Windows OS API
    require this constant conversion. There is the idea to also support
    utf-16 as internal data type form many commands, so we do not need to
    recode so often. I suppose, this would again boost the regexp performance.

    Take care,
    Harald

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pd@21:1/5 to All on Tue Apr 5 03:32:00 2022
    El martes, 5 de abril de 2022 a las 11:47:10 UTC+2, pd escribió:
    $ time ./p.pl
    ok
    real 0m 52.30s
    user 0m 52.12s
    sys 0m 0.01s

    with p.l :

    #!/bin/perl
    $s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
    if ( $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ){
    print "ok\n";
    }

    in case you think the time difference is not due to regex processing itself but the overtime due to if and print here you are the prove it is not the case:

    $ time ./p2.pl
    real 0m 53.08s
    user 0m 52.57s
    sys 0m 0.03s

    with p2.pl :

    #!/bin/perl
    $s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
    $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pd@21:1/5 to All on Tue Apr 5 03:38:08 2022
    El martes, 5 de abril de 2022 a las 12:04:25 UTC+2, Harald Oehlmann escribió:

    thank you. Yes, TCL is sometimes quite fast.
    AFAIK, the used regexp lib requires UTF-16 data input.
    In consequence, each string must be converted from utf-8 to utf-16
    first. As TCL has no utf-16 internal string type, the conversion is done quite frequently. Also other utf-16 interfaces like the Windows OS API require this constant conversion. There is the idea to also support
    utf-16 as internal data type form many commands, so we do not need to
    recode so often. I suppose, this would again boost the regexp performance.

    it's really a very good job for tcl developers team, and if performance will be increased even more with internal support of utf-16 then it will be lightning fast, really today there's no reason to choose other scripting langs specially perl for heavy
    regex or string processing tasks, tcl should be considered as a real winner.

    For me it's very surprissing this perl (and similar) performance because perl is *the* language when you think of regex processing and pcre is *the* algorithm for implementing it, I'm aware poor performance should be corner cases but it's so
    significative I really cannot believe a better algorithm has been already chosen.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Tue Apr 5 20:48:06 2022
    Am 05.04.22 um 11:47 schrieb pd:
    I was reading an article [1] about algorithms implementing regular expressions when I saw this picture [2], tcl regex performance is six orders of magnitude faster than perl regex for certain regular expression, I was really shocked and couldn't
    believe it

    Yes, it is true, now here comes the drawback: this speed advtantage
    comes for certain regular expressions where there are many backtracking
    points in the RE. It only occurs in specially crafted REs. If you
    benchmark real-world REs, then Tcl is on the average 2x slower then the
    Perl RE engine, and especially compared to the pcre JIT compiler.
    Furthermore, the engine is not any longer maintained IIRC. I cannot off
    the top of my head find the citations for this, but overall, I'm not
    overly convinced, unfortunately.

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pd@21:1/5 to All on Wed Apr 6 06:27:03 2022
    El martes, 5 de abril de 2022 a las 20:48:11 UTC+2, Christian Gollwitzer escribió:

    Furthermore, the engine is not any longer maintained IIRC. I cannot off
    the top of my head find the citations for this, but overall, I'm not
    overly convinced, unfortunately.

    do you mean tcl regex it's not manteined? what a pity!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Wed Apr 6 22:00:17 2022
    Am 06.04.22 um 21:25 schrieb jtyler:
    On 4/5/2022 2:47 AM, pd wrote:

    % set s [string repeat a 30]
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % set r "[string repeat a? 30][string repeat a 30]"
    a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    % time {regex $r $s m}
    199 microseconds per iteration

    A lot goes on behind the scenes in tcl. There's caching of the r.e. expression going on somewhere.

    That's not the reason: the regexp engine in Tcl transform the NFA for
    the regexp into a DFA - those are two standard implementations for
    regexp state machines. For some specially crafted REs like the ones
    above, with many overlapping matches, the NFA becomes very slow -
    exponential runtime in the length of the expression, IIRC - whereas the
    DFA still works quickly.

    You cand find an overview comparison of these algorithms here:

    https://zherczeg.github.io/sljit/regex_compare.html


    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jtyler@21:1/5 to All on Wed Apr 6 12:25:37 2022
    On 4/5/2022 2:47 AM, pd wrote:

    % set s [string repeat a 30]
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % set r "[string repeat a? 30][string repeat a 30]" a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % time {regex $r $s m}
    199 microseconds per iteration

    A lot goes on behind the scenes in tcl. There's caching of the r.e.
    expression going on somewhere. I'm not sure if the r.e. engine caches
    them (say after compiling a r.e.). Or, perhaps tcl's object system does
    it or both.

    The below try with a second variable, and making a copy it's not likely
    to know about, suggests to me the r.e. engine is doing the caching.

    One should probably also use the repeat factor with the [time] command
    to get a better timing.




    1 % set s [string repeat a 30]
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    2 % set r "[string repeat a? 30][string repeat a 30]" a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    3 % time {regexp $r $s m}
    344 microseconds per iteration

    4 % time {regexp $r $s m}
    96 microseconds per iteration


    5 % set r2 $r a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    6 % time {regexp $r2 $s m}
    100 microseconds per iteration



    7 % set r2 [string range $r 0 end] a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    8 % time {regexp $r2 $s m}
    95 microseconds per iteration

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jtyler@21:1/5 to Christian Gollwitzer on Wed Apr 6 15:13:52 2022
    On 4/6/2022 1:00 PM, Christian Gollwitzer wrote:
    Am 06.04.22 um 21:25 schrieb jtyler:
    On 4/5/2022 2:47 AM, pd wrote:

    % set s [string repeat a 30]
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % set r "[string repeat a? 30][string repeat a 30]"
    a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    % time {regex $r $s m}
    199 microseconds per iteration

    A lot goes on behind the scenes in tcl. There's caching of the r.e.
    expression going on somewhere.

    That's not the reason: the regexp engine in Tcl transform the NFA for
    the regexp into a DFA - those are two standard implementations for
    regexp state machines. For some specially crafted REs like the ones
    above, with many overlapping matches, the NFA becomes very slow -
    exponential runtime in the length of the expression, IIRC - whereas the
    DFA still works quickly.

    You cand find an overview comparison of these algorithms here:

    https://zherczeg.github.io/sljit/regex_compare.html


        Christian

    Isn't there some caching going on as well?

    I see things have changed since I last studied r.e. using Kernighan's
    book software tools way back in my youth. But I would imagine that the
    division into compiling an r.e. and matching likely still exist and the difference between the first run and the second could be that the r.e.
    string had been processed, and apparently doesn't need to be done in the
    second run.

    Or can you think of some other reason why the second run is so much
    faster than the first?

    I seem to recall reading somewhere that the tcl_obj has a type of r.e.
    that it can reuse in the same way as shimmering between strings and
    binary numbers. Could that be the case here?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Thu Apr 7 07:26:49 2022
    Am 07.04.22 um 00:13 schrieb jtyler:
    On 4/6/2022 1:00 PM, Christian Gollwitzer wrote:
    Am 06.04.22 um 21:25 schrieb jtyler:
    On 4/5/2022 2:47 AM, pd wrote:

    % set s [string repeat a 30]
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    % set r "[string repeat a? 30][string repeat a 30]"
    a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    % time {regex $r $s m}
    199 microseconds per iteration

    A lot goes on behind the scenes in tcl. There's caching of the r.e.
    expression going on somewhere.

    That's not the reason: the regexp engine in Tcl transform the NFA for
    the regexp into a DFA - those are two standard implementations for
    regexp state machines. For some specially crafted REs like the ones
    above, with many overlapping matches, the NFA becomes very slow -
    exponential runtime in the length of the expression, IIRC - whereas
    the DFA still works quickly.

    You cand find an overview comparison of these algorithms here:

    https://zherczeg.github.io/sljit/regex_compare.html


         Christian

    Isn't there some caching going on as well?

    Yes, indeed, the compilation of the regexp is cached in the internal representation of the Tcl_Obj for the regexp.

    Or can you think of some other reason why the second run is so much
    faster than the first?

    You are right that this is the reason that the second run is faster than
    the first. I was answering to the huge speed difference (seconds vs. microseconds) between the Perl RE engine and the Tcl RE engine for this particular (maliciously crafted) RE that the OP mentioned. Sorry for
    being unclear.

    I seem to recall reading somewhere that the tcl_obj has a type of r.e.
    that it can reuse in the same way as shimmering between strings and
    binary numbers. Could that be the case here?

    You can check this for yourself easily with
    tcl::unsupported::representation:

    (chris) 50 % set RE {a*b}
    a*b
    (chris) 51 % ::tcl::unsupported::representation $RE
    value is a pure string with a refcount of 5, object pointer at
    0x7f823f35a7f0, string representation "a*b"
    (chris) 52 % regexp $RE "blaaaaaab"
    1
    (chris) 53 % ::tcl::unsupported::representation $RE
    value is a regexp with a refcount of 4, object pointer at
    0x7f823f35a7f0, internal representation 0x7f823f36b910:0x7f823f35b150,
    string representation "a*b"


    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jtyler@21:1/5 to Christian Gollwitzer on Thu Apr 7 11:16:51 2022
    On 4/6/2022 10:26 PM, Christian Gollwitzer wrote:

    You can check this for yourself easily with
    tcl::unsupported::representation:

    (chris) 50 % set RE {a*b}
    a*b
    (chris) 51 % ::tcl::unsupported::representation $RE
    value is a pure string with a refcount of 5, object pointer at 0x7f823f35a7f0, string representation "a*b"
    (chris) 52 % regexp $RE "blaaaaaab"
    1
    (chris) 53 % ::tcl::unsupported::representation $RE
    value is a regexp with a refcount of 4, object pointer at
    0x7f823f35a7f0, internal representation 0x7f823f36b910:0x7f823f35b150,
    string representation "a*b"


        Christian




    Thanks, I was confusing tcl objects, I thought the object was the
    variable, not the data.

    Below I see that it doesn't increment the ref counter for the $RE, but
    does when a*b is entered literally, but they are the same object.

    Most interesting, time to re-read Ashok's book chapter 10.

    % set RE {a*b}
    a*b

    % ::tcl::unsupported::representation $RE
    value is a pure string with a refcount of 4, object pointer at 04E0BA00,
    string representation "a*b"

    % ::tcl::unsupported::representation $RE
    value is a pure string with a refcount of 4, object pointer at 04E0BA00,
    string representation "a*b"

    % ::tcl::unsupported::representation a*b
    value is a pure string with a refcount of 5, object pointer at 04E0BA00,
    string representation "a*b"

    % ::tcl::unsupported::representation a*b
    value is a pure string with a refcount of 6, object pointer at 04E0BA00,
    string representation "a*b"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Thu Apr 7 21:47:56 2022
    Am 07.04.22 um 20:16 schrieb jtyler:
    On 4/6/2022 10:26 PM, Christian Gollwitzer wrote:
    Thanks, I was confusing tcl objects, I thought the object was the
    variable, not the data.

    Below I see that it doesn't increment the ref counter for the $RE, but
    does when a*b is entered literally, but they are the same object.

    ...this is complicated. If you look at the reference count values ifrom
    the the interactive shell, in order to print it, the refcount increases
    for the string. Addditionally, for literals there is some kind of
    literal caching going on. In a compiled proc, the bytecode compiler also
    moves the literal out of the place so that repeated execution of a
    command wit a literal value in a loop uses always the cached Tcl_obj and
    so on.

    Not easy to follow this in detail, in the code you just need to make
    sure that every increment is paired with a decrement and then the
    lifetime management works correctly.

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)