• possible bug in "regexp" syntax ".*?"

    From aotto1968@21:1/5 to All on Sun Jun 26 21:05:14 2022
    Hi,

    my code

    # @cartouche
    while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end smallexample[^@]+@end cartouche} $txt allI valI]} {
    set val [string trim [string range $txt {*}$valI]]
    set val [string map { \\@\{ @\{ } $val ]
    set txt [string replace $txt {*}$allI "~~~~~\n$val\n~~~~~" ]
    }

    with data from:

    @cartouche
    @smallexample
    # file: quote.cfg
    quote = "Criticism may not be agreeable, but it is necessary."
    " It fulfils the same function as pain in the human"
    " body. It calls attention to an unhealthy state of"
    " things.\n"
    "\t--Winston Churchill";
    @end smallexample
    @end cartouche

    @cartouche
    @smallexample
    # file: test.cfg
    info: @{
    name = "Winston Churchill";
    @@include "quote.cfg"
    country = "UK";
    @};
    @end smallexample
    @end cartouche

    create the *false* match:

    ~~~~~
    # file: quote.cfg
    quote = "Criticism may not be agreeable, but it is necessary."
    " It fulfils the same function as pain in the human"
    " body. It calls attention to an unhealthy state of"
    " things.\n"
    "\t--Winston Churchill";
    @end smallexample
    @end cartouche

    @cartouche
    @smallexample
    # file: test.cfg
    info: @{
    name = "Winston Churchill";
    @@include "quote.cfg"
    country = "UK";
    @};
    ~~~~~

    because of error in ".*?" → should find the *smalles* match.
    the *good* solution would be:

    ~~~~~
    # file: quote.cfg
    quote = "Criticism may not be agreeable, but it is necessary."
    " It fulfils the same function as pain in the human"
    " body. It calls attention to an unhealthy state of"
    " things.\n"
    "\t--Winston Churchill";
    ~~~~~

    ~~~~~
    # file: test.cfg
    info: @{
    name = "Winston Churchill";
    @@include "quote.cfg"
    country = "UK";
    @};
    ~~~~~

    example online: https://regex101.com/r/AWkpiL/1

    mfg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Sun Jun 26 21:19:31 2022
    Just as info, the following code do the job (split-by-hand)

    # @cartouche
    while {[regexp -indices {@cartouche[^@]+@smallexample} $txt allI]} {
    lassign $allI start1I start2I
    if {![regexp -indices -start $start2I {@end\s+smallexample[^@]+@end\s+cartouche} $txt allI]} {
    error "don't find end of '@cartouche'"
    }
    lassign $allI end1I end2I
    set val [string trim [string range $txt $start2I+1 $end1I-1]]
    set val [string map { \\@\{ @\{ } $val ]
    set txt [string replace $txt $start1I $end2I "~~~~~\n$val\n~~~~~" ]
    }

    mfg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Sun Jun 26 21:40:22 2022
    Am 26.06.22 um 21:05 schrieb aotto1968:
    Hi,

    my code

      # @cartouche
      while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end smallexample[^@]+@end cartouche} $txt allI valI]} {

    I haven't checked the details, but your RE mixes greedy (+) and
    non-greedy (.*?) quantifiers. AFAIK, this can lead to hard to understand behaviour. Quote from the manpage: https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.html
    "The matching rules for REs containing both normal and non-greedy
    quantifiers have changed since early beta-test versions of this package.
    (The new rules are much simpler and cleaner, but do not work as hard at guessing the user's real intentions.) "

    The page you used to test the RE uses PCRE, the most widely known RE
    engine, as opposed to Henry Spencer's RE engine as used by Tcl. In
    certain corner cases (esp. for maliciously crafted REs), the algorithm
    used by PCRE can be exponentially slow whereas Spencer's engine is
    polynomial, however, on average for non-malicious REs PCRE is faster.

    One could argue that Tcl 9 should migrate to PCRE and drop Spencer's RE,
    which is also badly maintained. Someone with more moment than me could
    write a TIP about it ;)

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From heinrichmartin@21:1/5 to Christian Gollwitzer on Sun Jun 26 22:55:05 2022
    On Sunday, June 26, 2022 at 9:40:27 PM UTC+2, Christian Gollwitzer wrote:
    Am 26.06.22 um 21:05 schrieb aotto1968:
    Hi,

    my code

    # @cartouche
    while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end smallexample[^@]+@end cartouche} $txt allI valI]} {
    I haven't checked the details, but your RE mixes greedy (+) and
    non-greedy (.*?) quantifiers. AFAIK, this can lead to hard to understand behaviour. Quote from the manpage: https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.html
    "The matching rules for REs containing both normal and non-greedy quantifiers have changed since early beta-test versions of this package. (The new rules are much simpler and cleaner, but do not work as hard at guessing the user's real intentions.) "

    You are looking for the description of the _preference_ of a RE. Take-away message: you cannot mix greedy and non-greedy matching; and the *first* quantifier determines the preference of the full RE.

    I often fall back to _not_ create the super-duper long regexp that matches the whole thing from a stream (I am using Expect a lot), but match the header first (anchored at the start of the buffer, discarding junk), then find the matching footer; i.e.
    when magic[1] does not help, go the extra mile and implement that state-machine.

    Having that said, you could also look into packages that facilitate parsing (lexers). There used to be fickle/fcl (that was used by tcldoc; I am writing in past-tense, because a quick web search did not show their current homes).

    [1] https://xkcd.com/208/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to heinrichmartin on Mon Jun 27 13:49:46 2022
    heinrichmartin <martin.heinrich@frequentis.com> wrote:
    Having that said, you could also look into packages that facilitate
    parsing (lexers). There used to be fickle/fcl (that was used by
    tcldoc; I am writing in past-tense, because a quick web search did
    not show their current homes).

    [1] https://xkcd.com/208/

    Tcllib's PEG parser generator is relatively easy to use, once one
    grasps the syntax.

    http://tmml.sourceforge.net/doc/tcllib/peg.html

    There was also a nice example posted here to the group some months back
    for a PEG grammar. I forget now who posted it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Oleg Nemanov@21:1/5 to All on Mon Jul 11 05:44:08 2022
    воскресенье, 26 июня 2022 г. в 22:40:27 UTC+3, Christian Gollwitzer:
    One could argue that Tcl 9 should migrate to PCRE and drop Spencer's RE, which is also badly maintained. Someone with more moment than me could
    write a TIP about it ;)

    No-no. Let's stay RE engine as is :-). If someone want to use PCRE he can do this
    with external lib even now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)