• why is $ not literal mid-string in an ERE?

    From Ed Morton@21:1/5 to All on Thu Aug 18 08:27:46 2022
    When I write a regexp that has a `$` in the middle of it I write it as
    either of:

    sed 's/foo\$bar/stuff/'
    sed 's/foo[$]bar/stuff/'

    so that it's clear the `$` should be treated literally. Given that, I've
    never noticed before that an unescaped `$` mid-regexp is treated
    differently in BREs vs EREs, e.g.:

    $ echo 'foo$bar' | sed 's/foo$bar/stuff/'
    stuff

    $ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
    foo$bar

    As far as I can see, the relevant quotes of the POSIX spec (https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html)
    for BREs are:

    -----
    $
    The <dollar-sign> shall be special when used as an anchor.

    A <dollar-sign> ( '$' ) shall be an anchor when used as the last
    character of an entire BRE. The implementation may treat a <dollar-sign>
    as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression)
    to the end of the string being matched; the <dollar-sign> can be said to
    match the end-of-string following the last character.
    -----

    and for EREs (emphasis mine):

    -----
    $
    The <dollar-sign> shall be special when used as an anchor.

    A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the expression or subexpression it ends to the end of a string; such an
    expression or subexpression can match only a sequence ending at the last character of a string. For example, the EREs "ef$" and "(ef$)" match
    "ef" in the string "abcdef", but fail to match in the string "cdefab",
    and **the ERE "e$f" is valid, but can never match because the 'f'
    prevents the expression "e$" from matching ending at the last character**. -----

    So, the BRE section doesn't explicitly state what `$` means when it's
    not at the end of a regexp but given the "special when used as an
    anchor" statement, it makes sense to take that as meaning it's literal otherwise and that is how the various tools I've tried are interpreting it.

    The ERE section, however, has that same statement about `$` being
    special when used as an anchor, but then goes on to state that when it's mid-regexp, e.g. `e$f`, it should NOT be treated literally even though
    doing so means the regexp that includes it can never match anything.

    That ERE specification seems odd - why interpret `$` in a way that's
    different from BREs and results in a regexp that can never match
    anything instead of simply treating it as literal, same as BREs do?

    Does anyone have any insight into why a `$` mid-regexp is treated that
    way in EREs?

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?T8SfdXo=?=@21:1/5 to Ed Morton on Thu Aug 18 21:29:19 2022
    On 8/18/22 4:27 PM, Ed Morton wrote:
    That ERE specification seems odd - why interpret `$` in a way that's different from BREs and results in a regexp that can never match
    anything instead of simply treating it as literal, same as BREs do?

    The standard simply documents existing practice here. Under XRAT A.9.3.8
    it says:

    The ability of '^', '$', and '*' to be non-special in certain circumstances may be confusing to
    some programmers, but this situation was changed only in a minor way from historical practice
    to avoid breaking many historical scripts. Some consideration was given to making the use of
    the anchoring characters undefined if not escaped and not at the beginning or end of strings.
    This would cause a number of historical BREs, such as "2^10", "$HOME", and "$1.35", that
    relied on the characters being treated literally, to become invalid.
    ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped
    anchor character has never matched its literal counterpart outside a bracket expression. Some
    implementations treated "foo$bar" as a valid expression that never matched anything; others
    treated it as invalid. POSIX.1-202x mandates the former, valid unmatched behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to All on Thu Aug 18 13:44:17 2022
    On 8/18/2022 1:29 PM, Oğuz wrote:
    On 8/18/22 4:27 PM, Ed Morton wrote:
    That ERE specification seems odd - why interpret `$` in a way that's
    different from BREs and results in a regexp that can never match
    anything instead of simply treating it as literal, same as BREs do?

    The standard simply documents existing practice here. Under XRAT A.9.3.8
    it says:

    OK, I see that at:

    https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08


    The ability of '^', '$', and '*' to be non-special in certain
    circumstances may be confusing to
    some programmers, but this situation was changed only in a minor way
    from historical practice
    to avoid breaking many historical scripts. Some consideration was
    given to making the use of
    the anchoring characters undefined if not escaped and not at the
    beginning or end of strings.
    This would cause a number of historical BREs, such as "2^10", "$HOME",
    and "$1.35", that
    relied on the characters being treated literally, to become invalid.
    ERE anchoring has been different from BRE anchoring in all historical
    systems. An unescaped
    anchor character has never matched its literal counterpart outside a
    bracket expression. Some
    implementations treated "foo$bar" as a valid expression that never
    matched anything; others
    treated it as invalid. POSIX.1-202x mandates the former, valid
    unmatched behavior.

    Them discussing `foo$bar` in that article when I used that exact string
    in my question is quite a coincidence!

    Thanks,

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helmut Waitzmann@21:1/5 to All on Thu Aug 18 18:01:13 2022
    Ed Morton <mortonspam@gmail.com>:

    When I write a regexp that has a `$` in the middle of it I write it
    as either of:

    sed 's/foo\$bar/stuff/'
    sed 's/foo[$]bar/stuff/'

    so that it's clear the `$` should be treated literally. Given that,
    I've never noticed before

    So do I.


    that an unescaped `$` mid-regexp is treated differently in BREs vs
    EREs, e.g.:

    $ echo 'foo$bar' | sed 's/foo$bar/stuff/'
    stuff

    $ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
    foo$bar

    As far as I can see, the relevant quotes of the POSIX spec >(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) >for BREs are:

    -----
    $
    The <dollar-sign> shall be special when used as an anchor.

    A <dollar-sign> ( '$' ) shall be an anchor when used as the last
    character of an entire BRE. The implementation may treat a
    <dollar-sign> as an anchor when used as the last character of a >subexpression. The <dollar-sign> shall anchor the expression (or
    optionally subexpression) to the end of the string being matched; the ><dollar-sign> can be said to match the end-of-string following the
    last character.
    -----

    and for EREs (emphasis mine):

    -----
    $
    The <dollar-sign> shall be special when used as an anchor.

    A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the >expression or subexpression it ends to the end of a string; such an >expression or subexpression can match only a sequence ending at the
    last character of a string. For example, the EREs "ef$" and "(ef$)"
    match "ef" in the string "abcdef", but fail to match in the string
    "cdefab", and **the ERE "e$f" is valid, but can never match because
    the 'f' prevents the expression "e$" from matching ending at the last >character**.
    -----

    So, the BRE section doesn't explicitly state what `$` means when
    it's not at the end of a regexp but given the "special when used as
    an anchor" statement, it makes sense to take that as meaning it's
    literal otherwise and that is how the various tools I've tried are >interpreting it.

    The ERE section, however, has that same statement about `$` being
    special when used as an anchor, but then goes on to state that when
    it's mid-regexp, e.g. `e$f`, it should NOT be treated literally
    even though doing so means the regexp that includes it can never
    match anything.

    That ERE specification seems odd - why interpret `$` in a way
    that's different from BREs and results in a regexp that can never
    match anything instead of simply treating it as literal, same as
    BREs do?


    I feel it the other way round:  The BRE specification seems odd: 
    Why should an unescaped unbracketed dollar sign only be interpreted
    special when it's at the end of an expression?  (That constraint
    doesn't harm, though, because a dollar sign one wants to have its
    literal meaning can be escaped or bracketed even when not at the end
    of an expression.)

    Does anyone have any insight into why a `$` mid-regexp is treated
    that way in EREs?

    I don't know, but I guess it's the principle “Keep it simple”:  A dollar‐sign ist special, unless it is either inside a bracket
    expression or preceded be a quoting backslash.  There is no
    additional constraining rule “unless it is inside of an expression”,
    which is unnecessary to bring dollar signs with their literal
    meaning into regular expressions.

    And I guess with BREs it was too late to abandon that constraining
    rule without breaking existing utilities.

    But that all is just a guess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Helmut Waitzmann on Thu Aug 18 15:40:43 2022
    On 8/18/2022 11:01 AM, Helmut Waitzmann wrote:
    Ed Morton <mortonspam@gmail.com>:

    When I write a regexp that has a `$` in the middle of it I write it as
    either of:

       sed 's/foo\$bar/stuff/'
       sed 's/foo[$]bar/stuff/'

    so that it's clear the `$` should be treated literally. Given that,
    I've never noticed before

    So do I.

    that an unescaped `$` mid-regexp is treated differently in BREs vs
    EREs, e.g.:

       $ echo 'foo$bar' | sed 's/foo$bar/stuff/'
       stuff

       $ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
       foo$bar

    As far as I can see, the relevant quotes of the POSIX spec
    (https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) for BREs are:

    -----
    $
       The <dollar-sign> shall be special when used as an anchor.

    A <dollar-sign> ( '$' ) shall be an anchor when used as the last
    character of an entire BRE. The implementation may treat a
    <dollar-sign> as an anchor when used as the last character of a
    subexpression. The <dollar-sign> shall anchor the expression (or
    optionally subexpression) to the end of the string being matched; the
    <dollar-sign> can be said to match the end-of-string following the
    last character.
    -----

    and for EREs (emphasis mine):

    -----
    $
       The <dollar-sign> shall be special when used as an anchor.

    A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
    expression or subexpression it ends to the end of a string; such an
    expression or subexpression can match only a sequence ending at the
    last character of a string. For example, the EREs "ef$" and "(ef$)"
    match "ef" in the string "abcdef", but fail to match in the string
    "cdefab", and **the ERE "e$f" is valid, but can never match because
    the 'f' prevents the expression "e$" from matching ending at the last
    character**.
    -----

    So, the BRE section doesn't explicitly state what `$` means when it's
    not at the end of a regexp but given the "special when used as an
    anchor" statement, it makes sense to take that as meaning it's literal
    otherwise and that is how the various tools I've tried are
    interpreting it.

    The ERE section, however, has that same statement about `$` being
    special when used as an anchor, but then goes on to state that when
    it's mid-regexp, e.g. `e$f`, it should NOT be treated literally even
    though doing so means the regexp that includes it can never match
    anything.

    That ERE specification seems odd - why interpret `$` in a way that's
    different from BREs and results in a regexp that can never match
    anything instead of simply treating it as literal, same as BREs do?


    I feel it the other way round:  The BRE specification seems odd: Why
    should an unescaped unbracketed dollar sign only be interpreted special
    when it's at the end of an expression?  (That constraint doesn't harm, though, because a dollar sign one wants to have its literal meaning can
    be escaped or bracketed even when not at the end of an expression.)

    Does anyone have any insight into why a `$` mid-regexp is treated that
    way in EREs?

    I don't know, but I guess it's the principle “Keep it simple”:  A dollar‐sign ist special, unless it is either inside a bracket expression
    or preceded be a quoting backslash.  There is no additional constraining rule “unless it is inside of an expression”, which is unnecessary to bring dollar signs with their literal meaning into regular expressions.

    Right but there's other characters that are treated as metachars based
    on context, e.g. `}`, `]`, and `)` are only metachars if they succeed
    `{`, `[`, and `(` respectively, otherwise they're literal, and `^`, `-`
    and `]` mean different things depending on where they appear inside a
    bracket expression, so it wouldn't be much of a leap to make the meaning
    of `^` and `$` outside of a bracket expression context-sensitive too.

    Ed.


    And I guess with BREs it was too late to abandon that constraining rule without breaking existing utilities.

    But that all is just a guess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Ed Morton on Fri Aug 19 17:34:15 2022
    On 2022-08-18, Ed Morton <mortonspam@gmail.com> wrote:
    When I write a regexp that has a `$` in the middle of it I write it as
    either of:

    sed 's/foo\$bar/stuff/'
    sed 's/foo[$]bar/stuff/'

    Because $ can have the special anchoring meaning, even when
    it occurs in the middle of a regex:

    $ grep -E 'abc$|def'

    matches lines ending in abc, or containing def.

    To get the behavior you want, the exact rule would have to be
    rooted in the abstract syntax: that a $ which has a right
    sibling in the syntax tree becomes automatically literal.

    so that it's clear the `$` should be treated literally. Given that, I've

    Treating characters literally or not based on their position in the
    syntax is a bad idea in the first place.

    For instance it's a misfeature some regex implementatons that ) becomes literal, without escaping, if it is unmatched, rather than being
    flagged as a syntax error.

    Consistency is best.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Kaz Kylheku on Fri Aug 19 19:30:05 2022
    On 8/19/2022 12:34 PM, Kaz Kylheku wrote:
    On 2022-08-18, Ed Morton <mortonspam@gmail.com> wrote:
    When I write a regexp that has a `$` in the middle of it I write it as
    either of:

    sed 's/foo\$bar/stuff/'
    sed 's/foo[$]bar/stuff/'

    Because $ can have the special anchoring meaning, even when
    it occurs in the middle of a regex:

    $ grep -E 'abc$|def'

    matches lines ending in abc, or containing def.

    All the relevant ERE text actually talks about regexp subexpressions in
    the same way the BRE text talks about whole regexps, I just didn't want
    to get into that, so in the above case the $ is at the end of a
    subexpression and so the $ is being treated consistently between BREs
    and EREs in that regard.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)