That ERE specification seems odd - why interpret `$` in a way that's different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?
The ability of '^', '$', and '*' to be non-special in certain circumstances may be confusing to
some programmers, but this situation was changed only in a minor way from historical practice
to avoid breaking many historical scripts. Some consideration was given to making the use of
the anchoring characters undefined if not escaped and not at the beginning or end of strings.
This would cause a number of historical BREs, such as "2^10", "$HOME", and "$1.35", that
relied on the characters being treated literally, to become invalid.
ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped
anchor character has never matched its literal counterpart outside a bracket expression. Some
implementations treated "foo$bar" as a valid expression that never matched anything; others
treated it as invalid. POSIX.1-202x mandates the former, valid unmatched behavior.
On 8/18/22 4:27 PM, Ed Morton wrote:
That ERE specification seems odd - why interpret `$` in a way that's
different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?
The standard simply documents existing practice here. Under XRAT A.9.3.8
it says:
The ability of '^', '$', and '*' to be non-special in certain
circumstances may be confusing to
some programmers, but this situation was changed only in a minor way
from historical practice
to avoid breaking many historical scripts. Some consideration was
given to making the use of
the anchoring characters undefined if not escaped and not at the
beginning or end of strings.
This would cause a number of historical BREs, such as "2^10", "$HOME",
and "$1.35", that
relied on the characters being treated literally, to become invalid.
ERE anchoring has been different from BRE anchoring in all historical
systems. An unescaped
anchor character has never matched its literal counterpart outside a
bracket expression. Some
implementations treated "foo$bar" as a valid expression that never
matched anything; others
treated it as invalid. POSIX.1-202x mandates the former, valid
unmatched behavior.
When I write a regexp that has a `$` in the middle of it I write it
as either of:
sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'
so that it's clear the `$` should be treated literally. Given that,
I've never noticed before
that an unescaped `$` mid-regexp is treated differently in BREs vs
EREs, e.g.:
$ echo 'foo$bar' | sed 's/foo$bar/stuff/'
stuff
$ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
foo$bar
As far as I can see, the relevant quotes of the POSIX spec >(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) >for BREs are:
-----
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a >subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the ><dollar-sign> can be said to match the end-of-string following the
last character.
-----
and for EREs (emphasis mine):
-----
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the >expression or subexpression it ends to the end of a string; such an >expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and **the ERE "e$f" is valid, but can never match because
the 'f' prevents the expression "e$" from matching ending at the last >character**.
-----
So, the BRE section doesn't explicitly state what `$` means when
it's not at the end of a regexp but given the "special when used as
an anchor" statement, it makes sense to take that as meaning it's
literal otherwise and that is how the various tools I've tried are >interpreting it.
The ERE section, however, has that same statement about `$` being
special when used as an anchor, but then goes on to state that when
it's mid-regexp, e.g. `e$f`, it should NOT be treated literally
even though doing so means the regexp that includes it can never
match anything.
That ERE specification seems odd - why interpret `$` in a way
that's different from BREs and results in a regexp that can never
match anything instead of simply treating it as literal, same as
BREs do?
Does anyone have any insight into why a `$` mid-regexp is treated
that way in EREs?
Ed Morton <mortonspam@gmail.com>:
When I write a regexp that has a `$` in the middle of it I write it as
either of:
sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'
so that it's clear the `$` should be treated literally. Given that,
I've never noticed before
So do I.
that an unescaped `$` mid-regexp is treated differently in BREs vs
EREs, e.g.:
$ echo 'foo$bar' | sed 's/foo$bar/stuff/'
stuff
$ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
foo$bar
As far as I can see, the relevant quotes of the POSIX spec
(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) for BREs are:
-----
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a
subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the
<dollar-sign> can be said to match the end-of-string following the
last character.
-----
and for EREs (emphasis mine):
-----
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and **the ERE "e$f" is valid, but can never match because
the 'f' prevents the expression "e$" from matching ending at the last
character**.
-----
So, the BRE section doesn't explicitly state what `$` means when it's
not at the end of a regexp but given the "special when used as an
anchor" statement, it makes sense to take that as meaning it's literal
otherwise and that is how the various tools I've tried are
interpreting it.
The ERE section, however, has that same statement about `$` being
special when used as an anchor, but then goes on to state that when
it's mid-regexp, e.g. `e$f`, it should NOT be treated literally even
though doing so means the regexp that includes it can never match
anything.
That ERE specification seems odd - why interpret `$` in a way that's
different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?
I feel it the other way round: The BRE specification seems odd: Why
should an unescaped unbracketed dollar sign only be interpreted special
when it's at the end of an expression? (That constraint doesn't harm, though, because a dollar sign one wants to have its literal meaning can
be escaped or bracketed even when not at the end of an expression.)
Does anyone have any insight into why a `$` mid-regexp is treated that
way in EREs?
I don't know, but I guess it's the principle “Keep it simple”: A dollar‐sign ist special, unless it is either inside a bracket expression
or preceded be a quoting backslash. There is no additional constraining rule “unless it is inside of an expression”, which is unnecessary to bring dollar signs with their literal meaning into regular expressions.
And I guess with BREs it was too late to abandon that constraining rule without breaking existing utilities.
But that all is just a guess.
When I write a regexp that has a `$` in the middle of it I write it as
either of:
sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'
so that it's clear the `$` should be treated literally. Given that, I've
On 2022-08-18, Ed Morton <mortonspam@gmail.com> wrote:
When I write a regexp that has a `$` in the middle of it I write it as
either of:
sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'
Because $ can have the special anchoring meaning, even when
it occurs in the middle of a regex:
$ grep -E 'abc$|def'
matches lines ending in abc, or containing def.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 498 |
Nodes: | 16 (0 / 16) |
Uptime: | 72:17:15 |
Calls: | 9,819 |
Calls today: | 7 |
Files: | 13,757 |
Messages: | 6,189,681 |