Forum: >>> Magnum BBS <<<

why is $ not literal mid-string in an ERE?

From Ed Morton@21:1/5 to All on Thu Aug 18 08:27:46 2022

When I write a regexp that has a `$` in the middle of it I write it as
either of:

sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'

so that it's clear the `$` should be treated literally. Given that, I've
never noticed before that an unescaped `$` mid-regexp is treated
differently in BREs vs EREs, e.g.:

$ echo 'foo$bar' | sed 's/foo$bar/stuff/'
stuff

$ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
foo$bar

As far as I can see, the relevant quotes of the POSIX spec (https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html)
for BREs are:

-----
$
The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a <dollar-sign>
as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression)
to the end of the string being matched; the <dollar-sign> can be said to
match the end-of-string following the last character.
-----

and for EREs (emphasis mine):

-----
$
The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the last character of a string. For example, the EREs "ef$" and "(ef$)" match
"ef" in the string "abcdef", but fail to match in the string "cdefab",
and **the ERE "e$f" is valid, but can never match because the 'f'
prevents the expression "e$" from matching ending at the last character**. -----

So, the BRE section doesn't explicitly state what `$` means when it's
not at the end of a regexp but given the "special when used as an
anchor" statement, it makes sense to take that as meaning it's literal otherwise and that is how the various tools I've tried are interpreting it.

The ERE section, however, has that same statement about `$` being
special when used as an anchor, but then goes on to state that when it's mid-regexp, e.g. `e$f`, it should NOT be treated literally even though
doing so means the regexp that includes it can never match anything.

That ERE specification seems odd - why interpret `$` in a way that's
different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?

Does anyone have any insight into why a `$` mid-regexp is treated that
way in EREs?

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?B?T8SfdXo=?=@21:1/5 to Ed Morton on Thu Aug 18 21:29:19 2022

On 8/18/22 4:27 PM, Ed Morton wrote:

That ERE specification seems odd - why interpret `$` in a way that's different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?

The standard simply documents existing practice here. Under XRAT A.9.3.8
it says:

The ability of '^', '$', and '*' to be non-special in certain circumstances may be confusing to
some programmers, but this situation was changed only in a minor way from historical practice
to avoid breaking many historical scripts. Some consideration was given to making the use of
the anchoring characters undefined if not escaped and not at the beginning or end of strings.
This would cause a number of historical BREs, such as "2^10", "$HOME", and "$1.35", that
relied on the characters being treated literally, to become invalid.
ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped
anchor character has never matched its literal counterpart outside a bracket expression. Some
implementations treated "foo$bar" as a valid expression that never matched anything; others
treated it as invalid. POSIX.1-202x mandates the former, valid unmatched behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to All on Thu Aug 18 13:44:17 2022

On 8/18/2022 1:29 PM, Oğuz wrote:

On 8/18/22 4:27 PM, Ed Morton wrote:

That ERE specification seems odd - why interpret `$` in a way that's
different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?

The standard simply documents existing practice here. Under XRAT A.9.3.8
it says:

OK, I see that at:

https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08

The ability of '^', '$', and '*' to be non-special in certain
circumstances may be confusing to
some programmers, but this situation was changed only in a minor way
from historical practice
to avoid breaking many historical scripts. Some consideration was
given to making the use of
the anchoring characters undefined if not escaped and not at the
beginning or end of strings.
This would cause a number of historical BREs, such as "2^10", "$HOME",
and "$1.35", that
relied on the characters being treated literally, to become invalid.
ERE anchoring has been different from BRE anchoring in all historical
systems. An unescaped
anchor character has never matched its literal counterpart outside a
bracket expression. Some
implementations treated "foo$bar" as a valid expression that never
matched anything; others
treated it as invalid. POSIX.1-202x mandates the former, valid
unmatched behavior.

Them discussing `foo$bar` in that article when I used that exact string
in my question is quite a coincidence!

Thanks,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Waitzmann@21:1/5 to All on Thu Aug 18 18:01:13 2022

Ed Morton <mortonspam@gmail.com>:

When I write a regexp that has a `$` in the middle of it I write it
as either of:

sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'

so that it's clear the `$` should be treated literally. Given that,
I've never noticed before

So do I.

that an unescaped `$` mid-regexp is treated differently in BREs vs
EREs, e.g.:

$ echo 'foo$bar' | sed 's/foo$bar/stuff/'
stuff

$ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
foo$bar

As far as I can see, the relevant quotes of the POSIX spec >(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) >for BREs are:

-----
$
The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a >subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the ><dollar-sign> can be said to match the end-of-string following the
last character.
-----

and for EREs (emphasis mine):

-----
$
The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the >expression or subexpression it ends to the end of a string; such an >expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and **the ERE "e$f" is valid, but can never match because
the 'f' prevents the expression "e$" from matching ending at the last >character**.
-----

So, the BRE section doesn't explicitly state what `$` means when
it's not at the end of a regexp but given the "special when used as
an anchor" statement, it makes sense to take that as meaning it's
literal otherwise and that is how the various tools I've tried are >interpreting it.

The ERE section, however, has that same statement about `$` being
special when used as an anchor, but then goes on to state that when
it's mid-regexp, e.g. `e$f`, it should NOT be treated literally
even though doing so means the regexp that includes it can never
match anything.

That ERE specification seems odd - why interpret `$` in a way
that's different from BREs and results in a regexp that can never
match anything instead of simply treating it as literal, same as
BREs do?

I feel it the other way round: The BRE specification seems odd:
Why should an unescaped unbracketed dollar sign only be interpreted
special when it's at the end of an expression? (That constraint
doesn't harm, though, because a dollar sign one wants to have its
literal meaning can be escaped or bracketed even when not at the end
of an expression.)

Does anyone have any insight into why a `$` mid-regexp is treated
that way in EREs?

I don't know, but I guess it's the principle “Keep it simple”: A dollar‐sign ist special, unless it is either inside a bracket
expression or preceded be a quoting backslash. There is no
additional constraining rule “unless it is inside of an expression”,
which is unnecessary to bring dollar signs with their literal
meaning into regular expressions.

And I guess with BREs it was too late to abandon that constraining
rule without breaking existing utilities.

But that all is just a guess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Helmut Waitzmann on Thu Aug 18 15:40:43 2022

On 8/18/2022 11:01 AM, Helmut Waitzmann wrote:

Ed Morton <mortonspam@gmail.com>:

When I write a regexp that has a `$` in the middle of it I write it as
either of:

   sed 's/foo\$bar/stuff/'
   sed 's/foo[$]bar/stuff/'

so that it's clear the `$` should be treated literally. Given that,
I've never noticed before

So do I.

that an unescaped `$` mid-regexp is treated differently in BREs vs
EREs, e.g.:

   $ echo 'foo$bar' | sed 's/foo$bar/stuff/'
   stuff

   $ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
   foo$bar

As far as I can see, the relevant quotes of the POSIX spec
(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) for BREs are:

-----
$
   The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a
subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the
<dollar-sign> can be said to match the end-of-string following the
last character.
-----

and for EREs (emphasis mine):

-----
$
   The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and **the ERE "e$f" is valid, but can never match because
the 'f' prevents the expression "e$" from matching ending at the last
character**.
-----

So, the BRE section doesn't explicitly state what `$` means when it's
not at the end of a regexp but given the "special when used as an
anchor" statement, it makes sense to take that as meaning it's literal
otherwise and that is how the various tools I've tried are
interpreting it.

The ERE section, however, has that same statement about `$` being
special when used as an anchor, but then goes on to state that when
it's mid-regexp, e.g. `e$f`, it should NOT be treated literally even
though doing so means the regexp that includes it can never match
anything.

That ERE specification seems odd - why interpret `$` in a way that's
different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?

I feel it the other way round: The BRE specification seems odd: Why
should an unescaped unbracketed dollar sign only be interpreted special
when it's at the end of an expression? (That constraint doesn't harm, though, because a dollar sign one wants to have its literal meaning can
be escaped or bracketed even when not at the end of an expression.)

Does anyone have any insight into why a `$` mid-regexp is treated that
way in EREs?

I don't know, but I guess it's the principle “Keep it simple”: A dollar‐sign ist special, unless it is either inside a bracket expression
or preceded be a quoting backslash. There is no additional constraining rule “unless it is inside of an expression”, which is unnecessary to bring dollar signs with their literal meaning into regular expressions.

Right but there's other characters that are treated as metachars based
on context, e.g. `}`, `]`, and `)` are only metachars if they succeed
`{`, `[`, and `(` respectively, otherwise they're literal, and `^`, `-`
and `]` mean different things depending on where they appear inside a
bracket expression, so it wouldn't be much of a leap to make the meaning
of `^` and `$` outside of a bracket expression context-sensitive too.

Ed.

And I guess with BREs it was too late to abandon that constraining rule without breaking existing utilities.

But that all is just a guess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Ed Morton on Fri Aug 19 17:34:15 2022

On 2022-08-18, Ed Morton <mortonspam@gmail.com> wrote:

When I write a regexp that has a `$` in the middle of it I write it as
either of:

sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'

Because $ can have the special anchoring meaning, even when
it occurs in the middle of a regex:

$ grep -E 'abc$|def'

matches lines ending in abc, or containing def.

To get the behavior you want, the exact rule would have to be
rooted in the abstract syntax: that a $ which has a right
sibling in the syntax tree becomes automatically literal.

so that it's clear the `$` should be treated literally. Given that, I've

Treating characters literally or not based on their position in the
syntax is a bad idea in the first place.

For instance it's a misfeature some regex implementatons that ) becomes literal, without escaping, if it is unmatched, rather than being
flagged as a syntax error.

Consistency is best.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Kaz Kylheku on Fri Aug 19 19:30:05 2022

On 8/19/2022 12:34 PM, Kaz Kylheku wrote:

On 2022-08-18, Ed Morton <mortonspam@gmail.com> wrote:

When I write a regexp that has a `$` in the middle of it I write it as
either of:

sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'

Because $ can have the special anchoring meaning, even when
it occurs in the middle of a regex:

$ grep -E 'abc$|def'

matches lines ending in abc, or containing def.

All the relevant ERE text actually talks about regexp subexpressions in
the same way the BRE text talks about whole regexps, I just didn't want
to get into that, so in the above case the $ is at the end of a
subexpression and so the $ is being treated consistently between BREs
and EREs in that regard.

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Ian Rihard Kosednar
  Sun Jun 22 14:16:22 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:55:59 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:52:11 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:51:03 2025
  from No via SSH
- Ian Rihard Kosednar
  Sun Jun 22 13:24:06 2025
  from No via SSH
- Bob Worm
  Sun Jun 22 10:44:40 2025
  from Wales, Uk via Telnet
- Plume
  Sun Jun 22 04:10:00 2025
  from Uk via SSH
- Bob Worm
  Sat Jun 21 23:01:25 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	498
Nodes:	16 (0 / 16)
Uptime:	72:17:15
Calls:	9,819
Calls today:	7
Files:	13,757
Messages:	6,189,681

why is $ not literal mid-string in an ERE?

Who's Online

Recent Visitors

System Info