Forum: >>> Magnum BBS <<<

Vanilla regex

From Tuxedo@21:1/5 to All on Sun Jul 2 19:14:01 2023

Can anyone assist with a regex using fairly standard and cross-system compatible methods?

It's for files containing wiki markup segments as follows:

[[File:Some File Name 0123.jpg|800px]]

Or maybe:

[[File:Some other file.jpg|250px]]

Or maybe:

[[File:Another file.jpg |600px|thumb]]

etc.

The unique identifiers for the relevant parts are the start of "[[File:" followed by ASCII making up file names ending in some file type, such as
.jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" brackets.

The regex needs to grab the filename portion, eg. "Another file.jpg", keep
it in a variable and replace any spaces with underscore(s) within so this updated variable becomes "Another_file.jpg"

Thereafter, within the existing markup, for example:

[[File:Another file.jpg |600px|thumb]]

Insert the following markup after the first pipe:

link=https://example.com/display.pl?Another_file.jpg|

So the final markup becomes:

[[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

The spaces in the original "File: ..." name parts can remain as it's valid markup but the underscores need to exist in link=... strings.

There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on such instances so the procedure can run without conflict of previous replacement action.

Many thanks for any example code and ideas.

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Sun Jul 2 20:09:03 2023

On 02.07.2023 20:02, Janis Papanagnou wrote:

On 02.07.2023 19:14, Tuxedo wrote:

Can anyone assist with a regex using fairly standard and cross-system
compatible methods?

It's for files containing wiki markup segments as follows:

[[File:Some File Name 0123.jpg|800px]]

Or maybe:

[[File:Some other file.jpg|250px]]

Or maybe:

[[File:Another file.jpg |600px|thumb]]

etc.

The unique identifiers for the relevant parts are the start of "[[File:"
followed by ASCII making up file names ending in some file type, such as
.jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" >> brackets.

The regex needs to grab the filename portion, eg. "Another file.jpg", keep >> it in a variable and replace any spaces with underscore(s) within so this
updated variable becomes "Another_file.jpg"

Thereafter, within the existing markup, for example:

[[File:Another file.jpg |600px|thumb]]

Insert the following markup after the first pipe:

link=https://example.com/display.pl?Another_file.jpg|

So the final markup becomes:

[[File:Another file.jpg |
link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

The spaces in the original "File: ..." name parts can remain as it's valid >> markup but the underscores need to exist in link=... strings.

There may be instances where "|link=" occurrences already exits within the >> opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on such instances so the procedure can run without
conflict of previous replacement action.

Many thanks for any example code and ideas.

You can do such replacements in modern shells, but since "using fairly standard" isn't exactly an exact specification I provide an example in (standard) awk...

awk '
BEGIN {
p = "link=https://example.com/display.pl?"
}
$0 !~ p && match($0,/\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ $/, "", f)

sub(/ +$/, "", f)

In case there's more that one spurious space after the file extension.

gsub(/ /, "_", f)
sub(/[|]/, "|" p f "|")
}
1
'

The first sub-condition skips the pattern defined in variable p.
The second condition does a substitution where the pattern appears.
It strips trailing spaces so that you don't get them replaced by '_'.
and finally composes the link.

This code operates on lines containing only one of these patterns.
It assumes no spaces between '[[' and 'File:'.
It's also unclear whether you need to change multiple patterns in a
line or anything else, so it might need some tweaking or refinement.

Janis

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Tuxedo on Sun Jul 2 20:02:44 2023

On 02.07.2023 19:14, Tuxedo wrote:

Can anyone assist with a regex using fairly standard and cross-system compatible methods?

It's for files containing wiki markup segments as follows:

[[File:Some File Name 0123.jpg|800px]]

Or maybe:

[[File:Some other file.jpg|250px]]

Or maybe:

[[File:Another file.jpg |600px|thumb]]

etc.

The unique identifiers for the relevant parts are the start of "[[File:" followed by ASCII making up file names ending in some file type, such as .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" brackets.

The regex needs to grab the filename portion, eg. "Another file.jpg", keep
it in a variable and replace any spaces with underscore(s) within so this updated variable becomes "Another_file.jpg"

Thereafter, within the existing markup, for example:

[[File:Another file.jpg |600px|thumb]]

Insert the following markup after the first pipe:

link=https://example.com/display.pl?Another_file.jpg|

So the final markup becomes:

[[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

The spaces in the original "File: ..." name parts can remain as it's valid markup but the underscores need to exist in link=... strings.

There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on such instances so the procedure can run without conflict of previous replacement action.

Many thanks for any example code and ideas.

You can do such replacements in modern shells, but since "using fairly standard" isn't exactly an exact specification I provide an example in (standard) awk...

awk '
BEGIN {
p = "link=https://example.com/display.pl?"
}
$0 !~ p && match($0,/\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ $/, "", f)
gsub(/ /, "_", f)
sub(/[|]/, "|" p f "|")
}
1
'

The first sub-condition skips the pattern defined in variable p.
The second condition does a substitution where the pattern appears.
It strips trailing spaces so that you don't get them replaced by '_'.
and finally composes the link.

This code operates on lines containing only one of these patterns.
It assumes no spaces between '[[' and 'File:'.
It's also unclear whether you need to change multiple patterns in a
line or anything else, so it might need some tweaking or refinement.

Janis

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tuxedo@21:1/5 to Janis Papanagnou on Mon Jul 3 14:49:13 2023

Janis Papanagnou wrote:

On 02.07.2023 20:02, Janis Papanagnou wrote:

On 02.07.2023 19:14, Tuxedo wrote:

Can anyone assist with a regex using fairly standard and cross-system
compatible methods?

It's for files containing wiki markup segments as follows:

[[File:Some File Name 0123.jpg|800px]]

Or maybe:

[[File:Some other file.jpg|250px]]

Or maybe:

[[File:Another file.jpg |600px|thumb]]

etc.

The unique identifiers for the relevant parts are the start of "[[File:" >>> followed by ASCII making up file names ending in some file type, such as >>> .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing
"]]" brackets.

The regex needs to grab the filename portion, eg. "Another file.jpg",
keep it in a variable and replace any spaces with underscore(s) within
so this updated variable becomes "Another_file.jpg"

Thereafter, within the existing markup, for example:

[[File:Another file.jpg |600px|thumb]]

Insert the following markup after the first pipe:

link=https://example.com/display.pl?Another_file.jpg|

So the final markup becomes:

[[File:Another file.jpg |
link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

The spaces in the original "File: ..." name parts can remain as it's
valid markup but the underscores need to exist in link=... strings.

There may be instances where "|link=" occurrences already exits within
the opening of a "[[File:" and before its closing "]]" brackets. The
regex should avoid operating on such instances so the procedure can run
without conflict of previous replacement action.

Many thanks for any example code and ideas.

You can do such replacements in modern shells, but since "using fairly
standard" isn't exactly an exact specification I provide an example in
(standard) awk...

awk '
BEGIN {
p = "link=https://example.com/display.pl?"
}
$0 !~ p && match($0,/\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ $/, "", f)

sub(/ +$/, "", f)

In case there's more that one spurious space after the file extension.

gsub(/ /, "_", f)
sub(/[|]/, "|" p f "|")
}
1
'

The first sub-condition skips the pattern defined in variable p.
The second condition does a substitution where the pattern appears.
It strips trailing spaces so that you don't get them replaced by '_'.
and finally composes the link.

This code operates on lines containing only one of these patterns.
It assumes no spaces between '[[' and 'File:'.
It's also unclear whether you need to change multiple patterns in a
line or anything else, so it might need some tweaking or refinement.

Janis

Many thanks or posting the example along with the explanatory notes.

There could be a space between '[[' and 'File:' as it's not invalid markup, which I did not know until testing it now. It's however unlikely.

And there could be more than one pattern instance on a single line, but
again, it's unlikely. Multiple patterns are almost always on different
lines.

I will bear this in mind and tailor to my purpose.

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Ed Morton on Mon Jul 3 18:06:32 2023

On 03.07.2023 17:39, Ed Morton wrote:

On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
<snip>

awk '
BEGIN {
p = "link=https://example.com/display.pl?"

That `?` at the end means "0 or 1 occurrences of the preceding
expression". I suspect you meant to make the `?` literal

Actually, no. - What I meant was to keep the _sample_ *simple*.
Since regexp meta-symbols in strings that contain patterns will
become bulky.

What I considered (while preserving the simplicity) was to just
remove the '?' from the string...

p = "link=https://example.com/display.pl?"

and later write

sub(/[|]/, "|" p "?" f "|")

but as I wrote, it's not worth the hassle for the sample where
more important questions were still unclear (as previously said).

Of course you could also just simplify writing (without the dot)
for the match

p = "link=https://example.com/display"

and this will still fail in case you have this pattern elsewhere
in the data appearing. And how likely is it that the two dots in
link=https://example.com/display.pl
will match, say, link=https://exampleXcom/displayYpl - not really,
don't you think?

Janis

and you should
also make the `.`s literal:

p = "link=https://example[.]com/display[.]pl[?]"

Also consider whether or not word boundaries or, more likely, some other method of avoiding undesirable substring matches is required.

Regards,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Tuxedo on Mon Jul 3 10:56:34 2023

On 7/3/2023 7:49 AM, Tuxedo wrote:
<snip>

There could be a space between '[[' and 'File:' as it's not invalid markup, which I did not know until testing it now. It's however unlikely.

And there could be more than one pattern instance on a single line, but again, it's unlikely. Multiple patterns are almost always on different
lines.

If you post a block of text containing concise, testable sample input
that covers ALL of your use cases and a separate block of text showing
the exact output you expect from that input then we'll have something we
can copy/paste to easily test a potential solution against and so we can
help you.

Regards,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Tuxedo on Mon Jul 3 10:50:21 2023

On 7/2/2023 12:14 PM, Tuxedo wrote:

Can anyone assist with a regex using fairly standard and cross-system compatible methods?

There are 2 different POSIX regex standards:

BRE: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03
ERE: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04

and another fairly commonly used regex notation:

PCRE: https://www.pcre.org/

Also, every tool has its own options, extensions, delimiters,
backreferences support, other enhancements/considerations, etc. for
whatever regex flavor(s) it supports.

So there is no "fairly standard and cross-system compatible" regex
notation but Janis gave you an answer using AWK and only using POSIX
constructs within that script and that's the best you could do regarding portability and usability as AWK is the most powerful mandatory POSIX
tool (i.e. must be present on all Unix boxes) for manipulating text.

Regards,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Janis Papanagnou on Mon Jul 3 10:39:49 2023

On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
<snip>

awk '
BEGIN {
p = "link=https://example.com/display.pl?"

That `?` at the end means "0 or 1 occurrences of the preceding
expression". I suspect you meant to make the `?` literal and you should
also make the `.`s literal:

p = "link=https://example[.]com/display[.]pl[?]"

Also consider whether or not word boundaries or, more likely, some other
method of avoiding undesirable substring matches is required.

Regards,

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Janis Papanagnou on Mon Jul 3 13:07:09 2023

On 7/3/2023 12:57 PM, Janis Papanagnou wrote:

On 03.07.2023 18:06, Janis Papanagnou wrote:

On 03.07.2023 17:39, Ed Morton wrote:

[...]

and you should also make the `.`s literal:

p = "link=https://example[.]com/display[.]pl[?]"

I forgot to point out (in case it was not obvious)...

Variable p was used in two contexts, as pattern and as variable
to be printed literally; so above expression would not qualify.

I confess I just glanced at the script. Then leave the definition as it
was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
literally.

Ed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Mon Jul 3 19:57:35 2023

On 03.07.2023 18:06, Janis Papanagnou wrote:

On 03.07.2023 17:39, Ed Morton wrote:

[...]

and you should also make the `.`s literal:

p = "link=https://example[.]com/display[.]pl[?]"

I forgot to point out (in case it was not obvious)...

Variable p was used in two contexts, as pattern and as variable
to be printed literally; so above expression would not qualify.

Janis

[...]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Ed Morton on Mon Jul 3 21:15:47 2023

On 03.07.2023 20:07, Ed Morton wrote:

I confess I just glanced at the script. Then leave the definition as it
was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
literally.

Yes, indeed. It occurred to me only after my post. Maybe it's time for
an update now to add and summarize all the little details in one code
sample...

awk '
BEGIN { p = "link=https://example.com/display.pl" }

!index($0, p) && match($0, /\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ +$/, "", f)
gsub(/ /, "_", f)
sub(/[|]/, "|" p "?" f "|")
}

{ print }
'

Of course the OP meanwhile mentioned a couple more requirements and
also some more tweaks that would have to be considered could be added (replacing the ' ' by a character class, add support for white-space
after "File:", etc.), but for the intended purpose it's sufficient I
suppose.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Zharous
  Sat Jun 21 17:53:33 2025
  from Tempe, Az via Telnet
- Bob Worm
  Sat Jun 21 16:44:22 2025
  from Wales, Uk via Telnet
- Ian Rihard Kosednar
  Sat Jun 21 15:09:13 2025
  from No via SSH
- Ian Rihard Kosednar
  Sat Jun 21 14:47:27 2025
  from No via SSH
- Ian Rihard Kosednar
  Sat Jun 21 14:40:05 2025
  from No via Telnet
- Ian Rihard Kosednar
  Sat Jun 21 14:33:30 2025
  from No via Telnet
- Ian Rihard Kosednar
  Sat Jun 21 14:28:39 2025
  from No via SSH
- Ian Rihard Kosednar
  Sat Jun 21 14:20:16 2025
  from No via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	498
Nodes:	16 (2 / 14)
Uptime:	53:05:48
Calls:	9,810
Calls today:	12
Files:	13,754
Messages:	6,190,510

Vanilla regex

Who's Online

Recent Visitors

System Info