On 02.07.2023 19:14, Tuxedo wrote:
Can anyone assist with a regex using fairly standard and cross-system
compatible methods?
It's for files containing wiki markup segments as follows:
[[File:Some File Name 0123.jpg|800px]]
Or maybe:
[[File:Some other file.jpg|250px]]
Or maybe:
[[File:Another file.jpg |600px|thumb]]
etc.
The unique identifiers for the relevant parts are the start of "[[File:"
followed by ASCII making up file names ending in some file type, such as
.jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" >> brackets.
The regex needs to grab the filename portion, eg. "Another file.jpg", keep >> it in a variable and replace any spaces with underscore(s) within so this
updated variable becomes "Another_file.jpg"
Thereafter, within the existing markup, for example:
[[File:Another file.jpg |600px|thumb]]
Insert the following markup after the first pipe:
link=https://example.com/display.pl?Another_file.jpg|
So the final markup becomes:
[[File:Another file.jpg |
link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
The spaces in the original "File: ..." name parts can remain as it's valid >> markup but the underscores need to exist in link=... strings.
There may be instances where "|link=" occurrences already exits within the >> opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on such instances so the procedure can run without
conflict of previous replacement action.
Many thanks for any example code and ideas.
You can do such replacements in modern shells, but since "using fairly standard" isn't exactly an exact specification I provide an example in (standard) awk...
awk '
BEGIN {
p = "link=https://example.com/display.pl?"
}
$0 !~ p && match($0,/\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ $/, "", f)
gsub(/ /, "_", f)
sub(/[|]/, "|" p f "|")
}
1
'
The first sub-condition skips the pattern defined in variable p.
The second condition does a substitution where the pattern appears.
It strips trailing spaces so that you don't get them replaced by '_'.
and finally composes the link.
This code operates on lines containing only one of these patterns.
It assumes no spaces between '[[' and 'File:'.
It's also unclear whether you need to change multiple patterns in a
line or anything else, so it might need some tweaking or refinement.
Janis
Tuxedo
Can anyone assist with a regex using fairly standard and cross-system compatible methods?
It's for files containing wiki markup segments as follows:
[[File:Some File Name 0123.jpg|800px]]
Or maybe:
[[File:Some other file.jpg|250px]]
Or maybe:
[[File:Another file.jpg |600px|thumb]]
etc.
The unique identifiers for the relevant parts are the start of "[[File:" followed by ASCII making up file names ending in some file type, such as .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" brackets.
The regex needs to grab the filename portion, eg. "Another file.jpg", keep
it in a variable and replace any spaces with underscore(s) within so this updated variable becomes "Another_file.jpg"
Thereafter, within the existing markup, for example:
[[File:Another file.jpg |600px|thumb]]
Insert the following markup after the first pipe:
link=https://example.com/display.pl?Another_file.jpg|
So the final markup becomes:
[[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
The spaces in the original "File: ..." name parts can remain as it's valid markup but the underscores need to exist in link=... strings.
There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on such instances so the procedure can run without conflict of previous replacement action.
Many thanks for any example code and ideas.
Tuxedo
On 02.07.2023 20:02, Janis Papanagnou wrote:
On 02.07.2023 19:14, Tuxedo wrote:
Can anyone assist with a regex using fairly standard and cross-system
compatible methods?
It's for files containing wiki markup segments as follows:
[[File:Some File Name 0123.jpg|800px]]
Or maybe:
[[File:Some other file.jpg|250px]]
Or maybe:
[[File:Another file.jpg |600px|thumb]]
etc.
The unique identifiers for the relevant parts are the start of "[[File:" >>> followed by ASCII making up file names ending in some file type, such as >>> .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing
"]]" brackets.
The regex needs to grab the filename portion, eg. "Another file.jpg",
keep it in a variable and replace any spaces with underscore(s) within
so this updated variable becomes "Another_file.jpg"
Thereafter, within the existing markup, for example:
[[File:Another file.jpg |600px|thumb]]
Insert the following markup after the first pipe:
link=https://example.com/display.pl?Another_file.jpg|
So the final markup becomes:
[[File:Another file.jpg |
link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
The spaces in the original "File: ..." name parts can remain as it's
valid markup but the underscores need to exist in link=... strings.
There may be instances where "|link=" occurrences already exits within
the opening of a "[[File:" and before its closing "]]" brackets. The
regex should avoid operating on such instances so the procedure can run
without conflict of previous replacement action.
Many thanks for any example code and ideas.
You can do such replacements in modern shells, but since "using fairly
standard" isn't exactly an exact specification I provide an example in
(standard) awk...
awk '
BEGIN {
p = "link=https://example.com/display.pl?"
}
$0 !~ p && match($0,/\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ $/, "", f)
sub(/ +$/, "", f)
In case there's more that one spurious space after the file extension.
gsub(/ /, "_", f)
sub(/[|]/, "|" p f "|")
}
1
'
The first sub-condition skips the pattern defined in variable p.
The second condition does a substitution where the pattern appears.
It strips trailing spaces so that you don't get them replaced by '_'.
and finally composes the link.
This code operates on lines containing only one of these patterns.
It assumes no spaces between '[[' and 'File:'.
It's also unclear whether you need to change multiple patterns in a
line or anything else, so it might need some tweaking or refinement.
Janis
On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
<snip>
awk '
BEGIN {
p = "link=https://example.com/display.pl?"
That `?` at the end means "0 or 1 occurrences of the preceding
expression". I suspect you meant to make the `?` literal
and you should
also make the `.`s literal:
p = "link=https://example[.]com/display[.]pl[?]"
Also consider whether or not word boundaries or, more likely, some other method of avoiding undesirable substring matches is required.
Regards,
Ed.
There could be a space between '[[' and 'File:' as it's not invalid markup, which I did not know until testing it now. It's however unlikely.
And there could be more than one pattern instance on a single line, but again, it's unlikely. Multiple patterns are almost always on different
lines.
Can anyone assist with a regex using fairly standard and cross-system compatible methods?
awk '
BEGIN {
p = "link=https://example.com/display.pl?"
On 03.07.2023 18:06, Janis Papanagnou wrote:
On 03.07.2023 17:39, Ed Morton wrote:
[...]
and you should also make the `.`s literal:
p = "link=https://example[.]com/display[.]pl[?]"
I forgot to point out (in case it was not obvious)...
Variable p was used in two contexts, as pattern and as variable
to be printed literally; so above expression would not qualify.
On 03.07.2023 17:39, Ed Morton wrote:
[...]
and you should also make the `.`s literal:
p = "link=https://example[.]com/display[.]pl[?]"
[...]
I confess I just glanced at the script. Then leave the definition as it
was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
literally.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 498 |
Nodes: | 16 (2 / 14) |
Uptime: | 53:05:48 |
Calls: | 9,810 |
Calls today: | 12 |
Files: | 13,754 |
Messages: | 6,190,510 |