• Vanilla regex

    From Tuxedo@21:1/5 to All on Sun Jul 2 19:14:01 2023
    Can anyone assist with a regex using fairly standard and cross-system compatible methods?

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The unique identifiers for the relevant parts are the start of "[[File:" followed by ASCII making up file names ending in some file type, such as
    .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" brackets.

    The regex needs to grab the filename portion, eg. "Another file.jpg", keep
    it in a variable and replace any spaces with underscore(s) within so this updated variable becomes "Another_file.jpg"

    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Insert the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:

    [[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's valid markup but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
    should avoid operating on such instances so the procedure can run without conflict of previous replacement action.

    Many thanks for any example code and ideas.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Sun Jul 2 20:09:03 2023
    On 02.07.2023 20:02, Janis Papanagnou wrote:
    On 02.07.2023 19:14, Tuxedo wrote:
    Can anyone assist with a regex using fairly standard and cross-system
    compatible methods?

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The unique identifiers for the relevant parts are the start of "[[File:"
    followed by ASCII making up file names ending in some file type, such as
    .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" >> brackets.

    The regex needs to grab the filename portion, eg. "Another file.jpg", keep >> it in a variable and replace any spaces with underscore(s) within so this
    updated variable becomes "Another_file.jpg"

    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Insert the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:

    [[File:Another file.jpg |
    link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's valid >> markup but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within the >> opening of a "[[File:" and before its closing "]]" brackets. The regex
    should avoid operating on such instances so the procedure can run without
    conflict of previous replacement action.

    Many thanks for any example code and ideas.

    You can do such replacements in modern shells, but since "using fairly standard" isn't exactly an exact specification I provide an example in (standard) awk...

    awk '
    BEGIN {
    p = "link=https://example.com/display.pl?"
    }
    $0 !~ p && match($0,/\[\[File:[^]|]+/) {
    f = substr($0, RSTART+7, RLENGTH-7)
    sub(/ $/, "", f)

    sub(/ +$/, "", f)

    In case there's more that one spurious space after the file extension.

    gsub(/ /, "_", f)
    sub(/[|]/, "|" p f "|")
    }
    1
    '

    The first sub-condition skips the pattern defined in variable p.
    The second condition does a substitution where the pattern appears.
    It strips trailing spaces so that you don't get them replaced by '_'.
    and finally composes the link.

    This code operates on lines containing only one of these patterns.
    It assumes no spaces between '[[' and 'File:'.
    It's also unclear whether you need to change multiple patterns in a
    line or anything else, so it might need some tweaking or refinement.

    Janis


    Tuxedo



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Tuxedo on Sun Jul 2 20:02:44 2023
    On 02.07.2023 19:14, Tuxedo wrote:
    Can anyone assist with a regex using fairly standard and cross-system compatible methods?

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The unique identifiers for the relevant parts are the start of "[[File:" followed by ASCII making up file names ending in some file type, such as .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" brackets.

    The regex needs to grab the filename portion, eg. "Another file.jpg", keep
    it in a variable and replace any spaces with underscore(s) within so this updated variable becomes "Another_file.jpg"

    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Insert the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:

    [[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's valid markup but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
    should avoid operating on such instances so the procedure can run without conflict of previous replacement action.

    Many thanks for any example code and ideas.

    You can do such replacements in modern shells, but since "using fairly standard" isn't exactly an exact specification I provide an example in (standard) awk...

    awk '
    BEGIN {
    p = "link=https://example.com/display.pl?"
    }
    $0 !~ p && match($0,/\[\[File:[^]|]+/) {
    f = substr($0, RSTART+7, RLENGTH-7)
    sub(/ $/, "", f)
    gsub(/ /, "_", f)
    sub(/[|]/, "|" p f "|")
    }
    1
    '

    The first sub-condition skips the pattern defined in variable p.
    The second condition does a substitution where the pattern appears.
    It strips trailing spaces so that you don't get them replaced by '_'.
    and finally composes the link.

    This code operates on lines containing only one of these patterns.
    It assumes no spaces between '[[' and 'File:'.
    It's also unclear whether you need to change multiple patterns in a
    line or anything else, so it might need some tweaking or refinement.

    Janis


    Tuxedo


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Janis Papanagnou on Mon Jul 3 14:49:13 2023
    Janis Papanagnou wrote:

    On 02.07.2023 20:02, Janis Papanagnou wrote:
    On 02.07.2023 19:14, Tuxedo wrote:
    Can anyone assist with a regex using fairly standard and cross-system
    compatible methods?

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The unique identifiers for the relevant parts are the start of "[[File:" >>> followed by ASCII making up file names ending in some file type, such as >>> .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing
    "]]" brackets.

    The regex needs to grab the filename portion, eg. "Another file.jpg",
    keep it in a variable and replace any spaces with underscore(s) within
    so this updated variable becomes "Another_file.jpg"

    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Insert the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:

    [[File:Another file.jpg |
    link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's
    valid markup but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within
    the opening of a "[[File:" and before its closing "]]" brackets. The
    regex should avoid operating on such instances so the procedure can run
    without conflict of previous replacement action.

    Many thanks for any example code and ideas.

    You can do such replacements in modern shells, but since "using fairly
    standard" isn't exactly an exact specification I provide an example in
    (standard) awk...

    awk '
    BEGIN {
    p = "link=https://example.com/display.pl?"
    }
    $0 !~ p && match($0,/\[\[File:[^]|]+/) {
    f = substr($0, RSTART+7, RLENGTH-7)
    sub(/ $/, "", f)

    sub(/ +$/, "", f)

    In case there's more that one spurious space after the file extension.

    gsub(/ /, "_", f)
    sub(/[|]/, "|" p f "|")
    }
    1
    '

    The first sub-condition skips the pattern defined in variable p.
    The second condition does a substitution where the pattern appears.
    It strips trailing spaces so that you don't get them replaced by '_'.
    and finally composes the link.

    This code operates on lines containing only one of these patterns.
    It assumes no spaces between '[[' and 'File:'.
    It's also unclear whether you need to change multiple patterns in a
    line or anything else, so it might need some tweaking or refinement.

    Janis


    Many thanks or posting the example along with the explanatory notes.

    There could be a space between '[[' and 'File:' as it's not invalid markup, which I did not know until testing it now. It's however unlikely.

    And there could be more than one pattern instance on a single line, but
    again, it's unlikely. Multiple patterns are almost always on different
    lines.

    I will bear this in mind and tailor to my purpose.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ed Morton on Mon Jul 3 18:06:32 2023
    On 03.07.2023 17:39, Ed Morton wrote:
    On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
    <snip>
    awk '
    BEGIN {
    p = "link=https://example.com/display.pl?"

    That `?` at the end means "0 or 1 occurrences of the preceding
    expression". I suspect you meant to make the `?` literal

    Actually, no. - What I meant was to keep the _sample_ *simple*.
    Since regexp meta-symbols in strings that contain patterns will
    become bulky.

    What I considered (while preserving the simplicity) was to just
    remove the '?' from the string...

    p = "link=https://example.com/display.pl?"

    and later write

    sub(/[|]/, "|" p "?" f "|")

    but as I wrote, it's not worth the hassle for the sample where
    more important questions were still unclear (as previously said).

    Of course you could also just simplify writing (without the dot)
    for the match

    p = "link=https://example.com/display"

    and this will still fail in case you have this pattern elsewhere
    in the data appearing. And how likely is it that the two dots in
    link=https://example.com/display.pl
    will match, say, link=https://exampleXcom/displayYpl - not really,
    don't you think?

    Janis

    and you should
    also make the `.`s literal:

    p = "link=https://example[.]com/display[.]pl[?]"

    Also consider whether or not word boundaries or, more likely, some other method of avoiding undesirable substring matches is required.

    Regards,

    Ed.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Tuxedo on Mon Jul 3 10:56:34 2023
    On 7/3/2023 7:49 AM, Tuxedo wrote:
    <snip>
    There could be a space between '[[' and 'File:' as it's not invalid markup, which I did not know until testing it now. It's however unlikely.

    And there could be more than one pattern instance on a single line, but again, it's unlikely. Multiple patterns are almost always on different
    lines.

    If you post a block of text containing concise, testable sample input
    that covers ALL of your use cases and a separate block of text showing
    the exact output you expect from that input then we'll have something we
    can copy/paste to easily test a potential solution against and so we can
    help you.

    Regards,

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Tuxedo on Mon Jul 3 10:50:21 2023
    On 7/2/2023 12:14 PM, Tuxedo wrote:
    Can anyone assist with a regex using fairly standard and cross-system compatible methods?

    There are 2 different POSIX regex standards:

    BRE: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03
    ERE: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04

    and another fairly commonly used regex notation:

    PCRE: https://www.pcre.org/

    Also, every tool has its own options, extensions, delimiters,
    backreferences support, other enhancements/considerations, etc. for
    whatever regex flavor(s) it supports.

    So there is no "fairly standard and cross-system compatible" regex
    notation but Janis gave you an answer using AWK and only using POSIX
    constructs within that script and that's the best you could do regarding portability and usability as AWK is the most powerful mandatory POSIX
    tool (i.e. must be present on all Unix boxes) for manipulating text.

    Regards,

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Janis Papanagnou on Mon Jul 3 10:39:49 2023
    On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
    <snip>
    awk '
    BEGIN {
    p = "link=https://example.com/display.pl?"

    That `?` at the end means "0 or 1 occurrences of the preceding
    expression". I suspect you meant to make the `?` literal and you should
    also make the `.`s literal:

    p = "link=https://example[.]com/display[.]pl[?]"

    Also consider whether or not word boundaries or, more likely, some other
    method of avoiding undesirable substring matches is required.

    Regards,

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Janis Papanagnou on Mon Jul 3 13:07:09 2023
    On 7/3/2023 12:57 PM, Janis Papanagnou wrote:
    On 03.07.2023 18:06, Janis Papanagnou wrote:
    On 03.07.2023 17:39, Ed Morton wrote:
    [...]

    and you should also make the `.`s literal:

    p = "link=https://example[.]com/display[.]pl[?]"

    I forgot to point out (in case it was not obvious)...

    Variable p was used in two contexts, as pattern and as variable
    to be printed literally; so above expression would not qualify.

    I confess I just glanced at the script. Then leave the definition as it
    was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
    literally.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Mon Jul 3 19:57:35 2023
    On 03.07.2023 18:06, Janis Papanagnou wrote:
    On 03.07.2023 17:39, Ed Morton wrote:
    [...]

    and you should also make the `.`s literal:

    p = "link=https://example[.]com/display[.]pl[?]"

    I forgot to point out (in case it was not obvious)...

    Variable p was used in two contexts, as pattern and as variable
    to be printed literally; so above expression would not qualify.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ed Morton on Mon Jul 3 21:15:47 2023
    On 03.07.2023 20:07, Ed Morton wrote:

    I confess I just glanced at the script. Then leave the definition as it
    was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
    literally.

    Yes, indeed. It occurred to me only after my post. Maybe it's time for
    an update now to add and summarize all the little details in one code
    sample...

    awk '
    BEGIN { p = "link=https://example.com/display.pl" }

    !index($0, p) && match($0, /\[\[File:[^]|]+/) {
    f = substr($0, RSTART+7, RLENGTH-7)
    sub(/ +$/, "", f)
    gsub(/ /, "_", f)
    sub(/[|]/, "|" p "?" f "|")
    }

    { print }
    '

    Of course the OP meanwhile mentioned a couple more requirements and
    also some more tweaks that would have to be considered could be added (replacing the ' ' by a character class, add support for white-space
    after "File:", etc.), but for the intended purpose it's sufficient I
    suppose.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)