• Read and return

    From Tuxedo@21:1/5 to All on Mon Sep 26 06:59:51 2022
    Hello,

    I have an HTML page with some links in a standard HTML format, such as:

    index.html:

    <!-- start links -->

    <a href="noix_de_muscade.html">
    <img src="nutmeg.jpg>
    Noix de muscate</a>

    <a href="poivre">
    <img src="pepper.jpg">
    Poivre</a>

    <a href="grains_de_cafe.html">
    <img src="coffee_beans.jpg>
    Grains de café</a>

    <!-- end links -->


    I would like to read link values appearing between the comments (<!-- start links --> and <!-- end links -->) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.

    The relevant parts are the strings from:

    <a href="

    until:

    "

    ... so only until each first occurrence of a double quote " and not
    necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

    What regex can can be used extract these strings?

    And in their same order of appearance as in the original HTML file.

    After, they simply need to be printed in a new (JS) array, like this:

    ("noix_de_muscade.html",
    "poivre.html",
    "grains_de_cafe.html");


    linkextract.pl:

    #!/usr/bin/perl -w

    open (data, "index.html");

    # capture part between <!-- start links --> and <!-- end links -->

    # extract the parts between each <a href=" and the
    # first " double quote that follows each

    # place in array in order of appearance

    }

    print $data;


    Many thanks for any advice and ideas.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrzej Adam Filip@21:1/5 to Tuxedo on Mon Sep 26 10:28:26 2022
    Tuxedo <tuxedo@mailinator.net> wrote:
    Hello,

    I have an HTML page with some links in a standard HTML format, such as:

    index.html:

    <!-- start links -->

    <a href="noix_de_muscade.html">
    <img src="nutmeg.jpg>
    Noix de muscate</a>

    <a href="poivre">
    <img src="pepper.jpg">
    Poivre</a>

    <a href="grains_de_cafe.html">
    <img src="coffee_beans.jpg>
    Grains de café</a>

    <!-- end links -->


    I would like to read link values appearing between the comments (<!-- start links --> and <!-- end links -->) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.

    The relevant parts are the strings from:

    <a href="

    until:

    "

    ... so only until each first occurrence of a double quote " and not necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

    What regex can can be used extract these strings?

    And in their same order of appearance as in the original HTML file.

    After, they simply need to be printed in a new (JS) array, like this:

    ("noix_de_muscade.html",
    "poivre.html",
    "grains_de_cafe.html");


    linkextract.pl:

    #!/usr/bin/perl -w

    open (data, "index.html");

    # capture part between <!-- start links --> and <!-- end links -->

    # extract the parts between each <a href=" and the
    # first " double quote that follows each

    # place in array in order of appearance

    }

    print $data;


    Many thanks for any advice and ideas.

    Tuxedo

    0. Your html file misses closing quotes in src attribute of img tags.
    1. You may use HTML::TokeParser module or more intuitive HTML::TokeParser::Simple

    use strict;
    use warnings;
    use utf8;

    use IO::HTML;
    use HTML::TokeParser::Simple;

    # Make STDOUT ut8 encoded
    binmode(STDOUT,':utf8');

    # html_file - autodetect encoding of the html file
    my $p = HTML::TokeParser::Simple->new( html_file('x.html') );

    my $n;
    my @hrefs; # array to store detected href
    my $in_block;
    while ( my $token = $p->get_token ) {
    if( not $in_block ) {
    if( $token->is_comment() and $token->as_is() =~ /^<!-- start links -->$/ ) {
    $in_block = 1;
    }
    next;
    }elsif( $token->is_start_tag('a') and defined($token->get_attr('href'))){
    printf "%d: %s\n", ++$n, $token->get_attr('href');
    push( @hrefs, $token->get_attr('href'));
    }elsif( $token->is_comment() and $token->as_is() =~ /^<!-- end links -->$/ ) {
    last;
    }
    }

    --
    [Andrew] Andrzej A. Filip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Shvili, the Kookologist@21:1/5 to All on Mon Sep 26 11:20:19 2022
    On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1@solani.org>, Tuxedo
    wrote:

    Hello,

    I have an HTML page with some links in a standard HTML format, such as:

    index.html:

    <!-- start links -->

    <a href="noix_de_muscade.html">
    <img src="nutmeg.jpg>
    Noix de muscate</a>

    <a href="poivre">
    <img src="pepper.jpg">
    Poivre</a>

    <a href="grains_de_cafe.html">
    <img src="coffee_beans.jpg>
    Grains de café</a>

    <!-- end links -->


    I would like to read link values appearing between the comments (<!-- start links --> and <!-- end links -->) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.

    The relevant parts are the strings from:

    <a href="

    until:

    "

    ... so only until each first occurrence of a double quote " and not necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

    What regex can can be used extract these strings?

    And in their same order of appearance as in the original HTML file.

    After, they simply need to be printed in a new (JS) array, like this:

    ("noix_de_muscade.html",
    "poivre.html",
    "grains_de_cafe.html");


    linkextract.pl:

    #!/usr/bin/perl -w

    open (data, "index.html");

    That is an extremely old form of Perl. Slightly modernised, it wuold be:

    #!/usr/bin/perl

    use strict;
    use warnings;

    open (my $data, "index.html")
    or die "Couldn't open 'index.html' for reading: $!";



    # capture part between <!-- start links --> and <!-- end links -->

    my @links;

    while (<$data>) {
    if (/<!-- start links -->/ .. /<!-- end links -->/) {
    ...;
    }
    }

    # extract the parts between each <a href=" and the
    # first " double quote that follows each

    /<a href=("[^"]*")/

    # place in array in order of appearance

    push @links, $1

    print $data;

    print '(', join(', ', @links), ')';

    Putting these snippets together, gives us:


    #!/usr/bin/perl

    use strict;
    use warnings;

    open (my $data, "index.html")
    or die "Couldn't open 'index.html' for reading: $!";

    my @links;

    while (<$data>) {
    if (/<!-- start links -->/ .. /<!-- end links -->/) {
    push @links, $1 if /<a href=("[^"]*")/;
    }
    }

    print '(', join(', ', @links), ')';


    Many thanks for any advice and ideas.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to the Kookologist on Mon Sep 26 18:16:20 2022
    In comp.lang.perl.misc,
    Shvili, the Kookologist <kooks-and-cranks@kookology.invalid> wrote:

    Kookologist you say?

    Tuxedo wrote:
    I have an HTML page with some links in a standard HTML format, such as:
    <a href="noix_de_muscade.html">
    < img src="nutmeg.jpg>
    Noix de muscate</a>

    Oddly formatted HTML (or, worse, malformed like that missing quote) will
    be the bane of your existence using regexps to parse HTML.

    /<a href=("[^"]*")/

    <a class="linkmain" href="page1.html">
    <a href='page2.html'>
    <A HREF="page3.html">
    <a href=page4.html>
    <a href = "page5.html">
    <a
    href="page6.html">
    <!-- <a href="[[placeholder]]"> -->

    And that doesn't begin to cover the malformed HTML.

    See the TokeParser answer from Andrzej Adam Filip for a better way.

    Elijah
    ------
    just because it looks easy doesn't mean it is

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Andrzej Adam Filip on Tue Sep 27 06:02:35 2022
    Thank you for sharing this solution.

    In this case, which is pre-processing HTML in a page edit situation and not
    for each and every web request, I think I will use the procedure by Shvili
    the Kookologist in the next post, mainly to avoid keeping track of
    additional modules. I've not used HTML::TokeParser but it appears useful for things that require more flexibility.

    Tuxedo

    Andrzej Adam Filip wrote:

    Tuxedo <tuxedo@mailinator.net> wrote:
    Hello,

    I have an HTML page with some links in a standard HTML format, such as:

    index.html:

    <!-- start links -->

    <a href="noix_de_muscade.html">
    <img src="nutmeg.jpg>
    Noix de muscate</a>

    <a href="poivre">
    <img src="pepper.jpg">
    Poivre</a>

    <a href="grains_de_cafe.html">
    <img src="coffee_beans.jpg>
    Grains de café</a>

    <!-- end links -->


    I would like to read link values appearing between the comments (<!--
    start
    links --> and <!-- end links -->) segment of the document, avoiding
    other links that may appear elsewhere above and below the assigned
    segment.

    The relevant parts are the strings from:

    <a href="

    until:

    "

    ... so only until each first occurrence of a double quote " and not
    necessarily including a closing bracket (">) as some links can appear a
    bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

    What regex can can be used extract these strings?

    And in their same order of appearance as in the original HTML file.

    After, they simply need to be printed in a new (JS) array, like this:

    ("noix_de_muscade.html",
    "poivre.html",
    "grains_de_cafe.html");


    linkextract.pl:

    #!/usr/bin/perl -w

    open (data, "index.html");

    # capture part between <!-- start links --> and <!-- end links -->

    # extract the parts between each <a href=" and the
    # first " double quote that follows each

    # place in array in order of appearance

    }

    print $data;


    Many thanks for any advice and ideas.

    Tuxedo

    0. Your html file misses closing quotes in src attribute of img tags.
    1. You may use HTML::TokeParser module or more intuitive HTML::TokeParser::Simple

    use strict;
    use warnings;
    use utf8;

    use IO::HTML;
    use HTML::TokeParser::Simple;

    # Make STDOUT ut8 encoded
    binmode(STDOUT,':utf8');

    # html_file - autodetect encoding of the html file
    my $p = HTML::TokeParser::Simple->new( html_file('x.html') );

    my $n;
    my @hrefs; # array to store detected href
    my $in_block;
    while ( my $token = $p->get_token ) {
    if( not $in_block ) {
    if( $token->is_comment() and $token->as_is() =~ /^<!-- start links
    -->$/ ) {
    $in_block = 1;
    }
    next;
    }elsif( $token->is_start_tag('a') and
    defined($token->get_attr('href'))){
    printf "%d: %s\n", ++$n, $token->get_attr('href');
    push( @hrefs, $token->get_attr('href'));
    }elsif( $token->is_comment() and $token->as_is() =~ /^<!-- end links
    -->$/ ) {
    last;
    }
    }


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to the Kookologist on Tue Sep 27 06:11:22 2022
    Thank you for posting this old and modernised yet perfectly working
    solution.

    I will use it capture links in the order they appear on an overview (index) page while keeping navigational links between the individual pages in sync without manually having to update the same links in a separate (JS) script.

    It's an HTML pre-publication process.

    Tuxedo

    Shvili, the Kookologist wrote:

    On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1@solani.org>, Tuxedo wrote:

    Hello,

    I have an HTML page with some links in a standard HTML format, such as:

    index.html:

    <!-- start links -->

    <a href="noix_de_muscade.html">
    <img src="nutmeg.jpg>
    Noix de muscate</a>

    <a href="poivre">
    <img src="pepper.jpg">
    Poivre</a>

    <a href="grains_de_cafe.html">
    <img src="coffee_beans.jpg>
    Grains de café</a>

    <!-- end links -->


    I would like to read link values appearing between the comments (<!--
    start
    links --> and <!-- end links -->) segment of the document, avoiding
    other links that may appear elsewhere above and below the assigned
    segment.

    The relevant parts are the strings from:

    <a href="

    until:

    "

    ... so only until each first occurrence of a double quote " and not
    necessarily including a closing bracket (">) as some links can appear a
    bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

    What regex can can be used extract these strings?

    And in their same order of appearance as in the original HTML file.

    After, they simply need to be printed in a new (JS) array, like this:

    ("noix_de_muscade.html",
    "poivre.html",
    "grains_de_cafe.html");


    linkextract.pl:

    #!/usr/bin/perl -w

    open (data, "index.html");

    That is an extremely old form of Perl. Slightly modernised, it wuold be:

    #!/usr/bin/perl

    use strict;
    use warnings;

    open (my $data, "index.html")
    or die "Couldn't open 'index.html' for reading: $!";



    # capture part between <!-- start links --> and <!-- end links -->

    my @links;

    while (<$data>) {
    if (/<!-- start links -->/ .. /<!-- end links -->/) {
    ...;
    }
    }

    # extract the parts between each <a href=" and the
    # first " double quote that follows each

    /<a href=("[^"]*")/

    # place in array in order of appearance

    push @links, $1

    print $data;

    print '(', join(', ', @links), ')';

    Putting these snippets together, gives us:


    #!/usr/bin/perl

    use strict;
    use warnings;

    open (my $data, "index.html")
    or die "Couldn't open 'index.html' for reading: $!";

    my @links;

    while (<$data>) {
    if (/<!-- start links -->/ .. /<!-- end links -->/) {
    push @links, $1 if /<a href=("[^"]*")/;
    }
    }

    print '(', join(', ', @links), ')';


    Many thanks for any advice and ideas.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Eli the Bearded on Tue Sep 27 06:15:54 2022
    Thanks for pointing this out. I didn't see the malformed HTML. It would certainly break the regex procedure and perhaps also fail as an HTML link. Normally, I type a bit better :-)

    Tuxedo


    Eli the Bearded wrote:

    In comp.lang.perl.misc,
    Shvili, the Kookologist <kooks-and-cranks@kookology.invalid> wrote:

    Kookologist you say?

    Tuxedo wrote:
    I have an HTML page with some links in a standard HTML format, such as:
    <a href="noix_de_muscade.html">
    < img src="nutmeg.jpg>
    Noix de muscate</a>

    Oddly formatted HTML (or, worse, malformed like that missing quote) will
    be the bane of your existence using regexps to parse HTML.

    /<a href=("[^"]*")/

    <a class="linkmain" href="page1.html">
    <a href='page2.html'>
    <A HREF="page3.html">
    <a href=page4.html>
    <a href = "page5.html">
    <a
    href="page6.html">
    <!-- <a href="[[placeholder]]"> -->

    And that doesn't begin to cover the malformed HTML.

    See the TokeParser answer from Andrzej Adam Filip for a better way.

    Elijah
    ------
    just because it looks easy doesn't mean it is

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)