Hello,
I have an HTML page with some links in a standard HTML format, such as:
index.html:
<!-- start links -->
<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>
<a href="poivre">
<img src="pepper.jpg">
Poivre</a>
<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>
<!-- end links -->
I would like to read link values appearing between the comments (<!-- start links --> and <!-- end links -->) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.
The relevant parts are the strings from:
<a href="
until:
"
... so only until each first occurrence of a double quote " and not necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).
What regex can can be used extract these strings?
And in their same order of appearance as in the original HTML file.
After, they simply need to be printed in a new (JS) array, like this:
("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");
linkextract.pl:
#!/usr/bin/perl -w
open (data, "index.html");
# capture part between <!-- start links --> and <!-- end links -->
# extract the parts between each <a href=" and the
# first " double quote that follows each
# place in array in order of appearance
}
print $data;
Many thanks for any advice and ideas.
Tuxedo
Hello,
I have an HTML page with some links in a standard HTML format, such as:
index.html:
<!-- start links -->
<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>
<a href="poivre">
<img src="pepper.jpg">
Poivre</a>
<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>
<!-- end links -->
I would like to read link values appearing between the comments (<!-- start links --> and <!-- end links -->) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.
The relevant parts are the strings from:
<a href="
until:
"
... so only until each first occurrence of a double quote " and not necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).
What regex can can be used extract these strings?
And in their same order of appearance as in the original HTML file.
After, they simply need to be printed in a new (JS) array, like this:
("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");
linkextract.pl:
#!/usr/bin/perl -w
open (data, "index.html");
# capture part between <!-- start links --> and <!-- end links -->
# extract the parts between each <a href=" and the
# first " double quote that follows each
# place in array in order of appearance
print $data;
Many thanks for any advice and ideas.
Tuxedo
Tuxedo wrote:
I have an HTML page with some links in a standard HTML format, such as:
<a href="noix_de_muscade.html">
< img src="nutmeg.jpg>
Noix de muscate</a>
/<a href=("[^"]*")/
Tuxedo <tuxedo@mailinator.net> wrote:
Hello,
I have an HTML page with some links in a standard HTML format, such as:
index.html:
<!-- start links -->
<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>
<a href="poivre">
<img src="pepper.jpg">
Poivre</a>
<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>
<!-- end links -->
I would like to read link values appearing between the comments (<!--
start
links --> and <!-- end links -->) segment of the document, avoiding
other links that may appear elsewhere above and below the assigned
segment.
The relevant parts are the strings from:
<a href="
until:
"
... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a
bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).
What regex can can be used extract these strings?
And in their same order of appearance as in the original HTML file.
After, they simply need to be printed in a new (JS) array, like this:
("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");
linkextract.pl:
#!/usr/bin/perl -w
open (data, "index.html");
# capture part between <!-- start links --> and <!-- end links -->
# extract the parts between each <a href=" and the
# first " double quote that follows each
# place in array in order of appearance
}
print $data;
Many thanks for any advice and ideas.
Tuxedo
0. Your html file misses closing quotes in src attribute of img tags.
1. You may use HTML::TokeParser module or more intuitive HTML::TokeParser::Simple
use strict;
use warnings;
use utf8;
use IO::HTML;
use HTML::TokeParser::Simple;
# Make STDOUT ut8 encoded
binmode(STDOUT,':utf8');
# html_file - autodetect encoding of the html file
my $p = HTML::TokeParser::Simple->new( html_file('x.html') );
my $n;
my @hrefs; # array to store detected href
my $in_block;
while ( my $token = $p->get_token ) {
if( not $in_block ) {
if( $token->is_comment() and $token->as_is() =~ /^<!-- start links
-->$/ ) {
$in_block = 1;
}
next;
}elsif( $token->is_start_tag('a') and
defined($token->get_attr('href'))){
printf "%d: %s\n", ++$n, $token->get_attr('href');
push( @hrefs, $token->get_attr('href'));
}elsif( $token->is_comment() and $token->as_is() =~ /^<!-- end links
-->$/ ) {
last;
}
}
On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1@solani.org>, Tuxedo wrote:
Hello,
I have an HTML page with some links in a standard HTML format, such as:
index.html:
<!-- start links -->
<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>
<a href="poivre">
<img src="pepper.jpg">
Poivre</a>
<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>
<!-- end links -->
I would like to read link values appearing between the comments (<!--
start
links --> and <!-- end links -->) segment of the document, avoiding
other links that may appear elsewhere above and below the assigned
segment.
The relevant parts are the strings from:
<a href="
until:
"
... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a
bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).
What regex can can be used extract these strings?
And in their same order of appearance as in the original HTML file.
After, they simply need to be printed in a new (JS) array, like this:
("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");
linkextract.pl:
#!/usr/bin/perl -w
open (data, "index.html");
That is an extremely old form of Perl. Slightly modernised, it wuold be:
#!/usr/bin/perl
use strict;
use warnings;
open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";
# capture part between <!-- start links --> and <!-- end links -->
my @links;
while (<$data>) {
if (/<!-- start links -->/ .. /<!-- end links -->/) {
...;
}
}
# extract the parts between each <a href=" and the
# first " double quote that follows each
/<a href=("[^"]*")/
# place in array in order of appearance
push @links, $1
print $data;
print '(', join(', ', @links), ')';
Putting these snippets together, gives us:
#!/usr/bin/perl
use strict;
use warnings;
open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";
my @links;
while (<$data>) {
if (/<!-- start links -->/ .. /<!-- end links -->/) {
push @links, $1 if /<a href=("[^"]*")/;
}
}
print '(', join(', ', @links), ')';
Many thanks for any advice and ideas.
Tuxedo
In comp.lang.perl.misc,
Shvili, the Kookologist <kooks-and-cranks@kookology.invalid> wrote:
Kookologist you say?
Tuxedo wrote:
I have an HTML page with some links in a standard HTML format, such as:
<a href="noix_de_muscade.html">
< img src="nutmeg.jpg>
Noix de muscate</a>
Oddly formatted HTML (or, worse, malformed like that missing quote) will
be the bane of your existence using regexps to parse HTML.
/<a href=("[^"]*")/
<a class="linkmain" href="page1.html">
<a href='page2.html'>
<A HREF="page3.html">
<a href=page4.html>
<a href = "page5.html">
<a
href="page6.html">
<!-- <a href="[[placeholder]]"> -->
And that doesn't begin to cover the malformed HTML.
See the TokeParser answer from Andrzej Adam Filip for a better way.
Elijah
------
just because it looks easy doesn't mean it is
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 498 |
Nodes: | 16 (2 / 14) |
Uptime: | 23:17:15 |
Calls: | 9,828 |
Calls today: | 7 |
Files: | 13,761 |
Messages: | 6,191,779 |