Forum: >>> Magnum BBS <<<

Read and return

From Tuxedo@21:1/5 to All on Mon Sep 26 06:59:51 2022

Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:



<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>



I would like to read link values appearing between the comments ( and ) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");

linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

# capture part between  and 

# extract the parts between each <a href=" and the
# first " double quote that follows each

# place in array in order of appearance

}

print $data;

Many thanks for any advice and ideas.

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrzej Adam Filip@21:1/5 to Tuxedo on Mon Sep 26 10:28:26 2022

Tuxedo <tuxedo@mailinator.net> wrote:

Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:



<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>



I would like to read link values appearing between the comments ( and ) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");

linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

# capture part between  and 

# extract the parts between each <a href=" and the
# first " double quote that follows each

# place in array in order of appearance

}

print $data;

Many thanks for any advice and ideas.

Tuxedo

0. Your html file misses closing quotes in src attribute of img tags.
1. You may use HTML::TokeParser module or more intuitive HTML::TokeParser::Simple

use strict;
use warnings;
use utf8;

use IO::HTML;
use HTML::TokeParser::Simple;

# Make STDOUT ut8 encoded
binmode(STDOUT,':utf8');

# html_file - autodetect encoding of the html file
my $p = HTML::TokeParser::Simple->new( html_file('x.html') );

my $n;
my @hrefs; # array to store detected href
my $in_block;
while ( my $token = $p->get_token ) {
if( not $in_block ) {
if( $token->is_comment() and $token->as_is() =~ /^$/ ) {
$in_block = 1;
}
next;
}elsif( $token->is_start_tag('a') and defined($token->get_attr('href'))){
printf "%d: %s\n", ++$n, $token->get_attr('href');
push( @hrefs, $token->get_attr('href'));
}elsif( $token->is_comment() and $token->as_is() =~ /^$/ ) {
last;
}
}

--
[Andrew] Andrzej A. Filip

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Shvili, the Kookologist@21:1/5 to All on Mon Sep 26 11:20:19 2022

On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1@solani.org>, Tuxedo
wrote:

Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:



<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>



I would like to read link values appearing between the comments ( and ) segment of the document, avoiding other links that may appear elsewhere above and below the assigned segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not necessarily including a closing bracket (">) as some links can appear a bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");

linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

That is an extremely old form of Perl. Slightly modernised, it wuold be:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

# capture part between  and 

my @links;

while (<$data>) {
if (// .. //) {
...;
}
}

# extract the parts between each <a href=" and the
# first " double quote that follows each

/<a href=("[^"]*")/

# place in array in order of appearance

push @links, $1

print $data;

print '(', join(', ', @links), ')';

Putting these snippets together, gives us:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

my @links;

while (<$data>) {
if (// .. //) {
push @links, $1 if /<a href=("[^"]*")/;
}
}

print '(', join(', ', @links), ')';

Many thanks for any advice and ideas.

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to the Kookologist on Mon Sep 26 18:16:20 2022

In comp.lang.perl.misc,
Shvili, the Kookologist <kooks-and-cranks@kookology.invalid> wrote:

Kookologist you say?

Tuxedo wrote:

I have an HTML page with some links in a standard HTML format, such as:
<a href="noix_de_muscade.html">
< img src="nutmeg.jpg>
Noix de muscate</a>

Oddly formatted HTML (or, worse, malformed like that missing quote) will
be the bane of your existence using regexps to parse HTML.

/<a href=("[^"]*")/

<a class="linkmain" href="page1.html">
<a href='page2.html'>
<A HREF="page3.html">
<a href=page4.html>
<a href = "page5.html">
<a
href="page6.html">


And that doesn't begin to cover the malformed HTML.

See the TokeParser answer from Andrzej Adam Filip for a better way.

Elijah
------
just because it looks easy doesn't mean it is

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tuxedo@21:1/5 to Andrzej Adam Filip on Tue Sep 27 06:02:35 2022

Thank you for sharing this solution.

In this case, which is pre-processing HTML in a page edit situation and not
for each and every web request, I think I will use the procedure by Shvili
the Kookologist in the next post, mainly to avoid keeping track of
additional modules. I've not used HTML::TokeParser but it appears useful for things that require more flexibility.

Tuxedo

Andrzej Adam Filip wrote:

Tuxedo <tuxedo@mailinator.net> wrote:

Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:



<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>



I would like to read link values appearing between the comments ( and ) segment of the document, avoiding
other links that may appear elsewhere above and below the assigned
segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a
bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");

linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

# capture part between  and 

# extract the parts between each <a href=" and the
# first " double quote that follows each

# place in array in order of appearance

}

print $data;

Many thanks for any advice and ideas.

Tuxedo

0. Your html file misses closing quotes in src attribute of img tags.
1. You may use HTML::TokeParser module or more intuitive HTML::TokeParser::Simple

use strict;
use warnings;
use utf8;

use IO::HTML;
use HTML::TokeParser::Simple;

# Make STDOUT ut8 encoded
binmode(STDOUT,':utf8');

# html_file - autodetect encoding of the html file
my $p = HTML::TokeParser::Simple->new( html_file('x.html') );

my $n;
my @hrefs; # array to store detected href
my $in_block;
while ( my $token = $p->get_token ) {
if( not $in_block ) {
if( $token->is_comment() and $token->as_is() =~ /^$/ ) {
$in_block = 1;
}
next;
}elsif( $token->is_start_tag('a') and
defined($token->get_attr('href'))){
printf "%d: %s\n", ++$n, $token->get_attr('href');
push( @hrefs, $token->get_attr('href'));
}elsif( $token->is_comment() and $token->as_is() =~ /^$/ ) {
last;
}
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tuxedo@21:1/5 to the Kookologist on Tue Sep 27 06:11:22 2022

Thank you for posting this old and modernised yet perfectly working
solution.

I will use it capture links in the order they appear on an overview (index) page while keeping navigational links between the individual pages in sync without manually having to update the same links in a separate (JS) script.

It's an HTML pre-publication process.

Tuxedo

Shvili, the Kookologist wrote:

On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1@solani.org>, Tuxedo wrote:

Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:



<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>



I would like to read link values appearing between the comments ( and ) segment of the document, avoiding
other links that may appear elsewhere above and below the assigned
segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a
bit different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");

linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

That is an extremely old form of Perl. Slightly modernised, it wuold be:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

# capture part between  and 

my @links;

while (<$data>) {
if (// .. //) {
...;
}
}

# extract the parts between each <a href=" and the
# first " double quote that follows each

/<a href=("[^"]*")/

# place in array in order of appearance

push @links, $1

print $data;

print '(', join(', ', @links), ')';

Putting these snippets together, gives us:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

my @links;

while (<$data>) {
if (// .. //) {
push @links, $1 if /<a href=("[^"]*")/;
}
}

print '(', join(', ', @links), ')';

Many thanks for any advice and ideas.

Tuxedo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tuxedo@21:1/5 to Eli the Bearded on Tue Sep 27 06:15:54 2022

Thanks for pointing this out. I didn't see the malformed HTML. It would certainly break the regex procedure and perhaps also fail as an HTML link. Normally, I type a bit better :-)

Tuxedo

Eli the Bearded wrote:

In comp.lang.perl.misc,
Shvili, the Kookologist <kooks-and-cranks@kookology.invalid> wrote:

Kookologist you say?

Tuxedo wrote:

I have an HTML page with some links in a standard HTML format, such as:
<a href="noix_de_muscade.html">
< img src="nutmeg.jpg>
Noix de muscate</a>

Oddly formatted HTML (or, worse, malformed like that missing quote) will
be the bane of your existence using regexps to parse HTML.

/<a href=("[^"]*")/

<a class="linkmain" href="page1.html">
<a href='page2.html'>
<A HREF="page3.html">
<a href=page4.html>
<a href = "page5.html">
<a
href="page6.html">


And that doesn't begin to cover the malformed HTML.

See the TokeParser answer from Andrzej Adam Filip for a better way.

Elijah
------
just because it looks easy doesn't mean it is

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

Ian Rihard Kosednar
Mon Jun 23 17:19:07 2025
from No via SSH

Bob Worm
Mon Jun 23 13:40:10 2025
from Wales, Uk via Telnet

Plume
Mon Jun 23 10:43:22 2025
from Uk via Telnet

Plume
Mon Jun 23 10:20:22 2025
from Uk via Telnet

Centurion
Mon Jun 23 09:46:15 2025
from Berea, Ohio via Telnet

Gwylbert
Mon Jun 23 09:00:34 2025
from Sydney, Nsw via Telnet

Centurion
Mon Jun 23 02:07:35 2025
from Berea, Ohio via Telnet

Bob Worm
Sun Jun 22 21:19:20 2025
from Wales, Uk via Telnet

System Info

Sysop: Keyop

Location: Huddersfield, West Yorkshire, UK

Users: 498

Nodes: 16 (2 / 14)

Uptime: 23:17:15

Calls: 9,828

Calls today: 7

Files: 13,761

Messages: 6,191,779

Read and return

Who's Online

Recent Visitors

System Info