Forum: >>> Magnum BBS <<<

tdom html mode

From saitology9@21:1/5 to All on Tue Apr 25 12:31:08 2023

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "" tags in sequence for example,
it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ted Nolan @21:1/5 to saitology9@gmail.com on Tue Apr 25 17:34:39 2023

In article <u28v8d$ueo3$1@dont-email.me>,
saitology9 <saitology9@gmail.com> wrote:

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "" tags in sequence for example,
it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

I find that I always have better results with tdom parsing if I use the "-html5" option. Are you using that?
--
columbiaclosings.com
What's not in Columbia anymore..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to All on Tue Apr 25 14:40:35 2023

On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:

I find that I always have better results with tdom parsing if I use the "-html5" option. Are you using that?

no I am not. However, it doesnt recognize this option. I just reviewed
the tdom docs and there wasn't any mention of this option.

For reference, this is what I have:

% package req tdom
0.9.1

% dom parse -html "hello there"
domDoc010BC518

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ted Nolan @21:1/5 to saitology9@gmail.com on Tue Apr 25 19:25:10 2023

In article <u296r4$vqbb$1@dont-email.me>,
saitology9 <saitology9@gmail.com> wrote:

On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:

I find that I always have better results with tdom parsing if I use the
"-html5" option. Are you using that?

no I am not. However, it doesnt recognize this option. I just reviewed
the tdom docs and there wasn't any mention of this option.

For reference, this is what I have:

% package req tdom
0.9.1

% dom parse -html "hello there"
domDoc010BC518

It's a compile option:

http://www.tdom.org/index.html/doc/trunk/doc/dom.html

-html5
This option is only available if tDOM was build with
--enable-html5. Try the featureinfo method if you need
to know if this feature is build in.

Mine (FreeBSD) has it:

===
ted@hotrod:~ % tclsh8.6
% package require tdom
0.9.1
% dom parse -html5 "hello there"
domDoc0x80097d140
===

That's not to say it would solve your problem, but as I say
I've had better luck with it.
--
columbiaclosings.com
What's not in Columbia anymore..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to All on Tue Apr 25 15:54:07 2023

On 4/25/2023 3:25 PM, Ted Nolan <tednolan> wrote:

It's a compile option:

Thank you very much for your help. My version is not built with this
option. At the moment, it is not worth the trouble pursuing this any
further but it is good to know the option exists.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rolf Ade@21:1/5 to saitology9@gmail.com on Tue Apr 25 23:17:12 2023

saitology9 <saitology9@gmail.com> writes:

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "" tags in sequence for
example, it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

You'd better update to the current tdom 0.9.3 (which provides a solution
to your question).

While the -html5 parser (if it is build in; that requires the gumbo
HTML5 parser lib present at build time and the configure switch
--enable-html5) is very robust (digest nearly any tag soup) this may be
not the right thing for this problem, because that always insert a
single document root and inserts missing elements implied by the context
(as <head>, <tbody>, etc.).

You want to parse an HTML fragment like

"hello there"

But what DOM tree do you expect to get from that? That document or
fragment doesn't have a single root as HTML or XML have to. So if you
are fine with getting a DOM _forest_ instead of a DOM tree jus to:

package require tdom 0.9.3
dom parse -html -forest "hello there" doc
$doc asXML

This script returns this to me:

hello
there

tDOMs dom methods (and the xpath engine) works pretty fine with such a
"forest" and a natural way. It is just that you don't have the pattern

set root [$doc documentElement]

and you have all of your data as decendants of that one roots (remember,
you have a forest, not a tree).

The "other" root nodes beside the one you still get from [$doc
documentElement] are (next) siblings of that one. Or you can get all
the roots of your forest with [$doc childNodes]. Hope, this hints get
you started.

rolf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to Rolf Ade on Tue Apr 25 18:37:17 2023

On 4/25/2023 5:17 PM, Rolf Ade wrote:

But what DOM tree do you expect to get from that? That document or
fragment doesn't have a single root as HTML or XML have to. So if you
are fine with getting a DOM _forest_ instead of a DOM tree jus to:

package require tdom 0.9.3
dom parse -html -forest "hello there" doc
$doc asXML

Dear Rolf, thank you. Yes, I wanted the "forest" option. I am aware of
the difference between a tree and a forest. I have two versions of tdom
(0.9.1 and 0.9.2) and they both return a single node for the plain parse command. So despite me writing a recursive function to navigate the
node's children as well as its siblings, I was not getting the full data
out. In any case, this was more of a curiosity on my part and not based
on any need.

I will look to upgrade to tdom soon. Thanks for the heads up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

Bob Worm
Fri Jun 20 20:46:42 2025
from Wales, Uk via Telnet

Ian Rihard Kosednar
Fri Jun 20 16:40:58 2025
from No via SSH

Ian Rihard Kosednar
Fri Jun 20 16:38:38 2025
from No via SSH

Ian Rihard Kosednar
Fri Jun 20 16:10:44 2025
from No via SSH

Ian Rihard Kosednar
Fri Jun 20 15:32:37 2025
from No via SSH

Ian Rihard Kosednar
Fri Jun 20 15:29:33 2025
from No via SSH

Ian Rihard Kosednar
Fri Jun 20 15:27:36 2025
from No via SSH

Ian Rihard Kosednar
Fri Jun 20 15:16:08 2025
from No via SSH

System Info

Sysop: Keyop

Location: Huddersfield, West Yorkshire, UK

Users: 497

Nodes: 16 (2 / 14)

Uptime: 30:13:44

Calls: 9,797

Calls today: 16

Files: 13,749

Messages: 6,188,695

tdom html mode

Who's Online

Recent Visitors

System Info