• tdom html mode

    From saitology9@21:1/5 to All on Tue Apr 25 12:31:08 2023
    It seems like tdom html parsing doesn't work well with partial html
    strings that don't necessarily include the full doctype/head/body/etc.
    tags. tdom seems to return nodes only for the first tag and not the
    rest; meaning that if there are two "<p>" tags in sequence for example,
    it processes only the first one.

    That is fine if this is the expected behavior but if not, what is the
    correct way to do this?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Nolan @21:1/5 to saitology9@gmail.com on Tue Apr 25 17:34:39 2023
    In article <u28v8d$ueo3$1@dont-email.me>,
    saitology9 <saitology9@gmail.com> wrote:
    It seems like tdom html parsing doesn't work well with partial html
    strings that don't necessarily include the full doctype/head/body/etc.
    tags. tdom seems to return nodes only for the first tag and not the
    rest; meaning that if there are two "<p>" tags in sequence for example,
    it processes only the first one.

    That is fine if this is the expected behavior but if not, what is the
    correct way to do this?

    I find that I always have better results with tdom parsing if I use the "-html5" option. Are you using that?
    --
    columbiaclosings.com
    What's not in Columbia anymore..

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to All on Tue Apr 25 14:40:35 2023
    On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:

    I find that I always have better results with tdom parsing if I use the "-html5" option. Are you using that?

    no I am not. However, it doesnt recognize this option. I just reviewed
    the tdom docs and there wasn't any mention of this option.

    For reference, this is what I have:

    % package req tdom
    0.9.1

    % dom parse -html "<p>hello</p> <p>there</p>"
    domDoc010BC518

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Nolan @21:1/5 to saitology9@gmail.com on Tue Apr 25 19:25:10 2023
    In article <u296r4$vqbb$1@dont-email.me>,
    saitology9 <saitology9@gmail.com> wrote:
    On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:

    I find that I always have better results with tdom parsing if I use the
    "-html5" option. Are you using that?

    no I am not. However, it doesnt recognize this option. I just reviewed
    the tdom docs and there wasn't any mention of this option.

    For reference, this is what I have:

    % package req tdom
    0.9.1

    % dom parse -html "<p>hello</p> <p>there</p>"
    domDoc010BC518




    It's a compile option:

    http://www.tdom.org/index.html/doc/trunk/doc/dom.html

    -html5
    This option is only available if tDOM was build with
    --enable-html5. Try the featureinfo method if you need
    to know if this feature is build in.


    Mine (FreeBSD) has it:

    ===
    ted@hotrod:~ % tclsh8.6
    % package require tdom
    0.9.1
    % dom parse -html5 "<p>hello</p> <p>there</p>"
    domDoc0x80097d140
    ===

    That's not to say it would solve your problem, but as I say
    I've had better luck with it.
    --
    columbiaclosings.com
    What's not in Columbia anymore..

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to All on Tue Apr 25 15:54:07 2023
    On 4/25/2023 3:25 PM, Ted Nolan <tednolan> wrote:

    It's a compile option:


    Thank you very much for your help. My version is not built with this
    option. At the moment, it is not worth the trouble pursuing this any
    further but it is good to know the option exists.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rolf Ade@21:1/5 to saitology9@gmail.com on Tue Apr 25 23:17:12 2023
    saitology9 <saitology9@gmail.com> writes:
    It seems like tdom html parsing doesn't work well with partial html
    strings that don't necessarily include the full doctype/head/body/etc.
    tags. tdom seems to return nodes only for the first tag and not the
    rest; meaning that if there are two "<p>" tags in sequence for
    example, it processes only the first one.

    That is fine if this is the expected behavior but if not, what is the
    correct way to do this?

    You'd better update to the current tdom 0.9.3 (which provides a solution
    to your question).

    While the -html5 parser (if it is build in; that requires the gumbo
    HTML5 parser lib present at build time and the configure switch
    --enable-html5) is very robust (digest nearly any tag soup) this may be
    not the right thing for this problem, because that always insert a
    single document root and inserts missing elements implied by the context
    (as <head>, <tbody>, etc.).

    You want to parse an HTML fragment like

    "<p>hello</p> <p>there</p>"

    But what DOM tree do you expect to get from that? That document or
    fragment doesn't have a single root as HTML or XML have to. So if you
    are fine with getting a DOM _forest_ instead of a DOM tree jus to:

    package require tdom 0.9.3
    dom parse -html -forest "<p>hello</p> <p>there</p>" doc
    $doc asXML

    This script returns this to me:

    <p>hello</p>
    <p>there</p>

    tDOMs dom methods (and the xpath engine) works pretty fine with such a
    "forest" and a natural way. It is just that you don't have the pattern

    set root [$doc documentElement]

    and you have all of your data as decendants of that one roots (remember,
    you have a forest, not a tree).

    The "other" root nodes beside the one you still get from [$doc
    documentElement] are (next) siblings of that one. Or you can get all
    the roots of your forest with [$doc childNodes]. Hope, this hints get
    you started.

    rolf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to Rolf Ade on Tue Apr 25 18:37:17 2023
    On 4/25/2023 5:17 PM, Rolf Ade wrote:

    But what DOM tree do you expect to get from that? That document or
    fragment doesn't have a single root as HTML or XML have to. So if you
    are fine with getting a DOM _forest_ instead of a DOM tree jus to:

    package require tdom 0.9.3
    dom parse -html -forest "<p>hello</p> <p>there</p>" doc
    $doc asXML



    Dear Rolf, thank you. Yes, I wanted the "forest" option. I am aware of
    the difference between a tree and a forest. I have two versions of tdom
    (0.9.1 and 0.9.2) and they both return a single node for the plain parse command. So despite me writing a recursive function to navigate the
    node's children as well as its siblings, I was not getting the full data
    out. In any case, this was more of a curiosity on my part and not based
    on any need.

    I will look to upgrade to tdom soon. Thanks for the heads up.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)