• Re: Safe handling of lists

    From Gerald Lester@21:1/5 to Luc on Sun Nov 26 13:50:26 2023
    On 11/26/23 13:29, Luc wrote:
    Me again. I have a problem dealing with lists.

    I wanted to count the words in a text widget that contains the text
    of a file. I decided to treat the whole thing like a list and iterate
    over it to count the list elements, possibily filtering some things
    out.

    It worked fine with a small file, but a large (very large) file
    triggers this:


    list element in quotes followed by "," instead of space
    while executing
    "foreach w $::FILECONTENT {
    incr wordcount
    }"
    (procedure "p.wc" line 5)
    invoked from within
    "p.wc"

    Also relevant,

    set ::FILECONTENT [$::text get 1.0 end]

    It's probably something obvious that I am missing again. Can someone
    please enlighten me?


    Every list is a string, but not every string is a list.

    I would suggest that you take a look a the following builtins:
    tcl_endOfWord str start
    tcl_startOfNextWord str start
    tcl_startOfPreviousWord str start
    tcl_wordBreakAfter str start
    tcl_wordBreakBefore str start

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Sun Nov 26 16:29:14 2023
    Me again. I have a problem dealing with lists.

    I wanted to count the words in a text widget that contains the text
    of a file. I decided to treat the whole thing like a list and iterate
    over it to count the list elements, possibily filtering some things
    out.

    It worked fine with a small file, but a large (very large) file
    triggers this:


    list element in quotes followed by "," instead of space
    while executing
    "foreach w $::FILECONTENT {
    incr wordcount
    }"
    (procedure "p.wc" line 5)
    invoked from within
    "p.wc"

    Also relevant,

    set ::FILECONTENT [$::text get 1.0 end]

    It's probably something obvious that I am missing again. Can someone
    please enlighten me?

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Sun Nov 26 13:35:48 2023
    On 11/26/2023 11:29 AM, Luc wrote:
    Me again. I have a problem dealing with lists.

    I wanted to count the words in a text widget that contains the text
    of a file. I decided to treat the whole thing like a list and iterate
    over it to count the list elements, possibily filtering some things
    out.

    It worked fine with a small file, but a large (very large) file
    triggers this:


    list element in quotes followed by "," instead of space
    while executing
    "foreach w $::FILECONTENT {
    incr wordcount
    }"
    (procedure "p.wc" line 5)
    invoked from within
    "p.wc"

    Also relevant,

    set ::FILECONTENT [$::text get 1.0 end]

    It's probably something obvious that I am missing again. Can someone
    please enlighten me?

    I think you want [split]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Sun Nov 26 19:00:47 2023
    On Sun, 26 Nov 2023 13:35:48 -0800, et99 wrote:

    I think you want [split]

    **************************

    I am using split now. It's faster and "list safe" so it solves
    the problem I presented first.

    The new problem now is that cleaning up the huge string for proper
    counting is not fast enough.


    proc p.wc {} {
    set wordcount 0
    set content [$::text get 1.0 end]
    set cleancontent [string map "\n { } \t { }" $content]
    set wordcount [llength [split $cleancontent { }]]
    return $wordcount
    }


    Since it's called whenever some change is made to the text widget,
    typing becomes unacceptably slow.

    And I still haven't addded a line to clean all the multiple
    consecutive spaces, which changes the tally. I can't use regexp
    because it's too slow for what I want.

    Another (debatable?) problem is that the old code gave me a count
    that was a lot closer to the output of 'wc -w' in a terminal.
    This new one is way off, and it gives me a lower count! One would
    think it would be higher because of the consecutive spaces.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Obermeier@21:1/5 to All on Sun Nov 26 23:29:19 2023
    Am 26.11.2023 um 20:29 schrieb Luc:
    Me again. I have a problem dealing with lists.

    I wanted to count the words in a text widget that contains the text
    of a file. I decided to treat the whole thing like a list and iterate
    over it to count the list elements, possibily filtering some things
    out.

    It worked fine with a small file, but a large (very large) file
    triggers this:


    list element in quotes followed by "," instead of space
    while executing
    "foreach w $::FILECONTENT {
    incr wordcount
    }"
    (procedure "p.wc" line 5)
    invoked from within
    "p.wc"

    Also relevant,

    set ::FILECONTENT [$::text get 1.0 end]

    It's probably something obvious that I am missing again. Can someone
    please enlighten me?


    Take a look at my CAWT extension. I contains a CountWords procedure,
    see https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Sun Nov 26 23:03:06 2023
    Luc <luc@sep.invalid> wrote:
    On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:

    Take a look at my CAWT extension. I contains a CountWords procedure,
    see >>https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

    Paul
    **************************


    "CAWT (COM Automation With Tcl) is a utility package based on Twapi
    to script Microsoft Windows® applications with Tcl."

    But, if the source is available, you could look at the source to see
    how it performs "word counting" and use that for inspiration.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Paul Obermeier on Sun Nov 26 19:54:58 2023
    On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:

    Take a look at my CAWT extension. I contains a CountWords procedure,
    see >https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

    Paul
    **************************


    "CAWT (COM Automation With Tcl) is a utility package based on Twapi
    to script Microsoft Windows® applications with Tcl."

    Thanks, but I am on Linux.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Obermeier@21:1/5 to All on Mon Nov 27 01:15:52 2023
    Am 27.11.2023 um 00:03 schrieb Rich:
    Luc <luc@sep.invalid> wrote:
    On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:

    Take a look at my CAWT extension. I contains a CountWords procedure,
    see
    https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

    Paul
    **************************


    "CAWT (COM Automation With Tcl) is a utility package based on Twapi
    to script Microsoft Windows® applications with Tcl."

    But, if the source is available, you could look at the source to see
    how it performs "word counting" and use that for inspiration.

    That was the idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Sun Nov 26 16:37:37 2023
    On 11/26/2023 2:00 PM, Luc wrote:
    On Sun, 26 Nov 2023 13:35:48 -0800, et99 wrote:

    I think you want [split]

    **************************

    I am using split now. It's faster and "list safe" so it solves
    the problem I presented first.

    The new problem now is that cleaning up the huge string for proper
    counting is not fast enough.


    proc p.wc {} {
    set wordcount 0
    set content [$::text get 1.0 end]
    set cleancontent [string map "\n { } \t { }" $content]
    set wordcount [llength [split $cleancontent { }]]
    return $wordcount
    }


    Since it's called whenever some change is made to the text widget,
    typing becomes unacceptably slow.

    And I still haven't addded a line to clean all the multiple
    consecutive spaces, which changes the tally. I can't use regexp
    because it's too slow for what I want.

    Another (debatable?) problem is that the old code gave me a count
    that was a lot closer to the output of 'wc -w' in a terminal.
    This new one is way off, and it gives me a lower count! One would
    think it would be higher because of the consecutive spaces.


    Just wondering, why the string map to change newlines and tabs to spaces. Split can take those plus spaces in the splitchars string. In fact, I think that's the default anyway.

    Also, are you saying this is calculated on every char the user types? Is that to keep a wordcount in say a status area? How big is the text we're talking about here?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From clt.to.davebr@dfgh.net@21:1/5 to All on Mon Nov 27 02:25:32 2023
    This is 2-3 times slower that p.wc, but ignores any number of spacing characters:

    proc wc {str} {
    llength [lmap x [split $str] {if {[string is space $x]} {continue} {set x}}]
    }

    it gives the same count as the wc utility on one 100k test file.

    The other problem is ckecking count between key strokes.

    Consider keeping track of words above and below the edit window, then tracking lines moving into and out of the edit window and word count in the edit window at each key stroke. Much more complicated, but it only processes a short segment of text on each
    key stroke.

    Dave B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Mon Nov 27 00:05:47 2023
    On Sun, 26 Nov 2023 16:37:37 -0800, et99 wrote:

    Just wondering, why the string map to change newlines and tabs to spaces. >Split can take those plus spaces in the splitchars string. In fact, I
    think that's the default anyway.

    Also, are you saying this is calculated on every char the user types? Is
    that to keep a wordcount in say a status area? How big is the text we're >talking about here?

    **************************

    I decided to change all the newlines to spaces because I was afraid that

    this
    that

    might become 'thisthat' rather than 'this that' after the split.

    It probably wouldn't, but I wanted to make sure.

    Yes, it's a sort of text editor and the word count must be calculated
    after every change.

    (I had big plans for it but personal issues have forced me to put it
    on the back burner for God knows how long. I'm trying to fix this code
    right now because I got a copywriting job where counting words in real
    time is very useful. A ton of other things will be fixed... someday.)

    Anyway, there is a status bar with some information and the word count
    is supposed to be updated at every touch of the keyboard. I currently
    filter out arrow key movements. Need to rewrite the code and make it
    smarter.

    The "big text" I am using to assess performance is 18MB.

    $ wc /home/tcl/bigtext.txt
    218993 2758398 18421662 /home/tcl/bigtext.txt

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Dean@21:1/5 to Luc on Mon Nov 27 17:27:24 2023
    On 27/11/23 05:29, Luc wrote:
    Me again. I have a problem dealing with lists.

    I wanted to count the words in a text widget that contains the text
    of a file. I decided to treat the whole thing like a list and iterate
    over it to count the list elements, possibily filtering some things
    out.

    It worked fine with a small file, but a large (very large) file
    triggers this:


    list element in quotes followed by "," instead of space
    while executing
    "foreach w $::FILECONTENT {
    incr wordcount
    }"
    (procedure "p.wc" line 5)
    invoked from within
    "p.wc"

    Also relevant,

    set ::FILECONTENT [$::text get 1.0 end]

    It's probably something obvious that I am missing again. Can someone
    please enlighten me?

    tcllib has splitx which splits on regexp eg

    ::textutil::split::splitx $l {\s+}

    splits on runs of spaces

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Grunwald@21:1/5 to Luc on Mon Nov 27 12:57:27 2023
    On 27/11/2023 03:05, Luc wrote:
    On Sun, 26 Nov 2023 16:37:37 -0800, et99 wrote:

    Just wondering, why the string map to change newlines and tabs to spaces.
    Split can take those plus spaces in the splitchars string. In fact, I
    think that's the default anyway.

    Also, are you saying this is calculated on every char the user types? Is
    that to keep a wordcount in say a status area? How big is the text we're
    talking about here?

    **************************

    I decided to change all the newlines to spaces because I was afraid that

    this
    that

    might become 'thisthat' rather than 'this that' after the split.

    It probably wouldn't, but I wanted to make sure.

    Yes, it's a sort of text editor and the word count must be calculated
    after every change.

    (I had big plans for it but personal issues have forced me to put it
    on the back burner for God knows how long. I'm trying to fix this code
    right now because I got a copywriting job where counting words in real
    time is very useful. A ton of other things will be fixed... someday.)

    Anyway, there is a status bar with some information and the word count
    is supposed to be updated at every touch of the keyboard. I currently
    filter out arrow key movements. Need to rewrite the code and make it
    smarter.

    The "big text" I am using to assess performance is 18MB.

    $ wc /home/tcl/bigtext.txt
    218993 2758398 18421662 /home/tcl/bigtext.txt

    Surely you only need to update the word count if the character inserted
    or deleted is a word separator? I assume you can tell whether this is
    the case.

    Alan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ralf Fassel@21:1/5 to All on Mon Nov 27 15:50:08 2023
    * Luc <luc@sep.invalid>
    | I am using split now. It's faster and "list safe" so it solves
    | the problem I presented first.

    | The new problem now is that cleaning up the huge string for proper
    | counting is not fast enough.


    | proc p.wc {} {
    | set wordcount 0
    | set content [$::text get 1.0 end]
    | set cleancontent [string map "\n { } \t { }" $content]
    | set wordcount [llength [split $cleancontent { }]]
    | return $wordcount
    | }


    | Since it's called whenever some change is made to the text widget,
    | typing becomes unacceptably slow.

    You could set up a timer to do the real work after a short period
    (500ms) of keyboard-idle, and return the old count else. Quick typists
    see the updated count only after they stop typing. Else you would need
    to keep track of what is inserted/deleted and incr/decr the count based
    on that (fragile).

    | And I still haven't addded a line to clean all the multiple
    | consecutive spaces, which changes the tally. I can't use regexp
    | because it's too slow for what I want.

    As someone else suggested, ::textutil::split::splitx might be a solution (though it might be even slower due to the use of regexps, need to
    test), or else check the [string length] of elements and count them only
    as words if > 0.

    R'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to et99@rocketship1.me on Tue Nov 28 01:49:52 2023
    et99 <et99@rocketship1.me> wrote:
    On 11/27/2023 6:50 AM, Ralf Fassel wrote:
    As someone else suggested, ::textutil::split::splitx might be a
    solution (though it might be even slower due to the use of regexps,
    need to test), or else check the [string length] of elements and
    count them only as words if > 0.

    I was wondering just how accurate it has to be if one is dealing with
    a 20mb file with perhaps 2 million words.

    It would seem that for such a large file that being a "little bit
    incorrect" on a real-time display could be acceptable. Esp. if there
    were a way to request an "accurate" (and slower) word count for the
    few times the exact value is needed.

    What about just counting all the spaces, tabs, and newlines in the
    text by using

    set totchars [string length $txt]
    set nowhites [string map {\n {} \t {} { } {}} $txt]
    set wordcount [expr { $totchars - [string length $nowhites] }]

    Won't this be as accurate as using split? And with all string
    operations, no costs of creating lists.

    But there is the creating of a copy of a 20MB string for the output of
    string map. That is probably still faster that all the small
    allocations of word size strings to populate a list.

    In timing tests, the [string length] calls seemed to be near zero, I
    guess the string objects have the length saved in them.

    For Tcl_Obj's holding strings, there is an explicit length count field
    in the struct, so "length of string" devolves to "retreive the length
    field of the Tcl_Obj".

    And there's always the possibility of doing some of the heavy lifting
    in a separate thread.

    If an accurate, and nearly realtime, count is needed, and Luc does not
    want the GUI event loop to block while the count occurs, then using a
    second thread might be reasonable. That is if the time to 'get' the
    text and send it off to the second thread does not itself become the
    slow point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Ralf Fassel on Mon Nov 27 17:16:46 2023
    On 11/27/2023 6:50 AM, Ralf Fassel wrote:
    * Luc <luc@sep.invalid>
    | I am using split now. It's faster and "list safe" so it solves
    | the problem I presented first.

    | The new problem now is that cleaning up the huge string for proper
    | counting is not fast enough.


    | proc p.wc {} {
    | set wordcount 0
    | set content [$::text get 1.0 end]
    | set cleancontent [string map "\n { } \t { }" $content]
    | set wordcount [llength [split $cleancontent { }]]
    | return $wordcount
    | }


    | Since it's called whenever some change is made to the text widget,
    | typing becomes unacceptably slow.

    You could set up a timer to do the real work after a short period
    (500ms) of keyboard-idle, and return the old count else. Quick typists
    see the updated count only after they stop typing. Else you would need
    to keep track of what is inserted/deleted and incr/decr the count based
    on that (fragile).

    | And I still haven't addded a line to clean all the multiple
    | consecutive spaces, which changes the tally. I can't use regexp
    | because it's too slow for what I want.

    As someone else suggested, ::textutil::split::splitx might be a solution (though it might be even slower due to the use of regexps, need to
    test), or else check the [string length] of elements and count them only
    as words if > 0.

    R'

    I was wondering just how accurate it has to be if one is dealing with a 20mb file with perhaps 2 million words.

    What about just counting all the spaces, tabs, and newlines in the text by using

    set totchars [string length $txt]
    set nowhites [string map {\n {} \t {} { } {}} $txt]
    set wordcount [expr { $totchars - [string length $nowhites] }]

    Won't this be as accurate as using split? And with all string operations, no costs of creating lists.

    In timing tests, the [string length] calls seemed to be near zero, I guess the string objects have the length saved in them.

    But I was also thinking, there could be a threshold, say for smaller files, i.e. where the text extracted

    $::text get 1.0 end

    was less than some value, then use a more accurate method, since impact on typing would be smaller.

    I suspect that there has to be a file read to get the 20megs in, where an extra second to set it up won't matter. Then an accurate vs. quicker method might yield a percentage difference. That percent could be factored into any new quick counts.

    But as suggested, doing it only when the user has stopped typing makes sense to me.

    And there's always the possibility of doing some of the heavy lifting in a separate thread.

    Interesting problem. The text editor I use doesn't count words dynamically, but rather has a statistics command.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Mon Nov 27 23:07:34 2023
    On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:

    I was wondering just how accurate it has to be if one is dealing with a
    20mb file with perhaps 2 million words.

    For me, it doesn't, except that I'm using the same application for small
    and large files and I want the application to be able to handle all cases.

    Right now, I am only enabling the wc proc when I am doing my copywriting
    work. At all other times, it is disabled because typing becomes
    unacceptably slow on the large files I use regularly.


    And there's always the possibility of doing some of the heavy lifting
    in a separate thread.

    Hmm. That never crossed my mind. I don't remember ever coding with
    threads. I will have to look into that possibility.


    Interesting problem. The text editor I use doesn't count words
    dynamically, but rather has a statistics command.

    Now that did cross my mind. I wrote on MS Word and other word processors
    for most of my life and that's how they all do it. But my own editor has
    this or that nice little feature I made for myself that makes it all a
    better experience for me so I prefer to use it now.

    That's why we learn to code, right?

    The problem is, a statistics command doesn't really cut it.

    I am assigned a topic and a word count, usually 500 or 700. Rarely,
    1200. There are two approaches I can take:

    1. Splurge carelessly and rewrite to prune excesses later.

    2. Manage my verbosity as I go along. Sort of a regularity rally.

    I like #2 better. It's actually more enjoyable and it saves me quite
    some time. Sometimes the deadline is long, sometimes it isn't.

    I can only manage my verbosity as I go along if I know how many words
    I have put into it in real time.

    This is a very old idea. Even when I used MS Word on Windows, and that
    was literally 20 to 26 years ago, I craved something like that.
    I can finally have it.

    Or can I? Let's see.

    This is just a quick reply. I will take a more careful look at other
    aspects of your comments later. I will see what I can do with the code
    ideas you contributed. If I do find a good solution, I will wikify it.

    Many thanks for your interest.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Tue Nov 28 03:03:26 2023
    Luc <luc@sep.invalid> wrote:
    On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:

    I was wondering just how accurate it has to be if one is dealing
    with a 20mb file with perhaps 2 million words.

    For me, it doesn't, except that I'm using the same application for
    small and large files and I want the application to be able to handle
    all cases.

    Right now, I am only enabling the wc proc when I am doing my
    copywriting work. At all other times, it is disabled because typing
    becomes unacceptably slow on the large files I use regularly.

    Thing is, given what you've described so far of your code, you are
    attempting to brute-force a real-time count, and doing so you've
    created something on the order of an O(n^2) or worse algorithm.

    You are "counting" lots of things over and over that you previously
    counted after the last keystroke, even though that keystroke only made (usually) a one character change to the entire file.

    You instead want to think about how to not count any more than you have
    to for each "count cycle".

    I pulled a copy of Gutenburg's copy of War and Peace (because it is a
    long book) from here: https://gutenberg.org/cache/epub/2600/pg2600.txt

    Then I concatenated six duplicates into a single file (to make an
    approximately 20MB file). I named that file "20mb".

    Then I set out to see what could be done by trying to not count
    everything after any change. This is ugly demostration code below, all
    of this would ideally be wrapped inside a oo::object and made prettier,
    but you get the raw demo below:

    #!/usr/bin/wish

    set wc 0

    label .wc -textvariable wc
    text .t
    pack .wc
    pack .t

    set fd [open 20mb RDONLY]
    .t insert end [read $fd]
    close $fd

    set num_lines [.t count -lines 0.0 end]

    # prefill per-line word-count cache
    set lcc [list 0]
    for {set i 1} {$i < $num_lines} {incr i} {
    lappend lcc [llength [regexp -all -inline {\S+} [.t get $i.0 "$i.0 lineend"]]]
    }

    # initial load word count
    set wc [tcl::mathop::+ {*}$lcc]

    # set modified flag of text widget to false
    .t edit modified 0

    proc modified {} {
    global lcc
    global wc
    lassign [split [.t index insert] .] line_num
    lset lcc $line_num [llength [regexp -all -inline {\S+} [.t get $line_num.0 "$line_num.0 lineend"]]]
    set wc [tcl::mathop::+ {*}$lcc]
    .t edit modified 0
    }

    bind .t <<Modified>> [list modified]

    This above counts, in real time, as I type, with the 20mb 6x War and
    Peace file loaded. If I get going typing I can just begin to sense a
    latency on typing, but I do really have to get on a good roll for that.
    The one place I *do* see a latency is for keyboard autorepeat, there is
    a clear delay then.

    But, while hammering out this demo I noticed that <<Modified>> is
    called twice for every keystroke (meaning this above is still doing
    twice the work it needs to do). I simply did not want to be bothered
    with working out why <<Modified>> is called twice with every keystroke,
    nor with working out how to not call it twice for every keystroke.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Rich on Mon Nov 27 21:13:22 2023
    On 11/27/2023 7:03 PM, Rich wrote:
    Luc <luc@sep.invalid> wrote:
    On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:

    I was wondering just how accurate it has to be if one is dealing
    with a 20mb file with perhaps 2 million words.

    For me, it doesn't, except that I'm using the same application for
    small and large files and I want the application to be able to handle
    all cases.

    Right now, I am only enabling the wc proc when I am doing my
    copywriting work. At all other times, it is disabled because typing
    becomes unacceptably slow on the large files I use regularly.

    Thing is, given what you've described so far of your code, you are
    attempting to brute-force a real-time count, and doing so you've
    created something on the order of an O(n^2) or worse algorithm.

    You are "counting" lots of things over and over that you previously
    counted after the last keystroke, even though that keystroke only made (usually) a one character change to the entire file.

    You instead want to think about how to not count any more than you have
    to for each "count cycle".

    I pulled a copy of Gutenburg's copy of War and Peace (because it is a
    long book) from here: https://gutenberg.org/cache/epub/2600/pg2600.txt

    Then I concatenated six duplicates into a single file (to make an approximately 20MB file). I named that file "20mb".

    Then I set out to see what could be done by trying to not count
    everything after any change. This is ugly demostration code below, all
    of this would ideally be wrapped inside a oo::object and made prettier,
    but you get the raw demo below:

    #!/usr/bin/wish

    set wc 0

    label .wc -textvariable wc
    text .t
    pack .wc
    pack .t

    set fd [open 20mb RDONLY]
    .t insert end [read $fd]
    close $fd

    set num_lines [.t count -lines 0.0 end]

    # prefill per-line word-count cache
    set lcc [list 0]
    for {set i 1} {$i < $num_lines} {incr i} {
    lappend lcc [llength [regexp -all -inline {\S+} [.t get $i.0 "$i.0 lineend"]]]
    }

    # initial load word count
    set wc [tcl::mathop::+ {*}$lcc]

    # set modified flag of text widget to false
    .t edit modified 0

    proc modified {} {
    global lcc
    global wc
    lassign [split [.t index insert] .] line_num
    lset lcc $line_num [llength [regexp -all -inline {\S+} [.t get $line_num.0 "$line_num.0 lineend"]]]
    set wc [tcl::mathop::+ {*}$lcc]
    .t edit modified 0
    }

    bind .t <<Modified>> [list modified]

    This above counts, in real time, as I type, with the 20mb 6x War and
    Peace file loaded. If I get going typing I can just begin to sense a
    latency on typing, but I do really have to get on a good roll for that.
    The one place I *do* see a latency is for keyboard autorepeat, there is
    a clear delay then.

    But, while hammering out this demo I noticed that <<Modified>> is
    called twice for every keystroke (meaning this above is still doing
    twice the work it needs to do). I simply did not want to be bothered
    with working out why <<Modified>> is called twice with every keystroke,
    nor with working out how to not call it twice for every keystroke.

    Hmmm, a line cache. Cool.

    And since lcc is a list one can insert and delete quickly when lines are added or removed keeping it in sync with the text widget.

    How to tell just what changed (e.g. a paste in the middle) I can't say.

    I think for changes on a line, the [tcl::mathop::+ {*}$lcc] can adjust the wc using the line's (new value - its old value) saving possibly a 20k item arglist for the mathop.

    But I think this should work pretty well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Tue Nov 28 09:39:33 2023
    Am 27.11.23 um 08:27 schrieb Peter Dean:
    tcllib has splitx which splits on regexp eg

    ::textutil::split::splitx $l {\s+}

    splits on runs of spaces
    For only the count, it is not required to split the list. regex can do
    the counting:

    set string {This is a sentence with whitespaces in it.}
    regex -all {\s+} $string

    returns the number of blanks. With the uppercase \S it returns the
    number of non-white (there could be whitespace before and after)

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Dean@21:1/5 to Christian Gollwitzer on Tue Nov 28 09:19:50 2023
    Christian Gollwitzer <auriocus@gmx.de> wrote:
    Am 27.11.23 um 08:27 schrieb Peter Dean:
    tcllib has splitx which splits on regexp eg

    ::textutil::split::splitx $l {\s+}

    splits on runs of spaces
    For only the count, it is not required to split the list. regex can do
    the counting:

    set string {This is a sentence with whitespaces in it.}
    regex -all {\s+} $string

    returns the number of blanks. With the uppercase \S it returns the
    number of non-white (there could be whitespace before and after)

    Christian


    impressive
    better to predict what the real question was than that asked

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ralf Fassel@21:1/5 to All on Tue Nov 28 14:20:38 2023
    * Rich <rich@example.invalid>
    | But, while hammering out this demo I noticed that <<Modified>> is
    | called twice for every keystroke (meaning this above is still doing
    | twice the work it needs to do). I simply did not want to be bothered
    | with working out why <<Modified>> is called twice with every keystroke,

    Might be due to the fact that you change the 'modified' flag in the
    callback, and looking at the C code for the text widget which handles
    the ".t edit modified 0":

    /*
    * Only issue the <<Modified>> event if the flag actually changed.
    * However, degree of modified-ness doesn't matter. [Bug 1799782]
    */
    if ((!oldModified) != (!setModified)) {
    GenerateModifiedEvent(textPtr);
    }

    HTH
    R'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Ralf Fassel on Tue Nov 28 17:26:27 2023
    Ralf Fassel <ralfixx@gmx.de> wrote:
    * Rich <rich@example.invalid>
    | But, while hammering out this demo I noticed that <<Modified>> is
    | called twice for every keystroke (meaning this above is still doing
    | twice the work it needs to do). I simply did not want to be bothered
    | with working out why <<Modified>> is called twice with every keystroke,

    Might be due to the fact that you change the 'modified' flag in the
    callback, and looking at the C code for the text widget which handles
    the ".t edit modified 0":

    /*
    * Only issue the <<Modified>> event if the flag actually changed.
    * However, degree of modified-ness doesn't matter. [Bug 1799782]
    */
    if ((!oldModified) != (!setModified)) {
    GenerateModifiedEvent(textPtr);
    }

    Ah, that would be why then.

    As I said, I did not bother digging to find out why, nor to think about
    how to avoid the double calls.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to et99@rocketship1.me on Tue Nov 28 17:24:46 2023
    et99 <et99@rocketship1.me> wrote:
    On 11/27/2023 7:03 PM, Rich wrote:
    Luc <luc@sep.invalid> wrote:
    Right now, I am only enabling the wc proc when I am doing my
    copywriting work. At all other times, it is disabled because
    typing becomes unacceptably slow on the large files I use
    regularly.

    Thing is, given what you've described so far of your code, you are
    attempting to brute-force a real-time count, and doing so you've
    created something on the order of an O(n^2) or worse algorithm.

    You are "counting" lots of things over and over that you previously
    counted after the last keystroke, even though that keystroke only made
    (usually) a one character change to the entire file.

    You instead want to think about how to not count any more than you have
    to for each "count cycle".

    Hmmm, a line cache. Cool.

    It reduces the need to count to only counting the current line being
    edited. Which is where this example derives almost all of its speedup.

    And since lcc is a list one can insert and delete quickly when lines
    are added or removed keeping it in sync with the text widget.

    Yes, the example does not try to adjust the list for insert/delete
    operations. A real version would need to track inserts/deletes and
    adjust the line count cache accordingly. And yes, being a list, inserting/deleting one or more line items is reasonably quick.

    How to tell just what changed (e.g. a paste in the middle) I can't say.

    Probably have to shim the text widget and watch for all the operations
    that occur, and adjust the cache accordingly.

    I think for changes on a line, the [tcl::mathop::+ {*}$lcc] can
    adjust the wc using the line's (new value - its old value) saving
    possibly a 20k item arglist for the mathop.

    Yes, I suspect there are opportunities here to avoid having to iterate
    over the entire list. I did not try to add those opportunities.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Wed Nov 29 01:54:54 2023
    OK, I found a good solution. It's similar to what some of you suggested,
    but different enough. I don't want to post all the code because the whole mechanism spans across multiple procs and the procs can be long and complicated, but here is a description of the mechanism and the core procs:

    - Open file, immediately count all words in the buffer (slow), count all
    words in the current line (fast).
    That big count is done only once with a thorough enough cleanup, including removal of multiple spaces. So it's accurate.

    - Create three global variables: ::WholeBufferWC, ::CurrentLineWC and ::MostBufferWC.

    In case you're wondering, $::MostBufferWC is everything except the current line.

    $::MostBufferWC = $::WholeBufferWC - $::CurrentLineWC
    or
    $::CurrentLineWC + $::MostBufferWC = $::WholeBufferWC

    You get the picture.

    - Also store the number of the current line in ::CURRLINE. My code
    already did that anyway for the status bar.

    - Additionally, create the ::PREVLINE variable which will be used in the
    new, improved proc.
    That is where the magic is. It lets me monitor never more than one or two
    lines at a time. So it's fast. Tested and approved on the large file.

    I am using two procs now:

    wc is called occasionally and used for counting words cleanly, taking care
    of all the spaces.

    wcglobal is called at every touch and does the whole line monitor
    management, getting counts from wc whenever necessary. Here they are:


    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }


    proc p.wcglobal {} {
    set ::currindex [$::text index insert]
    lassign [split $::currindex "."] ::CURRLINE ::CURRCOL
    set ::currentlinecontent [$::text get $::CURRLINE.0 "$::CURRLINE.0 lineend"]
    set ::CurrentLineWC [p.wc $::currentlinecontent]

    if {$::CURRLINE == $::PREVLINE} {
    set ::WholeBufferWC [expr {$::MostBufferWC + $::CurrentLineWC}]
    }
    # else do not add the current line

    set ::MostBufferWC [expr {$::WholeBufferWC - $::CurrentLineWC}]

    set ::PREVLINE $::CURRLINE

    if {$::WholeBufferWC == 0} {return ""}
    if {$::WholeBufferWC == 1} {return "1 word"}
    if {$::WholeBufferWC > 1} {return "$::WholeBufferWC words"}
    }


    Tested with arrow keys navigation, random mouse clicks and pressing Return
    at the end or in the middle of lines.

    It works!

    I'm eating my own dog food, so if there are bugs, I find see them soon.

    Thank you for all the help once again.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Tue Nov 28 22:18:23 2023
    On 11/28/2023 8:54 PM, Luc wrote:
    OK, I found a good solution. It's similar to what some of you suggested,
    but different enough. I don't want to post all the code because the whole mechanism spans across multiple procs and the procs can be long and complicated, but here is a description of the mechanism and the core procs:

    - Open file, immediately count all words in the buffer (slow), count all words in the current line (fast).
    That big count is done only once with a thorough enough cleanup, including removal of multiple spaces. So it's accurate.

    - Create three global variables: ::WholeBufferWC, ::CurrentLineWC and ::MostBufferWC.

    In case you're wondering, $::MostBufferWC is everything except the current line.

    $::MostBufferWC = $::WholeBufferWC - $::CurrentLineWC
    or
    $::CurrentLineWC + $::MostBufferWC = $::WholeBufferWC

    You get the picture.

    - Also store the number of the current line in ::CURRLINE. My code
    already did that anyway for the status bar.

    - Additionally, create the ::PREVLINE variable which will be used in the
    new, improved proc.
    That is where the magic is. It lets me monitor never more than one or two lines at a time. So it's fast. Tested and approved on the large file.

    I am using two procs now:

    wc is called occasionally and used for counting words cleanly, taking care
    of all the spaces.

    wcglobal is called at every touch and does the whole line monitor
    management, getting counts from wc whenever necessary. Here they are:


    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }


    proc p.wcglobal {} {
    set ::currindex [$::text index insert]
    lassign [split $::currindex "."] ::CURRLINE ::CURRCOL
    set ::currentlinecontent [$::text get $::CURRLINE.0 "$::CURRLINE.0 lineend"]
    set ::CurrentLineWC [p.wc $::currentlinecontent]

    if {$::CURRLINE == $::PREVLINE} {
    set ::WholeBufferWC [expr {$::MostBufferWC + $::CurrentLineWC}]
    }
    # else do not add the current line

    set ::MostBufferWC [expr {$::WholeBufferWC - $::CurrentLineWC}]

    set ::PREVLINE $::CURRLINE

    if {$::WholeBufferWC == 0} {return ""}
    if {$::WholeBufferWC == 1} {return "1 word"}
    if {$::WholeBufferWC > 1} {return "$::WholeBufferWC words"}
    }


    Tested with arrow keys navigation, random mouse clicks and pressing Return
    at the end or in the middle of lines.

    It works!

    I'm eating my own dog food, so if there are bugs, I find see them soon.

    Thank you for all the help once again.



    That looks great. I think if you want to have some fun, you could run your full recount in a second thread.

    package require Thread

    set ::tid [thread::create {
    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }
    proc recount {main_tid var} {
    set words [p.wc [tsv::get text x]]
    thread::send -async $main_tid "set ::$var $words"
    }
    thread::wait
    }]

    # when you want a clean full update

    tsv::set text x [.t get 1.0 end] ;# copy to a thread shared var (20mb -> ~20ms) unset -nocomplain ::count_from_thread
    thread::send -async $::tid "recount [thread::id] count_from_thread"


    Then when you are doing a line count update in wcglobal,

    test [info exist ::count_from_thread] and if it exists,
    then use that for your current ::WholeBufferWC, if not, just use the
    current value until that variable gets set, maybe then after
    another few chars are entered by the user it will be ready. But
    there should be no impact on the user's typing (in theory) :)

    But don't queue another request until count_from_thread exists.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Wed Nov 29 17:43:26 2023
    Luc <luc@sep.invalid> wrote:
    OK, I found a good solution. It's similar to what some of you suggested,
    but different enough.
    ...
    Here they are:


    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }

    You can reduce the above to this:

    proc p.wc {content} {
    return [llength [regsub -all -inline {\S+} $content]]
    }

    And save having to create a trimmed version of content, save creating a
    second copy of content with runs of whitespace converted to a single
    space, and save having to then scan and split content on those spaces.

    Thank you for all the help once again.

    You are welcome.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From clt.to.davebr@dfgh.net@21:1/5 to All on Wed Nov 29 19:07:24 2023
    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }

    You can reduce the above to this:

    proc p.wc {content} {
    return [llength [regsub -all -inline {\S+} $content]]
    }


    Why not use Christian's suggestion from above?

    proc p.wc {content} {
    regex -all {\S+} $content
    }

    or just use the command without the proc wrapper

    Dave B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to clt.to.davebr@dfgh.net on Wed Nov 29 18:39:05 2023
    clt.to.davebr@dfgh.net wrote:

    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }

    You can reduce the above to this:

    proc p.wc {content} {
    return [llength [regsub -all -inline {\S+} $content]]
    }


    Why not use Christian's suggestion from above?

    proc p.wc {content} {
    regex -all {\S+} $content
    }

    Without -inline back a boolean from regex, and Luc wants a word count.

    With -inline you get the matches as a list, and the llength is to
    produce a 'count' (length of list) from the raw matches.

    or just use the command without the proc wrapper

    Valid ask, but in my case I just created the minimal change to Luc's
    example code, without going into other details.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Dean@21:1/5 to clt.to.davebr@dfgh.net on Wed Nov 29 19:53:46 2023
    clt.to.davebr@dfgh.net wrote:

    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }

    You can reduce the above to this:

    proc p.wc {content} {
    return [llength [regsub -all -inline {\S+} $content]]
    }


    Why not use Christian's suggestion from above?

    proc p.wc {content} {
    regex -all {\S+} $content
    }

    or just use the command without the proc wrapper

    Dave B


    +1

    Christian's comment forced me to
    man n regexp

    pd

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Rich on Wed Nov 29 12:02:01 2023
    On 11/29/2023 10:39 AM, Rich wrote:
    clt.to.davebr@dfgh.net wrote:

    proc p.wc {content} {
    set wordcount 0
    regsub -all {[\s]+} [string trim $content] { } content
    set wordcount [llength [split $content { }]]
    return $wordcount
    }

    You can reduce the above to this:

    proc p.wc {content} {
    return [llength [regsub -all -inline {\S+} $content]]
    }


    Why not use Christian's suggestion from above?

    proc p.wc {content} {
    regex -all {\S+} $content
    }

    Without -inline back a boolean from regex, and Luc wants a word count.

    With -inline you get the matches as a list, and the llength is to
    produce a 'count' (length of list) from the raw matches.

    or just use the command without the proc wrapper

    Valid ask, but in my case I just created the minimal change to Luc's
    example code, without going into other details.


    Actually, for regexp with -all it returns the count of matches as desired.

    So, based on the above posts, here's a very compact form using a Thread when a full update is desired, say after a multi-line change, e.g. cut or paste.:


    set ::tid [thread::create] ;# with no script it just does a thread::wait

    Then when you want a clean full update, you can queue this

    tsv::set text x [$::text get 1.0 end]
    unset -nocomplain ::count_from_thread
    thread::send -async $::tid {regexp -all {\S+} [tsv::get text x]} count_from_thread

    Or... I suppose you could even just set ::WholeBufferWC

    tsv::set text x [$::text get 1.0 end]
    thread::send -async $::tid {regexp -all {\S+} [tsv::get text x]} ::WholeBufferWC

    The only reason I prefer the unset method on a separate variable is to avoid a possible race condition. As long as you don't re-enter the event loop inside your callback for a text widget change, I don't think that can happen.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to All on Wed Nov 29 17:13:58 2023
    On Wed, 29 Nov 2023 12:02:01 -0800, et99 wrote:

    So, based on the above posts, here's a very compact form using a Thread
    when a full update is desired, say after a multi-line change, e.g. cut or >paste.:
    **************************

    Ah, yes. Multi-line changes e.g. cut or paste. My code doesn't support
    any of that. I will have to think about it and find a fix.

    Of course, I will look into the code you provided later.

    Thank you.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to et99@rocketship1.me on Wed Nov 29 20:16:07 2023
    et99 <et99@rocketship1.me> wrote:
    On 11/29/2023 10:39 AM, Rich wrote:

    Without -inline [back] a boolean [is returned] from regex, and Luc
    wants a word count.

    With -inline you get the matches as a list, and the llength is to
    produce a 'count' (length of list) from the raw matches.

    or just use the command without the proc wrapper

    Valid ask, but in my case I just created the minimal change to Luc's
    example code, without going into other details.


    Actually, for regexp with -all it returns the count of matches as
    desired.

    Ah, learn something new every day. I had not noticed that -all returns
    a count of matches. That saves creating a list only to take it's
    length and free it all over again.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Rich on Wed Nov 29 18:51:22 2023
    On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:

    Ah, learn something new every day. I had not noticed that -all returns
    a count of matches. That saves creating a list only to take it's
    length and free it all over again.

    **************************

    "That saves" is debatable. In all my attempts, regexp always seems to be
    too slow for this particular task. so I'm only using it once, when the
    file is opened.

    I will check, but I'm skeptical.


    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Peter Dean on Wed Nov 29 14:43:49 2023
    On 11/29/2023 2:28 PM, Peter Dean wrote:
    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue?

    I too see that, but only when using Thunderbird with eternal-september. I don't know why that happens, but they all get reduced to just one in google groups.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Dean@21:1/5 to All on Wed Nov 29 22:28:52 2023
    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Peter Dean on Wed Nov 29 20:20:11 2023
    On Wed, 29 Nov 2023 22:28:52 -0000 (UTC), Peter Dean wrote:

    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue? **************************

    I honestly have no idea what you are talking about. I only see one.

    Eternal September and Claws Mail here.

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From et99@21:1/5 to Luc on Wed Nov 29 16:15:43 2023
    On 11/29/2023 1:51 PM, Luc wrote:
    On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:

    Ah, learn something new every day. I had not noticed that -all returns
    a count of matches. That saves creating a list only to take it's
    length and free it all over again.

    **************************

    "That saves" is debatable. In all my attempts, regexp always seems to be
    too slow for this particular task. so I'm only using it once, when the
    file is opened.

    I will check, but I'm skeptical.



    I think you're right. regexp/regsub are pretty costly on a very large 20mb text.

    I tested a 20mb string

    % set text [string repeat "abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef\n" 290000] ; puts [string length $text]
    20300000

    % text .t
    % .t insert end $text

    Now...

    % timems {set foo [.t get 1.0 end]}
    35.405 milliseconds per iteration
    % timems {regexp -all {\S+} $foo}
    1,565.588 milliseconds per iteration
    % timems {llength [split $foo]}
    359.005 milliseconds per iteration
    % timems {regsub -all {[\s]+} [string trim $foo] { } content}
    1,489.499 milliseconds per iteration


    So, yes, I would only do these when you have a multi-line change or at startup.

    But it's pretty quick to get the text into a thread shared variable where the second thread can access it. And then running the regex/regsub etc. there shouldn't harm the responsiveness of the text widget.

    % package require Thread
    2.8.4
    % timems {tsv::set text x [.t get 1.0 end]}
    42.386 milliseconds per iteration


    fyi:

    proc timems {args} {
    set result [uplevel 1 time $args]
    set number [format %.3f [expr {( [lindex $result 0] / 1000. )}]]
    set number [regsub -all {\d(?=(\d{3})+($|\.))} $number {\0,}]
    return "[format %12s $number ] milliseconds [lrange $result 2 end]"
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Dean@21:1/5 to et99@rocketship1.me on Thu Nov 30 00:39:16 2023
    et99 <et99@rocketship1.me> wrote:
    On 11/29/2023 2:28 PM, Peter Dean wrote:
    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue?

    I too see that, but only when using Thunderbird with eternal-september. I don't know why that happens, but they all get reduced to just one in google groups.





    I see it on thunderbird and tin. The problem is apparent in the headers.

    Here's the header from the first message in this thread



    Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: clt.to.davebr@dfgh.net
    Newsgroups: comp.lang.tcl
    Subject: Safe handling of lists
    Date: Wed, 29 Nov 23 19:07:24 GMT
    Organization: A noiseless patient Spider
    Lines: 29
    Message-ID: <4431701284844@dlp>
    Injection-Info: dont-email.me; posting-host="f44ff9c2793715596881771caca57821"; \011logging-data="999582"; mail-complaints-to="abuse@eternal-september.org";\011posting-account="U2FsdGVkX1
    //3dnI1pxBXMHbjk1eHXw1"
    Cancel-Lock: sha1:nB6mjHtg4IyMflAMnXSxF0SsjhI=
    Xref: news.eternal-september.org comp.lang.tcl:65599


    and here's my followup



    Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: p.dean@gmx.com (Peter Dean)
    Newsgroups: comp.lang.tcl
    Subject: Re: Safe handling of lists
    Date: Wed, 29 Nov 2023 22:28:52 -0000 (UTC)
    Organization: A noiseless patient Spider
    Lines: 5
    Sender: <peter@arch1701.localdomain>
    Message-ID: <uk8dv3$10t7a$1@dont-email.me>
    References: <4431701284844@dlp>
    Injection-Date: Wed, 29 Nov 2023 22:28:52 -0000 (UTC)
    Injection-Info: dont-email.me; posting-host="8dec35c0e4543e0d8537605e1e3506b3"; \011logging-data="1078506"; mail-complaints-to="abuse@eternal-september.org";\011posting-account="U2FsdGVkX
    1+Ds1VaEGZd2zSG2P88S2g6"
    User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.63-1-lts (x86_64)) Cancel-Lock: sha1:zwLoa6WlRqCtBM9RTfoXZjJptE8=
    Xref: news.eternal-september.org comp.lang.tcl:65606

    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue?



    You can see that the References: field in mine points to the Message-ID: field in the OP. But there is no Reference: field in the OP's header. Therefore threading can only happen on the subject line not the Reference: field as it is meant to.

    And let's not mention google groups please.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Thu Nov 30 02:17:13 2023
    Luc <luc@sep.invalid> wrote:
    On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:

    Ah, learn something new every day. I had not noticed that -all returns
    a count of matches. That saves creating a list only to take it's
    length and free it all over again.

    **************************

    "That saves" is debatable. In all my attempts, regexp always seems to be
    too slow for this particular task. so I'm only using it once, when the
    file is opened.

    I will check, but I'm skeptical.

    You are not following what I said.

    For the way I outlined:

    [llength [regexp -all -inline ...]]

    The regex engine has to do:

    1) create a list to store the results (one memory allocation)

    2) create a list element (one per match) and append it to the list
    (one memory allocation per match)

    3) depending upon the length of the list, possibly realloc() the
    master list structure one or more times as the length grows

    So for a text file with 1 million words, we have 1,000,001 memory
    allocaations and 1,000,000 append to list operations (although these
    are relatively fast, 1M of them do add up).

    All to return that list, do nothing more with it beyond take its
    length, and then throw it all away (to 1,000,001 free() operations).


    Instead, leaving off -inline, the regex engine has to do:

    1) create an integer to store the match count

    2) perform "incr count" each time it finds a match (or, more likely,
    a C level count++ operation on a C 'int' variable)


    This second method has "saved" all the memory allocations to build up
    the large list, and "saved" all the free() operations to destroy the
    large list.

    So it has "saved" doing all that work. But I said nor implied nothing
    about whether using regexp to actually perform the counting operation
    was faster or not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Peter Dean on Thu Nov 30 02:22:27 2023
    Peter Dean <p.dean@gmx.com> wrote:
    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue?

    Someone's news reader broke the References: header.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Dean@21:1/5 to Rich on Thu Nov 30 02:49:34 2023
    Rich <rich@example.invalid> wrote:
    Peter Dean <p.dean@gmx.com> wrote:
    clt.to.davebr@dfgh.net wrote:

    Off Topic

    Why do we now have three separate threads going on this one issue?

    Someone's news reader broke the References: header.

    And we've no clue about which newsreader or os because it broke that header as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)