Forum: >>> Magnum BBS <<<

Re: Safe handling of lists

From Gerald Lester@21:1/5 to Luc on Sun Nov 26 13:50:26 2023

On 11/26/23 13:29, Luc wrote:

Me again. I have a problem dealing with lists.

I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.

It worked fine with a small file, but a large (very large) file
triggers this:

list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"

Also relevant,

set ::FILECONTENT [$::text get 1.0 end]

It's probably something obvious that I am missing again. Can someone
please enlighten me?

Every list is a string, but not every string is a list.

I would suggest that you take a look a the following builtins:
tcl_endOfWord str start
tcl_startOfNextWord str start
tcl_startOfPreviousWord str start
tcl_wordBreakAfter str start
tcl_wordBreakBefore str start

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Sun Nov 26 16:29:14 2023

Me again. I have a problem dealing with lists.

I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.

It worked fine with a small file, but a large (very large) file
triggers this:

list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"

Also relevant,

set ::FILECONTENT [$::text get 1.0 end]

It's probably something obvious that I am missing again. Can someone
please enlighten me?

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Sun Nov 26 13:35:48 2023

On 11/26/2023 11:29 AM, Luc wrote:

Me again. I have a problem dealing with lists.

I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.

It worked fine with a small file, but a large (very large) file
triggers this:

list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"

Also relevant,

set ::FILECONTENT [$::text get 1.0 end]

It's probably something obvious that I am missing again. Can someone
please enlighten me?

I think you want [split]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Sun Nov 26 19:00:47 2023

On Sun, 26 Nov 2023 13:35:48 -0800, et99 wrote:

I think you want [split]

**************************

I am using split now. It's faster and "list safe" so it solves
the problem I presented first.

The new problem now is that cleaning up the huge string for proper
counting is not fast enough.

proc p.wc {} {
set wordcount 0
set content [$::text get 1.0 end]
set cleancontent [string map "\n { } \t { }" $content]
set wordcount [llength [split $cleancontent { }]]
return $wordcount
}

Since it's called whenever some change is made to the text widget,
typing becomes unacceptably slow.

And I still haven't addded a line to clean all the multiple
consecutive spaces, which changes the tally. I can't use regexp
because it's too slow for what I want.

Another (debatable?) problem is that the old code gave me a count
that was a lot closer to the output of 'wc -w' in a terminal.
This new one is way off, and it gives me a lower count! One would
think it would be higher because of the consecutive spaces.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Obermeier@21:1/5 to All on Sun Nov 26 23:29:19 2023

Am 26.11.2023 um 20:29 schrieb Luc:

Me again. I have a problem dealing with lists.

I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.

It worked fine with a small file, but a large (very large) file
triggers this:

list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"

Also relevant,

set ::FILECONTENT [$::text get 1.0 end]

It's probably something obvious that I am missing again. Can someone
please enlighten me?

Take a look at my CAWT extension. I contains a CountWords procedure,
see https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Sun Nov 26 23:03:06 2023

Luc <luc@sep.invalid> wrote:

On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:

Take a look at my CAWT extension. I contains a CountWords procedure,
see >>https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

Paul

**************************

"CAWT (COM Automation With Tcl) is a utility package based on Twapi
to script Microsoft Windows® applications with Tcl."

But, if the source is available, you could look at the source to see
how it performs "word counting" and use that for inspiration.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Paul Obermeier on Sun Nov 26 19:54:58 2023

On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:

Take a look at my CAWT extension. I contains a CountWords procedure,
see >https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

Paul

**************************

"CAWT (COM Automation With Tcl) is a utility package based on Twapi
to script Microsoft Windows® applications with Tcl."

Thanks, but I am on Linux.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Obermeier@21:1/5 to All on Mon Nov 27 01:15:52 2023

Am 27.11.2023 um 00:03 schrieb Rich:

Luc <luc@sep.invalid> wrote:

On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:

Take a look at my CAWT extension. I contains a CountWords procedure,
see
https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords

Paul

**************************

"CAWT (COM Automation With Tcl) is a utility package based on Twapi
to script Microsoft Windows® applications with Tcl."

But, if the source is available, you could look at the source to see
how it performs "word counting" and use that for inspiration.

That was the idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Sun Nov 26 16:37:37 2023

On 11/26/2023 2:00 PM, Luc wrote:

On Sun, 26 Nov 2023 13:35:48 -0800, et99 wrote:

I think you want [split]

**************************

I am using split now. It's faster and "list safe" so it solves
the problem I presented first.

The new problem now is that cleaning up the huge string for proper
counting is not fast enough.

proc p.wc {} {
set wordcount 0
set content [$::text get 1.0 end]
set cleancontent [string map "\n { } \t { }" $content]
set wordcount [llength [split $cleancontent { }]]
return $wordcount
}

Since it's called whenever some change is made to the text widget,
typing becomes unacceptably slow.

And I still haven't addded a line to clean all the multiple
consecutive spaces, which changes the tally. I can't use regexp
because it's too slow for what I want.

Another (debatable?) problem is that the old code gave me a count
that was a lot closer to the output of 'wc -w' in a terminal.
This new one is way off, and it gives me a lower count! One would
think it would be higher because of the consecutive spaces.

Just wondering, why the string map to change newlines and tabs to spaces. Split can take those plus spaces in the splitchars string. In fact, I think that's the default anyway.

Also, are you saying this is calculated on every char the user types? Is that to keep a wordcount in say a status area? How big is the text we're talking about here?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From clt.to.davebr@dfgh.net@21:1/5 to All on Mon Nov 27 02:25:32 2023

This is 2-3 times slower that p.wc, but ignores any number of spacing characters:

proc wc {str} {
llength [lmap x [split $str] {if {[string is space $x]} {continue} {set x}}]
}

it gives the same count as the wc utility on one 100k test file.

The other problem is ckecking count between key strokes.

Consider keeping track of words above and below the edit window, then tracking lines moving into and out of the edit window and word count in the edit window at each key stroke. Much more complicated, but it only processes a short segment of text on each
key stroke.

Dave B

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Mon Nov 27 00:05:47 2023

On Sun, 26 Nov 2023 16:37:37 -0800, et99 wrote:

Just wondering, why the string map to change newlines and tabs to spaces. >Split can take those plus spaces in the splitchars string. In fact, I
think that's the default anyway.

Also, are you saying this is calculated on every char the user types? Is
that to keep a wordcount in say a status area? How big is the text we're >talking about here?

**************************

I decided to change all the newlines to spaces because I was afraid that

this
that

might become 'thisthat' rather than 'this that' after the split.

It probably wouldn't, but I wanted to make sure.

Yes, it's a sort of text editor and the word count must be calculated
after every change.

(I had big plans for it but personal issues have forced me to put it
on the back burner for God knows how long. I'm trying to fix this code
right now because I got a copywriting job where counting words in real
time is very useful. A ton of other things will be fixed... someday.)

Anyway, there is a status bar with some information and the word count
is supposed to be updated at every touch of the keyboard. I currently
filter out arrow key movements. Need to rewrite the code and make it
smarter.

The "big text" I am using to assess performance is 18MB.

$ wc /home/tcl/bigtext.txt
218993 2758398 18421662 /home/tcl/bigtext.txt

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Dean@21:1/5 to Luc on Mon Nov 27 17:27:24 2023

On 27/11/23 05:29, Luc wrote:

Me again. I have a problem dealing with lists.

I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.

It worked fine with a small file, but a large (very large) file
triggers this:

list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"

Also relevant,

set ::FILECONTENT [$::text get 1.0 end]

It's probably something obvious that I am missing again. Can someone
please enlighten me?

tcllib has splitx which splits on regexp eg

::textutil::split::splitx $l {\s+}

splits on runs of spaces

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alan Grunwald@21:1/5 to Luc on Mon Nov 27 12:57:27 2023

On 27/11/2023 03:05, Luc wrote:

On Sun, 26 Nov 2023 16:37:37 -0800, et99 wrote:

Just wondering, why the string map to change newlines and tabs to spaces.
Split can take those plus spaces in the splitchars string. In fact, I
think that's the default anyway.

Also, are you saying this is calculated on every char the user types? Is
that to keep a wordcount in say a status area? How big is the text we're
talking about here?

**************************

I decided to change all the newlines to spaces because I was afraid that

this
that

might become 'thisthat' rather than 'this that' after the split.

It probably wouldn't, but I wanted to make sure.

Yes, it's a sort of text editor and the word count must be calculated
after every change.

(I had big plans for it but personal issues have forced me to put it
on the back burner for God knows how long. I'm trying to fix this code
right now because I got a copywriting job where counting words in real
time is very useful. A ton of other things will be fixed... someday.)

Anyway, there is a status bar with some information and the word count
is supposed to be updated at every touch of the keyboard. I currently
filter out arrow key movements. Need to rewrite the code and make it
smarter.

The "big text" I am using to assess performance is 18MB.

$ wc /home/tcl/bigtext.txt
218993 2758398 18421662 /home/tcl/bigtext.txt

Surely you only need to update the word count if the character inserted
or deleted is a word separator? I assume you can tell whether this is
the case.

Alan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ralf Fassel@21:1/5 to All on Mon Nov 27 15:50:08 2023

* Luc <luc@sep.invalid>
| I am using split now. It's faster and "list safe" so it solves
| the problem I presented first.

| The new problem now is that cleaning up the huge string for proper
| counting is not fast enough.

| proc p.wc {} {
| set wordcount 0
| set content [$::text get 1.0 end]
| set cleancontent [string map "\n { } \t { }" $content]
| set wordcount [llength [split $cleancontent { }]]
| return $wordcount
| }

| Since it's called whenever some change is made to the text widget,
| typing becomes unacceptably slow.

You could set up a timer to do the real work after a short period
(500ms) of keyboard-idle, and return the old count else. Quick typists
see the updated count only after they stop typing. Else you would need
to keep track of what is inserted/deleted and incr/decr the count based
on that (fragile).

| And I still haven't addded a line to clean all the multiple
| consecutive spaces, which changes the tally. I can't use regexp
| because it's too slow for what I want.

As someone else suggested, ::textutil::split::splitx might be a solution (though it might be even slower due to the use of regexps, need to
test), or else check the [string length] of elements and count them only
as words if > 0.

R'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to et99@rocketship1.me on Tue Nov 28 01:49:52 2023

et99 <et99@rocketship1.me> wrote:

On 11/27/2023 6:50 AM, Ralf Fassel wrote:

As someone else suggested, ::textutil::split::splitx might be a
solution (though it might be even slower due to the use of regexps,
need to test), or else check the [string length] of elements and
count them only as words if > 0.

I was wondering just how accurate it has to be if one is dealing with
a 20mb file with perhaps 2 million words.

It would seem that for such a large file that being a "little bit
incorrect" on a real-time display could be acceptable. Esp. if there
were a way to request an "accurate" (and slower) word count for the
few times the exact value is needed.

What about just counting all the spaces, tabs, and newlines in the
text by using

set totchars [string length $txt]
set nowhites [string map {\n {} \t {} { } {}} $txt]
set wordcount [expr { $totchars - [string length $nowhites] }]

Won't this be as accurate as using split? And with all string
operations, no costs of creating lists.

But there is the creating of a copy of a 20MB string for the output of
string map. That is probably still faster that all the small
allocations of word size strings to populate a list.

In timing tests, the [string length] calls seemed to be near zero, I
guess the string objects have the length saved in them.

For Tcl_Obj's holding strings, there is an explicit length count field
in the struct, so "length of string" devolves to "retreive the length
field of the Tcl_Obj".

And there's always the possibility of doing some of the heavy lifting
in a separate thread.

If an accurate, and nearly realtime, count is needed, and Luc does not
want the GUI event loop to block while the count occurs, then using a
second thread might be reasonable. That is if the time to 'get' the
text and send it off to the second thread does not itself become the
slow point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Ralf Fassel on Mon Nov 27 17:16:46 2023

On 11/27/2023 6:50 AM, Ralf Fassel wrote:

* Luc <luc@sep.invalid>
| I am using split now. It's faster and "list safe" so it solves
| the problem I presented first.

| The new problem now is that cleaning up the huge string for proper
| counting is not fast enough.

| proc p.wc {} {
| set wordcount 0
| set content [$::text get 1.0 end]
| set cleancontent [string map "\n { } \t { }" $content]
| set wordcount [llength [split $cleancontent { }]]
| return $wordcount
| }

| Since it's called whenever some change is made to the text widget,
| typing becomes unacceptably slow.

You could set up a timer to do the real work after a short period
(500ms) of keyboard-idle, and return the old count else. Quick typists
see the updated count only after they stop typing. Else you would need
to keep track of what is inserted/deleted and incr/decr the count based
on that (fragile).

| And I still haven't addded a line to clean all the multiple
| consecutive spaces, which changes the tally. I can't use regexp
| because it's too slow for what I want.

As someone else suggested, ::textutil::split::splitx might be a solution (though it might be even slower due to the use of regexps, need to
test), or else check the [string length] of elements and count them only
as words if > 0.

R'

I was wondering just how accurate it has to be if one is dealing with a 20mb file with perhaps 2 million words.

What about just counting all the spaces, tabs, and newlines in the text by using

set totchars [string length $txt]
set nowhites [string map {\n {} \t {} { } {}} $txt]
set wordcount [expr { $totchars - [string length $nowhites] }]

Won't this be as accurate as using split? And with all string operations, no costs of creating lists.

In timing tests, the [string length] calls seemed to be near zero, I guess the string objects have the length saved in them.

But I was also thinking, there could be a threshold, say for smaller files, i.e. where the text extracted

$::text get 1.0 end

was less than some value, then use a more accurate method, since impact on typing would be smaller.

I suspect that there has to be a file read to get the 20megs in, where an extra second to set it up won't matter. Then an accurate vs. quicker method might yield a percentage difference. That percent could be factored into any new quick counts.

But as suggested, doing it only when the user has stopped typing makes sense to me.

And there's always the possibility of doing some of the heavy lifting in a separate thread.

Interesting problem. The text editor I use doesn't count words dynamically, but rather has a statistics command.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Mon Nov 27 23:07:34 2023

On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:

I was wondering just how accurate it has to be if one is dealing with a
20mb file with perhaps 2 million words.

For me, it doesn't, except that I'm using the same application for small
and large files and I want the application to be able to handle all cases.

Right now, I am only enabling the wc proc when I am doing my copywriting
work. At all other times, it is disabled because typing becomes
unacceptably slow on the large files I use regularly.

And there's always the possibility of doing some of the heavy lifting
in a separate thread.

Hmm. That never crossed my mind. I don't remember ever coding with
threads. I will have to look into that possibility.

Interesting problem. The text editor I use doesn't count words
dynamically, but rather has a statistics command.

Now that did cross my mind. I wrote on MS Word and other word processors
for most of my life and that's how they all do it. But my own editor has
this or that nice little feature I made for myself that makes it all a
better experience for me so I prefer to use it now.

That's why we learn to code, right?

The problem is, a statistics command doesn't really cut it.

I am assigned a topic and a word count, usually 500 or 700. Rarely,
1200. There are two approaches I can take:

1. Splurge carelessly and rewrite to prune excesses later.

2. Manage my verbosity as I go along. Sort of a regularity rally.

I like #2 better. It's actually more enjoyable and it saves me quite
some time. Sometimes the deadline is long, sometimes it isn't.

I can only manage my verbosity as I go along if I know how many words
I have put into it in real time.

This is a very old idea. Even when I used MS Word on Windows, and that
was literally 20 to 26 years ago, I craved something like that.
I can finally have it.

Or can I? Let's see.

This is just a quick reply. I will take a more careful look at other
aspects of your comments later. I will see what I can do with the code
ideas you contributed. If I do find a good solution, I will wikify it.

Many thanks for your interest.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Tue Nov 28 03:03:26 2023

Luc <luc@sep.invalid> wrote:

On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:

I was wondering just how accurate it has to be if one is dealing
with a 20mb file with perhaps 2 million words.

For me, it doesn't, except that I'm using the same application for
small and large files and I want the application to be able to handle
all cases.

Right now, I am only enabling the wc proc when I am doing my
copywriting work. At all other times, it is disabled because typing
becomes unacceptably slow on the large files I use regularly.

Thing is, given what you've described so far of your code, you are
attempting to brute-force a real-time count, and doing so you've
created something on the order of an O(n^2) or worse algorithm.

You are "counting" lots of things over and over that you previously
counted after the last keystroke, even though that keystroke only made (usually) a one character change to the entire file.

You instead want to think about how to not count any more than you have
to for each "count cycle".

I pulled a copy of Gutenburg's copy of War and Peace (because it is a
long book) from here: https://gutenberg.org/cache/epub/2600/pg2600.txt

Then I concatenated six duplicates into a single file (to make an
approximately 20MB file). I named that file "20mb".

Then I set out to see what could be done by trying to not count
everything after any change. This is ugly demostration code below, all
of this would ideally be wrapped inside a oo::object and made prettier,
but you get the raw demo below:

#!/usr/bin/wish

set wc 0

label .wc -textvariable wc
text .t
pack .wc
pack .t

set fd [open 20mb RDONLY]
.t insert end [read $fd]
close $fd

set num_lines [.t count -lines 0.0 end]

# prefill per-line word-count cache
set lcc [list 0]
for {set i 1} {$i < $num_lines} {incr i} {
lappend lcc [llength [regexp -all -inline {\S+} [.t get $i.0 "$i.0 lineend"]]]
}

# initial load word count
set wc [tcl::mathop::+ {*}$lcc]

# set modified flag of text widget to false
.t edit modified 0

proc modified {} {
global lcc
global wc
lassign [split [.t index insert] .] line_num
lset lcc $line_num [llength [regexp -all -inline {\S+} [.t get $line_num.0 "$line_num.0 lineend"]]]
set wc [tcl::mathop::+ {*}$lcc]
.t edit modified 0
}

bind .t <<Modified>> [list modified]

This above counts, in real time, as I type, with the 20mb 6x War and
Peace file loaded. If I get going typing I can just begin to sense a
latency on typing, but I do really have to get on a good roll for that.
The one place I *do* see a latency is for keyboard autorepeat, there is
a clear delay then.

But, while hammering out this demo I noticed that <<Modified>> is
called twice for every keystroke (meaning this above is still doing
twice the work it needs to do). I simply did not want to be bothered
with working out why <<Modified>> is called twice with every keystroke,
nor with working out how to not call it twice for every keystroke.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Rich on Mon Nov 27 21:13:22 2023

On 11/27/2023 7:03 PM, Rich wrote:

Luc <luc@sep.invalid> wrote:

On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:

I was wondering just how accurate it has to be if one is dealing
with a 20mb file with perhaps 2 million words.

For me, it doesn't, except that I'm using the same application for
small and large files and I want the application to be able to handle
all cases.

Right now, I am only enabling the wc proc when I am doing my
copywriting work. At all other times, it is disabled because typing
becomes unacceptably slow on the large files I use regularly.

Thing is, given what you've described so far of your code, you are
attempting to brute-force a real-time count, and doing so you've
created something on the order of an O(n^2) or worse algorithm.

You are "counting" lots of things over and over that you previously
counted after the last keystroke, even though that keystroke only made (usually) a one character change to the entire file.

You instead want to think about how to not count any more than you have
to for each "count cycle".

I pulled a copy of Gutenburg's copy of War and Peace (because it is a
long book) from here: https://gutenberg.org/cache/epub/2600/pg2600.txt

Then I concatenated six duplicates into a single file (to make an approximately 20MB file). I named that file "20mb".

Then I set out to see what could be done by trying to not count
everything after any change. This is ugly demostration code below, all
of this would ideally be wrapped inside a oo::object and made prettier,
but you get the raw demo below:

#!/usr/bin/wish

set wc 0

label .wc -textvariable wc
text .t
pack .wc
pack .t

set fd [open 20mb RDONLY]
.t insert end [read $fd]
close $fd

set num_lines [.t count -lines 0.0 end]

# prefill per-line word-count cache
set lcc [list 0]
for {set i 1} {$i < $num_lines} {incr i} {
lappend lcc [llength [regexp -all -inline {\S+} [.t get $i.0 "$i.0 lineend"]]]
}

# initial load word count
set wc [tcl::mathop::+ {*}$lcc]

# set modified flag of text widget to false
.t edit modified 0

proc modified {} {
global lcc
global wc
lassign [split [.t index insert] .] line_num
lset lcc $line_num [llength [regexp -all -inline {\S+} [.t get $line_num.0 "$line_num.0 lineend"]]]
set wc [tcl::mathop::+ {*}$lcc]
.t edit modified 0
}

bind .t <<Modified>> [list modified]

This above counts, in real time, as I type, with the 20mb 6x War and
Peace file loaded. If I get going typing I can just begin to sense a
latency on typing, but I do really have to get on a good roll for that.
The one place I *do* see a latency is for keyboard autorepeat, there is
a clear delay then.

But, while hammering out this demo I noticed that <<Modified>> is
called twice for every keystroke (meaning this above is still doing
twice the work it needs to do). I simply did not want to be bothered
with working out why <<Modified>> is called twice with every keystroke,
nor with working out how to not call it twice for every keystroke.

Hmmm, a line cache. Cool.

And since lcc is a list one can insert and delete quickly when lines are added or removed keeping it in sync with the text widget.

How to tell just what changed (e.g. a paste in the middle) I can't say.

I think for changes on a line, the [tcl::mathop::+ {*}$lcc] can adjust the wc using the line's (new value - its old value) saving possibly a 20k item arglist for the mathop.

But I think this should work pretty well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Tue Nov 28 09:39:33 2023

Am 27.11.23 um 08:27 schrieb Peter Dean:

tcllib has splitx which splits on regexp eg

::textutil::split::splitx $l {\s+}

splits on runs of spaces

For only the count, it is not required to split the list. regex can do
the counting:

set string {This is a sentence with whitespaces in it.}
regex -all {\s+} $string

returns the number of blanks. With the uppercase \S it returns the
number of non-white (there could be whitespace before and after)

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Dean@21:1/5 to Christian Gollwitzer on Tue Nov 28 09:19:50 2023

Christian Gollwitzer <auriocus@gmx.de> wrote:

Am 27.11.23 um 08:27 schrieb Peter Dean:

tcllib has splitx which splits on regexp eg

::textutil::split::splitx $l {\s+}

splits on runs of spaces

For only the count, it is not required to split the list. regex can do
the counting:

set string {This is a sentence with whitespaces in it.}
regex -all {\s+} $string

returns the number of blanks. With the uppercase \S it returns the
number of non-white (there could be whitespace before and after)

Christian

impressive
better to predict what the real question was than that asked

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ralf Fassel@21:1/5 to All on Tue Nov 28 14:20:38 2023

* Rich <rich@example.invalid>
| But, while hammering out this demo I noticed that <<Modified>> is
| called twice for every keystroke (meaning this above is still doing
| twice the work it needs to do). I simply did not want to be bothered
| with working out why <<Modified>> is called twice with every keystroke,

Might be due to the fact that you change the 'modified' flag in the
callback, and looking at the C code for the text widget which handles
the ".t edit modified 0":

/*
* Only issue the <<Modified>> event if the flag actually changed.
* However, degree of modified-ness doesn't matter. [Bug 1799782]
*/
if ((!oldModified) != (!setModified)) {
GenerateModifiedEvent(textPtr);
}

HTH
R'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Ralf Fassel on Tue Nov 28 17:26:27 2023

Ralf Fassel <ralfixx@gmx.de> wrote:

* Rich <rich@example.invalid>
| But, while hammering out this demo I noticed that <<Modified>> is
| called twice for every keystroke (meaning this above is still doing
| twice the work it needs to do). I simply did not want to be bothered
| with working out why <<Modified>> is called twice with every keystroke,

Might be due to the fact that you change the 'modified' flag in the
callback, and looking at the C code for the text widget which handles
the ".t edit modified 0":

/*
* Only issue the <<Modified>> event if the flag actually changed.
* However, degree of modified-ness doesn't matter. [Bug 1799782]
*/
if ((!oldModified) != (!setModified)) {
GenerateModifiedEvent(textPtr);
}

Ah, that would be why then.

As I said, I did not bother digging to find out why, nor to think about
how to avoid the double calls.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to et99@rocketship1.me on Tue Nov 28 17:24:46 2023

et99 <et99@rocketship1.me> wrote:

On 11/27/2023 7:03 PM, Rich wrote:

Luc <luc@sep.invalid> wrote:

Right now, I am only enabling the wc proc when I am doing my
copywriting work. At all other times, it is disabled because
typing becomes unacceptably slow on the large files I use
regularly.

Thing is, given what you've described so far of your code, you are
attempting to brute-force a real-time count, and doing so you've
created something on the order of an O(n^2) or worse algorithm.

You are "counting" lots of things over and over that you previously
counted after the last keystroke, even though that keystroke only made
(usually) a one character change to the entire file.

You instead want to think about how to not count any more than you have
to for each "count cycle".

Hmmm, a line cache. Cool.

It reduces the need to count to only counting the current line being
edited. Which is where this example derives almost all of its speedup.

And since lcc is a list one can insert and delete quickly when lines
are added or removed keeping it in sync with the text widget.

Yes, the example does not try to adjust the list for insert/delete
operations. A real version would need to track inserts/deletes and
adjust the line count cache accordingly. And yes, being a list, inserting/deleting one or more line items is reasonably quick.

How to tell just what changed (e.g. a paste in the middle) I can't say.

Probably have to shim the text widget and watch for all the operations
that occur, and adjust the cache accordingly.

I think for changes on a line, the [tcl::mathop::+ {*}$lcc] can
adjust the wc using the line's (new value - its old value) saving
possibly a 20k item arglist for the mathop.

Yes, I suspect there are opportunities here to avoid having to iterate
over the entire list. I did not try to add those opportunities.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Wed Nov 29 01:54:54 2023

OK, I found a good solution. It's similar to what some of you suggested,
but different enough. I don't want to post all the code because the whole mechanism spans across multiple procs and the procs can be long and complicated, but here is a description of the mechanism and the core procs:

- Open file, immediately count all words in the buffer (slow), count all
words in the current line (fast).
That big count is done only once with a thorough enough cleanup, including removal of multiple spaces. So it's accurate.

- Create three global variables: ::WholeBufferWC, ::CurrentLineWC and ::MostBufferWC.

In case you're wondering, $::MostBufferWC is everything except the current line.

$::MostBufferWC = $::WholeBufferWC - $::CurrentLineWC
or
$::CurrentLineWC + $::MostBufferWC = $::WholeBufferWC

You get the picture.

- Also store the number of the current line in ::CURRLINE. My code
already did that anyway for the status bar.

- Additionally, create the ::PREVLINE variable which will be used in the
new, improved proc.
That is where the magic is. It lets me monitor never more than one or two
lines at a time. So it's fast. Tested and approved on the large file.

I am using two procs now:

wc is called occasionally and used for counting words cleanly, taking care
of all the spaces.

wcglobal is called at every touch and does the whole line monitor
management, getting counts from wc whenever necessary. Here they are:

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

proc p.wcglobal {} {
set ::currindex [$::text index insert]
lassign [split $::currindex "."] ::CURRLINE ::CURRCOL
set ::currentlinecontent [$::text get $::CURRLINE.0 "$::CURRLINE.0 lineend"]
set ::CurrentLineWC [p.wc $::currentlinecontent]

if {$::CURRLINE == $::PREVLINE} {
set ::WholeBufferWC [expr {$::MostBufferWC + $::CurrentLineWC}]
}
# else do not add the current line

set ::MostBufferWC [expr {$::WholeBufferWC - $::CurrentLineWC}]

set ::PREVLINE $::CURRLINE

if {$::WholeBufferWC == 0} {return ""}
if {$::WholeBufferWC == 1} {return "1 word"}
if {$::WholeBufferWC > 1} {return "$::WholeBufferWC words"}
}

Tested with arrow keys navigation, random mouse clicks and pressing Return
at the end or in the middle of lines.

It works!

I'm eating my own dog food, so if there are bugs, I find see them soon.

Thank you for all the help once again.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Tue Nov 28 22:18:23 2023

On 11/28/2023 8:54 PM, Luc wrote:

OK, I found a good solution. It's similar to what some of you suggested,
but different enough. I don't want to post all the code because the whole mechanism spans across multiple procs and the procs can be long and complicated, but here is a description of the mechanism and the core procs:

- Open file, immediately count all words in the buffer (slow), count all words in the current line (fast).
That big count is done only once with a thorough enough cleanup, including removal of multiple spaces. So it's accurate.

- Create three global variables: ::WholeBufferWC, ::CurrentLineWC and ::MostBufferWC.

In case you're wondering, $::MostBufferWC is everything except the current line.

$::MostBufferWC = $::WholeBufferWC - $::CurrentLineWC
or
$::CurrentLineWC + $::MostBufferWC = $::WholeBufferWC

You get the picture.

- Also store the number of the current line in ::CURRLINE. My code
already did that anyway for the status bar.

- Additionally, create the ::PREVLINE variable which will be used in the
new, improved proc.
That is where the magic is. It lets me monitor never more than one or two lines at a time. So it's fast. Tested and approved on the large file.

I am using two procs now:

wc is called occasionally and used for counting words cleanly, taking care
of all the spaces.

wcglobal is called at every touch and does the whole line monitor
management, getting counts from wc whenever necessary. Here they are:

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

proc p.wcglobal {} {
set ::currindex [$::text index insert]
lassign [split $::currindex "."] ::CURRLINE ::CURRCOL
set ::currentlinecontent [$::text get $::CURRLINE.0 "$::CURRLINE.0 lineend"]
set ::CurrentLineWC [p.wc $::currentlinecontent]

if {$::CURRLINE == $::PREVLINE} {
set ::WholeBufferWC [expr {$::MostBufferWC + $::CurrentLineWC}]
}
# else do not add the current line

set ::MostBufferWC [expr {$::WholeBufferWC - $::CurrentLineWC}]

set ::PREVLINE $::CURRLINE

if {$::WholeBufferWC == 0} {return ""}
if {$::WholeBufferWC == 1} {return "1 word"}
if {$::WholeBufferWC > 1} {return "$::WholeBufferWC words"}
}

Tested with arrow keys navigation, random mouse clicks and pressing Return
at the end or in the middle of lines.

It works!

I'm eating my own dog food, so if there are bugs, I find see them soon.

Thank you for all the help once again.

That looks great. I think if you want to have some fun, you could run your full recount in a second thread.

package require Thread

set ::tid [thread::create {
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
proc recount {main_tid var} {
set words [p.wc [tsv::get text x]]
thread::send -async $main_tid "set ::$var $words"
}
thread::wait
}]

# when you want a clean full update

tsv::set text x [.t get 1.0 end] ;# copy to a thread shared var (20mb -> ~20ms) unset -nocomplain ::count_from_thread
thread::send -async $::tid "recount [thread::id] count_from_thread"

Then when you are doing a line count update in wcglobal,

test [info exist ::count_from_thread] and if it exists,
then use that for your current ::WholeBufferWC, if not, just use the
current value until that variable gets set, maybe then after
another few chars are entered by the user it will be ready. But
there should be no impact on the user's typing (in theory) :)

But don't queue another request until count_from_thread exists.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Wed Nov 29 17:43:26 2023

Luc <luc@sep.invalid> wrote:

OK, I found a good solution. It's similar to what some of you suggested,
but different enough.
...
Here they are:

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

You can reduce the above to this:

proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}

And save having to create a trimmed version of content, save creating a
second copy of content with runs of whitespace converted to a single
space, and save having to then scan and split content on those spaces.

Thank you for all the help once again.

You are welcome.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From clt.to.davebr@dfgh.net@21:1/5 to All on Wed Nov 29 19:07:24 2023

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

You can reduce the above to this:

proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}

Why not use Christian's suggestion from above?

proc p.wc {content} {
regex -all {\S+} $content
}

or just use the command without the proc wrapper

Dave B

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to clt.to.davebr@dfgh.net on Wed Nov 29 18:39:05 2023

clt.to.davebr@dfgh.net wrote:

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

You can reduce the above to this:

proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}

Why not use Christian's suggestion from above?

proc p.wc {content} {
regex -all {\S+} $content
}

Without -inline back a boolean from regex, and Luc wants a word count.

With -inline you get the matches as a list, and the llength is to
produce a 'count' (length of list) from the raw matches.

or just use the command without the proc wrapper

Valid ask, but in my case I just created the minimal change to Luc's
example code, without going into other details.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Dean@21:1/5 to clt.to.davebr@dfgh.net on Wed Nov 29 19:53:46 2023

clt.to.davebr@dfgh.net wrote:

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

You can reduce the above to this:

proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}

Why not use Christian's suggestion from above?

proc p.wc {content} {
regex -all {\S+} $content
}

or just use the command without the proc wrapper

Dave B

+1

Christian's comment forced me to
man n regexp

pd

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Rich on Wed Nov 29 12:02:01 2023

On 11/29/2023 10:39 AM, Rich wrote:

clt.to.davebr@dfgh.net wrote:

proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}

You can reduce the above to this:

proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}

Why not use Christian's suggestion from above?

proc p.wc {content} {
regex -all {\S+} $content
}

Without -inline back a boolean from regex, and Luc wants a word count.

With -inline you get the matches as a list, and the llength is to
produce a 'count' (length of list) from the raw matches.

or just use the command without the proc wrapper

Valid ask, but in my case I just created the minimal change to Luc's
example code, without going into other details.

Actually, for regexp with -all it returns the count of matches as desired.

So, based on the above posts, here's a very compact form using a Thread when a full update is desired, say after a multi-line change, e.g. cut or paste.:

set ::tid [thread::create] ;# with no script it just does a thread::wait

Then when you want a clean full update, you can queue this

tsv::set text x [$::text get 1.0 end]
unset -nocomplain ::count_from_thread
thread::send -async $::tid {regexp -all {\S+} [tsv::get text x]} count_from_thread

Or... I suppose you could even just set ::WholeBufferWC

tsv::set text x [$::text get 1.0 end]
thread::send -async $::tid {regexp -all {\S+} [tsv::get text x]} ::WholeBufferWC

The only reason I prefer the unset method on a separate variable is to avoid a possible race condition. As long as you don't re-enter the event loop inside your callback for a text widget change, I don't think that can happen.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to All on Wed Nov 29 17:13:58 2023

On Wed, 29 Nov 2023 12:02:01 -0800, et99 wrote:

So, based on the above posts, here's a very compact form using a Thread
when a full update is desired, say after a multi-line change, e.g. cut or >paste.:

**************************

Ah, yes. Multi-line changes e.g. cut or paste. My code doesn't support
any of that. I will have to think about it and find a fix.

Of course, I will look into the code you provided later.

Thank you.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to et99@rocketship1.me on Wed Nov 29 20:16:07 2023

et99 <et99@rocketship1.me> wrote:

On 11/29/2023 10:39 AM, Rich wrote:

Without -inline [back] a boolean [is returned] from regex, and Luc
wants a word count.

With -inline you get the matches as a list, and the llength is to
produce a 'count' (length of list) from the raw matches.

or just use the command without the proc wrapper

Valid ask, but in my case I just created the minimal change to Luc's
example code, without going into other details.

Actually, for regexp with -all it returns the count of matches as
desired.

Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Rich on Wed Nov 29 18:51:22 2023

On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:

Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.

**************************

"That saves" is debatable. In all my attempts, regexp always seems to be
too slow for this particular task. so I'm only using it once, when the
file is opened.

I will check, but I'm skeptical.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Peter Dean on Wed Nov 29 14:43:49 2023

On 11/29/2023 2:28 PM, Peter Dean wrote:

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue?

I too see that, but only when using Thunderbird with eternal-september. I don't know why that happens, but they all get reduced to just one in google groups.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Dean@21:1/5 to All on Wed Nov 29 22:28:52 2023

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Peter Dean on Wed Nov 29 20:20:11 2023

On Wed, 29 Nov 2023 22:28:52 -0000 (UTC), Peter Dean wrote:

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue? **************************

I honestly have no idea what you are talking about. I only see one.

Eternal September and Claws Mail here.

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From et99@21:1/5 to Luc on Wed Nov 29 16:15:43 2023

On 11/29/2023 1:51 PM, Luc wrote:

On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:

Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.

**************************

"That saves" is debatable. In all my attempts, regexp always seems to be
too slow for this particular task. so I'm only using it once, when the
file is opened.

I will check, but I'm skeptical.

I think you're right. regexp/regsub are pretty costly on a very large 20mb text.

I tested a 20mb string

% set text [string repeat "abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef\n" 290000] ; puts [string length $text]
20300000

% text .t
% .t insert end $text

Now...

% timems {set foo [.t get 1.0 end]}
35.405 milliseconds per iteration
% timems {regexp -all {\S+} $foo}
1,565.588 milliseconds per iteration
% timems {llength [split $foo]}
359.005 milliseconds per iteration
% timems {regsub -all {[\s]+} [string trim $foo] { } content}
1,489.499 milliseconds per iteration

So, yes, I would only do these when you have a multi-line change or at startup.

But it's pretty quick to get the text into a thread shared variable where the second thread can access it. And then running the regex/regsub etc. there shouldn't harm the responsiveness of the text widget.

% package require Thread
2.8.4
% timems {tsv::set text x [.t get 1.0 end]}
42.386 milliseconds per iteration

fyi:

proc timems {args} {
set result [uplevel 1 time $args]
set number [format %.3f [expr {( [lindex $result 0] / 1000. )}]]
set number [regsub -all {\d(?=(\d{3})+($|\.))} $number {\0,}]
return "[format %12s $number ] milliseconds [lrange $result 2 end]"
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Dean@21:1/5 to et99@rocketship1.me on Thu Nov 30 00:39:16 2023

et99 <et99@rocketship1.me> wrote:

On 11/29/2023 2:28 PM, Peter Dean wrote:

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue?

I too see that, but only when using Thunderbird with eternal-september. I don't know why that happens, but they all get reduced to just one in google groups.

I see it on thunderbird and tin. The problem is apparent in the headers.

Here's the header from the first message in this thread

Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: clt.to.davebr@dfgh.net
Newsgroups: comp.lang.tcl
Subject: Safe handling of lists
Date: Wed, 29 Nov 23 19:07:24 GMT
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <4431701284844@dlp>
Injection-Info: dont-email.me; posting-host="f44ff9c2793715596881771caca57821"; \011logging-data="999582"; mail-complaints-to="abuse@eternal-september.org";\011posting-account="U2FsdGVkX1
//3dnI1pxBXMHbjk1eHXw1"
Cancel-Lock: sha1:nB6mjHtg4IyMflAMnXSxF0SsjhI=
Xref: news.eternal-september.org comp.lang.tcl:65599

and here's my followup

Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: p.dean@gmx.com (Peter Dean)
Newsgroups: comp.lang.tcl
Subject: Re: Safe handling of lists
Date: Wed, 29 Nov 2023 22:28:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 5
Sender: <peter@arch1701.localdomain>
Message-ID: <uk8dv3$10t7a$1@dont-email.me>
References: <4431701284844@dlp>
Injection-Date: Wed, 29 Nov 2023 22:28:52 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8dec35c0e4543e0d8537605e1e3506b3"; \011logging-data="1078506"; mail-complaints-to="abuse@eternal-september.org";\011posting-account="U2FsdGVkX
1+Ds1VaEGZd2zSG2P88S2g6"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.63-1-lts (x86_64)) Cancel-Lock: sha1:zwLoa6WlRqCtBM9RTfoXZjJptE8=
Xref: news.eternal-september.org comp.lang.tcl:65606

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue?

You can see that the References: field in mine points to the Message-ID: field in the OP. But there is no Reference: field in the OP's header. Therefore threading can only happen on the subject line not the Reference: field as it is meant to.

And let's not mention google groups please.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Luc on Thu Nov 30 02:17:13 2023

Luc <luc@sep.invalid> wrote:

On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:

Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.

**************************

"That saves" is debatable. In all my attempts, regexp always seems to be
too slow for this particular task. so I'm only using it once, when the
file is opened.

I will check, but I'm skeptical.

You are not following what I said.

For the way I outlined:

[llength [regexp -all -inline ...]]

The regex engine has to do:

1) create a list to store the results (one memory allocation)

2) create a list element (one per match) and append it to the list
(one memory allocation per match)

3) depending upon the length of the list, possibly realloc() the
master list structure one or more times as the length grows

So for a text file with 1 million words, we have 1,000,001 memory
allocaations and 1,000,000 append to list operations (although these
are relatively fast, 1M of them do add up).

All to return that list, do nothing more with it beyond take its
length, and then throw it all away (to 1,000,001 free() operations).

Instead, leaving off -inline, the regex engine has to do:

1) create an integer to store the match count

2) perform "incr count" each time it finds a match (or, more likely,
a C level count++ operation on a C 'int' variable)

This second method has "saved" all the memory allocations to build up
the large list, and "saved" all the free() operations to destroy the
large list.

So it has "saved" doing all that work. But I said nor implied nothing
about whether using regexp to actually perform the counting operation
was faster or not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Peter Dean on Thu Nov 30 02:22:27 2023

Peter Dean <p.dean@gmx.com> wrote:

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue?

Someone's news reader broke the References: header.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Dean@21:1/5 to Rich on Thu Nov 30 02:49:34 2023

Rich <rich@example.invalid> wrote:

Peter Dean <p.dean@gmx.com> wrote:

clt.to.davebr@dfgh.net wrote:

Off Topic

Why do we now have three separate threads going on this one issue?

Someone's news reader broke the References: header.

And we've no clue about which newsreader or os because it broke that header as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Ian Rihard Kosednar
  Tue Jun 17 12:55:31 2025
  from No via SSH
- Ian Rihard Kosednar
  Tue Jun 17 12:22:30 2025
  from No via SSH
- Ian Rihard Kosednar
  Tue Jun 17 12:09:51 2025
  from No via SSH
- Ian Rihard Kosednar
  Tue Jun 17 12:04:11 2025
  from No via SSH
- Ian Rihard Kosednar
  Tue Jun 17 12:00:09 2025
  from No via SSH
- Ian Rihard Kosednar
  Tue Jun 17 11:42:12 2025
  from No via SSH
- Centurion
  Tue Jun 17 08:23:51 2025
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jun 17 07:45:44 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	493
Nodes:	16 (2 / 14)
Uptime:	20:36:58
Calls:	9,719
Calls today:	9
Files:	13,741
Messages:	6,182,166

Re: Safe handling of lists

Who's Online

Recent Visitors

System Info