Me again. I have a problem dealing with lists.
I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.
It worked fine with a small file, but a large (very large) file
triggers this:
list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"
Also relevant,
set ::FILECONTENT [$::text get 1.0 end]
It's probably something obvious that I am missing again. Can someone
please enlighten me?
Me again. I have a problem dealing with lists.
I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.
It worked fine with a small file, but a large (very large) file
triggers this:
list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"
Also relevant,
set ::FILECONTENT [$::text get 1.0 end]
It's probably something obvious that I am missing again. Can someone
please enlighten me?
I think you want [split]
Me again. I have a problem dealing with lists.
I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.
It worked fine with a small file, but a large (very large) file
triggers this:
list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"
Also relevant,
set ::FILECONTENT [$::text get 1.0 end]
It's probably something obvious that I am missing again. Can someone
please enlighten me?
On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:
Take a look at my CAWT extension. I contains a CountWords procedure,**************************
see >>https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords
Paul
"CAWT (COM Automation With Tcl) is a utility package based on Twapi
to script Microsoft Windows® applications with Tcl."
Take a look at my CAWT extension. I contains a CountWords procedure,**************************
see >https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords
Paul
Luc <luc@sep.invalid> wrote:
On Sun, 26 Nov 2023 23:29:19 +0100, Paul Obermeier wrote:
Take a look at my CAWT extension. I contains a CountWords procedure,**************************
see
https://www.tcl3d.org/cawt/download/CawtReference-Cawt.html#::Cawt::CountWords
Paul
"CAWT (COM Automation With Tcl) is a utility package based on Twapi
to script Microsoft Windows® applications with Tcl."
But, if the source is available, you could look at the source to see
how it performs "word counting" and use that for inspiration.
On Sun, 26 Nov 2023 13:35:48 -0800, et99 wrote:
I think you want [split]**************************
I am using split now. It's faster and "list safe" so it solves
the problem I presented first.
The new problem now is that cleaning up the huge string for proper
counting is not fast enough.
proc p.wc {} {
set wordcount 0
set content [$::text get 1.0 end]
set cleancontent [string map "\n { } \t { }" $content]
set wordcount [llength [split $cleancontent { }]]
return $wordcount
}
Since it's called whenever some change is made to the text widget,
typing becomes unacceptably slow.
And I still haven't addded a line to clean all the multiple
consecutive spaces, which changes the tally. I can't use regexp
because it's too slow for what I want.
Another (debatable?) problem is that the old code gave me a count
that was a lot closer to the output of 'wc -w' in a terminal.
This new one is way off, and it gives me a lower count! One would
think it would be higher because of the consecutive spaces.
Just wondering, why the string map to change newlines and tabs to spaces. >Split can take those plus spaces in the splitchars string. In fact, I
think that's the default anyway.
Also, are you saying this is calculated on every char the user types? Is
that to keep a wordcount in say a status area? How big is the text we're >talking about here?
Me again. I have a problem dealing with lists.
I wanted to count the words in a text widget that contains the text
of a file. I decided to treat the whole thing like a list and iterate
over it to count the list elements, possibily filtering some things
out.
It worked fine with a small file, but a large (very large) file
triggers this:
list element in quotes followed by "," instead of space
while executing
"foreach w $::FILECONTENT {
incr wordcount
}"
(procedure "p.wc" line 5)
invoked from within
"p.wc"
Also relevant,
set ::FILECONTENT [$::text get 1.0 end]
It's probably something obvious that I am missing again. Can someone
please enlighten me?
On Sun, 26 Nov 2023 16:37:37 -0800, et99 wrote:
Just wondering, why the string map to change newlines and tabs to spaces.**************************
Split can take those plus spaces in the splitchars string. In fact, I
think that's the default anyway.
Also, are you saying this is calculated on every char the user types? Is
that to keep a wordcount in say a status area? How big is the text we're
talking about here?
I decided to change all the newlines to spaces because I was afraid that
this
that
might become 'thisthat' rather than 'this that' after the split.
It probably wouldn't, but I wanted to make sure.
Yes, it's a sort of text editor and the word count must be calculated
after every change.
(I had big plans for it but personal issues have forced me to put it
on the back burner for God knows how long. I'm trying to fix this code
right now because I got a copywriting job where counting words in real
time is very useful. A ton of other things will be fixed... someday.)
Anyway, there is a status bar with some information and the word count
is supposed to be updated at every touch of the keyboard. I currently
filter out arrow key movements. Need to rewrite the code and make it
smarter.
The "big text" I am using to assess performance is 18MB.
$ wc /home/tcl/bigtext.txt
218993 2758398 18421662 /home/tcl/bigtext.txt
On 11/27/2023 6:50 AM, Ralf Fassel wrote:
As someone else suggested, ::textutil::split::splitx might be a
solution (though it might be even slower due to the use of regexps,
need to test), or else check the [string length] of elements and
count them only as words if > 0.
I was wondering just how accurate it has to be if one is dealing with
a 20mb file with perhaps 2 million words.
What about just counting all the spaces, tabs, and newlines in the
text by using
set totchars [string length $txt]
set nowhites [string map {\n {} \t {} { } {}} $txt]
set wordcount [expr { $totchars - [string length $nowhites] }]
Won't this be as accurate as using split? And with all string
operations, no costs of creating lists.
In timing tests, the [string length] calls seemed to be near zero, I
guess the string objects have the length saved in them.
And there's always the possibility of doing some of the heavy lifting
in a separate thread.
* Luc <luc@sep.invalid>
| I am using split now. It's faster and "list safe" so it solves
| the problem I presented first.
| The new problem now is that cleaning up the huge string for proper
| counting is not fast enough.
| proc p.wc {} {
| set wordcount 0
| set content [$::text get 1.0 end]
| set cleancontent [string map "\n { } \t { }" $content]
| set wordcount [llength [split $cleancontent { }]]
| return $wordcount
| }
| Since it's called whenever some change is made to the text widget,
| typing becomes unacceptably slow.
You could set up a timer to do the real work after a short period
(500ms) of keyboard-idle, and return the old count else. Quick typists
see the updated count only after they stop typing. Else you would need
to keep track of what is inserted/deleted and incr/decr the count based
on that (fragile).
| And I still haven't addded a line to clean all the multiple
| consecutive spaces, which changes the tally. I can't use regexp
| because it's too slow for what I want.
As someone else suggested, ::textutil::split::splitx might be a solution (though it might be even slower due to the use of regexps, need to
test), or else check the [string length] of elements and count them only
as words if > 0.
R'
I was wondering just how accurate it has to be if one is dealing with a
20mb file with perhaps 2 million words.
And there's always the possibility of doing some of the heavy lifting
in a separate thread.
Interesting problem. The text editor I use doesn't count words
dynamically, but rather has a statistics command.
On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:
I was wondering just how accurate it has to be if one is dealing
with a 20mb file with perhaps 2 million words.
For me, it doesn't, except that I'm using the same application for
small and large files and I want the application to be able to handle
all cases.
Right now, I am only enabling the wc proc when I am doing my
copywriting work. At all other times, it is disabled because typing
becomes unacceptably slow on the large files I use regularly.
Luc <luc@sep.invalid> wrote:
On Mon, 27 Nov 2023 17:16:46 -0800, et99 wrote:
I was wondering just how accurate it has to be if one is dealing
with a 20mb file with perhaps 2 million words.
For me, it doesn't, except that I'm using the same application for
small and large files and I want the application to be able to handle
all cases.
Right now, I am only enabling the wc proc when I am doing my
copywriting work. At all other times, it is disabled because typing
becomes unacceptably slow on the large files I use regularly.
Thing is, given what you've described so far of your code, you are
attempting to brute-force a real-time count, and doing so you've
created something on the order of an O(n^2) or worse algorithm.
You are "counting" lots of things over and over that you previously
counted after the last keystroke, even though that keystroke only made (usually) a one character change to the entire file.
You instead want to think about how to not count any more than you have
to for each "count cycle".
I pulled a copy of Gutenburg's copy of War and Peace (because it is a
long book) from here: https://gutenberg.org/cache/epub/2600/pg2600.txt
Then I concatenated six duplicates into a single file (to make an approximately 20MB file). I named that file "20mb".
Then I set out to see what could be done by trying to not count
everything after any change. This is ugly demostration code below, all
of this would ideally be wrapped inside a oo::object and made prettier,
but you get the raw demo below:
#!/usr/bin/wish
set wc 0
label .wc -textvariable wc
text .t
pack .wc
pack .t
set fd [open 20mb RDONLY]
.t insert end [read $fd]
close $fd
set num_lines [.t count -lines 0.0 end]
# prefill per-line word-count cache
set lcc [list 0]
for {set i 1} {$i < $num_lines} {incr i} {
lappend lcc [llength [regexp -all -inline {\S+} [.t get $i.0 "$i.0 lineend"]]]
}
# initial load word count
set wc [tcl::mathop::+ {*}$lcc]
# set modified flag of text widget to false
.t edit modified 0
proc modified {} {
global lcc
global wc
lassign [split [.t index insert] .] line_num
lset lcc $line_num [llength [regexp -all -inline {\S+} [.t get $line_num.0 "$line_num.0 lineend"]]]
set wc [tcl::mathop::+ {*}$lcc]
.t edit modified 0
}
bind .t <<Modified>> [list modified]
This above counts, in real time, as I type, with the 20mb 6x War and
Peace file loaded. If I get going typing I can just begin to sense a
latency on typing, but I do really have to get on a good roll for that.
The one place I *do* see a latency is for keyboard autorepeat, there is
a clear delay then.
But, while hammering out this demo I noticed that <<Modified>> is
called twice for every keystroke (meaning this above is still doing
twice the work it needs to do). I simply did not want to be bothered
with working out why <<Modified>> is called twice with every keystroke,
nor with working out how to not call it twice for every keystroke.
tcllib has splitx which splits on regexp egFor only the count, it is not required to split the list. regex can do
::textutil::split::splitx $l {\s+}
splits on runs of spaces
Am 27.11.23 um 08:27 schrieb Peter Dean:
tcllib has splitx which splits on regexp egFor only the count, it is not required to split the list. regex can do
::textutil::split::splitx $l {\s+}
splits on runs of spaces
the counting:
set string {This is a sentence with whitespaces in it.}
regex -all {\s+} $string
returns the number of blanks. With the uppercase \S it returns the
number of non-white (there could be whitespace before and after)
Christian
* Rich <rich@example.invalid>
| But, while hammering out this demo I noticed that <<Modified>> is
| called twice for every keystroke (meaning this above is still doing
| twice the work it needs to do). I simply did not want to be bothered
| with working out why <<Modified>> is called twice with every keystroke,
Might be due to the fact that you change the 'modified' flag in the
callback, and looking at the C code for the text widget which handles
the ".t edit modified 0":
/*
* Only issue the <<Modified>> event if the flag actually changed.
* However, degree of modified-ness doesn't matter. [Bug 1799782]
*/
if ((!oldModified) != (!setModified)) {
GenerateModifiedEvent(textPtr);
}
On 11/27/2023 7:03 PM, Rich wrote:
Luc <luc@sep.invalid> wrote:Hmmm, a line cache. Cool.
Right now, I am only enabling the wc proc when I am doing my
copywriting work. At all other times, it is disabled because
typing becomes unacceptably slow on the large files I use
regularly.
Thing is, given what you've described so far of your code, you are
attempting to brute-force a real-time count, and doing so you've
created something on the order of an O(n^2) or worse algorithm.
You are "counting" lots of things over and over that you previously
counted after the last keystroke, even though that keystroke only made
(usually) a one character change to the entire file.
You instead want to think about how to not count any more than you have
to for each "count cycle".
And since lcc is a list one can insert and delete quickly when lines
are added or removed keeping it in sync with the text widget.
How to tell just what changed (e.g. a paste in the middle) I can't say.
I think for changes on a line, the [tcl::mathop::+ {*}$lcc] can
adjust the wc using the line's (new value - its old value) saving
possibly a 20k item arglist for the mathop.
OK, I found a good solution. It's similar to what some of you suggested,
but different enough. I don't want to post all the code because the whole mechanism spans across multiple procs and the procs can be long and complicated, but here is a description of the mechanism and the core procs:
- Open file, immediately count all words in the buffer (slow), count all words in the current line (fast).
That big count is done only once with a thorough enough cleanup, including removal of multiple spaces. So it's accurate.
- Create three global variables: ::WholeBufferWC, ::CurrentLineWC and ::MostBufferWC.
In case you're wondering, $::MostBufferWC is everything except the current line.
$::MostBufferWC = $::WholeBufferWC - $::CurrentLineWC
or
$::CurrentLineWC + $::MostBufferWC = $::WholeBufferWC
You get the picture.
- Also store the number of the current line in ::CURRLINE. My code
already did that anyway for the status bar.
- Additionally, create the ::PREVLINE variable which will be used in the
new, improved proc.
That is where the magic is. It lets me monitor never more than one or two lines at a time. So it's fast. Tested and approved on the large file.
I am using two procs now:
wc is called occasionally and used for counting words cleanly, taking care
of all the spaces.
wcglobal is called at every touch and does the whole line monitor
management, getting counts from wc whenever necessary. Here they are:
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
proc p.wcglobal {} {
set ::currindex [$::text index insert]
lassign [split $::currindex "."] ::CURRLINE ::CURRCOL
set ::currentlinecontent [$::text get $::CURRLINE.0 "$::CURRLINE.0 lineend"]
set ::CurrentLineWC [p.wc $::currentlinecontent]
if {$::CURRLINE == $::PREVLINE} {
set ::WholeBufferWC [expr {$::MostBufferWC + $::CurrentLineWC}]
}
# else do not add the current line
set ::MostBufferWC [expr {$::WholeBufferWC - $::CurrentLineWC}]
set ::PREVLINE $::CURRLINE
if {$::WholeBufferWC == 0} {return ""}
if {$::WholeBufferWC == 1} {return "1 word"}
if {$::WholeBufferWC > 1} {return "$::WholeBufferWC words"}
}
Tested with arrow keys navigation, random mouse clicks and pressing Return
at the end or in the middle of lines.
It works!
I'm eating my own dog food, so if there are bugs, I find see them soon.
Thank you for all the help once again.
OK, I found a good solution. It's similar to what some of you suggested,
but different enough.
...
Here they are:
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
Thank you for all the help once again.
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
You can reduce the above to this:
proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
You can reduce the above to this:
proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}
Why not use Christian's suggestion from above?
proc p.wc {content} {
regex -all {\S+} $content
}
or just use the command without the proc wrapper
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
You can reduce the above to this:
proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}
Why not use Christian's suggestion from above?
proc p.wc {content} {
regex -all {\S+} $content
}
or just use the command without the proc wrapper
Dave B
clt.to.davebr@dfgh.net wrote:
proc p.wc {content} {
set wordcount 0
regsub -all {[\s]+} [string trim $content] { } content
set wordcount [llength [split $content { }]]
return $wordcount
}
You can reduce the above to this:
proc p.wc {content} {
return [llength [regsub -all -inline {\S+} $content]]
}
Why not use Christian's suggestion from above?
proc p.wc {content} {
regex -all {\S+} $content
}
Without -inline back a boolean from regex, and Luc wants a word count.
With -inline you get the matches as a list, and the llength is to
produce a 'count' (length of list) from the raw matches.
or just use the command without the proc wrapper
Valid ask, but in my case I just created the minimal change to Luc's
example code, without going into other details.
So, based on the above posts, here's a very compact form using a Thread**************************
when a full update is desired, say after a multi-line change, e.g. cut or >paste.:
On 11/29/2023 10:39 AM, Rich wrote:
Without -inline [back] a boolean [is returned] from regex, and Luc
wants a word count.
With -inline you get the matches as a list, and the llength is to
produce a 'count' (length of list) from the raw matches.
or just use the command without the proc wrapper
Valid ask, but in my case I just created the minimal change to Luc's
example code, without going into other details.
Actually, for regexp with -all it returns the count of matches as
desired.
Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.
clt.to.davebr@dfgh.net wrote:
Off Topic
Why do we now have three separate threads going on this one issue?
clt.to.davebr@dfgh.net wrote:
Off Topic
Why do we now have three separate threads going on this one issue? **************************
On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:
Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.
**************************
"That saves" is debatable. In all my attempts, regexp always seems to be
too slow for this particular task. so I'm only using it once, when the
file is opened.
I will check, but I'm skeptical.
On 11/29/2023 2:28 PM, Peter Dean wrote:
clt.to.davebr@dfgh.net wrote:
Off Topic
Why do we now have three separate threads going on this one issue?
I too see that, but only when using Thunderbird with eternal-september. I don't know why that happens, but they all get reduced to just one in google groups.
On Wed, 29 Nov 2023 20:16:07 -0000 (UTC), Rich wrote:
Ah, learn something new every day. I had not noticed that -all returns
a count of matches. That saves creating a list only to take it's
length and free it all over again.
**************************
"That saves" is debatable. In all my attempts, regexp always seems to be
too slow for this particular task. so I'm only using it once, when the
file is opened.
I will check, but I'm skeptical.
clt.to.davebr@dfgh.net wrote:
Off Topic
Why do we now have three separate threads going on this one issue?
Peter Dean <p.dean@gmx.com> wrote:
clt.to.davebr@dfgh.net wrote:
Off Topic
Why do we now have three separate threads going on this one issue?
Someone's news reader broke the References: header.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 493 |
Nodes: | 16 (2 / 14) |
Uptime: | 20:36:58 |
Calls: | 9,719 |
Calls today: | 9 |
Files: | 13,741 |
Messages: | 6,182,166 |