Now that I've got faster hardware and have had a chance to feed over 184 million articles to a new server, I notice a steep decline in
performance from the beginning of the process to the end. When looking
at the news.daily output it appears the majority of the time is spent
dealing with the history file. I would expect article/overview writing
or perl filtering to take more time, but perhaps there is more to
dealing with the history file than I understand.
Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:
Now that I've got faster hardware and have had a chance to feed over 184
million articles to a new server, I notice a steep decline in
performance from the beginning of the process to the end. When looking
at the news.daily output it appears the majority of the time is spent
dealing with the history file. I would expect article/overview writing
or perl filtering to take more time, but perhaps there is more to
dealing with the history file than I understand.
It sounds like you didn't run news.daily while you were feeding in the articles. My guess is that the history file index size was too small, so
you got tons of page overflows, which slows everything down considerably.
The history file is dynamically resized as part of the news.daily process, although that process is not really designed for the case of feeding in
tons of new articles.
Pre-sizing the history file to be much larger at the start might have
helped if I'm right about the possible cause.
There is almost certainly a better algorithm to use for the history
database than what INN currently does, which is more than 30 years old.
(Just throwing everything into a modern SQL database might be a
substantial improvement, although the history file has very specific characteristics, so at least in theory an algorithm chosen precisely for
its type of data would be fastest.)
The history file is ~18GB. How would one go about pre-sizing it? That is
one topic I don't think I've stumbled across.
I'll restore from snapshot and manually create the history file, if I'm reading the manpage correctly, the value used for the '-s' flag is the
number of articles expected (each line being an article)?
I'd probably want something like this assuming that's correct?
makedbz -i -o -s 200000000
Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:
The history file is ~18GB. How would one go about pre-sizing it? That is
one topic I don't think I've stumbled across.
(Note that this is unnecessary now; now that you've run news.daily after feeding in all the articles, it will have been resized to match its
current size, so this problem hopefully won't matter any more. If you're still seeing ongoing slowness, then I misunderstood.)
Running makedbz manually will let you provide the -s flag, which specifies the number of entries to size the history file for. news.daily will use
the current size as the expected size. When you're going to feed in a ton
of articles, you want to pass in something much, much larger, roughly the number of entries you're expecting to have at the end. The relevant instructions in INSTALL are:
| Next, you need to create an empty F<history> database. To do this, type:
|
| cd <pathdb in inn.conf>
| touch history
| makedbz -i -o
which possibly should mention the -s flag.
Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:
I'd probably want something like this assuming that's correct?
makedbz -i -o -s 200000000
Looks good to me, assuming that's the number of articles you're dealing
with.
(If I remember correctly, the dbz algorithm uses linear probing for hash collisions, so you really want a sparse hash. If the linear probe goes
for too long, it goes to a separate overflow table, and once that happens, the performance really tanks *and* the history file size bloats. It's
just a bad time all around, sort of like a system going to swap. I'm
betting your history file went to multiple overflow tables because it
started massively undersized.)
Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:
I'll restore from snapshot and manually create the history file, if I'm
reading the manpage correctly, the value used for the '-s' flag is the
number of articles expected (each line being an article)?
Yes, that's correct. The actual history index size will then be larger
than that to try to make it so that the number of entries is about 2/3rds
the size of the index.
(If I remember correctly, the dbz algorithm uses linear probing for hash collisions, so you really want a sparse hash. If the linear probe goes
for too long, it goes to a separate overflow table, and once that happens, the performance really tanks *and* the history file size bloats. It's
just a bad time all around, sort of like a system going to swap. I'm
betting your history file went to multiple overflow tables because it
started massively undersized.)
I'd probably want something like this assuming that's correct?
makedbz -i -o -s 200000000
Looks good to me, assuming that's the number of articles you're dealing
with.
9 hours and ~69 million articles in and it looks to be keeping a steady pace.
Looking at "ME time" lines, the perl filter is taking the most time now, which
I'd expect (on to remove misplaced binaries brought in from pullnews).
innd[11136]: ME time 600000 hishave 2543(1522320) hiswrite 130947(1522320) hissync 6637(2) idle 126732(1830166) artclean 2611(1522320) artwrite 19037(1522211) artcncl 69(547) hisgrep/artcncl 20(547) overv 42242(1522211) perl 229908(1522320) python 95(1522320) nntpread 3881(1830166) artparse 11883(3089741) artlog 4580(1523299) datamove 491(1886519)
Running makedbz manually will let you provide the -s flag, which specifies the number of entries to size the history file for. news.daily will use
the current size as the expected size. When you're going to feed in a ton
of articles, you want to pass in something much, much larger, roughly the number of entries you're expecting to have at the end. The relevant instructions in INSTALL are:
| Next, you need to create an empty F<history> database. To do this, type:
|
| cd <pathdb in inn.conf>
| touch history
| makedbz -i -o
which possibly should mention the -s flag.
On 26.07.23, Jesse Rehmer wrote:
9 hours and ~69 million articles in and it looks to be keeping a steady pace.
Looking at "ME time" lines, the perl filter is taking the most time now, which
I'd expect (on to remove misplaced binaries brought in from pullnews).
innd[11136]: ME time 600000 hishave 2543(1522320) hiswrite 130947(1522320) >> hissync 6637(2) idle 126732(1830166) artclean 2611(1522320) artwrite
19037(1522211) artcncl 69(547) hisgrep/artcncl 20(547) overv 42242(1522211) >> perl 229908(1522320) python 95(1522320) nntpread 3881(1830166) artparse
11883(3089741) artlog 4580(1523299) datamove 491(1886519)
what hardware is this?
zfs on hdd/ssd/cache?
do you use default cleanfeed or changed any "optimization"?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 11:33:16 |
Calls: | 10,389 |
Calls today: | 4 |
Files: | 14,061 |
Messages: | 6,416,868 |
Posted today: | 1 |