• Re: INN performance curve - why so much time dealing with the history f

    From Russ Allbery@21:1/5 to Jesse Rehmer on Tue Jul 25 19:44:04 2023
    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    Now that I've got faster hardware and have had a chance to feed over 184 million articles to a new server, I notice a steep decline in
    performance from the beginning of the process to the end. When looking
    at the news.daily output it appears the majority of the time is spent
    dealing with the history file. I would expect article/overview writing
    or perl filtering to take more time, but perhaps there is more to
    dealing with the history file than I understand.

    It sounds like you didn't run news.daily while you were feeding in the articles. My guess is that the history file index size was too small, so
    you got tons of page overflows, which slows everything down considerably.
    The history file is dynamically resized as part of the news.daily process, although that process is not really designed for the case of feeding in
    tons of new articles.

    Pre-sizing the history file to be much larger at the start might have
    helped if I'm right about the possible cause.

    There is almost certainly a better algorithm to use for the history
    database than what INN currently does, which is more than 30 years old.
    (Just throwing everything into a modern SQL database might be a
    substantial improvement, although the history file has very specific characteristics, so at least in theory an algorithm chosen precisely for
    its type of data would be fastest.)

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to Russ Allbery on Tue Jul 25 22:10:52 2023
    On 7/25/23 9:44 PM, Russ Allbery wrote:
    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    Now that I've got faster hardware and have had a chance to feed over 184
    million articles to a new server, I notice a steep decline in
    performance from the beginning of the process to the end. When looking
    at the news.daily output it appears the majority of the time is spent
    dealing with the history file. I would expect article/overview writing
    or perl filtering to take more time, but perhaps there is more to
    dealing with the history file than I understand.

    It sounds like you didn't run news.daily while you were feeding in the articles. My guess is that the history file index size was too small, so
    you got tons of page overflows, which slows everything down considerably.
    The history file is dynamically resized as part of the news.daily process, although that process is not really designed for the case of feeding in
    tons of new articles.

    Pre-sizing the history file to be much larger at the start might have
    helped if I'm right about the possible cause.

    There is almost certainly a better algorithm to use for the history
    database than what INN currently does, which is more than 30 years old.
    (Just throwing everything into a modern SQL database might be a
    substantial improvement, although the history file has very specific characteristics, so at least in theory an algorithm chosen precisely for
    its type of data would be fastest.)


    Not sure why my client formatted the first message so horribly, in the
    past it has never tried to do HTML-ish things. I trimmed some of the
    output so it wouldn't wrap below. For the first 6 hours or so the
    performance is stellar, but then trails off pretty drastically.

    Correct, I didn't run news.daily until after the injection run completed.

    The history file is ~18GB. How would one go about pre-sizing it? That is
    one topic I don't think I've stumbled across.

    Date Articles Art/sec
    Jul 16 10:03:43 - 10:59:59 6608118 1956.80
    Jul 16 11:00:00 - 11:59:59 13467714 3741.03
    Jul 16 12:00:00 - 12:59:59 8000940 2222.48
    Jul 16 13:00:00 - 13:59:59 5748819 1596.89
    Jul 16 14:00:00 - 14:59:59 5122241 1422.84
    Jul 16 15:00:00 - 15:59:59 4287868 1191.07
    Jul 16 16:00:00 - 16:59:59 4034400 1120.67
    Jul 16 17:00:00 - 17:59:59 3548680 985.74
    Jul 16 18:00:00 - 18:59:59 3238737 899.65
    Jul 16 19:00:00 - 19:59:59 3179352 883.15
    Jul 16 20:00:00 - 20:59:59 2861511 794.86
    Jul 16 21:00:00 - 21:59:59 2557273 710.35
    Jul 16 22:00:00 - 22:59:59 2476205 687.83
    Jul 16 23:00:00 - 23:59:59 2470653 686.29
    Jul 17 00:00:00 - 00:59:59 2377684 660.47
    Jul 17 01:00:00 - 01:59:59 2214251 615.07
    Jul 17 02:00:00 - 02:59:59 2160533 600.15
    Jul 17 03:00:00 - 03:59:59 2138072 593.91
    Jul 17 04:00:00 - 04:59:59 2064017 573.34
    Jul 17 05:00:00 - 05:59:59 1975211 548.67
    Jul 17 06:00:00 - 06:59:59 1902269 528.41
    Jul 17 07:00:00 - 07:59:59 1859862 516.63
    Jul 17 08:00:00 - 08:59:59 1826617 507.39
    Jul 17 09:00:00 - 09:59:59 1803042 500.85
    Jul 17 10:00:00 - 10:59:59 1779735 494.37
    Jul 17 11:00:00 - 11:59:59 1724235 478.95
    Jul 17 12:00:00 - 12:59:59 1694753 470.76
    Jul 17 13:00:00 - 13:59:59 1687997 468.89
    Jul 17 14:00:00 - 14:59:59 1689864 469.41
    Jul 17 15:00:00 - 15:59:59 1667083 463.08
    Jul 17 16:00:00 - 16:59:59 1614929 448.59
    Jul 17 17:00:00 - 17:59:59 1551492 430.97
    Jul 17 18:00:00 - 18:59:59 1510100 419.47
    Jul 17 19:00:00 - 19:59:59 1516064 421.13
    Jul 17 20:00:00 - 20:59:59 1504238 417.84
    Jul 17 21:00:00 - 21:59:59 1511102 419.75
    Jul 17 22:00:00 - 22:59:59 1498772 416.33
    Jul 17 23:00:00 - 23:59:59 1459980 405.55
    Jul 18 00:00:00 - 00:59:59 1414708 392.97
    Jul 18 01:00:00 - 01:59:59 1404954 390.26
    Jul 18 02:00:00 - 02:59:59 1376430 382.34
    Jul 18 03:00:00 - 03:59:59 1378259 382.85
    Jul 18 04:00:00 - 04:59:59 1390281 386.19
    Jul 18 05:00:00 - 05:59:59 1386335 385.09
    Jul 18 06:00:00 - 06:59:59 1355294 376.47
    Jul 18 07:00:00 - 07:59:59 1327220 368.67
    Jul 18 08:00:00 - 08:59:59 1296572 360.16
    Jul 18 09:00:00 - 09:59:59 1276394 354.55
    Jul 18 10:00:00 - 10:59:59 1280991 355.83
    Jul 18 11:00:00 - 11:59:59 1276734 354.65
    Jul 18 12:00:00 - 12:59:59 1330133 369.48
    Jul 18 13:00:00 - 13:59:59 1321197 367.00
    Jul 18 14:00:00 - 14:59:59 1270985 353.05
    Jul 18 15:00:00 - 15:59:59 1251052 347.51
    Jul 18 16:00:00 - 16:59:59 1238624 344.06
    Jul 18 17:00:00 - 17:59:59 1227250 340.90
    Jul 18 18:00:00 - 18:59:59 1203798 334.39
    Jul 18 19:00:00 - 19:59:59 1225960 340.54
    Jul 18 20:00:00 - 20:59:59 1221275 339.24
    Jul 18 21:00:00 - 21:59:59 1199617 333.23
    Jul 18 22:00:00 - 22:59:59 1178009 327.22
    Jul 18 23:00:00 - 23:59:59 1154628 320.73
    Jul 19 00:00:00 - 00:59:59 1145227 318.12
    Jul 19 01:00:00 - 01:59:59 1111125 308.65
    Jul 19 02:00:00 - 02:59:59 1095739 304.37
    Jul 19 03:00:00 - 03:59:59 1093542 303.76
    Jul 19 04:00:00 - 04:59:59 1089209 302.56
    Jul 19 05:00:00 - 05:59:59 1090842 303.01
    Jul 19 06:00:00 - 06:59:59 1074421 298.45
    Jul 19 07:00:00 - 07:59:59 1054212 292.84
    Jul 19 08:00:00 - 08:59:59 1044447 290.12
    Jul 19 09:00:00 - 09:59:59 931031 258.62
    Jul 19 10:00:00 - 10:59:59 1079244 299.79
    Jul 19 11:00:00 - 11:59:59 1069259 297.02
    Jul 19 12:00:00 - 12:59:59 1067895 296.64
    Jul 19 13:00:00 - 13:59:59 1056197 293.39
    Jul 19 14:00:00 - 14:59:59 1064508 295.70
    Jul 19 15:00:00 - 15:59:59 1053532 292.65
    Jul 19 16:00:00 - 16:59:59 1035573 287.66
    Jul 19 17:00:00 - 17:59:59 1017242 282.57
    Jul 19 18:00:00 - 18:59:59 1019435 283.18
    Jul 19 19:00:00 - 19:59:59 1013209 281.45
    Jul 19 20:00:00 - 20:59:59 1003259 278.68
    Jul 19 21:00:00 - 21:59:59 998980 277.49
    Jul 19 22:00:00 - 22:59:59 999129 277.54
    Jul 19 23:00:00 - 23:59:59 1005058 279.18
    Jul 20 00:00:00 - 00:59:59 1004154 278.93
    Jul 20 01:00:00 - 01:59:59 983326 273.15
    Jul 20 02:00:00 - 02:59:59 984133 273.37
    Jul 20 03:00:00 - 03:59:59 956548 265.71
    Jul 20 04:00:00 - 04:59:59 955865 265.52
    Jul 20 05:00:00 - 05:59:59 949230 263.68
    Jul 20 06:00:00 - 06:59:59 938727 260.76
    Jul 20 07:00:00 - 07:59:59 947608 263.22
    Jul 20 08:00:00 - 08:59:59 955909 265.53
    Jul 20 09:00:00 - 09:59:59 951293 264.25
    Jul 20 10:00:00 - 10:59:59 939124 260.87
    Jul 20 11:00:00 - 11:59:59 945338 262.59
    Jul 20 12:00:00 - 12:59:59 913587 253.77
    Jul 20 13:00:00 - 13:59:59 926263 257.30
    Jul 20 14:00:00 - 14:59:59 930761 258.54
    Jul 20 15:00:00 - 15:59:59 908011 252.23
    Jul 20 16:00:00 - 16:59:59 863137 239.76
    Jul 20 17:00:00 - 17:59:59 851113 236.42
    Jul 20 18:00:00 - 18:59:59 843252 234.24
    Jul 20 19:00:00 - 19:25:57 218494 140.33

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Jesse Rehmer on Tue Jul 25 20:24:55 2023
    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    The history file is ~18GB. How would one go about pre-sizing it? That is
    one topic I don't think I've stumbled across.

    (Note that this is unnecessary now; now that you've run news.daily after feeding in all the articles, it will have been resized to match its
    current size, so this problem hopefully won't matter any more. If you're
    still seeing ongoing slowness, then I misunderstood.)

    Running makedbz manually will let you provide the -s flag, which specifies
    the number of entries to size the history file for. news.daily will use
    the current size as the expected size. When you're going to feed in a ton
    of articles, you want to pass in something much, much larger, roughly the number of entries you're expecting to have at the end. The relevant instructions in INSTALL are:

    | Next, you need to create an empty F<history> database. To do this, type:
    |
    | cd <pathdb in inn.conf>
    | touch history
    | makedbz -i -o

    which possibly should mention the -s flag.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Jesse Rehmer on Tue Jul 25 20:41:16 2023
    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    I'll restore from snapshot and manually create the history file, if I'm reading the manpage correctly, the value used for the '-s' flag is the
    number of articles expected (each line being an article)?

    Yes, that's correct. The actual history index size will then be larger
    than that to try to make it so that the number of entries is about 2/3rds
    the size of the index.

    (If I remember correctly, the dbz algorithm uses linear probing for hash collisions, so you really want a sparse hash. If the linear probe goes
    for too long, it goes to a separate overflow table, and once that happens,
    the performance really tanks *and* the history file size bloats. It's
    just a bad time all around, sort of like a system going to swap. I'm
    betting your history file went to multiple overflow tables because it
    started massively undersized.)

    I'd probably want something like this assuming that's correct?

    makedbz -i -o -s 200000000

    Looks good to me, assuming that's the number of articles you're dealing
    with.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to Russ Allbery on Tue Jul 25 22:35:18 2023
    On 7/25/23 10:24 PM, Russ Allbery wrote:
    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    The history file is ~18GB. How would one go about pre-sizing it? That is
    one topic I don't think I've stumbled across.

    (Note that this is unnecessary now; now that you've run news.daily after feeding in all the articles, it will have been resized to match its
    current size, so this problem hopefully won't matter any more. If you're still seeing ongoing slowness, then I misunderstood.)

    Running makedbz manually will let you provide the -s flag, which specifies the number of entries to size the history file for. news.daily will use
    the current size as the expected size. When you're going to feed in a ton
    of articles, you want to pass in something much, much larger, roughly the number of entries you're expecting to have at the end. The relevant instructions in INSTALL are:

    | Next, you need to create an empty F<history> database. To do this, type:
    |
    | cd <pathdb in inn.conf>
    | touch history
    | makedbz -i -o

    which possibly should mention the -s flag.


    Got it! This run was a test to see just how long it would take, so the
    new box is kind of a scratch spot for further sorting later.

    I'll restore from snapshot and manually create the history file, if I'm
    reading the manpage correctly, the value used for the '-s' flag is the
    number of articles expected (each line being an article)?

    I'd probably want something like this assuming that's correct?

    makedbz -i -o -s 200000000

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From go-while@21:1/5 to Russ Allbery on Wed Jul 26 12:50:14 2023
    On 26.07.23 05:41, Russ Allbery wrote:
    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    I'd probably want something like this assuming that's correct?

    makedbz -i -o -s 200000000

    Looks good to me, assuming that's the number of articles you're dealing
    with.


    thanks! hits me hard.
    i noticed the same slowdown to almost beeing unusable.
    because i didnt want any expiry and disabled cronjobs... *kiss*

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From go-while@21:1/5 to Russ Allbery on Wed Jul 26 12:33:49 2023
    On 26.07.23 05:41, Russ Allbery wrote:
    (If I remember correctly, the dbz algorithm uses linear probing for hash collisions, so you really want a sparse hash. If the linear probe goes
    for too long, it goes to a separate overflow table, and once that happens, the performance really tanks *and* the history file size bloats. It's
    just a bad time all around, sort of like a system going to swap. I'm
    betting your history file went to multiple overflow tables because it
    started massively undersized.)

    https://manpages.debian.org/testing/inn2-dev/dbz.3.en.html

    DESCRIPTION
    These functions provide an indexing system for rapid random access to a
    text file (the base file).

    Dbz stores offsets into the base text file for rapid retrieval. All
    retrievals are keyed on a hash value that is generated by the
    HashMessageID() function.

    Dbzinit opens a database, an index into the base file base, consisting
    of files base.dir , base.index , and base.hash which must already exist.
    (If the database is new, they should be zero-length files.) Subsequent
    accesses go to that database until dbzclose is called to close the database.

    Dbzfetch searches the database for the specified key, returning the corresponding value if any, if <--enable-tagged-hash at configure> is specified. If <--enable-tagged-hash at configure> is not specified, it
    returns true and content of ivalue is set. Dbzstore stores the key -
    value pair in the database, if <--enable-tagged-hash at configure> is specified. If <--enable-tagged-hash at configure> is not specified, it
    stores the content of ivalue. Dbzstore will fail unless the database
    files are writable. Dbzexists will verify whether or not the given hash
    exists or not. Dbz is optimized for this operation and it may be
    significantly faster than dbzfetch().

    Dbzfresh is a variant of dbzinit for creating a new database with more
    control over details.

    Dbzfresh's size parameter specifies the size of the first hash table
    within the database, in key-value pairs. Performance will be best if the
    number of key-value pairs stored in the database does not exceed about
    2/3 of size. (The dbzsize function, given the expected number of
    key-value pairs, will suggest a database size that meets these
    criteria.) Assuming that an fseek offset is 4 bytes, the .index file
    will be 4 * size bytes. The .hash file will be DBZ_INTERNAL_HASH_SIZE *
    size bytes (the .dir file is tiny and roughly constant in size) until
    the number of key-value pairs exceeds about 80% of size. (Nothing awful
    will happen if the database grows beyond 100% of size, but accesses will
    slow down quite a bit and the .index and .hash files will grow somewhat.)

    Dbz stores up to DBZ_INTERNAL_HASH_SIZE bytes of the message-id's hash
    in the .hash file to confirm a hit. This eliminates the need to read the
    base file to handle collisions. This replaces the tagmask feature in
    previous dbz releases.

    A size of ``0'' given to dbzfresh is synonymous with the local default;
    the normal default is suitable for tables of 5,000,000 key-value pairs.
    Calling dbzinit(name) with the empty name is equivalent to calling dbzfresh(name, 0).

    When databases are regenerated periodically, as in news, it is simplest
    to pick the parameters for a new database based on the old one. This
    also permits some memory of past sizes of the old database, so that a
    new database size can be chosen to cover expected fluctuations. Dbzagain
    is a variant of dbzinit for creating a new database as a new generation
    of an old database. The database files for oldbase must exist. Dbzagain
    is equivalent to calling dbzfresh with a size equal to the result of
    applying dbzsize to the largest number of entries in the oldbase
    database and its previous 10 generations.

    When many accesses are being done by the same program, dbz is massively
    faster if its first hash table is in memory. If the ``pag_incore'' flag
    is set to INCORE_MEM, an attempt is made to read the table in when the
    database is opened, and dbzclose writes it out to disk again (if it was
    read successfully and has been modified). Dbzsetoptions can be used to
    set the pag_incore and exists_incore flag to new value which should be ``INCORE_NO'', ``INCORE_MEM'', or ``INCORE_MMAP'' for the .hash and
    .index files separately; this does not affect the status of a database
    that has already been opened. The default is ``INCORE_NO'' for the
    .index file and ``INCORE_MMAP'' for the .hash file. The attempt to read
    the table in may fail due to memory shortage; in this case dbz fails
    with an error. Stores to an in-memory database are not (in general)
    written out to the file until dbzclose or dbzsync, so if robustness in
    the presence of crashes or concurrent accesses is crucial, in-memory
    databases should probably be avoided or the writethrough option should
    be set to ``true'';

    If the nonblock option is ``true'', then writes to the .hash and .index
    files will be done using non-blocking I/O. This can be significantly
    faster if your platform supports non-blocking I/O with files.

    Dbzsync causes all buffers etc. to be flushed out to the files. It is
    typically used as a precaution against crashes or concurrent accesses
    when a dbz-using process will be running for a long time. It is a
    somewhat expensive operation, especially for an in-memory database.

    Concurrent reading of databases is fairly safe, but there is no
    (inter)locking, so concurrent updating is not.

    An open database occupies three stdio streams and two file descriptors;
    Memory consumption is negligible (except for stdio buffers) except for in-memory databases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to Russ Allbery on Wed Jul 26 13:02:17 2023
    On Jul 25, 2023 at 10:41:16 PM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    I'll restore from snapshot and manually create the history file, if I'm
    reading the manpage correctly, the value used for the '-s' flag is the
    number of articles expected (each line being an article)?

    Yes, that's correct. The actual history index size will then be larger
    than that to try to make it so that the number of entries is about 2/3rds
    the size of the index.

    (If I remember correctly, the dbz algorithm uses linear probing for hash collisions, so you really want a sparse hash. If the linear probe goes
    for too long, it goes to a separate overflow table, and once that happens, the performance really tanks *and* the history file size bloats. It's
    just a bad time all around, sort of like a system going to swap. I'm
    betting your history file went to multiple overflow tables because it
    started massively undersized.)

    I'd probably want something like this assuming that's correct?

    makedbz -i -o -s 200000000

    Looks good to me, assuming that's the number of articles you're dealing
    with.

    9 hours and ~69 million articles in and it looks to be keeping a steady pace.

    Looking at "ME time" lines, the perl filter is taking the most time now, which I'd expect (on to remove misplaced binaries brought in from pullnews).

    innd[11136]: ME time 600000 hishave 2543(1522320) hiswrite 130947(1522320) hissync 6637(2) idle 126732(1830166) artclean 2611(1522320) artwrite 19037(1522211) artcncl 69(547) hisgrep/artcncl 20(547) overv 42242(1522211) perl 229908(1522320) python 95(1522320) nntpread 3881(1830166) artparse 11883(3089741) artlog 4580(1523299) datamove 491(1886519)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G. (go-while)@21:1/5 to Jesse Rehmer on Sun Jul 30 09:34:54 2023
    On 26.07.23, Jesse Rehmer wrote:

    9 hours and ~69 million articles in and it looks to be keeping a steady pace.

    Looking at "ME time" lines, the perl filter is taking the most time now, which
    I'd expect (on to remove misplaced binaries brought in from pullnews).

    innd[11136]: ME time 600000 hishave 2543(1522320) hiswrite 130947(1522320) hissync 6637(2) idle 126732(1830166) artclean 2611(1522320) artwrite 19037(1522211) artcncl 69(547) hisgrep/artcncl 20(547) overv 42242(1522211) perl 229908(1522320) python 95(1522320) nntpread 3881(1830166) artparse 11883(3089741) artlog 4580(1523299) datamove 491(1886519)

    what hardware is this?

    zfs on hdd/ssd/cache?

    do you use default cleanfeed or changed any "optimization"?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sun Jul 30 17:47:00 2023
    Hi Russ,

    Again an interesting insight of the dbz database. Thanks Jesse for this thread!


    Running makedbz manually will let you provide the -s flag, which specifies the number of entries to size the history file for. news.daily will use
    the current size as the expected size. When you're going to feed in a ton
    of articles, you want to pass in something much, much larger, roughly the number of entries you're expecting to have at the end. The relevant instructions in INSTALL are:

    | Next, you need to create an empty F<history> database. To do this, type:
    |
    | cd <pathdb in inn.conf>
    | touch history
    | makedbz -i -o

    which possibly should mention the -s flag.

    Indeed!
    Additional wording:

    Next, you need to create an empty history database. To do this, type:

    cd <pathdb in inn.conf>
    touch history
    makedbz -i -o

    + makedbz will then create a database optimized for handling about
    + 6,000,000 articles (or 500,000 if the slower tagged hash format is
    + used). If you expect to inject more articles than that, use the -s flag
    + to specify the number of entries to size the initial history file for.
    + To pre-size it for 100,000,000 articles, type:
    +
    + makedbz -i -o -s 100000000
    +
    + This initial size does not limit the number of articles the news server
    + will accept. It will just get slower when that size is exceeded, until
    + the next run of news.daily which will appropriately resize it.





    I'll also update the makedbz documentation to make it clearer:

    -i To ignore the old database when determining the size of the new one
    to create, use the -i flag. Using the -o or -s flags implies the -i
    flag.

    + When the old database is ignored, and a size is not specified with
    + -s, makedbz will count the number of lines of the current text
    + history file, add 10% to that count (for the next articles to
    + arrive), and another 50% (or 100% if the slower tagged hash format
    + is used) to determine the size of the new database to create. The
    + aim is to optimize the performances of the database, keeping it
    + filled below 2/3 of its size (or 1/2 with the tagged hash format).
    +
    + If no text history file exists, the new one will have the default
    + creation size (see -s).


    -s *size*
    makedbz will also ignore any old database if the -s flag is used to
    specify the approximate number of entries in the new database.
    Accurately specifying the size is an optimization that will create a
    more efficient database.
    + The news server will still accept more articles, but will be slower.
    Size is measured in key-value pairs (i.e. lines). (The size should
    be the estimated eventual size of the file, typically the size of
    the old file.)
    +
    + The effective size used will be larger, to optimize the performances
    + of the database.
    For more information, see -i and the discussion of dbzfresh and
    dbzsize in libinn_dbz(3).

    + The default is 6,666,666 when creating a new history database. (If
    + the slower tagged hash format is used, the default is 500,000.)


    --
    Julien ÉLIE

    « Ne craignez pas d'être lent, craignez seulement d'être à l'arrêt. »

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to All on Mon Jul 31 14:10:28 2023
    On Jul 30, 2023 at 2:34:54 AM CDT, ""Billy G." <go-while)" <no-reply@no.spam> wrote:

    On 26.07.23, Jesse Rehmer wrote:

    9 hours and ~69 million articles in and it looks to be keeping a steady pace.

    Looking at "ME time" lines, the perl filter is taking the most time now, which
    I'd expect (on to remove misplaced binaries brought in from pullnews).

    innd[11136]: ME time 600000 hishave 2543(1522320) hiswrite 130947(1522320) >> hissync 6637(2) idle 126732(1830166) artclean 2611(1522320) artwrite
    19037(1522211) artcncl 69(547) hisgrep/artcncl 20(547) overv 42242(1522211) >> perl 229908(1522320) python 95(1522320) nntpread 3881(1830166) artparse
    11883(3089741) artlog 4580(1523299) datamove 491(1886519)

    what hardware is this?

    zfs on hdd/ssd/cache?

    do you use default cleanfeed or changed any "optimization"?

    Dell R440 with NVMe storage, ESXi 8 with FreeBSD 13.2 using ZFS.

    The cleanfeed I was using here is stripped down and only checks for misplaced binaries.

    As far as 'tuning' goes, I set icdsynccount to 10000 in inn.conf. I experimented with values up to 50000 but found no gain and when set too high I get several seconds (3-7) of pauses in article acceptance while it does the operation so it seemed more efficient overall to keep it lower.

    I also set cycbuffupdate to 10000 in cycbuff.conf, though I'm not sure if this had any impact, but changing the icdsynccount from 10 to 10000 was a big
    change for me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)