• working with cnfs

    From Nigel Reed@21:1/5 to All on Mon Apr 17 10:34:54 2023
    Hi all,

    earlier this week, my news server ran out of inodes and stopped
    accepting articles. I want to say a big thank you to Julien who helped
    me get back up on and running by converting to CNFS. I never thought
    I'd get enough articles to have a problem but obviously that wasn't the
    case.

    Now that I've switched, I have a couple of issues, and both are about
    using grep to find articles. I'm hoping someone has done this sort of
    thing before so I don't have to reinvent the wheel.

    If I'm looking for something in particular, I could just go into the spoo/aritlces directory and start grepping. This isn't possible with
    cnfs. So the questions is:

    1. How can I search the entire spool for a given phrase.

    2. How can I search a given hierarchy recursively for a given phrase.

    I'm sure I'm not the first person who wants to do this so hopeful
    someone has a solution.

    Thanks,
    Nigel


    --
    End Of The Line BBS - Plano, TX
    telnet endofthelinebbs.com 23

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard@21:1/5 to All on Mon Apr 17 17:08:40 2023
    [Please do not mail me a copy of your followup]

    Nigel Reed <sysop@endofthelinebbs.com> spake the secret code <20230417103454.440dbdc0@wibble.sysadmininc.com> thusly:

    me get back up on and running by converting to CNFS.

    Hey! I learned something today :). I was previously unfamiliar with
    CNFS, but it makes perfect sense. I've been hacking on trn and
    thinking of a similar in-memory data structure for all the text that
    comes back to the news reader.

    If I'm looking for something in particular, I could just go into the >spoo/aritlces directory and start grepping. This isn't possible with
    cnfs. So the questions is:

    1. How can I search the entire spool for a given phrase.

    2. How can I search a given hierarchy recursively for a given phrase.

    I'm sure I'm not the first person who wants to do this so hopeful
    someone has a solution.

    This could be done with some shell scripts and netcat talking to your
    news server's nntp port. I imagine there's probably something better
    by now though.
    --
    "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
    The Terminals Wiki <http://terminals-wiki.org>
    The Computer Graphics Museum <http://computergraphicsmuseum.org>
    Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Mon Apr 17 20:15:43 2023
    Hi Nigel,

    1. How can I search the entire spool for a given phrase.

    2. How can I search a given hierarchy recursively for a given phrase.

    I'm sure I'm not the first person who wants to do this so hopeful
    someone has a solution.

    I'm unfortunately not aware of such a tool for other storage methods
    than tradindexed. Even timehash won't respond to the second point as
    articles are not classified by hierarchy.

    If someone has ever written a script to do that, I would happily add it
    to INN.

    I would otherwise just suggest to retrieve articles one by one from the
    history file, and parse them.
    To do that, take the last field of each line in the history field, and
    give its value to "sm -q".

    Example for the first article in the history file in pathdb:

    % head -n1 history | cut -f3 | sm -q

    You'll get the article on standard output. You could then grep in it
    whatever you want.
    You now need to iterate over each article.

    Of course a more complex script should be written if your search has
    several parameters (in a hierarchy, from someone, etc.).

    --
    Julien ÉLIE

    « Ira furor breuis est. » (Horace)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nigel Reed@21:1/5 to iulius@nom-de-mon-site.com.invalid on Mon Apr 17 15:50:45 2023
    On Mon, 17 Apr 2023 20:15:43 +0200
    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Nigel,

    1. How can I search the entire spool for a given phrase.

    2. How can I search a given hierarchy recursively for a given
    phrase.

    I'm sure I'm not the first person who wants to do this so hopeful
    someone has a solution.

    I'm unfortunately not aware of such a tool for other storage methods
    than tradindexed. Even timehash won't respond to the second point as articles are not classified by hierarchy.

    If someone has ever written a script to do that, I would happily add
    it to INN.

    I would otherwise just suggest to retrieve articles one by one from
    the history file, and parse them.
    To do that, take the last field of each line in the history field,
    and give its value to "sm -q".

    Example for the first article in the history file in pathdb:

    % head -n1 history | cut -f3 | sm -q

    You'll get the article on standard output. You could then grep in it whatever you want.
    You now need to iterate over each article.

    Of course a more complex script should be written if your search has
    several parameters (in a hierarchy, from someone, etc.).

    At this point I have over 7 million articles. Do you have any idea how
    long that is going to take? :)

    Maybe I should have just got a new disk and formatted it with 3 times
    the inodes!




    --
    End Of The Line BBS - Plano, TX
    telnet endofthelinebbs.com 23

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam W.@21:1/5 to Nigel Reed on Mon Apr 17 21:37:49 2023
    Nigel Reed <sysop@endofthelinebbs.com> wrote:

    1. How can I search the entire spool for a given phrase.

    2. How can I search a given hierarchy recursively for a given phrase.

    I don't have a direct answer (other than what Julien said), but if the
    search phrase is always the same, then you could add another newsfeeds
    entry and feed all new articles to the program with "Tp". This program
    (script) could do the grep and, for example, send you an email, or post
    to a private group, or whatever (~15 years ago I used to have a similar
    setup, reposting all replies to my posts to a private group, accessible
    only by me -- it was more convenient to reply to them this way).

    Now I'm using it to count articles on Polish newsgroups to produce results
    like these:

    http://news.chmurka.net/top15.php

    The line is:

    chmurka.postprocessor.pl\
    :!*,pl.*,alt.pl.*\
    :Tp:/usr/local/news/local/bin/post-processor-pl.sh %s

    Now, as I think of it, it would be doable to do such grepping script by
    writing a program that reads the CNFS file (the format has to be described somewhere, or can be deduced from the source code) and maybe dumps its
    contents to stdout (or searches for a phrase and dumps the whole article),
    but I don't know of anything like that.

    On the other hand, CNFS is a binary format, but posts are stored in a text format, so maybe something like this will suffice?

    strings cnfs-file | grep phrase
    grep -a phrase cnfs-file

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nigel Reed@21:1/5 to Adam W. on Mon Apr 17 17:31:41 2023
    On Mon, 17 Apr 2023 21:37:49 -0000 (UTC) gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) wrote:

    Nigel Reed <sysop@endofthelinebbs.com> wrote:

    1. How can I search the entire spool for a given phrase.

    2. How can I search a given hierarchy recursively for a given
    phrase.

    I don't have a direct answer (other than what Julien said), but if
    the search phrase is always the same, then you could add another
    newsfeeds entry and feed all new articles to the program with "Tp".
    This program (script) could do the grep and, for example, send you an
    email, or post to a private group, or whatever (~15 years ago I used
    to have a similar setup, reposting all replies to my posts to a
    private group, accessible only by me -- it was more convenient to
    reply to them this way).

    Now I'm using it to count articles on Polish newsgroups to produce
    results like these:

    http://news.chmurka.net/top15.php

    The line is:

    chmurka.postprocessor.pl\
    :!*,pl.*,alt.pl.*\
    :Tp:/usr/local/news/local/bin/post-processor-pl.sh %s

    Now, as I think of it, it would be doable to do such grepping script
    by writing a program that reads the CNFS file (the format has to be
    described somewhere, or can be deduced from the source code) and
    maybe dumps its contents to stdout (or searches for a phrase and
    dumps the whole article), but I don't know of anything like that.

    On the other hand, CNFS is a binary format, but posts are stored in a
    text format, so maybe something like this will suffice?

    strings cnfs-file | grep phrase
    grep -a phrase cnfs-file

    Unfortunately the queries would be ad-hoc. I was looking at CPAN,
    212,470 modules and nothing to work with CNFS.


    --
    End Of The Line BBS - Plano, TX
    telnet endofthelinebbs.com 23

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Nigel Reed on Mon Apr 17 16:04:00 2023
    Nigel Reed <sysop@endofthelinebbs.com> writes:

    At this point I have over 7 million articles. Do you have any idea how
    long that is going to take? :)

    grep is probably somewhat more optimized than sm when reading files, but searching 7 million articles is just going to be slow no matter how you retrieve the article. sm is retrieving the article in mostly similar ways (mmap) to how grep is retrieving it.

    Searching 7 million articles without an index is just going to be slow no matter how you do it. This is why when search is an anticipated
    operation, people pre-create the search index.

    (There have been multiple proposals for a search capability in NNTP and
    ways to integrate search into INN over the years, but none of them have
    stuck, in part because the open source search tools keep changing and
    everyone stops using the old ones.)

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to All on Mon Apr 17 23:02:28 2023
    On Apr 17, 2023 at 3:50:45 PM CDT, "Nigel Reed" <sysop@endofthelinebbs.com> wrote:

    At this point I have over 7 million articles. Do you have any idea how
    long that is going to take? :)

    Maybe I should have just got a new disk and formatted it with 3 times
    the inodes!

    I weighed the pros/cons of CNFS and the issue you describe was one that kept
    me from using it on my main box. If you want to stick with tradspool I recommend using ZFS. On a relatively small (>1TB) disk I have somewhere near 280,000,000 articles in tradspool.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard@21:1/5 to All on Mon Apr 17 22:28:25 2023
    [Please do not mail me a copy of your followup]

    Nigel Reed <sysop@endofthelinebbs.com> spake the secret code <20230417155045.21198a0b@wibble.sysadmininc.com> thusly:

    At this point I have over 7 million articles. Do you have any idea how
    long that is going to take? :)

    If this is something you're often doing, then I suggest you build some
    sort of keyword index and use INN to automatically feed all incoming
    articles into the indexer and in the background run an indexer on all
    your existing articles to catch up the old references.

    Then you would query your index for the keywords to find relevant
    articles that might contain your whole phrase.
    --
    "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
    The Terminals Wiki <http://terminals-wiki.org>
    The Computer Graphics Museum <http://computergraphicsmuseum.org>
    Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to Russ Allbery on Mon Apr 17 23:32:16 2023
    On Apr 17, 2023 at 6:04:00 PM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

    Nigel Reed <sysop@endofthelinebbs.com> writes:

    At this point I have over 7 million articles. Do you have any idea how
    long that is going to take? :)

    grep is probably somewhat more optimized than sm when reading files, but searching 7 million articles is just going to be slow no matter how you retrieve the article. sm is retrieving the article in mostly similar ways (mmap) to how grep is retrieving it.

    Searching 7 million articles without an index is just going to be slow no matter how you do it. This is why when search is an anticipated
    operation, people pre-create the search index.

    (There have been multiple proposals for a search capability in NNTP and
    ways to integrate search into INN over the years, but none of them have stuck, in part because the open source search tools keep changing and everyone stops using the old ones.)

    The scope is beyond me, but if anyone out there wants a large dataset to feed into something like ElasticSearch, I can provide it (and probably the hosting/infrastructure needs).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam W.@21:1/5 to Nigel Reed on Tue Apr 18 10:51:02 2023
    Nigel Reed <sysop@endofthelinebbs.com> wrote:

    Unfortunately the queries would be ad-hoc. I was looking at CPAN,
    212,470 modules and nothing to work with CNFS.

    If you're into Perl, you could see the source of cnfsstat. It parses
    buffers directly.

    I'm looking into include/inn/storage.h (it's used by storage manager, frontends/sm.c). Seems there's a nice API to retrieve all articles one by
    one:

    ARTHANDLE *SMnext(ARTHANDLE *article, const RETRTYPE amount);

    There's also a manual page in doc/man/libinnstorage.3 (man libinnstorage).
    If you want to write something to retrieve articles from CNFS (or, in
    general, from storage manager), you could start there.

    "The SMnext function is similar in function to SMretrieve except that it
    is intended for traversing the method's article store sequentially. To
    start a query, SMnext should be called with a NULL pointer ARTHANDLE.
    Then SMnext returns ARTHANDLE which should be used for the next query. If
    a NULL pointer ARTHANDLE is returned, no articles are left to be queried.
    If data of ARTHANDLE is NULL pointer or len of ARTHANDLE is 0, it
    indicates the article may be corrupted and should be cancelled by
    SMcancel. The data area indicated by ARTHANDLE should not be modified."

    There's also some overview search function, but it's not for article
    bodies, so it probably won't be useful.

    One could write a program that retrieves articles one by one and performs
    some operations on them -- for example, matching them with regex (there's
    a regexec API for that, declared in regex.h) and if regex matches,
    printing some information (storage token, or Message-ID, or whatever). ARTHANDLE structure (defined in include/inn/storage.h) contains all
    required data, including storage token.

    Anyway, if you want to do a text search, it would be best if you made a dictionary first (for example, linking all words to tokens that contain
    them), but it would be better to use a ready-made indexer for that (there
    has to be one)...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nigel Reed@21:1/5 to Adam W. on Tue Apr 18 11:56:17 2023
    On Tue, 18 Apr 2023 10:51:02 -0000 (UTC) gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) wrote:


    Anyway, if you want to do a text search, it would be best if you made
    a dictionary first (for example, linking all words to tokens that
    contain them), but it would be better to use a ready-made indexer for
    that (there has to be one)...

    I think you're right. I setup my cnfs files to be 5gb each and it took
    about 10 minutes to grep for my name in one of them when piping the
    output of strings, so I don't see any other way than creating an index
    of some kind.


    --
    End Of The Line BBS - Plano, TX
    telnet endofthelinebbs.com 23

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)