Forum: >>> Magnum BBS <<<

working with cnfs

From Nigel Reed@21:1/5 to All on Mon Apr 17 10:34:54 2023

Hi all,

earlier this week, my news server ran out of inodes and stopped
accepting articles. I want to say a big thank you to Julien who helped
me get back up on and running by converting to CNFS. I never thought
I'd get enough articles to have a problem but obviously that wasn't the
case.

Now that I've switched, I have a couple of issues, and both are about
using grep to find articles. I'm hoping someone has done this sort of
thing before so I don't have to reinvent the wheel.

If I'm looking for something in particular, I could just go into the spoo/aritlces directory and start grepping. This isn't possible with
cnfs. So the questions is:

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given phrase.

I'm sure I'm not the first person who wants to do this so hopeful
someone has a solution.

Thanks,
Nigel

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard@21:1/5 to All on Mon Apr 17 17:08:40 2023

[Please do not mail me a copy of your followup]

Nigel Reed <sysop@endofthelinebbs.com> spake the secret code <20230417103454.440dbdc0@wibble.sysadmininc.com> thusly:

me get back up on and running by converting to CNFS.

Hey! I learned something today :). I was previously unfamiliar with
CNFS, but it makes perfect sense. I've been hacking on trn and
thinking of a similar in-memory data structure for all the text that
comes back to the news reader.

If I'm looking for something in particular, I could just go into the >spoo/aritlces directory and start grepping. This isn't possible with
cnfs. So the questions is:

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given phrase.

I'm sure I'm not the first person who wants to do this so hopeful
someone has a solution.

This could be done with some shell scripts and netcat talking to your
news server's nntp port. I imagine there's probably something better
by now though.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Mon Apr 17 20:15:43 2023

Hi Nigel,

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given phrase.

I'm sure I'm not the first person who wants to do this so hopeful
someone has a solution.

I'm unfortunately not aware of such a tool for other storage methods
than tradindexed. Even timehash won't respond to the second point as
articles are not classified by hierarchy.

If someone has ever written a script to do that, I would happily add it
to INN.

I would otherwise just suggest to retrieve articles one by one from the
history file, and parse them.
To do that, take the last field of each line in the history field, and
give its value to "sm -q".

Example for the first article in the history file in pathdb:

% head -n1 history | cut -f3 | sm -q

You'll get the article on standard output. You could then grep in it
whatever you want.
You now need to iterate over each article.

Of course a more complex script should be written if your search has
several parameters (in a hierarchy, from someone, etc.).

--
Julien ÉLIE

« Ira furor breuis est. » (Horace)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nigel Reed@21:1/5 to iulius@nom-de-mon-site.com.invalid on Mon Apr 17 15:50:45 2023

On Mon, 17 Apr 2023 20:15:43 +0200
Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

Hi Nigel,

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given
phrase.

I'm sure I'm not the first person who wants to do this so hopeful
someone has a solution.

I'm unfortunately not aware of such a tool for other storage methods
than tradindexed. Even timehash won't respond to the second point as articles are not classified by hierarchy.

If someone has ever written a script to do that, I would happily add
it to INN.

I would otherwise just suggest to retrieve articles one by one from
the history file, and parse them.
To do that, take the last field of each line in the history field,
and give its value to "sm -q".

Example for the first article in the history file in pathdb:

% head -n1 history | cut -f3 | sm -q

You'll get the article on standard output. You could then grep in it whatever you want.
You now need to iterate over each article.

Of course a more complex script should be written if your search has
several parameters (in a hierarchy, from someone, etc.).

At this point I have over 7 million articles. Do you have any idea how
long that is going to take? :)

Maybe I should have just got a new disk and formatted it with 3 times
the inodes!

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam W.@21:1/5 to Nigel Reed on Mon Apr 17 21:37:49 2023

Nigel Reed <sysop@endofthelinebbs.com> wrote:

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given phrase.

I don't have a direct answer (other than what Julien said), but if the
search phrase is always the same, then you could add another newsfeeds
entry and feed all new articles to the program with "Tp". This program
(script) could do the grep and, for example, send you an email, or post
to a private group, or whatever (~15 years ago I used to have a similar
setup, reposting all replies to my posts to a private group, accessible
only by me -- it was more convenient to reply to them this way).

Now I'm using it to count articles on Polish newsgroups to produce results
like these:

http://news.chmurka.net/top15.php

The line is:

chmurka.postprocessor.pl\
:!*,pl.*,alt.pl.*\
:Tp:/usr/local/news/local/bin/post-processor-pl.sh %s

Now, as I think of it, it would be doable to do such grepping script by
writing a program that reads the CNFS file (the format has to be described somewhere, or can be deduced from the source code) and maybe dumps its
contents to stdout (or searches for a phrase and dumps the whole article),
but I don't know of anything like that.

On the other hand, CNFS is a binary format, but posts are stored in a text format, so maybe something like this will suffice?

strings cnfs-file | grep phrase
grep -a phrase cnfs-file

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nigel Reed@21:1/5 to Adam W. on Mon Apr 17 17:31:41 2023

On Mon, 17 Apr 2023 21:37:49 -0000 (UTC) gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) wrote:

Nigel Reed <sysop@endofthelinebbs.com> wrote:

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given
phrase.

I don't have a direct answer (other than what Julien said), but if
the search phrase is always the same, then you could add another
newsfeeds entry and feed all new articles to the program with "Tp".
This program (script) could do the grep and, for example, send you an
email, or post to a private group, or whatever (~15 years ago I used
to have a similar setup, reposting all replies to my posts to a
private group, accessible only by me -- it was more convenient to
reply to them this way).

Now I'm using it to count articles on Polish newsgroups to produce
results like these:

http://news.chmurka.net/top15.php

The line is:

chmurka.postprocessor.pl\
:!*,pl.*,alt.pl.*\
:Tp:/usr/local/news/local/bin/post-processor-pl.sh %s

Now, as I think of it, it would be doable to do such grepping script
by writing a program that reads the CNFS file (the format has to be
described somewhere, or can be deduced from the source code) and
maybe dumps its contents to stdout (or searches for a phrase and
dumps the whole article), but I don't know of anything like that.

On the other hand, CNFS is a binary format, but posts are stored in a
text format, so maybe something like this will suffice?

strings cnfs-file | grep phrase
grep -a phrase cnfs-file

Unfortunately the queries would be ad-hoc. I was looking at CPAN,
212,470 modules and nothing to work with CNFS.

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Nigel Reed on Mon Apr 17 16:04:00 2023

Nigel Reed <sysop@endofthelinebbs.com> writes:

At this point I have over 7 million articles. Do you have any idea how
long that is going to take? :)

grep is probably somewhat more optimized than sm when reading files, but searching 7 million articles is just going to be slow no matter how you retrieve the article. sm is retrieving the article in mostly similar ways (mmap) to how grep is retrieving it.

Searching 7 million articles without an index is just going to be slow no matter how you do it. This is why when search is an anticipated
operation, people pre-create the search index.

(There have been multiple proposals for a search capability in NNTP and
ways to integrate search into INN over the years, but none of them have
stuck, in part because the open source search tools keep changing and
everyone stops using the old ones.)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jesse Rehmer@21:1/5 to All on Mon Apr 17 23:02:28 2023

On Apr 17, 2023 at 3:50:45 PM CDT, "Nigel Reed" <sysop@endofthelinebbs.com> wrote:

At this point I have over 7 million articles. Do you have any idea how
long that is going to take? :)

Maybe I should have just got a new disk and formatted it with 3 times
the inodes!

I weighed the pros/cons of CNFS and the issue you describe was one that kept
me from using it on my main box. If you want to stick with tradspool I recommend using ZFS. On a relatively small (>1TB) disk I have somewhere near 280,000,000 articles in tradspool.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard@21:1/5 to All on Mon Apr 17 22:28:25 2023

[Please do not mail me a copy of your followup]

Nigel Reed <sysop@endofthelinebbs.com> spake the secret code <20230417155045.21198a0b@wibble.sysadmininc.com> thusly:

At this point I have over 7 million articles. Do you have any idea how
long that is going to take? :)

If this is something you're often doing, then I suggest you build some
sort of keyword index and use INN to automatically feed all incoming
articles into the indexer and in the background run an indexer on all
your existing articles to catch up the old references.

Then you would query your index for the keywords to find relevant
articles that might contain your whole phrase.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jesse Rehmer@21:1/5 to Russ Allbery on Mon Apr 17 23:32:16 2023

On Apr 17, 2023 at 6:04:00 PM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

Nigel Reed <sysop@endofthelinebbs.com> writes:

At this point I have over 7 million articles. Do you have any idea how
long that is going to take? :)

grep is probably somewhat more optimized than sm when reading files, but searching 7 million articles is just going to be slow no matter how you retrieve the article. sm is retrieving the article in mostly similar ways (mmap) to how grep is retrieving it.

Searching 7 million articles without an index is just going to be slow no matter how you do it. This is why when search is an anticipated
operation, people pre-create the search index.

(There have been multiple proposals for a search capability in NNTP and
ways to integrate search into INN over the years, but none of them have stuck, in part because the open source search tools keep changing and everyone stops using the old ones.)

The scope is beyond me, but if anyone out there wants a large dataset to feed into something like ElasticSearch, I can provide it (and probably the hosting/infrastructure needs).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam W.@21:1/5 to Nigel Reed on Tue Apr 18 10:51:02 2023

Nigel Reed <sysop@endofthelinebbs.com> wrote:

Unfortunately the queries would be ad-hoc. I was looking at CPAN,
212,470 modules and nothing to work with CNFS.

If you're into Perl, you could see the source of cnfsstat. It parses
buffers directly.

I'm looking into include/inn/storage.h (it's used by storage manager, frontends/sm.c). Seems there's a nice API to retrieve all articles one by
one:

ARTHANDLE *SMnext(ARTHANDLE *article, const RETRTYPE amount);

There's also a manual page in doc/man/libinnstorage.3 (man libinnstorage).
If you want to write something to retrieve articles from CNFS (or, in
general, from storage manager), you could start there.

"The SMnext function is similar in function to SMretrieve except that it
is intended for traversing the method's article store sequentially. To
start a query, SMnext should be called with a NULL pointer ARTHANDLE.
Then SMnext returns ARTHANDLE which should be used for the next query. If
a NULL pointer ARTHANDLE is returned, no articles are left to be queried.
If data of ARTHANDLE is NULL pointer or len of ARTHANDLE is 0, it
indicates the article may be corrupted and should be cancelled by
SMcancel. The data area indicated by ARTHANDLE should not be modified."

There's also some overview search function, but it's not for article
bodies, so it probably won't be useful.

One could write a program that retrieves articles one by one and performs
some operations on them -- for example, matching them with regex (there's
a regexec API for that, declared in regex.h) and if regex matches,
printing some information (storage token, or Message-ID, or whatever). ARTHANDLE structure (defined in include/inn/storage.h) contains all
required data, including storage token.

Anyway, if you want to do a text search, it would be best if you made a dictionary first (for example, linking all words to tokens that contain
them), but it would be better to use a ready-made indexer for that (there
has to be one)...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nigel Reed@21:1/5 to Adam W. on Tue Apr 18 11:56:17 2023

On Tue, 18 Apr 2023 10:51:02 -0000 (UTC) gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) wrote:

Anyway, if you want to do a text search, it would be best if you made
a dictionary first (for example, linking all words to tokens that
contain them), but it would be better to use a ready-made indexer for
that (there has to be one)...

I think you're right. I setup my cnfs files to be 5gb each and it took
about 10 minutes to grep for my name in one of them when piping the
output of strings, so I don't see any other way than creating an index
of some kind.

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	13:39:03
Calls:	10,389
Calls today:	4
Files:	14,061
Messages:	6,416,888
Posted today:	1

working with cnfs

Who's Online

Recent Visitors

System Info