• Archiving Usenet 2003-2025

    From Jason Evans@21:1/5 to All on Sun Jun 1 13:34:55 2025
    A few months ago, I posted about my Usenet archiver application. Since then,
    I have completely retooled it, rewrote it in Python, and it is now a very capable tool.

    In January, I began a project that I had started many times before but never finished. That is, archiving Usenet Newsgroups from 2003 until the current year. To do this, I am using a paid Usenet provider and downloading all newsgroups in the mbox format and compressing them with gzip. I've been doing this since January. You might be wondering why I have been doing this since January and I'm still not done? That's because paid Usenet providers prioritize binary groups over text groups. I am not archiving binary groups, but when one slips under my radar, I can easily see that far more of it has been downloaded compared to other newsgroups in the same amount of time.

    Anyway, since January, I have downloaded approximately 2TB of Newsgroups. What newsgroups have I downloaded? The list so far is on my GitHub linked below. If there are any well-known groups that are missing, please let me know, and I will add them to my queue. You might be wondering where do I get my list of newsgroups. I began with the semi-official list from isc.org. (https://ftp.isc.org/usenet/CONFIG/newsgroups.gz) I have only omitted the following: test groups, e.g., misc.test, binary groups, and some alt groups that deal with pedophilia. Next, I got a list of newsgroups that are
    carried by eternal-september, and I started a new queue based on that, downloading all of the groups that are not in the isc list. There are a lot
    of them, and I'm hoping to have them done in the coming weeks. I am
    downloading approximately 95 newsgroups at a time in parallel. The limit
    from my Usenet provider is 100 downloads at a time.

    I'll update again later when I begin uploading them to the Internet Archive.

    https://github.com/tgeek77/usenet_archiver/blob/main/fetch_log.txt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G. (go-while)@21:1/5 to Jason Evans on Fri Jun 13 20:36:57 2025
    Cool project idea, i already did the same.

    Here is everything you can get from archive.org and probably everything
    you can get from the biggest paid providers....

    10 TB of text, mostly unfiltered. maybe some google groups spam is missing.

    The archive is live and connected via peering so nothing else to do, it archives on it's own.

    The Server is written by me and lacks some commands.

    Text Usenet Archive
    Host: lux-feed1.newsdeef.eu
    Port: 119 or 563 SSL
    User: usenet
    Pass: archive

    Please don't hit it too hard but connections are limited any ways.

    You can get me on discord: https://discord.gg/rECSbHHFzp

    If anybody can take a full copy: I'm happy to share!!!


    On 01.06.25 15:34, Jason Evans wrote:
    A few months ago, I posted about my Usenet archiver application. Since then, I have completely retooled it, rewrote it in Python, and it is now a very capable tool.
    ...
    ...
    The limit
    from my Usenet provider is 100 downloads at a time.

    I'll update again later when I begin uploading them to the Internet Archive.

    https://github.com/tgeek77/usenet_archiver/blob/main/fetch_log.txt

    --
    .......
    Billy G. (go-while)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Urs =?UTF-8?Q?Jan=C3=9Fen?=@21:1/5 to no-reply@no.spam on Sat Jun 14 06:56:19 2025
    In Billy G. (go-while) <no-reply@no.spam> wrote:
    The Server is written by me and lacks some commands.

    LIST OVERVIEW.FMT ist buggy (doubled status/end response):

    | > CAPABILITIES
    [...]
    | < LIST ACTIVE ACTIVE.TIMES COUNTS DISTRIB.PATS DISTRIBUTIONS HEADERS MODERATORS MOTD NEWSGROUPS OVERVIEW.FMT SUBSCRIPTIONS

    | > LIST OVERVIEW.FMT
    | < 215 Order of fields in overview database.
    ! < 215 Order of fields in overview database.
    | < Subject:
    | < From:
    | < Date:
    | < Message-ID:
    | < References:
    | < Bytes:
    | < Lines:
    | < Xref:full
    | < .
    ! < .

    LIST MOTD never returns
    | > LIST MOTD
    Connection closed by foreign host.

    LIST COUNTS never returns
    | > LIST COUNTS
    Connection closed by foreign host.

    If they are not implemented, don't advertize them in CAPABILITIES.
    There might be more worth fixing...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Colin Macleod@21:1/5 to All on Mon Jun 16 15:38:32 2025
    "Billy G. (go-while)" <no-reply@no.spam> posted:

    Cool project idea, i already did the same.

    Here is everything you can get from archive.org and probably everything
    you can get from the biggest paid providers....


    Impressive, some content goes back to 1983, before the "Great Renaming",
    but checking comp.lang.tcl also shows a new message posted today.
    There are nearly half a million groups listed, but many appear to be bogus
    with names which are typos and no or minimal content.

    I wonder if I could add lux-feed1.newsdeef.eu as another upstream server for
    my newsgrouper.org web/usenet gateway? I'm loading XOVER data into a local database to support efficient searching; I'm also loading some old article
    data from the Internet Archive and the "utzoo" archive into another local database, but for most articles I then pull the content on-demand from
    these servers:
    - eternal-september.org for recent posts
    - blueworldhosting.com for history back to 2003

    Could I add your server to this list?

    --
    Colin Macleod ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ https://cmacleod.me.uk

    Warning: Gumption level low, top-up when possible!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jason Evans@21:1/5 to All on Sun Jun 22 19:34:39 2025
    On 6/13/25 1:36 PM, Billy G. (go-while) wrote:
    Cool project idea, i already did the same.

    Can you provide a link to your archives or are they only on your news
    server? What newsgroup list did you use for gathering groups? I used the
    group list from isc.org
    (https://ftp.isc.org/pub/usenet/CONFIG/newsgroups) supplemented with
    Eternal September's list.

    Also, how did you get your archives? I developed a script to do this for
    me because I couldn't find a reliable way to do this otherwise. I also
    have the ability to download groups from a specific time frame which I
    hope to use every year to archive groups year-by-year. Are you doing
    something like that also?

    Jason

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Colin Macleod@21:1/5 to All on Mon Jun 23 10:58:28 2025
    Colin Macleod <user7@newsgrouper.org.invalid> posted:

    "Billy G. (go-while)" <no-reply@no.spam> posted:

    Cool project idea, i already did the same.

    Here is everything you can get from archive.org and probably everything
    you can get from the biggest paid providers....


    Impressive, some content goes back to 1983, before the "Great Renaming",
    but checking comp.lang.tcl also shows a new message posted today.

    I tried Billy's server lux-feed1.newsdeef.eu again today and but now I just
    get "481 Denied" responses. Is it still available?

    --
    Colin Macleod ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ https://cmacleod.me.uk

    🧙 Is there anybody there?
    👻 Is that a trick question? I'm here in spirit but not in body!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G. (go-while)@21:1/5 to Colin Macleod on Mon Jul 28 11:59:43 2025
    On 16.06.25 17:38, Colin Macleod wrote:
    "Billy G. (go-while)" <no-reply@no.spam> posted:

    Cool project idea, i already did the same.

    Here is everything you can get from archive.org and probably everything
    you can get from the biggest paid providers....


    Impressive, some content goes back to 1983, before the "Great Renaming",
    but checking comp.lang.tcl also shows a new message posted today.
    There are nearly half a million groups listed, but many appear to be bogus with names which are typos and no or minimal content.

    Could I add your server to this list?


    oh did not see this here!

    yes you can! :)

    that's everything what was available at archive.org
    and sucked from most providers too...
    all spam inclusive.

    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G. (go-while)@21:1/5 to Jason Evans on Mon Jul 28 12:25:09 2025
    On 23.06.25 02:34, Jason Evans wrote:
    On 6/13/25 1:36 PM, Billy G. (go-while) wrote:
    Cool project idea, i already did the same.

    Can you provide a link to your archives or are they only on your news
    server? What newsgroup list did you use for gathering groups? I used the group list from isc.org
    (https://ftp.isc.org/pub/usenet/CONFIG/newsgroups) supplemented with
    Eternal September's list.

    Also, how did you get your archives? I developed a script to do this for
    me because I couldn't find a reliable way to do this otherwise. I also
    have the ability to download groups from a specific time frame which I
    hope to use every year to archive groups year-by-year. Are you doing something like that also?

    Jason

    archive.org -> mbox2nntp ( https://github.com/go-while/mbox2nntp )

    https://archive.org/details/usenethistorical

    https://archive.org/details/usenet

    You need multiple TB of free space and more time to download them all =)

    Text Usenet Archive
    Host: lux-feed1.newsdeef.eu
    Port: 119 or 563 SSL
    User: usenet
    Pass: archive

    storage moved to SSD with 25 Gbps uplink and performance is crazy!

    You can use 1000 conns. Server will be happy serving you!

    Please ask if you need more conns! :D

    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to All on Sat Aug 30 04:11:03 2025
    On Jun 13, 2025 at 1:36:57 PM CDT, ""Billy G." <go-while)" <no-reply@no.spam> wrote:

    Cool project idea, i already did the same.

    Here is everything you can get from archive.org and probably everything
    you can get from the biggest paid providers....

    10 TB of text, mostly unfiltered. maybe some google groups spam is missing.

    The archive is live and connected via peering so nothing else to do, it archives on it's own.

    The Server is written by me and lacks some commands.

    Text Usenet Archive
    Host: lux-feed1.newsdeef.eu
    Port: 119 or 563 SSL
    User: usenet
    Pass: archive

    Please don't hit it too hard but connections are limited any ways.

    You can get me on discord: https://discord.gg/rECSbHHFzp

    If anybody can take a full copy: I'm happy to share!!!

    Hey Billy, I've been meaning to reach out to you. Mind contacting me via e-mail?

    I'd like to know if there is a more efficient way than using suck/pullnews to obtain the archive? I had been putting together an archive at news.blueworldhosting.com, but have a number of holes and never got around to seriously importing the mbox files from archive.org.

    I know you're in early development stages, but if you'd like someone to test pushing/streaming articles via NNTP I'm interested. I have a lot of bandwidth and performant hardware, always a good test case for testing NNTP streaming.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G.@21:1/5 to Billy G. on Sat Aug 30 23:50:26 2025
    On 30.08.25 23:45, Billy G. wrote:
    I've a tool to send many groups concurrently to nntp server via ihave.

    Side note:
    INN will not accept many of the old articles because dates are weird...

    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help
    irc.pugleaf.net:6697 (SSL) #lounge
    discord: https://discord.gg/rECSbHHFzp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G.@21:1/5 to Jesse Rehmer on Sat Aug 30 23:45:28 2025
    On 30.08.25 05:11, Jesse Rehmer wrote:

    If anybody can take a full copy: I'm happy to share!!!

    Hey Billy, I've been meaning to reach out to you. Mind contacting me via e-mail?

    I'd like to know if there is a more efficient way than using suck/pullnews to obtain the archive? I had been putting together an archive at news.blueworldhosting.com, but have a number of holes and never got around to seriously importing the mbox files from archive.org.

    I know you're in early development stages, but if you'd like someone to test pushing/streaming articles via NNTP I'm interested. I have a lot of bandwidth and performant hardware, always a good test case for testing NNTP streaming.

    Hi!

    using suck is worst way to download from the newsdeef archive.
    the overview is not a database but a flat file with offset indexes
    for every 100 articles only and downloading by article number is slow.

    Articles are stored as sha256 hash from message-id.
    best way is requesting '(X)HDR message-id' in a group first,
    then suck message-ids: results in max performance.

    I've a tool to send many groups concurrently to nntp server via ihave.

    I'll send you an email later.


    Import to pugleaf databases is completed (only *sex* missing).

    Plan is to share the database snapshots via torrent (10 TB+)

    Source Code is online too: https://github.com/go-while/go-pugleaf

    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help
    irc.pugleaf.net:6697 (SSL) #lounge
    discord: https://discord.gg/rECSbHHFzp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to Billy G. on Sat Aug 30 23:42:45 2025
    On Aug 30, 2025 at 5:45:28 PM CDT, ""Billy G."" <no-reply@no.spam> wrote:

    On 30.08.25 05:11, Jesse Rehmer wrote:

    If anybody can take a full copy: I'm happy to share!!!

    Hey Billy, I've been meaning to reach out to you. Mind contacting me via
    e-mail?

    I'd like to know if there is a more efficient way than using suck/pullnews to
    obtain the archive? I had been putting together an archive at
    news.blueworldhosting.com, but have a number of holes and never got around to
    seriously importing the mbox files from archive.org.

    I know you're in early development stages, but if you'd like someone to test >> pushing/streaming articles via NNTP I'm interested. I have a lot of bandwidth
    and performant hardware, always a good test case for testing NNTP streaming.

    Hi!

    using suck is worst way to download from the newsdeef archive.
    the overview is not a database but a flat file with offset indexes
    for every 100 articles only and downloading by article number is slow.

    Articles are stored as sha256 hash from message-id.
    best way is requesting '(X)HDR message-id' in a group first,
    then suck message-ids: results in max performance.

    I've a tool to send many groups concurrently to nntp server via ihave.

    FWIW - that's the default behavior of suck, it uses XHDR, builds a database of Message-IDs, and uses ARTICLE <MID> to fetch (among other things).

    If you have a tool to push, that's great, we can sort out details via e-mail.

    I'm aware of a few challenges getting the older messages accepted by INN. From what I've observed so far, it's primarily articles originating from ANews and BNews. Seems it was primarily used from 1981-1982 based on what I'm seeing rejected). I'll have to sort out how to deal with that later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)