This may or may not be a INN 2.8.0 issue, but noticed after upgrading and starting to inject older articles that I'm seeing errors about Syntax errors in Message-IDs but I have the following set in etc/inn.conf:
syntaxchecks: [ laxmid ]
I see in the manpage for inn.conf:
When laxmid is set, Message-IDs containing ".." in the left
part are accepted, as well as Message-IDs with two "@".
I assume that because Message-IDs like the one below do not fall into those parameters, it is still rejected, but this there a way to accept these?
<3f71e4a7_3@aeinews.> - 435 Syntax error in message-ID
Jesse Rehmer wrote:
This may or may not be a INN 2.8.0 issue, but noticed after upgrading and
starting to inject older articles that I'm seeing errors about Syntax errors >> in Message-IDs but I have the following set in etc/inn.conf:
syntaxchecks: [ laxmid ]
I see in the manpage for inn.conf:
When laxmid is set, Message-IDs containing ".." in the left
part are accepted, as well as Message-IDs with two "@".
I assume that because Message-IDs like the one below do not fall into those >> parameters, it is still rejected, but this there a way to accept these?
<3f71e4a7_3@aeinews.> - 435 Syntax error in message-ID
I found when adding older (1980s) articles to an inn2 server, it was necessary
to modify some headers else they were rejected. I found that trying to get inn2
to accept them was not the answer, modifying the article header was.
How old are these 'older articles' you are injecting?
You can see my results in a web interface here: http://www.novalink.us or in inn2 here: news.novalink.us:119
On Jul 16, 2023 at 10:25:17 AM CDT, "Retro Guy" <Retro Guy> wrote:
Jesse Rehmer wrote:
This may or may not be a INN 2.8.0 issue, but noticed after upgrading and >>> starting to inject older articles that I'm seeing errors about Syntax errors
in Message-IDs but I have the following set in etc/inn.conf:
syntaxchecks: [ laxmid ]
I see in the manpage for inn.conf:
When laxmid is set, Message-IDs containing ".." in the left
part are accepted, as well as Message-IDs with two "@".
I assume that because Message-IDs like the one below do not fall into those >>> parameters, it is still rejected, but this there a way to accept these?
<3f71e4a7_3@aeinews.> - 435 Syntax error in message-ID
I found when adding older (1980s) articles to an inn2 server, it was necessary
to modify some headers else they were rejected. I found that trying to get inn2
to accept them was not the answer, modifying the article header was.
How old are these 'older articles' you are injecting?
You can see my results in a web interface here: http://www.novalink.us or in >> inn2 here: news.novalink.us:119
2003-ish and forward.
On Jul 16, 2023 at 10:25:17 AM CDT, "Retro Guy" <Retro Guy> wrote:
How old are these 'older articles' you are injecting?
You can see my results in a web interface here: http://www.novalink.us or in
inn2 here: news.novalink.us:119
2003-ish and forward.
syntaxchecks: [ laxmid ]
I see in the manpage for inn.conf:
When laxmid is set, Message-IDs containing ".." in the left
part are accepted, as well as Message-IDs with two "@".
I assume that because Message-IDs like the one below do not fall into those parameters, it is still rejected, but this there a way to accept these?
<3f71e4a7_3@aeinews.> - 435 Syntax error in message-ID
I found when adding older (1980s) articles to an inn2 server, it was necessary to modify some headers else they were rejected. I found that trying to
get inn2 to accept them was not the answer, modifying the article header was.
Hi Jesse,
Over these past years, I have often seen questions about syntax checks.
Maybe laxmid should allow more Message-IDs than only the ones with ".."
and two "@"?
Strictly speaking, a dot (".") must be followed by another non-special chars, so <a.@b> and <a@b.> are invalid per RFC.
I suggest to change the behaviour of laxmid so that innd accepts even
more Message-IDs. For instance in the common dot-atom-text syntax, just checking we have "<", at least one non-special char, "@", at least onenon-special char, and ">".
no-fold-literal is kept untouched but dot-atom-text is changed.
The syntax per RFC is:
msg-id = "<" msg-id-core ">"
msg-id-core = id-left "@" id-right
id-left = dot-atom-text
id-right = dot-atom-text / no-fold-literal
dot-atom-text = 1*atext *("." 1*atext)
no-fold-literal = "[" *mdtext "]"
mdtext = %d33-61 / ; The rest of the US-ASCII
%d63-90 / ; characters not including
%d94-126 ; ">", "[", "]", or "\"
atext = ALPHA / DIGIT / ; Printable US-ASCII
"!" / "#" / ; characters not including
"$" / "%" / ; specials. Used for atoms.
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
laxmid would accept for innd:
dot-atom-text = 1*(atext / "." / "@")
At least, I think it would cope with all Message-IDs in the wild. (Are
there ones without any "@" at all?)
As for nnrpd, laxmid would go on having the current behaviour of
allowing ".." and two "@" as this was a request in 2017 from a news
admin with users having broken posting agents sending such Message-IDs.
No need for now to allow the injection of even more broken Message-IDs.
Any thoughts about that change?
laxmid would accept for innd:
dot-atom-text = 1*(atext / "." / "@")
At least, I think it would cope with all Message-IDs in the wild.
As for nnrpd, laxmid would go on having the current behaviour of
allowing ".." and two "@" as this was a request in 2017 from a news
admin with users having broken posting agents sending such Message-IDs.
No need for now to allow the injection of even more broken Message-IDs.
Any thoughts about that change?
Bringing this up again because I have found BNews Message-IDs cannot be injected without modification, and there are a ton of them from various sources I'd rather not attempt to modify. Once source is online via NNTP, so easy to use pullnews or suck, which would be the path of least resistance.
<bnews.sri-unix.2509> - 435 Syntax error in message-ID
Hi Jesse,
laxmid would accept for innd:
dot-atom-text = 1*(atext / "." / "@")
At least, I think it would cope with all Message-IDs in the wild.
As for nnrpd, laxmid would go on having the current behaviour of
allowing ".." and two "@" as this was a request in 2017 from a news
admin with users having broken posting agents sending such Message-IDs.
No need for now to allow the injection of even more broken Message-IDs.
Any thoughts about that change?
Bringing this up again because I have found BNews Message-IDs cannot be
injected without modification, and there are a ton of them from various
sources I'd rather not attempt to modify. Once source is online via NNTP, so >> easy to use pullnews or suck, which would be the path of least resistance. >>
<bnews.sri-unix.2509> - 435 Syntax error in message-ID
Sorry for having forgotten your request. I bet I was waiting for your approval
of my suggestion of change (innd would accept 0 to 2 '@', but not nnrpd whose behaviour would remain unchanged) before starting to work on it.
I think the following patch will work:
--- a/lib/messageid.c
+++ b/lib/messageid.c
@@ -127,8 +127,8 @@ InitializeMessageIDcclass(void)
** When stripspaces is true, whitespace at the beginning and at the end
** of MessageID are discarded.
**
-** When laxsyntax is true, '@' can occur twice in MessageID, and '..' is -** also accepted in the left part of MessageID.
+** When laxsyntax is true, '@' can occur twice in MessageID, or never occur,
+** and '..' is also accepted in the left part of MessageID.
*/
bool
IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax) @@ -155,6 +155,12 @@ IsValidMessageID(const char *MessageID, bool stripspaces,
bool laxsyntax)
/* Scan local-part: "<dot-atom-text". */> if (*p++ != '<')>
return false;
+
+ /* In case there's no '@' in the Message-ID and laxsyntax is set, just
+ * check the syntax of the Message-ID as though it had no left part. */ + if (laxsyntax && strchr((const char *) p, '@') == NULL)
+ return IsValidRightPartMessageID((const char *) p, stripspaces, true);
+
for (;; p++) {
if (midatomchar(*p)) {
while (midatomchar(*++p))
--- a/nnrpd/post.c
+++ b/nnrpd/post.c
@@ -471,6 +471,10 @@ ProcessHeaders(char *idbuff, bool needmoderation)
if (!IsValidMessageID(HDR(HDR__MESSAGEID), true, laxmid)) {
return "Can't parse Message-ID header field body";
}
+ /* Do not accept a Message-ID without an '@', even if laxmid is set. */ + if (laxmid && strchr(HDR(HDR__MESSAGEID), '@') == NULL) {
+ return "Missing @ in Message-ID header field body";
+ }
/* Set the Path header field. */
if (HDR(HDR__PATH) == NULL || PERMaccessconf->strippath) {
If you can confirm it suits your need, and you are now able to inject BNews articles downloaded by pullnews, it would be great.
I'll also add a note in the documentation to warn that when laxmid is set, remote peers may reject articles with a syntactically invalid Message-ID.
This does get past the Message-ID header issue, but presents a new one with the Date header.
437 Bad "Date" header field -- "Fri Jul 9 03:46:46 1982"
I was looking at lib/date.c but it's a bit complex for me to digest. I
see references in comments to "lax mode" and not sure if this is an undocumented option or maybe removed in the past?
innd always uses lax mode for date parsing.
That date is in ctime(3) format, which isn't supported by INN even in lax mode. That format was already forbidden in the first article format
standard (RFC 850) from June 1983, and a lot of articles before that are
in the completely incompatible A News format that INN has never attempted
to parse. It looks like you have a transitional article that is in B News format but is still using the ctime(3) format for Date.
RFC 850 says:
Note in particular that ctime format:
Wdy Mon DD HH:MM:SS YYYY
is not acceptable because it is not a valid ARPANET date.
However, since older software still generates this format,
news implementations are encouraged to accept this format
and translate it into an acceptable format.
I wouldn't object to supporting this in INN in lax mode, but it's not entirely trivial to add without accidentally breaking something else
because the order of the date elements is significantly different than a standardized date. It would take someone a bit of time to figure out how
to safely incorporate it into parsedate_rfc5322_lax. (For example, the
code that skips over a leading day of the week would also currently skip
over the month.) It might be easiest to add a separate ctime parser and to just attempt a ctime parse whenever the date is otherwise invalid. I'm not sure how wide of a range of formats the old ctime dates came in.
Mostly they've all been in the same format, except a few outliers that
have invalid Date and Posted headers, but do have a Date-Received header that's more appropriate.
The outliers seem to follow this pattern:
Date: Wed, 31-Dec-69 18:59:59 EDT
Posted: Wed Dec 31 18:59:59 1969
Date-Received: Sun, 28-Jul-85 00:57:37 EDT
Date: Wed, 31-Dec-69 18:59:59 EDT
Posted: Wed Dec 31 18:59:59 1969
Date-Received: Sun, 28-Jul-85 00:57:37 EDT
Those I assume will require changing the Date header, at minimum before they'd
be accepted.
from email.utils import format_datetime,parsedate_to_datetime
format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT")) 'Wed, 31 Dec 1969 18:59:59 -0400'
I did find a few more Message-ID variations that are rejected:
<[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
<366@mimir..dmt.oz> - 435 Syntax error in message-ID
<[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID
Hi Jesse,
Date: Wed, 31-Dec-69 18:59:59 EDT
Posted: Wed Dec 31 18:59:59 1969
Date-Received: Sun, 28-Jul-85 00:57:37 EDT
Those I assume will require changing the Date header, at minimum before they'd
be accepted.
Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header field.) There may be news readers that are unable to parse them too.
In Python:
'Wed, 31 Dec 1969 18:59:59 -0400'from email.utils import format_datetime,parsedate_to_datetime
format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT"))
I did find a few more Message-ID variations that are rejected:
<[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
<366@mimir..dmt.oz> - 435 Syntax error in message-ID
<[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID
These are indeed invalid domain names. I am under the impression that
the laxsyntax check should just ensure there are 1 to 248 (authorized)
chars surrounded by brackets, without verifying the number, order and
place of '.', '[', ']', etc.
Date: Wed, 31-Dec-69 18:59:59 EDT
Wouldn't it be possible to somehow rewrite the Date header field before
injecting the article? (The old one could be kept in an X-Date header
field.) There may be news readers that are unable to parse them too.
Will have to do something to bring these articles in. Since these exist on a server that speaks (limited) NNTP, I am trying to bring in as many as possible
without having to download, modify, and inject.
Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header field.) There may be news readers that are unable to parse them too.
In Python:
from email.utils import format_datetime,parsedate_to_datetime
format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT")) 'Wed, 31 Dec 1969 18:59:59 -0400'
As your use is very specific, and writing a proper parsing of that kind of dates is time-consuming and complicated (as Russ told us), I wonder
whether the faster approach wouldn't be to add a new level of control in syntaxchecks:
syntaxchecks: [ laxmid laxdate ]
which would just take the current time (= arrival time) as the posting
date when the Date header field exists and is not parseable. The
information is recorded in the history file and overview for expiry
purpose, so it shouldn't break anything as far as I see.
Julien <iulius@nom-de-mon-site.com.invalid> writes:
As your use is very specific, and writing a proper parsing of that kind of >>dates is time-consuming and complicated (as Russ told us), I wonder
whether the faster approach wouldn't be to add a new level of control in >>syntaxchecks:
syntaxchecks: [ laxmid laxdate ]
which would just take the current time (= arrival time) as the posting
date when the Date header field exists and is not parseable. The >>information is recorded in the history file and overview for expiry >>purpose, so it shouldn't break anything as far as I see.
Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
different than what "laxmid" means (ignore invalid message IDs and use the >message ID anyway), since it's ignoring the date entirely for the purposes
of all the other things INN does with dates. That way we would keep
laxdate in case we ever want to enable strict standards-enforcing date >parsing or provide a different laxer date parser that understands ctime(3) >and dashes, etc.
The actual field stored in overview for clients will still be the value of >the Date header so far as I can see, so that will be invalid (not parsable
by clients) unless you put something different into the overview. That's >already the case with the existing lax date parsing, so might not matter
and will be true for any of the proposals for handling old dates. I'm not >sure how many clients try to parse the date information in overview and do >something with it.
Russ Allbery <eagle@eyrie.org> wrote:
Julien <iulius@nom-de-mon-site.com.invalid> writes:
As your use is very specific, and writing a proper parsing of that kind of >>> dates is time-consuming and complicated (as Russ told us), I wonder
whether the faster approach wouldn't be to add a new level of control in >>> syntaxchecks:
syntaxchecks: [ laxmid laxdate ]
which would just take the current time (= arrival time) as the posting
date when the Date header field exists and is not parseable. The
information is recorded in the history file and overview for expiry
purpose, so it shouldn't break anything as far as I see.
Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
different than what "laxmid" means (ignore invalid message IDs and use the >> message ID anyway), since it's ignoring the date entirely for the purposes >> of all the other things INN does with dates. That way we would keep
laxdate in case we ever want to enable strict standards-enforcing date
parsing or provide a different laxer date parser that understands ctime(3) >> and dashes, etc.
The actual field stored in overview for clients will still be the value of >> the Date header so far as I can see, so that will be invalid (not parsable >> by clients) unless you put something different into the overview. That's
already the case with the existing lax date parsing, so might not matter
and will be true for any of the proposals for handling old dates. I'm not
sure how many clients try to parse the date information in overview and do >> something with it.
Could I make a suggestion as a nonprogrammer but someone who has spent countless hours putting data into a consistent pattern or good syntax?
Think of the Date header as temporary and that at some point, it might
be nice if it reflected the original date but now in modern syntax. I'm suggesting this as a Date header that reflects when it was appended to
an archive is not going to be helpful to a newsreader when it comes to sorting. And a whole lot of articles are going to share an identical
Date header as articles are going to be appended in huge batches.
Add a special X- header with an explicit header reflecting where the
article came from in a specific pattern, so it can be readily found.
Then another X- header with a preliminary analysis of the Date string.
In the example
Date: Wed, 31-Dec-69 18:59:59 EDT
Leave the punctuation as is. Look for the alphanumeric pattern. X
capital letter, x lower case letter, N numeral, _ space
X-Date-Pattern: Xxx,_NN-Xxx-NN_NN:NN:NN_XXX
Some time later, perhaps a year later, someone might analyze this. If a three-alpha recognizeable as a day is in the first Xxx group (which
might be an XXX), then it's a day. It might have an optional "." and "," isn't always going to be present as a separator. Similarly, the second three-alpha might be a month.
The two-digit year could be confused with a time element but there
probably won't be dates prior to the Unix epoch.
The three-alpha time zone isn't necessarily unique worldwide, but prior
to a certain date we know that the articles were from the United States
only.
If the transformation into modern syntax results in a logical day-date combination, then that's passed one test that the transformation was
valid.
There's going to be a lot of eyeballing necessary but this could
possibly be a way to choose which articles have dates requiring further analysis.
take the current time (= arrival time) as the posting
date when the Date header field exists and is not parseable.
Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
different than what "laxmid" means (ignore invalid message IDs and use the message ID anyway), since it's ignoring the date entirely for the purposes
of all the other things INN does with dates.
The actual field stored in overview for clients will still be the value of the Date header so far as I can see, so that will be invalid (not parsable
by clients) unless you put something different into the overview. That's already the case with the existing lax date parsing, so might not matter
and will be true for any of the proposals for handling old dates. I'm not sure how many clients try to parse the date information in overview and do something with it.
One of the ultimate goals of my archive is to sort the history file by posted date and re-feed to another INN instance, so article numbering in the 'final' archive is chronological. Though, I'm starting to think I'll never get to that
point. :-)
Julien's suggestion would work to get the articles injected to the spool, but could present other issues, primarily with article numbering.
When "Billy G." announced they had the archive.org Usenet content available via NNTP, I was elated as it would save a ton of work, but between the Date issue, and a lot of articles having duplicated headers that INN won't accept (not sure if this is caused by his import process or if the source material has duplicate headers), I'm starting to think I need to go back to dealing with the source material directly. They built their own NNTP implementation for this purpose, and I didn't think about 'complaince' of the content initially.
That leaves me a few battles to win. Like Adam, I do not have a programmer's mindset, so dealing with detecting date format issues and duplicated headers isn't straightforward for me, at least not when dealing with hundreds of millions of articles.
Hi Jesse,
One of the ultimate goals of my archive is to sort the history file by
posted date and re-feed to another INN instance, so article numbering
in the 'final' archive is chronological. Though, I'm starting to think
I'll never get to that point. :-)
Oh yes, I'm sorry about that. Indeed, if INN does not know how to parse
the Date header field, it won't be able to store the actual posting date
in the history file.
Incidentally, I am really unsure that Diablo performs better. Its
parsedate function only has 40 lines and does not handle dashes, so you
won't have the posting date either.
https://github.com/jpmens/diablo/blob/master/lib/parsedate.c
If someone has the time and the skill to write a decent parser in C to
decode dates in ctime(3) format, we could add it to INN and achieve your dream :)
Julien's suggestion would work to get the articles injected to the
spool, but could present other issues, primarily with article
numbering.
Are you still interested in the ignoredate setting then?
As for laxmid, I think the change we discussed is still worthwhile to
have.
That leaves me a few battles to win. Like Adam, I do not have a
programmer's mindset, so dealing with detecting date format issues and
duplicated headers isn't straightforward for me, at least not when
dealing with hundreds of millions of articles.
I would tend to think that the best move would be that Billy's news
server does the translation job and provides syntactically valid news articles per current Netnews standard. It would achieve
interoperability with current news readers and news servers.
I totally agree with Adam who recommends that the Date header field
reflects the original date but now in modern syntax. That's the point
of data conservation. One should ensure that old material is still
readable by modern software. Header fields should be adapted. The
point of a readable archive is to provide access to messages and notably their contents (body), not to have difficulties in sorting them, etc.
Of course, the original header fields and removed duplicated ones could
still be provided in X- header fields for the ones interested in seeing
the original material without modification.
Think about the videos of your childhood or of your grand-parents. The important is probably not in duplicating in the same format the magnetic
VHS contents or the Super 8mm contents, but having it in a modern and
still viewable format (though the overall quality may have decreased
because of digitisation artifacts). Good for you of course if you still
have the appropriate obsolete hardware able to read them in the original form, but it will be less and less practical and you'll have to maintain
it working or find a compatible one in second-hand sale.
Like old videos or music encoded in an obscure codec format from a
Windows 95 codec pack, or documents written with a no longer existing software. The important (at least to me) is that they are converted to modern software so as not to definitely loose them...
The same goes for old A News or B News article formats which somehow
need a bit of translation.
a lot of articles having duplicated headers that INN won't accept
(not sure if this is caused by his import process or if the source
material has duplicate headers)
a lot of articles having duplicated headers that INN won't accept
(not sure if this is caused by his import process or if the source material has duplicate headers)
Do you have some examples of articles with duplicate headers?
Path should show where article came from (eg: !any-name.mbox.zip/gz)
I have date parsing in Go and it's a mess.
Many old articles are using any format you can think of...
and INN does not like many of the older ones.
That's why I wrote my own less strict server to get it all in.
I scanned the archive vs blueworld (last year) and sucked everything
from your server to the archive . Wasn't that much.
Another feature my server can do is re-ordering the overview which I
already did because I believe there is not much more to find.
It's all in go-pugleaf databases (sqlite3) per newsgroup too.
Headers and Body, easy parseable into any format we need.
Proper date parsing exists and manipulating the Date: Header to be
X-Date and injecting a valid RFC date while sending, everywhere the
header is wrong formated, should be easy - but some wrong date headers
are spam from badly written tools and could be discarded.
I have date parsing in Go and it's a mess.
What is interesting is a lot of the articles from 1981-1982 in Billy's archive with a valid Date header and also have these headers:
X-Google-Info: Converted from the original A-News header
X-Google-Info: Converted from the original B-News header
Yet, there are a fair amount of articles that have the date issue and I
need to purge my history file of rejected articles and re-run the suck
where the Date header caused a lot of rejections.
I didn't grab the output for Message-IDs to inspect the articles in
depth. It is possible they are duplicates with different Message-IDs as
I have a good amount of articles from the same time period in my spool.
I remember when looking at the Utzoo archive that they didn't have
Message-ID headers, but Article-ID headers that weren't compatible, so
either Google or someone else converted those.
The format above is the most common. Then there's a handful of articles
whose real date is impossible to determine:
Aug 30 13:38:03.129 - localhost <369@psivax.UUCP> 437 Bad "Date" header
field -- "Wed, 31-Dec-69 18:59:59 EST"
Aug 30 13:41:24.347 - localhost <702@mmintl.UUCP> 437 Bad "Date" header
field -- "Wed, 31-Dec-69 18:59:59 EDT"
Aug 30 13:42:35.106 - localhost <1305@mtgzz.UUCP> 437 Bad "Date" header
field -- "Wed, 31-Dec-69 18:59:59 EDT"
Aug 30 13:47:25.759 - localhost <1639@qubix.UUCP> 437 Bad "Date" header
field -- "Wed, 31-Dec-69 18:59:59 EDT"
Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:
What is interesting is a lot of the articles from 1981-1982 in Billy's
archive with a valid Date header and also have these headers:
X-Google-Info: Converted from the original A-News header
X-Google-Info: Converted from the original B-News header
Yet, there are a fair amount of articles that have the date issue and I
need to purge my history file of rejected articles and re-run the suck
where the Date header caused a lot of rejections.
My recollection is that the team at Google (maybe it was at DejaNews?)
that did this ingestion started with INN at the time, which was probably
INN 1.x or at least before I rewrote parsedate, and thus probably only rewrote the Date headers that failed with the original yacc-based INN
date parser that I think might have been copied from C News.
I added support for every date format in an article that we had on
Stanford's spool at the time, but I seem to recall I didn't attempt to support every date format the C News parser supported. (I think it was originally based on some other yacc date parser from somewhere else? My memory on all of this is pretty vague since this was 15 or 20 years ago
at least, so someone should check me before relying on any of this.) It accepted all sorts of interesting stuff.
I didn't grab the output for Message-IDs to inspect the articles in
depth. It is possible they are duplicates with different Message-IDs as
I have a good amount of articles from the same time period in my spool.
I remember when looking at the Utzoo archive that they didn't have
Message-ID headers, but Article-ID headers that weren't compatible, so
either Google or someone else converted those.
Google (or DejaNews) injested a bunch of A-News articles and those *definitely* require conversion (they don't look anything like a modern article), so they definitely wrote a converter.
I have date parsing in Go and it's a mess.
Many old articles are using any format you can think of...
and INN does not like many of the older ones.
That's why I wrote my own less strict server to get it all in.
Would it be possible through filter_innd.pl to take the value of X-Google- ArrivalTime and replace the Date header with that value?
Also, I'm trying to strip the X-Google-Attributes and X-Google-Thread
headers from all articles, but cannot get it to work. I've added these
header values in innd/innd.c, so they are available to $hdr, but can't
seem to unset them, or I'm not doing it in a way that work with the rest
of cleanfeed. The basic code works to unset headers when a user posts
through filter_nnrpd.pl, but doesn't seem to in filter_innd.pl?
Also, I'm trying to strip the X-Google-Attributes and X-Google-Threadheaders from all articles, but cannot get it to work.
Aug 30 13:41:24.347 - localhost <702@mmintl.UUCP>
437 Bad "Date" header field -- "Wed, 31-Dec-69 18:59:59 EDT"
Aug 30 22:49:59.188 - localhost <27375@philabs.UUCP>
437 Duplicate "Path" header field
Aug 30 23:58:57.078 - localhost <426@novavax.UUCP>
437 Duplicate "Date" header field
I think AI can be summed up by Terry Winograd's defection. His
SHRDLU program is still quoted in *every* AI textbook (at least all
the ones I've seen), but he is no longer a believer in the AI
research programme (see "Understanding Computers and Cognition",
by Winograd and Flores).
On 02.09.25 19:42, Jesse Rehmer wrote:
Also, I'm trying to strip the X-Google-Attributes and X-Google-Threadheaders from all articles, but cannot get it to work.
Don't waste your time trying.
I can remove headers on-the-fly in Go catching continued lines too.
I gave up at some point trying to import all of the old stuff into INN.
Got too many declines and you'd have to write code for each article not
going in... check why, whats wrong and think how to fix...
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 13:36:31 |
Calls: | 10,389 |
Calls today: | 4 |
Files: | 14,061 |
Messages: | 6,416,888 |
Posted today: | 1 |