Forum: >>> Magnum BBS <<<

Re: ASCII to ASCII compression.

From bart@21:1/5 to Malcolm McLean on Thu Jun 6 17:55:58 2024

On 06/06/2024 17:25, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Thu Jun 6 17:56:54 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

I must not be smart as I can't see any connection to the topic of this
group!

Is there a compresiion algorthim which converts human language ASCII text
to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE�$543GtT�$"||x|VVBB?

Obviously such algorithms exist. One that is used a lot is just base64 encoding of binary compressed text, but that won't beat something
specifically crafted for the task which is presumably what you are
asking for. I don't know of anything aimed at that task specifically.

One thing you should specify is whether you need it to work on small
texts, or, even better, at what sort of size you want the pay-off to
start to kick in. For example, the xz+base64 encoding of the complete
works of Shakespeare is still less than 40% of the size of the original
but your single line will end up much larger using that off-the-shelf
scheme.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to bart on Thu Jun 6 20:02:55 2024

On Thu, 6 Jun 2024 17:55:58 +0100
bart <bc@freeuk.com> wrote:

On 06/06/2024 17:25, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph"
characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime
or Base64)?

Or, if we want to make a job just a little bit more interesting, we can
convert to base94, producing ~9% smaller size than base94 :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Malcolm McLean on Thu Jun 6 18:15:43 2024

On 2024-06-06, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

That could be any algorithm, followed by encoding of binary data
to ASCII.

https://en.wikipedia.org/wiki/Binary-to-text_encoding

That page has a table of various schemes and their packing density.

Some like Base91 or Base94 use almost all the printable characters
and have better than 80% coding efficiency.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Malcolm McLean on Thu Jun 6 15:23:33 2024

On 6/6/2024 12:25 PM, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

The purpose of doing this, is to satisfy transmission through a 7 bit channel. In the history of networking, not all channels were eight-bit transparent.
(On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the ability
to pass through a pipe which is not 8 bit transparent. Even to this day,
your email attachments may traverse the network in BASE64 format.

That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Malcolm McLean on Thu Jun 6 22:49:33 2024

On 06/06/2024 22:26, Malcolm McLean wrote:

On 06/06/2024 20:23, Paul wrote:

On 6/6/2024 12:25 PM, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

The purpose of doing this, is to satisfy transmission through a 7 bit
channel.
In the history of networking, not all channels were eight-bit
transparent.
(On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the
ability
to pass through a pipe which is not 8 bit transparent. Even to this day,
your email attachments may traverse the network in BASE64 format.

That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.

Yes. If yiu stire data as 8 but binaries then it's inherently risky.
There's usually no recovery froma single bit gett corrupted.

Whilst if you store as ASCII, the data can usually be recovered very
easly if something goes wrong wit the phsyical storage. A "And God said" becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.

But you are suggesting storing the compression data as meaningless ASCII
such as:

QWE£$543GtT£$"||x|VVBB?

If one bit gets flipped, then it will just be slightly different
meaningless ASCII; there's no way to detect it except checksums, CRCs
and the like.

In any case, the error detection won't be done by a human, but machine.

Possibly a human might detect, when back in plain text that 'Mary hid a
little lamb' should have been 'had', but now this is getting silly,
needing to rely on knowledge of nursery rhymes.

Trillions of bytes binary data must be transmitted every day (perhaps
every minute; I've no idea); how often have you encountered a
transmission error?

Compression schemes tend to have error-detection built-in; I'm sure
comms do as well, as well as storage device controllers and drivers.
People have this sort of thing in hand already!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 07:47:00 2024

On 2024-06-06 16:25:37 +0000, Malcolm McLean said:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE�$543GtT�$"||x|VVBB?

There are compression algorithms that can be adapted to any possible
size of input and output character sets, including that both are
ASCII and that the output character set is a subset of the input set.

Restricting the input set to ASCII may be too strong. Files that should
be ASCII files sometimes contain non-ascii bytes. The output should be restricted to the 94 visible characters but the decompressor should
accept at least full ASCII and skip the invalid characters as insignificant. That permits addition of line brakes and perhaps other spaces that could
be useful for example when the file is printed for debugging.

--
Mikko

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 08:03:04 2024

On 2024-06-06 19:02:56 +0000, Malcolm McLean said:

On 06/06/2024 17:55, bart wrote:

On 06/06/2024 17:25, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE�$543GtT�$"||x|VVBB?

What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime or
Base64)?

Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed
text in programs, and if it is corruped, only a single line will become unreadable.

The purpose of compression is to remove all pssibilities to detect and
correct errors. If an error tolerance is needed then that must be added
after the compression. The best solution is to use the best compression
and then the best error checking. The meaning of the latter "best" depends
on the requirements on reliability and compression. In any case there is
no hard limit to the amount of possible undetected corruption.

--
Mikko

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 08:20:03 2024

On 2024-06-06 19:09:03 +0000, Malcolm McLean said:

On 06/06/2024 17:56, Ben Bacarisse wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.

I must not be smart as I can't see any connection to the topic of this
group!

Is there a compresiion algorthim which converts human language ASCII text >>> to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE�$543GtT�$"||x|VVBB?

Obviously such algorithms exist. One that is used a lot is just base64
encoding of binary compressed text, but that won't beat something
specifically crafted for the task which is presumably what you are
asking for. I don't know of anything aimed at that task specifically.

One thing you should specify is whether you need it to work on small
texts, or, even better, at what sort of size you want the pay-off to
start to kick in. For example, the xz+base64 encoding of the complete
works of Shakespeare is still less than 40% of the size of the original
but your single line will end up much larger using that off-the-shelf
scheme.

What I was thing of was using Huffman codes to convert ASCII to a
string of of bits.

Works if one knows at the time one makes ones compression and
decmpression algorithms how often each short sequence of characters
will be used in the files that will be compressed. If you have an
adaptive Huffman coding (or any other adaptive coding) a single error
will corrupt the rest of your line. If you reset the adaptation at the
end of each line it does not adapt well and the result is not much
better than without adaptation. If you reset the adaptation at the
end of each page you can have better compression but an error corrupts
the rest of the page.

For ordinary texts (except short ones) and many other purposes Lempel-Ziv
and its variants work better than Huffman.

--
Mikko

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Malcolm McLean on Fri Jun 7 11:36:43 2024

On 06/06/2024 21:02, Malcolm McLean wrote:

On 06/06/2024 17:55, bart wrote:

On 06/06/2024 17:25, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime
or Base64)?

Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text
in programs, and if it is corruped, only a single line will become unreadable.

Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while
leaving most of it readable? :-)

Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes. Then convert to ASCII base64 or similar.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 15:52:25 2024

On 2024-06-07 09:00:57 +0000, Malcolm McLean said:

On 07/06/2024 06:20, Mikko wrote:

On 2024-06-06 19:09:03 +0000, Malcolm McLean said:

On 06/06/2024 17:56, Ben Bacarisse wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

Not strictly a C programming question, but smart people will see the >>>>> relavance to the topicality, which is portability.

I must not be smart as I can't see any connection to the topic of this >>>> group!

Is there a compresiion algorthim which converts human language ASCII text >>>>> to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE�$543GtT�$"||x|VVBB?

Obviously such algorithms exist.� One that is used a lot is just base64 >>>> encoding of binary compressed text, but that won't beat something
specifically crafted for the task which is presumably what you are
asking for.� I don't know of anything aimed at that task specifically. >>>>
One thing you should specify is whether you need it to work on small
texts, or, even better, at what sort of size you want the pay-off to
start to kick in.� For example, the xz+base64 encoding of the complete >>>> works of Shakespeare is still less than 40% of the size of the original >>>> but your single line will end up much larger using that off-the-shelf
scheme.

What I was thing of was using Huffman codes to convert ASCII to a
string of of bits.

Works if one knows at the time one makes ones compression and
decmpression algorithms how often each short sequence of characters
will be used in the files that will be compressed. If you have an
adaptive Huffman coding (or any other adaptive coding) a single error
will corrupt the rest of your line. If you reset the adaptation at the
end of each line it does not adapt well and the result is not much
better than without adaptation. If you reset the adaptation at the
end of each page you can have better compression but an error corrupts
the rest of the page.

For ordinary texts (except short ones) and many other purposes Lempel-Ziv
and its variants work better than Huffman.

Yes, but Huffman is easy to decode. It's the sort of project you give
to people who have just got past the beginner stage but aren't very experienced programmers yet, whilst implementing Lempel-Ziv is a job
for someone who knows what he is doing.

Because the lines will often be very short, adaptive Huffman coding is
no good. I need a fixed Huffman table with 128 entries for each 7 bit
value plus one for "stop". I wonder if any such standard table exists.

You don't need a standard table. You need statistics. Once you have the statistics the table is easy to costruct with Huffman's algorithm.

--
Mikko

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Malcolm McLean on Fri Jun 7 16:45:03 2024

On 07/06/2024 14:43, Malcolm McLean wrote:

On 07/06/2024 10:36, David Brown wrote:

On 06/06/2024 21:02, Malcolm McLean wrote:

On 06/06/2024 17:55, bart wrote:

On 06/06/2024 17:25, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>>>
So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime
or Base64)?

Because if a single bit flips in a zip archive, it's likely the
entire archive will be lost. This scheme is robust. We can emed
compressed text in programs, and if it is corruped, only a single
line will become unreadable.

Ah, you want something that will work like your newsreader program
that randomly changes letters or otherwise corrupts your spelling
while leaving most of it readable? :-)

Pass the data through a compressor and then add forward error checking
mechanisms such as Reed-Solomon codes. Then convert to ASCII base64
or similar.

Yes, exactly.

I want a system for compression which is robust to corruption, can be
stored as text, and with a compressor / decompressor which can be
written by a child hobby programmer with only a very little bit of
experience of programming.

That last "requirement" is completely unrealistic. Forget it. Then you already have a solution, as I outlined above.

I don't think it is remotely helpful to have error correction in your
format. You have to handle email extraordinarily badly to have any
issues transferring 8-bit binary data, and you certainly won't have
trouble if you are using Base64. Either the email will arrive
correctly, or it will not arrive at all. At most, you could add a CRC
check after compressing and before Base64 encoding.

But then, I don't think any of this stuff will be remotely useful in
practice. But if you are enjoying working on it, that's all the
motivation and justification anyone could ever need. So if you want
error correction and compression, that's fine.

That's what I need for Baby X. The FileSystem XML files can get very
large, and of course Baby X programmers are going to ask about
compression. And I don't think there is an existing system, and so I
shall devise one.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to bart on Fri Jun 7 11:22:12 2024

On 6/6/2024 5:49 PM, bart wrote:

On 06/06/2024 22:26, Malcolm McLean wrote:

On 06/06/2024 20:23, Paul wrote:

On 6/6/2024 12:25 PM, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

The purpose of doing this, is to satisfy transmission through a 7 bit channel.
In the history of networking, not all channels were eight-bit transparent. >>> (On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the ability >>> to pass through a pipe which is not 8 bit transparent. Even to this day, >>> your email attachments may traverse the network in BASE64 format.

That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.

Yes. If yiu stire data as 8 but binaries then it's inherently risky. There's usually no recovery froma single bit gett corrupted.

Whilst if you store as ASCII, the data can usually be recovered very easly if something goes wrong wit the phsyical storage. A "And God said"
becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.

But you are suggesting storing the compression data as meaningless ASCII such as:

QWE£$543GtT£$"||x|VVBB?

If one bit gets flipped, then it will just be slightly different meaningless ASCII; there's no way to detect it except checksums, CRCs and the like.

In any case, the error detection won't be done by a human, but machine.

Possibly a human might detect, when back in plain text that 'Mary hid a little lamb' should have been 'had', but now this is getting silly, needing to rely on knowledge of nursery rhymes.

Trillions of bytes binary data must be transmitted every day (perhaps every minute; I've no idea); how often have you encountered a transmission error?

Compression schemes tend to have error-detection built-in; I'm sure comms do as well, as well as storage device controllers and drivers. People have this sort of thing in hand already!

ZIP (of WinZIP fame), has a CRC computed per file. The decompression
step, will tell you if a file is corrupted. The column of CRC values
is shown in some of the unpacking software (and if you run a CRC check separately on the file at a later date, you can compare).

[Picture[

https://i.postimg.cc/DwQgPQP3/ZIP-CRC-field.gif

True repair capability, requires a better code. The Reed Solomon David Brown mentions is an example of such a code. A three dimensional version on CDs, makes the CD very resistant to errors. By the time the Reed Solomon cannot repair a CD, the CD surface is so bad, the laser can no longer lock to the groove.
Rather than Reed Solomon complaining it cannot correct the data, instead
it is the optical drive reporting it cannot find the groove using the laser.

Storage media also has repair capability. A typical SSD (NAND flash storage device),
has 10% overhead for corrections. A 512 byte sector, has an extra 51 bytes set aside for error correction. When your SSD slows down to 300MB/sec from 530MB/sec,
that means that every sector being read had errors, and is being corrected by a processor inside the SSD drive. This is a "normal" state of affairs for TLC
or QLC based drives. Some 2.5" flash devices, have a three core ARM processor, and at least one of the cores does error correction.

But on an archival format with extreme compression, finding that "someone had wasted an extra 10% on error correction capability", that would of course
annoy a user expecting the extreme compression to save them money (for storage).

When selecting a "scheme", you have to decide what kind of error-type you
are protecting against.

For example, on hard drives, someone postulated they were protecting against single-bit (independent, does not correlate with other single-bits) errors.
The Fire codes (polynomial) were the result. There is some small probability
of multiple bits (perhaps an error multiplication effect in the DSP-based
data recovery on read). At the time, no one considered that a heavy-weight method
was necessary.

When you expect to be losing whole sectors, whole files, whole pieces of media, there are PAR codes for that. But these were determined to be not mathematically
sound, so serious archival use might not use them. The idea would be, if an archive spanned ten CDs, you would burn one or two more CDs (generated by PAR), and if any of the twelve CDs total was bad, PAR could regenerate the
missing information (if any). Of the 12 CDs, any two could go missing, and they could then be regenerated.

A simpler to understand scheme, is to burn duplicate CD copies of the same information.
If you lose a CD, or if the media surface degrades completely, you have the second CD. And that does not involve any complex PAR method :-) It's easier
for the human to understand.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to Malcolm McLean on Fri Jun 7 20:06:09 2024

On 07/06/2024 19:49, Malcolm McLean wrote:

On 07/06/2024 13:52, Mikko wrote:

On 2024-06-07 09:00:57 +0000, Malcolm McLean said:

Yes, but Huffman is easy to decode. It's the sort of project you give
to people who have just got past the beginner stage but aren't very
experienced programmers yet, whilst implementing Lempel-Ziv is a job
for someone who knows what he is doing.

Because the lines will often be very short, adaptive Huffman coding
is no good. I need a fixed Huffman table with 128 entries for each 7
bit value plus one for "stop". I wonder if any such standard table
exists.

You don't need a standard table. You need statistics. Once you have the
statistics the table is easy to costruct with Huffman's algorithm.

No you do. The text might be very short, like "Mary had a little lamb",
and you will compress it because you know that you are being fed
meaningful ASCII. For example even this tiny fragment contains the
letter "e", which would have a short Huffman code. And four a's and two
t's, which are the third and the second most commn letters. So it should compress.

And we're compressing each line independently, and choosing a visually distinctive ASCII character as the line break. So anyone seeing the compressed data will immediately be able to home in on the line breaks,
and will be able to fix any corruption without special tools.

And you have a standard table which never changes. And so that makes the decompressor much easier to write.

Will your babyx be able to handle, say, utf-8?

--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Malcolm McLean on Fri Jun 7 16:57:08 2024

On 6/7/2024 8:43 AM, Malcolm McLean wrote:

On 07/06/2024 10:36, David Brown wrote:

On 06/06/2024 21:02, Malcolm McLean wrote:

On 06/06/2024 17:55, bart wrote:

On 06/06/2024 17:25, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?

Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text in programs, and if it is corruped, only a single line will become unreadable.

Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while leaving most of it readable? :-)

Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes. Then convert to ASCII base64 or similar.

Yes, exactly.

I want a system for compression which is robust to corruption, can be stored as text, and with a compressor / decompressor which can be written by a child hobby programmer with only a very little bit of experience of programming.

That's what I need for Baby X. The FileSystem XML files can get very large, and of course Baby X programmers are going to ask about compression. And I don't think there is an existing system, and so I shall devise one.

"XML Compression"

https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7993-3_783-2

"The size increase incurred by publishing data in XML format is
estimated to be as much as 400 % [14], making it a prime target for compression.

While standard general-purpose compressors, such as
zip, gzip or bzip, typically compress XML data reasonably well...
"

Show us a "dir" or an "ls -al" so we can better understand
the magnitude of what you're working on.

Lots of things have used ZIP, implicitly or explicitly, mainly
because it is a kind of standard and does not form a barrier to access.

In addition, if a structure is voluminous (a thousand control files representing one project), users appreciate having them stored in
a container, rather than filling the file system with fluff. A ZIP
can do that too. And if the ZIP has a convenient library you can
get from FOSS-land, that could save time on building a standards
based container.

But what's more important than any techie adventure, is not
annoying your users. What do the users want most ? The ability
to edit the files in question, on a moments notice ? Or would
the files, 99.999% of the time, comfortably remain hidden from view ?

If the "blob" involved was 100GB, then yes, I'd compress it :-)
If it is 4KB, well, those little files are a nuisance no matter
what you do. I would leave that uncompressed, unless I could
containerize it perhaps.

As an example, Mozilla has used .jsonlz4 as a file format solution.
I have no idea what problem they thought they were solving,
but I can tell you I consider the solution obnoxious and inconsiderate
of the user. LZ4 decompressors are not a stockroom item. I had
to write a very short program, so I could deal with that. Mozilla
has made a perfect example of what not to do, by doing that.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Malcolm McLean on Sun Jun 9 11:44:13 2024

On Fri, 7 Jun 2024 10:03:46 +0100
Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

On 07/06/2024 05:47, Mikko wrote:

On 2024-06-06 16:25:37 +0000, Malcolm McLean said:

Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph"
characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

There are compression algorithms that can be adapted to any possible
size of input and output character sets, including that both are
ASCII and that the output character set is a subset of the input
set.

Restricting the input set to ASCII may be too strong. Files that
should be ASCII files sometimes contain non-ascii bytes. The output
should be restricted to the 94 visible characters but the
decompressor should accept at least full ASCII and skip the invalid characters as insignificant.
That permits addition of line brakes and perhaps other spaces that
could be useful for example when the file is printed for debugging.

That's exactly the idea. The system is robust to white space. You can
add spaces to your heart's content, and they arec just skipped.

Robustness to white spaces necessarily weakens robustness to bit flips.
Not that your set of requirements made much sense to start with...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lowell Gilbert@21:1/5 to Malcolm McLean on Sun Jun 9 19:20:16 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 06/06/2024 20:23, Paul wrote:

On 6/6/2024 12:25 PM, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE�$543GtT�$"||x|VVBB?

The purpose of doing this, is to satisfy transmission through a 7
bit channel.
In the history of networking, not all channels were eight-bit transparent. >> (On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the ability >> to pass through a pipe which is not 8 bit transparent. Even to this day,
your email attachments may traverse the network in BASE64 format.
That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.

Yes. If yiu stire data as 8 but binaries then it's inherently
risky. There's usually no recovery froma single bit gett corrupted.

Whilst if you store as ASCII, the data can usually be recovered very
easly if something goes wrong wit the phsyical storage. A "And God
said"
becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.

That example only works because it doesn't include compression.

--
Lowell Gilbert, embedded/networking software engineer
http://be-well.ilk.org/~lowell/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to Malcolm McLean on Mon Jun 10 00:45:08 2024

On Thu, 06 Jun 2024 17:25:37 +0100, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

I'm afraid that you have conflicting requirements here. In effect,
you want to take an array of values (each within the range of
0 to 127) and
a) make the array shorter ("compress it"), and
b) express the individual elements of this shorter array with
a range of 96 values ("isgraph() characters")

Because you reduce the number of values each result element
can carry, each result element can only express a fraction
(96/128'ths) of the corresponding source element. Thus,
with the isgraph() requirement, the result will take /more/
elements to express the same data as the source did.

However, you want /compression/, which implies that you want
the result to be smaller than the source. And, therein lies
the conflict.

Can you help clarify this for me?
--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Malcolm McLean on Mon Jun 10 01:45:14 2024

On 10/06/2024 01:22, Malcolm McLean wrote:

On 10/06/2024 00:20, Lowell Gilbert wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 06/06/2024 20:23, Paul wrote:

On 6/6/2024 12:25 PM, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>>>
So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

The purpose of doing this, is to satisfy transmission through a 7
bit channel.
In the history of networking, not all channels were eight-bit
transparent.
(On the equipment in question, this was called "robbed-bit signaling.) >>>> For example, BASE64 is valued for its 7 bit channel properties, the
ability
to pass through a pipe which is not 8 bit transparent. Even to this
day,
your email attachments may traverse the network in BASE64 format.
That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.

Yes. If yiu stire data as 8 but binaries then it's inherently
risky. There's usually no recovery froma single bit gett corrupted.

Whilst if you store as ASCII, the data can usually be recovered very
easly if something goes wrong wit the phsyical storage. A "And God
said"
becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.

That example only works because it doesn't include compression.

Yes, so the ASCII to ASCCI compression scheme needs to be almost as
robust. Any corruption will corrupt only a single line. And you can
examine near-nonsense ASCII in a way you can't examine binary. Yiu can
lod it into a text editor and look for the difference between two versions.

(I think some single-bit corruption has crept into your post!)

If you try to compress ASCII in any worthwhile manner, then it's going
to be become as meaningless as any binary. It's certainly not going to
resemble English prose with its multiple layers of redundancy.

But take this bit of ASCII into which I've introduced a 1-bit error:

414 949 812 809

Can you tell which digit was affected? And this is uncompressed. Text
files can contain data like this. They can have tables, or source code,
or people's surnames with odd spellings, or maybe even encrypted (not compressed) text.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Malcolm McLean on Mon Jun 10 14:29:30 2024

On Mon, 10 Jun 2024 07:12:57 +0100
Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

On 10/06/2024 01:45, Lew Pitcher wrote:

On Thu, 06 Jun 2024 17:25:37 +0100, Malcolm McLean wrote:

Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.

Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph"
characters?

So "Mary had a little lamb, its fleece was white as snow".

Would become

QWE£$543GtT£$"||x|VVBB?

I'm afraid that you have conflicting requirements here. In effect,
you want to take an array of values (each within the range of
0 to 127) and
a) make the array shorter ("compress it"), and
b) express the individual elements of this shorter array with
a range of 96 values ("isgraph() characters")

Because you reduce the number of values each result element
can carry, each result element can only express a fraction
(96/128'ths) of the corresponding source element. Thus,
with the isgraph() requirement, the result will take /more/
elements to express the same data as the source did.

However, you want /compression/, which implies that you want
the result to be smaller than the source. And, therein lies
the conflict.

Can you help clarify this for me?

We have a fixed Huffman tree which is part of the algorithm and
optmised for ASCII. And we take each line otext, and comress it to a
binary string, using the Huffman table. The we code the binary string
six bytes ar a time using a 64 character dubset of ASCCI. And the we
append a special character which is chosen to be visually
distinctive..

So the inout is

Mary had a little lamb,
it's fleece was white as snow,
and eveywhere that Mary went,
the lamb was sure to. go.

And we get the output.

CVbGNh£-H$£*MMH&-VVdsE3w2as3-vv$G^&ggf-

And if it shorter or not depends on whether the fixed Huffman table
is any good.

Take something that is a little bigger than a text above. It does not
have to be much bigger. One page from any book will do ("Alice's
Adventures in Wonderland" is used most often for that purpose).
Apply your compression procedure.
Then run automatic test that applies all possible single bit flips, de-compresses and count # of mismatches vs original text. The test will
report the case with maximal # of mismatches.
Look at most corrupted text.
If your fixed Huffman table is any good, you'll see that output is
corrupted rather seriously, most likely at least one sentence will be unrecognizable.
Alternatively, if your fixed Huffman table is no good, you output will
be as big or bigger than the input.

Popular corpus of samples for compression tests: https://corpus.canterbury.ac.nz/descriptions/ http://corpus.canterbury.ac.nz/resources/cantrbry.zip

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Mon Jun 10 18:55:34 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

We have a fixed Huffman tree which is part of the algorithm and optmised
for ASCII. And we take each line otext, and comress it to a binary string, using the Huffman table. The we code the binary string six bytes ar a time using a 64 character dubset of ASCCI. And the we append a special character which is chosen to be visually distinctive..

So the inout is

Mary had a little lamb,
it's fleece was white as snow,
and eveywhere that Mary went,
the lamb was sure to. go.

And we get the output.

CVbGNh�-H$�*MMH&-VVdsE3w2as3-vv$G^&ggf-

It would be more like

pOHcDdz8v3cz5Nl7WP2gno5krTqU6g/ZynQYlawju8rxyhMT6B30nDusHrWaE+TZf1KdKmJ9Fb6orB

(That's an actual example using an optimal Huffman encoding for that
input and the conventional base 64 encoding. I can post the code table,
if you like.)

And if it shorter or not depends on whether the fixed Huffman table is any good.

If I use a bigger corpus of English text to derive the Huffman codes,
the encoding becomes less efficient (of course) so those 110 characters
need more like 83 base 64 encoded bytes to represent them. Is 75% of
the size worth it?

What is the use-case where there is so much English text that a little compression is worthwhile?

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Mon Jun 10 21:28:39 2024

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

On 10/06/2024 18:55, Ben Bacarisse wrote:

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

We have a fixed Huffman tree which is part of the algorithm and optmised >>> for ASCII. And we take each line otext, and comress it to a binary string, >>> using the Huffman table. The we code the binary string six bytes ar a time >>> using a 64 character dubset of ASCCI. And the we append a special character >>> which is chosen to be visually distinctive..

So the inout is

Mary had a little lamb,
it's fleece was white as snow,
and eveywhere that Mary went,
the lamb was sure to. go.

And we get the output.

CVbGNh�-H$�*MMH&-VVdsE3w2as3-vv$G^&ggf-

It would be more like
pOHcDdz8v3cz5Nl7WP2gno5krTqU6g/ZynQYlawju8rxyhMT6B30nDusHrWaE+TZf1KdKmJ9Fb6orB
(That's an actual example using an optimal Huffman encoding for that
input and the conventional base 64 encoding. I can post the code table,
if you like.)

And if it shorter or not depends on whether the fixed Huffman table is any >>> good.

If I use a bigger corpus of English text to derive the Huffman codes,
the encoding becomes less efficient (of course) so those 110 characters
need more like 83 base 64 encoded bytes to represent them. Is 75% of
the size worth it?
What is the use-case where there is so much English text that a little
compression is worthwhile?

The FileSystem XML files. They are uncompressed, and as you can take in entire folders, they can be very large.

I don't know what the XML file system is for either so explaining one by
the other doesn't help. I was hoping for a use -- a user story -- that
would help me understand what the point of all this is.

Tell me as a story: A user wants to ... what? And having a directory of
large text files in a structured XML text helps because they can
... what? And if they were a quarter of the size it would be better
because ... why?

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Jun 17 04:45:04 2024

On Thu, 6 Jun 2024 20:02:55 +0300, Michael S wrote:

Or, if we want to make a job just a little bit more interesting, we can convert to base94, producing ~9% smaller size than base94 :-)

You mean smaller than Base64?

I just spent some hours yesterday implementing the ASCII85 encoding in C
code. This was something Adobe added to PostScript level 2; not sure if
anybody else used it.

By using only 85 instead of 94 printable characters, it could reserve some
for special uses. For example, four bytes of zero are represented by a
single “z” character. Also “~” is not used because it is part of the PostScript string delimiter for strings in ASCII85 format.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Jun 17 13:10:12 2024

On Mon, 17 Jun 2024 04:45:04 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Thu, 6 Jun 2024 20:02:55 +0300, Michael S wrote:

Or, if we want to make a job just a little bit more interesting, we
can convert to base94, producing ~9% smaller size than base94 :-)

You mean smaller than Base64?

Yes.

I just spent some hours yesterday implementing the ASCII85 encoding
in C code. This was something Adobe added to PostScript level 2; not
sure if anybody else used it.

By using only 85 instead of 94 printable characters, it could reserve
some for special uses. For example, four bytes of zero are
represented by a single “z” character. Also “~” is not used because it is part of the PostScript string delimiter for strings in ASCII85
format.

I didn't look at the existing standards.
The nice thing about base 94 is that you can encode 9 arbitrary octets
into 11 isgraph() characters and that the code for encode/decode is
simple and reasonably fast even in absence of 64-bit integer types.

For base 85 you encode 8 octets to 10 characters, which is even a
little simpler (or more than a little when 64-bit integers are
available), but 2.3% less dense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 06:57:56 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 06:47:28 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	147:09:26
Calls:	10,383
Calls today:	8
Files:	14,054
D/L today:	2 files (1,861K bytes)
Messages:	6,417,724

Re: ASCII to ASCII compression.

Who's Online

Recent Visitors

System Info