Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII text
to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE$543GtT$"||x|VVBB?
On 06/06/2024 17:25, Malcolm McLean wrote:
Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph"
characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime
or Base64)?
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
On 06/06/2024 20:23, Paul wrote:
On 6/6/2024 12:25 PM, Malcolm McLean wrote:Yes. If yiu stire data as 8 but binaries then it's inherently risky.
Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
The purpose of doing this, is to satisfy transmission through a 7 bit
channel.
In the history of networking, not all channels were eight-bit
transparent.
(On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the
ability
to pass through a pipe which is not 8 bit transparent. Even to this day,
your email attachments may traverse the network in BASE64 format.
That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.
There's usually no recovery froma single bit gett corrupted.
Whilst if you store as ASCII, the data can usually be recovered very
easly if something goes wrong wit the phsyical storage. A "And God said" becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE$543GtT$"||x|VVBB?
On 06/06/2024 17:55, bart wrote:
On 06/06/2024 17:25, Malcolm McLean wrote:Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed
Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE$543GtT$"||x|VVBB?
What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime or
Base64)?
text in programs, and if it is corruped, only a single line will become unreadable.
On 06/06/2024 17:56, Ben Bacarisse wrote:
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:What I was thing of was using Huffman codes to convert ASCII to a
Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.
I must not be smart as I can't see any connection to the topic of this
group!
Is there a compresiion algorthim which converts human language ASCII text >>> to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE$543GtT$"||x|VVBB?
Obviously such algorithms exist. One that is used a lot is just base64
encoding of binary compressed text, but that won't beat something
specifically crafted for the task which is presumably what you are
asking for. I don't know of anything aimed at that task specifically.
One thing you should specify is whether you need it to work on small
texts, or, even better, at what sort of size you want the pay-off to
start to kick in. For example, the xz+base64 encoding of the complete
works of Shakespeare is still less than 40% of the size of the original
but your single line will end up much larger using that off-the-shelf
scheme.
string of of bits.
On 06/06/2024 17:55, bart wrote:
On 06/06/2024 17:25, Malcolm McLean wrote:Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text
Not strictly a C programming question, but smart people will see the
relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime
or Base64)?
in programs, and if it is corruped, only a single line will become unreadable.
On 07/06/2024 06:20, Mikko wrote:
On 2024-06-06 19:09:03 +0000, Malcolm McLean said:Yes, but Huffman is easy to decode. It's the sort of project you give
On 06/06/2024 17:56, Ben Bacarisse wrote:
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:What I was thing of was using Huffman codes to convert ASCII to a
Not strictly a C programming question, but smart people will see the >>>>> relavance to the topicality, which is portability.
I must not be smart as I can't see any connection to the topic of this >>>> group!
Is there a compresiion algorthim which converts human language ASCII text >>>>> to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE$543GtT$"||x|VVBB?
Obviously such algorithms exist. One that is used a lot is just base64 >>>> encoding of binary compressed text, but that won't beat something
specifically crafted for the task which is presumably what you are
asking for. I don't know of anything aimed at that task specifically. >>>>
One thing you should specify is whether you need it to work on small
texts, or, even better, at what sort of size you want the pay-off to
start to kick in. For example, the xz+base64 encoding of the complete >>>> works of Shakespeare is still less than 40% of the size of the original >>>> but your single line will end up much larger using that off-the-shelf
scheme.
string of of bits.
Works if one knows at the time one makes ones compression and
decmpression algorithms how often each short sequence of characters
will be used in the files that will be compressed. If you have an
adaptive Huffman coding (or any other adaptive coding) a single error
will corrupt the rest of your line. If you reset the adaptation at the
end of each line it does not adapt well and the result is not much
better than without adaptation. If you reset the adaptation at the
end of each page you can have better compression but an error corrupts
the rest of the page.
For ordinary texts (except short ones) and many other purposes Lempel-Ziv
and its variants work better than Huffman.
to people who have just got past the beginner stage but aren't very experienced programmers yet, whilst implementing Lempel-Ziv is a job
for someone who knows what he is doing.
Because the lines will often be very short, adaptive Huffman coding is
no good. I need a fixed Huffman table with 128 entries for each 7 bit
value plus one for "stop". I wonder if any such standard table exists.
On 07/06/2024 10:36, David Brown wrote:
On 06/06/2024 21:02, Malcolm McLean wrote:Yes, exactly.
On 06/06/2024 17:55, bart wrote:
On 06/06/2024 17:25, Malcolm McLean wrote:Because if a single bit flips in a zip archive, it's likely the
Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>>>
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
What's the problem with compressing to binary (using existing,
efficient utilities), then turning that binary into ASCII (like Mime
or Base64)?
entire archive will be lost. This scheme is robust. We can emed
compressed text in programs, and if it is corruped, only a single
line will become unreadable.
Ah, you want something that will work like your newsreader program
that randomly changes letters or otherwise corrupts your spelling
while leaving most of it readable? :-)
Pass the data through a compressor and then add forward error checking
mechanisms such as Reed-Solomon codes. Then convert to ASCII base64
or similar.
I want a system for compression which is robust to corruption, can be
stored as text, and with a compressor / decompressor which can be
written by a child hobby programmer with only a very little bit of
experience of programming.
That's what I need for Baby X. The FileSystem XML files can get very
large, and of course Baby X programmers are going to ask about
compression. And I don't think there is an existing system, and so I
shall devise one.
On 06/06/2024 22:26, Malcolm McLean wrote:
On 06/06/2024 20:23, Paul wrote:
On 6/6/2024 12:25 PM, Malcolm McLean wrote:Yes. If yiu stire data as 8 but binaries then it's inherently risky. There's usually no recovery froma single bit gett corrupted.
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
The purpose of doing this, is to satisfy transmission through a 7 bit channel.
In the history of networking, not all channels were eight-bit transparent. >>> (On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the ability >>> to pass through a pipe which is not 8 bit transparent. Even to this day, >>> your email attachments may traverse the network in BASE64 format.
That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.
Whilst if you store as ASCII, the data can usually be recovered very easly if something goes wrong wit the phsyical storage. A "And God said"
becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.
But you are suggesting storing the compression data as meaningless ASCII such as:
QWE£$543GtT£$"||x|VVBB?
If one bit gets flipped, then it will just be slightly different meaningless ASCII; there's no way to detect it except checksums, CRCs and the like.
In any case, the error detection won't be done by a human, but machine.
Possibly a human might detect, when back in plain text that 'Mary hid a little lamb' should have been 'had', but now this is getting silly, needing to rely on knowledge of nursery rhymes.
Trillions of bytes binary data must be transmitted every day (perhaps every minute; I've no idea); how often have you encountered a transmission error?
Compression schemes tend to have error-detection built-in; I'm sure comms do as well, as well as storage device controllers and drivers. People have this sort of thing in hand already!
On 07/06/2024 13:52, Mikko wrote:
On 2024-06-07 09:00:57 +0000, Malcolm McLean said:No you do. The text might be very short, like "Mary had a little lamb",
Yes, but Huffman is easy to decode. It's the sort of project you give
to people who have just got past the beginner stage but aren't very
experienced programmers yet, whilst implementing Lempel-Ziv is a job
for someone who knows what he is doing.
Because the lines will often be very short, adaptive Huffman coding
is no good. I need a fixed Huffman table with 128 entries for each 7
bit value plus one for "stop". I wonder if any such standard table
exists.
You don't need a standard table. You need statistics. Once you have the
statistics the table is easy to costruct with Huffman's algorithm.
and you will compress it because you know that you are being fed
meaningful ASCII. For example even this tiny fragment contains the
letter "e", which would have a short Huffman code. And four a's and two
t's, which are the third and the second most commn letters. So it should compress.
And we're compressing each line independently, and choosing a visually distinctive ASCII character as the line break. So anyone seeing the compressed data will immediately be able to home in on the line breaks,
and will be able to fix any corruption without special tools.
And you have a standard table which never changes. And so that makes the decompressor much easier to write.
On 07/06/2024 10:36, David Brown wrote:
On 06/06/2024 21:02, Malcolm McLean wrote:Yes, exactly.
On 06/06/2024 17:55, bart wrote:
On 06/06/2024 17:25, Malcolm McLean wrote:Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text in programs, and if it is corruped, only a single line will become unreadable.
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?
Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while leaving most of it readable? :-)
Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes. Then convert to ASCII base64 or similar.
I want a system for compression which is robust to corruption, can be stored as text, and with a compressor / decompressor which can be written by a child hobby programmer with only a very little bit of experience of programming.
That's what I need for Baby X. The FileSystem XML files can get very large, and of course Baby X programmers are going to ask about compression. And I don't think there is an existing system, and so I shall devise one.
On 07/06/2024 05:47, Mikko wrote:
On 2024-06-06 16:25:37 +0000, Malcolm McLean said:
Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph"
characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
There are compression algorithms that can be adapted to any possible
size of input and output character sets, including that both are
ASCII and that the output character set is a subset of the input
set.
Restricting the input set to ASCII may be too strong. Files thatThat's exactly the idea. The system is robust to white space. You can
should be ASCII files sometimes contain non-ascii bytes. The output
should be restricted to the 94 visible characters but the
decompressor should accept at least full ASCII and skip the invalid characters as insignificant.
That permits addition of line brakes and perhaps other spaces that
could be useful for example when the file is printed for debugging.
add spaces to your heart's content, and they arec just skipped.
On 06/06/2024 20:23, Paul wrote:
On 6/6/2024 12:25 PM, Malcolm McLean wrote:Yes. If yiu stire data as 8 but binaries then it's inherently
The purpose of doing this, is to satisfy transmission through a 7
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE$543GtT$"||x|VVBB?
bit channel.
In the history of networking, not all channels were eight-bit transparent. >> (On the equipment in question, this was called "robbed-bit signaling.)
For example, BASE64 is valued for its 7 bit channel properties, the ability >> to pass through a pipe which is not 8 bit transparent. Even to this day,
your email attachments may traverse the network in BASE64 format.
That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.
risky. There's usually no recovery froma single bit gett corrupted.
Whilst if you store as ASCII, the data can usually be recovered very
easly if something goes wrong wit the phsyical storage. A "And God
said"
becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language ASCII
text to compressed ASCII, preferably only "isgraph" characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
On 10/06/2024 00:20, Lowell Gilbert wrote:
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:Yes, so the ASCII to ASCCI compression scheme needs to be almost as
On 06/06/2024 20:23, Paul wrote:
On 6/6/2024 12:25 PM, Malcolm McLean wrote:Yes. If yiu stire data as 8 but binaries then it's inherently
The purpose of doing this, is to satisfy transmission through a 7
Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>>>
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
bit channel.
In the history of networking, not all channels were eight-bit
transparent.
(On the equipment in question, this was called "robbed-bit signaling.) >>>> For example, BASE64 is valued for its 7 bit channel properties, the
ability
to pass through a pipe which is not 8 bit transparent. Even to this
day,
your email attachments may traverse the network in BASE64 format.
That is one reason, that email or USENET clients to this day, have
both 7 bit and 8 bit content encoding methods. It's to handle the
unlikely possibility that 7 bit transmission channels still exist.
They likely do exist.
risky. There's usually no recovery froma single bit gett corrupted.
Whilst if you store as ASCII, the data can usually be recovered very
easly if something goes wrong wit the phsyical storage. A "And God
said"
becomes "And G$d said", an even with this tiny text, you can still read
it perfectly well.
That example only works because it doesn't include compression.
robust. Any corruption will corrupt only a single line. And you can
examine near-nonsense ASCII in a way you can't examine binary. Yiu can
lod it into a text editor and look for the difference between two versions.
On 10/06/2024 01:45, Lew Pitcher wrote:
On Thu, 06 Jun 2024 17:25:37 +0100, Malcolm McLean wrote:
Not strictly a C programming question, but smart people will see
the relavance to the topicality, which is portability.
Is there a compresiion algorthim which converts human language
ASCII text to compressed ASCII, preferably only "isgraph"
characters?
So "Mary had a little lamb, its fleece was white as snow".
Would become
QWE£$543GtT£$"||x|VVBB?
I'm afraid that you have conflicting requirements here. In effect,
you want to take an array of values (each within the range of
0 to 127) and
a) make the array shorter ("compress it"), and
b) express the individual elements of this shorter array with
a range of 96 values ("isgraph() characters")
Because you reduce the number of values each result element
can carry, each result element can only express a fraction
(96/128'ths) of the corresponding source element. Thus,
with the isgraph() requirement, the result will take /more/
elements to express the same data as the source did.
However, you want /compression/, which implies that you want
the result to be smaller than the source. And, therein lies
the conflict.
Can you help clarify this for me?We have a fixed Huffman tree which is part of the algorithm and
optmised for ASCII. And we take each line otext, and comress it to a
binary string, using the Huffman table. The we code the binary string
six bytes ar a time using a 64 character dubset of ASCCI. And the we
append a special character which is chosen to be visually
distinctive..
So the inout is
Mary had a little lamb,
it's fleece was white as snow,
and eveywhere that Mary went,
the lamb was sure to. go.
And we get the output.
CVbGNh£-H$£*MMH&-VVdsE3w2as3-vv$G^&ggf-
And if it shorter or not depends on whether the fixed Huffman table
is any good.
We have a fixed Huffman tree which is part of the algorithm and optmised
for ASCII. And we take each line otext, and comress it to a binary string, using the Huffman table. The we code the binary string six bytes ar a time using a 64 character dubset of ASCCI. And the we append a special character which is chosen to be visually distinctive..
So the inout is
Mary had a little lamb,
it's fleece was white as snow,
and eveywhere that Mary went,
the lamb was sure to. go.
And we get the output.
CVbGNh-H$*MMH&-VVdsE3w2as3-vv$G^&ggf-
And if it shorter or not depends on whether the fixed Huffman table is any good.
On 10/06/2024 18:55, Ben Bacarisse wrote:
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:The FileSystem XML files. They are uncompressed, and as you can take in entire folders, they can be very large.
We have a fixed Huffman tree which is part of the algorithm and optmised >>> for ASCII. And we take each line otext, and comress it to a binary string, >>> using the Huffman table. The we code the binary string six bytes ar a time >>> using a 64 character dubset of ASCCI. And the we append a special character >>> which is chosen to be visually distinctive..It would be more like
So the inout is
Mary had a little lamb,
it's fleece was white as snow,
and eveywhere that Mary went,
the lamb was sure to. go.
And we get the output.
CVbGNh-H$*MMH&-VVdsE3w2as3-vv$G^&ggf-
pOHcDdz8v3cz5Nl7WP2gno5krTqU6g/ZynQYlawju8rxyhMT6B30nDusHrWaE+TZf1KdKmJ9Fb6orB
(That's an actual example using an optimal Huffman encoding for that
input and the conventional base 64 encoding. I can post the code table,
if you like.)
And if it shorter or not depends on whether the fixed Huffman table is any >>> good.If I use a bigger corpus of English text to derive the Huffman codes,
the encoding becomes less efficient (of course) so those 110 characters
need more like 83 base 64 encoded bytes to represent them. Is 75% of
the size worth it?
What is the use-case where there is so much English text that a little
compression is worthwhile?
Or, if we want to make a job just a little bit more interesting, we can convert to base94, producing ~9% smaller size than base94 :-)
On Thu, 6 Jun 2024 20:02:55 +0300, Michael S wrote:
Or, if we want to make a job just a little bit more interesting, we
can convert to base94, producing ~9% smaller size than base94 :-)
You mean smaller than Base64?
I just spent some hours yesterday implementing the ASCII85 encoding
in C code. This was something Adobe added to PostScript level 2; not
sure if anybody else used it.
By using only 85 instead of 94 printable characters, it could reserve
some for special uses. For example, four bytes of zero are
represented by a single “z” character. Also “~” is not used because it is part of the PostScript string delimiter for strings in ASCII85
format.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 147:09:26 |
Calls: | 10,383 |
Calls today: | 8 |
Files: | 14,054 |
D/L today: |
2 files (1,861K bytes) |
Messages: | 6,417,724 |