• Re: ASCII to ASCII compression.

    From bart@21:1/5 to Malcolm McLean on Thu Jun 6 17:55:58 2024
    On 06/06/2024 17:25, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Malcolm McLean on Thu Jun 6 17:56:54 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    I must not be smart as I can't see any connection to the topic of this
    group!

    Is there a compresiion algorthim which converts human language ASCII text
    to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE$543GtT$"||x|VVBB?

    Obviously such algorithms exist. One that is used a lot is just base64 encoding of binary compressed text, but that won't beat something
    specifically crafted for the task which is presumably what you are
    asking for. I don't know of anything aimed at that task specifically.

    One thing you should specify is whether you need it to work on small
    texts, or, even better, at what sort of size you want the pay-off to
    start to kick in. For example, the xz+base64 encoding of the complete
    works of Shakespeare is still less than 40% of the size of the original
    but your single line will end up much larger using that off-the-shelf
    scheme.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to bart on Thu Jun 6 20:02:55 2024
    On Thu, 6 Jun 2024 17:55:58 +0100
    bart <bc@freeuk.com> wrote:

    On 06/06/2024 17:25, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see
    the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language
    ASCII text to compressed ASCII, preferably only "isgraph"
    characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    What's the problem with compressing to binary (using existing,
    efficient utilities), then turning that binary into ASCII (like Mime
    or Base64)?


    Or, if we want to make a job just a little bit more interesting, we can
    convert to base94, producing ~9% smaller size than base94 :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Malcolm McLean on Thu Jun 6 18:15:43 2024
    On 2024-06-06, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    That could be any algorithm, followed by encoding of binary data
    to ASCII.

    https://en.wikipedia.org/wiki/Binary-to-text_encoding

    That page has a table of various schemes and their packing density.

    Some like Base91 or Base94 use almost all the printable characters
    and have better than 80% coding efficiency.



    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Malcolm McLean on Thu Jun 6 15:23:33 2024
    On 6/6/2024 12:25 PM, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?


    The purpose of doing this, is to satisfy transmission through a 7 bit channel. In the history of networking, not all channels were eight-bit transparent.
    (On the equipment in question, this was called "robbed-bit signaling.)
    For example, BASE64 is valued for its 7 bit channel properties, the ability
    to pass through a pipe which is not 8 bit transparent. Even to this day,
    your email attachments may traverse the network in BASE64 format.

    That is one reason, that email or USENET clients to this day, have
    both 7 bit and 8 bit content encoding methods. It's to handle the
    unlikely possibility that 7 bit transmission channels still exist.
    They likely do exist.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Malcolm McLean on Thu Jun 6 22:49:33 2024
    On 06/06/2024 22:26, Malcolm McLean wrote:
    On 06/06/2024 20:23, Paul wrote:
    On 6/6/2024 12:25 PM, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the
    relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?


    The purpose of doing this, is to satisfy transmission through a 7 bit
    channel.
    In the history of networking, not all channels were eight-bit
    transparent.
    (On the equipment in question, this was called "robbed-bit signaling.)
    For example, BASE64 is valued for its 7 bit channel properties, the
    ability
    to pass through a pipe which is not 8 bit transparent. Even to this day,
    your email attachments may traverse the network in BASE64 format.

    That is one reason, that email or USENET clients to this day, have
    both 7 bit and 8 bit content encoding methods. It's to handle the
    unlikely possibility that 7 bit transmission channels still exist.
    They likely do exist.

    Yes. If yiu stire data as 8 but binaries then it's inherently risky.
    There's usually no recovery froma single bit gett corrupted.

    Whilst if you store as ASCII, the data can usually be recovered very
    easly if something goes wrong wit the phsyical storage. A "And God said" becomes "And G$d said", an even with this tiny text, you can still read
    it perfectly well.

    But you are suggesting storing the compression data as meaningless ASCII
    such as:

    QWE£$543GtT£$"||x|VVBB?

    If one bit gets flipped, then it will just be slightly different
    meaningless ASCII; there's no way to detect it except checksums, CRCs
    and the like.

    In any case, the error detection won't be done by a human, but machine.

    Possibly a human might detect, when back in plain text that 'Mary hid a
    little lamb' should have been 'had', but now this is getting silly,
    needing to rely on knowledge of nursery rhymes.

    Trillions of bytes binary data must be transmitted every day (perhaps
    every minute; I've no idea); how often have you encountered a
    transmission error?

    Compression schemes tend to have error-detection built-in; I'm sure
    comms do as well, as well as storage device controllers and drivers.
    People have this sort of thing in hand already!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 07:47:00 2024
    On 2024-06-06 16:25:37 +0000, Malcolm McLean said:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE$543GtT$"||x|VVBB?

    There are compression algorithms that can be adapted to any possible
    size of input and output character sets, including that both are
    ASCII and that the output character set is a subset of the input set.

    Restricting the input set to ASCII may be too strong. Files that should
    be ASCII files sometimes contain non-ascii bytes. The output should be restricted to the 94 visible characters but the decompressor should
    accept at least full ASCII and skip the invalid characters as insignificant. That permits addition of line brakes and perhaps other spaces that could
    be useful for example when the file is printed for debugging.

    --
    Mikko

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 08:03:04 2024
    On 2024-06-06 19:02:56 +0000, Malcolm McLean said:

    On 06/06/2024 17:55, bart wrote:
    On 06/06/2024 17:25, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the
    relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE$543GtT$"||x|VVBB?

    What's the problem with compressing to binary (using existing,
    efficient utilities), then turning that binary into ASCII (like Mime or
    Base64)?

    Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed
    text in programs, and if it is corruped, only a single line will become unreadable.

    The purpose of compression is to remove all pssibilities to detect and
    correct errors. If an error tolerance is needed then that must be added
    after the compression. The best solution is to use the best compression
    and then the best error checking. The meaning of the latter "best" depends
    on the requirements on reliability and compression. In any case there is
    no hard limit to the amount of possible undetected corruption.

    --
    Mikko

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 08:20:03 2024
    On 2024-06-06 19:09:03 +0000, Malcolm McLean said:

    On 06/06/2024 17:56, Ben Bacarisse wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    Not strictly a C programming question, but smart people will see the
    relavance to the topicality, which is portability.

    I must not be smart as I can't see any connection to the topic of this
    group!

    Is there a compresiion algorthim which converts human language ASCII text >>> to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE$543GtT$"||x|VVBB?

    Obviously such algorithms exist. One that is used a lot is just base64
    encoding of binary compressed text, but that won't beat something
    specifically crafted for the task which is presumably what you are
    asking for. I don't know of anything aimed at that task specifically.

    One thing you should specify is whether you need it to work on small
    texts, or, even better, at what sort of size you want the pay-off to
    start to kick in. For example, the xz+base64 encoding of the complete
    works of Shakespeare is still less than 40% of the size of the original
    but your single line will end up much larger using that off-the-shelf
    scheme.

    What I was thing of was using Huffman codes to convert ASCII to a
    string of of bits.

    Works if one knows at the time one makes ones compression and
    decmpression algorithms how often each short sequence of characters
    will be used in the files that will be compressed. If you have an
    adaptive Huffman coding (or any other adaptive coding) a single error
    will corrupt the rest of your line. If you reset the adaptation at the
    end of each line it does not adapt well and the result is not much
    better than without adaptation. If you reset the adaptation at the
    end of each page you can have better compression but an error corrupts
    the rest of the page.

    For ordinary texts (except short ones) and many other purposes Lempel-Ziv
    and its variants work better than Huffman.

    --
    Mikko

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Malcolm McLean on Fri Jun 7 11:36:43 2024
    On 06/06/2024 21:02, Malcolm McLean wrote:
    On 06/06/2024 17:55, bart wrote:
    On 06/06/2024 17:25, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the
    relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    What's the problem with compressing to binary (using existing,
    efficient utilities), then turning that binary into ASCII (like Mime
    or Base64)?

    Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text
    in programs, and if it is corruped, only a single line will become unreadable.

    Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while
    leaving most of it readable? :-)

    Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes. Then convert to ASCII base64 or similar.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mikko@21:1/5 to Malcolm McLean on Fri Jun 7 15:52:25 2024
    On 2024-06-07 09:00:57 +0000, Malcolm McLean said:

    On 07/06/2024 06:20, Mikko wrote:
    On 2024-06-06 19:09:03 +0000, Malcolm McLean said:

    On 06/06/2024 17:56, Ben Bacarisse wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    Not strictly a C programming question, but smart people will see the >>>>> relavance to the topicality, which is portability.

    I must not be smart as I can't see any connection to the topic of this >>>> group!

    Is there a compresiion algorthim which converts human language ASCII text >>>>> to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE$543GtT$"||x|VVBB?

    Obviously such algorithms exist. One that is used a lot is just base64 >>>> encoding of binary compressed text, but that won't beat something
    specifically crafted for the task which is presumably what you are
    asking for. I don't know of anything aimed at that task specifically. >>>>
    One thing you should specify is whether you need it to work on small
    texts, or, even better, at what sort of size you want the pay-off to
    start to kick in. For example, the xz+base64 encoding of the complete >>>> works of Shakespeare is still less than 40% of the size of the original >>>> but your single line will end up much larger using that off-the-shelf
    scheme.

    What I was thing of was using Huffman codes to convert ASCII to a
    string of of bits.

    Works if one knows at the time one makes ones compression and
    decmpression algorithms how often each short sequence of characters
    will be used in the files that will be compressed. If you have an
    adaptive Huffman coding (or any other adaptive coding) a single error
    will corrupt the rest of your line. If you reset the adaptation at the
    end of each line it does not adapt well and the result is not much
    better than without adaptation. If you reset the adaptation at the
    end of each page you can have better compression but an error corrupts
    the rest of the page.

    For ordinary texts (except short ones) and many other purposes Lempel-Ziv
    and its variants work better than Huffman.

    Yes, but Huffman is easy to decode. It's the sort of project you give
    to people who have just got past the beginner stage but aren't very experienced programmers yet, whilst implementing Lempel-Ziv is a job
    for someone who knows what he is doing.

    Because the lines will often be very short, adaptive Huffman coding is
    no good. I need a fixed Huffman table with 128 entries for each 7 bit
    value plus one for "stop". I wonder if any such standard table exists.

    You don't need a standard table. You need statistics. Once you have the statistics the table is easy to costruct with Huffman's algorithm.

    --
    Mikko

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Malcolm McLean on Fri Jun 7 16:45:03 2024
    On 07/06/2024 14:43, Malcolm McLean wrote:
    On 07/06/2024 10:36, David Brown wrote:
    On 06/06/2024 21:02, Malcolm McLean wrote:
    On 06/06/2024 17:55, bart wrote:
    On 06/06/2024 17:25, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see
    the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language
    ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>>>
    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    What's the problem with compressing to binary (using existing,
    efficient utilities), then turning that binary into ASCII (like Mime
    or Base64)?

    Because if a single bit flips in a zip archive, it's likely the
    entire archive will be lost. This scheme is robust. We can emed
    compressed text in programs, and if it is corruped, only a single
    line will become unreadable.

    Ah, you want something that will work like your newsreader program
    that randomly changes letters or otherwise corrupts your spelling
    while leaving most of it readable?  :-)

    Pass the data through a compressor and then add forward error checking
    mechanisms such as Reed-Solomon codes.  Then convert to ASCII base64
    or similar.

    Yes, exactly.

    I want a system for compression which is robust to corruption, can be
    stored as text, and with a compressor / decompressor which can be
    written by a child hobby programmer with only a very little bit of
    experience of programming.


    That last "requirement" is completely unrealistic. Forget it. Then you already have a solution, as I outlined above.

    I don't think it is remotely helpful to have error correction in your
    format. You have to handle email extraordinarily badly to have any
    issues transferring 8-bit binary data, and you certainly won't have
    trouble if you are using Base64. Either the email will arrive
    correctly, or it will not arrive at all. At most, you could add a CRC
    check after compressing and before Base64 encoding.

    But then, I don't think any of this stuff will be remotely useful in
    practice. But if you are enjoying working on it, that's all the
    motivation and justification anyone could ever need. So if you want
    error correction and compression, that's fine.

    That's what I need for Baby X. The FileSystem XML files can get very
    large, and of course Baby X programmers are going to ask about
    compression. And I don't think there is an existing system, and so I
    shall devise one.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to bart on Fri Jun 7 11:22:12 2024
    On 6/6/2024 5:49 PM, bart wrote:
    On 06/06/2024 22:26, Malcolm McLean wrote:
    On 06/06/2024 20:23, Paul wrote:
    On 6/6/2024 12:25 PM, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?


    The purpose of doing this, is to satisfy transmission through a 7 bit channel.
    In the history of networking, not all channels were eight-bit transparent. >>> (On the equipment in question, this was called "robbed-bit signaling.)
    For example, BASE64 is valued for its 7 bit channel properties, the ability >>> to pass through a pipe which is not 8 bit transparent. Even to this day, >>> your email attachments may traverse the network in BASE64 format.

    That is one reason, that email or USENET clients to this day, have
    both 7 bit and 8 bit content encoding methods. It's to handle the
    unlikely possibility that 7 bit transmission channels still exist.
    They likely do exist.

    Yes. If yiu stire data as 8 but binaries then it's inherently risky. There's usually no recovery froma single bit gett corrupted.

    Whilst if you store as ASCII, the data can usually be recovered very easly if something goes wrong wit the phsyical storage. A "And God said"
    becomes "And G$d said", an even with this tiny text, you can still read
    it perfectly well.

    But you are suggesting storing the compression data as meaningless ASCII such as:

    QWE£$543GtT£$"||x|VVBB?

    If one bit gets flipped, then it will just be slightly different meaningless ASCII; there's no way to detect it except checksums, CRCs and the like.

    In any case, the error detection won't be done by a human, but machine.

    Possibly a human might detect, when back in plain text that 'Mary hid a little lamb' should have been 'had', but now this is getting silly, needing to rely on knowledge of nursery rhymes.

    Trillions of bytes binary data must be transmitted every day (perhaps every minute; I've no idea); how often have you encountered a transmission error?

    Compression schemes tend to have error-detection built-in; I'm sure comms do as well, as well as storage device controllers and drivers. People have this sort of thing in hand already!



    ZIP (of WinZIP fame), has a CRC computed per file. The decompression
    step, will tell you if a file is corrupted. The column of CRC values
    is shown in some of the unpacking software (and if you run a CRC check separately on the file at a later date, you can compare).

    [Picture[

    https://i.postimg.cc/DwQgPQP3/ZIP-CRC-field.gif

    True repair capability, requires a better code. The Reed Solomon David Brown mentions is an example of such a code. A three dimensional version on CDs, makes the CD very resistant to errors. By the time the Reed Solomon cannot repair a CD, the CD surface is so bad, the laser can no longer lock to the groove.
    Rather than Reed Solomon complaining it cannot correct the data, instead
    it is the optical drive reporting it cannot find the groove using the laser.

    Storage media also has repair capability. A typical SSD (NAND flash storage device),
    has 10% overhead for corrections. A 512 byte sector, has an extra 51 bytes set aside for error correction. When your SSD slows down to 300MB/sec from 530MB/sec,
    that means that every sector being read had errors, and is being corrected by a processor inside the SSD drive. This is a "normal" state of affairs for TLC
    or QLC based drives. Some 2.5" flash devices, have a three core ARM processor, and at least one of the cores does error correction.

    But on an archival format with extreme compression, finding that "someone had wasted an extra 10% on error correction capability", that would of course
    annoy a user expecting the extreme compression to save them money (for storage).

    When selecting a "scheme", you have to decide what kind of error-type you
    are protecting against.

    For example, on hard drives, someone postulated they were protecting against single-bit (independent, does not correlate with other single-bits) errors.
    The Fire codes (polynomial) were the result. There is some small probability
    of multiple bits (perhaps an error multiplication effect in the DSP-based
    data recovery on read). At the time, no one considered that a heavy-weight method
    was necessary.

    When you expect to be losing whole sectors, whole files, whole pieces of media, there are PAR codes for that. But these were determined to be not mathematically
    sound, so serious archival use might not use them. The idea would be, if an archive spanned ten CDs, you would burn one or two more CDs (generated by PAR), and if any of the twelve CDs total was bad, PAR could regenerate the
    missing information (if any). Of the 12 CDs, any two could go missing, and they could then be regenerated.

    A simpler to understand scheme, is to burn duplicate CD copies of the same information.
    If you lose a CD, or if the media surface degrades completely, you have the second CD. And that does not involve any complex PAR method :-) It's easier
    for the human to understand.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to Malcolm McLean on Fri Jun 7 20:06:09 2024
    On 07/06/2024 19:49, Malcolm McLean wrote:
    On 07/06/2024 13:52, Mikko wrote:
    On 2024-06-07 09:00:57 +0000, Malcolm McLean said:

    Yes, but Huffman is easy to decode. It's the sort of project you give
    to people who have just got past the beginner stage but aren't very
    experienced programmers yet, whilst implementing Lempel-Ziv is a job
    for someone who knows what he is doing.

    Because the lines will often be very short, adaptive Huffman coding
    is no good. I need a fixed Huffman table with 128 entries for each 7
    bit value plus one for "stop". I wonder if any such standard table
    exists.

    You don't need a standard table. You need statistics. Once you have the
    statistics the table is easy to costruct with Huffman's algorithm.

    No you do. The text might be very short, like "Mary had a little lamb",
    and you will compress it because you know that you are being fed
    meaningful ASCII. For example even this tiny fragment contains the
    letter "e", which would have a short Huffman code. And four a's and two
    t's, which are the third and the second most commn letters. So it should compress.

    And we're compressing each line independently, and choosing a visually distinctive ASCII character as the line break. So anyone seeing the compressed data will immediately be able to home in on the line breaks,
    and will be able to fix any corruption without special tools.

    And you have a standard table which never changes. And so that makes the decompressor much easier to write.

    Will your babyx be able to handle, say, utf-8?

    --
    This email has been checked for viruses by AVG antivirus software.
    www.avg.com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Malcolm McLean on Fri Jun 7 16:57:08 2024
    On 6/7/2024 8:43 AM, Malcolm McLean wrote:
    On 07/06/2024 10:36, David Brown wrote:
    On 06/06/2024 21:02, Malcolm McLean wrote:
    On 06/06/2024 17:55, bart wrote:
    On 06/06/2024 17:25, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?

    Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text in programs, and if it is corruped, only a single line will become unreadable.

    Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while leaving most of it readable?  :-)

    Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes.  Then convert to ASCII base64 or similar.

    Yes, exactly.

    I want a system for compression which is robust to corruption, can be stored as text, and with a compressor / decompressor which can be written by a child hobby programmer with only a very little bit of experience of programming.

    That's what I need for Baby X. The FileSystem XML files can get very large, and of course Baby X programmers are going to ask about compression. And I don't think there is an existing system, and so I shall devise one.


    "XML Compression"

    https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7993-3_783-2

    "The size increase incurred by publishing data in XML format is
    estimated to be as much as 400 % [14], making it a prime target for compression.

    While standard general-purpose compressors, such as
    zip, gzip or bzip, typically compress XML data reasonably well...
    "

    Show us a "dir" or an "ls -al" so we can better understand
    the magnitude of what you're working on.

    Lots of things have used ZIP, implicitly or explicitly, mainly
    because it is a kind of standard and does not form a barrier to access.

    In addition, if a structure is voluminous (a thousand control files representing one project), users appreciate having them stored in
    a container, rather than filling the file system with fluff. A ZIP
    can do that too. And if the ZIP has a convenient library you can
    get from FOSS-land, that could save time on building a standards
    based container.

    But what's more important than any techie adventure, is not
    annoying your users. What do the users want most ? The ability
    to edit the files in question, on a moments notice ? Or would
    the files, 99.999% of the time, comfortably remain hidden from view ?

    If the "blob" involved was 100GB, then yes, I'd compress it :-)
    If it is 4KB, well, those little files are a nuisance no matter
    what you do. I would leave that uncompressed, unless I could
    containerize it perhaps.

    As an example, Mozilla has used .jsonlz4 as a file format solution.
    I have no idea what problem they thought they were solving,
    but I can tell you I consider the solution obnoxious and inconsiderate
    of the user. LZ4 decompressors are not a stockroom item. I had
    to write a very short program, so I could deal with that. Mozilla
    has made a perfect example of what not to do, by doing that.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Malcolm McLean on Sun Jun 9 11:44:13 2024
    On Fri, 7 Jun 2024 10:03:46 +0100
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

    On 07/06/2024 05:47, Mikko wrote:
    On 2024-06-06 16:25:37 +0000, Malcolm McLean said:

    Not strictly a C programming question, but smart people will see
    the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language
    ASCII text to compressed ASCII, preferably only "isgraph"
    characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    There are compression algorithms that can be adapted to any possible
    size of input and output character sets, including that both are
    ASCII and that the output character set is a subset of the input
    set.

    Restricting the input set to ASCII may be too strong. Files that
    should be ASCII files sometimes contain non-ascii bytes. The output
    should be restricted to the 94 visible characters but the
    decompressor should accept at least full ASCII and skip the invalid characters as insignificant.
    That permits addition of line brakes and perhaps other spaces that
    could be useful for example when the file is printed for debugging.

    That's exactly the idea. The system is robust to white space. You can
    add spaces to your heart's content, and they arec just skipped.

    Robustness to white spaces necessarily weakens robustness to bit flips.
    Not that your set of requirements made much sense to start with...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lowell Gilbert@21:1/5 to Malcolm McLean on Sun Jun 9 19:20:16 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 06/06/2024 20:23, Paul wrote:
    On 6/6/2024 12:25 PM, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE$543GtT$"||x|VVBB?

    The purpose of doing this, is to satisfy transmission through a 7
    bit channel.
    In the history of networking, not all channels were eight-bit transparent. >> (On the equipment in question, this was called "robbed-bit signaling.)
    For example, BASE64 is valued for its 7 bit channel properties, the ability >> to pass through a pipe which is not 8 bit transparent. Even to this day,
    your email attachments may traverse the network in BASE64 format.
    That is one reason, that email or USENET clients to this day, have
    both 7 bit and 8 bit content encoding methods. It's to handle the
    unlikely possibility that 7 bit transmission channels still exist.
    They likely do exist.

    Yes. If yiu stire data as 8 but binaries then it's inherently
    risky. There's usually no recovery froma single bit gett corrupted.

    Whilst if you store as ASCII, the data can usually be recovered very
    easly if something goes wrong wit the phsyical storage. A "And God
    said"
    becomes "And G$d said", an even with this tiny text, you can still read
    it perfectly well.

    That example only works because it doesn't include compression.


    --
    Lowell Gilbert, embedded/networking software engineer
    http://be-well.ilk.org/~lowell/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Malcolm McLean on Mon Jun 10 00:45:08 2024
    On Thu, 06 Jun 2024 17:25:37 +0100, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language ASCII
    text to compressed ASCII, preferably only "isgraph" characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    I'm afraid that you have conflicting requirements here. In effect,
    you want to take an array of values (each within the range of
    0 to 127) and
    a) make the array shorter ("compress it"), and
    b) express the individual elements of this shorter array with
    a range of 96 values ("isgraph() characters")

    Because you reduce the number of values each result element
    can carry, each result element can only express a fraction
    (96/128'ths) of the corresponding source element. Thus,
    with the isgraph() requirement, the result will take /more/
    elements to express the same data as the source did.

    However, you want /compression/, which implies that you want
    the result to be smaller than the source. And, therein lies
    the conflict.

    Can you help clarify this for me?
    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Malcolm McLean on Mon Jun 10 01:45:14 2024
    On 10/06/2024 01:22, Malcolm McLean wrote:
    On 10/06/2024 00:20, Lowell Gilbert wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 06/06/2024 20:23, Paul wrote:
    On 6/6/2024 12:25 PM, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see
    the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language
    ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>>>
    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    The purpose of doing this, is to satisfy transmission through a 7
    bit channel.
    In the history of networking, not all channels were eight-bit
    transparent.
    (On the equipment in question, this was called "robbed-bit signaling.) >>>> For example, BASE64 is valued for its 7 bit channel properties, the
    ability
    to pass through a pipe which is not 8 bit transparent. Even to this
    day,
    your email attachments may traverse the network in BASE64 format.
    That is one reason, that email or USENET clients to this day, have
    both 7 bit and 8 bit content encoding methods. It's to handle the
    unlikely possibility that 7 bit transmission channels still exist.
    They likely do exist.

    Yes. If yiu stire data as 8 but binaries then it's inherently
    risky. There's usually no recovery froma single bit gett corrupted.

    Whilst if you store as ASCII, the data can usually be recovered very
    easly if something goes wrong wit the phsyical storage. A "And God
    said"
    becomes "And G$d said", an even with this tiny text, you can still read
    it perfectly well.

    That example only works because it doesn't include compression.


    Yes, so the ASCII to ASCCI compression scheme needs to be almost as
    robust. Any corruption will corrupt only a single line. And you can
    examine near-nonsense ASCII in a way you can't examine binary. Yiu can
    lod it into a text editor and look for the difference between two versions.


    (I think some single-bit corruption has crept into your post!)

    If you try to compress ASCII in any worthwhile manner, then it's going
    to be become as meaningless as any binary. It's certainly not going to
    resemble English prose with its multiple layers of redundancy.

    But take this bit of ASCII into which I've introduced a 1-bit error:

    414 949 812 809

    Can you tell which digit was affected? And this is uncompressed. Text
    files can contain data like this. They can have tables, or source code,
    or people's surnames with odd spellings, or maybe even encrypted (not compressed) text.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Malcolm McLean on Mon Jun 10 14:29:30 2024
    On Mon, 10 Jun 2024 07:12:57 +0100
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:

    On 10/06/2024 01:45, Lew Pitcher wrote:
    On Thu, 06 Jun 2024 17:25:37 +0100, Malcolm McLean wrote:

    Not strictly a C programming question, but smart people will see
    the relavance to the topicality, which is portability.

    Is there a compresiion algorthim which converts human language
    ASCII text to compressed ASCII, preferably only "isgraph"
    characters?

    So "Mary had a little lamb, its fleece was white as snow".

    Would become

    QWE£$543GtT£$"||x|VVBB?

    I'm afraid that you have conflicting requirements here. In effect,
    you want to take an array of values (each within the range of
    0 to 127) and
    a) make the array shorter ("compress it"), and
    b) express the individual elements of this shorter array with
    a range of 96 values ("isgraph() characters")

    Because you reduce the number of values each result element
    can carry, each result element can only express a fraction
    (96/128'ths) of the corresponding source element. Thus,
    with the isgraph() requirement, the result will take /more/
    elements to express the same data as the source did.

    However, you want /compression/, which implies that you want
    the result to be smaller than the source. And, therein lies
    the conflict.

    Can you help clarify this for me?

    We have a fixed Huffman tree which is part of the algorithm and
    optmised for ASCII. And we take each line otext, and comress it to a
    binary string, using the Huffman table. The we code the binary string
    six bytes ar a time using a 64 character dubset of ASCCI. And the we
    append a special character which is chosen to be visually
    distinctive..

    So the inout is

    Mary had a little lamb,
    it's fleece was white as snow,
    and eveywhere that Mary went,
    the lamb was sure to. go.

    And we get the output.

    CVbGNh£-H$£*MMH&-VVdsE3w2as3-vv$G^&ggf-


    And if it shorter or not depends on whether the fixed Huffman table
    is any good.


    Take something that is a little bigger than a text above. It does not
    have to be much bigger. One page from any book will do ("Alice's
    Adventures in Wonderland" is used most often for that purpose).
    Apply your compression procedure.
    Then run automatic test that applies all possible single bit flips, de-compresses and count # of mismatches vs original text. The test will
    report the case with maximal # of mismatches.
    Look at most corrupted text.
    If your fixed Huffman table is any good, you'll see that output is
    corrupted rather seriously, most likely at least one sentence will be unrecognizable.
    Alternatively, if your fixed Huffman table is no good, you output will
    be as big or bigger than the input.

    Popular corpus of samples for compression tests: https://corpus.canterbury.ac.nz/descriptions/ http://corpus.canterbury.ac.nz/resources/cantrbry.zip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Malcolm McLean on Mon Jun 10 18:55:34 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    We have a fixed Huffman tree which is part of the algorithm and optmised
    for ASCII. And we take each line otext, and comress it to a binary string, using the Huffman table. The we code the binary string six bytes ar a time using a 64 character dubset of ASCCI. And the we append a special character which is chosen to be visually distinctive..

    So the inout is

    Mary had a little lamb,
    it's fleece was white as snow,
    and eveywhere that Mary went,
    the lamb was sure to. go.

    And we get the output.

    CVbGNh-H$*MMH&-VVdsE3w2as3-vv$G^&ggf-

    It would be more like

    pOHcDdz8v3cz5Nl7WP2gno5krTqU6g/ZynQYlawju8rxyhMT6B30nDusHrWaE+TZf1KdKmJ9Fb6orB

    (That's an actual example using an optimal Huffman encoding for that
    input and the conventional base 64 encoding. I can post the code table,
    if you like.)

    And if it shorter or not depends on whether the fixed Huffman table is any good.

    If I use a bigger corpus of English text to derive the Huffman codes,
    the encoding becomes less efficient (of course) so those 110 characters
    need more like 83 base 64 encoded bytes to represent them. Is 75% of
    the size worth it?

    What is the use-case where there is so much English text that a little compression is worthwhile?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Malcolm McLean on Mon Jun 10 21:28:39 2024
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    On 10/06/2024 18:55, Ben Bacarisse wrote:
    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    We have a fixed Huffman tree which is part of the algorithm and optmised >>> for ASCII. And we take each line otext, and comress it to a binary string, >>> using the Huffman table. The we code the binary string six bytes ar a time >>> using a 64 character dubset of ASCCI. And the we append a special character >>> which is chosen to be visually distinctive..

    So the inout is

    Mary had a little lamb,
    it's fleece was white as snow,
    and eveywhere that Mary went,
    the lamb was sure to. go.

    And we get the output.

    CVbGNh-H$*MMH&-VVdsE3w2as3-vv$G^&ggf-
    It would be more like
    pOHcDdz8v3cz5Nl7WP2gno5krTqU6g/ZynQYlawju8rxyhMT6B30nDusHrWaE+TZf1KdKmJ9Fb6orB
    (That's an actual example using an optimal Huffman encoding for that
    input and the conventional base 64 encoding. I can post the code table,
    if you like.)

    And if it shorter or not depends on whether the fixed Huffman table is any >>> good.
    If I use a bigger corpus of English text to derive the Huffman codes,
    the encoding becomes less efficient (of course) so those 110 characters
    need more like 83 base 64 encoded bytes to represent them. Is 75% of
    the size worth it?
    What is the use-case where there is so much English text that a little
    compression is worthwhile?

    The FileSystem XML files. They are uncompressed, and as you can take in entire folders, they can be very large.

    I don't know what the XML file system is for either so explaining one by
    the other doesn't help. I was hoping for a use -- a user story -- that
    would help me understand what the point of all this is.

    Tell me as a story: A user wants to ... what? And having a directory of
    large text files in a structured XML text helps because they can
    ... what? And if they were a quarter of the size it would be better
    because ... why?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Jun 17 04:45:04 2024
    On Thu, 6 Jun 2024 20:02:55 +0300, Michael S wrote:

    Or, if we want to make a job just a little bit more interesting, we can convert to base94, producing ~9% smaller size than base94 :-)

    You mean smaller than Base64?

    I just spent some hours yesterday implementing the ASCII85 encoding in C
    code. This was something Adobe added to PostScript level 2; not sure if
    anybody else used it.

    By using only 85 instead of 94 printable characters, it could reserve some
    for special uses. For example, four bytes of zero are represented by a
    single “z” character. Also “~” is not used because it is part of the PostScript string delimiter for strings in ASCII85 format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Jun 17 13:10:12 2024
    On Mon, 17 Jun 2024 04:45:04 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Thu, 6 Jun 2024 20:02:55 +0300, Michael S wrote:

    Or, if we want to make a job just a little bit more interesting, we
    can convert to base94, producing ~9% smaller size than base94 :-)

    You mean smaller than Base64?


    Yes.

    I just spent some hours yesterday implementing the ASCII85 encoding
    in C code. This was something Adobe added to PostScript level 2; not
    sure if anybody else used it.

    By using only 85 instead of 94 printable characters, it could reserve
    some for special uses. For example, four bytes of zero are
    represented by a single “z” character. Also “~” is not used because it is part of the PostScript string delimiter for strings in ASCII85
    format.

    I didn't look at the existing standards.
    The nice thing about base 94 is that you can encode 9 arbitrary octets
    into 11 isgraph() characters and that the code for encode/decode is
    simple and reasonably fast even in absence of 64-bit integer types.

    For base 85 you encode 8 octets to 10 characters, which is even a
    little simpler (or more than a little when 64-bit integers are
    available), but 2.3% less dense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)