• Re: Script to conditionally find and compress files recursively

    From D@21:1/5 to J Newman on Tue Jun 11 10:51:45 2024
    On Tue, 11 Jun 2024, J Newman wrote:

    Hi, I'm interested in writing a script that will:

    1. Find and compress files recursively
    2. After the first 5 seconds of compressing, if the compression ratio >1 (i.e. the compressed file will be larger than the uncompressed file), it tries another compression algorithm.
    3. If the other compression algorithm still has a ratio >1, it tries another algorithm, until a list is exhausted.
    4. If the list is exhausted, it skips compressing that file.

    Any suggestions on how to proceed?


    Difficult to estimate compression ratio without analyzing the entire file.
    In theory you could say something based on the file type, but that's the
    best I can come up with.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Beanfish@21:1/5 to J Newman on Tue Jun 11 14:58:23 2024
    On Tue, 11 Jun 2024 14:53:27 +0800, J Newman wrote:

    Hi, I'm interested in writing a script that will:

    1. Find and compress files recursively
    2. After the first 5 seconds of compressing, if the compression ratio >1 (i.e. the compressed file will be larger than the uncompressed file), it tries another compression algorithm.
    3. If the other compression algorithm still has a ratio >1, it tries
    another algorithm, until a list is exhausted.
    4. If the list is exhausted, it skips compressing that file.

    Any suggestions on how to proceed?

    You could use dd to extract a representative chunk of the file to
    compress and compare size.

    uncompressedsize=$(dd status=none if="$file" bs=1M count=1|wc -c) compressedsize=$(dd status=none if="$file" bs=1M count=1|$compresscmd|wc -c)

    You could get fancy and try all the compression commands you have
    and pick the one with smallest output for the actual compression.
    That's all assuming the beginning of the file is representative of
    the content throughout. If it's not, no way to tell without compressing
    the whole thing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to J Newman on Tue Jun 11 22:21:00 2024
    On 6/11/24 01:53, J Newman wrote:
    Any suggestions on how to proceed?

    As others have said, it's very difficult to tell within the first five
    seconds what the ultimate compression ratio will be.

    If you have the disk space, compress using all of the compression
    options and then remove all but the smallest file.

    Then go on to the next file.



    --
    Grant. . . .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Grant Taylor on Wed Jun 12 08:17:11 2024
    Grant Taylor <gtaylor@tnetconsulting.net> writes:
    On 6/11/24 01:53, J Newman wrote:
    Any suggestions on how to proceed?

    As others have said, it's very difficult to tell within the first five seconds what the ultimate compression ratio will be.

    Not just difficult but impossible in general: the input file could
    change character in its second half, switching the overall result from
    that that is (for example) a gzip win to an xz win.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From D@21:1/5 to Richard Kettlewell on Wed Jun 12 10:13:43 2024
    On Wed, 12 Jun 2024, Richard Kettlewell wrote:

    Grant Taylor <gtaylor@tnetconsulting.net> writes:
    On 6/11/24 01:53, J Newman wrote:
    Any suggestions on how to proceed?

    As others have said, it's very difficult to tell within the first five
    seconds what the ultimate compression ratio will be.

    Not just difficult but impossible in general: the input file could
    change character in its second half, switching the overall result from
    that that is (for example) a gzip win to an xz win.



    This is true! The only thing I can imagine are parsing the file type, and
    from that file type, drawing conclusions about the compressability of the
    data, or doing a flawed statistical analysis, but as said, the end could
    be vastly different from the start.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anssi Saari@21:1/5 to J Newman on Thu Jun 13 10:13:20 2024
    J Newman <jenniferkatenewman@gmail.com> writes:

    It's true that you cannot tell within the first 5 seconds what the
    ultimate compression ratio will be, but it seems to me (from
    compressing avi/mp4/mov files with lzma -9evv) that you can tell
    within +/- 5% to a high degree of confidence, what the ultimate
    compression ratio will be given the first 5 seconds.

    Well then, I believe the solution was already posted. Grab 5% of your
    files with dd and see how it compresses.

    I'm a little curious, what kind of space savings do you expect to get by
    doing this? And wouldn't it make more sense to re-encode for lower
    bitrate if space saving is your goal?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From D@21:1/5 to J Newman on Thu Jun 13 11:55:23 2024
    On Thu, 13 Jun 2024, J Newman wrote:

    On 12/06/2024 16:13, D wrote:


    On Wed, 12 Jun 2024, Richard Kettlewell wrote:

    Grant Taylor <gtaylor@tnetconsulting.net> writes:
    On 6/11/24 01:53, J Newman wrote:
    Any suggestions on how to proceed?

    As others have said, it's very difficult to tell within the first five >>>> seconds what the ultimate compression ratio will be.

    Not just difficult but impossible in general: the input file could
    change character in its second half, switching the overall result from
    that that is (for example) a gzip win to an xz win.



    This is true! The only thing I can imagine are parsing the file type, and
    from that file type, drawing conclusions about the compressability of the
    data, or doing a flawed statistical analysis, but as said, the end could be >> vastly different from the start.

    OK good point...as mentioned elsewhere my experience is with compressing video files with lzma.

    But if we accept that the script will make mistakes sometimes in choosing the right algorithm for compression, do you suggest parsing the file type, or trying to compress each file for the first 5 seconds, as the option with the least errors in choosing the right compression algorithm?


    Hmm, I'd say parsing file types first, and perhaps have a little database
    that maps file type to compression algorithm, and if that doesn't yield anything, proceed with "brute force".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From D@21:1/5 to Anssi Saari on Thu Jun 13 11:55:09 2024
    On Thu, 13 Jun 2024, Anssi Saari wrote:

    J Newman <jenniferkatenewman@gmail.com> writes:

    It's true that you cannot tell within the first 5 seconds what the
    ultimate compression ratio will be, but it seems to me (from
    compressing avi/mp4/mov files with lzma -9evv) that you can tell
    within +/- 5% to a high degree of confidence, what the ultimate
    compression ratio will be given the first 5 seconds.

    Well then, I believe the solution was already posted. Grab 5% of your
    files with dd and see how it compresses.

    I'm a little curious, what kind of space savings do you expect to get by doing this? And wouldn't it make more sense to re-encode for lower
    bitrate if space saving is your goal?


    If it's about space saving, don't forget deduplication, alternatively, depending on yoru file system of choice, you could maybe use file system functionality to save space as well, but caveat emptor, always have off
    site (or off machine) backups.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Anssi Saari on Fri Jun 14 09:06:21 2024
    Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:
    J Newman <jenniferkatenewman@gmail.com> writes:

    It's true that you cannot tell within the first 5 seconds what the
    ultimate compression ratio will be, but it seems to me (from
    compressing avi/mp4/mov files with lzma -9evv) that you can tell
    within +/- 5% to a high degree of confidence, what the ultimate
    compression ratio will be given the first 5 seconds.

    Well then, I believe the solution was already posted. Grab 5% of your
    files with dd and see how it compresses.

    The solution that I see grabs the first 1MB, but it would make more
    sense to sample eg. 1% of the file size in five places within the
    file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab
    one 1MB sample from the start of the file then four more at an
    offset that increments by 20MB each time. Store these separately,
    compress them separately, then average the compression ratio of all
    the samples.

    I'm a little curious, what kind of space savings do you expect to get by doing this? And wouldn't it make more sense to re-encode for lower
    bitrate if space saving is your goal?

    Maybe he's using lossless video compression? Otherwise yes it seems
    like the wrong approach.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Computer Nerd Kev on Fri Jun 14 12:25:06 2024
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:

    Well then, I believe the solution was already posted. Grab 5% of your
    files with dd and see how it compresses.

    The solution that I see grabs the first 1MB, but it would make more
    sense to sample eg. 1% of the file size in five places within the
    file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab
    one 1MB sample from the start of the file then four more at an
    offset that increments by 20MB each time. Store these separately,
    compress them separately, then average the compression ratio of all
    the samples.

    Also for some types of data (if it's not all video), like text, some
    more advanced compressors build a dictionary to better compress
    larger files. But this requires a minimum file size, so the small
    samples might not represent the compression ratio of the whole file
    with a dictionary included. A solution is to pre-generate a
    dictionary based on a collection of the same type of files you're
    compressing, then you could compress the small samples using that
    dictionary and get a more accurate result.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to All on Thu Jun 13 22:35:26 2024
    On 6/13/24 04:55, D wrote:
    perhaps have a little database that maps file type to compression algorithm

    case ${FILE##*.} in
    txt)
    #...
    ;;
    jpg|jpeg)
    # Jpeg
    ;;
    *)
    echo "unknown file type"
    ;;
    esac

    ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From D@21:1/5 to Grant Taylor on Fri Jun 14 11:07:15 2024
    On Thu, 13 Jun 2024, Grant Taylor wrote:

    On 6/13/24 04:55, D wrote:
    perhaps have a little database that maps file type to compression algorithm

    case ${FILE##*.} in
    txt)
    #...
    ;;
    jpg|jpeg)
    # Jpeg
    ;;
    *)
    echo "unknown file type"
    ;;
    esac

    ;-)


    See.. half way there! Just cut n' paste and fill in the details. =)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)