Forum: >>> Magnum BBS <<<

Re: Script to conditionally find and compress files recursively

From D@21:1/5 to J Newman on Tue Jun 11 10:51:45 2024

On Tue, 11 Jun 2024, J Newman wrote:

Hi, I'm interested in writing a script that will:

1. Find and compress files recursively
2. After the first 5 seconds of compressing, if the compression ratio >1 (i.e. the compressed file will be larger than the uncompressed file), it tries another compression algorithm.
3. If the other compression algorithm still has a ratio >1, it tries another algorithm, until a list is exhausted.
4. If the list is exhausted, it skips compressing that file.

Any suggestions on how to proceed?

Difficult to estimate compression ratio without analyzing the entire file.
In theory you could say something based on the file type, but that's the
best I can come up with.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Joe Beanfish@21:1/5 to J Newman on Tue Jun 11 14:58:23 2024

On Tue, 11 Jun 2024 14:53:27 +0800, J Newman wrote:

Hi, I'm interested in writing a script that will:

1. Find and compress files recursively
2. After the first 5 seconds of compressing, if the compression ratio >1 (i.e. the compressed file will be larger than the uncompressed file), it tries another compression algorithm.
3. If the other compression algorithm still has a ratio >1, it tries
another algorithm, until a list is exhausted.
4. If the list is exhausted, it skips compressing that file.

Any suggestions on how to proceed?

You could use dd to extract a representative chunk of the file to
compress and compare size.

uncompressedsize=$(dd status=none if="$file" bs=1M count=1|wc -c) compressedsize=$(dd status=none if="$file" bs=1M count=1|$compresscmd|wc -c)

You could get fancy and try all the compression commands you have
and pick the one with smallest output for the actual compression.
That's all assuming the beginning of the file is representative of
the content throughout. If it's not, no way to tell without compressing
the whole thing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Taylor@21:1/5 to J Newman on Tue Jun 11 22:21:00 2024

On 6/11/24 01:53, J Newman wrote:

Any suggestions on how to proceed?

As others have said, it's very difficult to tell within the first five
seconds what the ultimate compression ratio will be.

If you have the disk space, compress using all of the compression
options and then remove all but the smallest file.

Then go on to the next file.

--
Grant. . . .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Grant Taylor on Wed Jun 12 08:17:11 2024

Grant Taylor <gtaylor@tnetconsulting.net> writes:

On 6/11/24 01:53, J Newman wrote:

Any suggestions on how to proceed?

As others have said, it's very difficult to tell within the first five seconds what the ultimate compression ratio will be.

Not just difficult but impossible in general: the input file could
change character in its second half, switching the overall result from
that that is (for example) a gzip win to an xz win.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From D@21:1/5 to Richard Kettlewell on Wed Jun 12 10:13:43 2024

On Wed, 12 Jun 2024, Richard Kettlewell wrote:

Grant Taylor <gtaylor@tnetconsulting.net> writes:

On 6/11/24 01:53, J Newman wrote:

Any suggestions on how to proceed?

As others have said, it's very difficult to tell within the first five
seconds what the ultimate compression ratio will be.

Not just difficult but impossible in general: the input file could
change character in its second half, switching the overall result from
that that is (for example) a gzip win to an xz win.

This is true! The only thing I can imagine are parsing the file type, and
from that file type, drawing conclusions about the compressability of the
data, or doing a flawed statistical analysis, but as said, the end could
be vastly different from the start.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anssi Saari@21:1/5 to J Newman on Thu Jun 13 10:13:20 2024

J Newman <jenniferkatenewman@gmail.com> writes:

It's true that you cannot tell within the first 5 seconds what the
ultimate compression ratio will be, but it seems to me (from
compressing avi/mp4/mov files with lzma -9evv) that you can tell
within +/- 5% to a high degree of confidence, what the ultimate
compression ratio will be given the first 5 seconds.

Well then, I believe the solution was already posted. Grab 5% of your
files with dd and see how it compresses.

I'm a little curious, what kind of space savings do you expect to get by
doing this? And wouldn't it make more sense to re-encode for lower
bitrate if space saving is your goal?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From D@21:1/5 to J Newman on Thu Jun 13 11:55:23 2024

On Thu, 13 Jun 2024, J Newman wrote:

On 12/06/2024 16:13, D wrote:

On Wed, 12 Jun 2024, Richard Kettlewell wrote:

Grant Taylor <gtaylor@tnetconsulting.net> writes:

On 6/11/24 01:53, J Newman wrote:

Any suggestions on how to proceed?

As others have said, it's very difficult to tell within the first five >>>> seconds what the ultimate compression ratio will be.

Not just difficult but impossible in general: the input file could
change character in its second half, switching the overall result from
that that is (for example) a gzip win to an xz win.

This is true! The only thing I can imagine are parsing the file type, and
from that file type, drawing conclusions about the compressability of the
data, or doing a flawed statistical analysis, but as said, the end could be >> vastly different from the start.

OK good point...as mentioned elsewhere my experience is with compressing video files with lzma.

But if we accept that the script will make mistakes sometimes in choosing the right algorithm for compression, do you suggest parsing the file type, or trying to compress each file for the first 5 seconds, as the option with the least errors in choosing the right compression algorithm?

Hmm, I'd say parsing file types first, and perhaps have a little database
that maps file type to compression algorithm, and if that doesn't yield anything, proceed with "brute force".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From D@21:1/5 to Anssi Saari on Thu Jun 13 11:55:09 2024

On Thu, 13 Jun 2024, Anssi Saari wrote:

J Newman <jenniferkatenewman@gmail.com> writes:

It's true that you cannot tell within the first 5 seconds what the
ultimate compression ratio will be, but it seems to me (from
compressing avi/mp4/mov files with lzma -9evv) that you can tell
within +/- 5% to a high degree of confidence, what the ultimate
compression ratio will be given the first 5 seconds.

Well then, I believe the solution was already posted. Grab 5% of your
files with dd and see how it compresses.

I'm a little curious, what kind of space savings do you expect to get by doing this? And wouldn't it make more sense to re-encode for lower
bitrate if space saving is your goal?

If it's about space saving, don't forget deduplication, alternatively, depending on yoru file system of choice, you could maybe use file system functionality to save space as well, but caveat emptor, always have off
site (or off machine) backups.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Anssi Saari on Fri Jun 14 09:06:21 2024

Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:

J Newman <jenniferkatenewman@gmail.com> writes:

It's true that you cannot tell within the first 5 seconds what the
ultimate compression ratio will be, but it seems to me (from
compressing avi/mp4/mov files with lzma -9evv) that you can tell
within +/- 5% to a high degree of confidence, what the ultimate
compression ratio will be given the first 5 seconds.

Well then, I believe the solution was already posted. Grab 5% of your
files with dd and see how it compresses.

The solution that I see grabs the first 1MB, but it would make more
sense to sample eg. 1% of the file size in five places within the
file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab
one 1MB sample from the start of the file then four more at an
offset that increments by 20MB each time. Store these separately,
compress them separately, then average the compression ratio of all
the samples.

I'm a little curious, what kind of space savings do you expect to get by doing this? And wouldn't it make more sense to re-encode for lower
bitrate if space saving is your goal?

Maybe he's using lossless video compression? Otherwise yes it seems
like the wrong approach.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Computer Nerd Kev on Fri Jun 14 12:25:06 2024

Computer Nerd Kev <not@telling.you.invalid> wrote:

Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:

Well then, I believe the solution was already posted. Grab 5% of your
files with dd and see how it compresses.

The solution that I see grabs the first 1MB, but it would make more
sense to sample eg. 1% of the file size in five places within the
file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab
one 1MB sample from the start of the file then four more at an
offset that increments by 20MB each time. Store these separately,
compress them separately, then average the compression ratio of all
the samples.

Also for some types of data (if it's not all video), like text, some
more advanced compressors build a dictionary to better compress
larger files. But this requires a minimum file size, so the small
samples might not represent the compression ratio of the whole file
with a dictionary included. A solution is to pre-generate a
dictionary based on a collection of the same type of files you're
compressing, then you could compress the small samples using that
dictionary and get a more accurate result.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Taylor@21:1/5 to All on Thu Jun 13 22:35:26 2024

On 6/13/24 04:55, D wrote:

perhaps have a little database that maps file type to compression algorithm

case ${FILE##*.} in
txt)
#...
;;
jpg|jpeg)
# Jpeg
;;
*)
echo "unknown file type"
;;
esac

;-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From D@21:1/5 to Grant Taylor on Fri Jun 14 11:07:15 2024

On Thu, 13 Jun 2024, Grant Taylor wrote:

On 6/13/24 04:55, D wrote:

perhaps have a little database that maps file type to compression algorithm

case ${FILE##*.} in
txt)
#...
;;
jpg|jpeg)
# Jpeg
;;
*)
echo "unknown file type"
;;
esac

;-)

See.. half way there! Just cut n' paste and fill in the details. =)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Fri Jun 13 10:37:24 2025
  from Wales, Uk via Telnet
- Centurion
  Fri Jun 13 08:09:26 2025
  from Berea, Ohio via Telnet
- Centurion
  Fri Jun 13 05:19:00 2025
  from Berea, Ohio via Telnet
- Centurion
  Thu Jun 12 18:57:30 2025
  from Berea, Ohio via Telnet
- Bob Worm
  Thu Jun 12 18:29:11 2025
  from Wales, Uk via Telnet
- Centurion
  Thu Jun 12 17:59:25 2025
  from Berea, Ohio via Telnet
- Plume
  Thu Jun 12 15:49:12 2025
  from Uk via SSH
- Plume
  Thu Jun 12 15:10:05 2025
  from Uk via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	491
Nodes:	16 (2 / 14)
Uptime:	119:19:32
Calls:	9,687
Calls today:	3
Files:	13,728
Messages:	6,176,412

Re: Script to conditionally find and compress files recursively

Who's Online

Recent Visitors

System Info