Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression ratio gets worse with level? #157

Open
nabijaczleweli opened this issue Dec 15, 2024 · 1 comment
Open

Compression ratio gets worse with level? #157

nabijaczleweli opened this issue Dec 15, 2024 · 1 comment

Comments

@nabijaczleweli
Copy link

nabijaczleweli commented Dec 15, 2024

Given:

seq 0 10000000 | awk '{printf "%08X\n", $1}' | base16 -d > ints
od -tx4 ints -An | tr a-f A-F | base16 -d > ints2

i.e. ints is 4-byte integers [0, 10000000] in big endian, and ints2 is the same in little endian.

Then after

for i in $(seq 0 9); do xz -$i < ints > ints.xz-$i & :; done
for i in $(seq 0 9); do xz -$i < ints2 > ints2.xz-$i & :; done

I see

-rw-r--r-- 1 nabijaczleweli users 446.2k 12-15 17:27 ints2.xz-0
-rw-r--r-- 1 nabijaczleweli users   486k 12-15 17:24 ints2.xz-1
-rw-r--r-- 1 nabijaczleweli users 538.5k 12-15 17:24 ints2.xz-2
-rw-r--r-- 1 nabijaczleweli users 644.2k 12-15 17:24 ints2.xz-3
-rw-r--r-- 1 nabijaczleweli users 767.5k 12-15 17:24 ints2.xz-4
-rw-r--r-- 1 nabijaczleweli users     1M 12-15 17:24 ints2.xz-5
-rw-r--r-- 1 nabijaczleweli users 985.3k 12-15 17:24 ints2.xz-6
-rw-r--r-- 1 nabijaczleweli users   1.3M 12-15 17:24 ints2.xz-7
-rw-r--r-- 1 nabijaczleweli users   2.1M 12-15 17:24 ints2.xz-8
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints2.xz-9
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:27 ints.xz-0
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-1
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-2
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-3
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-4
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-5
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-6
-rw-r--r-- 1 nabijaczleweli users   1.8M 12-15 17:24 ints.xz-7
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-8
-rw-r--r-- 1 nabijaczleweli users   2.3M 12-15 17:24 ints.xz-9

So, beside integer endianness utterly defeating the compressor, (a) xz -7 ints is significantly smaller than every other setting, and (b) output size grows with compression levels for ints2.

It'd be nice if that were the other way around I think.

Testing on bookworm (5.4.1-0.2).

@Larhzu
Copy link
Member

Larhzu commented Dec 18, 2024

It's a curious result. LZMA SDK 24.09 produces results that, at least roughly, show some similar behavior.

I didn't investigate why it happens. Typically big endian compresses better than little endian. However, here little endian might benefit from the fact that the most random byte is always after 0x00, but again, I didn't actually investigate.

Differences between compression presets is weirder. If one only changes the dictionary size, keeping other things the same, in some cases a smaller dictionary makes the file a lot smaller. The same happens with the latest LZMA SDK.

I'm not sure if this is only a funny anomaly where specific input tricks the encoder on a wrong path, or if there is something worth improving due to these results. Artificial files like this don't represent real-world files well at all. I tried with zstd --ultra -22 too, and that produces a much smaller result from the big endian file: 2.49 MiB (BE) vs. 7.80 MiB (LE). With gzip -9 it's 20.3 MiB (BE) vs. 13.2 MiB (LE).

When nearby bytes have values that are close to each other (these two files, bitmap images, PCM audio, timestamps in a log), a simple delta filter makes a big difference:

$ xz -T1 -c --delta=dist=4 --lzma2=lp=2,lc=2 ints | wc -c
5980
$ xz -T1 -c --delta=dist=4 --lzma2=lp=2,lc=2 ints2 | wc -c
5984

(Typically one wants 4-byte distance paired with maching LZMA2 options pb=2,lp=2,lc=2 but it doesn't matter above with such an extreme input file.)

So when you have special kind of data, specializing the compression method helps. For example, with PCM audio, Delta+LZMA2 is better than plain LZMA2. But FLAC and other special purpose compressors produce much smaller results and do it much faster too.

The encoder in XZ Utils is based on an old LZMA SDK version. Some day it should be updated. Any encoder tweaks need to wait for that, and I likely won't touch the encoder in the very near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants