compression usage for data repos #107

rptaylor · 2022-03-10T20:39:05Z

By default CVMFS compresses files, however for data repos different considerations apply since some datasets can be incompressible binary data (or in a data format that is already natively compressed) while others may be highly compressible.

Of course if files are uncompressed, storage space and bandwidth are wasted, while if they are double compressed, CPU time is wasted while processing them on the publisher, and more importantly on clients , slowing down data access. And in some cases double compression could actually increase file size.

Using the gateway interface we can address this because we have the option of controlling compression on each transaction like this:
cvmfs_server publish -Z none # don't compress
cvmfs_server publish -Z default # compress

Now I have set the data repo back to compression by default. So if -Z none is exposed to analysts with a publishing option in the scripts, they can use that to avoid double compression when incompressible binary data is published.

The text was updated successfully, but these errors were encountered:

mboisson · 2022-03-18T14:42:52Z

The problem though is that it isn't trivial to figure out which one is best for a specific transaction.

rptaylor · 2022-03-18T18:09:00Z

Right. I am supposing that the analysts publishing and curating the content should have some domain-specific knowledge of the nature of the dataset. For example I know that the ROOT format typically compresses the contents of the data structures in the files.

Barring that there could be a helper script (maybe a pre-transaction hook using the publisher server tools) that looks at the files and guesses , though it might not be as effective as a human at understanding the nature of the data. E.g. here is a list of file extensions our backup software uses to avoid double compression.

* Don't waste time trying to compress these files during backup
   exclude.compression "/.../*.ace"
   exclude.compression "/.../*.arc"
   exclude.compression "/.../*.arj"
   exclude.compression "/.../*.avi"
   exclude.compression "/.../*.bzip2"
   exclude.compression "/.../*.cab"
   exclude.compression "/.../*.ear"
   exclude.compression "/.../*.exe"
   exclude.compression "/.../*.gho"
   exclude.compression "/.../*.gif"
   exclude.compression "/.../*.gpg"
   exclude.compression "/.../*.gzip"
   exclude.compression "/.../*.gz"
   exclude.compression "/.../*.jar"
   exclude.compression "/.../*.jpeg"
   exclude.compression "/.../*.jpg"
   exclude.compression "/.../*.lha"
   exclude.compression "/.../*.lzh"
   exclude.compression "/.../*.mov"
   exclude.compression "/.../*.mp3"
   exclude.compression "/.../*.mpeg"
   exclude.compression "/.../*.mpg"
   exclude.compression "/.../*.png"
   exclude.compression "/.../*.psd"
   exclude.compression "/.../*.rar"
   exclude.compression "/.../*.rrd"
   exclude.compression "/.../*.sea"
   exclude.compression "/.../*.sit"
   exclude.compression "/.../*.sitx"
   exclude.compression "/.../*.swf"
   exclude.compression "/.../*.tar.bz2"
   exclude.compression "/.../*.tar.gz"
   exclude.compression "/.../*.tgz"
   exclude.compression "/.../*.tiff"
   exclude.compression "/.../*.tif"
   exclude.compression "/.../*.war"
   exclude.compression "/.../*.wav"
   exclude.compression "/.../*.Z"
   exclude.compression "/.../*.zip"
   exclude.compression "/.../*.zoo"

Anyway we can consider it as a potential optimization.

mboisson · 2022-04-05T13:04:17Z

I really can't think of a way to implement that. Software or datasets installers don't necessarily have domain knowledge of what they are installing. Maybe @jshleap has an idea ?

jshleap · 2022-04-05T13:57:24Z

It is complex because you can also have a mix of types in a dataset. How we are doing in in the biorepo is to process all files known to be compressed (like the list above) separate from those that don't have it. That, of course, means that there are going to be some files that go through as they do not have the extension, but can easily be added to the array if someone has the knowledge of incomprehensibility.

mboisson · 2022-04-05T15:53:42Z

But how do you do that ?

jshleap · 2022-04-05T17:04:13Z

The way we push is very different as the RSNT, we loop over individual files to check for size and number, and if the threshold is passed, then we push. This allow us to loop over files not in the compress extensions array.

mboisson · 2022-04-05T18:50:12Z

Where is that script ?

jshleap · 2022-04-05T19:00:20Z

In this private repo https://github.com/ComputeCanada/BioRepo/blob/main/scripts/push_data, I can add you (but dont judge :P)

mboisson · 2022-04-05T19:22:27Z

If you loop through all files, something like this would be more useful than a fixed list based on extensions:

$ cat compress_factor.sh
#!/bin/bash

file="$1"

compressed_size=$(dd if="$file" count=20 bs=1024 2>/dev/null | gzip -c -9 | wc -c)
compression_factor=$((20*1024 / $compressed_size))
echo $compression_factor

It reads the first 20 kB, compresses it, and calculates a compression factor. Won't be reliable if the file is too small.

$ for f in *.*; do echo $f; ./compress_factor.sh $f; done
compress_factor.sh
136
gcc-9.1.0.tar.gz
1
keys-restricted.dev.computecanada.ca.tar
14
ld_debug_2.txt
27
ld_debug.txt
27
nupack-4.0.0.27-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
1
nupack-4.0.0.27-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
1
nupack-4.0.0.27-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
1
pi-test.py
10
privacyidea_pam_ldap.py
3
privacyidea_pam.py
3
strace-cuda11.0
27
strace_paraview.txt
24
testOpenVKL-warning-bug.py
21
wavelet.png
1

jshleap · 2022-04-05T19:54:50Z

Pretty cool.. I will have to refactor the code as I do batches, but i can create the groups with this code! thanks!

rptaylor · 2022-04-05T19:55:39Z

Software or datasets installers don't necessarily have domain knowledge of what they are installing.

Potentially rhetorical question - should we be publishing data sets if we don't have any analysts with at least some basic knowledge of what those data sets are?

That script is a nice idea, as the only truly reliable way to determine if a file is compressible is to compress it and see. Maybe analysts could use it as a tool to examine datasets and see which files are compressible and make a judgement call on the overall compressibility of the dataset, or specific subdirectories.

mboisson · 2023-07-13T14:25:10Z

I'm looking at open issues and got back to this one. What is loss in the event of a double-compression ? Publishing time ? If so, that's not really a big motivation.

And what should be the threshold for a given transaction to know whether we should compress everything or nothing ?
Hypothetical scenario: a transaction contains one very large uncompressible files, and a large number of small compressible files.

rptaylor · 2023-07-13T20:03:31Z

I suggested https://sft.its.cern.ch/jira/browse/CVM-2078 some time ago but no feedback on it.

Double compression does cost publishing time, if we're talking about doing things the right way to scale to data sets of ~10-100 TB it should be part of the picture. Measurements would be helpful to quantify.

Compressing an incompressible file can also potentially increase the size of it.

It also costs time and CPU for decompression on a compute node , the first time that a file is retrieved and placed into the cache. How much, we would again need to measure.

mboisson added enhancement New feature or request help wanted Extra attention is needed discussion needed Issues that require discussion labels Apr 5, 2022

mboisson added the low priority label Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression usage for data repos #107

compression usage for data repos #107

rptaylor commented Mar 10, 2022 •

edited

Loading

mboisson commented Mar 18, 2022

rptaylor commented Mar 18, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

rptaylor commented Apr 5, 2022

mboisson commented Jul 13, 2023 •

edited

Loading

rptaylor commented Jul 13, 2023

compression usage for data repos #107

compression usage for data repos #107

Comments

rptaylor commented Mar 10, 2022 • edited Loading

mboisson commented Mar 18, 2022

rptaylor commented Mar 18, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

mboisson commented Apr 5, 2022

jshleap commented Apr 5, 2022

rptaylor commented Apr 5, 2022

mboisson commented Jul 13, 2023 • edited Loading

rptaylor commented Jul 13, 2023

rptaylor commented Mar 10, 2022 •

edited

Loading

mboisson commented Jul 13, 2023 •

edited

Loading