Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compression usage for data repos #107

Open
rptaylor opened this issue Mar 10, 2022 · 13 comments
Open

compression usage for data repos #107

rptaylor opened this issue Mar 10, 2022 · 13 comments
Labels
discussion needed Issues that require discussion enhancement New feature or request help wanted Extra attention is needed low priority

Comments

@rptaylor
Copy link
Contributor

rptaylor commented Mar 10, 2022

By default CVMFS compresses files, however for data repos different considerations apply since some datasets can be incompressible binary data (or in a data format that is already natively compressed) while others may be highly compressible.

Of course if files are uncompressed, storage space and bandwidth are wasted, while if they are double compressed, CPU time is wasted while processing them on the publisher, and more importantly on clients , slowing down data access. And in some cases double compression could actually increase file size.

Using the gateway interface we can address this because we have the option of controlling compression on each transaction like this:
cvmfs_server publish -Z none # don't compress
cvmfs_server publish -Z default # compress

Now I have set the data repo back to compression by default. So if -Z none is exposed to analysts with a publishing option in the scripts, they can use that to avoid double compression when incompressible binary data is published.

@mboisson
Copy link
Member

The problem though is that it isn't trivial to figure out which one is best for a specific transaction.

@rptaylor
Copy link
Contributor Author

Right. I am supposing that the analysts publishing and curating the content should have some domain-specific knowledge of the nature of the dataset. For example I know that the ROOT format typically compresses the contents of the data structures in the files.

Barring that there could be a helper script (maybe a pre-transaction hook using the publisher server tools) that looks at the files and guesses , though it might not be as effective as a human at understanding the nature of the data. E.g. here is a list of file extensions our backup software uses to avoid double compression.

* Don't waste time trying to compress these files during backup
   exclude.compression "/.../*.ace"
   exclude.compression "/.../*.arc"
   exclude.compression "/.../*.arj"
   exclude.compression "/.../*.avi"
   exclude.compression "/.../*.bzip2"
   exclude.compression "/.../*.cab"
   exclude.compression "/.../*.ear"
   exclude.compression "/.../*.exe"
   exclude.compression "/.../*.gho"
   exclude.compression "/.../*.gif"
   exclude.compression "/.../*.gpg"
   exclude.compression "/.../*.gzip"
   exclude.compression "/.../*.gz"
   exclude.compression "/.../*.jar"
   exclude.compression "/.../*.jpeg"
   exclude.compression "/.../*.jpg"
   exclude.compression "/.../*.lha"
   exclude.compression "/.../*.lzh"
   exclude.compression "/.../*.mov"
   exclude.compression "/.../*.mp3"
   exclude.compression "/.../*.mpeg"
   exclude.compression "/.../*.mpg"
   exclude.compression "/.../*.png"
   exclude.compression "/.../*.psd"
   exclude.compression "/.../*.rar"
   exclude.compression "/.../*.rrd"
   exclude.compression "/.../*.sea"
   exclude.compression "/.../*.sit"
   exclude.compression "/.../*.sitx"
   exclude.compression "/.../*.swf"
   exclude.compression "/.../*.tar.bz2"
   exclude.compression "/.../*.tar.gz"
   exclude.compression "/.../*.tgz"
   exclude.compression "/.../*.tiff"
   exclude.compression "/.../*.tif"
   exclude.compression "/.../*.war"
   exclude.compression "/.../*.wav"
   exclude.compression "/.../*.Z"
   exclude.compression "/.../*.zip"
   exclude.compression "/.../*.zoo"

Anyway we can consider it as a potential optimization.

@mboisson
Copy link
Member

mboisson commented Apr 5, 2022

I really can't think of a way to implement that. Software or datasets installers don't necessarily have domain knowledge of what they are installing. Maybe @jshleap has an idea ?

@mboisson mboisson added enhancement New feature or request help wanted Extra attention is needed discussion needed Issues that require discussion labels Apr 5, 2022
@jshleap
Copy link

jshleap commented Apr 5, 2022

It is complex because you can also have a mix of types in a dataset. How we are doing in in the biorepo is to process all files known to be compressed (like the list above) separate from those that don't have it. That, of course, means that there are going to be some files that go through as they do not have the extension, but can easily be added to the array if someone has the knowledge of incomprehensibility.

@mboisson
Copy link
Member

mboisson commented Apr 5, 2022

But how do you do that ?

@jshleap
Copy link

jshleap commented Apr 5, 2022

The way we push is very different as the RSNT, we loop over individual files to check for size and number, and if the threshold is passed, then we push. This allow us to loop over files not in the compress extensions array.

@mboisson
Copy link
Member

mboisson commented Apr 5, 2022

Where is that script ?

@jshleap
Copy link

jshleap commented Apr 5, 2022

In this private repo https://github.com/ComputeCanada/BioRepo/blob/main/scripts/push_data, I can add you (but dont judge :P)

@mboisson
Copy link
Member

mboisson commented Apr 5, 2022

If you loop through all files, something like this would be more useful than a fixed list based on extensions:

$ cat compress_factor.sh
#!/bin/bash

file="$1"

compressed_size=$(dd if="$file" count=20 bs=1024 2>/dev/null | gzip -c -9 | wc -c)
compression_factor=$((20*1024 / $compressed_size))
echo $compression_factor

It reads the first 20 kB, compresses it, and calculates a compression factor. Won't be reliable if the file is too small.

$ for f in *.*; do echo $f; ./compress_factor.sh $f; done
compress_factor.sh
136
gcc-9.1.0.tar.gz
1
keys-restricted.dev.computecanada.ca.tar
14
ld_debug_2.txt
27
ld_debug.txt
27
nupack-4.0.0.27-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
1
nupack-4.0.0.27-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
1
nupack-4.0.0.27-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
1
pi-test.py
10
privacyidea_pam_ldap.py
3
privacyidea_pam.py
3
strace-cuda11.0
27
strace_paraview.txt
24
testOpenVKL-warning-bug.py
21
wavelet.png
1

@jshleap
Copy link

jshleap commented Apr 5, 2022

Pretty cool.. I will have to refactor the code as I do batches, but i can create the groups with this code! thanks!

@rptaylor
Copy link
Contributor Author

rptaylor commented Apr 5, 2022

Software or datasets installers don't necessarily have domain knowledge of what they are installing.

Potentially rhetorical question - should we be publishing data sets if we don't have any analysts with at least some basic knowledge of what those data sets are?

That script is a nice idea, as the only truly reliable way to determine if a file is compressible is to compress it and see. Maybe analysts could use it as a tool to examine datasets and see which files are compressible and make a judgement call on the overall compressibility of the dataset, or specific subdirectories.

@mboisson
Copy link
Member

mboisson commented Jul 13, 2023

I'm looking at open issues and got back to this one. What is loss in the event of a double-compression ? Publishing time ? If so, that's not really a big motivation.

And what should be the threshold for a given transaction to know whether we should compress everything or nothing ?
Hypothetical scenario: a transaction contains one very large uncompressible files, and a large number of small compressible files.

@rptaylor
Copy link
Contributor Author

I suggested https://sft.its.cern.ch/jira/browse/CVM-2078 some time ago but no feedback on it.

Double compression does cost publishing time, if we're talking about doing things the right way to scale to data sets of ~10-100 TB it should be part of the picture. Measurements would be helpful to quantify.

Compressing an incompressible file can also potentially increase the size of it.

It also costs time and CPU for decompression on a compute node , the first time that a file is retrieved and placed into the cache. How much, we would again need to measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Issues that require discussion enhancement New feature or request help wanted Extra attention is needed low priority
Projects
None yet
Development

No branches or pull requests

3 participants