-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compression usage for data repos #107
Comments
The problem though is that it isn't trivial to figure out which one is best for a specific transaction. |
Right. I am supposing that the analysts publishing and curating the content should have some domain-specific knowledge of the nature of the dataset. For example I know that the ROOT format typically compresses the contents of the data structures in the files. Barring that there could be a helper script (maybe a pre-transaction hook using the publisher server tools) that looks at the files and guesses , though it might not be as effective as a human at understanding the nature of the data. E.g. here is a list of file extensions our backup software uses to avoid double compression.
Anyway we can consider it as a potential optimization. |
I really can't think of a way to implement that. Software or datasets installers don't necessarily have domain knowledge of what they are installing. Maybe @jshleap has an idea ? |
It is complex because you can also have a mix of types in a dataset. How we are doing in in the biorepo is to process all files known to be compressed (like the list above) separate from those that don't have it. That, of course, means that there are going to be some files that go through as they do not have the extension, but can easily be added to the array if someone has the knowledge of incomprehensibility. |
But how do you do that ? |
The way we push is very different as the RSNT, we loop over individual files to check for size and number, and if the threshold is passed, then we push. This allow us to loop over files not in the compress extensions array. |
Where is that script ? |
In this private repo https://github.com/ComputeCanada/BioRepo/blob/main/scripts/push_data, I can add you (but dont judge :P) |
If you loop through all files, something like this would be more useful than a fixed list based on extensions:
It reads the first 20 kB, compresses it, and calculates a compression factor. Won't be reliable if the file is too small.
|
Pretty cool.. I will have to refactor the code as I do batches, but i can create the groups with this code! thanks! |
Potentially rhetorical question - should we be publishing data sets if we don't have any analysts with at least some basic knowledge of what those data sets are? That script is a nice idea, as the only truly reliable way to determine if a file is compressible is to compress it and see. Maybe analysts could use it as a tool to examine datasets and see which files are compressible and make a judgement call on the overall compressibility of the dataset, or specific subdirectories. |
I'm looking at open issues and got back to this one. What is loss in the event of a double-compression ? Publishing time ? If so, that's not really a big motivation. And what should be the threshold for a given transaction to know whether we should compress everything or nothing ? |
I suggested https://sft.its.cern.ch/jira/browse/CVM-2078 some time ago but no feedback on it. Double compression does cost publishing time, if we're talking about doing things the right way to scale to data sets of ~10-100 TB it should be part of the picture. Measurements would be helpful to quantify. Compressing an incompressible file can also potentially increase the size of it. It also costs time and CPU for decompression on a compute node , the first time that a file is retrieved and placed into the cache. How much, we would again need to measure. |
By default CVMFS compresses files, however for data repos different considerations apply since some datasets can be incompressible binary data (or in a data format that is already natively compressed) while others may be highly compressible.
Of course if files are uncompressed, storage space and bandwidth are wasted, while if they are double compressed, CPU time is wasted while processing them on the publisher, and more importantly on clients , slowing down data access. And in some cases double compression could actually increase file size.
Using the gateway interface we can address this because we have the option of controlling compression on each transaction like this:
cvmfs_server publish -Z none
# don't compresscvmfs_server publish -Z default
# compressNow I have set the data repo back to compression by default. So if
-Z none
is exposed to analysts with a publishing option in the scripts, they can use that to avoid double compression when incompressible binary data is published.The text was updated successfully, but these errors were encountered: