Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Comparison with the zarr format? #527

Open
julioasotodv opened this issue Sep 19, 2024 · 13 comments
Open

[Question] Comparison with the zarr format? #527

julioasotodv opened this issue Sep 19, 2024 · 13 comments

Comments

@julioasotodv
Copy link

Hi,

I know that safetensors are widely used nowadays in HF, and the comparisons made in this repo's README file make a lot of sense.

However, I am now surprised to see that there is no comparison with zarr, which is probably the most widely used format to store tensors in an universal, compressed and scalable way.

Is there any particular reason why safetensors was created instead of just using zarr, which has been around for longer (and has nice benefits such as good performance in object storage reads and writes)?

Thank you!

@User21T
Copy link

User21T commented Nov 5, 2024

Hello.

I don't represent Hugging Face or its position on the issue.
However, I think the main reason why creating safetensors was better than using zarr is that the latter is just an universal format to store any kind of tensor. Meanwhile, safetensors was specifically designed to store Machine Learning models and work within HF ecosystem.
It guarantees better performance, security and ML-specific types integration (Bfloat16, Fp8).

If I'm wrong, please correct me.

@julioasotodv
Copy link
Author

Thanks for the answer! I believe that zarr offers the same and more than safetensors (chunking, different compressions and others) except perhaps some of the specific dtypes such as bf16.

Thank you!

@User21T
Copy link

User21T commented Nov 16, 2024

You're welcome.

@TomNicholas
Copy link

TomNicholas commented Jan 1, 2025

I had the exact same question as @julioasotodv.

It's interesting to see in the readme that HDF5 is described as

Apparently now discouraged for TF/Keras. Seems like a great fit otherwise actually.

Zarr is heavily inspired by HDF5's data model, and so for all the same reasons would also be a great fit for this problem.

However, I think the main reason why creating safetensors was better than using zarr is that the latter is just an universal format to store any kind of tensor. Meanwhile, safetensors was specifically designed to store Machine Learning models and work within HF ecosystem.

An ML-specific format is only an advantage over a more general format if there is some feature that the ML domain needs that the general format cannot provide. Otherwise Hugging Face are just duplicating effort only to end up with a tool with fewer features and a smaller community.

People have used Zarr in an ML context at scale - I believe Google used it to checkpoint model weights when training Gemini.

It guarantees better performance, security and ML-specific types integration (Bfloat16, Fp8).

Can you elaborate on this? The relevant questions (and my stab at answers) are:

  • Does safetensors actually have better performance than Zarr? (Zarr v3 has async loading and decompression of chunks, so is really very performant)
  • Is it similarly easy to load data from storage to GPU memory from both formats? (you can definitely load zarr to GPU memory)
  • Are there major architectural differences between safetensors and Zarr? (IIUC they both have a json header that store metadata about binary blobs of floating point data. Safetensors doesn't actually put the json in a separate file to the binary data though, so is more analogous to (Cloud-Optimized) HDF5. The other big difference seems to be safetensors not having compression, and therefore no chunking? Though "multipart safetensors" sound like uncompressed chunks stored in different files...)
  • Can one create a zip bomb using Zarr? (I don't believe so...)
  • Is there any other way in which Zarr could present a security risk? (I highly doubt it, you're just reading json + decompressing chunks of bytes containing floating point data)
  • Can Zarr be extended to support ML-specific types (it certainly could, through some new zarr extension to add those dtypes)
  • Could a community standard be formed to agree upon ML-specific schema conventions in Zarr (surely - a similar thing already exists for microscope data in Zarr and is happening for geospatial data in Zarr)
  • Is there some feature needed to integrate smoothly with the HF ecosystem that Zarr can't provide? (I have no idea)

In fact, one might even imagine using the "virtual zarr" approach to make existing data in safetensor form accessible via zarr library implementations, which would be interesting as a way to very efficiently read safetensor data into applications which understand zarr without duplicating the safetensor data...

I don't represent Hugging Face or its position on the issue.

I would be very curious to hear an opinion from Hugging Face! :)

@Narsil
Copy link
Collaborator

Narsil commented Jan 2, 2025

I wasn't aware at all of zarr's existence when creating this format.

HDF5 was getting deprecated in TF/Keras at the time of writing this lib, which means building upon it didn't look like a great choice.

HDF5 main issue at the time was lack of support for BF16, which is ubiquitous now in ML (and was already quite used at the time.) Same probably goes for FP8(4,5 variants).

Zarr seems to be trying to store compression schemes. This is imho a counterproductive thing. Currently there are several alternatives for quantization and every month there's a new one, with always very different information needed, layout/constraints and I'm not mentioning alignment.

It seems currently other libraries are able to build on top of safetensors in order to store their compressed information as they desire: https://github.com/neuralmagic/compressed-tensors https://huggingface.co/bartowski/Llama-3.3-70B-Instruct-exl2/tree/4_25 etc...

The issue with trying to support those, is that since they are not standard, they do not have any equivalent in most libraries (GGUF compressed tensors do not exist in torch for instance). This makes it super hard to both cover everything, and stay simple. And it makes other libraries trying to read safetensors a never ending battle.

Currently, the current format hasn't changed a bit since inception, except for the addition of FP8 (which is a real hardware dtype).

zarr seems to have issues in its specs: https://zarr-specs.readthedocs.io/en/latest/v3/data-types.html Are there other sources ?

Just as a background safetensors was created at a time, where the core focus was to stop distributing inherently unsafe files (pickle files from torch) generally. As a side effect we created a format that made it trivial to do lazy loading (either over filesystem or over network) since it helps a lot with larger and larger system (TP, PP for training or inference where starting to be heavily used).

For the detailed questions:

Does safetensors actually have better performance than Zarr? (Zarr v3 has async loading and decompression of chunks, so is really very performant

Safetensors too. In ML tensors are not compressible by design, safetensors is zero-copy, which is pretty hard to beat.

Is it similarly easy to load data from storage to GPU memory from both formats?

Yes is it. If zarr supports compression you cannot guarantee that you can do that though (whereas safetensors is guaranteed to be zero-copy).

Can one create a zip bomb using Zarr? (I don't believe so...)

If Zarr implements zip, then it is vulnerable. There's no way around it. Your implementation may not be vulnerable, but that's implementation specific. Any other implementor might be or if users are unzipping manually.
If you are disabling zip bombs, then you are not implementing ZIP itself, implementing something different.

Can Zarr be extended to support ML-specific types (it certainly could, through some new zarr extension to add those dtypes)

Sure.

@TomNicholas
Copy link

Thank you for the detailed answer!

HDF5 was getting deprecated in TF/Keras at the time of writing this lib, which means building upon it didn't look like a great choice.

Yeah I'm not advocating for HDF5, just pointing out the similarities.

zarr seems to have issues in its specs: https://zarr-specs.readthedocs.io/en/latest/v3/data-types.html Are there other sources ?

Zarr v3 is still in beta. Zarr v2 is widely used though (spec here). For the purposes of this discussion the difference isn't very important - other than that one of the aims of v3 is to support adding new custom dtypes.

Zarr seems to be trying to store compression schemes. This is imho a counterproductive thing. Currently there are several alternatives for quantization and every month there's a new one, with always very different information needed, layout/constraints and I'm not mentioning alignment.

Interesting. It would be interesting to know if these different quantization approaches could all be considered examples of Zarr codecs or not. I can see there might be a difference in project scope here though, with other libraries building on top of safetensors to handle this step in a variety of ways.

Safetensors too. In ML tensors are not compressible by design, safetensors is zero-copy, which is pretty hard to beat.

Yeah makes sense. I would have thought a Zarr reader implementation could be zero-copy for the case of uncompressed data too though.

Yes is it. If zarr supports compression you cannot guarantee that you can do that though (whereas safetensors is guaranteed to be zero-copy).

I see. Obviously compression is useful for the other users of Zarr, but it is optional. So perhaps one could imagine simply defining "safe Zarr" as any Zarr store that is not compressed using certain codecs... You could also check for the presence of these codecs before attempting for decompress anything. And implement a "zero-copy safe zarr reader" that just refuses to open compressed data.

If Zarr implements zip, then it is vulnerable.

Zarr supports zlib compression, but if I understand correctly it doesn't implement zip in general. I have asked upstream to check my understanding though, see zarr-developers/zarr-python#2625.

People have used Zarr in an ML context at scale

FYI the relevant project to read about is Google TensorStore, which is a C++ implementation of Zarr, used for checkpointing LLMs at scale.

@csaybar
Copy link

csaybar commented Jan 11, 2025

Does safetensors actually have better performance than Zarr? (Zarr v3 has async loading and decompression of chunks, so is really very performant)

A key advantage of Safetensors over Zarr is its ability to handle chunks "dynamically", with virtually no overhead. In contrast, Zarr relies on a static chunking strategy (now also introduces sharding) set during its creation.

When Zarr chunks perfectly align with user requirements, both SafeTensors and Zarr deliver comparable performance. Safetensors is still slightly faster but maybe due to its Rust implementation compared to Zarr's Pure Python code.

Safetensor is as fast as zarr

Safetensors: 10.7 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Zarr: 13.5 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, if user request doesn't perfectly match Zarr chunks, the performance difference can be pretty big. Safetensors can be easily 10 to 100 times faster than Zarr in many many cases.

Safetensor is 41 times faster than Zarr for the same data request

Safetensors: 212 µs ± 21.3 µs  per loop (mean ± std. dev. of 7 runs, 10 loops each)
Zarr: 8.65 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

https://colab.research.google.com/drive/1ewxNez6R-q9tWtGBp4JsQrKMFrTS5BG_?usp=sharing

Over a network, if your server supports multiple range requests, you can perform any partial read from a Safetensors file using just 3 GET requests, without needing caching.

FYI the relevant project to read about is Google TensorStore, which is a C++ implementation of Zarr, used for checkpointing LLMs at scale.

Google TensorStore is more than a C++ implementation of Zarr. Zarr is just one of the many drivers it supports. https://google.github.io/tensorstore/driver/index.html

@nenb
Copy link

nenb commented Jan 13, 2025

@csaybar I repeated the benchmarks on my local machines.

New Mac M-series
It appeared that there was considerable caching going on, and that this heavily distorted the results. By looking at individual loads from disk (i.e. replacing %%timeit with something like %%timeit -n 1 -r 1) I could find no obvious difference between zarr and safetensors for the benchmarks. If anything, zarr was slightly faster for both benchmarks.

Older Lenovo Thinkpad with Ubuntu
The caching impact was less obvious here. zarr was slightly faster for the 'perfect alignment' case, and then between 2-5 times slower for the non-aligned case.


I also repeated the benchmarks on Colab but:
i) looked at individual loads from disk and
ii) added compressor=None when creating the zarr array (as this seemed like a more like-for-like comparison)

Colab
There did not appear to be any clear difference in the 'perfect alignment' case. zarr appeared to be about 10 times slower for the non-aligned case.


Based on the variability of these results, I would be hesitant to draw any definitive statements from these benchmarks. In particular:

Safetensors can be easily 10 to 100 times faster than Zarr in many many cases.

does not match my personal experience. (But it's quite possible that it's just because I have very limited personal experience!)

I'm also hesitant to conclude that any differences between the formats relate to the chunking strategy. safetensors is using mmap under-the-hood, and it seems plausible to me that any differences could in fact be related to the use of mmap, rather than anything related to 'dynamic chunking'?

I'm interested in seeing more benchmarks in the community comparing safetensors and zarr performance for different use-cases, thanks for sharing!

@csaybar
Copy link

csaybar commented Jan 13, 2025

Hi @nenb, you're absolutely right! I'm not very familiar with the Zarr Python implementation, but it seems the default compression is not None. The documentation is messy due to the transition between v2 and v3, but it seems that compressor=None works.

To make a fair comparison, I'll use cProfile instead of timeit to avoid potential caching in Zarr/Safetensor. I'm also reporting the best score from 10 runs.

Results:

https://colab.research.google.com/drive/1-yuHw5Id2bJDpKkilHBQ7QeRPfM8u8VB?usp=sharing

  • 'Perfect alignment' case: Safetensor is ~4x times faster.

    • Safetensors: 0.034 seconds
    • Zarr: 0.129 seconds
  • No 'perfect alignment' case: Safetensor is ~56x times faster.

    • Safetensors: 0.002 seconds
    • Zarr: 0.112 seconds

I'm also hesitant to conclude that any differences between the formats are related to the chunking strategy. Safetensor uses mmap under the hood, and it seems plausible that any differences might stem from the use of mmap, rather than 'dynamic chunking.'

Safetensor doesn’t have a chunk strategy. Everything is fixed by default. It uses mmap because it can, as no compression, everything can be mapped just by knowing __metadata__: {'TENSOR_NAME': {'dtype': 'F32', 'shape': [768, 2304], 'data_offsets': [541012992, 548090880]}}.

Check the number of functions calls Zarr has to make for the same request: 85,053 (Zarr) vs 5 (Safetensor). If Zarr were coded in C or Rust, it could potentially match Safetensor speed at "perfect alignment". HDF5 is faster than Zarr because its core is in C. Zarr’s main devs prefer to keep everything in Python, I’m not sure why.

image

image

@nenb
Copy link

nenb commented Jan 13, 2025

@csaybar

To make a fair comparison, I'll use cProfile instead of timeit to avoid potential caching in Zarr/Safetensor. I'm also reporting the best score from 10 runs.

In your runs you are profiling the import time of zarr. A lot of the imports in the zarr Python package only occur after doing some action i.e. import zarr alone will not trigger all the imports required for most operations. The overhead from this should be negligible (tens of milliseconds for the benchmark cases), and due to cPython optimisations this overhead should happen the first time only. However, because the above benchmarks are operating on such small data chunks and hence complete very quickly, the zarr import overhead is distorting the result. They are also the reason for the very high number of function calls (which should be less than 1000 in reality).

To have a realistic benchmark, I would suggest using larger data chunks so that the imports do not distort the results and/or rewriting the benchmark so as to remove these imports from the results. Because in practice, I don't really mind if there is a cost of tens of milliseconds on the very first read - the rest of my I/O times will completely dwarf this (and if they don't, zarr is not the correct tool here!).

Safetensor doesn’t have a chunk strategy. Everything is fixed by default. It uses mmap because it can, as no compression, everything can be mapped just by knowing metadata: {'TENSOR_NAME': {'dtype': 'F32', 'shape': [768, 2304], 'data_offsets': [541012992, 548090880]}}.

My point was that mmap seemed like a very plausible candidate as to why safetensors is faster than zarr in some of the benchmarks above. But mmap is a feature of the OS - I don't believe that it will be relevant when loading from cloud storage, which is a very common use-case. The results when loading data from cloud-storage (rather than from on disk) could look very different. Cloud-optimized geotiffs vs zarr would probably be the relevant benchmark here?

Zarr’s main devs prefer to keep everything in Python, I’m not sure why.

I am a user and not a dev, and so I can only speculate. But there are multiple implementations of zarr (JS, Java, Rust, ...), and I think that the Python implementation is probably considered the 'reference implementation' as Python is probably the most familiar language to most of us who use zarr (researchers especially). The size of the community around zarr-python is evidence in favour of this i.e. it's not just 1/2 key contributors.

@csaybar
Copy link

csaybar commented Jan 13, 2025

In your runs you are profiling the import time of zarr. A lot of the imports in the zarr Python package only occur after doing some action i.e. import zarr alone will not trigger all the imports required for most operations. The overhead from this should be negligible (tens of milliseconds for the benchmark cases), and due to cPython optimisations this overhead should happen the first time only. However, because the above benchmarks are operating on such small data chunks and hence complete very quickly, the zarr import overhead is distorting the result. They are also the reason for the very high number of function calls (which should be less than 1000 in reality).

The profiling is specifically conducted within the context manager, not considering the import time. I added
the same imports to all the scripts to ensure that the import time is not considered in the profiling.

from safetensors import safe_open
from safetensors.numpy import save_file
import numpy as np
import cProfile
import pstats
import numpy as np
import cProfile
import zarr
import pstats

The Profile class can also be used as a context manager (supported only in cProfile module. see Context Manager Types):

https://docs.python.org/3/library/profile.html

To have a realistic benchmark, I would suggest using larger data chunks so that the imports do not distort the results and/or rewriting the benchmark so as to remove these imports from the results. Because in practice, I don't really mind if there is a cost of tens of milliseconds on the very first read - the rest of my I/O times will completely dwarf this (and if they don't, zarr is not the correct tool here!).

Something "realistic" will depend on the use case. The example I used corresponds to a usual Sentinel-2 data cube. That you can download using stackstac for instance. But in my experience, bigger will be worse for Python Zarr. Consider that:

  • This is a single-threaded experiment, deliberately designed to measure one core performance.
  • The overhead becomes multiplicative when scaled across larger datasets.
  • Zarr users don't know how slow it is because they use Zarr in multi-threaded environments, where the overhead is hidden by the parallelism.

My point was that mmap seemed like a very plausible candidate as to why safetensors is faster than zarr in some of the benchmarks above. But mmap is a feature of the OS - I don't believe that it will be relevant when loading from cloud storage, which is a very common use-case. The results when loading data from cloud-storage (rather than from on disk) could look very different. Cloud-optimized geotiffs vs zarr would probably be the relevant benchmark here?

In my opinion, in a server context only matters minimise the number of GET operations, especially when serving millions of clients. Safetensor is also better here. As it is a fixed size, you can:

  • Client-side estimation of required bytes.
  • Optimization of GET operations.
  • Byte download execution.

To illustrate the GET optimization (step 2), consider this example:

File size: 1000 bytes
Required byte ranges: 0-20, 30-40, 900-1000

optimized approach (at the client side):
- Combine 0-20 and 30-40 into a single GET (0-40)
- Separate GET for 900-1000
Result: 2 GET operations instead of 3

You can do the same in Zarr and GeoTIFF too but at "chunk" level, Safetensor permit you to do it at "byte" level.

In a machine learning context, compression is irrelevant, that is why safetensors shines. In a geoscience context, data usually has high autocorrelation, you can easily get a 10x compression ratio for some datasets (especially climate-related ones) with a lossless compressor. This is the reason why formats like Safetensors are not popular in the geoscience community IMO.

I am a user and not a dev, and so I can only speculate. But there are multiple implementations of zarr (JS, Java, Rust, ...), and I think that the Python implementation is probably considered the 'reference implementation' as Python is probably the most familiar language to most of us who use zarr (researchers especially). The size of the community around zarr-python is evidence in favour of this i.e. it's not just 1/2 key contributors.

Beyond Python-Zarr, other implementations lack support.

@nenb
Copy link

nenb commented Jan 13, 2025

The profiling is specifically conducted within the context manager, not considering the import time. I added
the same imports to all the scripts to ensure that the import time is not considered in the profiling.

This is not correct. You can see this in the cProfile output - look at the number of calls to the importlib package as a clue. As a crude fix, you can add import fsspec to your list of imports, and this should change the results quite dramatically - the number of function calls should go from ~ 100,000 to ~ 1,000, and the runtime should drop accordingly.

I think it's important that a lot of the statements that have been made are also backed-up with benchmarks (and I'll admit that I am also guilty of this!).

There are only two benchmarks available in this thread. I wanted to point out that they i) vary quite a bit across machines and ii) when some subtleties are accounted for, the differences are not as large as originally suggested.

In a machine learning context, compression is irrelevant, that is why safetensors shines.

I do want to address this though, as I have seen it mentioned before, and I am not sure what it relates to.

If we consider the Qwen2.5-1.5B-Instruct LLM on HF for example (I picked this as it is one of the most downloaded, I have similar experience with other LLMs), using the Zstandard compression algorithm (with blosc for example), I get a compression ratio of 1.5. Sure, it mightn't be as high as 2 or 2.5 which can happen in geospatial applications, but it's certainly not irrelevant.

I don't mean to claim that this is true for all use-cases, but if it's true for one of the most downloaded LLMs on HF, then I don't think it's correct to say that compression is irrelevant.

@csaybar
Copy link

csaybar commented Jan 13, 2025

This is not correct. You can see this in the cProfile output - look at the number of calls to the importlib package as a clue. As a crude fix, you can add import fsspec to your list of imports, and this should change the results quite dramatically - the number of function calls should go from ~ 100,000 to ~ 1,000, and the runtime should drop accordingly.

I disagree here. To me, this sounds like a lack of optimization. But I understand your point and I think it is a valid claim.
What specifically are we benchmarking? It really depends on the context.

I think it's important that a lot of the statements that have been made are also backed-up with benchmarks (and I'll admit that I am also guilty of this!). There are only two benchmarks available in this thread. I wanted to point out that they i) vary quite a bit across machines and ii) when some subtleties are accounted for, the differences are not as large as originally suggested.

Exactly! That’s why I shared some Colab notebook. It would be very interesting to see counterexamples.

If we consider the Qwen2.5-1.5B-Instruct LLM on HF for example (I picked this as it is one of the most downloaded, I have similar experience with other LLMs), using the Zstandard compression algorithm (with blosc for example), I get a compression ratio of 1.5. Sure, it mightn't be as high as 2 or 2.5 which can happen in geospatial applications, but it's certainly not irrelevant.

In my experience (maybe not best to talk about this), if you have a high compression ratio, quantizing your model may be the best approach. I have been playing with Qwen2.5-1.5B-Instruct-GPTQ-Int8 I just compressed with zip and tar.xz and the ratio is close to 1 in both compared to the original safetensor file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants