Save spectrograms with `np.savez_compressed` #1

NickleDave · 2024-05-30T21:32:27Z

can we reduce file size
without affecting training
and requiring a ton of re-engineering of dataset prep / datapipe class

Currently we use a "just a bunch of files of approach", which lets us use the same npz file--the spectrogram, the input to a model--with multiple npy files--the labels, the target of the model.

Sort of a worst case might be where we get a big benefit from jamming all the spectrograms in a single zarr archive, but that means we have to re-engineer all the code that assumes the spectrograms exist as separate files: the prep step, the dataset class, etc. The reason to prefer the separate files is mainly for tracking metadata and for readability, but maybe I am overvaluing this.

This doesn't need to be highest priority but it could help make it easier to upload the dataset.

edit: if we were to cram all the spectrograms into a single zarr archive, then we might want to access with a mem-mapping approach. DAS docs suggest it's not easy to squeeze good performance out of this:

While zarr, h5py, and xarray provide mechanisms for out-of-memory access, they tend to be slower in our experience or require fine tuning to reach the performance reached with memmapped npy files.

I did find examples for pytorch + zarr previously in other domains but similarly got the impression that it's not a simple clear process to follow and it's not easy to troubleshoot. Although the point about just mem-mapping npy makes me wonder if I should try that

NickleDave · 2024-10-04T01:05:35Z

Renamed this issue.

I realized that numpy actually has built-in compression, and used that to test quickly how much space we could save using it, and whether it would affect training, without trying to figure out zarr and memmapping

Tested this some, found that indeed it saves a lot of space. I tried on one directory of spectrograms from Mouse Pup Calls and it went from 1.9G to 93M.

I then did some tests of the dataloader, will attach a notebook.
My current understanding is that, at least for TweetyNet trained on Bengalese finch song spectrograms, we are in the compute-bound regime where the time to shovel stuff through the GPU >>> the time for all the CPUs to load a batch

If you just look at loading times of the files you might think we pay a big cost

But if you measure the time it takes a pytorch dataloader to fetch a batch and put it on the GPU, and then include the time to run through the model, it seems like there's no real difference

I should double-check that this is true and that it holds for other datasets and models. But if it does, then we should just do this for everything

NickleDave · 2024-10-04T01:33:03Z

I am remembering that I compressed everything and then ran actual experiments, and noticed that they seemed qualitatively slower for canary song, presumably in part because the spectrograms have 257 frequency bins as opposed to the 152 that Bengalese finch song has

So: we should definitely measure with actual experiments

But, even if it does slow things down, we should save all spectrograms compressed anyways, and then provide a script that downloads the compressed files and then decompresses them, basically by opening and resaving them all

Also note I tried compressing everything, including the indexing vectors, that were surprisingly slow. There must be some numerical property of the indexing vectors that makes them slow -- because they're all integers? Or because they are mostly monotonous runs of values? So, don't do that

NickleDave · 2024-10-04T01:46:55Z

This zip has a script and a notebook that I used to benchmark -- IIRC, I made the script from the notebook to be sure there was no effect of the Jupyter server (maybe me being ignorant about how they interact)
benchmark-dataloader.zip

NickleDave changed the title ~~Test storing spectrograms with zarr~~ Save spectrograms with np.savez_compressed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save spectrograms with `np.savez_compressed` #1

Save spectrograms with `np.savez_compressed` #1

NickleDave commented May 30, 2024 •

edited

Loading

NickleDave commented Oct 4, 2024

NickleDave commented Oct 4, 2024

NickleDave commented Oct 4, 2024

Save spectrograms with np.savez_compressed #1

Save spectrograms with np.savez_compressed #1

Comments

NickleDave commented May 30, 2024 • edited Loading

NickleDave commented Oct 4, 2024

NickleDave commented Oct 4, 2024

NickleDave commented Oct 4, 2024

Save spectrograms with `np.savez_compressed` #1

Save spectrograms with `np.savez_compressed` #1

NickleDave commented May 30, 2024 •

edited

Loading