Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save spectrograms with np.savez_compressed #1

Open
3 tasks
NickleDave opened this issue May 30, 2024 · 3 comments
Open
3 tasks

Save spectrograms with np.savez_compressed #1

NickleDave opened this issue May 30, 2024 · 3 comments

Comments

@NickleDave
Copy link
Contributor

NickleDave commented May 30, 2024

  • can we reduce file size
  • without affecting training
  • and requiring a ton of re-engineering of dataset prep / datapipe class

Currently we use a "just a bunch of files of approach", which lets us use the same npz file--the spectrogram, the input to a model--with multiple npy files--the labels, the target of the model.

Sort of a worst case might be where we get a big benefit from jamming all the spectrograms in a single zarr archive, but that means we have to re-engineer all the code that assumes the spectrograms exist as separate files: the prep step, the dataset class, etc. The reason to prefer the separate files is mainly for tracking metadata and for readability, but maybe I am overvaluing this.

This doesn't need to be highest priority but it could help make it easier to upload the dataset.

edit: if we were to cram all the spectrograms into a single zarr archive, then we might want to access with a mem-mapping approach. DAS docs suggest it's not easy to squeeze good performance out of this:

While zarr, h5py, and xarray provide mechanisms for out-of-memory access, they tend to be slower in our experience or require fine tuning to reach the performance reached with memmapped npy files.

I did find examples for pytorch + zarr previously in other domains but similarly got the impression that it's not a simple clear process to follow and it's not easy to troubleshoot. Although the point about just mem-mapping npy makes me wonder if I should try that

@NickleDave NickleDave changed the title Test storing spectrograms with zarr Save spectrograms with np.savez_compressed Oct 4, 2024
@NickleDave
Copy link
Contributor Author

Renamed this issue.

I realized that numpy actually has built-in compression, and used that to test quickly how much space we could save using it, and whether it would affect training, without trying to figure out zarr and memmapping

Tested this some, found that indeed it saves a lot of space. I tried on one directory of spectrograms from Mouse Pup Calls and it went from 1.9G to 93M.

I then did some tests of the dataloader, will attach a notebook.
My current understanding is that, at least for TweetyNet trained on Bengalese finch song spectrograms, we are in the compute-bound regime where the time to shovel stuff through the GPU >>> the time for all the CPUs to load a batch

If you just look at loading times of the files you might think we pay a big cost
image

But if you measure the time it takes a pytorch dataloader to fetch a batch and put it on the GPU, and then include the time to run through the model, it seems like there's no real difference
image

I should double-check that this is true and that it holds for other datasets and models. But if it does, then we should just do this for everything

@NickleDave
Copy link
Contributor Author

I am remembering that I compressed everything and then ran actual experiments, and noticed that they seemed qualitatively slower for canary song, presumably in part because the spectrograms have 257 frequency bins as opposed to the 152 that Bengalese finch song has

So: we should definitely measure with actual experiments

But, even if it does slow things down, we should save all spectrograms compressed anyways, and then provide a script that downloads the compressed files and then decompresses them, basically by opening and resaving them all

Also note I tried compressing everything, including the indexing vectors, that were surprisingly slow. There must be some numerical property of the indexing vectors that makes them slow -- because they're all integers? Or because they are mostly monotonous runs of values? So, don't do that

@NickleDave
Copy link
Contributor Author

This zip has a script and a notebook that I used to benchmark -- IIRC, I made the script from the notebook to be sure there was no effect of the Jupyter server (maybe me being ignorant about how they interact)
benchmark-dataloader.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant