-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save spectrograms with np.savez_compressed
#1
Comments
np.savez_compressed
I am remembering that I compressed everything and then ran actual experiments, and noticed that they seemed qualitatively slower for canary song, presumably in part because the spectrograms have 257 frequency bins as opposed to the 152 that Bengalese finch song has So: we should definitely measure with actual experiments But, even if it does slow things down, we should save all spectrograms compressed anyways, and then provide a script that downloads the compressed files and then decompresses them, basically by opening and resaving them all Also note I tried compressing everything, including the indexing vectors, that were surprisingly slow. There must be some numerical property of the indexing vectors that makes them slow -- because they're all integers? Or because they are mostly monotonous runs of values? So, don't do that |
This zip has a script and a notebook that I used to benchmark -- IIRC, I made the script from the notebook to be sure there was no effect of the Jupyter server (maybe me being ignorant about how they interact) |
Currently we use a "just a bunch of files of approach", which lets us use the same npz file--the spectrogram, the input to a model--with multiple npy files--the labels, the target of the model.
Sort of a worst case might be where we get a big benefit from jamming all the spectrograms in a single zarr archive, but that means we have to re-engineer all the code that assumes the spectrograms exist as separate files: the prep step, the dataset class, etc. The reason to prefer the separate files is mainly for tracking metadata and for readability, but maybe I am overvaluing this.
This doesn't need to be highest priority but it could help make it easier to upload the dataset.
edit: if we were to cram all the spectrograms into a single zarr archive, then we might want to access with a mem-mapping approach. DAS docs suggest it's not easy to squeeze good performance out of this:
I did find examples for pytorch + zarr previously in other domains but similarly got the impression that it's not a simple clear process to follow and it's not easy to troubleshoot. Although the point about just mem-mapping npy makes me wonder if I should try that
The text was updated successfully, but these errors were encountered: