-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize BSS use of FFT with cupy, speed up of up to 3x for full tracks #83
Comments
Also there could be a "super-performant" config with cupy, stacking multiple 1D FFTs (respecting GPU memory allocation limits), and using pinned host/gpu memory and FFT plans - I'll continue working in that direction. |
Optimized every slow line (discovered through kernprof + line_profiler): master...sevagh:feat/cupy-accel This leads to just about 1 minute to compute the IRM mask and perform a BSS evaluation on 1 full-length MUSDB18 track:
This is down from the 3+ minutes originally:
|
@sevagh i think this would be great. Do the regression tests pass using this? |
How can I run the tests? |
install the test evironment
|
OK. My most recent commits get the regression tests passing. Casting explicitly to float32 was creating huge errors in SAR/SIR/ISR, so I just removed them. I made the cupy install optional (although fixed to CUDA 11.4, which is rather recent). Other notes/idiosyncrasies is that it's best to clear the cupy FFT cache between BSS evaluations of large songs. That's why I added this helper function: In real code you would do:
Passing regression test:
|
Hello,
I have been working on some potential performance optimizations for the BSS evaluation (which is rather slow/compute intensive for full tracks).
Baseline measurement with original museval code (the total execution involves also computing the IRM, adapted from https://github.com/sigsep/sigsep-mus-oracle/blob/master/IRM.py):
The original code takes ~3:20 minutes.
The second optimization uses cupy and the GPU, which is in my opinion a big cost/burden for end users. Installing the CUDA toolkit etc. is no joke. Here is the code: master...sevagh:feat/cupy-accel
However, the performance is rather good at ~1:20 minutes, so maybe almost ~3x faster than the original code:
One final note is that the CUDA/cupy version has slight differences in the outputs due to numerical precision differences. It doesn't look too significant to me - here's an excerpt of a diff between the evaluated json files, showing small differences in the BSS scores:
I'm also trying to find a way to use CPU parallelism with scipy.fft and combining several of the FFTs in a single call, but this isn't really helping as much as the CUDA change. My code attempts can be seen here: master...sevagh:multiple-1d-fft
I'm aware of the separate repo for bss at https://github.com/sigsep/bsseval/ but I wasn't sure which project to discuss it in - I'm using museval because I'm trying to recreate the SiSec 2018 testbench.
The text was updated successfully, but these errors were encountered: