-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spykingcircus2: optimization and speedup for in vitro HD-MEA recordings with low SNR #3543
Comments
Thanks for your interest. In fact, I have a working branch that should speed up quite drastically the whole algorithm, making use of recent changes discussed with @samuelgarcia . Currently, no GPU are used, but with a powerful machine and tens of core, this should be much faster than numbers you are reported. |
Thank you very kindly for being willing to have a look at this. I'm eager to learn about how you troubleshoot this as it is an effort I'm quite novice at. I've created a separate blob container in azure with recording and minimal code: you can download this using the azcopy command:
It contains a hybrid ground truth recording under the hybridgt_20241011 directory, and the code to read the recording and ground truth sorting data and perform basic sorting is in read_minimal.py file Update @yger let me know if you have access issues, should be working until 28/11, if I set it up right. |
Ok, so I've made some tests, thanks for sharing the data. I've a working branch called "sc2_recording_slices" that will be merged soon, and with the default params in this branch, the code takes 3000s to run on my machine with 28 jobs. Important point is that because I have enough RAM, the file is written to memory and that might speed up everything also. You need to double check that this is also your case. Thus, we have an improvement, but it is still (a bit) long. However, given that the longest step is the template matching, one other possibility to speed everything again is to switch the default template matching engine from circus-omp-svd to wobble. I won't discuss in depth the differences, but results should be somehow similar, and matching should be faster (x1.5 I would say). I'll test that. So you can try on your side by updating the params of sc2. params = {"matching" : {'method' : 'wobble'}} I'll keep digging a bit, and keep you posted |
I must mention that in this new branch, you have now the option to use GPU during the fitting procedure (both with circus-omp-svd and wobble). You can do so by setting params = {"matching" : {'engine' : 'torch', 'torch_device' : 'cuda'}} However, this has not yet been properly benchmarked, and I don't know what is the best. Few cores with GPU, or lots of cores without GPU. I should also have asked, but are you using Linux or Windows? That might have an imporant role also... |
Thank you so much for looking into this: these are some exciting developments! I am working on a Linux distribution, the RAM can be tailored for a given situation, some example (pure CPU) configurations:
About the caching of the recording in memory, I'm not so sure how to set these parameters (I left them to the default values): Note: on trying to rerun, I get this notification, right before the "detect peaks using locally_exclusive" step. |
Ok, good to know that you are on Linux, because this is what I'm using also, and the multiprocessing mode of spikeinterface is known to be better there than on Windows. The caching can be optimized, given your amount of RAM. Indeed, what the code will do is try to fit into RAM the preprocessed recording (in float32, so size might be more than original one if raw data are in int16), if 0.5 (memory_limit) of your RAM is free and available, and big enough to receive the recording. But given the fact that you have the warning, the recording is not preloaded in RAM. You can try to increase memory_limit if you are willing to devote more RAM to SC2. This is not a "major" deal if you do not preload in RAM, and will be like that anyway for very long recordings, but because you have multiple passes over the data (to find peaks, to match peaks, ...), be sure that you have an SSD drive there, because this is the main bottleneck. Plus, without caching, every time chunks are reloaded, preprocessing steps are re-applied. This is ok as long as you do not have too complicated preprocessing steps, but otherwise this is good to know that there might be some speed gain there. If not enough RAM, you can still cache the preprocessed file to folder, but it requieres you to have enough disk space then. I'll keep playing and we'll push the PR into main, I'll let you know |
Then I guess this is just a display issue with the progress bar. Weird, because I never saw that but I'll look into it |
4 hours later, still the output hasn't changed, somehow the system seems completely stalled on this. Update: tried the zarr approach and the exact same thing occured, on the 64core system. All cores are completely used, no output in the command line, nothing seems to happen. Trying folder mode as it seems to be the only remaining way to go about it... |
Ok, then forget about the caching, but this is strange. As said, caching is a plus, but speedup should not be major, I'll redo some more benchmark with/without to test that. Waiting for the new branch, you can also change the fitting engine to wobble, this would already make a gain |
Running SC2 on 10minutes/37GB in vitro (from hiPSC cultures) HD-MEA recordings yields some surprisingly good results without too much finetuning. (using SI 0.101 for this)
I have a couple of questions about getting it to run smoother:
It takes 10k s (roughly 3hours) to run, is there any way I can configure it to run faster?
Tried changing the number of jobs to 80% (28 jobs), tried setting the chunk size so that it would take bigger chunks. It seems that many cores are used, but their individual memory usage is very low.
Do I preferably spin up a system with more cores (running in Azure currently).
I have access to GPU, but SC2 doesn't use GPU acceleration, right?
If I were to try and finetune a limited set of parameters, for the low SNR and 1023 readout channels use case, which should I finetune?
I'm assuming, the following, but I may have missed some:
I'm also having trouble identifying which step is in progress at a given point in time from the logging information.
Made a chart based on the code, to try and understand the flow.
Is there a way to enable logging to reflect the high-level steps (or would it be ok if I try to add logging)?
The text was updated successfully, but these errors were encountered: