Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spykingcircus2: optimization and speedup for in vitro HD-MEA recordings with low SNR #3543

Open
Djoels opened this issue Nov 19, 2024 · 10 comments
Assignees
Labels
question General question regarding SI sorters Related to sorters module

Comments

@Djoels
Copy link

Djoels commented Nov 19, 2024

Running SC2 on 10minutes/37GB in vitro (from hiPSC cultures) HD-MEA recordings yields some surprisingly good results without too much finetuning. (using SI 0.101 for this)
I have a couple of questions about getting it to run smoother:

It takes 10k s (roughly 3hours) to run, is there any way I can configure it to run faster?
Tried changing the number of jobs to 80% (28 jobs), tried setting the chunk size so that it would take bigger chunks. It seems that many cores are used, but their individual memory usage is very low.
Do I preferably spin up a system with more cores (running in Azure currently).
I have access to GPU, but SC2 doesn't use GPU acceleration, right?

If I were to try and finetune a limited set of parameters, for the low SNR and 1023 readout channels use case, which should I finetune?
I'm assuming, the following, but I may have missed some:

  • detect_threshold,
  • radius_um,
  • not sure of the impact of matching engine method? is there a clear choice for the use case I'm describing?

I'm also having trouble identifying which step is in progress at a given point in time from the logging information.
Made a chart based on the code, to try and understand the flow.
Is there a way to enable logging to reflect the high-level steps (or would it be ok if I try to add logging)?
image

@Djoels Djoels changed the title spykingcircus2: optimization and speedup for HD-MEA with low SNR spykingcircus2: optimization and speedup for in vitro HD-MEA recordings with low SNR Nov 19, 2024
@alejoe91 alejoe91 added question General question regarding SI sorters Related to sorters module labels Nov 19, 2024
@yger
Copy link
Collaborator

yger commented Nov 19, 2024

Thanks for your interest. In fact, I have a working branch that should speed up quite drastically the whole algorithm, making use of recent changes discussed with @samuelgarcia . Currently, no GPU are used, but with a powerful machine and tens of core, this should be much faster than numbers you are reported.
The best would be, if you are willing to, to share your 37GB file (or even smaller one if you want) such that I can valide everything on it, before merging everything into master. Then we can discuss the optimization of the parameters. Your pipeline is correct, this is the generic workflow of the algorithm

@Djoels
Copy link
Author

Djoels commented Nov 20, 2024

Thank you very kindly for being willing to have a look at this. I'm eager to learn about how you troubleshoot this as it is an effort I'm quite novice at.

I've created a separate blob container in azure with recording and minimal code:
https://storageczispikesort.blob.core.windows.net/safesharecont01?sv=2023-01-03&spr=https%2Chttp&st=2024-11-21T00%3A00%3A00Z&se=2024-11-28T00%3A00%3A00Z&sr=c&sp=rl&sig=o%2FNBVdjwvdN%2F8quqQ2kjrVMgkKBp3eNp3l87UKG3KdM%3D

you can download this using the azcopy command:

azcopy copy "<URL_FROM_ABOVE>" <local_dir> --recursive

It contains a hybrid ground truth recording under the hybridgt_20241011 directory, and the code to read the recording and ground truth sorting data and perform basic sorting is in read_minimal.py file
I also added a pip freeze in freeze.txt of a environment in which it took me 10000 seconds approximately.

Update @yger let me know if you have access issues, should be working until 28/11, if I set it up right.

@yger
Copy link
Collaborator

yger commented Nov 27, 2024

Ok, so I've made some tests, thanks for sharing the data. I've a working branch called "sc2_recording_slices" that will be merged soon, and with the default params in this branch, the code takes 3000s to run on my machine with 28 jobs. Important point is that because I have enough RAM, the file is written to memory and that might speed up everything also. You need to double check that this is also your case. Thus, we have an improvement, but it is still (a bit) long.

However, given that the longest step is the template matching, one other possibility to speed everything again is to switch the default template matching engine from circus-omp-svd to wobble. I won't discuss in depth the differences, but results should be somehow similar, and matching should be faster (x1.5 I would say). I'll test that. So you can try on your side by updating the params of sc2.

params = {"matching" : {'method' : 'wobble'}}

I'll keep digging a bit, and keep you posted

@yger
Copy link
Collaborator

yger commented Nov 27, 2024

I must mention that in this new branch, you have now the option to use GPU during the fitting procedure (both with circus-omp-svd and wobble). You can do so by setting

params = {"matching" : {'engine' : 'torch', 'torch_device' : 'cuda'}}

However, this has not yet been properly benchmarked, and I don't know what is the best. Few cores with GPU, or lots of cores without GPU. I should also have asked, but are you using Linux or Windows? That might have an imporant role also...

@Djoels
Copy link
Author

Djoels commented Nov 27, 2024

Thank you so much for looking into this: these are some exciting developments!

I am working on a Linux distribution, the RAM can be tailored for a given situation, some example (pure CPU) configurations:

  • Standard_D15_v2: 20 cores, 140 GB RAM, 1000 GB disk
  • Standard_D32s_v3: 32 cores, 128GB RAM, 256GB storage
  • Standard_D64_v3: 64 cores, 256GB RAM, 1600GB storage

About the caching of the recording in memory, I'm not so sure how to set these parameters (I left them to the default values):
'cache_preprocessing': {'mode': 'memory', 'memory_limit': 0.5, 'delete_cache': True},
If I have 140GB of RAM at my disposal, it should suffice to have 50% aka 70GB allocated for the recording, right?

Note: on trying to rerun, I get this notification, right before the "detect peaks using locally_exclusive" step.
Recording too large to be preloaded in RAM...

@yger
Copy link
Collaborator

yger commented Nov 28, 2024

Ok, good to know that you are on Linux, because this is what I'm using also, and the multiprocessing mode of spikeinterface is known to be better there than on Windows.

The caching can be optimized, given your amount of RAM. Indeed, what the code will do is try to fit into RAM the preprocessed recording (in float32, so size might be more than original one if raw data are in int16), if 0.5 (memory_limit) of your RAM is free and available, and big enough to receive the recording. But given the fact that you have the warning, the recording is not preloaded in RAM. You can try to increase memory_limit if you are willing to devote more RAM to SC2. This is not a "major" deal if you do not preload in RAM, and will be like that anyway for very long recordings, but because you have multiple passes over the data (to find peaks, to match peaks, ...), be sure that you have an SSD drive there, because this is the main bottleneck. Plus, without caching, every time chunks are reloaded, preprocessing steps are re-applied. This is ok as long as you do not have too complicated preprocessing steps, but otherwise this is good to know that there might be some speed gain there. If not enough RAM, you can still cache the preprocessed file to folder, but it requieres you to have enough disk space then. I'll keep playing and we'll push the PR into main, I'll let you know

@Djoels
Copy link
Author

Djoels commented Nov 28, 2024

When I'm running SC2 again on a similar recording, I see this in the command line:
write_memory_recording: 0%| | 0/602 [00:00<?, ?it/s]

However it doesn't appear to be moving, maybe this is to do with issues in visualizing the progress, I can't recall seeing it go to completion, all of a sudden it is just done I guess.

CPU usage is a 100% (on all but 2 cores) during this time:
image

update: I also tried a way bigger cluster:
image
with as it stands the same issue...

@yger
Copy link
Collaborator

yger commented Nov 28, 2024

Then I guess this is just a display issue with the progress bar. Weird, because I never saw that but I'll look into it

@Djoels
Copy link
Author

Djoels commented Nov 28, 2024

4 hours later, still the output hasn't changed, somehow the system seems completely stalled on this.
maybe I should try the zarr approach?

Update: tried the zarr approach and the exact same thing occured, on the 64core system. All cores are completely used, no output in the command line, nothing seems to happen.

Trying folder mode as it seems to be the only remaining way to go about it...

@yger
Copy link
Collaborator

yger commented Nov 29, 2024

Ok, then forget about the caching, but this is strange. As said, caching is a plus, but speedup should not be major, I'll redo some more benchmark with/without to test that. Waiting for the new branch, you can also change the fitting engine to wobble, this would already make a gain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question regarding SI sorters Related to sorters module
Projects
None yet
Development

No branches or pull requests

3 participants