Speed up prediction - batch_size & num_workers #731

charleygros · 2023-05-21T22:18:12Z

charleygros
May 21, 2023

Hi there,

Just wanted to double check with you: the best way to speed up the prediction.

I am working with one-hour long audio files --> I would like to use a model I trained on these files, predict the presence of a species, 5s duration clips.

Using CPUs at prediction time.

predict(samples, batch_size=1, num_workers=0)

Playing around with these three parameters.. it seems that the fastest way was to: loop across a list of files, one file at a time, with a list of audio file paths as input, something like:

for fname in lst_fname_to_predict:
    pred = predict([fname], batch_size=1, num_workers=1)
    ...

I doubt that's the best way to do it?

What would you recommend me to do? Like... splitting the one-hour long file before calling predict and increasing batch_size? Yea.. I don't know, curious to hear your recommendations here.

Many thanks in advance,
Charley

Answered by sammlapp

Jun 6, 2023

Hey @charleygros so prediction on long files with OpenSoundscape >=0.8.0 is actually very simple (even simpler than my message above). All you need to do is pass a list of files (or dataframe with file paths in the index), and the CNN.predict() method will take care of splitting up your files into the appropriate length clips. The CNN object's .preprocessor attribute will use the same clip duration that was used to train the model. You can use the predict method's num_workers argument to parallelize preprocessing of samples over parallel CPU processes, and batch_size to increase training speed by preparing and running many samples at once.

For example, with OpenSoundscape 0.9.0:

from open…

View full answer

charleygros · 2023-06-02T23:43:33Z

charleygros
Jun 2, 2023
Author

@sammlapp: Wondering if you may be able to help us here? Many thanks in advance

1 reply

sammlapp Jun 3, 2023
Maintainer

Hi sorry for the delay, we’ve been doing fieldwork. I’ll write a complete answer tomorrow but basically you just want to make a datframe with multi-index (file path, start time of clip, end time of clip) and pass that to predict, then use the largest batch size you can without running out of memory and num_workers = 8 perhaps, or more depending on the number of cpus in your machine.

sammlapp · 2023-06-06T14:16:23Z

sammlapp
Jun 6, 2023
Maintainer

Hey @charleygros so prediction on long files with OpenSoundscape >=0.8.0 is actually very simple (even simpler than my message above). All you need to do is pass a list of files (or dataframe with file paths in the index), and the CNN.predict() method will take care of splitting up your files into the appropriate length clips. The CNN object's .preprocessor attribute will use the same clip duration that was used to train the model. You can use the predict method's num_workers argument to parallelize preprocessing of samples over parallel CPU processes, and batch_size to increase training speed by preparing and running many samples at once.

For example, with OpenSoundscape 0.9.0:

from opensoundscape import load_model
from glob import glob

all_files_for_prediction = glob('my_data_path/*.WAV') 

model = load_model(`my_saved_cnn.model`)

preds = model.predict(all_files_for_prediction, num_workers = 8, batch_size=256)

In general, increasing batch_size and num_workers will speed up prediction up to some threshold where you don't have enough memory, CPUs, or I/O speed. In particular, num_workers will be limited by CPUs and I/O speed, while bach_size will be limited by memory.

4 replies

charleygros Jun 7, 2023
Author

Thank you @sammlapp, that's helpful and appreciated.
I will go with that 👍

Note: In a user case where I have hundreds of one-hour audio files to process, a progress bar can be useful to help me with my patience ability :-) At the moment, I am going with the below workaround with a tqdm for loop across filenames, and saving outputs at each iteration. But it is very likely to be slower than the solution you suggested. Thanks again.

from opensoundscape import load_model
from glob import glob
from tqdm import tqdm

all_files_for_prediction = glob('my_data_path/*.WAV') 

model = load_model(`my_saved_cnn.model`)

for fname in tqdm(all_files_for_prediction):
    pred = model.predict([fname], num_workers = 8, batch_size=256)

    # Apply predict_multi_target_labels(scores, threshold)
    # ...

    # Save positive samples, audio clip and associated spectrogram
    # ...

sammlapp Jun 8, 2023
Maintainer

Great. This should actually be nearly as fast since each file will contain many samples that can be processed in parallel and batched. You could also use opensoundscape's wandb integration for more detailed real-time logging including progress on prediction and visualization of samples (see docs)

charleygros Jun 9, 2023
Author

Yes! I have been working with opensoundscape's wandb recently during model training, and think that's a great addition. Many thanks for integrating this great tool.
I should also try it at prediction time, you are right.

I also like to save the predictions 'on the fly' (ie one file at a time) so that I can start working on the QAQC, while the model continues to generate predictions.

Thanks again for your insights here @sammlapp , really appreciated 👍

sammlapp Jun 9, 2023
Maintainer

Agreed - saving predictions frequently also avoids losing progress if something goes astray. We prefer to predict in some sort of loop rather than run a single predict call for hours/days :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up prediction - batch_size & num_workers #731

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speed up prediction - batch_size & num_workers #731

charleygros May 21, 2023

Replies: 2 comments · 5 replies

charleygros Jun 2, 2023 Author

sammlapp Jun 3, 2023 Maintainer

sammlapp Jun 6, 2023 Maintainer

charleygros Jun 7, 2023 Author

sammlapp Jun 8, 2023 Maintainer

charleygros Jun 9, 2023 Author

sammlapp Jun 9, 2023 Maintainer

charleygros
May 21, 2023

Replies: 2 comments 5 replies

charleygros
Jun 2, 2023
Author

sammlapp Jun 3, 2023
Maintainer

sammlapp
Jun 6, 2023
Maintainer

charleygros Jun 7, 2023
Author

sammlapp Jun 8, 2023
Maintainer

charleygros Jun 9, 2023
Author

sammlapp Jun 9, 2023
Maintainer