5x pytorch performance increase on CPU, 1 thread #408

IntendedConsequence · 2023-12-19T08:43:48Z

IntendedConsequence
Dec 19, 2023

tl;dr: Smarter batching without splitting the whole audio. Use batch inference on steps up until and including the encoder, then use batch size of 1 for lstm and decoder, because the last two parts are the only ones that depend on state.

All that is without affecting precision at all, as would be the case with regular batching, where one would have to split the entire audio in multiple independent slices, where each slice would be processed using its own independent state. But if batching is only used on the graph up to encoder, you can still process the entire audio with a single state.

Longer version:
When I was looking inside the jit and onnx model internals, I noticed that the majority of the computation the model performs during inference does not depend on state (hn and cn inputs to lstm).

The model consists of roughly 6 steps:

feature_extractor
adaptive_normalization
first_layer
encoder
lstm
decoder

First 4 of those depend only on audio data, and only at the lstm step the state variables hn and cn are actually used and updated. This means that first 4 steps are easy to isolate and parallelize, and only the last 2 steps need to be computed sequentially.
Note: technically the decoder can also be batched, resulting in more speedups, but I didn't test that

While this is good news for GPU and multithreaded inference, I did not expect a 5x performance increase on a single thread.

Version: Silero VAD v3.1, 16kHz. Restored python code from .jit file (for implementing the model architecture in C/ggml)
Sample count: 1536
Pytorch version: 2.1-cpu

torch.set_grad_enabled(False)
torch.set_num_threads(1)
torch.set_num_interop_threads(1)

Batch size for first 4 steps: 32
Batch size for lstm and decoder: 1

Result: 600s of audio data is processed in 2-3 seconds.
And this is just using my restored pytorch silero code without any TorchScript, no jit.script or jit.trace at all.

Running the official v3.1 silero_vad.jit, batch_size=1 on the same 600s of audio data takes 10.4 seconds.

Batch size above 32 on my cpu doesn't result in much improvement.

I didn't look at V4 internals closely, but if the lstm is still near the bottom of the graph, the preceding steps can probably benefit from batching too.
I also didn't test what effect this kind of batching have on onnx model inference, and I didn't conduct any GPU/CUDA tests.

IntendedConsequence · 2023-12-19T10:46:31Z

IntendedConsequence
Dec 19, 2023
Author

UPD: onnx < 1s
I did a very quick and dirty test with exported onnx of the restored pytorch code with hardcoded batch size 32 for the non-lstm steps, and the same 600s of audio data are being processed in a little less than a second. CPU, 1 thread, and using python's onnxruntime bindings.
If these results aren't a fluke, it would make it 600x realtime processing speed. It's also more than twice as fast as using the official .onnx v3.1 model from pure C, which processes the same audio data in about 2s.

P.S. Admittedly this was a very fast proof of concept test, so I may have screwed something up, even though result probabilities are all within 2e-6 compared to the reference probabilities.

3 replies

IntendedConsequence Apr 26, 2024
Author

UPD: onnx CPU 800x up to 1240x speed
I've tweaked the onnx model a bit since, it supports dynamic batching now.

These were constant in my tests:

Sample count per chunk = 1536
Batch size = 96
SetInterOpNumThreads = 1

With SetIntraOpNumThreads of 1, through onnxruntime C api, I could get silero model process 5071 seconds of PCM 16-bit 16kHz audio in just under 6.4 seconds, giving about 800x processing speed.

Increasing SetIntraOpNumThreads to 4 boosts processing speed up to 1240x, taking less than 4.1 seconds to process the same audio.

This is a best case scenario though, when the audio was decoded, mixed and resampled into the required format ahead of time. If I decode opus-encoded audio with ffmpeg and pipe the output to the vad utility, the speed is bottlenecked by the decode speed and falls to ~340x. Also do note, these measurements were made on AMD 5950x.

I've switched to this modified onnx model and been testing it in daily use for months, and so far it seems to run just fine.

If anyone wants to test this modified onnx model, I'm attaching it. No guarantees though, not even that it'll work on your setup, YMMV =)

The input and output names should be the same, the only difference is that "input" shape is now [B, 1536], so 1536 samples is hardcoded because this is the sample count that I was using historically, and "output"'s first axis will also correspond to B. I never tried other sample input counts though, and I don't know if restoring the original variable sample count support would hurt performance.

Also the modified onnx model is onnx V8, while the original silero 3.1 is onnx v6. This is just what the default torch 2.1.1 exported, not a deliberate choice. I also ran it through an onnx simplifier and optimizer at https://convertmodel.com

silero_restored_v3.1_16k_v2_dyn-sim.zip

dgoryeo Apr 26, 2024

@IntendedConsequence , any guidance / sample code on how the silero model can be used? Thanks!

IntendedConsequence Apr 27, 2024
Author

@dgoryeo silero repository already has example python code for onnx inference. The 3.1 model has different input names from v4.0 though, so you'd have to look at the 3.1 tag, here:

silero-vad/utils_vad.py

Line 10 in 8ebaf13

class OnnxWrapper():

If you're interested in using it via onnxruntime's C api however, well, you can look in my vadc repository, but it's highly experimental and WIP, I doubt it will even compile on any other machine in its current state. So it probably won't help you if you aren't comfortable reading C 🤷‍♂️

MahmoudAshraf97 · 2024-07-25T16:36:06Z

MahmoudAshraf97
Jul 25, 2024

Hi @IntendedConsequence , I managed to implement your idea in V5 and got a great speedup, the only problem I have is with exporting to onnx, I have to split the model to encoder and decoder but other than that no problems
this is the function I use for inference:

import numpy as np
import onnxruntime
opts = onnxruntime.SessionOptions()
opts.inter_op_num_threads = 0
opts.intra_op_num_threads = 0
opts.log_severity_level = 2

path = "encoder.onnx"
encoder_session = onnxruntime.InferenceSession(
    path,
    providers=["CPUExecutionProvider"],
    sess_options=opts,
)
path = "decoder.onnx"
decoder_session = onnxruntime.InferenceSession(
    path,
    providers=["CPUExecutionProvider"],
    sess_options=opts,
)

def forward(x: np.ndarray, num_samples: int, context_size_samples: int):
    assert (
        x.ndim == 2
    ), "Input should be a 2D tensor with size (batch_size, num_samples)"
    assert (
        x.shape[1] % num_samples == 0
    ), "Input size should be a multiple of num_samples"

    num_audio = x.shape[0]

    state = np.zeros((2, num_audio, 128), dtype="float32")
    context = np.zeros(
        (num_audio, context_size_samples),
        dtype="float32",
    )

    x = x.reshape(num_audio, -1, num_samples)
    context = x[..., -context_size_samples:]
    context[:, -1] = 0
    context = np.roll(context, 1, 1)
    x = np.concatenate([context, x], 2)

    x = x.reshape(-1, num_samples + context_size_samples)

    x = encoder_session.run(None, {"input": x})[0]
    x = x.reshape(num_audio, -1, 128)

    decoder_outputs = []
    for window in np.split(x, x.shape[1], axis=1):
        out, state = decoder_session.run(
            None, {"input": window.squeeze(1), "state": state}
        )
        decoder_outputs.append(out)

    out = np.stack(decoder_outputs, axis=1).squeeze(-1)
    return out

it accepts a whole input file, no batching needed as any system can handle the whole file in one pass, the batch_size mentioned here is for multiple audio files
I've attached the encoder and decoder onnx files if anyone wants to try
silero_encoder_decoder.zip

1 reply

IntendedConsequence Jul 25, 2024
Author

@MahmoudAshraf97 Strange, the only issues I remember having when exporting to onnx were for fused 16k and 8k models. I prefer to use separate models though (I don't think I ever even used the 8k version? Except when testing that my rewrites work), so I don't remember what the issues were.
I'm currently test driving this exported onnx to see if it is a suitable replacement for my old v3
silero_vad_v5_16k_minibatched-sim.zip

There are some differences though:

instead of 1 state input there are 2 inputs, h and c, each [1, 1, 128] (with hn and cn of the same shapes as outputs)
input samples and output probabilities are shapes [minibatch, 576] and [minibatch, 1]

I called it minibatch because it doesn't process audio independently e.g. minibatch 1 and minibatch 96 will yield identical probabilities (and state) for the same audio. I'm keeping minibatch size configurable to adjust for best performance, I found that even at high batches the difference can get up to 2x with different minibatch sizes, and I also like to often use the VAD in streaming mode while watching/listening media, and it's really nice not having to wait till the entire audio track gets processed before getting the speech segments.

It is possible to add independent batching on top of that with a few judicious reshapes and making a dynamic batch dimension for lstm's state but I never needed that.

I did consider splitting the model to squeeze the most speed per layer with custom (mini)batch sizes, but as I said before, the bottleneck on my measurements was always the decoding of audio, so I won't be thinking about splitting the model until I come up with a robust way of parallel decoding a long audio file without any skipping\overlap on the seams.

ozancaglayan · 2024-08-15T12:45:15Z

ozancaglayan
Aug 15, 2024

Hi,

This is very interesting. I'm also trying this now for the v4 branch. Managed to export the encoder with helps from @MahmoudAshraf97 . Verified that its outputs are coherent with the combined model's JIT'ted encoder. The unoptimized JIT'ted model is already quite slow compared to ONNX so I'm not measuring speed-ups in that model.

I'm measuring the performance over a single chunk of 512 samples, the recommended value by the devs (in fact its the only value supported for the new v5 model). The fact that you had initially 512*3 sample windows could have made your inference more efficient.

I replicated the same 512-elem chunk across the batch dimension for batch sizes (1-256). Set the intra_op_num_threads according between 1, 2, 4 as my CPU has 4 cores. Extraploted the time it'll take to process 5 minutes of audio based on the 512-sample window timings.

With batch_size=1, threads=1, I would get 1.63 seconds.
With batch_size=96, threads=1, I would get 1 seconds.
With batch_size=96, threads=2, I would get 0.89 seconds.
With batch_size=256, threads=4, I would get 0.74 seconds.

So I could only get 2x improvement by this change. Comments are welcome.

3 replies

MahmoudAshraf97 Aug 15, 2024

you can use much larger batch sizes, since the model is tiny, most audio files can be encoded in a single run, the speedup over 256 is not expected to be significant but will simplify the code

IntendedConsequence Aug 15, 2024
Author

Hi @ozancaglayan
It's great that more people are looking into speeding up Silero VAD models. May I offer a few suggestions?

My main suggestion is to drop extrapolation and take measurements on at least one hour worth of real audio. This should eliminate a bunch of variables that can introduce a lot of noise to measurements. Processing long durations of audio is much closer to real world usage, and should eliminate imprecision due to cold caches of the first batch inference. Also, onnxruntime may be using certain optimizations that would skew the results. For instance, if onnxruntime is using the first inference call to construct an internal trace of the compute graph and shapes of passed data to jit compile optimized version of the graph specialized to that batch and your hardware, some of the optimizations may not kick in the first few batches, which means you'll be measuring slower inference speed that would actually be in production. The reverse situation is also possible, for instance if one of the optimizations includes a check for duplicate data in batches, the measurements will appear faster than when running on real data, which probably isn't going to include a lot of duplicate chunks in a batch.

You also didn't specify if you measured any batch size between 1 and 96. In case you didn't, since optimal batch size and thread count can be very specific to cpu, and even to onnxruntime version, it's a good idea to take a bunch of measurements with a lot of different batch sizes to find the sweet spot for your hardware. Batch size 96 was fastest on my measurements, but my cpu has 32 cores (16 physical, x2 with hyperthreading), it may not be so for a 4 core one.

Looking forward for more results!

ozancaglayan Aug 15, 2024

Thanks.

I'm always warming up the ONNX forward calls before the actual measurements.
Since Silero model works on chunks of 512 (or 1024), the runtime of a long audio file would simply linearly scale with the number of chunks in that audio file. So I think extrapolating should be fine. Interesting about the duplicate data. I can at least try with different data in the batch
I measured from 1 to 256 batch sizes but didn't dump the full table. I'll do it soon.
I'm now measuring each forward call 100 times and taking the median.
Since I'm still mostly measuring the encoder part of the model, numbers above are not representative of the full VAD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5x pytorch performance increase on CPU, 1 thread #408

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

5x pytorch performance increase on CPU, 1 thread #408

IntendedConsequence Dec 19, 2023

Replies: 3 comments · 7 replies

IntendedConsequence Dec 19, 2023 Author

IntendedConsequence Apr 26, 2024 Author

dgoryeo Apr 26, 2024

IntendedConsequence Apr 27, 2024 Author

MahmoudAshraf97 Jul 25, 2024

IntendedConsequence Jul 25, 2024 Author

ozancaglayan Aug 15, 2024

MahmoudAshraf97 Aug 15, 2024

IntendedConsequence Aug 15, 2024 Author

ozancaglayan Aug 15, 2024

IntendedConsequence
Dec 19, 2023

Replies: 3 comments 7 replies

IntendedConsequence
Dec 19, 2023
Author

IntendedConsequence Apr 26, 2024
Author

IntendedConsequence Apr 27, 2024
Author

MahmoudAshraf97
Jul 25, 2024

IntendedConsequence Jul 25, 2024
Author

ozancaglayan
Aug 15, 2024

IntendedConsequence Aug 15, 2024
Author