VAD probability very low using ONNX on Android #284

lhr0909 · 2023-01-09T03:32:34Z

lhr0909
Jan 9, 2023

Hi, I am building an Android App trying to use the ONNX VAD model to determine speech in the AudioRecord stream. I got the ONNX model integrated but the probability numbers are way off. I have to set the threshold to 0.15 to be more accurate.

I realize that the audio recording amplitude was a bit too low, so I multiply by a gain, but it doesn't seem to solve the issue.

here is the integration point and please let me know if there is anything I have done wrong. I can also provide a sample recording.

voice-0e0d0a5c-4d51-4209-a51d-c0e32b39ede3.pcm.zip

The zip file contains a raw PCM mono recording, 16000Hz sample rate, big-endian float format. This audio file is already multiplied by a 10^(6/20) gain.

Answered by snakers4

Jan 9, 2023

Hi,

I cannot really comment on the language used for Android, but looks like your code is not stateful. I.e. the VAD should keep the state during the session. With ONNX the best illustration is here:

silero-vad/utils_vad.py

Lines 63 to 68 in e7c4539

     if sr in [8000, 16000]:  
   ort_inputs = {'input': x.numpy(), 'h': self._h, 'c': self._c, 'sr': np.array(sr, dtype='int64')}  
   ort_outs = self.session.run(None, ort_inputs)  
   out, self._h, self._c = ort_outs  
   else:  
   raise ValueError()  

 

Looks like you always re-create these tensors:

https://github.com/lhr0909/live-subtitles-rokid-ar/blob/f19ecc197d3bee6484fa7145f73607bb90d77869/app/src/main/java/chat/senses/lives…

View full answer

snakers4 · 2023-01-09T03:43:29Z

snakers4
Jan 9, 2023
Maintainer

Hi,

I cannot really comment on the language used for Android, but looks like your code is not stateful. I.e. the VAD should keep the state during the session. With ONNX the best illustration is here:

silero-vad/utils_vad.py

Lines 63 to 68 in e7c4539

    
           if sr in [8000, 16000]: 
        
               ort_inputs = {'input': x.numpy(), 'h': self._h, 'c': self._c, 'sr': np.array(sr, dtype='int64')} 
        
               ort_outs = self.session.run(None, ort_inputs) 
        
               out, self._h, self._c = ort_outs 
        
           else: 
        
               raise ValueError()

Looks like you always re-create these tensors:

https://github.com/lhr0909/live-subtitles-rokid-ar/blob/f19ecc197d3bee6484fa7145f73607bb90d77869/app/src/main/java/chat/senses/livesubs/TranscribeViewModel.kt#L199-L200

What it means in practical terms, on each chunk the VAD "thinks" that this is a new audio. Also note that there is quite a bit of logic in get_speech_timestamps which probably must be ported.

What I also recommend as debugging tool is to run you audio using the provided utils with visualize_probs = True (e.g. in Colab) to see what the internal states look like for your audio to set the thresholds properly.

4 replies

lhr0909 Jan 9, 2023
Author

Thank you for the quick response! @snakers4 after storing the state vectors it is working now!

snakers4 Jan 9, 2023
Maintainer

Also make sure to tune the hyper-params for thresholds properly.

lhr0909 Jan 9, 2023
Author

@snakers4 I was using the torch hub model, and I didn't recall having to do the hyper-params tuning. Is it also needed for the PyTorch model?

snakers4 Jan 9, 2023
Maintainer

The torch.hub model, the PyTorch model and the ONNX model are essentially the same model.
torch.hub is just used as a very light packaging tool and the underlying models for .jit and .onnx are the same.
If you need to get the hyper-params once for you app it is worth doing:

silero-vad/utils_vad.py

Lines 161 to 172 in e7c4539

    
           def get_speech_timestamps(audio: torch.Tensor, 
        
                                     model, 
        
                                     threshold: float = 0.5, 
        
                                     sampling_rate: int = 16000, 
        
                                     min_speech_duration_ms: int = 250, 
        
                                     max_speech_duration_s: float = float('inf'), 
        
                                     min_silence_duration_ms: int = 100, 
        
                                     window_size_samples: int = 512, 
        
                                     speech_pad_ms: int = 30, 
        
                                     return_seconds: bool = False, 
        
                                     visualize_probs: bool = False, 
        
                                     progress_tracking_callback: Callable[[float], None] = None):

Typically it can be done quite quickly simply by looking at the probability chart.

lhr0909 · 2023-01-15T08:19:40Z

lhr0909
Jan 15, 2023
Author

For anyone that is trying to integrate the ONNX model into Android, this is a working version I put together in a ViewModel that takes a live stream of audio PCM input and returns the speech buffers. It is a bit rough since I don't have too much Kotlin experience, but it should show how the ONNX model is used in Kotlin/Java/Android.

I will spend some time to set up an Android Library just for Silero VAD integration and incorporate the get_speech_timestamps method from Python.

https://github.com/lhr0909/live-subtitles-rokid-ar/blob/c6c5dd34b53150b79c75b214e6c8facbc552d7ec/app/src/main/java/chat/senses/livesubs/TranscribeViewModel.kt

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD probability very low using ONNX on Android #284

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

	if sr in [8000, 16000]:
	ort_inputs = {'input': x.numpy(), 'h': self._h, 'c': self._c, 'sr': np.array(sr, dtype='int64')}
	ort_outs = self.session.run(None, ort_inputs)
	out, self._h, self._c = ort_outs
	else:
	raise ValueError()

VAD probability very low using ONNX on Android #284

lhr0909 Jan 9, 2023

Replies: 2 comments · 4 replies

snakers4 Jan 9, 2023 Maintainer

lhr0909 Jan 9, 2023 Author

snakers4 Jan 9, 2023 Maintainer

lhr0909 Jan 9, 2023 Author

snakers4 Jan 9, 2023 Maintainer

lhr0909 Jan 15, 2023 Author

lhr0909
Jan 9, 2023

Replies: 2 comments 4 replies

snakers4
Jan 9, 2023
Maintainer

lhr0909 Jan 9, 2023
Author

snakers4 Jan 9, 2023
Maintainer

lhr0909 Jan 9, 2023
Author

snakers4 Jan 9, 2023
Maintainer

lhr0909
Jan 15, 2023
Author