Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage output on multi channel audio and audio above 24khz #9

Open
Quackdoc opened this issue Aug 18, 2023 · 4 comments
Open

Garbage output on multi channel audio and audio above 24khz #9

Quackdoc opened this issue Aug 18, 2023 · 4 comments

Comments

@Quackdoc
Copy link

Quackdoc commented Aug 18, 2023

Seems like audio decode is picky on what gets input to it

Audio mediainfo

General
Complete name                            : C:\Users\Quack\code\whisper-burn\slap.wav
Format                                   : Wave
File size                                : 788 KiB
Duration                                 : 4 s 203 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 1 536 kb/s
Writing application                      : Lavf58.29.100

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 4 s 203 ms
Bit rate mode                            : Constant
Bit rate                                 : 1 536 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 48.0 kHz
Bit depth                                : 16 bits
Stream size                              : 788 KiB (100%)

Audio file: https://cdn.discordapp.com/attachments/615105639567589376/1141946730485665893/slap.wav

target\release\whisper.exe .\slap.wav small_en        08/18/2023 12:07:51 AMLoading waveform...
Loading model...
Chunk 0:  (screaming)

Chunk 1:  (screeching)

Transcribed text:  (screeching)

whisper-ctranslate2:

whisper-ctranslate2.exe slap.wav --model tiny.en      08/18/2023 12:10:23 AM
Detected language 'English' with probability 1.000000
[00:00.000 --> 00:04.000]  Also, it's not always useful.
Transcription results written to 'C:\Users\Quack\code\whisper-burn' directory

EDIT: transcoding the audio file using ffmpeg -i .\slap.wav -ar SAMPLE_RATE -ac 1 slap-edit.wav seems to make it work, It needs to be both single channel as well as 41khz or less.

at 41khz the audio output was

Chunk 0:  Oh, son, it's not all you are.

Transcribed text:  Oh, son, it's not all you are.

at 24khz and below it is

Chunk 0:  also it's not always useful.

Transcribed text:  also it's not always useful
@Quackdoc Quackdoc changed the title Garbage text on custom file Garbage output on multi channel audio and audio above 24khz Aug 18, 2023
@jbrough
Copy link

jbrough commented Aug 18, 2023

the whisper model itself expects 16Khz mono.

@Quackdoc
Copy link
Author

ah that make sense, I would assume burn doesn't do down sampling for the samplerate or for channel downmixing

@Quackdoc
Copy link
Author

Quackdoc commented Aug 29, 2023

This is partially addressed by 4080a33, but if I get the time I plan on looking into resampling and channel downmixing. I do have some work done, however I was using dasp which has proven it'self to be rather unusable, so im looking into different crates.

Looked into fon and it seems like it may work, but i don't like how it hasn't been active since feb'22.

currently looking into other crates

@jbrough
Copy link

jbrough commented Sep 6, 2023

@Quackdoc have a look at https://github.com/HEnquist/rubato

It does what you need. I've had no success with the sync Ftt methods yet but SincFixedIn which is in their main example works well.

Here's how I'm using it - I have a pop at the end but the main downsampling is very good:

https://github.com/wavey-ai/soundkit/blob/75bf99c0e220bcfa380c6ae72e626257fb4790e0/src/audio_pipeline.rs#L67

(I had a feeling the Synchronous resampling FFT method might be better for wasm but haven't tested it and may have misunderstood what's its designed for, as the output is terribly distorted. Still investigating. Hopefully SincInterpolationType::Linear is good enough for real-time use cases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants