2x slowdown compared to whisper-turbo #257

obenjiro · 2024-09-26T13:04:45Z

I created a prototype using Whisper-Turbo, which performed well and processed files quickly. I was using an 8-bit quantized medium model (specifically this one: https://rmbl.us/whisper-turbo/medium-q8g16.bin). However, since Whisper-Turbo is no longer supported, I had to switch to Ratchet.

In Ratchet, I used the FL33TW00D-HF/whisper-medium model with an 8-bit quantized medium bin (https://rmbl.us/FL33TW00D-HF/whisper-medium:medium_q8.bin). Unfortunately, this model was about 2x-3x slower than Whisper-Turbo. It's possible that the slowdown is due to changes in the runtime environment rather than the model itself.

Here's a test for 45 second audio file (same model size whisper-medium-8bit):
Whisper-Turbo: 20sec
Ratchet: 62 sec

I've been experimenting with DistilWhisperLargeV3 and I'm seeing some impressive results - it can process 45 second audio file in just 13 seconds. However, it seems to be limited to English language inputs only, so it doesn't work for non-English languages :/

Could you help me out by checking if there's a multilingual version of DistilWhisperLargeV3 model available on Hugging Face (mb we can use https://huggingface.co/distil-whisper/distil-large-v3), or maybe we could look into Whisper Medium and figure out the problem with slower processing time?

FL33TW00D · 2024-09-30T10:39:07Z

Hi @obenjiro,
Thanks for your continued support for these projects.

whisper-turbo had a lot of whisper specific optimizations, so it's unsurprising that Ratchet is 3x slower.
That said, all of those ideas aren't lost, and I'd love to ship them in Ratchet.

There is no multi lingual version of Distil Whisper unfortunately, so regular Whisper would be the best way to go.

FL33TW00D · 2024-10-01T14:42:19Z

There is now multilingual "whisper turbo"! Stay tuned!

obenjiro · 2024-10-02T07:34:56Z

Thanks for your hard work on a project! :)

whisper-turbo had a lot of whisper specific optimizations,

Can I help with bringing some of them to ratchet? Mb there is some low hanging fruit that I can help with (have some experience with Rust, Wasm, JS/TS)

There is now multilingual "whisper turbo"! Stay tuned!

Ye saw that too 🚀 (PS quality of Whisper-Turbo model is more then good)

FL33TW00D · 2024-10-02T07:48:06Z

Thanks for your hard work on a project! :)

whisper-turbo had a lot of whisper specific optimizations,

Can I help with bringing some of them to ratchet? Mb there is some low hanging fruit that I can help with (have some experience with Rust, Wasm, JS/TS)

There is now multilingual "whisper turbo"! Stay tuned!

Ye saw that too 🚀 (PS quality of Whisper-Turbo model is more then good)

The most interesting one (and most likely the one with the biggest impact if my memory serves correctly) was the caching for static models.

Whisper consists of 2 models, and encoder and a decoder. At this time, Ratchet JIT compiles each of the models like the following:
.resolve() -> allocate_storage() -> compile_gpu() -> executable.dispatch().

For the encoder model, where everything is completely static (only the input data changes, structure/dims remains the same), this is wasteful. We could cache the executable and just change the input data. This would be much faster, and it's what whisper-turbo did.

How we model this and do it in a more general way is a bit of a challenge!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2x slowdown compared to whisper-turbo #257

2x slowdown compared to whisper-turbo #257

obenjiro commented Sep 26, 2024 •

edited

Loading

FL33TW00D commented Sep 30, 2024

FL33TW00D commented Oct 1, 2024

obenjiro commented Oct 2, 2024

FL33TW00D commented Oct 2, 2024

2x slowdown compared to whisper-turbo #257

2x slowdown compared to whisper-turbo #257

Comments

obenjiro commented Sep 26, 2024 • edited Loading

FL33TW00D commented Sep 30, 2024

FL33TW00D commented Oct 1, 2024

obenjiro commented Oct 2, 2024

FL33TW00D commented Oct 2, 2024

obenjiro commented Sep 26, 2024 •

edited

Loading