Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions: using v3 turbo + quantization + faster-whisper #9

Open
thiswillbeyourgithub opened this issue Oct 2, 2024 · 2 comments
Open

Comments

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Oct 2, 2024

Hi @abb128 ,

I'm running a faster-whisper-server backend on a computer and am blown away by the rapid advancement in the field.

Notably, on my cheap low end hardware I'm still able to transcribe text way faster than I can speak it using the large v3 turbo model and int8 quantization. Using faster-whisper, using v3 turbo models and using quantization all resulted in a subjective leap in speed.

Naturally, some questions come to mind regarding the voice input of FUTO on android:

  1. Does FUTO plan on redoing the training with the latest v3 turbo models? They are more compact so we can imagine having near real time transcription on non large models on android right? If not planned, why not / what is missing? Is there anything the community can do to help?
  2. Actually looking at the notebook seems to indicate that this work was based on the original whisper implementation by openai. Is there a reason not to use faster-whisper from SYSTRAN? It's a way faster implementation. I made a FUTO shoutout there btw
  3. In the same vein, I don't think I see any quantization applied in the notebook, would that be a huge speed enhancement too? By making the model smaller and faster we can also considerably reduce the model loading time.

Thanks a lot for everything you've been doing.

@thiswillbeyourgithub thiswillbeyourgithub changed the title questions: using v3 turbo + quantization + faster-whisper Questions: using v3 turbo + quantization + faster-whisper Oct 2, 2024
@abb128
Copy link
Collaborator

abb128 commented Jan 6, 2025

Hi, thank you for the comments. I haven't tried evaluating v3 turbo models yet but plan to do so. As I understand it, the encoder is still large-sized which may present difficulties running on mobile and memory-constrained devices but I'd be interested to hear your experience. In the app we use whisper.cpp and quantized models, there's not much reason to switch to faster-whisper over it, in fact the benchmarks in their README indicate it uses 2x more memory in the case of small models on CPU

@thiswillbeyourgithub
Copy link
Author

Hi, thank you for the comments. I haven't tried evaluating v3 turbo models yet but plan to do so.

Great to hear!

As I understand it, the encoder is still large-sized which may present difficulties running on mobile and memory-constrained devices

I have not seen anywhere this mentioned. I thought it was a true drop in replacement. I could totally be wrong though.

but I'd be interested to hear your experience.

I have no experience with using the v3 models for android or on acft fine tunes. I only have experience with your vanilla models and the faster whisper v3 on my computer.

I can only report on using it with my consumer hardware from years ago (GTX 1080), the large v3 model faster whisper and int8 quantized has extremely low latency, maybe 5 to 10 times better than openai direct calls! I have never noticed any difference in quality between the openai api and my own version, it is just much faster and only takes about 900mb of vram IIRC.

In the app we use whisper.cpp and quantized models, there's not much reason to switch to faster-whisper over it, in fact the benchmarks in their README indicate it uses 2x more memory in the case of small models on CPU

I didn't know about this. But still a lot of people with nowaday's phone and their huge RAM might not be bothered by taking up twice as much ram and would prefer it over waiting. In any case when I see what can be done with 900MB on my GPU I'm optimistic about what could be done on phone. Well you proved it already actually but I meant we can maybe have a pareto improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants