-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions: using v3 turbo + quantization + faster-whisper #9
Comments
Hi, thank you for the comments. I haven't tried evaluating v3 turbo models yet but plan to do so. As I understand it, the encoder is still large-sized which may present difficulties running on mobile and memory-constrained devices but I'd be interested to hear your experience. In the app we use whisper.cpp and quantized models, there's not much reason to switch to faster-whisper over it, in fact the benchmarks in their README indicate it uses 2x more memory in the case of small models on CPU |
Great to hear!
I have not seen anywhere this mentioned. I thought it was a true drop in replacement. I could totally be wrong though.
I have no experience with using the v3 models for android or on acft fine tunes. I only have experience with your vanilla models and the faster whisper v3 on my computer. I can only report on using it with my consumer hardware from years ago (GTX 1080), the large v3 model faster whisper and int8 quantized has extremely low latency, maybe 5 to 10 times better than openai direct calls! I have never noticed any difference in quality between the openai api and my own version, it is just much faster and only takes about 900mb of vram IIRC.
I didn't know about this. But still a lot of people with nowaday's phone and their huge RAM might not be bothered by taking up twice as much ram and would prefer it over waiting. In any case when I see what can be done with 900MB on my GPU I'm optimistic about what could be done on phone. Well you proved it already actually but I meant we can maybe have a pareto improvement. |
Hi @abb128 ,
I'm running a faster-whisper-server backend on a computer and am blown away by the rapid advancement in the field.
Notably, on my cheap low end hardware I'm still able to transcribe text way faster than I can speak it using the large v3 turbo model and int8 quantization. Using faster-whisper, using v3 turbo models and using quantization all resulted in a subjective leap in speed.
Naturally, some questions come to mind regarding the voice input of FUTO on android:
Thanks a lot for everything you've been doing.
The text was updated successfully, but these errors were encountered: