-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized whisper example #574
Comments
We certainly hope that candle will be a good place for experimenting with quantization, as it should be much easier to add new models compared to llama.cpp. |
I'll maybe look into implementing a quantized whisper example but, i first have to finish implementing the k-quants and somehow have to create a k-quantized |
Ah, enjoy Armored Core 6 😄 |
Awesome thanks, right now I'm trying to get the whisper example (modified a little) running on Android. |
Any update? Thanks for created the https://github.com/huggingface/candle/tree/main/candle-wasm-examples and https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized. I'm thinking it will be great if we have a quantized-wasm-example serving gguf model, anyone here have the same thought? |
Mentioning @radames explicitely in case he has some interest into making yet another wasm based example for the quantized models this time. |
@limcheekin I did get it working on Android. |
thanks @LaurentMazare I'll look into creating a Wasm version for the quantized llama example |
A tricky bit with this is that even with 4 bits quantization, a 7B model would be ~4GB and this may be slow to load and also break the memory limit of wasm-32 (which is the default wasm target). Maybe there is a good 1B or 2B model that we could use in this demo, would have to check in the model list of TheBloke on the hub. |
https://github.com/jzhang38/TinyLlama is good candidate @LaurentMazare |
Agreed. Specifically the chat model at https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1. |
is it quantized ? or just a smaller model? we might need to ask TheBloke to quantize it |
This comes very approriately, I'm also building something for android (native, no wasm). and 7B is indeed a bit too much for my poor phone: https://github.com/Narsil/hf-chat This unquantized 1.1B might still be a bit much, a GGUF/GPTQ version of it might help |
A quantized version of the TinyLlama that working at https://huggingface.co/spaces/kirp/tinyllama-chat |
hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here |
Great! Thanks for sharing. |
@radames Do you get my question above? |
@limcheekin The current BERT implementation should handle nearly any BERT-like model meaning |
Sorry @limcheekin, didn't get notified. Thanks @LLukas22, I think that's right any BERT-like model might work, just keep in mind you want to load smaller models on the wasm version, since there's a 4GB memory limit on webassembly, and the fetch /cache api seems to crash with large files >2gb, ideally we will need to shard large models. |
@soupslurpr coming back to your original question, I've just merged a quantized whisper example, model code. You can use it from the whisper example with the |
amazing! @LaurentMazare. I'll update the wasm example! |
Thank you a ton @LaurentMazare, it will help a lot with running it fast on Android devices |
Thanks for information. Do you have any plan for getting around that 4GB memory limit? Given the following implementation status of the WebAssembly runtimes: https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md |
@LaurentMazare How did you get the gguf files? On the whisper.cpp repo I only see ggml conversion scripts. |
You can use the following command to convert a safetensors file to a gguf file so there is actually no dependency on whisper.cpp.
|
Hi guys |
Hi, are there any plans for a quantized whisper model example?
The text was updated successfully, but these errors were encountered: