Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized whisper example #574

Closed
soupslurpr opened this issue Aug 24, 2023 · 26 comments
Closed

Quantized whisper example #574

soupslurpr opened this issue Aug 24, 2023 · 26 comments

Comments

@soupslurpr
Copy link

Hi, are there any plans for a quantized whisper model example?

@LaurentMazare
Copy link
Collaborator

We certainly hope that candle will be a good place for experimenting with quantization, as it should be much easier to add new models compared to llama.cpp.
That said there is no immediate plans for a quantized whisper, @LLukas22 has been pushing a lot on adding new quantizations and initially mentioned some interest for quantized stable-diffusion / whisper in #359, so hopefully we'll be in a good space to try these soonish.

@LLukas22
Copy link
Contributor

I'll maybe look into implementing a quantized whisper example but, i first have to finish implementing the k-quants and somehow have to create a k-quantized gguf whisper file. And since Armored Core 6 releases today most of my free time will go there, but theoretically you only have to swap out the current matmuls with qmatmuls and it should just work if you load in quantized tensors.

@LaurentMazare
Copy link
Collaborator

Ah, enjoy Armored Core 6 😄
I'll add some helper function to write gguf files and if I find some time might give a try at some very basic quantization.

@soupslurpr
Copy link
Author

Awesome thanks, right now I'm trying to get the whisper example (modified a little) running on Android.

@limcheekin
Copy link

Awesome thanks, right now I'm trying to get the whisper example (modified a little) running on Android.

Any update?

Thanks for created the https://github.com/huggingface/candle/tree/main/candle-wasm-examples and https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized.

I'm thinking it will be great if we have a quantized-wasm-example serving gguf model, anyone here have the same thought?

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Sep 18, 2023

Mentioning @radames explicitely in case he has some interest into making yet another wasm based example for the quantized models this time.

@soupslurpr
Copy link
Author

@limcheekin I did get it working on Android.

@radames
Copy link
Contributor

radames commented Sep 18, 2023

thanks @LaurentMazare I'll look into creating a Wasm version for the quantized llama example

@LaurentMazare
Copy link
Collaborator

A tricky bit with this is that even with 4 bits quantization, a 7B model would be ~4GB and this may be slow to load and also break the memory limit of wasm-32 (which is the default wasm target). Maybe there is a good 1B or 2B model that we could use in this demo, would have to check in the model list of TheBloke on the hub.

@abodacs
Copy link

abodacs commented Sep 18, 2023

https://github.com/jzhang38/TinyLlama is good candidate @LaurentMazare

@limcheekin
Copy link

https://github.com/jzhang38/TinyLlama is good candidate @LaurentMazare

Agreed. Specifically the chat model at https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1.

@radames
Copy link
Contributor

radames commented Sep 19, 2023

is it quantized ? or just a smaller model? we might need to ask TheBloke to quantize it

@Narsil
Copy link
Collaborator

Narsil commented Sep 19, 2023

This comes very approriately, I'm also building something for android (native, no wasm). and 7B is indeed a bit too much for my poor phone:

https://github.com/Narsil/hf-chat

This unquantized 1.1B might still be a bit much, a GGUF/GPTQ version of it might help

@limcheekin
Copy link

A tricky bit with this is that even with 4 bits quantization, a 7B model would be ~4GB and this may be slow to load and also break the memory limit of wasm-32 (which is the default wasm target). Maybe there is a good 1B or 2B model that we could use in this demo, would have to check in the model list of TheBloke on the hub.

A quantized version of the TinyLlama that working at https://huggingface.co/spaces/kirp/tinyllama-chat

@radames
Copy link
Contributor

radames commented Sep 26, 2023

hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here
https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm

@limcheekin
Copy link

hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm

Great! Thanks for sharing.
I found another example at https://huggingface.co/spaces/radames/Candle-BERT-Semantic-Similarity-Wasm, may I know it support bge and gte models, besides e5 models?

@limcheekin
Copy link

hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm

Great! Thanks for sharing. I found another example at https://huggingface.co/spaces/radames/Candle-BERT-Semantic-Similarity-Wasm, may I know it support bge and gte models, besides e5 models?

@radames Do you get my question above?

@LLukas22
Copy link
Contributor

LLukas22 commented Oct 2, 2023

@limcheekin The current BERT implementation should handle nearly any BERT-like model meaning bge and gte should just work.

@radames
Copy link
Contributor

radames commented Oct 2, 2023

Sorry @limcheekin, didn't get notified. Thanks @LLukas22, I think that's right any BERT-like model might work, just keep in mind you want to load smaller models on the wasm version, since there's a 4GB memory limit on webassembly, and the fetch /cache api seems to crash with large files >2gb, ideally we will need to shard large models.

@LaurentMazare
Copy link
Collaborator

@soupslurpr coming back to your original question, I've just merged a quantized whisper example, model code. You can use it from the whisper example with the --quantized flag, that said it's using a q4_0 quantization by default which makes for very tiny weight files (23.3MB instead of 151MB) but performance is certainly affected. It's likely possible to achieve better results by tweaking the quantization type and the model/example code should be independent of this anyway.

@radames
Copy link
Contributor

radames commented Oct 2, 2023

amazing! @LaurentMazare. I'll update the wasm example!

@soupslurpr
Copy link
Author

Thank you a ton @LaurentMazare, it will help a lot with running it fast on Android devices

@limcheekin
Copy link

Sorry @limcheekin, didn't get notified. Thanks @LLukas22, I think that's right any BERT-like model might work, just keep in mind you want to load smaller models on the wasm version, since there's a 4GB memory limit on webassembly, and the fetch /cache api seems to crash with large files >2gb, ideally we will need to shard large models.

Thanks for information.

Do you have any plan for getting around that 4GB memory limit?

Given the following implementation status of the WebAssembly runtimes: https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md

image

@soupslurpr
Copy link
Author

@LaurentMazare How did you get the gguf files? On the whisper.cpp repo I only see ggml conversion scripts.

@LaurentMazare
Copy link
Collaborator

You can use the following command to convert a safetensors file to a gguf file so there is actually no dependency on whisper.cpp.

cargo run --example tensor-tools --release -- quantize --quantization q8_0 model.safetensors -out model-q80.gguf

@HajarMazaheri
Copy link

Hi guys
How can I use quantization-aware training (QAT) for the Whisper model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants