Quantized whisper example #574

soupslurpr · 2023-08-24T01:04:38Z

Hi, are there any plans for a quantized whisper model example?

LaurentMazare · 2023-08-24T11:17:23Z

We certainly hope that candle will be a good place for experimenting with quantization, as it should be much easier to add new models compared to llama.cpp.
That said there is no immediate plans for a quantized whisper, @LLukas22 has been pushing a lot on adding new quantizations and initially mentioned some interest for quantized stable-diffusion / whisper in #359, so hopefully we'll be in a good space to try these soonish.

LLukas22 · 2023-08-24T13:57:59Z

I'll maybe look into implementing a quantized whisper example but, i first have to finish implementing the k-quants and somehow have to create a k-quantized gguf whisper file. And since Armored Core 6 releases today most of my free time will go there, but theoretically you only have to swap out the current matmuls with qmatmuls and it should just work if you load in quantized tensors.

LaurentMazare · 2023-08-24T14:47:53Z

Ah, enjoy Armored Core 6 😄
I'll add some helper function to write gguf files and if I find some time might give a try at some very basic quantization.

soupslurpr · 2023-08-24T19:27:41Z

Awesome thanks, right now I'm trying to get the whisper example (modified a little) running on Android.

limcheekin · 2023-09-18T08:28:21Z

Awesome thanks, right now I'm trying to get the whisper example (modified a little) running on Android.

Any update?

Thanks for created the https://github.com/huggingface/candle/tree/main/candle-wasm-examples and https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized.

I'm thinking it will be great if we have a quantized-wasm-example serving gguf model, anyone here have the same thought?

LaurentMazare · 2023-09-18T08:43:09Z

Mentioning @radames explicitely in case he has some interest into making yet another wasm based example for the quantized models this time.

soupslurpr · 2023-09-18T18:58:31Z

@limcheekin I did get it working on Android.

radames · 2023-09-18T20:22:27Z

thanks @LaurentMazare I'll look into creating a Wasm version for the quantized llama example

LaurentMazare · 2023-09-18T20:52:08Z

A tricky bit with this is that even with 4 bits quantization, a 7B model would be ~4GB and this may be slow to load and also break the memory limit of wasm-32 (which is the default wasm target). Maybe there is a good 1B or 2B model that we could use in this demo, would have to check in the model list of TheBloke on the hub.

abodacs · 2023-09-18T21:10:54Z

https://github.com/jzhang38/TinyLlama is good candidate @LaurentMazare

limcheekin · 2023-09-18T22:35:12Z

https://github.com/jzhang38/TinyLlama is good candidate @LaurentMazare

Agreed. Specifically the chat model at https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1.

radames · 2023-09-19T04:21:25Z

is it quantized ? or just a smaller model? we might need to ask TheBloke to quantize it

Narsil · 2023-09-19T07:49:38Z

This comes very approriately, I'm also building something for android (native, no wasm). and 7B is indeed a bit too much for my poor phone:

https://github.com/Narsil/hf-chat

This unquantized 1.1B might still be a bit much, a GGUF/GPTQ version of it might help

limcheekin · 2023-09-23T23:55:36Z

A tricky bit with this is that even with 4 bits quantization, a 7B model would be ~4GB and this may be slow to load and also break the memory limit of wasm-32 (which is the default wasm target). Maybe there is a good 1B or 2B model that we could use in this demo, would have to check in the model list of TheBloke on the hub.

A quantized version of the TinyLlama that working at https://huggingface.co/spaces/kirp/tinyllama-chat

radames · 2023-09-26T21:47:00Z

hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here
https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm

limcheekin · 2023-09-27T12:14:30Z

hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm

Great! Thanks for sharing.
I found another example at https://huggingface.co/spaces/radames/Candle-BERT-Semantic-Similarity-Wasm, may I know it support bge and gte models, besides e5 models?

limcheekin · 2023-10-02T06:23:35Z

hi @limcheekin , just finish a PR #966 adding phi 1.5 quantized example as a wasm module, you can try the demo here https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm

Great! Thanks for sharing. I found another example at https://huggingface.co/spaces/radames/Candle-BERT-Semantic-Similarity-Wasm, may I know it support bge and gte models, besides e5 models?

@radames Do you get my question above?

LLukas22 · 2023-10-02T13:38:13Z

@limcheekin The current BERT implementation should handle nearly any BERT-like model meaning bge and gte should just work.

radames · 2023-10-02T16:42:50Z

Sorry @limcheekin, didn't get notified. Thanks @LLukas22, I think that's right any BERT-like model might work, just keep in mind you want to load smaller models on the wasm version, since there's a 4GB memory limit on webassembly, and the fetch /cache api seems to crash with large files >2gb, ideally we will need to shard large models.

LaurentMazare · 2023-10-02T16:58:12Z

@soupslurpr coming back to your original question, I've just merged a quantized whisper example, model code. You can use it from the whisper example with the --quantized flag, that said it's using a q4_0 quantization by default which makes for very tiny weight files (23.3MB instead of 151MB) but performance is certainly affected. It's likely possible to achieve better results by tweaking the quantization type and the model/example code should be independent of this anyway.

radames · 2023-10-02T18:14:09Z

amazing! @LaurentMazare. I'll update the wasm example!

soupslurpr · 2023-10-02T20:38:07Z

Thank you a ton @LaurentMazare, it will help a lot with running it fast on Android devices

limcheekin · 2023-10-03T03:55:02Z

Sorry @limcheekin, didn't get notified. Thanks @LLukas22, I think that's right any BERT-like model might work, just keep in mind you want to load smaller models on the wasm version, since there's a 4GB memory limit on webassembly, and the fetch /cache api seems to crash with large files >2gb, ideally we will need to shard large models.

Thanks for information.

Do you have any plan for getting around that 4GB memory limit?

Given the following implementation status of the WebAssembly runtimes: https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md

soupslurpr · 2023-10-04T21:30:10Z

@LaurentMazare How did you get the gguf files? On the whisper.cpp repo I only see ggml conversion scripts.

LaurentMazare · 2023-10-04T21:32:48Z

You can use the following command to convert a safetensors file to a gguf file so there is actually no dependency on whisper.cpp.

cargo run --example tensor-tools --release -- quantize --quantization q8_0 model.safetensors -out model-q80.gguf

HajarMazaheri · 2024-01-21T13:51:16Z

Hi guys
How can I use quantization-aware training (QAT) for the Whisper model?

soupslurpr closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized whisper example #574

Quantized whisper example #574

soupslurpr commented Aug 24, 2023

LaurentMazare commented Aug 24, 2023

LLukas22 commented Aug 24, 2023

LaurentMazare commented Aug 24, 2023

soupslurpr commented Aug 24, 2023

limcheekin commented Sep 18, 2023

LaurentMazare commented Sep 18, 2023 •

edited

Loading

soupslurpr commented Sep 18, 2023

radames commented Sep 18, 2023

LaurentMazare commented Sep 18, 2023

abodacs commented Sep 18, 2023

limcheekin commented Sep 18, 2023

radames commented Sep 19, 2023

Narsil commented Sep 19, 2023

limcheekin commented Sep 23, 2023

radames commented Sep 26, 2023

limcheekin commented Sep 27, 2023

limcheekin commented Oct 2, 2023

LLukas22 commented Oct 2, 2023

radames commented Oct 2, 2023

LaurentMazare commented Oct 2, 2023

radames commented Oct 2, 2023

soupslurpr commented Oct 2, 2023

limcheekin commented Oct 3, 2023

soupslurpr commented Oct 4, 2023

LaurentMazare commented Oct 4, 2023

HajarMazaheri commented Jan 21, 2024

Quantized whisper example #574

Quantized whisper example #574

Comments

soupslurpr commented Aug 24, 2023

LaurentMazare commented Aug 24, 2023

LLukas22 commented Aug 24, 2023

LaurentMazare commented Aug 24, 2023

soupslurpr commented Aug 24, 2023

limcheekin commented Sep 18, 2023

LaurentMazare commented Sep 18, 2023 • edited Loading

soupslurpr commented Sep 18, 2023

radames commented Sep 18, 2023

LaurentMazare commented Sep 18, 2023

abodacs commented Sep 18, 2023

limcheekin commented Sep 18, 2023

radames commented Sep 19, 2023

Narsil commented Sep 19, 2023

limcheekin commented Sep 23, 2023

radames commented Sep 26, 2023

limcheekin commented Sep 27, 2023

limcheekin commented Oct 2, 2023

LLukas22 commented Oct 2, 2023

radames commented Oct 2, 2023

LaurentMazare commented Oct 2, 2023

radames commented Oct 2, 2023

soupslurpr commented Oct 2, 2023

limcheekin commented Oct 3, 2023

soupslurpr commented Oct 4, 2023

LaurentMazare commented Oct 4, 2023

HajarMazaheri commented Jan 21, 2024

LaurentMazare commented Sep 18, 2023 •

edited

Loading