4 bit quantization support #260

bil-ash · 2024-10-06T02:52:41Z

I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than

ggml.cpp(llama.cpp/whisper.cpp) - as it supports both CPU and GPU and can use GPU on devices where WebGPU is available thereby providing better performance
web-llm(which is WEBGPU only) - as it (will) have a CPU backend thereby allowing inference on devices where WEBGPU is not supported(many android browsers)
onnx - it is ligter than onnx

However, all 3 of them support 4 bit quantization whereas (apparently) ratchet only supports 8 bit quantization. 4-bit quantization is very much required because without that, it is impossible to run whisper-v3-turbo and llama-3.2-1b on browser with limited RAM. So, please support 4bit quantization soon.

FL33TW00D · 2024-10-06T10:46:11Z

Hey @bil-ash,
Thanks for raising these points.

We have done some work on 4 bit quantization here, but it's not completed.

CPU + 4 bit are very important to us, stay tuned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4 bit quantization support #260

4 bit quantization support #260

bil-ash commented Oct 6, 2024

FL33TW00D commented Oct 6, 2024

4 bit quantization support #260

4 bit quantization support #260

Comments

bil-ash commented Oct 6, 2024

FL33TW00D commented Oct 6, 2024