Quantized models on Cuda #1250

EmilLindfors · 2023-11-03T08:13:21Z

Hello!
Are there any plans on implementing quantized models on cuda devices?
Would be great to be able to run the forthcoming 14b mistral on a 3090 with e.g. q_8.

LaurentMazare · 2023-11-03T13:04:55Z

Hello,
Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).

trigger-happy · 2023-11-05T02:32:49Z

Commenting here, I'd love to have Cuda + quantization support as well.

LLukas22 · 2023-11-05T19:41:37Z

I already created another issue, some time ago for this: #655

But i'm also very interested in getting cuda acceleration working for quantized tensors, but i think it would be wise to wait for #1230 to mature a bit as they already added the whole Device scaffolding to the quantized implementation, which we will also need to support cuda acceleration.

Other than that this should theoretically be relatively simple as the quantized cuda kernels already exist in the ggml \ llama.cpp projects. They even have some matmul kernels now in addition to the older vecdot kernels.

danielclough · 2023-11-23T04:17:25Z

Commenting to bump priority, as requested.

Hello, Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).

Related open issues

Support for quantisation: #359
CUDA support for QMatMul: #655
Error: no cuda implementation for qmatmul: #696
You are here: #1250

miketang84 · 2023-12-26T15:52:01Z

come on, need it, thanks.

np33kf · 2024-01-04T00:12:11Z

This will be a game changer in running bigger LLM in consumer grade GPU as memory is the main constrain. Big thanks for the efforts of everyone. This is awesome framework! Love Rust and cannot stand to code in python ....

EricLBuehler · 2024-01-04T01:10:23Z

Looks like an exciting development.

miketang84 · 2024-02-11T09:08:39Z

Hi, guys, I ask for the progress on supporting the quantization on cuda politely, is there any new info? In the following days I have some time to help, if needed.

LLukas22 · 2024-02-11T13:56:55Z

@miketang84

Due to time constraints, I wasn't able to dive deeper into this. However, for enabling gguf quantizations with CUDA, essentially three steps are required:

There's a need to implement QCudaStorage, similar to how QMetalStorage was implemented for Metal, as seen here: QMetalStorage.
The CUDA kernels from ggml-cuda.cu must be ported to candle-kernels and properly integrated into the build process.
Implementation of cuda_fwd for QTensor is needed, akin to the metal_fwd implementation found here: metal_fwd.

akhildevelops · 2024-02-15T18:10:48Z

Please link any ongoing PR / branch for this feature, if work on it has started.

LaurentMazare · 2024-02-25T15:33:43Z

You can check out #1754 which contains a first implementation of cuda support for quantized models. It's certainly not optimal in terms of performance and there are a bunch of optimization/kernels to be added but I hope to merge a first cut later today.

This was referenced Nov 23, 2023

Error: no cuda implementation for qmatmul #696

Closed

CUDA support for QMatMul #655

Closed

danielclough mentioned this issue Dec 27, 2023

from_gguf only supports CPU? #1486

Closed

LaurentMazare mentioned this issue Feb 25, 2024

Support for quantisation #359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized models on Cuda #1250

Quantized models on Cuda #1250

EmilLindfors commented Nov 3, 2023

LaurentMazare commented Nov 3, 2023

trigger-happy commented Nov 5, 2023

LLukas22 commented Nov 5, 2023

danielclough commented Nov 23, 2023 •

edited

Loading

miketang84 commented Dec 26, 2023

np33kf commented Jan 4, 2024

EricLBuehler commented Jan 4, 2024

miketang84 commented Feb 11, 2024 •

edited

Loading

LLukas22 commented Feb 11, 2024

akhildevelops commented Feb 15, 2024 •

edited

Loading

LaurentMazare commented Feb 25, 2024

Quantized models on Cuda #1250

Quantized models on Cuda #1250

Comments

EmilLindfors commented Nov 3, 2023

LaurentMazare commented Nov 3, 2023

trigger-happy commented Nov 5, 2023

LLukas22 commented Nov 5, 2023

danielclough commented Nov 23, 2023 • edited Loading

Related open issues

miketang84 commented Dec 26, 2023

np33kf commented Jan 4, 2024

EricLBuehler commented Jan 4, 2024

miketang84 commented Feb 11, 2024 • edited Loading

LLukas22 commented Feb 11, 2024

akhildevelops commented Feb 15, 2024 • edited Loading

LaurentMazare commented Feb 25, 2024

danielclough commented Nov 23, 2023 •

edited

Loading

miketang84 commented Feb 11, 2024 •

edited

Loading

akhildevelops commented Feb 15, 2024 •

edited

Loading