Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized models on Cuda #1250

Open
EmilLindfors opened this issue Nov 3, 2023 · 11 comments
Open

Quantized models on Cuda #1250

EmilLindfors opened this issue Nov 3, 2023 · 11 comments

Comments

@EmilLindfors
Copy link

Hello!
Are there any plans on implementing quantized models on cuda devices?
Would be great to be able to run the forthcoming 14b mistral on a 3090 with e.g. q_8.

@LaurentMazare
Copy link
Collaborator

Hello,
Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).

@trigger-happy
Copy link

Commenting here, I'd love to have Cuda + quantization support as well.

@LLukas22
Copy link
Contributor

LLukas22 commented Nov 5, 2023

I already created another issue, some time ago for this: #655

But i'm also very interested in getting cuda acceleration working for quantized tensors, but i think it would be wise to wait for #1230 to mature a bit as they already added the whole Device scaffolding to the quantized implementation, which we will also need to support cuda acceleration.

Other than that this should theoretically be relatively simple as the quantized cuda kernels already exist in the ggml \ llama.cpp projects. They even have some matmul kernels now in addition to the older vecdot kernels.

@danielclough
Copy link
Contributor

danielclough commented Nov 23, 2023

Commenting to bump priority, as requested.

Hello, Yes there is a plan to have this supported though it's certainly a couple weeks away at least, but good to know that there is some demand for it. If other people also think that it would be useful, please comment below so that we can bump the priority for this (though it will have to wait at least until I get my desktop computer back in ~10 days).

Related open issues

Support for quantisation: #359
CUDA support for QMatMul: #655
Error: no cuda implementation for qmatmul: #696
You are here: #1250

@miketang84
Copy link

come on, need it, thanks.

@np33kf
Copy link

np33kf commented Jan 4, 2024

This will be a game changer in running bigger LLM in consumer grade GPU as memory is the main constrain. Big thanks for the efforts of everyone. This is awesome framework! Love Rust and cannot stand to code in python ....

@EricLBuehler
Copy link
Member

Looks like an exciting development.

@miketang84
Copy link

miketang84 commented Feb 11, 2024

Hi, guys, I ask for the progress on supporting the quantization on cuda politely, is there any new info? In the following days I have some time to help, if needed.

@LLukas22
Copy link
Contributor

@miketang84

Due to time constraints, I wasn't able to dive deeper into this. However, for enabling gguf quantizations with CUDA, essentially three steps are required:

  1. There's a need to implement QCudaStorage, similar to how QMetalStorage was implemented for Metal, as seen here: QMetalStorage.
  2. The CUDA kernels from ggml-cuda.cu must be ported to candle-kernels and properly integrated into the build process.
  3. Implementation of cuda_fwd for QTensor is needed, akin to the metal_fwd implementation found here: metal_fwd.

@akhildevelops
Copy link

akhildevelops commented Feb 15, 2024

Please link any ongoing PR / branch for this feature, if work on it has started.

@LaurentMazare
Copy link
Collaborator

You can check out #1754 which contains a first implementation of cuda support for quantized models. It's certainly not optimal in terms of performance and there are a bunch of optimization/kernels to be added but I hope to merge a first cut later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants