-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized models on Cuda #1250
Comments
Hello, |
Commenting here, I'd love to have Cuda + quantization support as well. |
I already created another issue, some time ago for this: #655 But i'm also very interested in getting cuda acceleration working for quantized tensors, but i think it would be wise to wait for #1230 to mature a bit as they already added the whole Other than that this should theoretically be relatively simple as the quantized cuda kernels already exist in the |
Commenting to bump priority, as requested.
Related open issuesSupport for quantisation: #359 |
come on, need it, thanks. |
This will be a game changer in running bigger LLM in consumer grade GPU as memory is the main constrain. Big thanks for the efforts of everyone. This is awesome framework! Love Rust and cannot stand to code in python .... |
Looks like an exciting development. |
Hi, guys, I ask for the progress on supporting the quantization on cuda politely, is there any new info? In the following days I have some time to help, if needed. |
Due to time constraints, I wasn't able to dive deeper into this. However, for enabling gguf quantizations with CUDA, essentially three steps are required:
|
Please link any ongoing PR / branch for this feature, if work on it has started. |
You can check out #1754 which contains a first implementation of cuda support for quantized models. It's certainly not optimal in terms of performance and there are a bunch of optimization/kernels to be added but I hope to merge a first cut later today. |
Hello!
Are there any plans on implementing quantized models on cuda devices?
Would be great to be able to run the forthcoming 14b mistral on a 3090 with e.g. q_8.
The text was updated successfully, but these errors were encountered: