-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for quantisation #359
Comments
Hey @okpatil4u, We are looking at running GGML models directly. We're focusing on Q4 for now as it seems the most popular #360. |
I would also very much appreciate quantization support, as it allows user with weaker hardware to run models. I'm guessing most of the work will be implementing the quantized |
Pretty much yes, and probably loading GGML/GPTQ weights (GPTQ is already safetensors so shouldn't be too hard) |
Is there a plan to support both GGML and GPTQ in the current project? I must admit, I'm not very familiar with GPTQ, but from what I understand, it seems to use a different quantization format compared to GGML. Would this difference necessitate the creation of separate matmul kernels? On a related note, I believe I could contribute to the loading of GGML files. Additionally, I'm interested in assisting with the implementation of GGML quantization support, though I'm uncertain about where to begin. I presume that one of the initial steps might involve extending the TensorStorage to accommodate quantized tensors and facilitate the loading of data into these tensors. |
Yes, it's on my list to look at quantization support this week. It's currently not very clear how we will do this, we already support a bunch of different backends/dtypes/ops that this will have to interact too and we would like the changes for quantization to be as non-intrusive as possible. |
Concerning the operations that can directly utilize quantized weights, GGML appears to implement real quantized operations specifically for its Integrating this functionality into Candle could be relatively straightforward if To extend support to all other operations, we could introduce an automatic dequantization mechanism. This would be necessary for operations that don't natively support working with quantized types, ensuring seamless integration and compatibility across the system. |
Right, that's exactly the plan: the initial version will stick to using As preliminary work, I've added some simd bits within candle earlier today so that we can use them in the matrix multiply. I'm currenty adding the quantized matmul based on the llama.cpp code. And as you mentioned, longer term we might think about having proper dtype for quantized tensors but that's a huge lift so would prefer to avoid it if we can get a large part of the benefit via the simple approach. |
There has been a bit of progress on quantization though it's certainly very early days. Roughly the recent changes introduce a new This can be run via: cargo run --example ggml --profile=release-with-debug This will download some weights from the hf hub. If you want to use local weights, you can set As I mentioned this is currently a proof of concept and not something polished at all:
The next steps would involve: lining things up with ggml though it's unclear to me how to do this as I haven't found how to print intermediary tensors out of ggml, complete the functions from @LLukas22 (and others), if you're interested we could certainly use some help here, typically if you want to add some of the missing bits/add some simd optimizations/help lining things up we can just sync up on this thread so as to split the work. |
Outstanding work! 👌 I took a quick glance at your commits, and I can't wait to play around with them tomorrow. SIMD stuff isn't really my strong suit, but with the ggml code as a guide, I can probably contribute a bit. By the way, I noticed a lot of unimplemented code blocks in quantization types other than q4_0. Are you thinking of supporting all those formats, or are you mainly focusing on q4_0? |
Right, most of the quantization types are unsupported, I'm planning to add the |
I've been testing on my Ryzen 3700 machine (which doesn't support AVX512), and I must say, the results are genuinely impressive. Initially, I ran the example without your AVX implementation, and it was somewhat sluggish, clocking in at 0.2 t/s. Then I decided to give your AVX implementation a try, but getting it to work was a bit tricky. Rust seemed reluctant to set the [build]
rustflags = "-C target-cpu=native" This change led to a massive performance boost, over 10x, reaching about 2.29 t/s. I didn't stop there, though. I parallelized the For comparison, running the same model through Rustformer's GGML backend gives me around 8.6 t/s. That means the current implementation is only about 25% slower than |
That's great, thanks for trying it out and for the comparison numbers! Yes we should add some multi-threading in there, the trickiness is that we want to avoid using all the cores (on very large boxes) on tiny loads as it can actually be counter productive. But maybe to start with we just use Re the
This should hopefully turn things on. That said we should certainly document how to enable these instructions for crate that use candle, and we should also print to the user whereas avx/neon/... was enabled or not. (edit: ah and also probably obvious but it's not "my" avx implementation but the ggml one that I just ported to whatever was available in rust stable, the speed discrepency btw might be due to some avx instructions being nightly only) |
Waves in windows user
I thought about adding a |
Ah, feel free to make a PR with the proper target and features so that it works on your side, ideally using
Sounds like a good idea. Not sure how large a threadpool object is, if it always starts the threads, maybe there is a cost to having these on a per |
Also I just merged the support for avx in the q6k matmul that is commonly used for the last matrix multiply when using mostly q4_0. Not sure if the model you were trying had it but if that's the case you may want to update, on my side it's a 10 to 15% speedup so should put us really close to ggml (looking only at this multiplication itself, the speedup is ~8x). |
Yup, can confirm the Regarding testing, should is just mimic the q6k tests? |
Great to add q4k and q5k support, and yes mimicing the q6k tests sounds good. |
First on my agenda is to concentrate on implementing some of the As for aligning with the |
Sounds good. When it comes to matmul tests, we actually already have some that you can see here for q6k for example, these were super helpful as there were some bugs in the first cut. This file also has some tests for the Fair enough re aligning with |
Re lining up with |
That's great news 👍 Just a heads up they are planning to release the new GGUF file format tomorrow, meaning the current way of reading ggml files will break. I didn't have much time but i started to port some of the k-quant quantizations over and added some more unit tests to ensure the quantization actually works as expected. |
Ah interesting re the gguf format, thanks for pointing this out. edit forgot to mention but I've also added simd support for neon (used in mac M1/M2) in the q4_0/q8_0 and q6k/q8k vecdot. Pretty nice speedup on these and overall the text generation is almost twice as fast as before. |
As far as i know the
Yeah, it takes some time but i'm getting there. Just another random thought how hard would it be to add CUDA support for the qmatmuls ? |
Ah that sounds good, hopefully we won't have to deprecate the old format too soon on our side (I imagine that it's more of a burden to have multiple format in their codebase than in ours so they just want to support "one" format at a time).
Well we can just port their cuda kernels so shouldn't be that hard. That said, I think I would rather prioritize performance on M1/M2 before this as it's more appealing to a bunch of potential users. |
There were some improvements to the k-quants today i guess we want sync these improvements? I also started implementing some of the
Should we just replicate this or is there a better way to test matmuls? Currently i find it a bit hard to debug potential errors. |
Re tests, I think the current ones for matmul are actually pretty good as they generate random values with a fixed seed, see here, that said I certainly won't object to having additional testing :) |
Re |
Nice. Hopefully the hf tokenizer file will be included in the uploaded Since i struggled a bit with the |
The vocab/tokens are indeed already in the |
|
Ah that's great, that will make things a lot easier. cargo run --example tensor-tools --release -- ls ../llama.cpp/models/7B/ggml-model-q4_0.gguf --verbose My hope is to add more to this command line tool, e.g. conversion functions from a format to another, quantization functions etc. |
I would like to contribute to candle by porting |
Ah actually I've started adding these and hope to have a PR ready later today (had to step out for a bit), it will also add a bunch more testing and should make lining things up with the C++ version easier. Sorry for the collision @abuelnasr0 |
@abuelnasr0 I only looked at the avx implementation for the k-quants but i haven't started porting anything. If you want you could also give the CUDA implementation a try. |
Alright i played around a bit with the CUDA implementation but i had some trouble with the |
rustformers/llm supports Q2 to Q8 quants with various varieties. Would it be possible to quantize the existing models and run them in this repo ?
The text was updated successfully, but these errors were encountered: