Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for quantisation #359

Closed
okpatil4u opened this issue Aug 9, 2023 · 36 comments
Closed

Support for quantisation #359

okpatil4u opened this issue Aug 9, 2023 · 36 comments

Comments

@okpatil4u
Copy link

rustformers/llm supports Q2 to Q8 quants with various varieties. Would it be possible to quantize the existing models and run them in this repo ?

@Narsil
Copy link
Collaborator

Narsil commented Aug 9, 2023

Hey @okpatil4u,

We are looking at running GGML models directly. We're focusing on Q4 for now as it seems the most popular #360.

@LLukas22
Copy link
Contributor

LLukas22 commented Aug 9, 2023

I would also very much appreciate quantization support, as it allows user with weaker hardware to run models. I'm guessing most of the work will be implementing the quantized matmul kernels?

@Narsil
Copy link
Collaborator

Narsil commented Aug 11, 2023

Pretty much yes, and probably loading GGML/GPTQ weights (GPTQ is already safetensors so shouldn't be too hard)

@LLukas22
Copy link
Contributor

LLukas22 commented Aug 14, 2023

Is there a plan to support both GGML and GPTQ in the current project? I must admit, I'm not very familiar with GPTQ, but from what I understand, it seems to use a different quantization format compared to GGML. Would this difference necessitate the creation of separate matmul kernels?

On a related note, I believe I could contribute to the loading of GGML files. Additionally, I'm interested in assisting with the implementation of GGML quantization support, though I'm uncertain about where to begin. I presume that one of the initial steps might involve extending the TensorStorage to accommodate quantized tensors and facilitate the loading of data into these tensors.

@LaurentMazare
Copy link
Collaborator

Yes, it's on my list to look at quantization support this week. It's currently not very clear how we will do this, we already support a bunch of different backends/dtypes/ops that this will have to interact too and we would like the changes for quantization to be as non-intrusive as possible.
That's why the idea was to start by being able to load ggml weights which is almost ready, and then see how we could adapt some of the ops to use the quantized weights directly.

@LLukas22
Copy link
Contributor

Concerning the operations that can directly utilize quantized weights, GGML appears to implement real quantized operations specifically for its q x f32 matmul operation. Within this operation, many quantized vector-dot operations are performed.

Integrating this functionality into Candle could be relatively straightforward if f32 is consistently the target of these quantized operations. This approach would mean that most of the weights could remain quantized during a forward call, without requiring adjustments to any of the other implemented operations.

To extend support to all other operations, we could introduce an automatic dequantization mechanism. This would be necessary for operations that don't natively support working with quantized types, ensuring seamless integration and compatibility across the system.

@LaurentMazare
Copy link
Collaborator

Right, that's exactly the plan: the initial version will stick to using f32 tensors internally and just have specific matmul using the quantized weights. It's also what Sasha has been doing in llama2.rs with good performance.

As preliminary work, I've added some simd bits within candle earlier today so that we can use them in the matrix multiply. I'm currenty adding the quantized matmul based on the llama.cpp code.

And as you mentioned, longer term we might think about having proper dtype for quantized tensors but that's a huge lift so would prefer to avoid it if we can get a large part of the benefit via the simple approach.

@LaurentMazare
Copy link
Collaborator

There has been a bit of progress on quantization though it's certainly very early days. Roughly the recent changes introduce a new QTensor type that is only used to create linear layers and there is an example that use it for a ggml llama model here.

This can be run via:

cargo run --example ggml --profile=release-with-debug

This will download some weights from the hf hub. If you want to use local weights, you can set --model ggml-model-q4_0.bin where the .bin file has been generated by the llama.cpp quantize command.

As I mentioned this is currently a proof of concept and not something polished at all:

  • It is very very slow, the ops are implemented in a very dummy way, no simd, no multi-thread.
  • When using Q4 the model is very unstable, it hasn't been lined up properly with ggml or anything so unsure where this numerical instability is coming from.
  • Most of the functions in k_quants.rs are todo's, we should prioritize the ones we care about.

The next steps would involve: lining things up with ggml though it's unclear to me how to do this as I haven't found how to print intermediary tensors out of ggml, complete the functions from k_quants.rs and optimize them, profile the current code to see where the time is spent, and hopefully we can make this much faster.

@LLukas22 (and others), if you're interested we could certainly use some help here, typically if you want to add some of the missing bits/add some simd optimizations/help lining things up we can just sync up on this thread so as to split the work.

@LLukas22
Copy link
Contributor

Outstanding work! 👌

I took a quick glance at your commits, and I can't wait to play around with them tomorrow. SIMD stuff isn't really my strong suit, but with the ggml code as a guide, I can probably contribute a bit.

By the way, I noticed a lot of unimplemented code blocks in quantization types other than q4_0. Are you thinking of supporting all those formats, or are you mainly focusing on q4_0?

@LaurentMazare
Copy link
Collaborator

Right, most of the quantization types are unsupported, I'm planning to add the vec_dot for Q6K at the moment as it would directly be used by the last fc layer of the q4_0 llama quantization (in the current implementation this is converted to a f32 matmul which is likely a lot slower).
Certainly happy if you want to implement some of the others. Also I took a stab at simd'ing q4_0 in #474 but strangely it ended up being slower than the default implementation. My guess is that the compiler does a good work at simd out of the box (as these are mostly int types and not float types for which rust wouldn't simd stuff), so anyway no rush in having the simd version.
If you add some stuff, don't hesitate to add tests in quantized_tests.rs, typically matmul/from-float/to-float are good to cover.

@LLukas22
Copy link
Contributor

I've been testing on my Ryzen 3700 machine (which doesn't support AVX512), and I must say, the results are genuinely impressive.

Initially, I ran the example without your AVX implementation, and it was somewhat sluggish, clocking in at 0.2 t/s.

Then I decided to give your AVX implementation a try, but getting it to work was a bit tricky. Rust seemed reluctant to set the avx target feature. I finally managed to force it on by adding the following to my .cargo\config.toml file:

[build]
rustflags = "-C target-cpu=native"

This change led to a massive performance boost, over 10x, reaching about 2.29 t/s.

I didn't stop there, though. I parallelized the vec_dot calls using rayon, which gave me another 3x performance increase, bringing it up to 6.53 t/s.

For comparison, running the same model through Rustformer's GGML backend gives me around 8.6 t/s. That means the current implementation is only about 25% slower than llama.cpp. In my opinion, that's quite an achievement. Well done!

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Aug 17, 2023

That's great, thanks for trying it out and for the comparison numbers! Yes we should add some multi-threading in there, the trickiness is that we want to avoid using all the cores (on very large boxes) on tiny loads as it can actually be counter productive. But maybe to start with we just use utils::get_num_threads to control the number of threads (this will use the number of physical core which is usually better than the number of logical one in the case of hyper threading).

Re the avx trickiness, is it because you're using a separate crate? In the main candle repo we should have:

cat .cargo/config.toml
[target.x86_64-unknown-linux-gnu]
rustflags = ["-C", "target-cpu=native"]

[target.aarch64-apple-darwin]
rustflags = ["-C", "target-cpu=native"]

[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+simd128"]

This should hopefully turn things on. That said we should certainly document how to enable these instructions for crate that use candle, and we should also print to the user whereas avx/neon/... was enabled or not.

(edit: ah and also probably obvious but it's not "my" avx implementation but the ggml one that I just ported to whatever was available in rust stable, the speed discrepency btw might be due to some avx instructions being nightly only)

@LLukas22
Copy link
Contributor

LLukas22 commented Aug 17, 2023

This should hopefully turn things on. That said we should certainly document how to enable these instructions for crate that use candle, and we should also print to the user whereas avx/neon/... was enabled or not.

Waves in windows user

Yes we should add some multi-threading in there, the trickiness is that we want to avoid using all the cores (on very large boxes) on tiny loads as it can actually be counter productive.

I thought about adding a rayon::ThreadPoolBuilder to the QMatMul op, which could be initialized during the from_qtensor call, meaning we know the dimensions of one of the tensors used in the matmul, that could be helpful in figuring out a senseful thread count.

@LaurentMazare
Copy link
Collaborator

Waves in windows user

Ah, feel free to make a PR with the proper target and features so that it works on your side, ideally using native so that it works for all the simd instruction sets.

I thought about adding a rayon::ThreadPoolBuilder to the QMatMul op, which could be initialized during the from_qtensor call, meaning we know the dimensions of one of the tensors used in the matmul, that could be helpful in figuring out a senseful thread count.

Sounds like a good idea. Not sure how large a threadpool object is, if it always starts the threads, maybe there is a cost to having these on a per QMatMul basis. Did you come across par_for_each, we use it at a couple place to control tightly the number of threads that are working (though it has the downside that it's less fine-grained than parallelising into lots of small tasks), you can grep in the codebase to see how we already use this.

@LaurentMazare
Copy link
Collaborator

Also I just merged the support for avx in the q6k matmul that is commonly used for the last matrix multiply when using mostly q4_0. Not sure if the model you were trying had it but if that's the case you may want to update, on my side it's a 10 to 15% speedup so should put us really close to ggml (looking only at this multiplication itself, the speedup is ~8x).

@LLukas22
Copy link
Contributor

Yup, can confirm the q6k avx implementation gets the llama implementation pretty close to ggmls performance. I'll probably look into supporting q4k and q5k as this should enable us to run most of the q4-kquants which should perform a lot better than the original q4_0 format.

Regarding testing, should is just mimic the q6k tests?

@LaurentMazare
Copy link
Collaborator

Great to add q4k and q5k support, and yes mimicing the q6k tests sounds good.
We should also try to line up the inference with what llama.cpp produces, ideally when setting the temperature to 0 we should produce the same thing (up to some hopefully small numerical errors), that said it's probably not a super easy thing do but I'm currently a bit worried by some of the test that I generated in my examples which seemed a bit poor.

@LLukas22
Copy link
Contributor

First on my agenda is to concentrate on implementing some of the from_float functions. I want to make sure that roundtrips are functioning correctly through thorough testing. Once that's squared away, the next step will be to validate that the qmatmuls are producing accurate results. This should be relatively straightforward if we can create test inputs from f32 tensors. However, I'll likely need some time for this part, as my goal is to make the functions more readable and easier to understand during the porting process.

As for aligning with the llama.cpp inference, it's not currently a primary focus for me. While the ability to run GGML files produced by llama.cpp would be a nice enhancement, my main interest lies in quantization support for other models like stable-diffusion or wisper. For those, I plan to use the existing inference code as a template.

@LaurentMazare
Copy link
Collaborator

Sounds good. When it comes to matmul tests, we actually already have some that you can see here for q6k for example, these were super helpful as there were some bugs in the first cut. This file also has some tests for the from_float / to_float round-tripping so hopefully we can cover all functionalities there.

Fair enough re aligning with llama.cpp and it's certainly a pretty difficult task, I feel there is some value into it, helping discovering small bugs but I'll have a go at it on my side. And certainly looking forward to you trying out quantization for the other models - one of the main selling point of candle is that it should make this experimentations a lot easier than what you can do with llama.cpp where introducing new models/architectures is a bit harder.

@LaurentMazare
Copy link
Collaborator

Re lining up with llama.cpp, it took me forever to debug this but it turned out there was some subtle bug in the way rope embeddings were computed on the candle side (totally my fault).
I've just merged #518 which fixes this. After applying the change, the generated sequences with temperature 0 seem to exactly match the llama.cpp ones (both for f16 and q4_0), and the generated sequences when using some non-zero temperature look a lot better. We definitely have to get better at figuring out this kind of issue but hopefully this makes it possible to also use the full text generation loop as a test for new quantizations.
Also with multi-threading, the generation speed seems to be within 10% of llama.cpp on my ryzen 2600x which is great.

@LLukas22
Copy link
Contributor

That's great news 👍

Just a heads up they are planning to release the new GGUF file format tomorrow, meaning the current way of reading ggml files will break.

I didn't have much time but i started to port some of the k-quant quantizations over and added some more unit tests to ensure the quantization actually works as expected. q4k is already finished but I'm having some trouble with q2k and q5k which produce strange results with larger positive numbers. I'll probably setup the same testsuit in the ggml repo to help me debug this and ensure that the results match.

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Aug 20, 2023

Ah interesting re the gguf format, thanks for pointing this out.
By "the current way of reading ggml files will break" you mean that we will need to support gguf specifically or that we'll have to do breaking changes in the way we currently read files? Looking at the discussion on the llama.cpp thread, I would have hoped that it's just a matter of supporting the new format and that the current support for the "old" file format will continue to work but maybe I'm missing some things.
And yeah the experience of lining things up is certainly tricky, more unit tests will certainly be good and would be very keen on your idea to have the same tests on both sides!

edit forgot to mention but I've also added simd support for neon (used in mac M1/M2) in the q4_0/q8_0 and q6k/q8k vecdot. Pretty nice speedup on these and overall the text generation is almost twice as fast as before.

@LLukas22
Copy link
Contributor

By "the current way of reading ggml files will break" you mean that we will need to support gguf specifically or that we'll have to do breaking changes in the way we currently read files? Looking at the discussion on the llama.cpp thread, I would have hoped that it's just a matter of supporting the new format and that the current support for the "old" file format will continue to work but maybe I'm missing some things.

As far as i know the Ggjt file format will be deprecated. The original author of the GGUF spec is currently working on a rust implementation, so we could probably use that to read the new files.

And yeah the experience of lining things up is certainly tricky, more unit tests will certainly be good and would be very keen on your idea to have the same tests on both sides!

Yeah, it takes some time but i'm getting there. Just another random thought how hard would it be to add CUDA support for the qmatmuls ?

@LaurentMazare
Copy link
Collaborator

As far as i know the Ggjt file format will be deprecated. The original author of the GGUF spec is currently working on a rust implementation, so we could probably use that to read the new files.

Ah that sounds good, hopefully we won't have to deprecate the old format too soon on our side (I imagine that it's more of a burden to have multiple format in their codebase than in ours so they just want to support "one" format at a time).

Yeah, it takes some time but i'm getting there. Just another random thought how hard would it be to add CUDA support for the qmatmuls ?

Well we can just port their cuda kernels so shouldn't be that hard. That said, I think I would rather prioritize performance on M1/M2 before this as it's more appealing to a bunch of potential users.

@LLukas22
Copy link
Contributor

There were some improvements to the k-quants today i guess we want sync these improvements?

I also started implementing some of the vec_dot functions but unit testing them is a bit iffy. GGML just uses this function to generate some test data:

void generate_data(float offset, size_t n, float * dst) {
    for (size_t i = 0; i < n; i++) {
        dst[i] = 0.1 + 2*cosf(i + offset);
    }
}

Should we just replicate this or is there a better way to test matmuls? Currently i find it a bit hard to debug potential errors.

@LaurentMazare
Copy link
Collaborator

Re tests, I think the current ones for matmul are actually pretty good as they generate random values with a fixed seed, see here, that said I certainly won't object to having additional testing :)
Re improvements, sure we could update. I tried to keep the code I wrote as close as possible to the original one so that it's easier to follow down the line but that's also annoying to have code that doesn't look very rust idiomatic. We could also hold off until we have plugged in everything and can play with the quantization, measure its effects, and measure that the improvements are actual improvements in the same way folks do it on the llama.cpp side.

@LaurentMazare
Copy link
Collaborator

Re GGUF, as the spec turned out to be fairly simple, I've added support for it in candle and the quantized example now works with these too (tested with both q4 and f16, #559). We'll switch to use it for the default models once they are available on the hub.

@LLukas22
Copy link
Contributor

Re GGUF, as the spec turned out to be fairly simple, I've added support for it in candle and the quantized example now works with these too (tested with both q4 and f16, #559). We'll switch to use it for the default models once they are available on the hub.

Nice. Hopefully the hf tokenizer file will be included in the uploaded GGUF files.

Since i struggled a bit with the q2k vec-dot implementation i decided to port over the ggml unit tests just to ensure that the existing quantization and matmul functions work as expected.

@LaurentMazare
Copy link
Collaborator

The vocab/tokens are indeed already in the GGUF files produced by ./quantize. That said these are not totally straightforward to inject in the tokenizers library so we may have to do some work if we want to rely on a single config file (currently we just also use the default llama tokenizer config which is small and most likely always cached).

@LLukas22
Copy link
Contributor

The vocab/tokens are indeed already in the GGUF files produced by ./quantize. That said these are not totally straightforward to inject in the tokenizers library so we may have to do some work if we want to rely on a single config file (currently we just also use the default llama tokenizer config which is small and most likely always cached).

GGUF was designed to contain a tokenizer.huggingface.json field which simply contains the content of a tokenizer.json file to be directly used with the tokenizers crate. I will add this filed for all models i convert but the current llama.cpp conversion script doesn't seam to add it.

@LaurentMazare
Copy link
Collaborator

Ah that's great, that will make things a lot easier.
Fwiw, you can use the tensor-tools binary to look at metadata in gguf files (or list tensor infos for any supported file format, safetensors/npz/pytorch/...) e.g.:

cargo run --example tensor-tools --release -- ls ../llama.cpp/models/7B/ggml-model-q4_0.gguf --verbose

My hope is to add more to this command line tool, e.g. conversion functions from a format to another, quantization functions etc.

@abuelnasr0
Copy link

I would like to contribute to candle by porting quantization and vec_dot from lama.cpp into candle. I can add that methods for q4_1, q5_0, q5_1, q8_1. And then I can add also the avx implementation.
@LaurentMazare @LLukas22 commenting before I start because I don't know if one of you has a plan to implement this. can I open a pull request for that ?

@LaurentMazare
Copy link
Collaborator

Ah actually I've started adding these and hope to have a PR ready later today (had to step out for a bit), it will also add a bunch more testing and should make lining things up with the C++ version easier. Sorry for the collision @abuelnasr0

@LLukas22
Copy link
Contributor

@abuelnasr0 I only looked at the avx implementation for the k-quants but i haven't started porting anything. If you want you could also give the CUDA implementation a try.

@LLukas22
Copy link
Contributor

Alright i played around a bit with the CUDA implementation but i had some trouble with the CudaDType trait and the CudaStorageSlice enum as i wanted to implement CudaDType for the QTensor but i guess that's not the right approach.
It's probably easier to work with CudaSlice<T> and CudaView<> directly, as the result will always be a f32 tensor.

@LaurentMazare
Copy link
Collaborator

Closing this now in favor of more recent issues, e.g. #1250 (and #1754 should bring some preliminary support for quantized models on cuda).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants