-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized much slower than llama.cpp with same model and settings... #1939
Comments
Just adding in some more data to this as I investigate the root cause of the differences, should have some results later on where we are slowing down in comparison to llama.cpp CANDLE (metal)
CANDLE (metal,accelerate)
LLAMA.CPP
|
Our mistral.rs achieves llama.cpp speeds with CUDA on the |
On the cuda backend, performance is now roughly comparable with llama.cpp. Most of the change came from #1978 that was merged earlier this week. My timings on a RTX 2080 (before the change candle was at ~34 token/s)
So my guess is that there is something specific to metal that we're not getting right at the moment. |
One idea I have to look into this is reducing the number of calls to the encoder through the use of Metal's argument buffers, it seems like it might be a nice starting point to see if that leads to a reduction in the pressure on instructions. https://developer.apple.com/documentation/metal/buffers/improving_cpu_performance_by_using_argument_buffers?language=objc |
Thanks for putting these up. |
Let me look into getting better way to export data from this, the debugger xcode has is nice... but it's brutally slow for even a trace I captured with 10 output tokens. It also does not have a great/obvious way to query/aggregate information to see which operations are taking the most. I'll do some more reading in the meantime on how to use the debugger more effectively and try to find the appropriate workflow to identify bottlenecks. I went down a rabbit hole of trying to find half decent documentation or sources on optimizations for MPS, apple did a great job at throwing information at the wall without any kind of map to link the information together. I'll put a PR up later today or this weekend with the changes I made to the quantized model to enable outputting the gputrace information in addition to documentation that goes along with it so others can instrument any example/model or tensor operations they want. Nice resource I found that does more code level optimization review: |
One thing to note is that the only two "counters" (the graphs on the bottom), that were saturated were the one shown in the photo, the instruction throughput limiter and the compute shader launch limiter. I'm struggling to find what exactly these two are measuring and what the significance of a 100% statistic on these means, or if it's a red herring. |
Just another little screenshot, ran with smaller sample size and the debugger gave me a much more friendly analysis. I imagine it's expected we spend the majority of our time with this matmul operation. I would be interested to see how this compares to the CUDA results @LaurentMazare If you would like to replicate you can do so on this branch: Note: the sample length needs to be that short or else when you open the gputrace it will cause the debugger to enter "lite" mode or crash your computer... 😰 |
I should also note that I see drastically different breakdowns on how the compute is allocated from the M1 machine I have to the M3 machine I have |
Quick debugging locally shows that our calls to reshape is taking the non-contiguous branch a fair amount, unsure how we could avoid or optimize from here if anyone has ideas! |
Is this part of the trace from when processing the prompt or for subsequent tokens? I'm a bit surprised to see a copy-strided if it's for the subsequent tokens as the goal is to avoid these. |
Here are some instructions to reproduce the llama.cpp gputrace Checkout this branch with some tweaks made to make metal compile in the debugger: https://github.com/tomsanbear/llama.cpp/tree/MetalDebugger Run |
Is it still including the prompt processing? (I suspect it does as the |
I start the capture right before
So this should only be capturing after prompt processing, as such likely a bug... |
Ah I think that's actually a mistral specificity because of mqa. Could you try using the standard llama 7b model and see what you get? You can just do so by specifying no model in the |
This closes the gap a bit more Candle
Llama.cpp
|
I've put together a small hack to get around the strided copy in #2043 , on my M2 Pro 16GB it significantly increases the speed (from 25tokens/s to 31.6) for mistral q4k generating 150 tokens. |
Here are some updated stats on a macbook pro M2 Pro (14,9) with the mistral model as of the current main branch. Timings do not include the prompt processing where llama.cpp is still much better.
So it's still slower but not by that much.
edit: adding |
@tomsanbear I saw this issue, I think #2615 is quite relevant! It boosts the prompt speeds to comparable to llama.cpp! |
quantized compiled using --> cargo build --example quantized -r --features metal
Unsure of... how many layers accelerated / how many threads used / clearly different sample stages
..yet I presume the speed should be on par... ?
CANDLE
./quantized --model mistral-7b-instruct-v0.2.Q4_K_S.gguf --which 7b-mistral-instruct-v0.2 --prompt "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" --sample-len 2048 --temperature 0.1 --seed 1337 --top-p 0.950 --repeat-penalty 1.100 --repeat-last-n 64
--> 31.83 t/s
avx: false, neon: true, simd128: false, f16c: false / temp: 0.10 repeat-penalty: 1.10 repeat-last-n: 64 / loaded 291 tensors (4.14GB) in 0.09s
LLAMA.CPP
./main -p "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" -m mistral-7b-instruct-v0.2.Q4_K_S.gguf -n 128 -ngl 33 --threads 8 --seed 1337
--> 51.30 t/s
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 1
The text was updated successfully, but these errors were encountered: