Replies: 1 comment 6 replies
-
It's because the prompt processing speed on cpu is a lot slower than it is on gpu. This can be mitigated somewhat using a cublas build of llama-cpp-python. If you're trying to make gpu faster and you aren't using it already you can replace gptq-for-llama with autogptq. If you are trying to load 13B models on your gpu it will always be slower than with ggml because it requires offloading. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi
I have been playing with different models for several weeks, HF / 4bit / 8bit in general, which work through the GPU (CUDA), as well as with the CPU GGML models.
I've noticed a strange thing, why does the GGML model through the CPU run faster than the GPU model? Yes, GPU models start writing immediately (but slowly), when CPU models can think for 1-2 minute (especially when they are loaded into RAM), but then they start typing very quickly, when GPU models always have one speed and GPU models lose on long texts.
Is it just me?
Tested on 7b-13b models of Vicuna/Llama
HW:
CPU: i9 9900k
RAM: 32 GB
GPU: 2080 8 GB
Beta Was this translation helpful? Give feedback.
All reactions