GGML work pretty well or? #2653

berkut1 · 2023-06-12T23:55:35Z

berkut1
Jun 12, 2023

Hi
I have been playing with different models for several weeks, HF / 4bit / 8bit in general, which work through the GPU (CUDA), as well as with the CPU GGML models.
I've noticed a strange thing, why does the GGML model through the CPU run faster than the GPU model? Yes, GPU models start writing immediately (but slowly), when CPU models can think for 1-2 minute (especially when they are loaded into RAM), but then they start typing very quickly, when GPU models always have one speed and GPU models lose on long texts.
Is it just me?

Tested on 7b-13b models of Vicuna/Llama

HW:
CPU: i9 9900k
RAM: 32 GB
GPU: 2080 8 GB

BetaDoggo · 2023-06-13T00:23:21Z

BetaDoggo
Jun 13, 2023

It's because the prompt processing speed on cpu is a lot slower than it is on gpu. This can be mitigated somewhat using a cublas build of llama-cpp-python. If you're trying to make gpu faster and you aren't using it already you can replace gptq-for-llama with autogptq. If you are trying to load 13B models on your gpu it will always be slower than with ggml because it requires offloading.

6 replies

berkut1 Jun 13, 2023
Author

Oh, I found out why autogpt does not work, there you need to set the video memory limit to 4 GB instead of 6-7.

However, it works x2 time slower than in CPU mode.... There is not point to use GPU...

viperwasp Jun 13, 2023

Is autogptq better if GPTQ for llama already works very well for me at 13B? It writes a full story in like 5 seconds. Is it more accurate?

BetaDoggo Jun 13, 2023

However, it works x2 time slower than in CPU mode.... There is not point to use GPU...

How many tokens per second does it say you are getting while using your gpu? I think with 7B you should be getting at least 12t/s which is much faster than ggml should be on that cpu. Try removing any arguments/settings related to gpu memory to avoid any kind of offloading.

berkut1 Jun 13, 2023
Author

How many tokens per second does it say you are getting while using your gpu?

There are tests. I have enabled on GGML model CUDA too through this guide https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md#gpu-acceleration
GGML models (Wizard)
CPU:
7B

Output generated in 16.06 seconds (1.99 tokens/s, 32 tokens, context 784, seed 65181345)
Output generated in 9.09 seconds (4.18 tokens/s, 38 tokens, context 784, seed 2139810237)
Output generated in 7.44 seconds (4.16 tokens/s, 31 tokens, context 784, seed 396042236)
Output generated in 6.32 seconds (4.11 tokens/s, 26 tokens, context 784, seed 1074256612)

13B

Output generated in 23.85 seconds (0.67 tokens/s, 16 tokens, context 784, seed 519566293)
Output generated in 9.81 seconds (2.24 tokens/s, 22 tokens, context 784, seed 1948312845)
Output generated in 14.05 seconds (2.28 tokens/s, 32 tokens, context 784, seed 1595045219)
Output generated in 9.24 seconds (2.27 tokens/s, 21 tokens, context 784, seed 1063499018)

GPU:
7B

Output generated in 11.20 seconds (4.38 tokens/s, 49 tokens, context 784, seed 397634338)
Output generated in 9.99 seconds (8.11 tokens/s, 81 tokens, context 784, seed 1857349175)
Output generated in 6.51 seconds (7.38 tokens/s, 48 tokens, context 784, seed 1599853949)
Output generated in 8.09 seconds (7.67 tokens/s, 62 tokens, context 784, seed 780552616)

13B

Output generated in 23.02 seconds (1.74 tokens/s, 40 tokens, context 784, seed 1248328022)
Output generated in 6.55 seconds (3.20 tokens/s, 21 tokens, context 784, seed 624681199)
Output generated in 9.70 seconds (3.30 tokens/s, 32 tokens, context 784, seed 2053306757)
Output generated in 10.13 seconds (3.36 tokens/s, 34 tokens, context 784, seed 2013238742)

GPTQ models (Wizard)
CPU:
7B
Can't test, cause Wizard model has problem with 7B TypeError: not a string
13B
Can't test in CPU mode (I tried different options)

device = [d for d in self.hf_device_map.values() if d not in {'cpu', 'disk'}][0]
IndexError: list index out of range

GPU (result true for HF models too):
7B
Can't test, cause Wizard model has problem with 7B TypeError: not a string
this is the test with 7B HF, because it has similar results, where GPTQ works

Output generated in 10.48 seconds (0.67 tokens/s, 7 tokens, context 745, seed 521945760)
Output generated in 13.75 seconds (0.73 tokens/s, 10 tokens, context 745, seed 931751960)
Output generated in 5.26 seconds (0.57 tokens/s, 3 tokens, context 745, seed 808583528)

13B

Output generated in 15.30 seconds (0.46 tokens/s, 7 tokens, context 784, seed 1925568921)
Output generated in 10.10 seconds (0.50 tokens/s, 5 tokens, context 784, seed 1315048154)
Output generated in 8.77 seconds (0.46 tokens/s, 4 tokens, context 784, seed 2069662834)

harshs0ni Sep 17, 2024

Hello! Im facing the same issue is there any solution you found?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGML work pretty well or? #2653

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

GGML work pretty well or? #2653

berkut1 Jun 12, 2023

Replies: 1 comment · 6 replies

BetaDoggo Jun 13, 2023

berkut1 Jun 13, 2023 Author

viperwasp Jun 13, 2023

BetaDoggo Jun 13, 2023

berkut1 Jun 13, 2023 Author

harshs0ni Sep 17, 2024

berkut1
Jun 12, 2023

Replies: 1 comment 6 replies

BetaDoggo
Jun 13, 2023

berkut1 Jun 13, 2023
Author

berkut1 Jun 13, 2023
Author