Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inability to use capabilities of dGPU, CLBlast(Old CPU) + other suggestions. #1272

Open
Luro223 opened this issue Dec 18, 2024 · 49 comments
Open

Comments

@Luro223
Copy link

Luro223 commented Dec 18, 2024

Hello dear developer of KoboldAI CPP.

I've been using 1.79.1 since release and I have i3-3120M with HD Graphics 4000 and HD 8700M, since 1.79.1 I finally managed to use Vulkan for HD 8700M, it is a bit faster now, but still I can't use dGPU's capabilities, KoboldAI CPP still only uses CPU, it only uses small amount of GPU's memory, also it crashes if I use more than 0 GPU layers on Vulkan.

CLBlast worked earlier with HD Graphics 4000, but with 1.7X it stopped working, AMD Accelerated Parallel Processing never actually worked with CLBlast.

Using --sdvaeauto slightly increases performance, I'll show the results later.
Edit: --sdvaeauto slightly increases performance but in rare cases, I need to test more.

Also is it possible for you to add toggle-able DRY, XTC, Temperature, Top-P, Top-K, either from command-line interface or GUI? I don't use any of these with some models, because, I think, some of them might affect performance, even if not used (0 value), especially with XTC.

@Luro223
Copy link
Author

Luro223 commented Dec 18, 2024

Comparsion of Vulkan and Old CPU with 1K context.
Some performance boost, yet no dGPU utilization.
.....Vulkan, HD8700M.....
Processing Prompt (924 / 924 tokens)
Generating (100 / 100 tokens)
[00:00:00] CtxLimit:1024/1024, Amt:100/100, Init:0.10s, Process:378.21s (409.3ms/T = 2.44T/s), Generate:50.80s (508.0ms/T = 1.97T/s), Total:429.01s (0.23T/s)
Benchmark Completed - v1.79.1 Results:
.....
Flags: NoAVX2=True Threads=4 HighPriority=False Cublas_Args=None Tensor_Split=None BlasThreads=4 BlasBatchSize=-1 FlashAttention=True KvCache=2
Timestamp: 2024-12-18 00:00:00.000000+00:00
Backend: koboldcpp_vulkan_noavx2.dll
Layers: 0
Model: mistral7b-erebus-v3.Q4_K_M
MaxCtx: 1024
GenAmount: 100
.....
ProcessingTime: 378.215s
ProcessingSpeed: 2.44T/s
GenerationTime: 50.795s
GenerationSpeed: 1.97T/s
TotalTime: 429.010s
Output: 1 1 1 1
.....
.....CPU (Old CPU).....
Processing Prompt (924 / 924 tokens)
Generating (100 / 100 tokens)
[00:00:00] CtxLimit:1024/1024, Amt:100/100, Init:0.09s, Process:387.69s (419.6ms/T = 2.38T/s), Generate:53.23s (532.3ms/T = 1.88T/s), Total:440.93s (0.23T/s)
Benchmark Completed - v1.79.1 Results:
.....
Flags: NoAVX2=True Threads=4 HighPriority=False Cublas_Args=None Tensor_Split=None BlasThreads=4 BlasBatchSize=-1 FlashAttention=True KvCache=2
Timestamp: 2024-12-18 00:00:00.000000+00:00
Backend: koboldcpp_noavx2.dll
Layers: 0
Model: mistral7b-erebus-v3.Q4_K_M
MaxCtx: 1024
GenAmount: 100
.....
ProcessingTime: 387.695s
ProcessingSpeed: 2.38T/s
GenerationTime: 53.235s
GenerationSpeed: 1.88T/s
TotalTime: 440.930s
Output: 1 1 1 1
.....

@LostRuins
Copy link
Owner

CLBlast with 0 layers doesnt work at all?

@Luro223
Copy link
Author

Luro223 commented Dec 18, 2024

CLBlast with 0 layers doesnt work at all?

Yeah, absolutely, neither with HD Graphics 4000 or AMD Accelerated Parallel Processing(Oland), though it worked with versions earlier than 1.7X, but only with HD Graphics 4000.

@Luro223
Copy link
Author

Luro223 commented Dec 18, 2024

CLBlast(Old CPU) with 0 layers(1.79.1)
Loading model: L:\AI-Models\mistral7b-erebus-v3.Q4_K_M.gguf
Traceback (most recent call last):
File "koboldcpp.py", line 5009, in
main(parser.parse_args(),start_server=True)
File "koboldcpp.py", line 4630, in main
loadok = load_model(modelname)
File "koboldcpp.py", line 930, in load_model
ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d
[776] Failed to execute script 'koboldcpp' due to unhandled exception!

@Luro223
Copy link
Author

Luro223 commented Dec 18, 2024

Same error with Oland.
If windbg or System Informer is enough to dump error, then I could use one of them and send dmp to you.

@Luro223
Copy link
Author

Luro223 commented Dec 18, 2024

Also if it's true or not, I noticed 1.79.1 is less creative compared to 1.69.1 is it because of DRY or XTC, I don't know, but the outputs are always different compared to 1.69.1, even if I disable DRY and XTC.
For example 1.69.1 gives more creative responses, but 1.79.1 gives more logic responses, even with DRY/XTC 0 and different models. I can reproduce this and show you if you're interested.

@LostRuins
Copy link
Owner

The 2 versions shouldnt have any difference in creativity. 1.80 is just released, you can try that.

@Luro223
Copy link
Author

Luro223 commented Dec 20, 2024

After a comprehensive testing of 1.80 the same errors, but noticeable performance boost(especially with longer contexts).
1.79.1 VS 1.80 (Vulkan-HD8700M)
Vulkan-HD8700M-1.79.1.txt
Vulkan-HD8700M-1.80.0.txt
CLBlast 1.69.1 VS 1.80.0 (to prove that 1.80.0 still doesn't work) (Also I've noticed 1.7X-1.80 uses koboldcpp_clblast.dll, while 1.69.1 uses koboldcpp_clblast_noavx2.dll, even though 1.80.0 has this library, yet didn't see it being used)
CLBlast (Old CPU)-I3-3120M.txt
CLBlast (Old CPU)-Oland.txt
CLBlast NoAVX2 (Old CPU)-I3-3120M(1.69.1).txt

@Luro223
Copy link
Author

Luro223 commented Dec 20, 2024

Also to clarify all confusions with Creativity, I'll provide all fine-tuned custom settings for UI and saved Character Card, so you'll able to reproduce all the errors related to creativity
UI Settings :
set-005.zip
Test Character Cards (GENERATED RESULTS) :
1.69.1,1.80.zip
Test Character Card (CLEAN TO GENERATE) :
Kirby-clean.zip
Main LLM I used for all this :
L3-8B-Stheno-v3.2-NEO-V1-D_AU-Q4_K_M-imat13 Link

@Luro223
Copy link
Author

Luro223 commented Dec 20, 2024

All the errors is happening when you try to change DRY, even if Mult./Base/A.Len 0, and cannot be disabled either from UI or from command-line arguments, even others like XTC, TOP K and etc.
For example all the settings related to DRY are 0, everything generated will be weird.

@Luro223
Copy link
Author

Luro223 commented Dec 20, 2024

Also GPU Utilization is same 0% with 1.80, only about ~40MB of GPU is used, and 100% of CPU :
76667
Also other presets perform very well too (Old Cpu), and I already showed the comparsion of VULKAN:
1.79.1.txt
1.80.0.txt
(Only 256 context, but with bigger contexts works better).

@Luro223
Copy link
Author

Luro223 commented Dec 20, 2024

Also Vulkan with more than 0 layers crashes :
Vulkan-1Layers.txt

@LostRuins
Copy link
Owner

It will only use noavx2.dll if you selected "old cpu" option. If you have avx2 support you should not use that!

@Luro223
Copy link
Author

Luro223 commented Dec 21, 2024

It will only use noavx2.dll if you selected "old cpu" option. If you have avx2 support you should not use that!

Then what other variant do I have? CPU (Old CPU) Works, but CLBlast (Old CPU) doesn't, it uses koboldcpp_clblast.dll, even with --noavx2 --nommap --usecpu flags, koboldcpp_clblast_noavx2.dll library is completely unusable.

@Luro223
Copy link
Author

Luro223 commented Dec 21, 2024

If I delete koboldcpp_clblast.dll and rename koboldcpp_clblast_noavx2.dll to koboldcpp_clblast.dll it works surprisingly well(Only used 256 context for testing):
1.80_CLBlast_noavx2-FORCED.txt
1.80_CLBlast_noavx2-FORCED-BENCHMARK.txt(Comparsion between CLblast(Intel OpenCL-HD Graphics 4000) and CPU (Old CPU))
1.80_CLBlast_noavx2-FORCED-BENCHMARK-Intel(R) HD Graphics 4000.txt(Intel(R) HD Graphics 4000 only)
1.80_CLBlast_noavx2-FORCED-BENCHMARK-Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz.txt(i3-3120M only)
(Mostly because I have 2 intel OpenCL devices, one uses directly (called Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz), and one from driver (Intel(R) HD Graphics 4000).
Also Use CPU (Old CPU) is a bit faster than Intel's OpenCL, but OpenCL works faster with longer contexts.
Also Oland with CLblast still doesn't work, gives absolutely identical error as 1.69.1(cl_khr_f16 (not supported)):
1.80_CLBlast_noavx2-FORCED-Oland.txt

@Luro223
Copy link
Author

Luro223 commented Dec 21, 2024

I've tested CLblast with forced library(koboldcpp_clblast_noavx2.dll), and it's a bit faster than Vulkan, which proves 0% GPU utilization and how faster it is with direct usage of OpenCL when Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz is being selected:
CLblast-HD Graphics 4000-1.80.0.txt
CLblast-Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz-1.80.0.txt

@Luro223
Copy link
Author

Luro223 commented Dec 21, 2024

1.80.1 - same errors, and CLblast still works, but only with library replacement:
CLblast-Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz-1.80.1.txt

@Luro223
Copy link
Author

Luro223 commented Dec 21, 2024

@LostRuins tell me any useful tools to debug errors for you to see the exact problem, because KoboldAI's own UI option 'Debug Mode' provides not enough to see the exact problem.
Also 1.80.1 still gives completely different output with different creativity but a bit better logic, and if I change DRY to 0 it will output complete nonsense.

@LostRuins
Copy link
Owner

Looking at your logs, I can see that the noavx2 flag is not being set at all (hence why its not being used)

image

Also it looks like you set blasbatchsize to -1, which disables batch processing.

You might want to check your launch parameters to make sure --noavx2 has been set, either in CLI, or by selecting it in the launcher.
image

@Luro223
Copy link
Author

Luro223 commented Dec 22, 2024

@LostRuins So, you've completely ignored all the messages answering exactly same things I'll type right now. I launced even with --showgui --noavx2 it still uses koboldcpp_clblast.dll instead of koboldcpp_clblast_noavx2.dll.
noavx2 is still False even with --showgui --noavx2 :
CLblast-Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz-1.80.1-NOAVX2.txt
Even with --showgui --noavx2 --nommap --usecpu :
CLblast-Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz-1.80.1-NOAVX2-NOMMAP-USECPU.txt
Even if I launch without my predefined config for UI.

@Luro223
Copy link
Author

Luro223 commented Dec 22, 2024

If KoboldCPP GUI uses noavx2=false, even with flags, then it's the issue from GUI itself. Still works if I replace library.
But it works from terminal without any library replacement :
CLblast-Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz-1.80.1-NOAVX2-Terminal.txt

@LostRuins
Copy link
Owner

LostRuins commented Dec 23, 2024

No, i am not ignoring your messages. I'm saying the behavior when you run with --noavx2 will indeed load the noavx2 library

image

image

image

The txt file you are sending me does not seem to match the command lines you have sent. Somehow, you seem to be running a benchmark? Are you loading another config file by mistake? That will override the flags you set.

image

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins No. Even with noavx2=false Vulkan (Old CPU) will use koboldcpp_vulkan_noavx2.dll, but with CLBlast (Old CPU), it will use koboldcpp_clblast.dll, even if I use --noavx2. It will ONLY use koboldcpp_clblast_noavx2.dll if I run directly from terminal without actually using any UI.
And NO, even with --noavx2 flag, the UI will still use koboldcpp_clblast.dll, EVEN if I run without my config. But if you've checked my config earlier, my config contains "noavx2": true, but with/without this same config, it still uses koboldcpp_clblast.dll.

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins You can directly reproduce this error if you run with these flags : --showgui --noavx2
It'll use koboldcpp_clblast.dll in any way if you try to run from UI, even with --noavx2 flag.

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins WITHOUT config.
Here we go again.txt
Used --usecl 1 0 --show --noavx2(Only loaded model and started, no config) :
AAA0
AAA1

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

Still koboldcpp_clblast.dll, ALWAYS koboldcpp_clblast.dll with UI.

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins Also can you add "Break" button to forcefully stop prompt from browser's frontend?
Sometimes it will keep generating, even if I press stop, ignoring everything until "Processing Prompt" is fully completed.

@LostRuins
Copy link
Owner

Okay I think I see the bug. I will do a new build

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins Awesome, what do you think about adding toggle-able functions like DRY, XTC and etc. for UI and/or flags, because with some models I don't use such things like DRY, XTC, Top-K, Top-P, Temperature, etc., and disabling some of them might increase performance, especially on low-end machines.

Also what can I do to provide you with enough info to fix the inability of GPU utilization for my Vulkan device?

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

As I still can't use my Vulkan device, even with 1.80.1, it uses only about ~20MB. of GPU memory, but still uses CPU only as I mentioned earlier.

@LostRuins
Copy link
Owner

@Luro223 fix is up, please try latest version 1.80.3

@LostRuins
Copy link
Owner

Meanwhile, what error does vulkan give you when you try to use it with offloaded layers

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins Thanks, CLblast works with terminal, as well with custom settings too:
1.80.3-Fix-CLblast.txt
1.80.3-Fix-CLblast-Terminal.txt
And Oland still gives same errors:
1.80.3-CLblast-Oland.txt

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

Meanwhile, what error does vulkan give you when you try to use it with offloaded layers

Edit: Sorry, the errors same as attached from 1.80.1 :
1.80.1-Vulkan-1Layers.txt
access violation writing 0x0000000000001000

@Luro223
Copy link
Author

Luro223 commented Dec 23, 2024

@LostRuins More layers - same errors.

@Luro223
Copy link
Author

Luro223 commented Dec 24, 2024

@LostRuins Any news? Or maybe additional tools for me to debug more info from these errors?

@LostRuins
Copy link
Owner

You need to disable quantized KV cache. It's not supported with Vulkan.

@Luro223
Copy link
Author

Luro223 commented Dec 25, 2024

@LostRuins After some tests I noticed that GPU kinda works with KV off, but max GPU utilization was 64%, and with KV2 it's significantly faster than Vulkan, even with 5 layers(Crashes with more layers.).
Vulkan, 5 layers, FlashAttention off, ContextShift off:
1.80.3-Vulkan-5Layers-NoKV.txt
CLblast (Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz) with 0 layers, FlashAttention on, KvCache=2, ContextShift off:
1.80.3-CLblast-0Layers-KV2.txt
So, is there any way to make KobolAI use 100% GPU utilization instead of 64%?
Also ~1055MB out of 2048MB is being used from dedicated graphics memory.
Please tell me if I missed out something important.

@LostRuins
Copy link
Owner

Why are you using BlasBatchSize = -1? That basically negates the prompt processing speedup of the GPU.

@Luro223
Copy link
Author

Luro223 commented Dec 27, 2024

Why are you using BlasBatchSize = -1? That basically negates the prompt processing speedup of the GPU.

Yeah it worked, with test 256 context size I got <60Sec. instead of 100. but, after experimenting with blas it crashed midway by blas 512, and now I can't run with any blas settings, only no blas with 0 layers work(will show later):
1.80.3-Vulkan0-blas512.txt

@Luro223
Copy link
Author

Luro223 commented Dec 27, 2024

Another error, similar to first one, but crashed even with blas 256. And as same with previous one, the Vulkan device becomes completely unusable, even with full dGPU driver reload, only full reboot helps:
1.80.3-Vulkan0-blas256.txt

@Luro223
Copy link
Author

Luro223 commented Dec 27, 2024

Also tested with 1layer no blas, and 0layers blas32, only 0layers and no blas works after error(midway crash):
1.80.3-Vulkan0-0layers-Blas32.txt
1.80.3-Vulkan0-1layer-noBlas.txt

@Luro223
Copy link
Author

Luro223 commented Dec 28, 2024

@LostRuins After a long testing I've learned that I only have 1GB of Vram, such a powerful capabilities will be limited because of VRAM... So, I've managed to launch it, and here are the results:
aak.txt
Much better results compared to CLBlast (i3-3120m) ~472Seconds, vs HD7800M's ~178Seconds.
Can you make custom BLAS Batch Size, so I could be able to select a more precise value (for example 900MB.) out of my VRAM, instead of fixed values? Or it's not possible due to special algorithm?

@LostRuins
Copy link
Owner

You can select a blasbatchsize from the supported list
[32,64,128,256,512,1024,2048]

using a custom value apart from the above is not supported at this time. You can try either 128 or 256.

@Luro223
Copy link
Author

Luro223 commented Dec 29, 2024

@LostRuins Ok i'll wait for custom value then, so I'll be able to use more than 512mb or 768mb.
Also can you add toggle-able features for DRY, XTC, Temperature, etc. from UI or via flags, as disabling them might slightly increase performance, especially on lower-end systems if one/several are not being used.
Also Happy New Year, wish you to have a great time with your relatives and friends.

@Luro223
Copy link
Author

Luro223 commented Jan 21, 2025

@LostRuins After upgrading to 1.82, CLBlast (Older CPU) is waaay too slow, even 2X slower compared to Failsafe:

CLBlast NoAVX2 (Old CPU)-I3-3120M-1.81.1.txt
CLBlast NoAVX2 (Old CPU)-I3-3120M-1.82.2.txt
Failsafe-1.82.2.txt
Can you make checkboxes (AVX/AVX2) to merge several different lists into one for GUI or maybe return back CLBlast NoAVX2 (Old CPU)?

@Luro223
Copy link
Author

Luro223 commented Jan 21, 2025

@LostRuins Another issue with 1.82.2:
For example if you select Vulkan (Old CPU) backend and save as a template, then load the saved template, it will always use CLBlast (Older CPU).
(Tested, happens only with Vulkan (Old CPU) backend).
Another issue(Older versions too):
If you try to select Lora Base without Text Lora, the selected box for Lora Base adapter will be emptied after loading from saved template.

@Luro223
Copy link
Author

Luro223 commented Jan 21, 2025

@LostRuins Also with MMAP unchecked it shows:
WARNING: Requested buffer size (4478459936) exceeds device memory allocation limit (2147483648)!
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: No suitable memory type found: ErrorOutOfDeviceMemory
load_tensors: offloading 1 repeating layers to GPU
load_tensors: offloaded 1/33 layers to GPU
load_tensors: Vulkan0 model buffer size = 132.50 MiB
load_tensors: CPU model buffer size = 4270.99 MiB
load_tensors: CPU model buffer size = 281.81 MiB
But still proceeds to load further, is it expected behaviour?(Same with older versions too).

@LostRuins
Copy link
Owner

For example if you select Vulkan (Old CPU) backend and save as a template, then load the saved template, it will always use CLBlast (Older CPU).

This is now fixed in 1.82.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants