rocBLAS support #1060

daniandtheweb · 2023-04-19T13:42:53Z

Since Nvidia's cuBLAS support has been added it is possible to implement AMD's rocBLAS support as well? It would make this the first llama project with official support for AMD gpus acceleration.

SlyEcho · 2023-04-20T19:22:33Z

I have a PR in draft #1087 for hipBLAS and HIP. If you can test, it would be great.

Probably not on Windows, though.

daniandtheweb · 2023-04-20T23:37:57Z

Is there any specific configuration to make it work? I built it with HIPBLAS and CUBLAS but it doesn't seem to work

SlyEcho · 2023-04-20T23:54:46Z

I have a Vega64 and I'm using Arch Linux. I have these packages:

[henri@taichi llama.cpp]$ pacman -Q | grep rocm
rocm-clang-ocl 5.4.3-1
rocm-cmake 5.4.3-1
rocm-core 5.4.3-4
rocm-device-libs 5.4.3-1
rocm-hip-libraries 5.4.3-2
rocm-hip-runtime 5.4.3-2
rocm-hip-sdk 5.4.3-2
rocm-language-runtime 5.4.3-2
rocm-llvm 5.4.3-1
rocm-opencl-runtime 5.4.3-1
rocm-opencl-sdk 5.4.3-2
rocm-smi-lib 5.4.3-1
rocminfo 5.4.3-1
[henri@taichi llama.cpp]$ pacman -Q | grep hip
hip-runtime-amd 5.4.3-1
hipblas 5.4.3-1
hipcub 5.4.3-1
hipfft 5.4.3-1
hipsolver 5.4.3-1
hipsparse 5.4.3-1
miopen-hip 5.4.3-1
rocm-hip-libraries 5.4.3-2
rocm-hip-runtime 5.4.3-2
rocm-hip-sdk 5.4.3-2
[henri@taichi llama.cpp]$ pacman -Q | grep amd
hip-runtime-amd 5.4.3-1
hsa-amd-aqlprofile-bin 5.4.3-2

Can you tell me what kind of issue do you have? Problem compiling, some kind of error? What device, OS?

daniandtheweb · 2023-04-21T00:30:03Z

I'm on Arch using a Radeon 5700XT. The problem is that the program doesn't use the GPU at all. The compilation works as expected and I have every hip and rocm package needed installed

SlyEcho · 2023-04-21T01:29:11Z

Does it say "BLAS = 1" when you start it?

daniandtheweb · 2023-04-21T09:42:21Z

Nope, 0

SlyEcho · 2023-04-21T09:50:25Z

That means it was not compiled with the right options. It will not configure any kind of acceleration automatically, it has to be enabled from the make/cmake command line.

How did you compile it?

daniandtheweb · 2023-04-21T10:20:41Z

Nevermind, I rebuilt it correctly with make -j8 LLAMA_HIPBLAS=1 and the program reports BLAS=1 however the gpu isn't used at all

SlyEcho · 2023-04-21T10:23:20Z

What size of batch did you use? The default is 8 which will make it skip any BLAS calls.

Try -b 512 when running ./main

daniandtheweb · 2023-04-21T10:29:57Z

Still 0 GPU usage

daniandtheweb · 2023-04-21T10:33:43Z

The only difference between the official master branch and yours is that load time on the official one is lower (490 ms vs 697 ms)

SlyEcho · 2023-04-21T10:35:01Z

Interesting, if it is loading longer, then it may be actually loading the hipBLAS libraries, I noticed it can take some time.

Can you try building and running it in Docker? I have some not-so-exact commands up there in the #1087 description.

daniandtheweb · 2023-04-21T11:02:05Z

I set up the docker and cloned your repo inside, recompiled and still the same 0% GPU usage.

SlyEcho · 2023-04-21T11:07:32Z

How are you checking the GPU usage? Because if tried to use ROCm and your card wasn't supported, the program would crash.

Can you also check the libraries linked to main with ldd main and see if there you see anything from /opt/rocm?

daniandtheweb · 2023-04-21T11:13:16Z

Those are the linked libraries:
linux-vdso.so.1 (0x00007fff8adee000)
libhipblas.so.0 => /opt/rocm/lib/libhipblas.so.0 (0x00007fd3d087f000)
libamdhip64.so.5 => /opt/rocm/lib/libamdhip64.so.5 (0x00007fd3cf200000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fd3cee00000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007fd3cf118000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fd3d082a000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fd3cec19000)
librocsolver.so.0 => /opt/rocm/lib/librocsolver.so.0 (0x00007fd368400000)
librocblas.so.0 => /opt/rocm/lib/librocblas.so.0 (0x00007fd349c00000)
libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007fd349800000)
libnuma.so.1 => /lib/libnuma.so.1 (0x00007fd3d081a000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fd3d09e4000)
libhsakmt.so.1 => /opt/rocm/lib/libhsakmt.so.1 (0x00007fd3d07ee000)
libelf.so.1 => /usr/lib/libelf.so.1 (0x00007fd3cf0fc000)
libdrm.so.2 => /usr/lib/libdrm.so.2 (0x00007fd3cf0e5000)
libdrm_amdgpu.so.1 => /usr/lib/libdrm_amdgpu.so.1 (0x00007fd3cf0d9000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007fd3cf0bf000)
libzstd.so.1 => /usr/lib/libzstd.so.1 (0x00007fd3ceb46000)

I know about an issue with 5700 gpus using rocm that breaks half precision mode (for example on stable diffusion the card can only work on full precison). May it be related with the issue?

SlyEcho · 2023-04-21T11:16:24Z

Are you sure the card is supported, though? Can you run some other ROCm/HIP test application?

Wikipedia: ROCm Consumer-grade_GPUs
AMD: Hardware_and_Software_Support

There is no mention of 5000 series.

I know about an issue with 5700 gpus using rocm that breaks half precision mode (for example on stable diffusion the card can only work on full precison). May it be related with the issue?

It could be, there is at least one GemmEx call that uses f16, but everything else uses f32.

SlyEcho · 2023-04-21T11:21:45Z

But for the GPU activity, I can tell that it's working because I can see the activity LEDs on the card and I can hear the coil whine. If I use nvtop, then it shows nothing, KDE System Monitor shows some activity but the issue is that it is not constant, the GPU is calculating for a very short time, it is not like a video game where the load is constant.

Anyway, have you measured the prompt evaluation speed with and without this PR? Feed it some longer text like prompts/dan.txt with -b 512 --no-mmap -n 64 --ignore-eos

daniandtheweb · 2023-04-21T11:28:37Z

Ok, it works. Sorry for not trying this longer prompt first but I'm quite new to this and still don't understand a lot about it.
Using this longer prompt evaluation time goes from 88.6 ms to 34.64 ms

SlyEcho · 2023-04-21T11:30:14Z

34.64 ms, is that for 13B or 7B?

daniandtheweb · 2023-04-21T11:30:37Z

Vicuna 13B q4_1

daniandtheweb · 2023-04-21T12:01:50Z

Once I get home I'll test more, however that already seems great

daniandtheweb · 2023-04-21T14:41:11Z

I've noticed that in the perplexity test the HIPBLAS version is doesn't calculate anything. It hangs at 100% GPU usage and just doesn't do anything. However it seems related to a bug in this specific ROCM version since the same bug happens in stable diffusion using the 5700xt, only rocm 5.2 doesn't seem to suffer from this.

SlyEcho · 2023-04-21T14:42:55Z

There may also be some kind of bug if you use --memory_f32 that doesn't happen with Cuda.

daniandtheweb · 2023-04-21T14:44:21Z

The perplexity test works perfectly with that flag on the 5700xt

daniandtheweb · 2023-04-21T14:46:24Z

Vicuna 13B-q4_1
no blas: 49.66 seconds per pass - ETA 9.04 hours
hipblas: 16.44 seconds per pass - ETA 2.99 hours

SlyEcho · 2023-04-21T14:48:41Z

I mean the program works fine, but it is generating garbage. But I need to test more.

daniandtheweb · 2023-04-21T14:50:53Z

I see what you mean, just tested the generation and outputs gibberish. Also the perplexity score is different

daniandtheweb · 2023-04-21T15:33:02Z

I'm noticing just now some that the model loaded using hipblas under certain circumstances generates garbage output compared to the no hipblas version even without the f32 memory flag.

SlyEcho · 2023-04-21T15:35:59Z

OK, that's good to know. I think maybe some of the Cuda kernels are not completely compatible with AMD. I have to check them. It may be possible to use just hipBLAS on its own without any custom GPU code, but the code will be more different from the Cuda code.

daniandtheweb · 2023-04-21T15:37:15Z

Try for example the prompt chat-with-bob. Using hipblas it seems not to understand the context of the conversation.

SlyEcho · 2023-04-21T15:52:33Z

I don't see any problems here, with ./build/bin/main -m ./models/alpaca-native-7b-q4_0.bin -c 2048 -b 512 -f prompts/chat-with-bob.txt -r 'User:'

User: How many people live in London?
Bob: About 8.9 million people live in London. It's the second-largest city in Europe after Moscow.
User: What about Paris?
Bob: Paris is the third-largest city in Europe with over 2.2 million inhabitants.
User: Berlin?
Bob: Yes, Berlin is also a large city in Europe, with over 3.7 million people living there.
User: What about Europe total?
Bob: According to estimates, around 450 million people live in Europe.
User:

User: How big is a giraffe?
Bob: A giraffe is about 18 feet tall, including the long neck.
User: What about an elephant?
Bob: An elephant can grow up to 4 meters tall and weigh up to 7 tons.
User: Who would win in a fight?
Bob: It depends on the situation. Generally speaking, a giraffe has higher odds of winning due to its height and
     long neck, while an elephant is stronger and more powerful.
User:

I think I could create another branch with just hipBLAS for you to test, it would be slower, though

daniandtheweb · 2023-04-21T17:36:33Z

It seems inconsistent in my case:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:How many people live in New York?
Assistant: About 8 million, the second-largest city in the United States.
User:How big is New Zeland?
Assistant: It's a small country, comparable in size to the American state of New Mexico.
Bob: Good day, Bob.

In this case for example creates another another character. If I let the model go on itself it goes almost random:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Yes, please.
Bob: Nice to meet you. A fine day in May.
User: Bob, what do I owe you.
\*
Bob: Thank you. You don't owe me anything today

This behavior only happens with the hipblas branch.

daniandtheweb · 2023-04-21T19:02:51Z

Here is another example of the chat-with-bob
hipblas:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Please. The capital of a country named "Moscow."
Answerer: Good morning Bob, the capital of Russia is Moscow, the city of Russia, located on the river Moskovka.

Assistant: Yes, that's correct.

Assistant: You may interact with me by

No hipblas:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Is there a time limit for using this AI chatbot?
Bob: No, there is no time limit for using me. I am here to assist you whenever you need help.
User: What is the name of the most populous country in Africa?
Bob: The most populous country in Africa

slaren · 2023-04-21T19:05:25Z

You should use the perplexity tool to evaluate generation quality, this is not a reliable way of doing it.

daniandtheweb · 2023-04-21T19:08:25Z

The perplexity tool isn't working for now with this branch.

daniandtheweb · 2023-04-22T17:52:21Z

Update: running the perplexity tool in memory_f32 mode on both the hipblas and cpu the result is the same (the pass time is 2x faster on gpu) so it should be working as intended, the issue I had before was a mistake because I compared the f32 in gpu with the non f32 in cpu. The perplexity tool on gpu in non f32 mode however still doesn't work at all.

daniandtheweb changed the title ~~rocBLAS~~ rocBLAS support Apr 19, 2023

daniandtheweb closed this as completed Apr 25, 2023

rocBLAS support #1060

rocBLAS support #1060

Comments

daniandtheweb commented Apr 19, 2023 • edited Loading

SlyEcho commented Apr 20, 2023 • edited Loading

daniandtheweb commented Apr 20, 2023

SlyEcho commented Apr 20, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023 • edited Loading

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

SlyEcho commented Apr 21, 2023 • edited Loading

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023 • edited Loading

daniandtheweb commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023 • edited Loading

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023 • edited Loading

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

SlyEcho commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023 • edited Loading

daniandtheweb commented Apr 21, 2023

slaren commented Apr 21, 2023

daniandtheweb commented Apr 21, 2023

daniandtheweb commented Apr 22, 2023 • edited Loading

daniandtheweb commented Apr 19, 2023 •

edited

Loading

SlyEcho commented Apr 20, 2023 •

edited

Loading

daniandtheweb commented Apr 21, 2023 •

edited

Loading

SlyEcho commented Apr 21, 2023 •

edited

Loading

daniandtheweb commented Apr 21, 2023 •

edited

Loading

daniandtheweb commented Apr 21, 2023 •

edited

Loading

daniandtheweb commented Apr 21, 2023 •

edited

Loading

daniandtheweb commented Apr 21, 2023 •

edited

Loading

daniandtheweb commented Apr 22, 2023 •

edited

Loading