Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocBLAS support #1060

Closed
daniandtheweb opened this issue Apr 19, 2023 · 36 comments
Closed

rocBLAS support #1060

daniandtheweb opened this issue Apr 19, 2023 · 36 comments

Comments

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Apr 19, 2023

Since Nvidia's cuBLAS support has been added it is possible to implement AMD's rocBLAS support as well? It would make this the first llama project with official support for AMD gpus acceleration.

@daniandtheweb daniandtheweb changed the title rocBLAS rocBLAS support Apr 19, 2023
@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 20, 2023

I have a PR in draft #1087 for hipBLAS and HIP. If you can test, it would be great.

Probably not on Windows, though.

@daniandtheweb
Copy link
Contributor Author

Is there any specific configuration to make it work? I built it with HIPBLAS and CUBLAS but it doesn't seem to work

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 20, 2023

I have a Vega64 and I'm using Arch Linux. I have these packages:

[henri@taichi llama.cpp]$ pacman -Q | grep rocm
rocm-clang-ocl 5.4.3-1
rocm-cmake 5.4.3-1
rocm-core 5.4.3-4
rocm-device-libs 5.4.3-1
rocm-hip-libraries 5.4.3-2
rocm-hip-runtime 5.4.3-2
rocm-hip-sdk 5.4.3-2
rocm-language-runtime 5.4.3-2
rocm-llvm 5.4.3-1
rocm-opencl-runtime 5.4.3-1
rocm-opencl-sdk 5.4.3-2
rocm-smi-lib 5.4.3-1
rocminfo 5.4.3-1
[henri@taichi llama.cpp]$ pacman -Q | grep hip
hip-runtime-amd 5.4.3-1
hipblas 5.4.3-1
hipcub 5.4.3-1
hipfft 5.4.3-1
hipsolver 5.4.3-1
hipsparse 5.4.3-1
miopen-hip 5.4.3-1
rocm-hip-libraries 5.4.3-2
rocm-hip-runtime 5.4.3-2
rocm-hip-sdk 5.4.3-2
[henri@taichi llama.cpp]$ pacman -Q | grep amd
hip-runtime-amd 5.4.3-1
hsa-amd-aqlprofile-bin 5.4.3-2

Can you tell me what kind of issue do you have? Problem compiling, some kind of error? What device, OS?

@daniandtheweb
Copy link
Contributor Author

I'm on Arch using a Radeon 5700XT. The problem is that the program doesn't use the GPU at all. The compilation works as expected and I have every hip and rocm package needed installed

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

Does it say "BLAS = 1" when you start it?

@daniandtheweb
Copy link
Contributor Author

Nope, 0

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

That means it was not compiled with the right options. It will not configure any kind of acceleration automatically, it has to be enabled from the make/cmake command line.

How did you compile it?

@daniandtheweb
Copy link
Contributor Author

Nevermind, I rebuilt it correctly with make -j8 LLAMA_HIPBLAS=1 and the program reports BLAS=1 however the gpu isn't used at all

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

What size of batch did you use? The default is 8 which will make it skip any BLAS calls.

Try -b 512 when running ./main

@daniandtheweb
Copy link
Contributor Author

Still 0 GPU usage

@daniandtheweb
Copy link
Contributor Author

The only difference between the official master branch and yours is that load time on the official one is lower (490 ms vs 697 ms)

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

Interesting, if it is loading longer, then it may be actually loading the hipBLAS libraries, I noticed it can take some time.

Can you try building and running it in Docker? I have some not-so-exact commands up there in the #1087 description.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Apr 21, 2023

I set up the docker and cloned your repo inside, recompiled and still the same 0% GPU usage.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

How are you checking the GPU usage? Because if tried to use ROCm and your card wasn't supported, the program would crash.

Can you also check the libraries linked to main with ldd main and see if there you see anything from /opt/rocm?

@daniandtheweb
Copy link
Contributor Author

Those are the linked libraries:
linux-vdso.so.1 (0x00007fff8adee000)
libhipblas.so.0 => /opt/rocm/lib/libhipblas.so.0 (0x00007fd3d087f000)
libamdhip64.so.5 => /opt/rocm/lib/libamdhip64.so.5 (0x00007fd3cf200000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fd3cee00000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007fd3cf118000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fd3d082a000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fd3cec19000)
librocsolver.so.0 => /opt/rocm/lib/librocsolver.so.0 (0x00007fd368400000)
librocblas.so.0 => /opt/rocm/lib/librocblas.so.0 (0x00007fd349c00000)
libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007fd349800000)
libnuma.so.1 => /lib/libnuma.so.1 (0x00007fd3d081a000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fd3d09e4000)
libhsakmt.so.1 => /opt/rocm/lib/libhsakmt.so.1 (0x00007fd3d07ee000)
libelf.so.1 => /usr/lib/libelf.so.1 (0x00007fd3cf0fc000)
libdrm.so.2 => /usr/lib/libdrm.so.2 (0x00007fd3cf0e5000)
libdrm_amdgpu.so.1 => /usr/lib/libdrm_amdgpu.so.1 (0x00007fd3cf0d9000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007fd3cf0bf000)
libzstd.so.1 => /usr/lib/libzstd.so.1 (0x00007fd3ceb46000)

I know about an issue with 5700 gpus using rocm that breaks half precision mode (for example on stable diffusion the card can only work on full precison). May it be related with the issue?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

Are you sure the card is supported, though? Can you run some other ROCm/HIP test application?

There is no mention of 5000 series.

I know about an issue with 5700 gpus using rocm that breaks half precision mode (for example on stable diffusion the card can only work on full precison). May it be related with the issue?

It could be, there is at least one GemmEx call that uses f16, but everything else uses f32.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

But for the GPU activity, I can tell that it's working because I can see the activity LEDs on the card and I can hear the coil whine. If I use nvtop, then it shows nothing, KDE System Monitor shows some activity but the issue is that it is not constant, the GPU is calculating for a very short time, it is not like a video game where the load is constant.

Anyway, have you measured the prompt evaluation speed with and without this PR? Feed it some longer text like prompts/dan.txt with -b 512 --no-mmap -n 64 --ignore-eos

@daniandtheweb
Copy link
Contributor Author

Ok, it works. Sorry for not trying this longer prompt first but I'm quite new to this and still don't understand a lot about it.
Using this longer prompt evaluation time goes from 88.6 ms to 34.64 ms

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

34.64 ms, is that for 13B or 7B?

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Apr 21, 2023

Vicuna 13B q4_1

@daniandtheweb
Copy link
Contributor Author

Once I get home I'll test more, however that already seems great

@daniandtheweb
Copy link
Contributor Author

I've noticed that in the perplexity test the HIPBLAS version is doesn't calculate anything. It hangs at 100% GPU usage and just doesn't do anything. However it seems related to a bug in this specific ROCM version since the same bug happens in stable diffusion using the 5700xt, only rocm 5.2 doesn't seem to suffer from this.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

There may also be some kind of bug if you use --memory_f32 that doesn't happen with Cuda.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Apr 21, 2023

The perplexity test works perfectly with that flag on the 5700xt

@daniandtheweb
Copy link
Contributor Author

Vicuna 13B-q4_1
no blas: 49.66 seconds per pass - ETA 9.04 hours
hipblas: 16.44 seconds per pass - ETA 2.99 hours

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

I mean the program works fine, but it is generating garbage. But I need to test more.

@daniandtheweb
Copy link
Contributor Author

I see what you mean, just tested the generation and outputs gibberish. Also the perplexity score is different

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Apr 21, 2023

I'm noticing just now some that the model loaded using hipblas under certain circumstances generates garbage output compared to the no hipblas version even without the f32 memory flag.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

OK, that's good to know. I think maybe some of the Cuda kernels are not completely compatible with AMD. I have to check them. It may be possible to use just hipBLAS on its own without any custom GPU code, but the code will be more different from the Cuda code.

@daniandtheweb
Copy link
Contributor Author

Try for example the prompt chat-with-bob. Using hipblas it seems not to understand the context of the conversation.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 21, 2023

I don't see any problems here, with ./build/bin/main -m ./models/alpaca-native-7b-q4_0.bin -c 2048 -b 512 -f prompts/chat-with-bob.txt -r 'User:'

User: How many people live in London?
Bob: About 8.9 million people live in London. It's the second-largest city in Europe after Moscow.
User: What about Paris?
Bob: Paris is the third-largest city in Europe with over 2.2 million inhabitants.
User: Berlin?
Bob: Yes, Berlin is also a large city in Europe, with over 3.7 million people living there.
User: What about Europe total?
Bob: According to estimates, around 450 million people live in Europe.
User:
User: How big is a giraffe?
Bob: A giraffe is about 18 feet tall, including the long neck.
User: What about an elephant?
Bob: An elephant can grow up to 4 meters tall and weigh up to 7 tons.
User: Who would win in a fight?
Bob: It depends on the situation. Generally speaking, a giraffe has higher odds of winning due to its height and
     long neck, while an elephant is stronger and more powerful.
User:

I think I could create another branch with just hipBLAS for you to test, it would be slower, though

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Apr 21, 2023

It seems inconsistent in my case:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:How many people live in New York?
Assistant: About 8 million, the second-largest city in the United States.
User:How big is New Zeland?
Assistant: It's a small country, comparable in size to the American state of New Mexico.
Bob: Good day, Bob.

In this case for example creates another another character. If I let the model go on itself it goes almost random:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Yes, please.
Bob: Nice to meet you. A fine day in May.
User: Bob, what do I owe you.
\*
Bob: Thank you. You don't owe me anything today

This behavior only happens with the hipblas branch.

@daniandtheweb
Copy link
Contributor Author

Here is another example of the chat-with-bob
hipblas:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Please. The capital of a country named "Moscow."
Answerer: Good morning Bob, the capital of Russia is Moscow, the city of Russia, located on the river Moskovka.

Assistant: Yes, that's correct.

Assistant: You may interact with me by

No hipblas:

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: Is there a time limit for using this AI chatbot?
Bob: No, there is no time limit for using me. I am here to assist you whenever you need help.
User: What is the name of the most populous country in Africa?
Bob: The most populous country in Africa

@slaren
Copy link
Collaborator

slaren commented Apr 21, 2023

You should use the perplexity tool to evaluate generation quality, this is not a reliable way of doing it.

@daniandtheweb
Copy link
Contributor Author

The perplexity tool isn't working for now with this branch.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Apr 22, 2023

Update: running the perplexity tool in memory_f32 mode on both the hipblas and cpu the result is the same (the pass time is 2x faster on gpu) so it should be working as intended, the issue I had before was a mistake because I compared the f32 in gpu with the non f32 in cpu. The perplexity tool on gpu in non f32 mode however still doesn't work at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants