-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocBLAS support #1060
Comments
I have a PR in draft #1087 for hipBLAS and HIP. If you can test, it would be great. Probably not on Windows, though. |
Is there any specific configuration to make it work? I built it with HIPBLAS and CUBLAS but it doesn't seem to work |
I have a Vega64 and I'm using Arch Linux. I have these packages:
Can you tell me what kind of issue do you have? Problem compiling, some kind of error? What device, OS? |
I'm on Arch using a Radeon 5700XT. The problem is that the program doesn't use the GPU at all. The compilation works as expected and I have every hip and rocm package needed installed |
Does it say "BLAS = 1" when you start it? |
Nope, 0 |
That means it was not compiled with the right options. It will not configure any kind of acceleration automatically, it has to be enabled from the make/cmake command line. How did you compile it? |
Nevermind, I rebuilt it correctly with make -j8 LLAMA_HIPBLAS=1 and the program reports BLAS=1 however the gpu isn't used at all |
What size of batch did you use? The default is 8 which will make it skip any BLAS calls. Try |
Still 0 GPU usage |
The only difference between the official master branch and yours is that load time on the official one is lower (490 ms vs 697 ms) |
Interesting, if it is loading longer, then it may be actually loading the hipBLAS libraries, I noticed it can take some time. Can you try building and running it in Docker? I have some not-so-exact commands up there in the #1087 description. |
I set up the docker and cloned your repo inside, recompiled and still the same 0% GPU usage. |
How are you checking the GPU usage? Because if tried to use ROCm and your card wasn't supported, the program would crash. Can you also check the libraries linked to main with |
Those are the linked libraries: I know about an issue with 5700 gpus using rocm that breaks half precision mode (for example on stable diffusion the card can only work on full precison). May it be related with the issue? |
Are you sure the card is supported, though? Can you run some other ROCm/HIP test application?
There is no mention of 5000 series.
It could be, there is at least one GemmEx call that uses f16, but everything else uses f32. |
But for the GPU activity, I can tell that it's working because I can see the activity LEDs on the card and I can hear the coil whine. If I use nvtop, then it shows nothing, KDE System Monitor shows some activity but the issue is that it is not constant, the GPU is calculating for a very short time, it is not like a video game where the load is constant. Anyway, have you measured the prompt evaluation speed with and without this PR? Feed it some longer text like |
Ok, it works. Sorry for not trying this longer prompt first but I'm quite new to this and still don't understand a lot about it. |
34.64 ms, is that for 13B or 7B? |
Vicuna 13B q4_1 |
Once I get home I'll test more, however that already seems great |
I've noticed that in the perplexity test the HIPBLAS version is doesn't calculate anything. It hangs at 100% GPU usage and just doesn't do anything. However it seems related to a bug in this specific ROCM version since the same bug happens in stable diffusion using the 5700xt, only rocm 5.2 doesn't seem to suffer from this. |
There may also be some kind of bug if you use |
The perplexity test works perfectly with that flag on the 5700xt |
Vicuna 13B-q4_1 |
I mean the program works fine, but it is generating garbage. But I need to test more. |
I see what you mean, just tested the generation and outputs gibberish. Also the perplexity score is different |
I'm noticing just now some that the model loaded using hipblas under certain circumstances generates garbage output compared to the no hipblas version even without the f32 memory flag. |
OK, that's good to know. I think maybe some of the Cuda kernels are not completely compatible with AMD. I have to check them. It may be possible to use just hipBLAS on its own without any custom GPU code, but the code will be more different from the Cuda code. |
Try for example the prompt chat-with-bob. Using hipblas it seems not to understand the context of the conversation. |
I don't see any problems here, with
I think I could create another branch with just hipBLAS for you to test, it would be slower, though |
It seems inconsistent in my case:
In this case for example creates another another character. If I let the model go on itself it goes almost random:
This behavior only happens with the hipblas branch. |
Here is another example of the chat-with-bob
No hipblas:
|
You should use the perplexity tool to evaluate generation quality, this is not a reliable way of doing it. |
The perplexity tool isn't working for now with this branch. |
Update: running the perplexity tool in memory_f32 mode on both the hipblas and cpu the result is the same (the pass time is 2x faster on gpu) so it should be working as intended, the issue I had before was a mistake because I compared the f32 in gpu with the non f32 in cpu. The perplexity tool on gpu in non f32 mode however still doesn't work at all. |
Since Nvidia's cuBLAS support has been added it is possible to implement AMD's rocBLAS support as well? It would make this the first llama project with official support for AMD gpus acceleration.
The text was updated successfully, but these errors were encountered: