-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ROCm support #7
Conversation
As for performance, on my RX 6700 XT, here is my result on generating 2048 tokens on a 13B model, note that triton based GPTQ is super slow for unknown reason:
|
I've never had good performance from any of the Triton versions either. Just a bunch of time waiting for the kernels to "warm up" followed by some unsatisfying numbers. I must admit I have no experience with ROCm, and the last time I had an AMD GPU I can't even remember. I'm a little surprised this is all it takes to get it running. Had to look up what HIP is, and apparently it's just pretty amazing. And I'm not sure how to compare performance across the two architectures, but you're getting half the speed I'm seeing, so either the 6700 XT is very well suited for this, or there's a lot more potential to unlock on the 4090. The I'm not sure what to do about the compiler flags. I think it's kind of important to avoid any complicated install procedure for the C++ extension, since it seems a lot of people get stuck on that and it might be own my least favorite thing about GPTQ-for-LLaMa, too. Why is |
Well that because HIP is basically just a clone of CUDA, to make a CUDA project run on ROCm you just have to rename everything to the HIP variant. In our case since it's a cpp extension for pytorch, pytorch automatically run hipify-torch on all .cu files, which rename everything https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py, it doesn't do that for the .h surprisingly. After reading a bit the code, I believe that renaming those .h to .cuh, will allow for hipify-python to also run on them, which would reduce this PR size and increase maintainability. I will try to do that when I have some time.
I'm not sure, I was a bit fed up because of the issue I had to compile it, so I just stubbed it. It's possible that a simple fix could be made.
I'm not sure, I was rather desperate, with getting |
I have rebased locally on latest version. The only thing stopping it being built is HIP missing half2 atomicAdd. I need some help making an efficient implementation that, could you help me with that @turboderp? It require cuda compute capability 6.x or higher so it shouldn't be needed for most nvidia user, but HIP doesn't support it. In other news, someone shared his benchmark using 2 MI60, getting around 8.8 tokens/s generating 128 tokens on 65B. |
so that really was you, huh? |
I'm not convinced MI60 is a bad purchase at all. It seems to have a lot of bandwidth, FP16 support... I could see it running very well under the right circumstances. And I pushed an update for the atomicAdd. I've tested it as much as I can and the function seems to work as it should. To test it you'll need to run with If it doesn't work with |
I saw your push by chance, and got real excited. Unfortunately it fails to build.
|
Also, if you want me to attempt merging and rebasing you have to explain it to me like I am a retard. |
Hey, I'm not very good at git either. But, from the output it looks like it's due to that thing where PyTorch doesn't want to HIPify Still, I have no way to test this myself, so who knows what will happen. |
Attempted again, but received identical errors below (truncated to last build since they are all identical):
Looks like it's HIPifying them, but the
Am I reading too far into this? |
Pulled a lot of stuff out of ardfork's rework, involving a lot of manual renaming and the classic copypasta technique. Nearly there but running into only math errors.
|
You could try removing the |
I pushed my rebase. But it doesn't work, it compile but during inferring, it SIGABRT with "Cannot find Symbol", setting AMD_LOG_LEVEL to try to debug show:
I don't really understand why it fail while launching kernel, if we remove this call, it will fail on another kernel (silu_mul_cuda_kernel). Those lines were here before, I don't see any significant change that would break things. Other similar project with similar codebase work well when hipified, like GPTQ for example, and it also launch kernel the same way. I verified and exllama_ext.so correctly have those symbols, I'm afraid I don't know what is the cause of this and I don't have contact with any person knowledgeable with HIP. |
It may be the |
Same here. Sorry I couldn't help more. |
I'm considering pulling the trigger and returning my two MI60s before I'm stuck with them for life... Thinking about buying A4000s to fill the 4 PCIe slots the MI60s hold currently, and would bring me to the same total VRAM. |
This seems to to be the correct assessment. Trying to build even a trivial extension without this flag causes SIGABRT I'm not entirely sure if adding this flag initially solved the problem either: I suspect it's likely that it simply moved compile-time linking errors into runtime. Then, these errors manifested in The error itself ( half2 r;
r.x = 1.0 / sum.x;
r.y = 1.0 / sum.y This compiles, but chat mode still seems to give gibberish for me. I'm not sure if that's because of this or something else not working. |
Ardfork's original pull req had this issue too. I was benchmarking good speeds, but had not thought to check perplexity nor chatbot. |
Oddly enough, the inference benchmark seems to mostly work for me now. It shows a perplexity of 5.6797 which is about in line with the 13B model I used for it. I'm not sure if this is some coincidence or because of some difference between the two, such as config. |
Could you include your changes in a PR or maybe just a line num + edit? I'd like to give it a try, for continuity. |
The only changes made was undoing any changes to pytorch regarding The only other thing is the reciprocal change to q4v2_mlp.cu I noted above: @@ -29 +29,3 @@
- half2 r = h2rcp(sum);
+ half2 r;
+ r.x = 1.0 / sum.x;
+ r.y = 1.0 / sum.y; Which I'm not sure is correct, and is probably slower than ideal even if it is. |
I'm surprised that compiles. But if there's a problem with the fused MLP you could try running with |
Is it that bad 😄
This seems to have no effect on things, perplexity doesn't change and chatbot outputs are similar.
This doesn't seem to affect perplexity, but in chat mode it makes things a little more "sane?" Still gibberish, but a lot less of repeating the same tokens and shorter outputs. For reference, it does indeed run at about 2/3 of the usual speed. |
I can confirm that setting both at once provides what seems like correct output. Since this disables what sounds like a lot of the main optimizations of exllama, the low speeds are not unexpected. I would speculate that the earlier speed of about 3t/s is a more reasonable estimate of performance if the rest can be made to work. |
Actually, I seem to get much higher speeds natively on the ooba webui. |
So it was the commit with the most inoffensive name that fuck it. It was the fault of 167f601. Please @turboderp don't hide code change in an "Update todo" commit. |
Co-authored-by: [ ] <[email protected]>
Alright, I quickly fixed that, don't know why it was causing issue. Now that PR should be good. It is ready to be reviewed and merged. |
Looks good! Working with your new commit, here's 65B on two MI60. |
Yeah sorry about that. PyCharm makes it a little too easy to commit and push changes. I'll be more careful. |
Well, this is great. Thanks guys. I'm still a little amazed it works at all with so few changes at the end of the day. Based on some of these benchmarks I suspect there are lots of AMD specific optimizations to look at, too. The prompt speed on the MI60 is really low, for instance, compared to the speed per token which is kind of decent for a GPU from 2018. Seems the kernels are 4-5 times slower than my 4090+3090 setup, while rocBLAS is about 13 times slower. Seems off, but who knows. |
By the way, the next big optimization I'm going for will be using CUDA Graphs. And "as of my knowledge cutoff in September 2021" there's no ROCm equivalent. Does anyone know if this has changed since? If not I'll try to keep the regular, HIP-compatible CUDA stuff as is and implement graphs within a |
It might work, was added in around May 2021. And was released in ROCm 5.3 which was released in October 2022, most people are probably using >= 5.4.2. In pytorch they use a bunch of |
I've started work on moving to graphs. It's in the "graphs" branch for now, since I'd like someone to at least try it out on ROCm before I start breaking stuff for people. |
Looks like we're just coming up with some translation errors:
|
Some are not handled by hipify_python, I made a small patch, do you want me to open a PR for it? diff --git a/exllama_ext/cuda_func/rms_norm.cu b/exllama_ext/cuda_func/rms_norm.cu
index d327a65..34f643e 100644
--- a/exllama_ext/cuda_func/rms_norm.cu
+++ b/exllama_ext/cuda_func/rms_norm.cu
@@ -3,6 +3,17 @@
#include "../util.cuh"
#include "../matrix.cuh"
+#if defined(USE_ROCM)
+#define cudaGraphAddKernelNode hipGraphAddKernelNode
+#define cudaGraphAddMemsetNode hipGraphAddMemsetNode
+#define cudaGraphCreate hipGraphCreate
+#define cudaGraphExecKernelNodeSetParams hipGraphExecKernelNodeSetParams
+#define cudaGraphExecMemsetNodeSetParams hipGraphExecMemsetNodeSetParams
+#define cudaGraphNode_t hipGraphNode_t
+#define cudaKernelNodeParams hipKernelNodeParams
+#define cudaMemsetParams hipMemsetParams
+#endif
+
const int THREADS_X = 32;
const int THREADS_Y = 8;
const int BLOCKSIZE_X = 16;
@@ -207,8 +218,18 @@ void rms_norm_cuda
void* args1[] = { &x, &temp, &rows, &dim };
void* args2[] = { &x, &w, &out, &temp, &epsilon, &r_dim, &rows, &dim };
- cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
- cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
+ cudaKernelNodeParams params1 = { .func = (void *)rms_norm_kernel_func1,
+ .gridDim = blocks,
+ .blockDim = threads,
+ .sharedMemBytes = 0,
+ .kernelParams = args1,
+ .extra = nullptr };
+ cudaKernelNodeParams params2 = { .func = (void *)rms_norm_kernel_func2,
+ .gridDim = blocks,
+ .blockDim = threads,
+ .sharedMemBytes = 0,
+ .kernelParams = args2,
+ .extra = nullptr };
if (!contexts[device_index].rms_norm_graph_init)
{ It is working correctly with a small test. But seem a bit slower:
|
Also, I want to detail a bit my patch making process. First I try to run it, if they are compilation errors, I note the files which have some. In our case, it's Secondly I run
I then copy those Now, I try again to run it, and fix the remaining errors. In our case, that was with difference between |
Thanks, I added the changes. As for the speed, I'm hoping it's not because HIP emulates graphs too clumsily, e.g. by destroying and recreating them whenever parameters change, which would defeat the purpose somewhat. But maybe it's just constant overhead and it will start to flip with longer graphs. |
Well the nice thing is that HIP is open source so you can actually check. The less nice thing is that not many people have time to read such complex code. |
These are the only errors I'm still getting, didn't get a chance to test ardfork's own changes and ended up just pulling most recent commit.
|
Yes, he changed my patch a bit, which cause this error. Also, compiling with Edit: And |
Damn it. I'll put it in the .cu file for now, but I really would prefer to have the translation fixes in one place rather than spread throughout the CUDA code, because there's a lot of refactoring coming. Is there a way I can force it to generate HIP code without an AMD GPU just to check if it still compiles? |
Actually HIP is supposed to run on both NVIDIA and AMD. And it's supposed to do so without any performance loss when running NVIDIA, as it's just a wrapper for CUDA function. I don't know if you need the whole ROCm stack, probably not. Try just installing pytorch ROCm in a new venv:
Well in that case, you would just need to add |
You're both some pretty damn smart people.
at least it's consistent, right? |
Hi @ardfork , I don't get why half2 is disabled by default when compiling for rocm. Could you explain? |
Thank you, missed this one! I hit the same issue. |
In this PR for me Given this in
Without renaming this file I am getting a |
atomicAdd for half isn't supported on RDNA1-3 by the way (not sure about older archs). You should always use atomicAdd for float, or implement your own CAS loop. |
This PR add support for ROCm.
It is currently based on 7d8ca43, I will base it on a more recent commit when cuda_compat is added for atomicAdd of float2, since you will probably need to also implement that for older nvidia GPU as it require compute capability 9.x or higher.
Before being ready for merging, it have two parts that need fixing, they are commented with a "FIXME". First is about compiler flags, pytorch doesn't give us full control of them, from what I tried, gpu-rdc is needed to compile, but pytorch add -fno-gpu-rdc as the last flag. Secondly I had to stub _cuda_raise, I didn't investigate that much about it, but it doesn't work on HIP, a fix or a workaround need to be implemented.