Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm support #7

Merged
merged 2 commits into from
Jun 6, 2023
Merged

Add ROCm support #7

merged 2 commits into from
Jun 6, 2023

Conversation

ardfork
Copy link
Contributor

@ardfork ardfork commented May 23, 2023

This PR add support for ROCm.

It is currently based on 7d8ca43, I will base it on a more recent commit when cuda_compat is added for atomicAdd of float2, since you will probably need to also implement that for older nvidia GPU as it require compute capability 9.x or higher.

Before being ready for merging, it have two parts that need fixing, they are commented with a "FIXME". First is about compiler flags, pytorch doesn't give us full control of them, from what I tried, gpu-rdc is needed to compile, but pytorch add -fno-gpu-rdc as the last flag. Secondly I had to stub _cuda_raise, I didn't investigate that much about it, but it doesn't work on HIP, a fix or a workaround need to be implemented.

@turboderp turboderp marked this pull request as ready for review May 23, 2023 21:52
@ardfork
Copy link
Contributor Author

ardfork commented May 23, 2023

As for performance, on my RX 6700 XT, here is my result on generating 2048 tokens on a 13B model, note that triton based GPTQ is super slow for unknown reason:

  • GPTQ-for-LLaMa cuda branch: 17.2 tokens/s
  • GPTQ-for-LLaMa triton branch: 3 tokens/s
  • exllama: 31.5 tokens/s

@turboderp
Copy link
Owner

I've never had good performance from any of the Triton versions either. Just a bunch of time waiting for the kernels to "warm up" followed by some unsatisfying numbers.

I must admit I have no experience with ROCm, and the last time I had an AMD GPU I can't even remember. I'm a little surprised this is all it takes to get it running. Had to look up what HIP is, and apparently it's just pretty amazing. And I'm not sure how to compare performance across the two architectures, but you're getting half the speed I'm seeing, so either the 6700 XT is very well suited for this, or there's a lot more potential to unlock on the 4090.

The _cuda_raise macro isn't important, there are other ways to check a return value. Don't know why it should cause issues, though, but I'll work around it.

I'm not sure what to do about the compiler flags. I think it's kind of important to avoid any complicated install procedure for the C++ extension, since it seems a lot of people get stuck on that and it might be own my least favorite thing about GPTQ-for-LLaMa, too.

Why is -fgpu-rdc needed, exactly? Just because the code is split across multiple files?

@ardfork
Copy link
Contributor Author

ardfork commented May 23, 2023

I'm a little surprised this is all it takes to get it running.

Well that because HIP is basically just a clone of CUDA, to make a CUDA project run on ROCm you just have to rename everything to the HIP variant. In our case since it's a cpp extension for pytorch, pytorch automatically run hipify-torch on all .cu files, which rename everything https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py, it doesn't do that for the .h surprisingly. After reading a bit the code, I believe that renaming those .h to .cuh, will allow for hipify-python to also run on them, which would reduce this PR size and increase maintainability. I will try to do that when I have some time.

The _cuda_raise macro isn't important, there are other ways to check a return value. Don't know why it should cause issues, though, but I'll work around it.

I'm not sure, I was a bit fed up because of the issue I had to compile it, so I just stubbed it. It's possible that a simple fix could be made.

Why is -fgpu-rdc needed, exactly? Just because the code is split across multiple files?

I'm not sure, I was rather desperate, with getting lld: error: undefined hidden symbol: __llvm_amdgcn_rcp_f16 while trying to build. I found this similar issue which suggested this fix: ROCm/HIP#2196 (comment). AMD doc say that it's to Generates relocatable device code, also known as separate compilation mode, I don't know why the undefined hidden symbol error happen in the first place or why that fix it.
In general, I don't know why pytorch use those compile flags by default, it's the same with __HIP_NO_HALF_CONVERSIONS__, it's always used, thankfully I could disable it, as it's needed for exllama to work. Git blame only point to the PR that implemented HIP compilation for cpp extension, but that didn't have any useful information.

@ardfork
Copy link
Contributor Author

ardfork commented May 29, 2023

I have rebased locally on latest version. The only thing stopping it being built is HIP missing half2 atomicAdd. I need some help making an efficient implementation that, could you help me with that @turboderp? It require cuda compute capability 6.x or higher so it shouldn't be needed for most nvidia user, but HIP doesn't support it.

In other news, someone shared his benchmark using 2 MI60, getting around 8.8 tokens/s generating 128 tokens on 65B.

@jmoney7823956789378
Copy link

In other news, someone shared his benchmark using 2 MI60, getting around 8.8 tokens/s generating 128 tokens on 65B.

so that really was you, huh?
I'm willing to test every commit you make as long as it justifies my retarded purchase.

@turboderp
Copy link
Owner

I'm not convinced MI60 is a bad purchase at all. It seems to have a lot of bandwidth, FP16 support... I could see it running very well under the right circumstances.

And I pushed an update for the atomicAdd. I've tested it as much as I can and the function seems to work as it should. To test it you'll need to run with -dq 1 or more, to dequantize at least one layer at load time. Cause there's only one function that does half2 atomicAdd (non-cuBLAS half matmul function) right now, and at the moment it only gets called on pre-dequantized layers.

If it doesn't work with -dq 1, that's not a big loss, though. Turns out it's not a useful feature anyway, and I can do a half version of the matmul to fall back on.

@jmoney7823956789378
Copy link

I saw your push by chance, and got real excited. Unfortunately it fails to build.
Regarding the earlier claim of 8+t/s on 65B, it was incorrect. After verifying with test_chatbot.py on ardfork's fork, I only get gibberish, in both single and double GPU mode.
Below is the output from attempting to test inference on your most recent push.

root@a3eef14b80f1:/mnt/textgen/exllama# python test_benchmark_inference.py -d ../models/ausboss_WizardLM-13B-Uncensored-4bit-128g/ -p -ppl -dq 1
Successfully preprocessed all matching files.
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/root/miniconda3/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/textgen/exllama/test_benchmark_inference.py", line 1, in <module>
    from model import ExLlama, ExLlamaCache, ExLlamaConfig
  File "/mnt/textgen/exllama/model.py", line 5, in <module>
    import cuda_ext
  File "/mnt/textgen/exllama/cuda_ext.py", line 14, in <module>
    exllama_ext = load(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_buffers.hip -o hip_buffers.cuda.o
FAILED: hip_buffers.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_buffers.hip -o hip_buffers.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
In file included from /mnt/textgen/exllama/exllama_ext/hip_buffers.hip:2:
/mnt/textgen/exllama/exllama_ext/cuda_buffers.h:4:10: fatal error: 'cuda_runtime.h' file not found
#include <cuda_runtime.h>
         ^~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[2/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/half_matmul.hip -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/half_matmul.hip -o half_matmul.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/half_matmul.hip:3:10: fatal error: 'half_matmul.h' file not found
#include "half_matmul.h"
         ^~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[3/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/rope.hip -o rope.cuda.o
FAILED: rope.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/rope.hip -o rope.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/rope.hip:3:10: fatal error: 'rope.h' file not found
#include "rope.h"
         ^~~~~~~~
1 error generated when compiling for gfx1030.
[4/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_recons.hip -o q4v2_recons.cuda.o
FAILED: q4v2_recons.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_recons.hip -o q4v2_recons.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/q4v2_recons.hip:3:10: fatal error: 'q4v2_recons.h' file not found
#include "q4v2_recons.h"
         ^~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[5/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_sequential.hip -o q4v2_sequential.cuda.o
FAILED: q4v2_sequential.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_sequential.hip -o q4v2_sequential.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/q4v2_sequential.hip:3:10: fatal error: 'q4v2_sequential.h' file not found
#include "q4v2_sequential.h"
         ^~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[6/11] c++ -MMD -MF exllama_ext_hip.o.d -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /mnt/textgen/exllama/exllama_ext/exllama_ext_hip.cpp -o exllama_ext_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1
FAILED: exllama_ext_hip.o
c++ -MMD -MF exllama_ext_hip.o.d -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /mnt/textgen/exllama/exllama_ext/exllama_ext_hip.cpp -o exllama_ext_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1
In file included from /mnt/textgen/exllama/exllama_ext/exllama_ext_hip.cpp:10:
/mnt/textgen/exllama/exllama_ext/cuda_buffers.h:4:10: fatal error: cuda_runtime.h: No such file or directory
    4 | #include <cuda_runtime.h>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
[7/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/column_remap.hip -o column_remap.cuda.o
FAILED: column_remap.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/column_remap.hip -o column_remap.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/column_remap.hip:3:10: fatal error: 'column_remap.h' file not found
#include "column_remap.h"
         ^~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[8/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip -o rms_norm.cuda.o
FAILED: rms_norm.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip -o rms_norm.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:3:10: fatal error: 'rms_norm.h' file not found
#include "rms_norm.h"
         ^~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[9/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_matmul.hip -o q4v2_matmul.cuda.o
FAILED: q4v2_matmul.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_matmul.hip -o q4v2_matmul.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/q4v2_matmul.hip:3:10: fatal error: 'q4v2_matmul.h' file not found
#include "q4v2_matmul.h"
         ^~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
[10/11] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_mlp.hip -o q4v2_mlp.cuda.o
FAILED: q4v2_mlp.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/q4v2_mlp.hip -o q4v2_mlp.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/q4v2_mlp.hip:3:10: fatal error: 'q4v2_mlp.h' file not found
#include "q4v2_mlp.h"
         ^~~~~~~~~~~~
1 error generated when compiling for gfx1030.
ninja: build stopped: subcommand failed.

(base) root@a3eef14b80f1:/mnt/textgen/exllama#

@jmoney7823956789378
Copy link

Also, if you want me to attempt merging and rebasing you have to explain it to me like I am a retard.
I have never worked with git more than git clone and git pull. :)

@turboderp
Copy link
Owner

Hey, I'm not very good at git either.

But, from the output it looks like it's due to that thing where PyTorch doesn't want to HIPify .h tiles, so I tried renaming them all to .cuh instead. The C++ compiler doesn't care anyway.

Still, I have no way to test this myself, so who knows what will happen.

@jmoney7823956789378
Copy link

jmoney7823956789378 commented May 30, 2023

so I tried renaming them all to .cuh instead.

Attempted again, but received identical errors below (truncated to last build since they are all identical):

[11/12] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/column_remap.hip -o column_remap.cuda.o
FAILED: column_remap.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/column_remap.hip -o column_remap.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
/mnt/textgen/exllama/exllama_ext/hip_func/column_remap.hip:3:10: fatal error: 'column_remap.cuh' file not found
#include "column_remap.cuh"
         ^~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1030.
ninja: build stopped: subcommand failed.

(base) root@a3eef14b80f1:/mnt/textgen/exllama#

Looks like it's HIPifying them, but the #include "column_remap.cuh" line is calling to the un-HIPified version.
First 4 lines of column_remap.hip:

// !!! This is a file automatically generated by hipify!!!
#include "hip/hip_runtime.h"
#include "column_remap.cuh"
#include "../util.cuh"

Am I reading too far into this?

@jmoney7823956789378
Copy link

jmoney7823956789378 commented May 30, 2023

Pulled a lot of stuff out of ardfork's rework, involving a lot of manual renaming and the classic copypasta technique. Nearly there but running into only math errors.
I cloned what I have and I'm going to bed before my head explodes.
https://github.com/johnmoney83748932/exllama-idkwhatimdoing
I hope someone smarter than me can do math.

In file included from /mnt/textgen/exllama/exllama_ext/hip_func/rope.hip:5:
/mnt/textgen/exllama/exllama_ext/hip_func/../matrix.cuh:218:10: error: no viable conversion from 'unsigned int' to 'half' (aka '__half')
    half result1_ = __hadd(result1.x, result1.y);

@turboderp
Copy link
Owner

You could try removing the dot_product_8_dual and dot_product_8_dual_buffered as they're not referenced anywhere right now. I can't say why it's producing an error about unsigned ints when adding two fp16 values and storing to another fp16. But those two functions at least you can comment out and disregard.

@ardfork
Copy link
Contributor Author

ardfork commented May 30, 2023

I pushed my rebase.

But it doesn't work, it compile but during inferring, it SIGABRT with "Cannot find Symbol", setting AMD_LOG_LEVEL to try to debug show:

:3:hip_module.cpp           :469 : 4773487481 us: 12874: [tid:0x7f00dce85740]  hipLaunchKernel ( 0x7efe2bfb1a20, {80,480,1}, {16,4,1}, 0x7ffe178f3210, 0, stream:<null> )
:3:devprogram.cpp           :2676: 4773530241 us: 12874: [tid:0x7f00dce85740] Using Code Object V4.
:3:devprogram.cpp           :2979: 4773534804 us: 12874: [tid:0x7f00dce85740] For Init/Fini: Kernel Name: _Z52ncclKernelLL128Debug_SendRecv_RING_SIMPLE_Sum_int8_tP11ncclDevCommmP8ncclWork
:3:devprogram.cpp           :2979: 4773534813 us: 12874: [tid:0x7f00dce85740] For Init/Fini: Kernel Name: _Z47ncclKernelLL128_SendRecv_RING_SIMPLE_Sum_int8_tP11ncclDevCommmP8ncclWork
:3:devprogram.cpp           :2979: 4773534815 us: 12874: [tid:0x7f00dce85740] For Init/Fini: Kernel Name: _Z47ncclKernelDebug_SendRecv_RING_SIMPLE_Sum_int8_tP11ncclDevCommmP8ncclWork
:3:devprogram.cpp           :2979: 4773534817 us: 12874: [tid:0x7f00dce85740] For Init/Fini: Kernel Name: _Z42ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_tP11ncclDevCommmP8ncclWork
:1:hip_global.cpp           :97  : 4773534820 us: 12874: [tid:0x7f00dce85740] Cannot find Symbol with name: _Z27rms_norm_row_product_kernelP6__halfPfii

I don't really understand why it fail while launching kernel, if we remove this call, it will fail on another kernel (silu_mul_cuda_kernel). Those lines were here before, I don't see any significant change that would break things. Other similar project with similar codebase work well when hipified, like GPTQ for example, and it also launch kernel the same way. I verified and exllama_ext.so correctly have those symbols, I'm afraid I don't know what is the cause of this and I don't have contact with any person knowledgeable with HIP.

@turboderp
Copy link
Owner

It may be the -fno-gpu-rdc flag. It could be that the HIPifying code just doesn't know how to deal with linking multiple files. You could try creating a single .cu file that just #includes all the others, and then just have that one file (along with the .cpp files) at the top of cuda_ext.py.

@jmoney7823956789378
Copy link

"Cannot find Symbol"

Same here. Sorry I couldn't help more.

@jmoney7823956789378
Copy link

I'm considering pulling the trigger and returning my two MI60s before I'm stuck with them for life... Thinking about buying A4000s to fill the 4 PCIe slots the MI60s hold currently, and would bring me to the same total VRAM.
Am I being too pussy, or is ROCm just doomed?

@BlankParenthesis
Copy link
Contributor

It may be the -fno-gpu-rdc flag.

This seems to to be the correct assessment. Trying to build even a trivial extension without this flag causes SIGABRT "Cannot find Symbol": seems like it was there for a good reason.

I'm not entirely sure if adding this flag initially solved the problem either: I suspect it's likely that it simply moved compile-time linking errors into runtime. Then, these errors manifested in _cuda_raise which was stubbed. By stubbing the macro, I think it may have been removing all kernel calls, resulting in no errors and unusually high speeds.

The error itself (undefined hidden symbol: __llvm_amdgcn_rcp_f16) is strange, but seems to only be caused the call to h2rcp(sum) in q4v2_mlp.cu. I tried a naïve approach of calculating the reciprocal by hand:

half2 r;
r.x = 1.0 / sum.x;
r.y = 1.0 / sum.y

This compiles, but chat mode still seems to give gibberish for me. I'm not sure if that's because of this or something else not working.

@jmoney7823956789378
Copy link

This compiles, but chat mode still seems to give gibberish for me.

Ardfork's original pull req had this issue too. I was benchmarking good speeds, but had not thought to check perplexity nor chatbot.
Chatbot just dumps trash and perplexity fails.

@BlankParenthesis
Copy link
Contributor

Oddly enough, the inference benchmark seems to mostly work for me now. It shows a perplexity of 5.6797 which is about in line with the 13B model I used for it. I'm not sure if this is some coincidence or because of some difference between the two, such as config.

@jmoney7823956789378
Copy link

Oddly enough, the inference benchmark seems to mostly work for me now. It shows a perplexity of 5.6797 which is about in line with the 13B model I used for it. I'm not sure if this is some coincidence or because of some difference between the two, such as config.

Could you include your changes in a PR or maybe just a line num + edit? I'd like to give it a try, for continuity.

@BlankParenthesis
Copy link
Contributor

BlankParenthesis commented Jun 1, 2023

The only changes made was undoing any changes to pytorch regarding -fno-gpu-rdc and removing -fgpu-rdc from the extra_cuda_cflags list in cuda_exy.py. You can probably just change -fgpu-rdc to -fno-gpu-rdc in there though and skip fixing torch.

The only other thing is the reciprocal change to q4v2_mlp.cu I noted above:

@@ -29 +29,3 @@
-    half2 r = h2rcp(sum);
+    half2 r;
+    r.x = 1.0 / sum.x;
+    r.y = 1.0 / sum.y;

Which I'm not sure is correct, and is probably slower than ideal even if it is.

@turboderp
Copy link
Owner

I'm surprised that compiles. But if there's a problem with the fused MLP you could try running with -mlp normal, then it shouldn't use any code in q4v2_mlp.cu. Also you can try `-mm pytorch_only' which should avoid using q4v2_matmul.cu. It won't run great like that, but there would be less CUDA code.

@BlankParenthesis
Copy link
Contributor

I'm surprised that compiles.

Is it that bad 😄
I'll freely admit this is not my usual domain.

you could try running with -mlp normal

This seems to have no effect on things, perplexity doesn't change and chatbot outputs are similar.

Also you can try -mm pytorch_only

This doesn't seem to affect perplexity, but in chat mode it makes things a little more "sane?" Still gibberish, but a lot less of repeating the same tokens and shorter outputs. For reference, it does indeed run at about 2/3 of the usual speed.

@jmoney7823956789378
Copy link

1685658009_WindowsTerminal_W3svi0yyOP
1685664669_WindowsTerminal_ql6RZ3ymlQ
1685664773_WindowsTerminal_1gBuEDd82i

We got too cocky...

@jmoney7823956789378
Copy link

jmoney7823956789378 commented Jun 2, 2023

1685664982_WindowsTerminal_C51GYZTW6Q

-mlp normal -mm pytorch only

Are we so back???
Sidenote: 65B performs around 1t/s. it's over

@BlankParenthesis
Copy link
Contributor

I can confirm that setting both at once provides what seems like correct output. Since this disables what sounds like a lot of the main optimizations of exllama, the low speeds are not unexpected. I would speculate that the earlier speed of about 3t/s is a more reasonable estimate of performance if the rest can be made to work.

@jmoney7823956789378
Copy link

I can confirm that setting both at once provides what seems like correct output. Since this disables what sounds like a lot of the main optimizations of exllama, the low speeds are not unexpected. I would speculate that the earlier speed of about 3t/s is a more reasonable estimate of performance if the rest can be made to work.

1685666696_chrome_cyGY4oYN5r

Actually, I seem to get much higher speeds natively on the ooba webui.

@ardfork
Copy link
Contributor Author

ardfork commented Jun 6, 2023

So it was the commit with the most inoffensive name that fuck it. It was the fault of 167f601. Please @turboderp don't hide code change in an "Update todo" commit.

@ardfork
Copy link
Contributor Author

ardfork commented Jun 6, 2023

Alright, I quickly fixed that, don't know why it was causing issue.

Now that PR should be good. It is ready to be reviewed and merged.

@jmoney7823956789378
Copy link

jmoney7823956789378 commented Jun 6, 2023

Alright, I quickly fixed that, don't know why it was causing issue.

Now that PR should be good. It is ready to be reviewed and merged.

(base) root@a3eef14b80f1:/mnt/textgen/exllama# python test_benchmark_inference.py -d ../models/Neko-Institute-of-Science_LLaMA-65B-4bit-128g/ -p -ppl -gs 20,20
Successfully preprocessed all matching files.
 -- Tokenizer: ../models/Neko-Institute-of-Science_LLaMA-65B-4bit-128g/tokenizer.model
 -- Model config: ../models/Neko-Institute-of-Science_LLaMA-65B-4bit-128g/config.json
 -- Model: ../models/Neko-Institute-of-Science_LLaMA-65B-4bit-128g/llama-65b-4bit-128g.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- --rmsnorm_no_half2
 -- --rope_no_half2
 -- --matmul_no_half2
 -- --silu_no_half2
 -- Options: ['gpu_split: 20,20', 'perf', 'perplexity']
 ** Time, Load model: 48.31 seconds
 ** Time, Load tokenizer: 0.02 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 20,901.84 MB - [cuda:1] 12,563.61 MB
 -- Warmup pass 1...
 ** Time, Warmup: 24.68 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 24.23 seconds
 -- Warmup pass 3...
 ** Time, Warmup: 24.24 seconds
 -- Inference, first pass.
 ** Time, Inference: 24.32 seconds
 ** Speed: 78.94 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 4.10 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 4.81 tokens/second
 ** VRAM, Inference: [cuda:0] 4,284.22 MB - [cuda:1] 2,890.08 MB
 ** VRAM, Total: [cuda:0] 25,186.06 MB - [cuda:1] 15,453.68 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 4.1895
(base) root@a3eef14b80f1:/mnt/textgen/exllama#

Looks good! Working with your new commit, here's 65B on two MI60.
On 33B, I get identical speeds for 2 GPUs vs 1 GPU.

@turboderp
Copy link
Owner

Yeah sorry about that. PyCharm makes it a little too easy to commit and push changes. I'll be more careful.

@turboderp turboderp merged commit 43e3059 into turboderp:master Jun 6, 2023
@turboderp
Copy link
Owner

Well, this is great. Thanks guys. I'm still a little amazed it works at all with so few changes at the end of the day.

Based on some of these benchmarks I suspect there are lots of AMD specific optimizations to look at, too. The prompt speed on the MI60 is really low, for instance, compared to the speed per token which is kind of decent for a GPU from 2018. Seems the kernels are 4-5 times slower than my 4090+3090 setup, while rocBLAS is about 13 times slower. Seems off, but who knows.

@turboderp
Copy link
Owner

turboderp commented Jun 7, 2023

By the way, the next big optimization I'm going for will be using CUDA Graphs. And "as of my knowledge cutoff in September 2021" there's no ROCm equivalent. Does anyone know if this has changed since? If not I'll try to keep the regular, HIP-compatible CUDA stuff as is and implement graphs within a #ifndef USE_ROCM scope from the start so we won't have to keep coming back to it.

@ardfork
Copy link
Contributor Author

ardfork commented Jun 7, 2023

It might work, was added in around May 2021. And was released in ROCm 5.3 which was released in October 2022, most people are probably using >= 5.4.2.

In pytorch they use a bunch of #if !defined(USE_ROCM) || ROCM_VERSION >= 50300

@turboderp
Copy link
Owner

I've started work on moving to graphs. It's in the "graphs" branch for now, since I'd like someone to at least try it out on ROCm before I start breaking stuff for people.

@jmoney7823956789378
Copy link

I've started work on moving to graphs. It's in the "graphs" branch for now, since I'd like someone to at least try it out on ROCm before I start breaking stuff for people.

Looks like we're just coming up with some translation errors:

root@a3eef14b80f1:/mnt/textgen/exllama# python test_benchmark_inference.py -d ../models/TheBloke_WizardLM-30B-GPTQ/ -p -ppl -gs 10,10
Successfully preprocessed all matching files.
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/root/miniconda3/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/textgen/exllama/test_benchmark_inference.py", line 1, in <module>
    from model import ExLlama, ExLlamaCache, ExLlamaConfig
  File "/mnt/textgen/exllama/model.py", line 5, in <module>
    import cuda_ext
  File "/mnt/textgen/exllama/cuda_ext.py", line 42, in <module>
    exllama_ext = load(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/2] /opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/mnt/textgen/exllama/exllama_ext -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -U__HIP_NO_HALF_CONVERSIONS__ --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip -o rms_norm.cuda.o
FAILED: rms_norm.cuda.o
/opt/rocm-5.4.2/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/mnt/textgen/exllama/exllama_ext -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/lib/python3.10/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /root/miniconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -U__HIP_NO_HALF_CONVERSIONS__ --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -c /mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip -o rms_norm.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
In file included from /mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:5:
/mnt/textgen/exllama/exllama_ext/hip_func/../util_hip.cuh:44:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipDeviceSynchronize();
    ^~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/../util_hip.cuh:58:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipDeviceSynchronize();
    ^~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:141:5: error: unknown type name 'cudaGraphNode_t'; did you mean 'hipGraphNode_t'?
    cudaGraphNode_t rms_norm_node0;
    ^~~~~~~~~~~~~~~
    hipGraphNode_t
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:1074:30: note: 'hipGraphNode_t' declared here
typedef struct hipGraphNode* hipGraphNode_t;
                             ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:142:5: error: unknown type name 'cudaGraphNode_t'; did you mean 'hipGraphNode_t'?
    cudaGraphNode_t rms_norm_node1;
    ^~~~~~~~~~~~~~~
    hipGraphNode_t
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:1074:30: note: 'hipGraphNode_t' declared here
typedef struct hipGraphNode* hipGraphNode_t;
                             ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:143:5: error: unknown type name 'cudaGraphNode_t'; did you mean 'hipGraphNode_t'?
    cudaGraphNode_t rms_norm_node2;
    ^~~~~~~~~~~~~~~
    hipGraphNode_t
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:1074:30: note: 'hipGraphNode_t' declared here
typedef struct hipGraphNode* hipGraphNode_t;
                             ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:165:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    hipSetDevice(device_index);
    ^~~~~~~~~~~~ ~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:181:5: error: unknown type name 'cudaMemsetParams'; did you mean 'hipMemsetParams'?
    cudaMemsetParams memsetParams = {0};
    ^~~~~~~~~~~~~~~~
    hipMemsetParams
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:1127:3: note: 'hipMemsetParams' declared here
} hipMemsetParams;
  ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:212:5: error: unknown type name 'cudaKernelNodeParams'; did you mean 'hipKernelNodeParams'?
    cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
    ^~~~~~~~~~~~~~~~~~~~
    hipKernelNodeParams
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:1119:3: note: 'hipKernelNodeParams' declared here
} hipKernelNodeParams;
  ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:212:38: error: no viable conversion from 'void *' to 'dim3'
    cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:934:16: note: candidate constructor (the implicit copy constructor) not viable: cannot convert argument of incomplete type 'void *' to 'const dim3 &' for 1st argument
typedef struct dim3 {
               ^
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:934:16: note: candidate constructor (the implicit move constructor) not viable: cannot convert argument of incomplete type 'void *' to 'dim3 &&' for 1st argument
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:939:35: note: candidate constructor not viable: cannot convert argument of incomplete type 'void *' to 'uint32_t' (aka 'unsigned int') for 1st argument
    constexpr __host__ __device__ dim3(uint32_t _x = 1, uint32_t _y = 1, uint32_t _z = 1) : x(_x), y(_y), z(_z){};
                                  ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:212:69: error: no viable conversion from 'dim3' to 'void **'
    cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
                                                                    ^~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:212:77: error: no viable conversion from 'dim3' to 'void *'
    cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
                                                                            ^~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:212:96: error: cannot initialize a member subobject of type 'unsigned int' with an rvalue of type 'std::nullptr_t'
    cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
                                                                                               ^~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:213:5: error: unknown type name 'cudaKernelNodeParams'; did you mean 'hipKernelNodeParams'?
    cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
    ^~~~~~~~~~~~~~~~~~~~
    hipKernelNodeParams
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:1119:3: note: 'hipKernelNodeParams' declared here
} hipKernelNodeParams;
  ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:213:38: error: no viable conversion from 'void *' to 'dim3'
    cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:934:16: note: candidate constructor (the implicit copy constructor) not viable: cannot convert argument of incomplete type 'void *' to 'const dim3 &' for 1st argument
typedef struct dim3 {
               ^
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:934:16: note: candidate constructor (the implicit move constructor) not viable: cannot convert argument of incomplete type 'void *' to 'dim3 &&' for 1st argument
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:939:35: note: candidate constructor not viable: cannot convert argument of incomplete type 'void *' to 'uint32_t' (aka 'unsigned int') for 1st argument
    constexpr __host__ __device__ dim3(uint32_t _x = 1, uint32_t _y = 1, uint32_t _z = 1) : x(_x), y(_y), z(_z){};
                                  ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:213:69: error: no viable conversion from 'dim3' to 'void **'
    cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
                                                                    ^~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:213:77: error: no viable conversion from 'dim3' to 'void *'
    cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
                                                                            ^~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:213:96: error: cannot initialize a member subobject of type 'unsigned int' with an rvalue of type 'std::nullptr_t'
    cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
                                                                                               ^~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:217:9: error: use of undeclared identifier 'cudaGraphCreate'; did you mean 'hipGraphCreate'?
        cudaGraphCreate(&contexts[device_index].rms_norm_graph, 0);
        ^~~~~~~~~~~~~~~
        hipGraphCreate
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:5552:12: note: 'hipGraphCreate' declared here
hipError_t hipGraphCreate(hipGraph_t* pGraph, unsigned int flags);
           ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:217:9: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
        cudaGraphCreate(&contexts[device_index].rms_norm_graph, 0);
        ^~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:219:9: error: use of undeclared identifier 'cudaGraphAddMemsetNode'; did you mean 'hipGraphAddMemsetNode'?
        cudaGraphAddMemsetNode(&contexts[device_index].rms_norm_node0, contexts[device_index].rms_norm_graph, nullptr, 0, &memsetParams);
        ^~~~~~~~~~~~~~~~~~~~~~
        hipGraphAddMemsetNode
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:6139:12: note: 'hipGraphAddMemsetNode' declared here
hipError_t hipGraphAddMemsetNode(hipGraphNode_t* pGraphNode, hipGraph_t graph,
           ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:219:9: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
        cudaGraphAddMemsetNode(&contexts[device_index].rms_norm_node0, contexts[device_index].rms_norm_graph, nullptr, 0, &memsetParams);
        ^~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:220:9: error: use of undeclared identifier 'cudaGraphAddKernelNode'; did you mean 'hipGraphAddKernelNode'?
        cudaGraphAddKernelNode(&contexts[device_index].rms_norm_node1, contexts[device_index].rms_norm_graph, &contexts[device_index].rms_norm_node0, 1, &params1);
        ^~~~~~~~~~~~~~~~~~~~~~
        hipGraphAddKernelNode
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:5848:12: note: 'hipGraphAddKernelNode' declared here
hipError_t hipGraphAddKernelNode(hipGraphNode_t* pGraphNode, hipGraph_t graph,
           ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:220:9: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
        cudaGraphAddKernelNode(&contexts[device_index].rms_norm_node1, contexts[device_index].rms_norm_graph, &contexts[device_index].rms_norm_node0, 1, &params1);
        ^~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:221:9: error: use of undeclared identifier 'cudaGraphAddKernelNode'; did you mean 'hipGraphAddKernelNode'?
        cudaGraphAddKernelNode(&contexts[device_index].rms_norm_node2, contexts[device_index].rms_norm_graph, &contexts[device_index].rms_norm_node1, 1, &params2);
        ^~~~~~~~~~~~~~~~~~~~~~
        hipGraphAddKernelNode
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:5848:12: note: 'hipGraphAddKernelNode' declared here
hipError_t hipGraphAddKernelNode(hipGraphNode_t* pGraphNode, hipGraph_t graph,
           ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:221:9: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
        cudaGraphAddKernelNode(&contexts[device_index].rms_norm_node2, contexts[device_index].rms_norm_graph, &contexts[device_index].rms_norm_node1, 1, &params2);
        ^~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:223:9: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
        hipGraphInstantiate(&contexts[device_index].rms_norm_graphExec, contexts[device_index].rms_norm_graph, nullptr, nullptr, 0);
        ^~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:230:13: error: use of undeclared identifier 'cudaGraphExecMemsetNodeSetParams'; did you mean 'hipGraphExecMemsetNodeSetParams'?
            cudaGraphExecMemsetNodeSetParams(contexts[device_index].rms_norm_graphExec, contexts[device_index].rms_norm_node0, &memsetParams);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            hipGraphExecMemsetNodeSetParams
/opt/rocm-5.4.2/include/hip/hip_runtime_api.h:6175:12: note: 'hipGraphExecMemsetNodeSetParams' declared here
hipError_t hipGraphExecMemsetNodeSetParams(hipGraphExec_t hGraphExec, hipGraphNode_t node,
           ^
/mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:230:13: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
            cudaGraphExecMemsetNodeSetParams(contexts[device_index].rms_norm_graphExec, contexts[device_index].rms_norm_node0, &memsetParams);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
9 warnings and 20 errors generated when compiling for gfx1030.
ninja: build stopped: subcommand failed.

@ardfork
Copy link
Contributor Author

ardfork commented Jun 7, 2023

I've started work on moving to graphs. It's in the "graphs" branch for now, since I'd like someone to at least try it out on ROCm before I start breaking stuff for people.

Some are not handled by hipify_python, I made a small patch, do you want me to open a PR for it?

diff --git a/exllama_ext/cuda_func/rms_norm.cu b/exllama_ext/cuda_func/rms_norm.cu
index d327a65..34f643e 100644
--- a/exllama_ext/cuda_func/rms_norm.cu
+++ b/exllama_ext/cuda_func/rms_norm.cu
@@ -3,6 +3,17 @@
 #include "../util.cuh"
 #include "../matrix.cuh"

+#if defined(USE_ROCM)
+#define cudaGraphAddKernelNode hipGraphAddKernelNode
+#define cudaGraphAddMemsetNode hipGraphAddMemsetNode
+#define cudaGraphCreate hipGraphCreate
+#define cudaGraphExecKernelNodeSetParams hipGraphExecKernelNodeSetParams
+#define cudaGraphExecMemsetNodeSetParams hipGraphExecMemsetNodeSetParams
+#define cudaGraphNode_t hipGraphNode_t
+#define cudaKernelNodeParams hipKernelNodeParams
+#define cudaMemsetParams hipMemsetParams
+#endif
+
 const int THREADS_X = 32;
 const int THREADS_Y = 8;
 const int BLOCKSIZE_X = 16;
@@ -207,8 +218,18 @@ void rms_norm_cuda

     void* args1[] = { &x, &temp, &rows, &dim };
     void* args2[] = { &x, &w, &out, &temp, &epsilon, &r_dim, &rows, &dim };
-    cudaKernelNodeParams params1 = { (void*) rms_norm_kernel_func1, blocks, threads, 0, args1, nullptr };
-    cudaKernelNodeParams params2 = { (void*) rms_norm_kernel_func2, blocks, threads, 0, args2, nullptr };
+    cudaKernelNodeParams params1 = { .func           = (void *)rms_norm_kernel_func1,
+                                     .gridDim        = blocks,
+                                     .blockDim       = threads,
+                                     .sharedMemBytes = 0,
+                                     .kernelParams   = args1,
+                                     .extra          = nullptr };
+    cudaKernelNodeParams params2 = { .func           = (void *)rms_norm_kernel_func2,
+                                     .gridDim        = blocks,
+                                     .blockDim       = threads,
+                                     .sharedMemBytes = 0,
+                                     .kernelParams   = args2,
+                                     .extra          = nullptr };

     if (!contexts[device_index].rms_norm_graph_init)
     {

It is working correctly with a small test. But seem a bit slower:

 ** Time, Inference: 3.71 seconds
 ** Speed: 516.85 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 22.12 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 26.88 tokens/second

@ardfork
Copy link
Contributor Author

ardfork commented Jun 7, 2023

Also, I want to detail a bit my patch making process.

First I try to run it, if they are compilation errors, I note the files which have some. In our case, it's rms_norm.hip.

Secondly I run hipify-perl -examine $file on the hipified (by pytorch) file, hipify-perl is a bit more generic instead of being a tool specific for pytorch. In our case, it results in this:

  warning: exllama/exllama_ext/hip_func/rms_norm.hip:79: unsupported device function "__shfl_down_sync": //         acc += __shfl_down_sync(0xffffffff, acc, offset);

[HIPIFY] info: file 'exllama/exllama_ext/hip_func/rms_norm.hip' statistics:
  CONVERTED refs count: 31
  TOTAL lines of code: 270
  WARNINGS: 1
[HIPIFY] info: CONVERTED refs by names:
  cudaGraphAddKernelNode => hipGraphAddKernelNode: 3
  cudaGraphAddMemsetNode => hipGraphAddMemsetNode: 2
  cudaGraphCreate => hipGraphCreate: 2
  cudaGraphExecKernelNodeSetParams => hipGraphExecKernelNodeSetParams: 3
  cudaGraphExecMemsetNodeSetParams => hipGraphExecMemsetNodeSetParams: 2
  cudaGraphNode_t => hipGraphNode_t: 4
  cudaKernelNodeParams => hipKernelNodeParams: 3
  cudaMemsetParams => hipMemsetParams: 2

I then copy those cuda => hip line into the .cu files, put them inside #if defined(USE_ROCM) and use a simple sed s/ \(.\+\) => \(.\+\): [0-9]\+/#define \1 \2.

Now, I try again to run it, and fix the remaining errors. In our case, that was with difference between cudaMemsetParams and hipMemsetParams. They have the same members in their struct but not in the same order. We can easily fix that by using designated initializer style.

@turboderp
Copy link
Owner

Thanks, I added the changes.

As for the speed, I'm hoping it's not because HIP emulates graphs too clumsily, e.g. by destroying and recreating them whenever parameters change, which would defeat the purpose somewhat. But maybe it's just constant overhead and it will start to flip with longer graphs.

@ardfork
Copy link
Contributor Author

ardfork commented Jun 7, 2023

Well the nice thing is that HIP is open source so you can actually check. The less nice thing is that not many people have time to read such complex code.

@jmoney7823956789378
Copy link

These are the only errors I'm still getting, didn't get a chance to test ardfork's own changes and ended up just pulling most recent commit.

In file included from /mnt/textgen/exllama/exllama_ext/hip_func/rms_norm.hip:8:
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:19:26: error: unknown type name 'hipblasStatus_t'
__host__ __forceinline__ hipblasStatus_t __compat_hipblasHgemm(hipblasHandle_t    handle,
                         ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:19:64: error: unknown type name 'hipblasHandle_t'
__host__ __forceinline__ hipblasStatus_t __compat_hipblasHgemm(hipblasHandle_t    handle,
                                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:20:64: error: unknown type name 'hipblasOperation_t'
                                                               hipblasOperation_t transA,
                                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:21:64: error: unknown type name 'hipblasOperation_t'
                                                               hipblasOperation_t transB,
                                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:34:48: error: unknown type name 'hipblasHalf'
                        reinterpret_cast<const hipblasHalf *>(alpha),
                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:35:48: error: unknown type name 'hipblasHalf'
                        reinterpret_cast<const hipblasHalf *>(AP), lda,
                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:36:48: error: unknown type name 'hipblasHalf'
                        reinterpret_cast<const hipblasHalf *>(BP), ldb,
                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:37:48: error: unknown type name 'hipblasHalf'
                        reinterpret_cast<const hipblasHalf *>(beta),
                                               ^
/mnt/textgen/exllama/exllama_ext/hip_func/../hip_compat.cuh:38:42: error: unknown type name 'hipblasHalf'
                        reinterpret_cast<hipblasHalf *>(CP), ldc);
                                         ^

@ardfork
Copy link
Contributor Author

ardfork commented Jun 7, 2023

Yes, he changed my patch a bit, which cause this error.

Also, compiling with -O3 give me a ~2.5% speed boost.

Edit: And -O2 have similar speed boost but that somehow make my GPU do some coil whine.

@turboderp
Copy link
Owner

Damn it. I'll put it in the .cu file for now, but I really would prefer to have the translation fixes in one place rather than spread throughout the CUDA code, because there's a lot of refactoring coming. Is there a way I can force it to generate HIP code without an AMD GPU just to check if it still compiles?

@ardfork
Copy link
Contributor Author

ardfork commented Jun 8, 2023

Is there a way I can force it to generate HIP code without an AMD GPU just to check if it still compiles?

Actually HIP is supposed to run on both NVIDIA and AMD. And it's supposed to do so without any performance loss when running NVIDIA, as it's just a wrapper for CUDA function. I don't know if you need the whole ROCm stack, probably not. Try just installing pytorch ROCm in a new venv: pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.4.2. If that isn't enough, either install ROCm stuff with your distro or try the docker route.

but I really would prefer to have the translation fixes in one place rather than spread throughout the CUDA code

Well in that case, you would just need to add #include <hipblas/hipblas.h> in hip_compat.cuh I guess.

@jmoney7823956789378
Copy link

You're both some pretty damn smart people.
Here's 65B again, on new "graph" branch:

-- Inference, first pass.
 ** Time, Inference: 24.43 seconds
 ** Speed: 78.59 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 3.99 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 4.64 tokens/second
 ** VRAM, Inference: [cuda:0] 4,284.22 MB - [cuda:1] 2,890.08 MB
 ** VRAM, Total: [cuda:0] 25,185.82 MB - [cuda:1] 15,453.44 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 4.1894

at least it's consistent, right?

@fxmarty
Copy link

fxmarty commented Jul 28, 2023

Hi @ardfork , I don't get why half2 is disabled by default when compiling for rocm. Could you explain?

@ardfork
Copy link
Contributor Author

ardfork commented Jul 28, 2023

Hi @ardfork , I don't get why half2 is disabled by default when compiling for rocm. Could you explain?

Because it is broken. Search "gibberish" in this thread for more information. Also check out #146 for a fix.

@fxmarty
Copy link

fxmarty commented Jul 28, 2023

Thank you, missed this one! I hit the same issue.

@fxmarty
Copy link

fxmarty commented Aug 4, 2023

In this PR for me cuda_compat.cuh gets renamed by the hipifier to hip_compat.cuh which conflicts with the file of the same name. I guess this is not intended, is it?

Given this in cuda_compat.cuh:

#if defined(__CUDA_ARCH__) || defined(USE_ROCM)
#if __CUDA_ARCH__ < 700 || defined(USE_ROCM)

__device__ __forceinline__ void atomicAdd(half* address, half val) { atomicAdd_half(address, val); }

#if __CUDA_ARCH__ < 600 || defined(USE_ROCM)
__device__ __forceinline__ void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
#endif

#endif
#endif

Without renaming this file I am getting a no matching function for call to 'atomicAdd' error.

@jammm
Copy link

jammm commented Sep 5, 2023

In this PR for me cuda_compat.cuh gets renamed by the hipifier to hip_compat.cuh which conflicts with the file of the same name. I guess this is not intended, is it?

Given this in cuda_compat.cuh:

#if defined(__CUDA_ARCH__) || defined(USE_ROCM)
#if __CUDA_ARCH__ < 700 || defined(USE_ROCM)

__device__ __forceinline__ void atomicAdd(half* address, half val) { atomicAdd_half(address, val); }

#if __CUDA_ARCH__ < 600 || defined(USE_ROCM)
__device__ __forceinline__ void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
#endif

#endif
#endif

Without renaming this file I am getting a no matching function for call to 'atomicAdd' error.

atomicAdd for half isn't supported on RDNA1-3 by the way (not sure about older archs). You should always use atomicAdd for float, or implement your own CAS loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants