-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Arch Error #18
Comments
Hi @HakubunLuo , thanks for reporting this issue. Could you please share the complete log you get from both the CMake configure and build commands? It looks like the |
CMakeLists.txt cmake_minimum_required(VERSION 3.25)
set(CMAKE_CUDA_ARCHITECTURES 80 86)
set(CMAKE_CUDA_STANDARD 17)
SET(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
project(demo_test CUDA CXX)
find_package(CUDA REQUIRED)
find_package(CUDAToolkit REQUIRED)
include_directories(/usr/local/cuda/include)
include_directories(/usr/include)
aux_source_directory(src SOURCES)
include(cmake/CPM.cmake)
CPMAddPackage(
NAME bght
URL "https://github.com/owensgroup/BGHT/archive/refs/heads/main.zip"
OPTIONS
"build_tests OFF"
"build_benchmarks OFF"
)
set(CUDA_ARCHS 86)
cuda_add_library(demo SHARED demo.cu demo.cuh)
target_link_libraries(demo bght)
set_target_properties(demo PROPERTIES CUDA_ARCHITECTURES ${CUDA_ARCHS})
add_executable(demo_test main.cpp)
target_link_libraries(demo_test PRIVATE demo)
target_link_libraries(demo_test PRIVATE CUDA::cudart)
set_target_properties(demo_test PROPERTIES CUDA_SEPARABLE_COMPILATION ON)` I just solved this by replacing cuda_add_library by add_library, now when I used it in my kernel function, there is another problem: template<class HashMap>
__global__ void createLookupHashKernel(HashMap lookupTable, const int *coordinates, int N, int max_x, int max_y) The size of table is about 100,000, it was very solve if I pass HashMap directly so that I used HashMap *lookupTable and pass reference into it. However, it has memory problem like this:
My operations in kernel function is: int key = ...;
int idx = ...;
using key_type = int;
using value_type = int;
using pair_type = bght::pair<key_type, value_type>;
auto block = cooperative_groups::this_thread_block();
auto tile = cooperative_groups::tiled_partition<HashMap::bucket_size>(block);
pair_type pair{key, idx};
lookupTable->insert(pair, tile); The hash map was created by: std::size_t capacity = num_size * 2;
auto invalid_key = std::numeric_limits<key_type>::max();
auto invalid_value = std::numeric_limits<value_type>::max();
bght::bcht<key_type, value_type> hash_lookupTable(capacity, invalid_key, invalid_value); |
You should not pass the hash tables by reference or as pointers to kernels. You should pass them by value to kernels. It looks like you are dealing with a pointer here? lookupTable->insert(pair, tile); Here is an example: Lines 176 to 210 in 140b80f
|
Also if every thread is trying to insert a key, you will need to serialize them within a tile. See how we do it here: BGHT/include/detail/kernels.cuh Lines 26 to 64 in 140b80f
|
Yes There are lots of threads try to insert the hash map. I reference the code to edit my kernel: template<class HashMap>
__global__ void createLookupHashKernel(HashMap lookupTable, const int *coordinates, int N, int max_x, int max_y) {
auto idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
int x = coordinates[idx * 4 + 1];
int y = coordinates[idx * 4 + 2];
int z = coordinates[idx * 4 + 3];
int key = getIndex(x, y, z, max_x, max_y);
using key_type = int;
using value_type = int;
using pair_type = bght::pair<key_type, value_type>;
auto block = cooperative_groups::this_thread_block();
auto tile = cooperative_groups::tiled_partition<HashMap::bucket_size>(block);
bool do_op = true;
pair_type insertion_pair{key, (int) idx};
bool success = true;
// Do the insertion
auto work_queue = tile.ballot(do_op);
while (work_queue) {
auto cur_rank = __ffs(work_queue) - 1;
auto cur_pair = tile.shfl(insertion_pair, cur_rank);
bool insertion_success = lookupTable.insert(cur_pair, tile);
if (tile.thread_rank() == cur_rank) {
do_op = false;
success = insertion_success;
}
work_queue = tile.ballot(do_op);
}
}
} I do not add if (!tile.all(success)) {
*map.d_build_success_ = false;
} Because i get d_build_success_' is a private member of 'bght::bcht<int, int>' |
If N is not multiple of the bucket size you will into issues. The insert function expects This if statement is problematic: if (idx < N) {
} You can replace it with a couple of lines to address this issue: auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
// tile
auto block = cooperative_groups::this_thread_block();
auto tile = cooperative_groups::tiled_partition<HashMap::bucket_size>(block);
if ((thread_id - tile.thread_rank()) >= N) {
return;
}
bool do_op = false;
// load the input
if (thread_id < N) {
int x = coordinates[idx * 4 + 1];
int y = coordinates[idx * 4 + 2];
int z = coordinates[idx * 4 + 3];
int key = getIndex(x, y, z, max_x, max_y);
do_op = true;
}
// the insertion loop. Correct, that variable is hidden. What you could do is just add another boolean argument to your kernel, and set it to false if any of the insertions failed. In general, if you have a non-skewed distribution then insertion should succeed otherwise you may need to decrease the load factor. Let me know if the modifications here works for you. |
It works for insert process, so if I also have a kernel function that reads hash map in multi threads, i need to use same operations? |
Great! BGHT/include/detail/kernels.cuh Lines 67 to 109 in 140b80f
|
I use this to find values by tables we created before:
} ` |
You need to follow the insertion/find code. Again, this if statement is problematic: if (idx < N) {
} and if values of |
I am confused with InputIt first and InputIt last in examples, I only need to implement one search at one time. I noticed: ` if (thread_id < count) { in kernel.cuh, However, I can not just use one parameter: find_key to search. |
I understand the keys you are using are different. You need to follow the same strategy you followed for insertion which is similar to find as well. The two things you need to make sure happens are (1) all threads in the tile call the find function, and (2) within a tile you serially do finds. See comments here: __global__
void createRulesTableByHashKernel(HashMap lookupTable, const int *coordinates,
int N, int *rulesTable, int max_x, int max_y, int max_z, int kernel_size) {
auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
int half_kernel = kernel_size / 2;
auto block = cooperative_groups::this_thread_block();
auto tile = cooperative_groups::tiled_partition<HashMap::bucket_size>(block);
if ((thread_id - tile.thread_rank()) >= N) { // this if statement make sure (1) happens
return;
}
bool do_op = false;
typename HashMap::key_type find_key;
typename HashMap::mapped_type result;
auto idx = thread_id;
//if (idx < N) { // violates (1)
int x_start = ...
int y_start = ...
int z_start = ...
int x_end = ...
int y_end = ...
int z_end = ...
for (int x = x_start; x <= x_end; x++)
for (int y = y_start; y <= y_end; y++)
for (int z = z_start; z <= z_end; z++) {
int lookup_idx = getIndex(x, y, z, max_x, max_y);
// since we removed the if (idx < N) we may need to make sure the index is valid
bool do_op = is_valid_index(lookup_idx);
// is the lookup_idx different per threads in the tile of size bucket_size?
// if yes, you need do the following so that (2) is satisfied:
auto work_queue = tile.ballot(do_op);
while (work_queue) {
auto cur_rank = __ffs(work_queue) - 1;
auto cur_key = tile.shfl(lookup_idx, cur_rank);
int target = map.find(cur_key, tile);
if (tile.thread_rank() == cur_rank) {
do_op = false;
}
work_queue = tile.ballot(do_op);
}
...
}
//}
} |
when i using it in my project, i meet:
My GPU is RTX3060 on CUDA 11.4, Ubuntu 18.04
Here is my CMakeLists.txt configuration:
The text was updated successfully, but these errors were encountered: