Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV #14192

Closed
res-life opened this issue Sep 26, 2023 · 9 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working

Comments

@res-life
Copy link
Contributor

Describe the bug
Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV

Steps/Code to reproduce bug
Code:

#include <cudf/types.hpp>
#include <cudf/aggregation.hpp>
#include <cudf/reduction.hpp>
#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_wrapper.hpp>

template <typename T, typename SourceElementT = T>
using column_wrapper =
  typename std::conditional<std::is_same_v<T, cudf::string_view>,
                            cudf::test::strings_column_wrapper,
                            cudf::test::fixed_width_column_wrapper<T, SourceElementT>>::type;
using int16_col   = column_wrapper<int16_t>;

struct MyReductionTest : public cudf::test::BaseFixture {};
TEST_F(MyReductionTest, AlignmentIssue)
{
  std::vector<int16_t> v({1, 2, 3});
  int16_col col(v.begin(), v.end());
  
  auto const output_dtype                 = cudf::data_type{cudf::type_id::INT16};
  auto min_agg = cudf::make_min_aggregation();
  std::unique_ptr<cudf::scalar> reduction1 = cudf::reduce(col, *dynamic_cast<cudf::reduce_aggregation *>(&(*min_agg)), output_dtype);

  auto const output_dtype2                 = cudf::data_type{cudf::type_id::BOOL8};
  auto any_agg = cudf::make_any_aggregation();
  std::unique_ptr<cudf::scalar> reduction2 = cudf::reduce(col, *dynamic_cast<cudf::reduce_aggregation *>(&(*any_agg)), output_dtype2);
}

Compile and Run with sanitizer:

compute-sanitizer --tool memcheck \
    --launch-timeout 600 \
    --error-exitcode -2 \
    --log-file "./sanitizer_for_pid_%p.log" \
    ./my-exe

Print sanitizer log:

head sanitizer_for_pid_42.log 
========= COMPUTE-SANITIZER
========= Invalid __shared__ read of size 16 bytes
=========     at 0x38c0 in void cub::CUB_101702_600_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_101702_600_700_750_800_860_900_NS::DeviceReducePolicy<short, short, int, cudf::DeviceMin>::Policy600, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, int, cudf::DeviceMin, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2d18f2]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame:__cudart1049 [0xd9bd3b]
=========                in /home/chongg/code/spark-rapids-jni/target/cmake-build/gtests/./my-exe

The main errors are:

Invalid __shared__ read of size 16 bytes
Address 0x8 is misaligned

Others:

There are 2 reductions in the code.
If another reduction follows a min reduction, then errors occur.

Expected behavior
Fix Sanitizer error.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: from source

Environment details
Docker image: urm.nvidia.com/sw-spark-docker/plugin-jni:centos7-cuda12.0.1-blossom
CUDA 12, for more details, refer to NVIDIA/spark-rapids-jni#1349

Additional context
Refer to NVIDIA/spark-rapids-jni#1349

@davidwendt
Copy link
Contributor

I'm not able to reproduce the error on my local libcudf build.

$ /usr/local/cuda/bin/compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2023 NVIDIA Corporation
Version 2022.4.1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0

@davidwendt
Copy link
Contributor

davidwendt commented Sep 26, 2023

We have a few reductions tests in libcudf that use min-aggregation followed by a call to max-aggregation. Here is one that executes with int16 types:

EXPECT_EQ(
this->template reduction_test<T>(col, *cudf::make_min_aggregation<reduce_aggregation>()).first,
expected_min_result);
EXPECT_EQ(
this->template reduction_test<T>(col, *cudf::make_max_aggregation<reduce_aggregation>()).first,
expected_max_result);

All of the tests are run with compute-sanitizer in our nightly builds
Curious if these tests also fail for you in your environment as well.

@GregoryKimball GregoryKimball moved this to Burndown PRs in libcudf Sep 26, 2023
@GregoryKimball GregoryKimball moved this from Burndown PRs to Slip in libcudf Sep 26, 2023
@ttnghia
Copy link
Contributor

ttnghia commented Sep 27, 2023

I suspect that this is the same as my previous reported issue: #13685

@res-life
Copy link
Contributor Author

Curious if these tests also fail for you in your environment as well.

Yes, they also fail.
I copied the reduction_tests.cpp file to spark-rapids-jni, and compiled to an executable.
Tried to directly compile against the cuDF code, but the compile failed. So I copied the reduction_tests.cpp to spark-rapids-jni project and compile.

@davidwendt
Copy link
Contributor

This seems specific to your test environment since our nightly compute-sanitizer does not fail running reduction_tests.cpp and I cannot reproduce the error locally either.

Perhaps you can provide some details on the environment.
Is libcudf.so being built from source or downloaded from conda, etc?
Is there a way to duplicate an environment where we could debug libcudf (e.g. compile and link specific modules)?

I see mention of a docker image in the description. Does the error occur only on centos7?
This kind of error would not be silent and so should be occurring always. Does it show up in your normal testing?
In other words, does the error occur without running compute-sanitizer?
If not, perhaps this could be a compute-sanitizer bug on CUDA-12 on centos7 instead.

@jlowe
Copy link
Member

jlowe commented Oct 2, 2023

Is libcudf.so being built from source or downloaded from conda, etc?

This is a custom build of libcudf for the RAPIDS Accelerator, where we are compiling libcudf as a PIC static library that is ultimately linked into a shared library and used by the JVM.

Is there a way to duplicate an environment where we could debug libcudf (e.g. compile and link specific modules)?
I see mention of a docker image in the description. Does the error occur only on centos7?

The Docker image is used to produce this build, see https://github.com/NVIDIA/spark-rapids-jni/blob/branch-23.10/CONTRIBUTING.md#building-in-the-docker-container. After pulling the spark-rapids-jni and executing the build-in-docker script, the libcudf install will be in spark-rapids-jni/target/libcudf-install/. You can use the run-in-docker script to get an interactive shell within the same environment as the build env if desired.

The result of the build can be run on any supported OS (e.g.: Ubuntu). I don't know if the error has been reproduced in different OS's.

This kind of error would not be silent and so should be occurring always. Does it show up in your normal testing?

No, it only occurs when running under the compute-sanitizer, and specifically when compiling with CUDA12. I agree that at this point it appears to be a compute-sanitizer bug specific to CUDA 12.

@GregoryKimball GregoryKimball removed the status in libcudf Oct 26, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
@GregoryKimball
Copy link
Contributor

I agree that at this point it appears to be a compute-sanitizer bug specific to CUDA 12.
If true, then the action here will be to file an nvbug with the compute-sanitizer team.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@davidwendt
Copy link
Contributor

Curious if this is perhaps resolved with the fixes for this issue NVIDIA/spark-rapids-jni#1567

@davidwendt
Copy link
Contributor

This is likely the same compute-sanitizer bug mentioned here #15258 and fixed in 12.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants