[BUG] Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV #14192

res-life · 2023-09-26T09:42:13Z

Describe the bug
Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV

Steps/Code to reproduce bug
Code:

#include <cudf/types.hpp>
#include <cudf/aggregation.hpp>
#include <cudf/reduction.hpp>
#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_wrapper.hpp>

template <typename T, typename SourceElementT = T>
using column_wrapper =
  typename std::conditional<std::is_same_v<T, cudf::string_view>,
                            cudf::test::strings_column_wrapper,
                            cudf::test::fixed_width_column_wrapper<T, SourceElementT>>::type;
using int16_col   = column_wrapper<int16_t>;

struct MyReductionTest : public cudf::test::BaseFixture {};
TEST_F(MyReductionTest, AlignmentIssue)
{
  std::vector<int16_t> v({1, 2, 3});
  int16_col col(v.begin(), v.end());
  
  auto const output_dtype                 = cudf::data_type{cudf::type_id::INT16};
  auto min_agg = cudf::make_min_aggregation();
  std::unique_ptr<cudf::scalar> reduction1 = cudf::reduce(col, *dynamic_cast<cudf::reduce_aggregation *>(&(*min_agg)), output_dtype);

  auto const output_dtype2                 = cudf::data_type{cudf::type_id::BOOL8};
  auto any_agg = cudf::make_any_aggregation();
  std::unique_ptr<cudf::scalar> reduction2 = cudf::reduce(col, *dynamic_cast<cudf::reduce_aggregation *>(&(*any_agg)), output_dtype2);
}

Compile and Run with sanitizer:

compute-sanitizer --tool memcheck \
    --launch-timeout 600 \
    --error-exitcode -2 \
    --log-file "./sanitizer_for_pid_%p.log" \
    ./my-exe

Print sanitizer log:

head sanitizer_for_pid_42.log 
========= COMPUTE-SANITIZER
========= Invalid __shared__ read of size 16 bytes
=========     at 0x38c0 in void cub::CUB_101702_600_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_101702_600_700_750_800_860_900_NS::DeviceReducePolicy<short, short, int, cudf::DeviceMin>::Policy600, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, int, cudf::DeviceMin, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2d18f2]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame:__cudart1049 [0xd9bd3b]
=========                in /home/chongg/code/spark-rapids-jni/target/cmake-build/gtests/./my-exe

The main errors are:

Invalid __shared__ read of size 16 bytes
Address 0x8 is misaligned

Others:

There are 2 reductions in the code.
If another reduction follows a min reduction, then errors occur.

Expected behavior
Fix Sanitizer error.

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: from source

Environment details
Docker image: urm.nvidia.com/sw-spark-docker/plugin-jni:centos7-cuda12.0.1-blossom
CUDA 12, for more details, refer to NVIDIA/spark-rapids-jni#1349

Additional context
Refer to NVIDIA/spark-rapids-jni#1349

The text was updated successfully, but these errors were encountered:

davidwendt · 2023-09-26T16:54:42Z

I'm not able to reproduce the error on my local libcudf build.

$ /usr/local/cuda/bin/compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2023 NVIDIA Corporation
Version 2022.4.1

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0

davidwendt · 2023-09-26T17:00:44Z

We have a few reductions tests in libcudf that use min-aggregation followed by a call to max-aggregation. Here is one that executes with int16 types:

cudf/cpp/tests/reductions/reduction_tests.cpp

Lines 152 to 157 in 3196f6c

    
           EXPECT_EQ( 
        
             this->template reduction_test<T>(col, *cudf::make_min_aggregation<reduce_aggregation>()).first, 
        
             expected_min_result); 
        
           EXPECT_EQ( 
        
             this->template reduction_test<T>(col, *cudf::make_max_aggregation<reduce_aggregation>()).first, 
        
             expected_max_result);

All of the tests are run with compute-sanitizer in our nightly builds
Curious if these tests also fail for you in your environment as well.

ttnghia · 2023-09-27T13:50:21Z

I suspect that this is the same as my previous reported issue: #13685

res-life · 2023-09-27T13:55:56Z

Curious if these tests also fail for you in your environment as well.

Yes, they also fail.
I copied the reduction_tests.cpp file to spark-rapids-jni, and compiled to an executable.
Tried to directly compile against the cuDF code, but the compile failed. So I copied the reduction_tests.cpp to spark-rapids-jni project and compile.

davidwendt · 2023-09-28T23:08:58Z

This seems specific to your test environment since our nightly compute-sanitizer does not fail running reduction_tests.cpp and I cannot reproduce the error locally either.

Perhaps you can provide some details on the environment.
Is libcudf.so being built from source or downloaded from conda, etc?
Is there a way to duplicate an environment where we could debug libcudf (e.g. compile and link specific modules)?

I see mention of a docker image in the description. Does the error occur only on centos7?
This kind of error would not be silent and so should be occurring always. Does it show up in your normal testing?
In other words, does the error occur without running compute-sanitizer?
If not, perhaps this could be a compute-sanitizer bug on CUDA-12 on centos7 instead.

jlowe · 2023-10-02T14:24:57Z

Is libcudf.so being built from source or downloaded from conda, etc?

This is a custom build of libcudf for the RAPIDS Accelerator, where we are compiling libcudf as a PIC static library that is ultimately linked into a shared library and used by the JVM.

Is there a way to duplicate an environment where we could debug libcudf (e.g. compile and link specific modules)?
I see mention of a docker image in the description. Does the error occur only on centos7?

The Docker image is used to produce this build, see https://github.com/NVIDIA/spark-rapids-jni/blob/branch-23.10/CONTRIBUTING.md#building-in-the-docker-container. After pulling the spark-rapids-jni and executing the build-in-docker script, the libcudf install will be in spark-rapids-jni/target/libcudf-install/. You can use the run-in-docker script to get an interactive shell within the same environment as the build env if desired.

The result of the build can be run on any supported OS (e.g.: Ubuntu). I don't know if the error has been reproduced in different OS's.

This kind of error would not be silent and so should be occurring always. Does it show up in your normal testing?

No, it only occurs when running under the compute-sanitizer, and specifically when compiling with CUDA12. I agree that at this point it appears to be a compute-sanitizer bug specific to CUDA 12.

GregoryKimball · 2023-11-09T17:46:55Z

I agree that at this point it appears to be a compute-sanitizer bug specific to CUDA 12.
If true, then the action here will be to file an nvbug with the compute-sanitizer team.

davidwendt · 2024-02-13T10:38:05Z

Curious if this is perhaps resolved with the fixes for this issue NVIDIA/spark-rapids-jni#1567

davidwendt · 2024-12-13T20:14:02Z

This is likely the same compute-sanitizer bug mentioned here #15258 and fixed in 12.5.

res-life added bug Something isn't working Needs Triage Need team to review and classify labels Sep 26, 2023

res-life mentioned this issue Sep 26, 2023

[BUG] nightly ai.rapids.cudf.ReductionTest failed in cuda12 ENV after enable sanitizer NVIDIA/spark-rapids-jni#1349

Open

GregoryKimball added this to libcudf Sep 26, 2023

GregoryKimball moved this to Burndown PRs in libcudf Sep 26, 2023

GregoryKimball moved this from Burndown PRs to Slip in libcudf Sep 26, 2023

GregoryKimball removed the status in libcudf Oct 26, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Nov 9, 2023

davidwendt closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV #14192

[BUG] Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV #14192

res-life commented Sep 26, 2023

davidwendt commented Sep 26, 2023

davidwendt commented Sep 26, 2023 •

edited

Loading

ttnghia commented Sep 27, 2023

res-life commented Sep 27, 2023

davidwendt commented Sep 28, 2023

jlowe commented Oct 2, 2023

GregoryKimball commented Nov 9, 2023

davidwendt commented Feb 13, 2024

davidwendt commented Dec 13, 2024

[BUG] Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV #14192

[BUG] Sanitizer reports misaligned error when doing reduction on short type values in cuda12 ENV #14192

Comments

res-life commented Sep 26, 2023

davidwendt commented Sep 26, 2023

davidwendt commented Sep 26, 2023 • edited Loading

ttnghia commented Sep 27, 2023

res-life commented Sep 27, 2023

davidwendt commented Sep 28, 2023

jlowe commented Oct 2, 2023

GregoryKimball commented Nov 9, 2023

davidwendt commented Feb 13, 2024

davidwendt commented Dec 13, 2024

davidwendt commented Sep 26, 2023 •

edited

Loading