You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
17:42:00 24/12/09 09:42:00 INFO RapidsPluginUtils: RAPIDS Accelerator build: Map(url -> https://github.com/NVIDIA/spark-rapids.git, branch -> HEAD,
revision -> fb2f72df881582855393135d6e574111716ec7bb, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T10:18:05Z, cudf_version -> 24.12.0-SNAPSHOT, user -> root)
17:42:00 24/12/09 09:42:00 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: Map(url -> https://github.com/NVIDIA/spark-rapids-jni.git, branch -> HEAD, gpu_architectures -> 70;75;80;86;90,
revision -> 7842da04bd6486f2389c441f0e1aa094c5eef469, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T05:18:03Z, user -> root)
17:42:00 24/12/09 09:42:00 INFO RapidsPluginUtils: cudf build: Map(url -> https://github.com/rapidsai/cudf.git, branch -> HEAD, gpu_architectures -> 70;75;80;86;90,
revision -> 439321edb43082fb75f195b6be2049c925279089, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T05:17:59Z, user -> root)
17:42:00 24/12/09 09:42:00 INFO RapidsPluginUtils: RAPIDS Accelerator Private Map(url -> https://gitlab-master.nvidia.com/nvspark/spark-rapids-private.git, branch -> HEAD,
revision -> 2f08e20170b66621d1f14ee0fb351ef5630ea811, version -> 24.12.0-SNAPSHOT, date -> 2024-12-08T03:33:41Z, user -> root)
...
...
...
17:42:11 ============================= test session starts ==============================
17:42:11 platform linux -- Python 3.10.16, pytest-7.4.4, pluggy-1.5.0 -- /opt/conda/bin/python3
17:42:11 cachedir: .pytest_cache
17:42:11 rootdir: /home/jenkins/agent/workspace/jenkins-examples-udf-examples-native-179/examples/UDF-Examples/RAPIDS-accelerated-UDFs
17:42:11 configfile: pytest.ini
17:42:11 plugins: order-1.3.0, xdist-3.6.1
17:42:11 collecting ... collected 8 items
17:42:11
17:42:16 src/main/python/rapids_udf_test.py::test_hive_simple_udf 24/12/09 09:42:16 WARN GpuOverrides:
17:42:16 ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:16 @Expression <AttributeReference> i#7 could run on GPU
17:42:16 @Expression <AttributeReference> s#8 could run on GPU
17:42:16
17:42:17 PASSED [ 12%]
17:42:17 src/main/python/rapids_udf_test.py::test_hive_generic_udf 24/12/09 09:42:17 WARN GpuOverrides:
17:42:17 ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:17 @Expression <AttributeReference> s#18 could run on GPU
17:42:17
17:42:18 24/12/09 09:42:17 WARN GpuOverrides:
17:42:18 ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:18 @Expression <AttributeReference> dec#26 could run on GPU
17:42:18
17:42:18 PASSED [ 25%]
17:42:18 src/main/python/rapids_udf_test.py::test_hive_simple_udf_native 24/12/09 09:42:18 WARN GpuOverrides:
17:42:18 ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
17:42:18 @Expression <AttributeReference> s#34 could run on GPU
17:42:18
17:42:18 #
17:42:18 # A fatal error has been detected by the Java Runtime Environment:
17:42:18 #
17:42:18 # SIGSEGV (0xb) at pc=0x00007f02da529598, pid=177, tid=0x00007f02c9bff700
17:42:18 #
17:42:18 # JRE version: OpenJDK Runtime Environment (8.0_432) (build 1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga)
17:42:18 # Java VM: OpenJDK 64-Bit Server VM (25.432-bga mixed mode linux-amd64 compressed oops)
17:42:18 # Problematic frame:
17:42:18 # C [libcuda.so.1+0x186598]
So, I am able to produce an issue locally that looks quite similar to the one reported here. The stack trace and the error message are not the exact same, but the same test fails within the same docker container as the jenkins job uses. Here is the error from my local setting. Note that the pc is null in the below, whereas it is some non-null value in the above.
...
PASSED [ 25%]
src/main/python/rapids_udf_test.py::test_hive_simple_udf_native 24/12/28 02:31:47 WARN GpuOverrides:
! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
@Expression <AttributeReference> s#34 could run on GPU
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000000000000, pid=5045, tid=0x0000764457fff700
#
# JRE version: OpenJDK Runtime Environment (8.0_432) (build 1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga)
# Java VM: OpenJDK 64-Bit Server VM (25.432-bga mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C 0x0000000000000000
...
I managed to narrow down when this error stated. The error seems to have been caused by NVIDIA/cccl#2266. With the exact same versions of cuda, cudf, spark-rapids-jni, and the plugin, the same test passes with the cccl older than the commit NVIDIA/cccl@f53e72555. But it fails with the cccl at or after that commit. Now I'm trying to reproduce the issue within a cudf c++ unit test.
Describe the bug
first seen in examples-udf-examples-native run:179
https://github.com/NVIDIA/spark-rapids-examples/tree/branch-24.12/examples/UDF-Examples/RAPIDS-accelerated-UDFs
core dump: (complete file hs_err_pid177.log)
Steps/Code to reproduce bug
build and test case at: https://github.com/NVIDIA/spark-rapids-examples/blob/branch-24.12/examples/UDF-Examples/RAPIDS-accelerated-UDFs/README.md#building-and-run-the-tests-without-native-code-examples
https://github.com/NVIDIA/spark-rapids-examples/blob/branch-24.12/examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/rapids_udf_test.py
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: