[BUG] CUDA illegal access error is triggering split and retry logic #9364

jlowe · 2023-10-02T19:51:54Z

While working on GPU core dump functionality, I manually triggered a GPU exception and then noticed an odd behavior with spark.range that was triggering split and retry logic for OOM rather than a CUDA fatal exception as I expected. Here's repro steps from a spark-shell with GPU memory pooling disabled:

import ai.rapids.cudf._
val data = DeviceMemoryBuffer.allocate(16)
Cuda.memset(data.getAddress, 0xa5.toByte, 16L)
val cv = new ColumnView(DType.INT64, 2147483000L, java.util.Optional.empty[java.lang.Long](), data, data, data)

At this point it triggered a CudaFatalException like this:

ai.rapids.cudf.CudaFatalException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-183-cuda11/target/libcudf-install/include/rmm/device_uvector.hpp:318: cudaErrorIllegalAddress an illegal memory access was encountered
  at ai.rapids.cudf.ColumnView.makeCudfColumnView(Native Method)
  at ai.rapids.cudf.ColumnVector.initViewHandle(ColumnVector.java:217)
  at ai.rapids.cudf.ColumnView.<init>(ColumnView.java:175)
  at ai.rapids.cudf.ColumnView.<init>(ColumnView.java:165)
  ... 49 elided

With a "sticky" illegal access error triggered, I then tried doing a simple spark.range query, expecting the task listener to shut down the entire process due to the expected CudaFatalException that should have occurred. Instead, I got this:

scala> spark.range(10).show
23/10/02 19:45:25 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: the number of rows generated is too small to be split 5!
	at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$reduceRowsNumberByHalf$2(basicPhysicalOperators.scala:1081)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$reduceRowsNumberByHalf$1(basicPhysicalOperators.scala:1078)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:441)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:556)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:494)
	at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$next$17(basicPhysicalOperators.scala:1047)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuRangeIterator.next(basicPhysicalOperators.scala:1029)
	at com.nvidia.spark.rapids.GpuRangeIterator.next(basicPhysicalOperators.scala:988)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:282)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:255)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:299)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

When doing a spark.read.parquet instead, I got what I expected:

Caused by: ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-183-cuda11/thirdparty/cudf/cpp/include/cudf/detail/utilities/pinned_host_vector.hpp:157: 700 cudaErrorIllegalAddress an illegal memory access was encountered

The OOM behavior leads me to believe that we're accidentally interpreting the GPU illegal access error or CudaFatalException as something like an OOM error that is retryable, at least in the context of GpuRange.

The text was updated successfully, but these errors were encountered:

revans2 · 2023-10-05T16:35:44Z

I was able to repro this so now I will dig into what is happening.

revans2 · 2023-10-05T20:09:10Z

Why RMM Why????

CAUGHT BAD ALLOC std::bad_alloc: CUDA error at: /...libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:71: cudaErrorIllegalAddress an illegal memory access was encountered

Will file an issue for RMM.

revans2 · 2023-10-05T20:12:42Z

Actually it looks like we might be handling errors incorrectly.

https://github.com/rapidsai/rmm/blob/da3ed7b9f987d729cb4f4003acc242ce3e830ca6/include/rmm/detail/error.hpp#L196-L200

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 2, 2023

jlowe transferred this issue from NVIDIA/spark-rapids-jni Oct 2, 2023

mattahrens assigned revans2 Oct 3, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 3, 2023

revans2 mentioned this issue Oct 5, 2023

handle rmm::out_of_memory instead of std::bad_alloc for retry NVIDIA/spark-rapids-jni#1477

Merged

revans2 closed this as completed in NVIDIA/spark-rapids-jni#1477 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CUDA illegal access error is triggering split and retry logic #9364

[BUG] CUDA illegal access error is triggering split and retry logic #9364

jlowe commented Oct 2, 2023

revans2 commented Oct 5, 2023

revans2 commented Oct 5, 2023

revans2 commented Oct 5, 2023

[BUG] CUDA illegal access error is triggering split and retry logic #9364

[BUG] CUDA illegal access error is triggering split and retry logic #9364

Comments

jlowe commented Oct 2, 2023

revans2 commented Oct 5, 2023

revans2 commented Oct 5, 2023

revans2 commented Oct 5, 2023