Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDA illegal access error is triggering split and retry logic #9364

Closed
jlowe opened this issue Oct 2, 2023 · 3 comments · Fixed by NVIDIA/spark-rapids-jni#1477
Closed
Assignees
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Contributor

jlowe commented Oct 2, 2023

While working on GPU core dump functionality, I manually triggered a GPU exception and then noticed an odd behavior with spark.range that was triggering split and retry logic for OOM rather than a CUDA fatal exception as I expected. Here's repro steps from a spark-shell with GPU memory pooling disabled:

import ai.rapids.cudf._
val data = DeviceMemoryBuffer.allocate(16)
Cuda.memset(data.getAddress, 0xa5.toByte, 16L)
val cv = new ColumnView(DType.INT64, 2147483000L, java.util.Optional.empty[java.lang.Long](), data, data, data)

At this point it triggered a CudaFatalException like this:

ai.rapids.cudf.CudaFatalException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-183-cuda11/target/libcudf-install/include/rmm/device_uvector.hpp:318: cudaErrorIllegalAddress an illegal memory access was encountered
  at ai.rapids.cudf.ColumnView.makeCudfColumnView(Native Method)
  at ai.rapids.cudf.ColumnVector.initViewHandle(ColumnVector.java:217)
  at ai.rapids.cudf.ColumnView.<init>(ColumnView.java:175)
  at ai.rapids.cudf.ColumnView.<init>(ColumnView.java:165)
  ... 49 elided

With a "sticky" illegal access error triggered, I then tried doing a simple spark.range query, expecting the task listener to shut down the entire process due to the expected CudaFatalException that should have occurred. Instead, I got this:

scala> spark.range(10).show
23/10/02 19:45:25 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: the number of rows generated is too small to be split 5!
	at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$reduceRowsNumberByHalf$2(basicPhysicalOperators.scala:1081)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$reduceRowsNumberByHalf$1(basicPhysicalOperators.scala:1078)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:441)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:556)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:494)
	at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$next$17(basicPhysicalOperators.scala:1047)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuRangeIterator.next(basicPhysicalOperators.scala:1029)
	at com.nvidia.spark.rapids.GpuRangeIterator.next(basicPhysicalOperators.scala:988)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:282)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:255)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:299)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

When doing a spark.read.parquet instead, I got what I expected:

Caused by: ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-183-cuda11/thirdparty/cudf/cpp/include/cudf/detail/utilities/pinned_host_vector.hpp:157: 700 cudaErrorIllegalAddress an illegal memory access was encountered

The OOM behavior leads me to believe that we're accidentally interpreting the GPU illegal access error or CudaFatalException as something like an OOM error that is retryable, at least in the context of GpuRange.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 2, 2023
@jlowe jlowe transferred this issue from NVIDIA/spark-rapids-jni Oct 2, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 3, 2023
@revans2
Copy link
Collaborator

revans2 commented Oct 5, 2023

I was able to repro this so now I will dig into what is happening.

@revans2
Copy link
Collaborator

revans2 commented Oct 5, 2023

Why RMM Why????

CAUGHT BAD ALLOC std::bad_alloc: CUDA error at: /...libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:71: cudaErrorIllegalAddress an illegal memory access was encountered

Will file an issue for RMM.

@revans2
Copy link
Collaborator

revans2 commented Oct 5, 2023

Actually it looks like we might be handling errors incorrectly.

https://github.com/rapidsai/rmm/blob/da3ed7b9f987d729cb4f4003acc242ce3e830ca6/include/rmm/detail/error.hpp#L196-L200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants