You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on GPU core dump functionality, I manually triggered a GPU exception and then noticed an odd behavior with spark.range that was triggering split and retry logic for OOM rather than a CUDA fatal exception as I expected. Here's repro steps from a spark-shell with GPU memory pooling disabled:
import ai.rapids.cudf._
val data = DeviceMemoryBuffer.allocate(16)
Cuda.memset(data.getAddress, 0xa5.toByte, 16L)
val cv = new ColumnView(DType.INT64, 2147483000L, java.util.Optional.empty[java.lang.Long](), data, data, data)
At this point it triggered a CudaFatalException like this:
ai.rapids.cudf.CudaFatalException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-183-cuda11/target/libcudf-install/include/rmm/device_uvector.hpp:318: cudaErrorIllegalAddress an illegal memory access was encountered
at ai.rapids.cudf.ColumnView.makeCudfColumnView(Native Method)
at ai.rapids.cudf.ColumnVector.initViewHandle(ColumnVector.java:217)
at ai.rapids.cudf.ColumnView.<init>(ColumnView.java:175)
at ai.rapids.cudf.ColumnView.<init>(ColumnView.java:165)
... 49 elided
With a "sticky" illegal access error triggered, I then tried doing a simple spark.range query, expecting the task listener to shut down the entire process due to the expected CudaFatalException that should have occurred. Instead, I got this:
scala> spark.range(10).show
23/10/02 19:45:25 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: the number of rows generated is too small to be split 5!
at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$reduceRowsNumberByHalf$2(basicPhysicalOperators.scala:1081)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$reduceRowsNumberByHalf$1(basicPhysicalOperators.scala:1078)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:441)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:556)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:494)
at com.nvidia.spark.rapids.GpuRangeIterator.$anonfun$next$17(basicPhysicalOperators.scala:1047)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.GpuRangeIterator.next(basicPhysicalOperators.scala:1029)
at com.nvidia.spark.rapids.GpuRangeIterator.next(basicPhysicalOperators.scala:988)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:282)
at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:255)
at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:299)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
When doing a spark.read.parquet instead, I got what I expected:
Caused by: ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-183-cuda11/thirdparty/cudf/cpp/include/cudf/detail/utilities/pinned_host_vector.hpp:157: 700 cudaErrorIllegalAddress an illegal memory access was encountered
The OOM behavior leads me to believe that we're accidentally interpreting the GPU illegal access error or CudaFatalException as something like an OOM error that is retryable, at least in the context of GpuRange.
The text was updated successfully, but these errors were encountered:
CAUGHT BAD ALLOC std::bad_alloc: CUDA error at: /...libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:71: cudaErrorIllegalAddress an illegal memory access was encountered
While working on GPU core dump functionality, I manually triggered a GPU exception and then noticed an odd behavior with spark.range that was triggering split and retry logic for OOM rather than a CUDA fatal exception as I expected. Here's repro steps from a spark-shell with GPU memory pooling disabled:
At this point it triggered a CudaFatalException like this:
With a "sticky" illegal access error triggered, I then tried doing a simple spark.range query, expecting the task listener to shut down the entire process due to the expected CudaFatalException that should have occurred. Instead, I got this:
When doing a spark.read.parquet instead, I got what I expected:
The OOM behavior leads me to believe that we're accidentally interpreting the GPU illegal access error or CudaFatalException as something like an OOM error that is retryable, at least in the context of GpuRange.
The text was updated successfully, but these errors were encountered: