Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU illegal access detected during delta_byte_array.parquet read #10149

Closed
jlowe opened this issue Jan 3, 2024 · 4 comments
Closed

[BUG] GPU illegal access detected during delta_byte_array.parquet read #10149

jlowe opened this issue Jan 3, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Member

jlowe commented Jan 3, 2024

From a recent Databricks nightly test:

2024-01-03T17:16:51.884Z] ../../src/main/python/parquet_testing_test.py::test_parquet_testing_valid_files[confs0-/home/ubuntu/integration_tests/src/test/resources/parquet-testing/data/delta_byte_array.parquet][DATAGEN_SEED=1704302104] 24/01/03 17:16:51 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.int96RebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.int96RebaseModeInRead' instead.
[2024-01-03T17:16:51.884Z] 24/01/03 17:16:51 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.
[2024-01-03T17:16:51.884Z] 24/01/03 17:16:51 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.
[2024-01-03T17:16:51.884Z] 24/01/03 17:16:51 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.int96RebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.int96RebaseModeInRead' instead.
[2024-01-03T17:16:52.139Z] 24/01/03 17:16:51 WARN CloudStoreSpecificConf: Unknown cloud store file
[2024-01-03T17:16:52.139Z] 24/01/03 17:16:51 WARN CloudStoreSpecificConf: Unknown cloud store file
[2024-01-03T17:16:52.139Z] 24/01/03 17:16:52 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.int96RebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.int96RebaseModeInRead' instead.
[2024-01-03T17:16:52.139Z] 24/01/03 17:16:52 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.
[2024-01-03T17:16:52.139Z] 24/01/03 17:16:52 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.
[2024-01-03T17:16:52.139Z] 24/01/03 17:16:52 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.int96RebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.int96RebaseModeInRead' instead.
[2024-01-03T17:16:52.698Z] 24/01/03 17:16:52 ERROR Executor: Exception in task 0.0 in stage 125.0 (TID 125)
[2024-01-03T17:16:52.699Z] java.io.IOException: Error when processing path: file:/home/ubuntu/integration_tests/src/test/resources/parquet-testing/data/delta_byte_array.parquet, range: 0-68353, partition values: [empty row], modificationTime: 1704278585000
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2695)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2686)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2660)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$1(GpuDataProducer.scala:159)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
[2024-01-03T17:16:52.699Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[2024-01-03T17:16:52.699Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:521)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2024-01-03T17:16:52.699Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.699Z] 	at scala.Option.exists(Option.scala:376)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:95)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
[2024-01-03T17:16:52.699Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
[2024-01-03T17:16:52.699Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
[2024-01-03T17:16:52.699Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:197)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
[2024-01-03T17:16:52.699Z] 	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:99)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
[2024-01-03T17:16:52.699Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2024-01-03T17:16:52.699Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.699Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
[2024-01-03T17:16:52.699Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2024-01-03T17:16:52.699Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2024-01-03T17:16:52.699Z] 	at java.lang.Thread.run(Thread.java:750)
[2024-01-03T17:16:52.699Z] Caused by: ai.rapids.cudf.CudaFatalException: exclusive_scan_by_key failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
[2024-01-03T17:16:52.699Z] 	at ai.rapids.cudf.ParquetChunkedReader.readChunk(Native Method)
[2024-01-03T17:16:52.699Z] 	at ai.rapids.cudf.ParquetChunkedReader.readChunk(ParquetChunkedReader.java:170)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2688)
[2024-01-03T17:16:52.699Z] 	... 62 more
[2024-01-03T17:16:52.699Z] 24/01/03 17:16:52 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: java.io.IOException: Error when processing path: file:/home/ubuntu/integration_tests/src/test/resources/parquet-testing/data/delta_byte_array.parquet, range: 0-68353, partition values: [empty row], modificationTime: 1704278585000
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2695)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2686)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2660)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$1(GpuDataProducer.scala:159)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2024-01-03T17:16:52.699Z] 	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$5(GpuMultiFileReader.scala:1075)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
[2024-01-03T17:16:52.700Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[2024-01-03T17:16:52.700Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:521)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2024-01-03T17:16:52.700Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.700Z] 	at scala.Option.exists(Option.scala:376)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:95)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:71)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
[2024-01-03T17:16:52.700Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
[2024-01-03T17:16:52.700Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
[2024-01-03T17:16:52.700Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:197)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
[2024-01-03T17:16:52.700Z] 	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:99)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
[2024-01-03T17:16:52.700Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2024-01-03T17:16:52.700Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2024-01-03T17:16:52.700Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
[2024-01-03T17:16:52.700Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2024-01-03T17:16:52.700Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2024-01-03T17:16:52.700Z] 	at java.lang.Thread.run(Thread.java:750)
[2024-01-03T17:16:52.700Z] Caused by: ai.rapids.cudf.CudaFatalException: exclusive_scan_by_key failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
[2024-01-03T17:16:52.700Z] 	at ai.rapids.cudf.ParquetChunkedReader.readChunk(Native Method)
[2024-01-03T17:16:52.700Z] 	at ai.rapids.cudf.ParquetChunkedReader.readChunk(ParquetChunkedReader.java:170)
[2024-01-03T17:16:52.700Z] 	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2688)
[2024-01-03T17:16:52.700Z] 	... 62 more
[2024-01-03T17:16:52.700Z] 
[2024-01-03T17:16:52.966Z] 24/01/03 17:16:52 WARN RapidsExecutorPlugin: nvidia-smi:
[2024-01-03T17:16:52.967Z] Wed Jan  3 17:16:52 2024       
[2024-01-03T17:16:52.967Z] +---------------------------------------------------------------------------------------+
[2024-01-03T17:16:52.967Z] | NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
[2024-01-03T17:16:52.967Z] |-----------------------------------------+----------------------+----------------------+
[2024-01-03T17:16:52.967Z] | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
[2024-01-03T17:16:52.967Z] | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
[2024-01-03T17:16:52.967Z] |                                         |                      |               MIG M. |
[2024-01-03T17:16:52.967Z] |=========================================+======================+======================|
[2024-01-03T17:16:52.967Z] |   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
[2024-01-03T17:16:52.967Z] | N/A   29C    P0              26W /  70W |   9942MiB / 15360MiB |     34%      Default |
[2024-01-03T17:16:52.967Z] |                                         |                      |                  N/A |
[2024-01-03T17:16:52.967Z] +-----------------------------------------+----------------------+----------------------+
[2024-01-03T17:16:52.967Z]                                                                                          
[2024-01-03T17:16:52.967Z] +---------------------------------------------------------------------------------------+
[2024-01-03T17:16:52.967Z] | Processes:                                                                            |
[2024-01-03T17:16:52.967Z] |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
[2024-01-03T17:16:52.967Z] |        ID   ID                                                             Usage      |
[2024-01-03T17:16:52.967Z] |=======================================================================================|
[2024-01-03T17:16:52.967Z] +---------------------------------------------------------------------------------------+
@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 3, 2024
@razajafri
Copy link
Collaborator

Which Databricks version was this failing on?

@jlowe
Copy link
Member Author

jlowe commented Jan 4, 2024

This was from a test run on Databricks 13.3.

@razajafri
Copy link
Collaborator

Ran the test 100 times but unable to reproduce the Exception.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 9, 2024
@mattahrens
Copy link
Collaborator

Issue has not been reproduced, closing for now.

@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants