Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test "convert large InternalRow iterator to cached batch single col" failed with arena pool #9982

Closed
thirtiseven opened this issue Dec 7, 2023 · 6 comments · Fixed by #9985
Assignees
Labels
bug Something isn't working

Comments

@thirtiseven
Copy link
Collaborator

Describe the bug
convert large InternalRow iterator to cached batch single col *** FAILED *** in pipeline rapids_premerge-github build ID 8657.

Steps/Code to reproduce bug
No local repro.
CI log:

[2023-12-06T11:28:08.960Z] [31m- convert large InternalRow iterator to cached batch single col *** FAILED ***[0m
[2023-12-06T11:28:08.961Z] [31m  java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1131)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$7(CachedBatchWriterSuite.scala:96)[0m
[2023-12-06T11:28:08.961Z] [31m  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m
[2023-12-06T11:28:08.961Z] [31m  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m
[2023-12-06T11:28:08.961Z] [31m  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m
[2023-12-06T11:28:08.961Z] [31m  ...[0m
[2023-12-06T11:28:08.961Z] [31m  Cause: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.Rmm.free(Native Method)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.DeviceMemoryBuffer$DeviceBufferCleaner.cleanImpl(DeviceMemoryBuffer.java:50)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1115)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)[0m
[2023-12-06T11:28:08.961Z] [31m  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)[0m
[2023-12-06T11:28:08.961Z] [31m  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)[0m
[2023-12-06T11:28:08.961Z] [31m  ...[0m

Expected behavior
It should pass.

Environment details (please complete the following information)

Premerge CI, arena the test has been assigned to 470 driver instance

@thirtiseven thirtiseven added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 7, 2023
@pxLi
Copy link
Collaborator

pxLi commented Dec 7, 2023

we have nightly 470-driver regression integration tests but did not find the same issue.
Some random job scheduling to a compatible-test node (470 driver) in pre-merge pool helped find the issue with above unit test case

the run using JNI was compiled with latest rmm fix rapidsai/rmm#1396

@pxLi
Copy link
Collaborator

pxLi commented Dec 7, 2023

provided a simple internal pipeline in Jobs-for-developers/job/REPRO-9982 to help try repro with internal resources

@thirtiseven
Copy link
Collaborator Author

If we force use arena pool in rmmPool
https://github.com/NVIDIA/spark-rapids/blob/branch-23.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala#L2320-L2338
The issue can be always reproduced with: -DwildcardSuites=com.nvidia.spark.rapids.CachedBatchWriterSuite

@pxLi pxLi changed the title [BUG] test "convert large InternalRow iterator to cached batch single col" failed due to RMM failure in premerge [BUG] test "convert large InternalRow iterator to cached batch single col" failed with arena pool Dec 7, 2023
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Dec 7, 2023

log with spark.rapids.memory.gpu.debug and ai.rapids.refcount.debug=true:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running com.nvidia.spark.rapids.shuffle.TestShuffleMetricsUpdater
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running com.nvidia.spark.rapids.unit.GpuScalarUnitTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.047 sec
Running com.nvidia.spark.rapids.unit.StringRepeatUnitTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
Running com.nvidia.spark.rapids.unit.DateTimeUnitTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running com.nvidia.spark.rapids.unit.DecimalUnitTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running com.nvidia.spark.rapids.GpuIntervalUtilsTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running com.nvidia.spark.rapids.TestHashedPriorityQueue
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.apache.spark.sql.rapids.TestPrimitiveUDT
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.apache.spark.sql.rapids.TestNestedStructUDT
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running org.apache.spark.sql.rapids.TestArrayUDT
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec

Results :

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- scalatest-maven-plugin:2.0.2:test (test) @ rapids-4-spark-tests_2.12 ---
Discovery starting.
Discovery completed in 1 second, 176 milliseconds.
Run starting. Expected test count is: 7
CachedBatchWriterSuite:
- convert columnar batch to cached batch on single col table with 0 rows in a batch
- convert large columnar batch to cached batch on single col table
- convert large columnar batch to cached batch on multi-col table
Thread,Time,Action,Pointer,Size,Stream
4069267,14:03:26.671288,allocate,0x7f0f0e000000,134217728,0x2
4069267,14:03:27.752660,free,0x7f152b200000,256,0x2
4069267,14:03:27.754116,free,0x7f152b200200,256,0x2
4069267,14:03:27.758929,allocate,0x7f0f16000000,256,0x2
4069267,14:03:27.759391,allocate,0x7f0f16000100,256,0x2
4069267,14:03:27.760024,allocate,0x7f0f16000200,256,0x2
4069267,14:03:27.761734,allocate,0x7f0f16000300,256,0x2
4069267,14:03:27.762161,allocate,0x7f0f16000400,256,0x2
4069267,14:03:27.762550,allocate,0x7f0f16000500,256,0x2
- convert large InternalRow iterator to cached batch single col *** FAILED ***
  java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
  at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1131)
  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)
  at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)
  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
  at com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$7(CachedBatchWriterSuite.scala:96)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  ...
  Cause: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
  at ai.rapids.cudf.Rmm.free(Native Method)
  at ai.rapids.cudf.DeviceMemoryBuffer$DeviceBufferCleaner.cleanImpl(DeviceMemoryBuffer.java:50)
  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
  at ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247)
  at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1115)
  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)
  at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)
  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
  ...
4069267,14:03:27.763962,allocate,0x7f0f16000600,256,0x2
4069267,14:03:27.764388,allocate,0x7f0f16000700,256,0x2
4069267,14:03:27.764781,allocate,0x7f0f16000800,256,0x2
4069267,14:03:27.777773,free,0x7f0f16000000,256,0x2
4069267,14:03:27.778099,free,0x7f0f16000300,256,0x2
4069267,14:03:27.778400,free,0x7f0f16000600,256,0x2
4069449,14:03:27.873217,free,0x7f152b200400,256,0x2
23/12/07 14:03:27.873 Cleaner Thread ERROR MemoryCleaner: CAUGHT EXCEPTION WHILE TRYING TO CLEAN ai.rapids.cudf.MemoryCleaner$CleanerWeakReference@8b619b1
java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
	at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1131) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner$CleanerWeakReference.clean(MemoryCleaner.java:167) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner.lambda$static$0(MemoryCleaner.java:200) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
Caused by: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
	at ai.rapids.cudf.Rmm.free(Native Method) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.DeviceMemoryBuffer$DeviceBufferCleaner.cleanImpl(DeviceMemoryBuffer.java:50) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1115) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	... 4 more
23/12/07 14:03:27.881 Cleaner Thread ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 53)
23/12/07 14:03:27.893 Cleaner Thread ERROR MemoryCleaner: Leaked vector (ID: 53): 2023-12-07 06:03:27.0752 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.HostColumnVector.incRefCountInternal(HostColumnVector.java:190)
ai.rapids.cudf.HostColumnVector.<init>(HostColumnVector.java:112)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5111)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.mockito.internal.util.reflection.ReflectionMemberAccessor.invoke(ReflectionMemberAccessor.java:48)
org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
org.mockito.Answers.answer(Answers.java:99)
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5080)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5166)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.mockito.internal.util.reflection.ReflectionMemberAccessor.invoke(ReflectionMemberAccessor.java:48)
org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
org.mockito.Answers.answer(Answers.java:99)
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5166)
com.nvidia.spark.rapids.GpuColumnVector.copyToHost(GpuColumnVector.java:1077)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.mockito.internal.util.reflection.ReflectionMemberAccessor.invoke(ReflectionMemberAccessor.java:48)
org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
org.mockito.Answers.answer(Answers.java:99)
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
com.nvidia.spark.rapids.GpuColumnVector.copyToHost(GpuColumnVector.java:1077)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.$anonfun$myIter$3(ParquetCachedBatchSerializer.scala:1221)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableFromBatchColumns.$anonfun$safeMap$2(implicits.scala:287)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableFromBatchColumns.$anonfun$safeMap$2$adapted(implicits.scala:287)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:221)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:218)
scala.collection.immutable.Range.foreach(Range.scala:158)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:218)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableFromBatchColumns.safeMap(implicits.scala:287)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.$anonfun$myIter$2(ParquetCachedBatchSerializer.scala:1221)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.$anonfun$myIter$1(ParquetCachedBatchSerializer.scala:1220)
scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.hasNext(ParquetCachedBatchSerializer.scala:1216)
scala.collection.Iterator.foreach(Iterator.scala:943)
scala.collection.Iterator.foreach$(Iterator.scala:943)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.foreach(ParquetCachedBatchSerializer.scala:999)
scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.foldLeft(ParquetCachedBatchSerializer.scala:999)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$testColumnarBatchToCachedBatchIterator$2(CachedBatchWriterSuite.scala:337)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.CachedBatchWriterSuite.testColumnarBatchToCachedBatchIterator(CachedBatchWriterSuite.scala:335)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$8(CachedBatchWriterSuite.scala:102)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$8$adapted(CachedBatchWriterSuite.scala:96)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$7(CachedBatchWriterSuite.scala:96)
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:431)
org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
org.scalatest.Suite.run(Suite.scala:1114)
org.scalatest.Suite.run$(Suite.scala:1096)
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
org.scalatest.SuperEngine.runImpl(Engine.scala:535)
org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
com.nvidia.spark.rapids.CachedBatchWriterSuite.org$scalatest$BeforeAndAfterAll$$super$run(CachedBatchWriterSuite.scala:47)
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
com.nvidia.spark.rapids.CachedBatchWriterSuite.run(CachedBatchWriterSuite.scala:47)
org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:30)
org.scalatest.Suite.run(Suite.scala:1111)
org.scalatest.Suite.run$(Suite.scala:1096)
org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:30)
org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
scala.collection.immutable.List.foreach(List.scala:431)
org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
org.scalatest.tools.Runner$.main(Runner.scala:775)
org.scalatest.tools.Runner.main(Runner.scala)

4069267,14:03:28.030814,free,0x7f0f16000700,256,0x2
4069267,14:03:28.031178,free,0x7f0f16000800,256,0x2
4069267,14:03:28.031581,free,0x7f0f16000400,256,0x2
4069267,14:03:28.031907,free,0x7f0f16000500,256,0x2
4069267,14:03:28.032093,free,0x7f0f16000100,256,0x2
4069267,14:03:28.032456,free,0x7f0f16000200,256,0x2
- convert large InternalRow iterator to cached batch multi-col
- test useCompression conf is honored
4069267,14:03:28.247503,allocate,0x7f0f16000000,256,0x2
4069267,14:03:28.247822,allocate,0x7f0f16000100,256,0x2
4069267,14:03:28.747750,allocate,0x7f0f16000200,256,0x2
4069267,14:03:28.748609,allocate,0x7f0f16000300,256,0x2
4069267,14:03:28.748624,allocate,0x7f0f16000400,256,0x2
4069267,14:03:28.748648,allocate,0x7f0f16000500,256,0x2
4069267,14:03:28.748655,allocate,0x7f0f16000600,256,0x2
4069267,14:03:28.748675,allocate,0x7f0f16000700,256,0x2
4069267,14:03:28.749912,allocate,0x7f0f16000800,256,0x2
4069267,14:03:28.750639,free,0x7f0f16000800,256,0x2
4069267,14:03:28.750672,allocate,0x7f0f16000800,256,0x2
4069267,14:03:28.750687,allocate,0x7f0f16000900,256,0x2
4069267,14:03:28.750960,allocate,0x7f0f16000a00,256,0x2
4069267,14:03:28.750963,allocate,0x7f0f16000b00,256,0x2
4069267,14:03:28.751035,free,0x7f0f16000900,256,0x2
4069267,14:03:28.751050,allocate,0x7f0f16000900,256,0x2
4069267,14:03:28.751053,allocate,0x7f0f16000c00,256,0x2
4069267,14:03:28.751064,allocate,0x7f0f16000d00,256,0x2
4069267,14:03:28.751090,free,0x7f0f16000d00,256,0x2
4069267,14:03:28.751095,allocate,0x7f0f16000d00,256,0x2
4069267,14:03:28.751560,free,0x7f0f16000d00,256,0x2
4069267,14:03:28.751633,allocate,0x7f0f16000d00,256,0x2
4069267,14:03:28.751661,allocate,0x7f0f16000e00,256,0x2
4069267,14:03:28.751716,free,0x7f0f16000d00,256,0x2
4069267,14:03:28.751729,allocate,0x7f0f16000d00,256,0x2
4069267,14:03:28.751731,allocate,0x7f0f16000f00,512,0x2
4069267,14:03:28.751732,allocate,0x7f0f16001100,256,0x2
4069267,14:03:28.751733,allocate,0x7f0f16001200,256,0x2
4069267,14:03:28.751736,allocate,0x7f0f16001300,256,0x2
4069267,14:03:28.752013,free,0x7f0f16001300,256,0x2
4069267,14:03:28.752019,allocate,0x7f0f16001300,256,0x2
4069267,14:03:28.752020,allocate,0x7f0f16001400,256,0x2
4069267,14:03:28.752022,allocate,0x7f0f16001500,256,0x2
4069267,14:03:28.752043,allocate,0x7f0f16001600,256,0x2
4069267,14:03:28.752076,free,0x7f0f16001600,256,0x2
4069267,14:03:28.752629,allocate,0x7f0f16001600,256,0x2
4069267,14:03:28.752634,allocate,0x7f0f16001700,256,0x2
4069267,14:03:28.752636,allocate,0x7f0f16001800,256,0x2
4069267,14:03:28.752637,allocate,0x7f0f16001900,256,0x2
4069267,14:03:28.753269,allocate,0x7f0f16001a00,256,0x2
4069267,14:03:28.753320,free,0x7f0f16001a00,256,0x2
4069267,14:03:28.753326,allocate,0x7f0f16001a00,256,0x2
4069267,14:03:28.753360,free,0x7f0f16001a00,256,0x2
4069267,14:03:28.753373,allocate,0x7f0f16001a00,256,0x2
4069267,14:03:28.754078,free,0x7f0f16001a00,256,0x2
4069267,14:03:28.754085,free,0x7f0f16001900,256,0x2
4069267,14:03:28.754087,free,0x7f0f16001800,256,0x2
4069267,14:03:28.754088,free,0x7f0f16001700,256,0x2
4069267,14:03:28.754090,free,0x7f0f16001600,256,0x2
4069267,14:03:28.754195,allocate,0x7f0f16001600,256,0x2
4069267,14:03:28.754644,free,0x7f0f16001600,256,0x2
4069267,14:03:28.754651,allocate,0x7f0f16001600,256,0x2
4069267,14:03:28.754683,free,0x7f0f16001600,256,0x2
4069267,14:03:28.754687,allocate,0x7f0f16001600,256,0x2
4069267,14:03:28.754702,free,0x7f0f16001600,256,0x2
4069267,14:03:28.754705,allocate,0x7f0f16001600,256,0x2
4069267,14:03:28.754718,free,0x7f0f16001600,256,0x2
4069267,14:03:28.754721,free,0x7f0f16001500,256,0x2
4069267,14:03:28.754722,free,0x7f0f16001400,256,0x2
4069267,14:03:28.754724,free,0x7f0f16001300,256,0x2
4069267,14:03:28.754746,free,0x7f0f16001200,256,0x2
4069267,14:03:28.754748,free,0x7f0f16000e00,256,0x2
4069267,14:03:28.754760,free,0x7f0f16000900,256,0x2
4069267,14:03:28.754766,free,0x7f0f16000c00,256,0x2
4069267,14:03:28.754774,free,0x7f0f16000b00,256,0x2
4069267,14:03:28.754775,free,0x7f0f16000a00,256,0x2
4069267,14:03:28.754777,free,0x7f0f16000700,256,0x2
4069267,14:03:28.754780,free,0x7f0f16000600,256,0x2
4069267,14:03:28.754782,free,0x7f0f16000500,256,0x2
4069267,14:03:28.754789,free,0x7f0f16000400,256,0x2
4069267,14:03:28.754790,free,0x7f0f16000300,256,0x2
4069267,14:03:28.754798,free,0x7f0f16000200,256,0x2
4069267,14:03:28.754819,free,0x7f0f16001100,256,0x2
4069267,14:03:28.754821,free,0x7f0f16000800,256,0x2
4069267,14:03:28.755286,free,0x7f0f16000d00,256,0x2
4069267,14:03:28.755292,free,0x7f0f16000f00,512,0x2
4069267,14:03:28.758287,free,0x7f0f16000000,256,0x2
4069267,14:03:28.758375,free,0x7f0f16000100,256,0x2
4069267,14:03:28.761409,allocate,0x7f0f16000000,256,0x2
4069267,14:03:28.761433,allocate,0x7f0f16000100,256,0x2
4069267,14:03:28.761456,allocate,0x7f0f16000200,256,0x2
4069267,14:03:28.761724,free,0x7f0f16000200,256,0x2
4069267,14:03:28.761741,allocate,0x7f0f16000200,512,0x2
4069267,14:03:28.761745,allocate,0x7f0f16000400,256,0x2
4069267,14:03:28.761775,allocate,0x7f0f16000500,256,0x2
4069267,14:03:28.763735,free,0x7f0f16000500,256,0x2
4069267,14:03:28.763748,free,0x7f0f16000400,256,0x2
4069267,14:03:28.763772,allocate,0x7f0f16000400,256,0x2
4069267,14:03:28.763781,allocate,0x7f0f16000500,256,0x2
4069267,14:03:28.763793,allocate,0x7f0f16000600,8192,0x2
4069267,14:03:28.763801,allocate,0x7f0f16002600,256,0x2
4069267,14:03:28.763803,allocate,0x7f0f16002700,256,0x2
4069267,14:03:28.763855,allocate,0x7f0f16002800,512,0x2
4069267,14:03:28.763915,free,0x7f0f16002800,512,0x2
4069267,14:03:28.763921,allocate,0x7f0f16002800,256,0x2
4069267,14:03:28.763923,allocate,0x7f0f16002900,1536,0x2
4069267,14:03:28.764009,free,0x7f0f16002900,1536,0x2
4069267,14:03:28.764013,allocate,0x7f0f16002900,256,0x2
4069267,14:03:28.764015,allocate,0x7f0f16002a00,1280,0x2
4069267,14:03:28.764074,free,0x7f0f16002a00,1280,0x2
4069267,14:03:28.764084,allocate,0x7f0f16002a00,256,0x2
4069267,14:03:28.764117,free,0x7f0f16002a00,256,0x2
4069267,14:03:28.764120,free,0x7f0f16002900,256,0x2
4069267,14:03:28.764121,free,0x7f0f16002800,256,0x2
4069267,14:03:28.764134,allocate,0x7f0f16002800,256,0x2
4069267,14:03:28.764141,allocate,0x7f0f16002900,256,0x2
4069267,14:03:28.764147,allocate,0x7f0f16002a00,256,0x2
4069267,14:03:28.764471,free,0x7f0f16002a00,256,0x2
4069267,14:03:28.764487,allocate,0x7f0f16002a00,256,0x2
4069267,14:03:28.764497,allocate,0x7f0f16002b00,256,0x2
4069267,14:03:28.764514,allocate,0x7f0f16002c00,256,0x2
4069267,14:03:28.764635,free,0x7f0f16002c00,256,0x2
4069267,14:03:28.764639,free,0x7f0f16002b00,256,0x2
4069267,14:03:28.764649,free,0x7f0f16002a00,256,0x2
4069267,14:03:28.764673,free,0x7f0f16000600,8192,0x2
4069267,14:03:28.764675,free,0x7f0f16002700,256,0x2
4069267,14:03:28.764677,free,0x7f0f16002600,256,0x2
4069267,14:03:28.764679,free,0x7f0f16000500,256,0x2
4069267,14:03:28.764685,free,0x7f0f16000400,256,0x2
4069267,14:03:28.764690,free,0x7f0f16000200,512,0x2
4069267,14:03:28.764695,free,0x7f0f16000000,256,0x2
4069267,14:03:28.765149,free,0x7f0f16000100,256,0x2
4069267,14:03:28.769716,free,0x7f0f16002900,256,0x2
4069267,14:03:28.769737,free,0x7f0f16002800,256,0x2
- cache empty columnar batch on GPU
4069636,14:03:28.825767,free,0x7f0f0e000000,134217728,0x2
Run completed in 15 seconds, 187 milliseconds.
Total number of tests run: 7
Suites: completed 2, aborted 0
Tests: succeeded 6, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  56.268 s
[INFO] Finished at: 2023-12-07T14:03:30+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project rapids-4-spark-tests_2.12: There are test failures -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

@thirtiseven
Copy link
Collaborator Author

log of failed case only:

[INFO]
[INFO] --- scalatest-maven-plugin:2.0.2:test (test) @ rapids-4-spark-tests_2.12 ---
Discovery starting.
Discovery completed in 1 second, 41 milliseconds.
Run starting. Expected test count is: 7
CachedBatchWriterSuite:
- convert columnar batch to cached batch on single col table with 0 rows in a batch !!! CANCELED !!!
  org.scalatest.exceptions.TestCanceledException was thrown. (CachedBatchWriterSuite.scala:62)
- convert large columnar batch to cached batch on single col table !!! CANCELED !!!
  org.scalatest.exceptions.TestCanceledException was thrown. (CachedBatchWriterSuite.scala:78)
- convert large columnar batch to cached batch on multi-col table !!! CANCELED !!!
  org.scalatest.exceptions.TestCanceledException was thrown. (CachedBatchWriterSuite.scala:88)
Thread,Time,Action,Pointer,Size,Stream
4180932,14:59:35.447726,allocate,0x7f21ca000000,134217728,0x2
4180932,14:59:36.410549,free,0x7f27df200000,256,0x2
4180932,14:59:36.412028,free,0x7f27df200200,256,0x2
- convert large InternalRow iterator to cached batch single col *** FAILED ***
  java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
  at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1131)
  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)
  at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)
  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
  at com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$7(CachedBatchWriterSuite.scala:99)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  ...
  Cause: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
  at ai.rapids.cudf.Rmm.free(Native Method)
  at ai.rapids.cudf.DeviceMemoryBuffer$DeviceBufferCleaner.cleanImpl(DeviceMemoryBuffer.java:50)
  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
  at ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247)
  at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1115)
  at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)
  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)
  at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)
  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
  ...
- convert large InternalRow iterator to cached batch multi-col !!! CANCELED !!!
  org.scalatest.exceptions.TestCanceledException was thrown. (CachedBatchWriterSuite.scala:110)
- test useCompression conf is honored !!! CANCELED !!!
  org.scalatest.exceptions.TestCanceledException was thrown. (CachedBatchWriterSuite.scala:131)
- cache empty columnar batch on GPU !!! CANCELED !!!
  org.scalatest.exceptions.TestCanceledException was thrown. (CachedBatchWriterSuite.scala:146)
4181307,14:59:36.446173,free,0x7f21ca000000,134217728,0x2
4181284,14:59:36.570215,free,0x7f27df200400,256,0x2
23/12/07 14:59:36.570 Cleaner Thread ERROR MemoryCleaner: CAUGHT EXCEPTION WHILE TRYING TO CLEAN ai.rapids.cudf.MemoryCleaner$CleanerWeakReference@1b545d17
java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
	at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1131) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner$CleanerWeakReference.clean(MemoryCleaner.java:167) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner.lambda$static$0(MemoryCleaner.java:200) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
Caused by: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-224-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:256: allocation not found
	at ai.rapids.cudf.Rmm.free(Native Method) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.DeviceMemoryBuffer$DeviceBufferCleaner.cleanImpl(DeviceMemoryBuffer.java:50) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1115) ~[spark-rapids-jni-23.12.0-SNAPSHOT-cuda11.jar:?]
	... 4 more
23/12/07 14:59:36.581 Cleaner Thread ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 17)
23/12/07 14:59:36.587 Cleaner Thread ERROR MemoryCleaner: Leaked vector (ID: 17): 2023-12-07 06:59:36.0409 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.HostColumnVector.incRefCountInternal(HostColumnVector.java:190)
ai.rapids.cudf.HostColumnVector.<init>(HostColumnVector.java:112)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5111)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.mockito.internal.util.reflection.ReflectionMemberAccessor.invoke(ReflectionMemberAccessor.java:48)
org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
org.mockito.Answers.answer(Answers.java:99)
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5080)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5166)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.mockito.internal.util.reflection.ReflectionMemberAccessor.invoke(ReflectionMemberAccessor.java:48)
org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
org.mockito.Answers.answer(Answers.java:99)
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
ai.rapids.cudf.ColumnView.copyToHost(ColumnView.java:5166)
com.nvidia.spark.rapids.GpuColumnVector.copyToHost(GpuColumnVector.java:1077)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.mockito.internal.util.reflection.ReflectionMemberAccessor.invoke(ReflectionMemberAccessor.java:48)
org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
org.mockito.Answers.answer(Answers.java:99)
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
com.nvidia.spark.rapids.GpuColumnVector.copyToHost(GpuColumnVector.java:1077)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.$anonfun$myIter$3(ParquetCachedBatchSerializer.scala:1221)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableFromBatchColumns.$anonfun$safeMap$2(implicits.scala:287)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableFromBatchColumns.$anonfun$safeMap$2$adapted(implicits.scala:287)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:221)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:218)
scala.collection.immutable.Range.foreach(Range.scala:158)
com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:218)
com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableFromBatchColumns.safeMap(implicits.scala:287)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.$anonfun$myIter$2(ParquetCachedBatchSerializer.scala:1221)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.$anonfun$myIter$1(ParquetCachedBatchSerializer.scala:1220)
scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$ColumnarBatchToCachedBatchIterator.hasNext(ParquetCachedBatchSerializer.scala:1216)
scala.collection.Iterator.foreach(Iterator.scala:943)
scala.collection.Iterator.foreach$(Iterator.scala:943)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.foreach(ParquetCachedBatchSerializer.scala:999)
scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
com.nvidia.spark.rapids.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.foldLeft(ParquetCachedBatchSerializer.scala:999)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$testColumnarBatchToCachedBatchIterator$2(CachedBatchWriterSuite.scala:343)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.CachedBatchWriterSuite.testColumnarBatchToCachedBatchIterator(CachedBatchWriterSuite.scala:341)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$8(CachedBatchWriterSuite.scala:105)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$8$adapted(CachedBatchWriterSuite.scala:99)
com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$7(CachedBatchWriterSuite.scala:99)
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:431)
org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
org.scalatest.Suite.run(Suite.scala:1114)
org.scalatest.Suite.run$(Suite.scala:1096)
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
org.scalatest.SuperEngine.runImpl(Engine.scala:535)
org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
com.nvidia.spark.rapids.CachedBatchWriterSuite.org$scalatest$BeforeAndAfterAll$$super$run(CachedBatchWriterSuite.scala:47)
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
com.nvidia.spark.rapids.CachedBatchWriterSuite.run(CachedBatchWriterSuite.scala:47)
org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:30)
org.scalatest.Suite.run(Suite.scala:1111)
org.scalatest.Suite.run$(Suite.scala:1096)
org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:30)
org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
scala.collection.immutable.List.foreach(List.scala:431)
org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
org.scalatest.tools.Runner$.main(Runner.scala:775)
org.scalatest.tools.Runner.main(Runner.scala)

Run completed in 13 seconds, 379 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 1, canceled 6, ignored 0, pending 0
*** 1 TEST FAILED ***

@pxLi
Copy link
Collaborator

pxLi commented Dec 8, 2023

closing by #9985

verified no error in driver 470&535 + arena rmm pool test with the fix

@pxLi pxLi closed this as completed Dec 8, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants