Update for new retry state machine JNI APIs #9656

revans2 · 2023-11-07T22:31:00Z

This depends on NVIDIA/spark-rapids-jni#1543. NVIDIA/spark-rapids-jni#1543 is also a breaking change so both of these need to go in at about the same time.

This mostly just updates the existing code to use the new APIs that are equivalent to what we used before. A separate PR will come later to add in support for CPU allocation/retry along with optional configs to try and limit the CPU memory usage.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

jbrennan333

First pass - looks good.
Will take another look at both prs.

jbrennan333 · 2023-11-21T16:33:17Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

@@ -342,7 +342,8 @@ object RmmRapidsRetryIterator extends Logging {
    override def hasNext: Boolean

    /**
-     * Split is a function that is invoked by `RmmRapidsRetryIterator` when `SplitAndRetryOOM`
+     * Split is a function that is invoked by `RmmRapidsRetryIterator` when `GpuSplitAndRetryOOM`
+     * of `CpuSplitAndRetryOOM`


Suggested change

* of `CpuSplitAndRetryOOM`

* or `CpuSplitAndRetryOOM`

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala

abellina · 2023-11-21T20:59:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

@@ -437,6 +437,14 @@ object GpuDeviceManager extends Logging {
            s"dropping pinned memory to ${ret / 1024 / 1024.0} MiB")
        ret
      }
+      val nonPinnedLimit = finalMemoryLimit - totalOverhead - pinnedLimit
+      // TODO make this debug???


remove TODO, I quite like this log :D

jbrennan333 · 2023-11-22T15:31:41Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsDeviceMemoryStore.scala

+      withResource(makeChunkedPacker) { cp =>
+        (cp.getMeta, cp.getTotalContiguousSize)


I'm not sure I understand what is happening here. Doesn't this close the ChunkedPacker? Is the metadata and total contiguous size all we need here?
Previously we held the chunked packer open until free().

I know @revans2 had issues where the chunked packer was closed because of a failed host allocation, and a re-attempt to spill to disk. I'll review the changes around here also today.

Ok yes this is a way to get around the fact that the chunked packer, and its metadata, are tied together. And because a failed host alloc in the past would have closed the instance chunked packer in the RapidsTable (https://github.com/NVIDIA/spark-rapids/pull/9656/files#diff-8816e7aa7f45c4fd7b0e557d63f30d3b3538e4b71633eb7ffa8d98713bb58dc7L122) via the copy iterator.

With the changes here, we are creating the packer only when we need it => at copy time, and to get the metadata.

We should clean this up in cuDF so we can get the metadata and contiguous size without instantiating a packer.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

revans2 · 2023-11-29T17:04:43Z

build

Update for new retry state machine JNI APIs

188f14b

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 mentioned this pull request Nov 7, 2023

Update state retry state machine for CPU alloc support NVIDIA/spark-rapids-jni#1543

Merged

Updated for 3.4.0

302f7fb

mattahrens added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Nov 9, 2023

revans2 added 3 commits November 15, 2023 10:49

Merge branch 'branch-23.12' into cpu_memory_state_changes

f95c5de

Have HostAlloc use the new State APIs too

564d5a3

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Merge branch 'branch-23.12' into cpu_memory_state_changes

942be08

revans2 changed the base branch from branch-23.12 to branch-24.02 November 17, 2023 14:16

revans2 added 2 commits November 17, 2023 08:55

Merge branch 'branch-24.02' into cpu_memory_state_changes

23a963b

Adjust again to declare race as safe

c8c4d9a

Signed-off-by: Robert (Bobby) Evans <[email protected]>

abellina self-requested a review November 21, 2023 14:46

jbrennan333 reviewed Nov 21, 2023

View reviewed changes

Merge branch 'branch-24.02' into cpu_memory_state_changes

19153b5

abellina reviewed Nov 21, 2023

View reviewed changes

Address Review Comments

2910967

jbrennan333 reviewed Nov 22, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala Outdated Show resolved Hide resolved

revans2 added 2 commits November 27, 2023 09:34

Merge branch 'branch-24.02' into cpu_memory_state_changes

433d405

Fix bug with init order

7fb6a2e

jbrennan333 approved these changes Nov 27, 2023

View reviewed changes

abellina approved these changes Nov 29, 2023

View reviewed changes

revans2 merged commit 33bd589 into NVIDIA:branch-24.02 Nov 29, 2023
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for new retry state machine JNI APIs #9656

Update for new retry state machine JNI APIs #9656

revans2 commented Nov 7, 2023

jbrennan333 left a comment

jbrennan333 Nov 21, 2023

abellina Nov 21, 2023

jbrennan333 Nov 22, 2023

abellina Nov 22, 2023

abellina Nov 22, 2023

revans2 commented Nov 29, 2023

		withResource(makeChunkedPacker) { cp =>
		(cp.getMeta, cp.getTotalContiguousSize)

Update for new retry state machine JNI APIs #9656

Update for new retry state machine JNI APIs #9656

Conversation

revans2 commented Nov 7, 2023

jbrennan333 left a comment

Choose a reason for hiding this comment

jbrennan333 Nov 21, 2023

Choose a reason for hiding this comment

abellina Nov 21, 2023

Choose a reason for hiding this comment

jbrennan333 Nov 22, 2023

Choose a reason for hiding this comment

abellina Nov 22, 2023

Choose a reason for hiding this comment

abellina Nov 22, 2023

Choose a reason for hiding this comment

revans2 commented Nov 29, 2023