Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for new retry state machine JNI APIs #9656

Merged
merged 11 commits into from
Nov 29, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Nov 7, 2023

This depends on NVIDIA/spark-rapids-jni#1543. NVIDIA/spark-rapids-jni#1543 is also a breaking change so both of these need to go in at about the same time.

This mostly just updates the existing code to use the new APIs that are equivalent to what we used before. A separate PR will come later to add in support for CPU allocation/retry along with optional configs to try and limit the CPU memory usage.

@mattahrens mattahrens added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Nov 9, 2023
@revans2 revans2 changed the base branch from branch-23.12 to branch-24.02 November 17, 2023 14:16
@abellina abellina self-requested a review November 21, 2023 14:46
Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass - looks good.
Will take another look at both prs.

@@ -342,7 +342,8 @@ object RmmRapidsRetryIterator extends Logging {
override def hasNext: Boolean

/**
* Split is a function that is invoked by `RmmRapidsRetryIterator` when `SplitAndRetryOOM`
* Split is a function that is invoked by `RmmRapidsRetryIterator` when `GpuSplitAndRetryOOM`
* of `CpuSplitAndRetryOOM`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* of `CpuSplitAndRetryOOM`
* or `CpuSplitAndRetryOOM`

@@ -437,6 +437,14 @@ object GpuDeviceManager extends Logging {
s"dropping pinned memory to ${ret / 1024 / 1024.0} MiB")
ret
}
val nonPinnedLimit = finalMemoryLimit - totalOverhead - pinnedLimit
// TODO make this debug???
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove TODO, I quite like this log :D

Comment on lines +260 to +261
withResource(makeChunkedPacker) { cp =>
(cp.getMeta, cp.getTotalContiguousSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what is happening here. Doesn't this close the ChunkedPacker? Is the metadata and total contiguous size all we need here?
Previously we held the chunked packer open until free().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @revans2 had issues where the chunked packer was closed because of a failed host allocation, and a re-attempt to spill to disk. I'll review the changes around here also today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes this is a way to get around the fact that the chunked packer, and its metadata, are tied together. And because a failed host alloc in the past would have closed the instance chunked packer in the RapidsTable (https://github.com/NVIDIA/spark-rapids/pull/9656/files#diff-8816e7aa7f45c4fd7b0e557d63f30d3b3538e4b71633eb7ffa8d98713bb58dc7L122) via the copy iterator.

With the changes here, we are creating the packer only when we need it => at copy time, and to get the metadata.

We should clean this up in cuDF so we can get the metadata and contiguous size without instantiating a packer.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 29, 2023

build

@revans2 revans2 merged commit 33bd589 into NVIDIA:branch-24.02 Nov 29, 2023
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants