-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update for new retry state machine JNI APIs #9656
Update for new retry state machine JNI APIs #9656
Conversation
Signed-off-by: Robert (Bobby) Evans <[email protected]>
Signed-off-by: Robert (Bobby) Evans <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass - looks good.
Will take another look at both prs.
@@ -342,7 +342,8 @@ object RmmRapidsRetryIterator extends Logging { | |||
override def hasNext: Boolean | |||
|
|||
/** | |||
* Split is a function that is invoked by `RmmRapidsRetryIterator` when `SplitAndRetryOOM` | |||
* Split is a function that is invoked by `RmmRapidsRetryIterator` when `GpuSplitAndRetryOOM` | |||
* of `CpuSplitAndRetryOOM` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* of `CpuSplitAndRetryOOM` | |
* or `CpuSplitAndRetryOOM` |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Outdated
Show resolved
Hide resolved
@@ -437,6 +437,14 @@ object GpuDeviceManager extends Logging { | |||
s"dropping pinned memory to ${ret / 1024 / 1024.0} MiB") | |||
ret | |||
} | |||
val nonPinnedLimit = finalMemoryLimit - totalOverhead - pinnedLimit | |||
// TODO make this debug??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove TODO, I quite like this log :D
withResource(makeChunkedPacker) { cp => | ||
(cp.getMeta, cp.getTotalContiguousSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what is happening here. Doesn't this close the ChunkedPacker? Is the metadata and total contiguous size all we need here?
Previously we held the chunked packer open until free()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @revans2 had issues where the chunked packer was closed because of a failed host allocation, and a re-attempt to spill to disk. I'll review the changes around here also today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yes this is a way to get around the fact that the chunked packer, and its metadata, are tied together. And because a failed host alloc in the past would have closed the instance chunked packer in the RapidsTable
(https://github.com/NVIDIA/spark-rapids/pull/9656/files#diff-8816e7aa7f45c4fd7b0e557d63f30d3b3538e4b71633eb7ffa8d98713bb58dc7L122) via the copy iterator.
With the changes here, we are creating the packer only when we need it => at copy time, and to get the metadata.
We should clean this up in cuDF so we can get the metadata and contiguous size without instantiating a packer.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
Outdated
Show resolved
Hide resolved
build |
This depends on NVIDIA/spark-rapids-jni#1543. NVIDIA/spark-rapids-jni#1543 is also a breaking change so both of these need to go in at about the same time.
This mostly just updates the existing code to use the new APIs that are equivalent to what we used before. A separate PR will come later to add in support for CPU allocation/retry along with optional configs to try and limit the CPU memory usage.