host watermark metric #11725

zpuller · 2024-11-15T19:40:27Z

Adds watermark metric to track the max amount of memory allocated on the host per task.

Tested manually and verified the values look sane in the UI and in logs.

Signed-off-by: Zach Puller <[email protected]>

zpuller · 2024-11-18T17:11:15Z

build

kuhushukla

overall lgtm, could we do a UT for this ?

jihoonson

Thanks @zpuller. Looks good to me overall, left some minor comments. Let me know what you think.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala

jihoonson · 2024-11-19T18:30:48Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala

+    hostBytesAllocated -= bytes
+    // For some reason it's possible for the task to start out by releasing resources,
+    // possibly from a previous task, in such case we probably should just ignore it.
+    hostBytesAllocated = hostBytesAllocated.max(0)


Would it be worth to log a warning if this happens?

This is a weird behavior.

Could there be allocations from different threads here that are not task threads? I don't think there is a case where a task could begin by freeing a bunch of memory, unless we didn't tie in correctly at the beginning, and we are missing the allocations, or the bytes we are getting here are somehow padded, and different than the allocated amount.

zpuller · 2024-11-19T19:12:58Z

overall lgtm, could we do a UT for this ?

It's possible and normally I would always be in favor of doing so but in this case I opted not to for the following reasons:

The HostAlloc depends on a TaskContext for reporting metrics, but in the existing unit tests there is no such context so it just does nothing under test.
There are no existing unit tests for all the other GpuTaskMetrics
Similarly, when I implemented equivalent feature for disk spill Disk spill metric #11564 I did not add tests.
The overall functionality is not as critical; it doesn't impact the core data processing.

If you or others feel strongly I can still find a way to add tests regardless.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala

abellina

I'd like for us to investigate the case where we can go negative on deallocation, that seems odd.

zpuller · 2024-11-19T19:26:48Z

I'd like for us to investigate the case where we can go negative on deallocation, that seems odd.

FYI I didn't actually observe this here, this was a carryover. from having observed it in the disk spill metric impl. In any case, I can still double check if/why that may happen.

Signed-off-by: Zach Puller <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala

jihoonson · 2024-11-20T18:33:23Z

Thanks @zpuller for addressing the comments. The code change looks good to me. As for the metric being negative, I agree that it would be better to understand when and why it happens. But since that issue doesn't seem to be introduced by this change, I would be fine with merging this PR and continuing the investigation if it takes long.

Signed-off-by: Zach Puller <[email protected]>

revans2

Looks good. Just a small comment about the logging level

revans2 · 2024-11-20T22:35:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala

@@ -186,6 +203,9 @@ private class HostAlloc(nonPinnedLimit: Long) extends HostMemoryAllocator with L
      allocAttemptFinishedWithoutException = true
    } finally {
      if (ret.isDefined) {
+        val metrics = GpuTaskMetrics.get
+        metrics.incHostBytesAllocated(amount)
+        logDebug(getHostAllocMetricsLogStr(metrics))


Can we make these logTrace instead of debug? The frequency that these happen make it so that I would prefer to have it at an even lower logging level.

Signed-off-by: Zach Puller <[email protected]>

abellina · 2024-11-21T03:30:57Z

The way the accumulators here are getting combined (max) I don't think works in this case. Unless I am missing something. Because a pinned allocation (or the disk case, or the device case) could be allocated in thread 1 and freed on thread 2, we could have a negative value in thread 2, and over count thread 1's max.

I thought the idea behind the metric was to get the "max used memory", and so that means to me that really the task should sample the tracker (device/host/disk) for the max used memory during the lifecycle of the task. Yes that means the task is accounting for other tasks' memory, but the max is what we are after. So I think when we allocate/free, instead of sending the delta to the task specific counter to inc/dec, we set the watermark, and recompute the max.

revans2 · 2024-11-21T15:49:35Z

I thought the idea behind the metric was to get the "max used memory", and so that means to me that really the task should sample the tracker (device/host/disk) for the max used memory during the lifecycle of the task. Yes that means the task is accounting for other tasks' memory, but the max is what we are after. So I think when we allocate/free, instead of sending the delta to the task specific counter to inc/dec, we set the watermark, and recompute the max.

Thanks @abellina I missed that. Yes the max and the memory allocated values need to be stored in a singleton/static location and protected with some kind of a lock/atomic. I think that would be enough to make this all work.

zpuller · 2024-11-21T17:07:56Z

build

Signed-off-by: Zach Puller <[email protected]>

revans2 · 2024-11-22T15:58:10Z

build

host watermark metric

6c3a566

Signed-off-by: Zach Puller <[email protected]>

zpuller force-pushed the host_metric branch from 6ecc817 to 6c3a566 Compare November 15, 2024 20:02

sameerz added the task Work required that improves the product but is not user facing label Nov 16, 2024

kuhushukla previously approved these changes Nov 19, 2024

View reviewed changes

jihoonson reviewed Nov 19, 2024

View reviewed changes

abellina reviewed Nov 19, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala Outdated Show resolved Hide resolved

abellina requested changes Nov 19, 2024

View reviewed changes

pr comments

391b11f

Signed-off-by: Zach Puller <[email protected]>

zpuller dismissed kuhushukla’s stale review via 391b11f November 19, 2024 19:34

abellina reviewed Nov 19, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala Outdated Show resolved Hide resolved

let disk allocs go negative

64274cd

zpuller added 2 commits November 20, 2024 14:22

pr comments

b28af97

Signed-off-by: Zach Puller <[email protected]>

revert disk bytes change

a99f519

Signed-off-by: Zach Puller <[email protected]>

revans2 reviewed Nov 20, 2024

View reviewed changes

log trace

4177288

Signed-off-by: Zach Puller <[email protected]>

revans2 previously approved these changes Nov 21, 2024

View reviewed changes

make disk and host trackers global

1f2a4fc

Signed-off-by: Zach Puller <[email protected]>

zpuller dismissed revans2’s stale review via 1f2a4fc November 22, 2024 00:27

revans2 approved these changes Nov 22, 2024

View reviewed changes

abellina approved these changes Nov 22, 2024

View reviewed changes

zpuller merged commit cacc3ae into NVIDIA:branch-24.12 Nov 22, 2024
49 checks passed

abellina mentioned this pull request Nov 26, 2024

[FEA] add debug log for HostMemoryBuffer allocation rapidsai/cudf#17444

Open

revans2 mentioned this pull request Nov 26, 2024

add debug log for HostMemoryBuffer allocation rapidsai/cudf#17447

Open

3 tasks

zpuller deleted the host_metric branch December 31, 2024 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

host watermark metric #11725

host watermark metric #11725

zpuller commented Nov 15, 2024

zpuller commented Nov 18, 2024

kuhushukla left a comment

jihoonson left a comment

jihoonson Nov 19, 2024

abellina Nov 19, 2024

zpuller commented Nov 19, 2024 •

edited

Loading

abellina left a comment

zpuller commented Nov 19, 2024

jihoonson commented Nov 20, 2024

revans2 left a comment

revans2 Nov 20, 2024

abellina commented Nov 21, 2024 •

edited

Loading

revans2 commented Nov 21, 2024

zpuller commented Nov 21, 2024

revans2 commented Nov 22, 2024

host watermark metric #11725

host watermark metric #11725

Conversation

zpuller commented Nov 15, 2024

zpuller commented Nov 18, 2024

kuhushukla left a comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

jihoonson Nov 19, 2024

Choose a reason for hiding this comment

abellina Nov 19, 2024

Choose a reason for hiding this comment

zpuller commented Nov 19, 2024 • edited Loading

abellina left a comment

Choose a reason for hiding this comment

zpuller commented Nov 19, 2024

jihoonson commented Nov 20, 2024

revans2 left a comment

Choose a reason for hiding this comment

revans2 Nov 20, 2024

Choose a reason for hiding this comment

abellina commented Nov 21, 2024 • edited Loading

revans2 commented Nov 21, 2024

zpuller commented Nov 21, 2024

revans2 commented Nov 22, 2024

zpuller commented Nov 19, 2024 •

edited

Loading

abellina commented Nov 21, 2024 •

edited

Loading