Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host watermark metric #11725

Merged
merged 7 commits into from
Nov 22, 2024
Merged

host watermark metric #11725

merged 7 commits into from
Nov 22, 2024

Conversation

zpuller
Copy link
Collaborator

@zpuller zpuller commented Nov 15, 2024

Adds watermark metric to track the max amount of memory allocated on the host per task.

Tested manually and verified the values look sane in the UI and in logs.

Signed-off-by: Zach Puller <[email protected]>
@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 16, 2024
@zpuller
Copy link
Collaborator Author

zpuller commented Nov 18, 2024

build

kuhushukla
kuhushukla previously approved these changes Nov 19, 2024
Copy link
Collaborator

@kuhushukla kuhushukla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, could we do a UT for this ?

Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpuller. Looks good to me overall, left some minor comments. Let me know what you think.

hostBytesAllocated -= bytes
// For some reason it's possible for the task to start out by releasing resources,
// possibly from a previous task, in such case we probably should just ignore it.
hostBytesAllocated = hostBytesAllocated.max(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth to log a warning if this happens?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird behavior.

Could there be allocations from different threads here that are not task threads? I don't think there is a case where a task could begin by freeing a bunch of memory, unless we didn't tie in correctly at the beginning, and we are missing the allocations, or the bytes we are getting here are somehow padded, and different than the allocated amount.

@zpuller
Copy link
Collaborator Author

zpuller commented Nov 19, 2024

overall lgtm, could we do a UT for this ?

It's possible and normally I would always be in favor of doing so but in this case I opted not to for the following reasons:

  • The HostAlloc depends on a TaskContext for reporting metrics, but in the existing unit tests there is no such context so it just does nothing under test.
  • There are no existing unit tests for all the other GpuTaskMetrics
  • Similarly, when I implemented equivalent feature for disk spill Disk spill metric #11564 I did not add tests.
  • The overall functionality is not as critical; it doesn't impact the core data processing.

If you or others feel strongly I can still find a way to add tests regardless.

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for us to investigate the case where we can go negative on deallocation, that seems odd.

@zpuller
Copy link
Collaborator Author

zpuller commented Nov 19, 2024

I'd like for us to investigate the case where we can go negative on deallocation, that seems odd.

FYI I didn't actually observe this here, this was a carryover. from having observed it in the disk spill metric impl. In any case, I can still double check if/why that may happen.

Signed-off-by: Zach Puller <[email protected]>
@jihoonson
Copy link
Collaborator

Thanks @zpuller for addressing the comments. The code change looks good to me. As for the metric being negative, I agree that it would be better to understand when and why it happens. But since that issue doesn't seem to be introduced by this change, I would be fine with merging this PR and continuing the investigation if it takes long.

Signed-off-by: Zach Puller <[email protected]>
Signed-off-by: Zach Puller <[email protected]>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a small comment about the logging level

@@ -186,6 +203,9 @@ private class HostAlloc(nonPinnedLimit: Long) extends HostMemoryAllocator with L
allocAttemptFinishedWithoutException = true
} finally {
if (ret.isDefined) {
val metrics = GpuTaskMetrics.get
metrics.incHostBytesAllocated(amount)
logDebug(getHostAllocMetricsLogStr(metrics))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these logTrace instead of debug? The frequency that these happen make it so that I would prefer to have it at an even lower logging level.

Signed-off-by: Zach Puller <[email protected]>
@abellina
Copy link
Collaborator

abellina commented Nov 21, 2024

The way the accumulators here are getting combined (max) I don't think works in this case. Unless I am missing something. Because a pinned allocation (or the disk case, or the device case) could be allocated in thread 1 and freed on thread 2, we could have a negative value in thread 2, and over count thread 1's max.

I thought the idea behind the metric was to get the "max used memory", and so that means to me that really the task should sample the tracker (device/host/disk) for the max used memory during the lifecycle of the task. Yes that means the task is accounting for other tasks' memory, but the max is what we are after. So I think when we allocate/free, instead of sending the delta to the task specific counter to inc/dec, we set the watermark, and recompute the max.

revans2
revans2 previously approved these changes Nov 21, 2024
@revans2
Copy link
Collaborator

revans2 commented Nov 21, 2024

I thought the idea behind the metric was to get the "max used memory", and so that means to me that really the task should sample the tracker (device/host/disk) for the max used memory during the lifecycle of the task. Yes that means the task is accounting for other tasks' memory, but the max is what we are after. So I think when we allocate/free, instead of sending the delta to the task specific counter to inc/dec, we set the watermark, and recompute the max.

Thanks @abellina I missed that. Yes the max and the memory allocated values need to be stored in a singleton/static location and protected with some kind of a lock/atomic. I think that would be enough to make this all work.

@zpuller
Copy link
Collaborator Author

zpuller commented Nov 21, 2024

build

@revans2
Copy link
Collaborator

revans2 commented Nov 22, 2024

build

@zpuller zpuller merged commit cacc3ae into NVIDIA:branch-24.12 Nov 22, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants