Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-grained spill metrics #9509

Merged
merged 19 commits into from
Nov 14, 2023

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Oct 22, 2023

This PR fixes #8880. it removes task metrics

  • gpuSpillBlockTime
  • gpuReadSpillTime

and adds the following metrics instead:

  • gpuSpillToHostTime
  • gpuSpillToDiskTime
  • gpuReadSpillFromHostTime
  • gpuReadSpillFromDiskTime

Signed-off-by: Gera Shegalov [email protected]

Signed-off-by: Gera Shegalov <[email protected]>
@sameerz sameerz added the task Work required that improves the product but is not user facing label Oct 23, 2023
@gerashegalov gerashegalov changed the title [WIP] Fine-grained spilled metrics [WIP] Fine-grained spill metrics Oct 24, 2023
@gerashegalov gerashegalov self-assigned this Nov 8, 2023
@gerashegalov gerashegalov changed the title [WIP] Fine-grained spill metrics Fine-grained spill metrics Nov 8, 2023
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov marked this pull request as ready for review November 8, 2023 07:27
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov requested a review from revans2 November 8, 2023 07:31
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also look at updating the tuning docs for these changes. I think that doc changed to be in an internal NVIDIA repo so @mattahrens might need to help you get it done right.

Also @jlowe with me initially trying to call out project rapids4spark, so I want to check with him that he is okay with the names of the metrics here.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good, happy to see fewer metrics being added. Still wondering about host alloc time. Do we have a GPU alloc time metric or plans to add one?

@revans2
Copy link
Collaborator

revans2 commented Nov 9, 2023

Updates look good, happy to see fewer metrics being added. Still wondering about host alloc time. Do we have a GPU alloc time metric or plans to add one?

I think there may have been some miscommunication here. Or at least I didn't do a good job of this. In #8880 I asked for "metric for the amount of time a task was blocked on host memory allocation" This was intended to be analogous to gpuRetryBlockTime. But it was before we decided to use the same state machine for both. I will handle that part. I need to split up the metric so we can get GPU and CPU numbers for it. But it should be doable.

Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Nov 10, 2023
@gerashegalov gerashegalov requested a review from jlowe November 13, 2023 19:48
@gerashegalov
Copy link
Collaborator Author

build

@jlowe jlowe merged commit e3ad47b into NVIDIA:branch-23.12 Nov 14, 2023
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add host memory task metrics and explore a host memory allocation watch dog.
5 participants