Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a few more stage level metrics #11821

Merged

Conversation

binmahone
Copy link
Collaborator

This PR closes #11820 by adding the three metrics pointed by green arrow:

image

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone
Copy link
Collaborator Author

build

Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @binmahone. The new metrics seem useful. I left some minor comments.

@@ -316,7 +319,8 @@ private final class SemaphoreTaskInfo(val stageId: Int, val taskAttemptId: Long)
if (hasSemaphore) {
semaphore.release(numPermits)
hasSemaphore = false
lastHeld = System.currentTimeMillis()
lastReleased = System.nanoTime()
GpuTaskMetrics.get.addGpuTime(lastReleased - lastAcquired)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it semaphoreTime instead? Because what we measure is how long the task has held the semaphore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semaphoreTime could also be ambiguous because it could either mean "time acquiring semaphore" or "time having the semaphore". In fact, if you take a looks at the equation: "total_core_time = total_gpu_time + total_acquire_gpu_time + other_time_spent_on_cpu", it is somehow intuitive to call it "gpuTime"?

Copy link
Collaborator

@jihoonson jihoonson Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call it semaphoreHoldingTime if it's not clear. The reason I prefer semaphore to gpu is semaphore is less misleading. We may not always use the gpu for the whole time while holding the semaphore.

Though, I realized that the naming cannot solve the ambiguity problem completely. I would suggest adding some comment here that explains exactly how this metric is measured. The naming will be less important once documentation is done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to semaphoreHoldingTime. since semaphoreHoldingTime is quite self explainatory, I skipped the comment

@binmahone
Copy link
Collaborator Author

Currently CI is blocked by #11822 as @NvTimLiu mentioned

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

import org.apache.spark.sql.rapids.GpuTaskMetrics

class PrioritySemaphore[T](val maxPermits: Int, val priorityForNonStarted: T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there any case where we want to use a different value of priorityForNonStarted than GpuSemaphore.DEFAULT_PRIORITY? If not, we can probably just use GpuSemaphore.DEFAULT_PRIORITY directly in this class instead of passing it to the constructor. The reason I prefer using the static variable directly is that this constructor parameter seems to tell me that it can be other values than GpuSemaphore.DEFAULT_PRIORITY in some cases, which I'm not sure if it's true. This is more for code readability and easy to change later if needed, so I'm OK with the current code as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the major reason for val priorityForNonStarted: T is that PrioritySemaphore has a template class T (I have no idea why template class is needed here.)

@@ -316,7 +319,8 @@ private final class SemaphoreTaskInfo(val stageId: Int, val taskAttemptId: Long)
if (hasSemaphore) {
semaphore.release(numPermits)
hasSemaphore = false
lastHeld = System.currentTimeMillis()
lastReleased = System.nanoTime()
GpuTaskMetrics.get.addGpuTime(lastReleased - lastAcquired)
Copy link
Collaborator

@jihoonson jihoonson Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call it semaphoreHoldingTime if it's not clear. The reason I prefer semaphore to gpu is semaphore is less misleading. We may not always use the gpu for the whole time while holding the semaphore.

Though, I realized that the naming cannot solve the ambiguity problem completely. I would suggest adding some comment here that explains exactly how this metric is measured. The naming will be less important once documentation is done.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone
Copy link
Collaborator Author

build

Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @binmahone for addressing my comments. LGTM!

@binmahone
Copy link
Collaborator Author

build

@binmahone binmahone merged commit c0fe534 into NVIDIA:branch-25.02 Dec 13, 2024
50 checks passed
@sameerz sameerz added the task Work required that improves the product but is not user facing label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] add a few more metrics at stage level
3 participants