add a few more stage level metrics #11821

binmahone · 2024-12-04T08:36:28Z

This PR closes #11820 by adding the three metrics pointed by green arrow:

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2024-12-04T08:36:52Z

build

jihoonson

Thanks @binmahone. The new metrics seem useful. I left some minor comments.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala

jihoonson · 2024-12-04T22:09:21Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala

@@ -316,7 +319,8 @@ private final class SemaphoreTaskInfo(val stageId: Int, val taskAttemptId: Long)
    if (hasSemaphore) {
      semaphore.release(numPermits)
      hasSemaphore = false
-      lastHeld = System.currentTimeMillis()
+      lastReleased = System.nanoTime()
+      GpuTaskMetrics.get.addGpuTime(lastReleased - lastAcquired)


Should we call it semaphoreTime instead? Because what we measure is how long the task has held the semaphore.

semaphoreTime could also be ambiguous because it could either mean "time acquiring semaphore" or "time having the semaphore". In fact, if you take a looks at the equation: "total_core_time = total_gpu_time + total_acquire_gpu_time + other_time_spent_on_cpu", it is somehow intuitive to call it "gpuTime"?

We can call it semaphoreHoldingTime if it's not clear. The reason I prefer semaphore to gpu is semaphore is less misleading. We may not always use the gpu for the whole time while holding the semaphore.

Though, I realized that the naming cannot solve the ambiguity problem completely. I would suggest adding some comment here that explains exactly how this metric is measured. The naming will be less important once documentation is done.

renamed to semaphoreHoldingTime. since semaphoreHoldingTime is quite self explainatory, I skipped the comment

sql-plugin/src/main/scala/com/nvidia/spark/rapids/PrioritySemaphore.scala

binmahone · 2024-12-05T00:33:03Z

Currently CI is blocked by #11822 as @NvTimLiu mentioned

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/PrioritySemaphore.scala

jihoonson · 2024-12-05T20:16:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/PrioritySemaphore.scala

+
+import org.apache.spark.sql.rapids.GpuTaskMetrics
+
+class PrioritySemaphore[T](val maxPermits: Int, val priorityForNonStarted: T)


nit: is there any case where we want to use a different value of priorityForNonStarted than GpuSemaphore.DEFAULT_PRIORITY? If not, we can probably just use GpuSemaphore.DEFAULT_PRIORITY directly in this class instead of passing it to the constructor. The reason I prefer using the static variable directly is that this constructor parameter seems to tell me that it can be other values than GpuSemaphore.DEFAULT_PRIORITY in some cases, which I'm not sure if it's true. This is more for code readability and easy to change later if needed, so I'm OK with the current code as well.

the major reason for val priorityForNonStarted: T is that PrioritySemaphore has a template class T (I have no idea why template class is needed here.)

jihoonson · 2024-12-05T21:58:18Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala

@@ -316,7 +319,8 @@ private final class SemaphoreTaskInfo(val stageId: Int, val taskAttemptId: Long)
    if (hasSemaphore) {
      semaphore.release(numPermits)
      hasSemaphore = false
-      lastHeld = System.currentTimeMillis()
+      lastReleased = System.nanoTime()
+      GpuTaskMetrics.get.addGpuTime(lastReleased - lastAcquired)


We can call it semaphoreHoldingTime if it's not clear. The reason I prefer semaphore to gpu is semaphore is less misleading. We may not always use the gpu for the whole time while holding the semaphore.

Though, I realized that the naming cannot solve the ambiguity problem completely. I would suggest adding some comment here that explains exactly how this metric is measured. The naming will be less important once documentation is done.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2024-12-10T06:49:01Z

build

jihoonson

Thanks @binmahone for addressing my comments. LGTM!

…s_on_gpu_contention

binmahone · 2024-12-12T03:52:27Z

build

add a few more stage level metrics

57e8bfd

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

NvTimLiu mentioned this pull request Dec 4, 2024

[BUG] [Spark 4] Type mismatch Exceptions from DFUDFShims.scala with Spark-4.0.0 expressions.Expression #11822

Closed

jihoonson reviewed Dec 4, 2024

View reviewed changes

address comments

783cd27

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

jihoonson reviewed Dec 5, 2024

View reviewed changes

address comments

af093a6

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

jihoonson approved these changes Dec 10, 2024

View reviewed changes

Merge remote-tracking branch 'origin/branch-25.02' into 241204_metric…

f76f6fe

…s_on_gpu_contention

binmahone merged commit c0fe534 into NVIDIA:branch-25.02 Dec 13, 2024
50 checks passed

sameerz added the task Work required that improves the product but is not user facing label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a few more stage level metrics #11821

add a few more stage level metrics #11821

binmahone commented Dec 4, 2024

binmahone commented Dec 4, 2024

jihoonson left a comment

jihoonson Dec 4, 2024

binmahone Dec 5, 2024

jihoonson Dec 5, 2024 •

edited

Loading

binmahone Dec 10, 2024

binmahone commented Dec 5, 2024

jihoonson Dec 5, 2024

binmahone Dec 10, 2024

jihoonson Dec 5, 2024 •

edited

Loading

binmahone commented Dec 10, 2024

jihoonson left a comment

binmahone commented Dec 12, 2024


		import org.apache.spark.sql.rapids.GpuTaskMetrics

		class PrioritySemaphore[T](val maxPermits: Int, val priorityForNonStarted: T)

add a few more stage level metrics #11821

add a few more stage level metrics #11821

Conversation

binmahone commented Dec 4, 2024

binmahone commented Dec 4, 2024

jihoonson left a comment

Choose a reason for hiding this comment

jihoonson Dec 4, 2024

Choose a reason for hiding this comment

binmahone Dec 5, 2024

Choose a reason for hiding this comment

jihoonson Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

binmahone Dec 10, 2024

Choose a reason for hiding this comment

binmahone commented Dec 5, 2024

jihoonson Dec 5, 2024

Choose a reason for hiding this comment

binmahone Dec 10, 2024

Choose a reason for hiding this comment

jihoonson Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

binmahone commented Dec 10, 2024

jihoonson left a comment

Choose a reason for hiding this comment

binmahone commented Dec 12, 2024

jihoonson Dec 5, 2024 •

edited

Loading

jihoonson Dec 5, 2024 •

edited

Loading