[FEA] Add IO diagnostic output for GPU slowness in Profiler tool #1451

cindyyuanjiang · 2024-12-06T01:40:36Z

Contributes to #1374

Changes

Added an IO diagnostic view in Profiler output: io_diagnostic_metrics.csv
- Added class IOAccumDiagnosticMetrics to store selected IO related metric names and methods
- Added class IODiagnosticResult to represent each IO diagnostic result
- In core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala, cache results from generateSQLAccums and use them to compute IO diagnostic metrics in function generateIODiagnosticAccums
- Added IODiagnostics in class DiagnosticSummaryInfo
- Reorganized AccumProfileResults and SQLAccumProfileResults presentation for better readability

Testing

Added unit test "test IO diagnostic metrics" in core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala

Example Output

appIndex,appName,appId,sqlId,stageId,stageDurationMs,nodeId,nodeName,outputRowsMin,outputRowsMedian,outputRowsMax,outputRowsTotal,scanTimeMin,scanTimeMedian,scanTimeMax,scanTimeTotal,outputBatchesMin,outputBatchesMedian,outputBatchesMax,outputBatchesTotal,bufferTimeMin,bufferTimeMedian,bufferTimeMax,bufferTimeTotal,shuffleWriteTimeMin,shuffleWriteTimeMedian,shuffleWriteTimeMax,shuffleWriteTimeTotal,fetchWaitTimeMin,fetchWaitTimeMedian,fetchWaitTimeMax,fetchWaitTimeTotal,gpuDecodeTimeMin,gpuDecodeTimeMedian,gpuDecodeTimeMax,gpuDecodeTimeTotal
1,Spark shell,local-1622814619968,0,0,1743,16,"GpuColumnarExchange",1666666,1666667,1666667,10000000,0,0,0,0,200,200,200,1200,0,0,0,0,41434653,60830365,100858775,400284505,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,0,1743,21,"Scan",1666666,1666667,1666667,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,1,1631,8,"GpuColumnarExchange",1666666,1666667,1666667,10000000,0,0,0,0,200,200,200,1200,0,0,0,0,37444140,92128351,108992798,508750471,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,1,1631,13,"Scan",1666666,1666667,1666667,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,2,688,3,"GpuColumnarExchange",1,1,1,200,0,0,0,0,1,1,1,200,0,0,0,0,139875,230038,9747416,93193331,0,0,0,0,0,0,0,0

Follow-up Issue

#1454

Signed-off-by: cindyyuanjiang <[email protected]>

nartal1

Thanks @cindyyuanjiang ! Overall looks good. Were you able to test this PR with large eventlogs? Any impact on the runtime?

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AnalysisUtils.scala

nartal1 · 2024-12-09T18:08:09Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+        val stageDiagnosticInfo = HashMap.empty[String, StatisticsMetrics]
+
+        sqlAccums.foreach { sqlAccum =>
+          val stageTaskIds = app.taskManager.getAllTasksStageAttempt(stageId).map(_.taskId).toSet


Nit: We can move this before sqlAccums.foreach. i.e we can cache stageTaskIds once per stage rather than inside the loop for each accumulator.

thanks @nartal1, updated this.

parthosa

Thanks @cindyyuanjiang. Made some minor comments.

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

Signed-off-by: cindyyuanjiang <[email protected]>

parthosa · 2024-12-10T18:37:32Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+            new AccumInfo(AccumMetaRef(0L, AccumNameRef("")))
+          )
+          // Compute the metric's statistics (min, median, max, sum) for the given stage
+          val accumInfoStatistics = getAccumInfoStatisticsInStage(accumInfo, stageTaskIds)


This block seems to be deconstructing the tuple returned from getAccumInfoStatisticsInStage() and creates an instance of StatisticsMetrics using the individual values (min, med, max, sum).

Can getAccumInfoStatisticsInStage() directly return an Option[StatisticsMetrics] ?

getAccumInfoStatisticsInStage is also used in generateStageLevelAccums. I didn't want to create an extra object in this function. WDYT?

A tuple (Long, Long, Long, Long) is also an object.

refactored this part, please see new changes.

parthosa · 2024-12-10T18:49:47Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+   * @return A sequence of `IODiagnosticResult` objects containing diagnostic metrics.
+   */
+  def generateIODiagnosticAccums(): Seq[IODiagnosticResult] = {
+    val zeroRecord = StatisticsMetrics.ZERO_RECORD


nit: Can we move zeroRecord to the else block where it is being used?

thanks, updated.

parthosa · 2024-12-10T18:51:50Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AnalysisUtils.scala

+  }
+
+  /**
+   * Normalize a metric name to its IO diagnostic metric constant


This comment seems to be generic. I believe we are performing normalization because we want to support variations in output rows such as join output rows, number of output rows

Can we specify the reason for this normalization in the comment?

thanks, updated.

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

amahussein · 2024-12-11T15:58:12Z

core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala

+    outputBatchesMed: Long,
+    outputBatchesMax: Long,
+    outputBatchesSum: Long,
+    buffeTimeMin: Long,


bufferTime

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AnalysisUtils.scala

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-12-18T07:15:44Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

-    Seq(appIndex.toString, sqlID.toString, nodeID.toString,
-      StringUtils.reformatCSVString(nodeName), accumulatorId.toString,
-      StringUtils.reformatCSVString(name), min.toString, median.toString, max.toString,
-      total.toString, StringUtils.reformatCSVString(metricType),


Updated format only for better readability.

cindyyuanjiang · 2024-12-18T07:15:51Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

-case class AccumProfileResults(appIndex: Int, stageId: Int, accMetaRef: AccumMetaRef,
-    min: Long, median: Long, max: Long, total: Long) extends ProfileResult {
-  override val outputHeaders = Seq("appIndex", "stageId", "accumulatorId", "name", "min",
-    "median", "max", "total")


Updated format only for better readability.

cindyyuanjiang · 2024-12-18T07:15:58Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala


  override def convertToSeq: Seq[String] = {
-    Seq(appIndex.toString, stageId.toString, accMetaRef.id.toString, accMetaRef.getName(),
-      min.toString, median.toString, max.toString, total.toString)


Updated format only for better readability.

cindyyuanjiang · 2024-12-18T07:16:03Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

  }

  override def convertToCSVSeq: Seq[String] = {
-    Seq(appIndex.toString, stageId.toString, accMetaRef.id.toString,
-      accMetaRef.name.csvValue, min.toString,
-      median.toString, max.toString, total.toString)


Updated format only for better readability.

amahussein · 2024-12-18T16:36:34Z

@nartal1 thanks! Yes, I ran the performance benchmark on a ~350MB zstd eventlog, the performance is less than 1% slower than the current dev branch.

Thanks @cindyyuanjiang for running performance evaluation.
A 350MB is a little bit small eventlog to start noticing an impact of the code change.
We need to look into the CPUTime of the aggregation methods before and after the code change.

amahussein

Thanks @cindyyuanjiang
While I was working on #1461, I found some potential performance optimizations in AggregateStage which will conflict with the current PR.

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

amahussein · 2024-12-18T16:28:03Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+            new AccumInfo(AccumMetaRef(0L, AccumNameRef("")))
+          )
+          // Compute the metric's statistics (min, median, max, sum) for the given stage
+          val accumInfoStatistics = getAccumInfoStatisticsInStage(accumInfo, stageTaskIds)


A tuple (Long, Long, Long, Long) is also an object.

amahussein · 2024-12-18T16:32:01Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+        sqlAccums.foreach { sqlAccum =>
+          val accumInfo = app.accumManager.accumInfoMap.getOrElse(
+            sqlAccum.accumulatorId,
+            new AccumInfo(AccumMetaRef(0L, AccumNameRef("")))


this could cause potential problems.
I made a comment about that in a different PR #1468 (comment)

thanks @amahussein , updated this. Also refactored usage of filterKeys in this PR.

cindyyuanjiang · 2024-12-18T21:46:52Z

@nartal1 thanks! Yes, I ran the performance benchmark on a ~350MB zstd eventlog, the performance is less than 1% slower than the current dev branch.

Thanks @cindyyuanjiang for running performance evaluation. A 350MB is a little bit small eventlog to start noticing an impact of the code change. We need to look into the CPUTime of the aggregation methods before and after the code change.

Thanks @amahussein! This is the eventlog you shared earlier with me for performance testing. Profiler tool takes about 1.2 hours to run on this eventlog on my local desktop. I will follow up with the aggregation methods CPU time offline.

Signed-off-by: cindyyuanjiang <[email protected]>

…when unnecessary Signed-off-by: cindyyuanjiang <[email protected]>

amahussein · 2024-12-19T15:53:30Z

@nartal1 thanks! Yes, I ran the performance benchmark on a ~350MB zstd eventlog, the performance is less than 1% slower than the current dev branch.

Thanks @cindyyuanjiang for running performance evaluation. A 350MB is a little bit small eventlog to start noticing an impact of the code change. We need to look into the CPUTime of the aggregation methods before and after the code change.

Thanks @amahussein! This is the eventlog you shared earlier with me for performance testing. Profiler tool takes about 1.2 hours to run on this eventlog on my local desktop. I will follow up with the aggregation methods CPU time offline.

The eventlog should not take that long even before we merged #1468

If you sync-up with latest dev, the eventlog should be processed within a single digit number of minutes.
If you haven't synced-up yet, then the eventlog should be processed within 16-30 minutes depending on the Xms and XmX flags you pass to the java cmd. This was mentioned in a previous issue [BUG] Profiler takes 60 minutes to process and generate output on integration tests eventlog #1382 (comment)

We can re-iterate further offline.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-12-23T23:34:02Z

core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala

-        nanoToMilliSec(result.srFetchWaitTime.max), nanoToMilliSec(result.srFetchWaitTime.total),
-        nanoToMilliSec(result.swWriteTime.min), nanoToMilliSec(result.swWriteTime.median),
-        nanoToMilliSec(result.swWriteTime.max), nanoToMilliSec(result.swWriteTime.total),
-        result.gpuSemaphoreWait.total, result.nodeNames)


reformat for better readability.

cindyyuanjiang · 2024-12-23T23:35:04Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

@@ -74,16 +76,33 @@ case class JobInfoProfileResult(
    sqlID: Option[Long],
    startTime: Long,
    endTime: Option[Long]) extends ProfileResult {
-  override val outputHeaders = Seq("appIndex", "jobID", "stageIds", "sqlID", "startTime", "endTime")
+


updated format for better readability.

cindyyuanjiang · 2024-12-23T23:35:11Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

  override def convertToSeq: Seq[String] = {
    val stageIdStr = s"[${stageIds.mkString(",")}]"
-    Seq(appIndex.toString, jobID.toString, stageIdStr, sqlID.map(_.toString).getOrElse(null),
-      startTime.toString, endTime.map(_.toString).getOrElse(null))


updated format for better readability.

cindyyuanjiang · 2024-12-23T23:35:15Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

  override def convertToCSVSeq: Seq[String] = {
    val stageIdStr = s"[${stageIds.mkString(",")}]"
-    Seq(appIndex.toString, jobID.toString, StringUtils.reformatCSVString(stageIdStr),
-      sqlID.map(_.toString).getOrElse(null), startTime.toString,


updated format for better readability.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-12-24T00:18:21Z

New iteration changes:

Move AnalysisUtils.scala file under newly added util folder and rename to DiagnosticMetrics.scala
Remove stageIds from IODiagnosticMetricsMap keys b/c they are unnecessary (same node ID will have same stage IDs)
Change metricNamesToKeyMap schema from "set[string] to string" to "string to string"
Change stageIds to Set[Int] in SQLAccumProfileResults to avoid unnecessary conversions

Ideas need further discussions:

How to deal with accumulators from driverAccumMap in generateIODiagnosticAccums
Now IODiagnosticMetricsMap key is a tuple (Long, Long), is this necessary to change this to a composite map or create a dummy key to save memory?

cc: @amahussein

Signed-off-by: cindyyuanjiang <[email protected]>

parthosa

Thanks @cindyyuanjiang. Few comments.

parthosa · 2024-12-31T00:36:51Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumNameRef.scala

@@ -36,7 +36,7 @@ case class AccumNameRef(value: String) {
  // create a new CSV string even though they represent the same AccumulatorName.
  val csvValue: String = StringUtils.reformatCSVString(value)

-  def isDiagnosticMetrics(): Boolean = getAllDiagnosticMetrics.contains(value)
+  def isDiagnosticMetrics(): Boolean = allDiagnosticMetrics.contains(value)


nit: () can be removed for accessor like method

parthosa · 2024-12-31T00:40:19Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

@@ -245,7 +262,7 @@ case class SQLAccumProfileResults(
      max.toString,
      total.toString,
      metricType,
-      stageIds)
+      stageIds.mkString(","))


We are computing stageIds.mkString(",") twice in this case class. Can we avoid this redundant computation by storing it in this case class?

parthosa · 2024-12-31T00:49:05Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/util/DiagnosticMetrics.scala

+  /**
+   * Check if a metric name belongs to IO diagnostic metrics
+   */
+  def isIODiagnosticMetricName(metric: String): Boolean = {


I saw a similar method isDiagnosticMetrics() that has been included in class AccumNameRef where as
isIODiagnosticMetricName() is part of object IOAccumDiagnosticMetrics.

Can we improve the consistency (maybe include isDiagnosticMetrics in its object StageAccumDiagnosticMetrics)?

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumNameRef.scala

Lines 37 to 39 in cc80b48

val csvValue: String = StringUtils.reformatCSVString(value)

def isDiagnosticMetrics(): Boolean = allDiagnosticMetrics.contains(value)

parthosa · 2024-12-31T00:51:09Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+            }
+
+          // Compute the metric's statistics and store the results if available
+          metricStats match {


nit: Can we use .map() instead of match{} for a more functional approach?

parthosa · 2024-12-31T01:00:06Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+          // TODO: check if accumulator ID is in driverAccumMap, currently skipped
+          val accumInfo = app.accumManager.accumInfoMap.get(sqlAccum.accumulatorId)
+
+          val metricStats: Option[StatisticsMetrics] =


nit: Can we use accumInfoOpt.flatMap() for a functional approach?

parthosa · 2024-12-31T01:00:33Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+    stageTaskIds.collect {
+      case taskId if accumInfo.taskUpdatesMap.contains(taskId) =>
+        accumInfo.taskUpdatesMap(taskId)
+    }(breakOut)


Good use of breakOut here

cindyyuanjiang added 8 commits November 14, 2024 18:00

diagnostic view 2 start

5d36743

Signed-off-by: cindyyuanjiang <[email protected]>

Merge branch 'dev' into diagnostic-view-2

5a62749

update IODiagnosticProfileResult

6978f79

Signed-off-by: cindyyuanjiang <[email protected]>

working implementation of view 2

e0fe692

Signed-off-by: cindyyuanjiang <[email protected]>

add node name in view 2 with some print statements

b2a8c0e

Signed-off-by: cindyyuanjiang <[email protected]>

fixe unit tests

e0d6a0c

Signed-off-by: cindyyuanjiang <[email protected]>

Merge branch 'dev' into diagnostic-view-2

03087c0

fix merge conflict

c0e44b2

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang self-assigned this Dec 6, 2024

add unit test and clean up

66c9762

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang changed the title ~~WIP: [FEA] Add IO diagnostic output for GPU slowness in Profiler tool~~ [FEA] Add IO diagnostic output for GPU slowness in Profiler tool Dec 6, 2024

cindyyuanjiang marked this pull request as ready for review December 6, 2024 23:42

cindyyuanjiang requested review from amahussein, nartal1 and parthosa and removed request for amahussein December 6, 2024 23:42

cindyyuanjiang added feature request New feature or request core_tools Scope the core module (scala) affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) labels Dec 6, 2024

cindyyuanjiang requested a review from kuhushukla December 6, 2024 23:42

nartal1 reviewed Dec 9, 2024

View reviewed changes

parthosa reviewed Dec 9, 2024

View reviewed changes

cindyyuanjiang added 2 commits December 9, 2024 17:12

address review feedback

45712e2

Signed-off-by: cindyyuanjiang <[email protected]>

add comments and rename variables/functions

3786fd9

Signed-off-by: cindyyuanjiang <[email protected]>

parthosa reviewed Dec 10, 2024

View reviewed changes

amahussein reviewed Dec 11, 2024

View reviewed changes

cindyyuanjiang added 3 commits December 12, 2024 14:19

Merge branch 'dev' into diagnostic-view-2

f2131ee

address review feedback

5dabc28

Signed-off-by: cindyyuanjiang <[email protected]>

new output batches name

f668dda

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang commented Dec 18, 2024

View reviewed changes

amahussein reviewed Dec 18, 2024

View reviewed changes

cindyyuanjiang added 3 commits December 18, 2024 14:50

remove function getAccumInfoStatisticsInStage

0bd1e7f

Signed-off-by: cindyyuanjiang <[email protected]>

Merge branch 'dev' into diagnostic-view-2

f8346dd

refactor due to new optimizations from dev

550406c

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested review from parthosa, amahussein and nartal1 December 19, 2024 00:46

add back getStageTaskIds to avoid computing stage ids multiple times …

6d0d798

…when unnecessary Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang added 5 commits December 19, 2024 18:11

Merge branch 'dev' into diagnostic-view-2

1ae6143

merged dev

be428e9

Signed-off-by: cindyyuanjiang <[email protected]>

optimized computation of metric stats

e882518

Signed-off-by: cindyyuanjiang <[email protected]>

change metricNamesToKeyMap from string to string

735546a

Signed-off-by: cindyyuanjiang <[email protected]>

move AnalysisUtil file

43f5c84

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang commented Dec 23, 2024

View reviewed changes

cindyyuanjiang added 2 commits December 23, 2024 16:01

minor style updates

e8c2814

Signed-off-by: cindyyuanjiang <[email protected]>

expected file update

b97e2db

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang added 2 commits December 23, 2024 16:38

minor updates between seq conversion

631b9c0

Signed-off-by: cindyyuanjiang <[email protected]>

comment on empty stage ids

cc80b48

Signed-off-by: cindyyuanjiang <[email protected]>

parthosa reviewed Dec 31, 2024

View reviewed changes

	val csvValue: String = StringUtils.reformatCSVString(value)

	def isDiagnosticMetrics(): Boolean = allDiagnosticMetrics.contains(value)

[FEA] Add IO diagnostic output for GPU slowness in Profiler tool #1451

Are you sure you want to change the base?

[FEA] Add IO diagnostic output for GPU slowness in Profiler tool #1451

Conversation

cindyyuanjiang commented Dec 6, 2024 • edited Loading

Changes

Testing

Example Output

Follow-up Issue

nartal1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amahussein commented Dec 18, 2024

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 18, 2024

amahussein commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 24, 2024 • edited Loading

parthosa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Dec 6, 2024 •

edited

Loading

cindyyuanjiang commented Dec 24, 2024 •

edited

Loading