-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375
[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375
Conversation
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
On a second thought, we should combine the two tables shown as an example here. My original intent was to keep the first view simple but the latter table is not too bad for that |
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
swWriteTimeMax, | ||
swWriteTimeSum, | ||
gpuSemaphoreWaitSum, | ||
nodeNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make an encapsulating object for this so that we dont have large arg list as well as a single place to hold the metrics we care about -- easier to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kuhushukla! Can you elaborate a bit more on this? I thought StageDiagnosticResult
is the encapsulating object. It has similar structure as other profiler results, for example - https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala#L417.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arg list is very large and that on its own would be nice to abstract away in a case class etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @kuhushukla! I experimented with a few things like encapsulating part of the arg list into a separate case class, but overall I think this presentation has the best readability. It also aligns with other classes in this file and current unit tests. We can discuss more offline if there is something else we should try.
/** | ||
* Given an input iterable, returns its min, median, max and sum. | ||
*/ | ||
def getStatistics(arr: Iterable[Long]): (Long, Long, Long, Long) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I thought this information is available and can be simply pulled? Please correct me if I am wrong -- for eg, in the existing profiler o/p where does the median value come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the implementation to reuse/pull existing metrics results from ProfStageMetricView
. I cannot do this for shuffle read total bytes
because in ProfStageMetricView
there are 2 metrics associated with this: internal.metrics.shuffle.read.localBytesRead
and internal.metrics.shuffle.read.remoteBytesRead
. I cannot get the min/med/max of shuffle read total bytes
by adding the min/med/max of the 2 metrics. I am keeping this function for now, but if it looks too unnecessary I can remove it.
@@ -608,6 +572,183 @@ case class StageAggTaskMetricsProfileResult( | |||
override def idHeader = "stageId" | |||
} | |||
|
|||
case class StageDiagnosticMetricsProfileResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rename to a smaller string : DiagnosticResult
, StageDiagnosticResult
etc. are some options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short and concise are good as long as it doesn't lose context. For instance I think DiagnosticResult may be to generic as it doesn't tell you what its applied or diagnostic of. so I would lean towards StageDiagnosticResult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, changed to StageDiagnosticResult
...sources/ProfilingExpectations/rapids_join_eventlog_stagediagnosticmetricsagg_expectation.csv
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
is the example output real here? The first stage that took the longest has no input data but has a Scan with it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang. Minor nits and questions.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Show resolved
Hide resolved
Signed-off-by: cindyyuanjiang <[email protected]>
thanks @tgravescs! This is the example output from existing testing event log: https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/test/resources/spark-events-profiling/rapids_join_eventlog.zstd. |
Signed-off-by: cindyyuanjiang <[email protected]>
@@ -47,7 +47,8 @@ case class ApplicationSummaryInfo( | |||
ioMetrics: Seq[IOAnalysisProfileResult], | |||
sysProps: Seq[RapidsPropertyProfileResult], | |||
sqlCleanedAlignedIds: Seq[SQLCleanAndAlignIdsProfileResult], | |||
sparkRapidsBuildInfo: Seq[SparkRapidsBuildInfoEvent]) | |||
sparkRapidsBuildInfo: Seq[SparkRapidsBuildInfoEvent], | |||
stageDiagnostics: Seq[StageDiagnosticResult]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that we can generate those records for the sake of generating the report, and we do not have to store them in the ApplicationSummaryInfo.
ApplicationSummaryInfo fields are the ones that provide some information about the application and then it can be consumed by other modules inside scala (i.e., AutoTuner).
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Show resolved
Hide resolved
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
I tried to run this branch as qualification CMd and I do not see diagnostics CSV file generated in the raw_metrics.
I remember that you mentioned that the diagnsotics output will be generated from Qualification as well. Is there a change in the requirement?
Signed-off-by: cindyyuanjiang <[email protected]>
Thanks @amahussein! Yes, I updated the PR, we do want diagnostic output in Qualification as well. PTAL. |
Signed-off-by: cindyyuanjiang <[email protected]>
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Show resolved
Hide resolved
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark
?
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
Thanks @amahussein. I have a local copy of Profile Benchmark class, do we want to include that in this PR as well? I am planning to run some Profiler benchmarks later based on our offline discussion. |
Thanks @cindyyuanjiang ! |
Signed-off-by: cindyyuanjiang <[email protected]>
Thank you @amahussein! I added the Profile benchmark class in this PR. |
// Currently the input arguments are assumed to be common across cases | ||
// This will be improved in a follow up PR to enable passing as a config | ||
// file with argument support for different cases | ||
runBenchmark("Benchmark_Per_SQL_Arg_Profiling") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no "PER_SQL" argument for Profiling. that prefix was used in the qualification's benchmark because we were running the benchmark with the perSql argument enabled/disabled.
Suggestion is:
- to enable CSV and call it something like Benchmark_Profiling_CSV
- if this is supposed to run a single thread, then the number of threads should be specified. Otherwise, the benchmark will have non-deterministic behavior for multiple eventlogs.
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTME.
Thanks @cindyyuanjiang
Contributes to #1374
Changes
stage_level_diagnostic_metrics.csv
AggRawMetricsResult
AppSQLPlanAnalyzer
to store mappings between stage IDs to node names and GPU semaphore wait timecore/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala
to verify diagnostic csv file outputcore/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
Testing CMD
spark_rapids profiling -e <my_event_log> -t <my_tools_jar>
Output Example