[BUG] Investigate long execution time of eventlogs #1461

amahussein · 2024-12-13T16:00:22Z

Describe the bug

After we addressed memory configuration in #1382, we need to narrow down to find where does the core-tool spend most of the time analyzing the eventlogs

Analysis result for an eventlog that takes up to 28 minutes:

jvm Args:

-Xms32G -Xmx64G -XX:+UseG1GC

total CPU: 1,695,427ms
total time: 11,919,747 ms
total allocation: 6.14 TB

Methods with highest resource utilizations ProfileMain_2024_12_13_162335.zip:

getAggRawMetrics: 1,396,931 ms | 82% of all | 5.78 TB (94 % of all)
- aggregateSparkMetricsByJob: 498,101 ms (29% parent, 36% total) | 2.06 TB (36 parent, 34% all).
- aggregateSparkMetricsBySql: 448,030 ms (32% of parent, 27% of all) | 1.86 TB (32% parent, 30%).
  - It seems we call the method twice by mistake. We should fix that to get 27% CPU time back. So, in total this method takes 898,580 ms -> 53% of all | 1.86 TB (32% parent, 30%)
  - in addition, just one line inside the function takes 492,010 ms (29% of all).
```
val cachedResBySQL = stageLevelSparkMetrics(index).filterKeys(stagesInSQL.contains).value
```

Tasks

Give feedback

Deduplicate calls to aggregateSparkMetricsBySql #1464

bug core_tools
Optimize implementation of getAggregateRawMetrics in core-tools #1468

core_tools
Improve implementation of finding median in StatisticsMetrics #1474

core_tools
Options

The text was updated successfully, but these errors were encountered:

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#1461 AppSparkMetricsAnalyzer was calling `aggregateSparkMetricsBySql` twice. This code change eleiminates this redundancy to save CPU time and memory allocations.

cindyyuanjiang · 2024-12-13T22:57:50Z

Thanks @amahussein for investigating into this!

QQ: aggregateSparkMetricsBySql takes 448,030 ms and line
val cachedResBySQL = stageLevelSparkMetrics(index).filterKeys(stagesInSQL.contains).value
in this function takes 492,010 ms, which takes more time than the function call?

amahussein · 2024-12-16T15:19:53Z

Thanks @amahussein for investigating into this!

QQ: aggregateSparkMetricsBySql takes 448,030 ms and line val cachedResBySQL = stageLevelSparkMetrics(index).filterKeys(stagesInSQL.contains).value in this function takes 492,010 ms, which takes more time than the function call?

Good question!
It is due to that this is the total cost of that line because we were calling the method twice.
It implies that this line 50% of the cost of the method.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to #1461 AppSparkMetricsAnalyzer was calling `aggregateSparkMetricsBySql` twice. This code change eleiminates this redundancy to save CPU time and memory allocations.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contrinutes to NVIDIA#1461 This commit improves the implementation of aggregation accross raw metrics by replacing the builtin scala collections with accumulators.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#1461 This commit improves the implementation of aggregation accross raw metrics by replacing the builtin scala collections with accumulators.

* Optimize implementation of getAggregateRawMetrics in core-tools * address reviews and fix issues in aggregateDiagnostic Contributes to #1461 This commit improves the implementation of aggregation accross raw metrics by replacing the builtin scala collections with accumulators. --------- Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#1461 Adds an InPlace median finding to improve the performance of the metric aggregates. We used to sort a sequence to create StatisticsMetrics which turned out to be very expensive in large eventlogs.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Fixes NVIDIA#1461 Adds an InPlace median finding to improve the performance of the metric aggregates. We used to sort a sequence to create StatisticsMetrics which turned out to be very expensive in large eventlogs. Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein added ? - Needs Triage bug Something isn't working core_tools Scope the core module (scala) labels Dec 13, 2024

amahussein mentioned this issue Dec 13, 2024

[FEA] Improve performance of core module #367

Open

amahussein removed the ? - Needs Triage label Dec 13, 2024

amahussein self-assigned this Dec 13, 2024

amahussein mentioned this issue Dec 13, 2024

Deduplicate calls to aggregateSparkMetricsBySql #1464

Merged

This was referenced Dec 17, 2024

Optimize implementation of getAggregateRawMetrics in core-tools #1468

Merged

[BUG] SQL/Task Aggregated Metrics contain duplicated executor CPU time results #1467

Open

[FEA] Add IO diagnostic output for GPU slowness in Profiler tool #1451

Open

amahussein mentioned this issue Dec 19, 2024

Improve implementation of finding median in StatisticsMetrics #1474

Merged

cindyyuanjiang closed this as completed in #1474 Dec 20, 2024

cindyyuanjiang closed this as completed in 3db52ef Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Investigate long execution time of eventlogs #1461

[BUG] Investigate long execution time of eventlogs #1461

amahussein commented Dec 13, 2024 •

edited

Loading

Tasks

cindyyuanjiang commented Dec 13, 2024 •

edited

Loading

amahussein commented Dec 16, 2024

[BUG] Investigate long execution time of eventlogs #1461

[BUG] Investigate long execution time of eventlogs #1461

Comments

amahussein commented Dec 13, 2024 • edited Loading

Tasks

cindyyuanjiang commented Dec 13, 2024 • edited Loading

amahussein commented Dec 16, 2024

amahussein commented Dec 13, 2024 •

edited

Loading

cindyyuanjiang commented Dec 13, 2024 •

edited

Loading