Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Investigate long execution time of eventlogs #1461

Closed
3 tasks done
Tracked by #367
amahussein opened this issue Dec 13, 2024 · 2 comments · Fixed by #1474
Closed
3 tasks done
Tracked by #367

[BUG] Investigate long execution time of eventlogs #1461

amahussein opened this issue Dec 13, 2024 · 2 comments · Fixed by #1474
Assignees
Labels
bug Something isn't working core_tools Scope the core module (scala)

Comments

@amahussein
Copy link
Collaborator

amahussein commented Dec 13, 2024

Describe the bug

After we addressed memory configuration in #1382, we need to narrow down to find where does the core-tool spend most of the time analyzing the eventlogs

Analysis result for an eventlog that takes up to 28 minutes:

jvm Args:

-Xms32G -Xmx64G -XX:+UseG1GC

total CPU: 1,695,427ms
total time: 11,919,747 ms
total allocation: 6.14 TB

Methods with highest resource utilizations ProfileMain_2024_12_13_162335.zip:

  • getAggRawMetrics: 1,396,931 ms | 82% of all | 5.78 TB (94 % of all)
    • aggregateSparkMetricsByJob: 498,101 ms (29% parent, 36% total) | 2.06 TB (36 parent, 34% all).
    • aggregateSparkMetricsBySql: 448,030 ms (32% of parent, 27% of all) | 1.86 TB (32% parent, 30%).
      • It seems we call the method twice by mistake. We should fix that to get 27% CPU time back. So, in total this method takes 898,580 ms -> 53% of all | 1.86 TB (32% parent, 30%)
      • in addition, just one line inside the function takes 492,010 ms (29% of all).
        val cachedResBySQL = stageLevelSparkMetrics(index).filterKeys(stagesInSQL.contains).value
        

Tasks

Preview Give feedback
  1. bug core_tools
    amahussein
  2. core_tools
    amahussein
  3. core_tools
    amahussein
@amahussein amahussein added ? - Needs Triage bug Something isn't working core_tools Scope the core module (scala) labels Dec 13, 2024
@amahussein amahussein self-assigned this Dec 13, 2024
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Dec 13, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#1461

AppSparkMetricsAnalyzer was calling `aggregateSparkMetricsBySql` twice.
This code change eleiminates this redundancy to save CPU time and memory
allocations.
@cindyyuanjiang
Copy link
Collaborator

cindyyuanjiang commented Dec 13, 2024

Thanks @amahussein for investigating into this!

QQ: aggregateSparkMetricsBySql takes 448,030 ms and line
val cachedResBySQL = stageLevelSparkMetrics(index).filterKeys(stagesInSQL.contains).value
in this function takes 492,010 ms, which takes more time than the function call?

@amahussein
Copy link
Collaborator Author

Thanks @amahussein for investigating into this!

QQ: aggregateSparkMetricsBySql takes 448,030 ms and line val cachedResBySQL = stageLevelSparkMetrics(index).filterKeys(stagesInSQL.contains).value in this function takes 492,010 ms, which takes more time than the function call?

Good question!
It is due to that this is the total cost of that line because we were calling the method twice.
It implies that this line 50% of the cost of the method.

amahussein added a commit that referenced this issue Dec 16, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to #1461

AppSparkMetricsAnalyzer was calling `aggregateSparkMetricsBySql` twice.
This code change eleiminates this redundancy to save CPU time and memory
allocations.
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Dec 16, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contrinutes to NVIDIA#1461

This commit improves the implementation of aggregation accross raw
metrics by replacing the builtin scala collections with accumulators.
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Dec 17, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#1461

This commit improves the implementation of aggregation accross raw
metrics by replacing the builtin scala collections with accumulators.
amahussein added a commit that referenced this issue Dec 18, 2024
* Optimize implementation of getAggregateRawMetrics in core-tools
* address reviews and fix issues in aggregateDiagnostic

Contributes to #1461

This commit improves the implementation of aggregation accross raw
metrics by replacing the builtin scala collections with accumulators.

---------

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Dec 18, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#1461

Adds an InPlace median finding to improve the performance of the metric
aggregates.
We used to sort a sequence to create StatisticsMetrics which turned out
to be very expensive in large eventlogs.
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Dec 19, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1461

Adds an InPlace median finding to improve the performance of the metric
aggregates.
We used to sort a sequence to create StatisticsMetrics which turned out
to be very expensive in large eventlogs.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants