Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Profiler output shows inconsistent shuffleWriteTime results #1408

Open
cindyyuanjiang opened this issue Oct 31, 2024 · 1 comment · May be fixed by #1450
Open

[BUG] Profiler output shows inconsistent shuffleWriteTime results #1408

cindyyuanjiang opened this issue Oct 31, 2024 · 1 comment · May be fixed by #1450
Assignees
Labels
bug Something isn't working core_tools Scope the core module (scala)

Comments

@cindyyuanjiang
Copy link
Collaborator

Describe the bug
Profiler output shows inconsistent shuffle write write time results in profiler.log.

Under Stage Level All Metrics, the total is 93 ms.

+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+
|appIndex|stageId|accumulatorId|name                                               |min      |median   |max      |total     |
+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+
|1       |2      |201          |internal.metrics.shuffle.write.writeTime           |139875   |230038   |9747416  |93193331  |

Under Stage level aggregated task metrics, the total is 42ms.

+--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+
|appIndex|stageId|numTasks|Duration|diskBytesSpilled_sum|duration_sum|duration_max|duration_min|duration_avg|executorCPUTime_sum|executorDeserializeCPUTime_sum|executorDeserializeTime_sum|executorRunTime_sum|input_bytesRead_sum|input_recordsRead_sum|jvmGCTime_sum|memoryBytesSpilled_sum|output_bytesWritten_sum|output_recordsWritten_sum|peakExecutionMemory_max|resultSerializationTime_sum|resultSize_max|sr_fetchWaitTime_sum|sr_localBlocksFetched_sum|sr_localBytesRead_sum|sr_remoteBlocksFetched_sum|sr_remoteBytesRead_sum|sr_remoteBytesReadToDisk_sum|sr_totalBytesRead_sum|sw_bytesWritten_sum|sw_recordsWritten_sum|sw_writeTime_sum|
+--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+
|1       |2      |200     |688     |0                   |7705        |237         |10          |38.5        |1556               |940                           |1474                       |5337               |0                  |0                    |0            |0                     |0                      |0                        |0                      |0                          |7359          |0                   |2400                     |80264508             |0                         |0                     |0                           |80264508             |15400              |200                  |42              |

We suspect this is due to precision lost from converting nanoseconds to milliseconds. For Stage Level All Metrics, the results are in nanoseconds, and for Stage level aggregated task metrics, the results are in milliseconds. Most tasks in this stage has < 1,000,000 nanoseconds = 1 ms shuffle write time, which will rounds to 0 ms.

Steps/Code to reproduce bug
spark_rapids profiling -v -e <my_tools_repo>/core/src/test/resources/spark-events-profiling/rapids_join_eventlog.zstd

@cindyyuanjiang cindyyuanjiang added ? - Needs Triage bug Something isn't working labels Oct 31, 2024
@amahussein amahussein added the core_tools Scope the core module (scala) label Nov 1, 2024
@amahussein amahussein changed the title [BUG] Profiler output shows inconsistent shuffle write write time results [BUG] Profiler output shows inconsistent shuffleWriteTime results Nov 1, 2024
@cindyyuanjiang
Copy link
Collaborator Author

Found conversion from nanoseconds to ms in TaskModel construction:

TimeUnit.NANOSECONDS.toMillis(event.taskMetrics.shuffleWriteMetrics.writeTime),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants