-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualification tool: Add output stats file for Execs(operators) #1225
Conversation
Signed-off-by: Niranjan Artal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nartal1 for this feature.
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nartal1 !
Do we create this as a separate tool similar to what we did for "Predict"?
If we treat it as independent tool, then it can have its own yaml file and it can be easier to configure and extend without affecting the qualification cmd.
We can also allow this post-phase processing to be enabled/disabled.
It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.
Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience. |
Yes, I meant that it runs as part of the Qualification tool (enabled by default). One last thing. How to keep compatibility? for example, it is possible that the stats is scanning a folder that was generated by older tools version. in that case, it might crash. |
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Niranjan Artal <[email protected]>
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Niranjan Artal <[email protected]>
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
Thanks @parthosa for the review. I have addressed all the review comments. PTAL. |
Signed-off-by: Niranjan Artal <[email protected]>
user_tools/src/spark_rapids_pytools/rapids/qualification_stats.py
Outdated
Show resolved
Hide resolved
Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per sql |
Looking into this if we can get % total task time per sql |
@tgravescs - Updated the PR to report % total task time. |
can you update sample output in description? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nartal1. The final output columns are quite meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nartal1 for this feature!
Can you please describe each of these columns:
I want to make sure the columns are clear to the user. from just reading it I would have expected the total SQL duration to just be the wall clock time but from previous discussions that isn't what we were talking about so want to make sure and then maybe come up with a name that is more obvious to the user. |
ok if we want to have separate cli for users to run after the fact I guess I'm ok with it. it adds more things - documentation and testing mainly, but more commands always potential for user confusion. please make sure we have followups to add docs/tests. I thought this requirement was mostly from Felix and would just get this with the normal qualification runs. Let me talk to Felix and Hao |
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py
Outdated
Show resolved
Hide resolved
merged_df = merged_df.merge(total_duration_df, on=['App ID', 'SQL ID'], how='left') | ||
|
||
# Mark unique stage task durations | ||
merged_df['Unique StageTaskDuration'] = ~merged_df.duplicated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for each operator we will get all stages and then possibly separate rows if have both supported and unsupported and we just want to make sure we have those unique so we aren't double counting on time, correct?
So does this mean if you have an two operators in a stage, one that is supported and one that is not supported that we the time will be in there twice, correct? which is fine I just want to make sure I'm understanding properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I updated it to handle this case. Until now, it was taking into account same operators within a SQL ID and the assumption was that the operator will be either supported or unsupported within a stage. But there could be a scenario where the same operator can be supported and unsupported within a stage(due to underlying Expression not supported).
Sample output: There are 4 entries of Project for SQLID=1, StageID(5,6) in execs.csv and 3 are unsupported as below:
unsupported_operators_df
App ID SQL ID Stage ID Operator
0 1 3 7 HashAggregate
1 1 1 6 Project
2 2 1 -1 Project
3 1 1 -1 Filter
4 1 1 5 Project
5 1 1 5 Project
stages_df:
stages_df
App ID Stage ID StageTaskDuration
0 1 5 50
2 1 4 20
3 1 3 100
4 1 6 60
5 1 7 100
final dataframe output(before renaming columns):
App ID SQL ID Operator Count StageTaskDuration TotalSQLDuration % of Total SQL Duration Supported
0 1 1 Filter 2 70 230 30.434783 True
1 1 1 Project 3 110 230 47.826087 False
2 1 1 Project 1 50 230 21.739130 True
3 1 1 Sort 3 170 230 73.913043 True
4 1 3 HashAggregate 1 100 100 100.000000 False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I assume the sql id 1 above example has some execs - likely the Project that are unsupposed in the same stages as another exec that is supported, correct? Because you hadd up the 4 of those 70 + 110 + 50 + 170 and its more then the 230 for total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. In the above example For SQLID=1, there are 4 Projects in total.
StageID=5, Unsupported=2, Supported=1, StageTaskDuration=50
StageID=6, Unsupported=1, StageTaskDuration=60
So we have 2 rows in output column for Project.
StageTaskDuration=110 ( 50 + 60) for Unsupported
StageTaskDuration=50 for Supported
That's a good point @tgravescs |
yeah I'm good with a dev command for it. Talked to Hao and Felix and they agreed not needed the public/external command |
Signed-off-by: Niranjan Artal <[email protected]>
This PR contributes to #1157 .
This PR has the following changes:
Changes made:
SparkQualificationStats
object passing the context. The logic of creating statistics and generating a csv out file is inSparkQualificationStats
.WholeStageCodegen
and exploding theExec Stage
column.For a given Exec(operator), determine all the stages where this Exec is present in that SQL. Get StageTaskDurations of
all the stages this Exec is present to get the StageTaskExecDuration. We can also determine the total StageTaskDuration corresponding to that SQLID from execs.csv. Note that this stats is based on the stage to SQL mapping based on execs.csv file. We know that there could be some Execs to which we cannot map stages and will miss those out in the calculation/output. This is consistent with the current csv output files.
Output:
Sample output:
###Output Columns description:
Follow on
zero