Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Enable AQE related recommendations in Profiler Auto-tuner #688

Merged
merged 8 commits into from
Jan 11, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Dec 12, 2023

Fixes #576

This PR added the following list of settings AQE optimization for auto-tuner:

Spark Property Recommendation
spark.rapids.sql.batchSizeBytes Set to just under 2GB (default is 1GB)
spark.sql.adaptive.autoBroadcastJoinThreshold If the setting is above 100MB (default is 10MB), recommend user to set to a lower number.
spark.sql.adaptive.advisoryPartitionSizeInBytes If Input Size > 35KB and Shuffle Read > 50KB:
- For A100, set to 64MB
- For T4, set to 32MB
Otherwise, set to 128MB.
spark.sql.adaptive.coalescePartitions.initialPartitionNum If Input Size > 35KB, Shuffle Read > 50KB and value < 200 (low value)
- For A100, set to 400
- For T4, set to 800
spark.sql.adaptive.coalescePartitions.parallelismFirst If Input Size > 35KB and Shuffle Read > 50KB:
- Set to 'false' to prioritize 'advisoryPartitionSizeInBytes' over 'minPartitionSize' for better performance

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa and @cindyyuanjiang !
I made some few comments

Can you please check also how the AQE recommendations look like in the user-tools?
One of the things to keep an eye on is that the tables of the profiler stdout are readable and match the expected from the profiler core .

@cindyyuanjiang
Copy link
Collaborator Author

Filed follow up issue, tracked here: #719

revans2
revans2 previously approved these changes Jan 9, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching my review to approved. I saw the benchmark results and I am okay with the change now. I am a little concerned with what happens if the the number of shuffle partitions is very small, or the AQE target shuffle size is very large, or if the maxPartitionBytes is very large. But as long as we have benchmarks that we are running we can improve the benchmarks over time as we see more corner cases from customers show up.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR has conflicts. Needs upmerge.

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Jan 10, 2024

Switching my review to approved. I saw the benchmark results and I am okay with the change now. I am a little concerned ...

Thank you for the feedback @revans2!

Signed-off-by: cindyyuanjiang <[email protected]>
@amahussein amahussein merged commit c0b4ddf into NVIDIA:dev Jan 11, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Enable AQE-related settings to be recommended as part of the profiler auto-tuner
5 participants