Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a TPC-DS SF 10 Notebook #448

Merged
merged 2 commits into from
Oct 29, 2024
Merged

Conversation

gerashegalov
Copy link
Collaborator

This Notebook demonstrates GPU acceleration of TPC-DS queries, it is portable across:

  • Local Jupyter
  • VS Code
  • Google Colab

@gerashegalov gerashegalov self-assigned this Oct 19, 2024
@viadea viadea requested a review from nvliyuan October 22, 2024 15:11
@viadea
Copy link
Collaborator

viadea commented Oct 23, 2024

@gerashegalov Could you help add:

  1. A CPU notebook
  2. The performance results between CPU and GPU notebooks and the corresponding env
    such as the page https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/XGBoost-Examples
    ?

@gerashegalov
Copy link
Collaborator Author

@viadea the notebook in this PR runs the same workload on CPU and then on GPU. Then visualizes the metrics of those runs as a chart:

https://github.com/NVIDIA/spark-rapids-examples/blob/3acce4535b15c1436be5b5efe6fcdf9a044b60ba/examples/SQL%2BDF-Examples/tpcds/notebooks/TPCDS-SF10.ipynb

If the notebooks are separate for CPU and GPU we have to have exchange metrics via a file instead of dataframes. So not sure what is the ask in

  1. A CPU notebook

@nvliyuan
Copy link
Collaborator

nvliyuan commented Oct 23, 2024

Hi @gerashegalov, thanks for the contribution. Could you help add more background? Normally, the example repo showcases the plugin's capability and performance. For example, the example/MIG-Support showcases the support for GPU scalability, the ML+DL-Example showcases the plugin integration with machine learning and deep learning, and the SQL+DF-Examples showcases the acceleration for data processing, especially for the strong performance which we could observe from microbenchmark. The Intersection demo is imitated from TPC-DS Query14a, it seems we already have a similar case which has about 5x speedup.

Copy link
Collaborator Author

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nvliyuan

Value proposition 1:

This notebook is an example of a selfcontained notebook that requires none of the extra installation steps on most IPython kernels unlike most of the notebooks in this repository. This allows it to be open and run directly in Google Colab, Jupyter, VSCode directly without any modifications

Value proposition 2:

It runs TPC-DS queries both on CPU and GPU and plots metrics of these very runs (via pandas dataframe) in this notebook as charts, side by side.

It also shows how to see both the initial CPU Plan and the final GPU Plan.

@viadea
Copy link
Collaborator

viadea commented Oct 23, 2024

@gerashegalov could you add a chart for comparing CPU vs GPU perf in a README for this example? we normally want to keep updating the perf chart each release for each example.
Example is in https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/SQL%2BDF-Examples/micro-benchmarks

@gerashegalov
Copy link
Collaborator Author

viadea
viadea previously approved these changes Oct 25, 2024
@gerashegalov
Copy link
Collaborator Author

@nvliyuan is it ok to merge this PR?

@nvliyuan
Copy link
Collaborator

Hi @gerashegalov , I tested and it works well, just one nit, could you help update the spark version from 3.5.1 to 3.5.0(keep sync as the notebook)? thx

@gerashegalov
Copy link
Collaborator Author

@nvliyuan Thanks for the review. Fixed the spark version in the README

nvliyuan
nvliyuan previously approved these changes Oct 28, 2024
@nvliyuan
Copy link
Collaborator

nvliyuan commented Oct 28, 2024

can you help target to branch 24.12? @gerashegalov

@gerashegalov gerashegalov changed the base branch from main to branch-24.12 October 29, 2024 03:48
@gerashegalov gerashegalov dismissed nvliyuan’s stale review October 29, 2024 03:48

The base branch was changed.

or Google Colab

Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov
Copy link
Collaborator Author

can you help target to branch 24.12? @gerashegalov

@nvliyuan done! I missed that because this repo does not follow the same convention as the others that switch the default branch every release.

@nvliyuan nvliyuan self-requested a review October 29, 2024 09:18
Copy link
Collaborator

@nvliyuan nvliyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvliyuan nvliyuan merged commit ca1555a into NVIDIA:branch-24.12 Oct 29, 2024
2 checks passed
parthosa pushed a commit to parthosa/spark-rapids-examples that referenced this pull request Oct 30, 2024
* Add a TPC-DS SF 10 Notebook for locall Jupyter

or Google Colab

Signed-off-by: Gera Shegalov <[email protected]>

* Update link to the current blob

Signed-off-by: Gera Shegalov <[email protected]>

---------

Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov deleted the tpcds branch October 31, 2024 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants