Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cluster recommendation when CPU cluster cannot be determined #13

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Jul 26, 2024

This PR fixes the issue where we do not generate a cluster recommendation when CPU cluster cannot be created (e.g., no matching executor instance found for the required number of cores).

Approach

Scala tool now generates a recommended GPU cluster per app basis NVIDIA/spark-rapids-tools#1188. For the case when CPU cluster is not provided, we should use the values from Scala tool output for our GPU cluster recommendation instead of python's cpu<->gpu core matching.

Output

Case 1: CPU cluster is not passed and we infer CPU cluster for each app

Logs (for each app):

INFO rapids.tools.cluster_inference: For App ID: app-20200423033538-0000, Unable to infer CPU cluster. Reason - No matching executor instance found for num cores = 80
INFO rapids.tools.cluster_recommender: For App ID: app-20200423033538-0000, CPU cluster: N/A; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 16 X g5.4xlarge>
INFO rapids.tools.cluster_recommender: For App ID: app-20210509200722-0001, Inferred CPU cluster: <Driver: m6gd.xlarge, Executor: 1 X m6gd.4xlarge>; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 1 X g5.4xlarge>

Final Result:

+----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------+
|    | App Name            | App ID                  | Estimated GPU   | Qualified Node             | Full Cluster                 | GPU Config                  |
|    |                     |                         | Speedup         | Recommendation             | Config                       | Recommendation              |
|    |                     |                         | Category**      |                            | Recommendations*             | Breakdown*                  |
|----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------|
|  1 | spark_test_apps.py  | app-20200423033538-0000 | Large           | g5.4xlarge                 | app-20200423033538-0000.conf | app-20200423033538-0000.log |
|  2 | Spark shell         | app-20210509200722-0001 | Small           | m6gd.4xlarge to g5.4xlarge | app-20210509200722-0001.conf | app-20210509200722-0001.log |
+----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------+

Case 2: CPU cluster is passed as input (--cluster <cluster>)

Logs (for all apps):

INFO rapids.tools.cluster_recommender: CPU cluster: <Driver: m6gd.xlarge, Executor: 2 X m6gd.xlarge>; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 2 X g5.xlarge>

Final Result:

+----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+
|    | App Name            | App ID                  | Estimated GPU   | Qualified Node           | Full Cluster                 | GPU Config                  |
|    |                     |                         | Speedup         | Recommendation           | Config                       | Recommendation              |
|    |                     |                         | Category**      |                          | Recommendations*             | Breakdown*                  |
|----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------|
|  1 | spark_test_apps.py  | app-20200423033538-0000 | Large           | m6gd.xlarge to g5.xlarge | app-20200423033538-0000.conf | app-20200423033538-0000.log |
|  2 | Spark shell         | app-20210509200722-0001 | Small           | m6gd.xlarge to g5.xlarge | app-20210509200722-0001.conf | app-20210509200722-0001.log |
+----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+

@parthosa parthosa added the bug Something isn't working label Jul 26, 2024
@parthosa parthosa requested a review from amahussein July 26, 2024 17:18
@parthosa parthosa self-assigned this Jul 26, 2024
@parthosa parthosa changed the title Fix node recommendation when CPU cluster cannot be determined Fix cluster recommendation when CPU cluster cannot be determined Jul 26, 2024
@parthosa parthosa merged commit 9904fc9 into amahussein:spark-rapids-tools-1221-cost-args Jul 26, 2024
13 of 14 checks passed
@parthosa parthosa deleted the spark-rapids-tools-1221-fix-cluster-recommendation branch July 26, 2024 17:51
amahussein added a commit that referenced this pull request Jul 27, 2024
* Remove arguments related to cost-savings

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1229

- remove the legacy `spark_rapids_user_tools` cmd
- remove qualification arguments related to cost-savings

* remove gpu-cluster-reshape and grouping of apps

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1099

- disable grouping of results by row_name
- the file `qualification_summary_full.csv` is omitted

* Fix cluster recommendation when CPU cluster cannot be determined (#13)
* Fix node recommendation when CPU cluster cannot be determined
* Move cluster cols to config file

Signed-off-by: Partho Sarthi <[email protected]>

---------

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Co-authored-by: Partho Sarthi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant