Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] qualification tool can error out from a divide by zero #637

Closed
kuhushukla opened this issue Oct 27, 2023 · 4 comments
Closed

[BUG] qualification tool can error out from a divide by zero #637

kuhushukla opened this issue Oct 27, 2023 · 4 comments
Labels
bug Something isn't working core_tools Scope the core module (scala) usability track issues related to the Tools's user experience

Comments

@kuhushukla
Copy link
Collaborator

Describe the bug
The cpu cost division for the estimate can cause divide by zero error.

Steps/Code to reproduce bug
Use an eventlog where costs are forced to be zero

2023-10-26 15:52:14,823 INFO rapids.tools.savings: Force costs to 0 because the original cost is 0.000000
2023-10-26 15:52:14,823 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 108, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 239, in _collect_result
    self._process_output()
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 620, in __calc_apps_cost
    app_df_set[cost_cols] = app_df_set.apply(
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 8845, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 733, in apply
    return self.apply_standard()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 857, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 873, in apply_series_generator
    results[i] = self.f(v)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 621, in <lambda>
    lambda row: get_costs_for_single_app(row, estimator=savings_estimator), axis=1)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 577, in get_costs_for_single_app
    est_savings = 100.0 - ((100.0 * gpu_cost) / cpu_cost)
ZeroDivisionError: float division by zero

Expected behavior
Default to 0 and not cause divide by 0.

@amahussein
Copy link
Collaborator

Our approach to deal with this is to:

  • First we need to reproduce to know exactly what causes the CPU cost to be unavailable
  • Once there, we should fix the root cause. For example, if we are missing a device or mapping, then we should fix it.
  • Finally, Fix the code so that the Qualification is smarter when it fails to get the CPU cost to prevent that from happening.

Finally, the divide-by-zero will be the symptom.

@amahussein
Copy link
Collaborator

We need @kuhushukla 's help to reproduce it.

@cindyyuanjiang
Copy link
Collaborator

cindyyuanjiang commented Jan 4, 2024

I have found a scenario that leads to a crash in the Qualification tool. This did not reproduce the divide-by-zero error as I expected.

Repro:

  1. Remove instance type Standard_DS3_v2 from user_tools/src/spark_rapids_pytools/resources/premium-databricks-azure-catalog.json
  2. Run cmd: spark_rapids_user_tools databricks-azure qualification -e <my-event-log> --cpu_cluster <my-cpu-cluster> --verbose where <my-cpu-cluster> has worker_node type Standard_DS3_v2

Stack-trace error:

2024-01-03 14:31:40,683 ERROR rapids.tools.price.Databricks-Azure: Could not find price for instance type 'Standard_DS3_v2': 'NoneType' object has no attribute 'get'
2024-01-03 14:31:40,683 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 110, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 242, in _collect_result
    self._process_output()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 616, in __calc_apps_cost
    savings_estimator = self.ctxt.platform.create_saving_estimator(self.ctxt.get_ctxt('cpuClusterProxy'),
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 82, in create_saving_estimator
    saving_estimator = DBAzureSavingsEstimator(price_provider=db_azure_price_provider,
  File "<string>", line 9, in __init__
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 148, in __post_init__
    self._setup_costs()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 143, in _setup_costs
    self.source_cost = self._get_cost_per_cluster(self.source_cluster)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 410, in _get_cost_per_cluster
    cost = self.price_provider.get_instance_price(instance=instance_type)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 83, in get_instance_price
    raise ex
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 79, in get_instance_price
    rate_per_hour = instance_conf.get('TotalPricePerHour')
AttributeError: 'NoneType' object has no attribute 'get'

@amahussein amahussein added the core_tools Scope the core module (scala) label Jan 23, 2024
@amahussein amahussein added the usability track issues related to the Tools's user experience label May 22, 2024
@amahussein
Copy link
Collaborator

Could not produce it

@amahussein amahussein closed this as not planned Won't fix, can't repro, duplicate, stale Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala) usability track issues related to the Tools's user experience
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants