[BUG] qualification tool can error out from a divide by zero #637

kuhushukla · 2023-10-27T19:20:37Z

Describe the bug
The cpu cost division for the estimate can cause divide by zero error.

Steps/Code to reproduce bug
Use an eventlog where costs are forced to be zero

2023-10-26 15:52:14,823 INFO rapids.tools.savings: Force costs to 0 because the original cost is 0.000000
2023-10-26 15:52:14,823 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 108, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 239, in _collect_result
    self._process_output()
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 620, in __calc_apps_cost
    app_df_set[cost_cols] = app_df_set.apply(
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 8845, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 733, in apply
    return self.apply_standard()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 857, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 873, in apply_series_generator
    results[i] = self.f(v)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 621, in <lambda>
    lambda row: get_costs_for_single_app(row, estimator=savings_estimator), axis=1)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 577, in get_costs_for_single_app
    est_savings = 100.0 - ((100.0 * gpu_cost) / cpu_cost)
ZeroDivisionError: float division by zero

Expected behavior
Default to 0 and not cause divide by 0.

The text was updated successfully, but these errors were encountered:

amahussein · 2024-01-03T15:14:09Z

Our approach to deal with this is to:

First we need to reproduce to know exactly what causes the CPU cost to be unavailable
Once there, we should fix the root cause. For example, if we are missing a device or mapping, then we should fix it.
Finally, Fix the code so that the Qualification is smarter when it fails to get the CPU cost to prevent that from happening.

Finally, the divide-by-zero will be the symptom.

amahussein · 2024-01-03T15:15:15Z

We need @kuhushukla 's help to reproduce it.

cindyyuanjiang · 2024-01-04T22:40:38Z

I have found a scenario that leads to a crash in the Qualification tool. This did not reproduce the divide-by-zero error as I expected.

Repro:

Remove instance type Standard_DS3_v2 from user_tools/src/spark_rapids_pytools/resources/premium-databricks-azure-catalog.json
Run cmd: spark_rapids_user_tools databricks-azure qualification -e <my-event-log> --cpu_cluster <my-cpu-cluster> --verbose where <my-cpu-cluster> has worker_node type Standard_DS3_v2

Stack-trace error:

2024-01-03 14:31:40,683 ERROR rapids.tools.price.Databricks-Azure: Could not find price for instance type 'Standard_DS3_v2': 'NoneType' object has no attribute 'get'
2024-01-03 14:31:40,683 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 110, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 242, in _collect_result
    self._process_output()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 616, in __calc_apps_cost
    savings_estimator = self.ctxt.platform.create_saving_estimator(self.ctxt.get_ctxt('cpuClusterProxy'),
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 82, in create_saving_estimator
    saving_estimator = DBAzureSavingsEstimator(price_provider=db_azure_price_provider,
  File "<string>", line 9, in __init__
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 148, in __post_init__
    self._setup_costs()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 143, in _setup_costs
    self.source_cost = self._get_cost_per_cluster(self.source_cluster)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 410, in _get_cost_per_cluster
    cost = self.price_provider.get_instance_price(instance=instance_type)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 83, in get_instance_price
    raise ex
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 79, in get_instance_price
    rate_per_hour = instance_conf.get('TotalPricePerHour')
AttributeError: 'NoneType' object has no attribute 'get'

amahussein · 2024-07-30T12:00:58Z

Could not produce it

kuhushukla added the bug Something isn't working label Oct 27, 2023

kuhushukla self-assigned this Oct 27, 2023

kuhushukla mentioned this issue Oct 27, 2023

Qualification tool can error out from a divide by zero #638

Closed

amahussein mentioned this issue Jan 2, 2024

[BUG] Fix databricks-aws user profiling tool error with --gpu_cluster argument #707

Merged

amahussein added the core_tools Scope the core module (scala) label Jan 23, 2024

amahussein unassigned kuhushukla May 22, 2024

amahussein added the usability track issues related to the Tools's user experience label May 22, 2024

amahussein closed this as not planned Won't fix, can't repro, duplicate, stale Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] qualification tool can error out from a divide by zero #637

[BUG] qualification tool can error out from a divide by zero #637

kuhushukla commented Oct 27, 2023

amahussein commented Jan 3, 2024

amahussein commented Jan 3, 2024

cindyyuanjiang commented Jan 4, 2024 •

edited

Loading

amahussein commented Jul 30, 2024

[BUG] qualification tool can error out from a divide by zero #637

[BUG] qualification tool can error out from a divide by zero #637

Comments

kuhushukla commented Oct 27, 2023

amahussein commented Jan 3, 2024

amahussein commented Jan 3, 2024

cindyyuanjiang commented Jan 4, 2024 • edited Loading

amahussein commented Jul 30, 2024

cindyyuanjiang commented Jan 4, 2024 •

edited

Loading