Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill nulls for multi-metric queries #850

Merged
merged 9 commits into from
Nov 8, 2023

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Nov 6, 2023

Resolves #SL-1173

Description

  • Updates the logic when filling nulls for input measures. When joining multiple metrics, you may end up with null rows post-join if 1) there are dimensions in the query and 2) one of the metrics has dimension values that the other metric does not have. In that case, if the metric's input measure fills nulls, we want to fill those nulls with the requested value. This PR implements that behavior.
  • Tangentially, this updates MetricInstance.defined_from to contain a single MetricModelReference instead of multiple. It doesn't seem like there is ever a time we would want multiple (multiple are never used in the codebase), and we need to know just one to get the correct element name.

@cla-bot cla-bot bot added the cla:yes label Nov 6, 2023
Copy link

github-actions bot commented Nov 6, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@courtneyholcomb courtneyholcomb force-pushed the court/fill-nulls-multi-metric branch from da2653a to e8e1b17 Compare November 6, 2023 22:57
@courtneyholcomb courtneyholcomb changed the title Update MetricInstance to take a singular defined_from object instead … Fill nulls for multi-metric queries Nov 6, 2023
@courtneyholcomb courtneyholcomb marked this pull request as ready for review November 6, 2023 23:01
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this works, but if it does, can we either:

  1. Have an accessor on the metric_lookup object that only returns the immediate measure inputs for the metric in question?
  2. Have an is_derived or similar property that is based on the metric type rather than the number of measure inputs?

I'm concerned about edge case bugs creeping into the rendering here, and it'll be hard to reason about them because the typechecker can't help out at all. Like if I add a new metric type and we have logic that does an assert_values_exhausted if/else block switching over all metric types, that'll fail the typechecker.

@@ -706,7 +706,7 @@ def visit_compute_metrics_node(self, node: ComputeMetricsNode) -> SqlDataSet:
metric_instances.append(
MetricInstance(
associated_columns=(output_column_association,),
defined_from=(MetricModelReference(metric_name=metric_spec.element_name),),
defined_from=MetricModelReference(metric_name=metric_spec.element_name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, thanks. We have so many of these kind of "let's allow a sequence of things and then have exactly one callsite that manually enforces there's only one" interfaces lying around...

@@ -436,7 +436,7 @@ def from_reference(reference: MetricReference) -> MetricSpec:

@property
def alias_spec(self) -> MetricSpec:
"""Returns a MetricSpec represneting the alias state."""
"""Returns a MetricSpec representing the alias state."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😸

group_by_objs: [{"name": "metric_time"}]
check_query: |
SELECT
COALESCE(subq_7.metric_time__day, subq_12.metric_time__day) AS metric_time__day
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there missing values in this date sequence for the bookings_fill_nulls_with metric? If so that's great. An output test would be nice but at least we can inspect the data frame from here and see it.

If not, it might be worth adding a dimension that has a gap for a given metric_time, dimension_value pair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test case & updated the source data to make sure there were some nulls filled for this case.
Will plan to add some output tests tomorrow!

metricflow/plan_conversion/dataflow_to_sql.py Outdated Show resolved Hide resolved
Comment on lines 901 to 904
input_measures = self._metric_lookup.measures_for_metric(
metric_reference=MetricReference(metric_name),
column_association_resolver=self._column_association_resolver,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to deal with the issue below is to add a different, internal API here that only gives the measures that are direct inputs into the metric. That way, if we add another metric type that takes measures as input anything derived from it won't have anything to do here, and we ensure we only do the coalesce once within a derived metric chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea! I'll add that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that!

@courtneyholcomb
Copy link
Contributor Author

@tlento This should be good to go, with the exception of updating Databricks snapshots. Wrapping that up now - repopulating the source schema is just taking forever.

@courtneyholcomb
Copy link
Contributor Author

@tlento This should be good to go, with the exception of updating Databricks snapshots. Wrapping that up now - repopulating the source schema is just taking forever.

And now done! Kicking off SQL engine tests now

@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 7, 2023
@courtneyholcomb courtneyholcomb added the Reload Test Data in SQL Engines Should be run when test data changes label Nov 7, 2023
@courtneyholcomb courtneyholcomb removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 7, 2023
@github-actions github-actions bot removed the Reload Test Data in SQL Engines Should be run when test data changes label Nov 7, 2023
@courtneyholcomb courtneyholcomb added Reload Test Data in SQL Engines Should be run when test data changes Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Nov 8, 2023
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 8, 2023
@courtneyholcomb courtneyholcomb added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Reload Test Data in SQL Engines Should be run when test data changes labels Nov 8, 2023
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 8, 2023
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go! Thank you!

@@ -105,6 +106,23 @@ def add_metric(self, metric: Metric) -> None:
)
self._metrics[metric_reference] = metric

def yaml_input_measure_for_metric(self, metric_reference: MetricReference) -> Optional[MetricInputMeasure]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Super nitty but can we rename this away from yaml? YAML makes me grumpy. More importantly, there's no guarantee that the input is defined in YAML (although in practice that is how it works today).

Maybe something like direct_input_measure_for_metric or configured_input_measure_for_metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Comment on lines 110 to 113
"""Get input measure defined in the metric YAML, if exists.

When SemanticModel is constructed, input measures from input metrics are added to the list of input measures
for a metric. Here, use rules about metric types to determine which input measures were defined in the YAML:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then here we can replace YAML with config or something.

@@ -275,7 +275,7 @@ integration_test:
check_query: |
SELECT
CAST(bookings AS {{ double_data_type_name }}) / NULLIF(views, 0) AS bookings_per_view
, groupby_8cbdaa28.ds AS metric_time__day
, COALESCE(groupby_8cbdaa28.ds, groupby_68058b0b.ds) AS metric_time__day
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aliases though 🤣

from metricflow.test.snapshot_utils import assert_object_snapshot_equal


@pytest.mark.sql_engine_snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These snapshot tests are great, thanks for doing this!

@courtneyholcomb courtneyholcomb merged commit dc076a1 into main Nov 8, 2023
8 checks passed
@courtneyholcomb courtneyholcomb deleted the court/fill-nulls-multi-metric branch November 8, 2023 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants