Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join to Time Spine & Fill Nulls #832

Merged
merged 13 commits into from
Nov 2, 2023
Merged

Join to Time Spine & Fill Nulls #832

merged 13 commits into from
Nov 2, 2023

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Oct 31, 2023

Resolves #SL-783

Description

We have two new attributes on MetricInputMeasures - join_to_timespine and fill_nulls_with. This PR implements the expected behavior for those features.

  1. join_to_timespine - when true for an input measure, we should join the already-aggregated measure to the time spine table, ensuring that any missing date periods in the underlying data are filled in with null rows. We do this by adding our existing JoinToTimeSpineNode to the Dataflow plan.
  2. fill_nulls_with accepts an optional integer input. When not null, any rows that are null for the post-aggregated measure should be filled with the requested integer. We implement this with a COALESCE statement.

@cla-bot cla-bot bot added the cla:yes label Oct 31, 2023
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl,dr; I think we should do the following:

  1. Make sure the code runs and test cases pass for this PR as it is, and document it as allowing filling nulls with whatever for time spine values only (since that is what it does).
  2. Add a follow up PR to allow filling nulls with zero (or whatever) for derived metrics/multi-metric queries. That will require a change in our compute metrics join semantics and tests for the multi-metric scenario.
  3. Add some documentation including expected input and output.... somewhere. The test cases don't exactly do this because they're just SQL queries.

Long commentary follows:

Whew. This feature is super gnarly. In hindsight I think we should've built the coalesce without time spine first and then added the time spine in after.

This PR does not appear to address cases where there are joined in dimensions and multiple metrics, either as input to derived metrics or as general metric output. Right now, that is an INNER JOIN and so we will still filter out measure values from those input sets.

I think there's a bit of an open question about what the user should expect in these cases, so we can chat about that separately, but for derived metrics in particular it's a problem that we throw away entire rows of output just because something that should be represented as a 0 doesn't produce any join output values.

I also think addressing them might be a pretty gnarly PR in and of itself, since we need to change our join semantics on the final compute metrics join steps. This also begins to get into the LEFT vs RIGHT outer join type on our dimension joins, which currently all originate from the measure aggregation input rather than allowing for dimension values to have NULL (replaced with zero) values.

metricflow/test/generate_snapshots.py Show resolved Hide resolved
metricflow/dataflow/builder/dataflow_plan_builder.py Outdated Show resolved Hide resolved
metricflow/plan_conversion/dataflow_to_sql.py Show resolved Hide resolved
metricflow/plan_conversion/dataflow_to_sql.py Show resolved Hide resolved
Comment on lines 3 to 4
booking__is_instant
, COALESCE(bookings, 0) AS bookings_fill_0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we haven't talked about at all I don't think, but I was thinking about it while reading the code. I believe this is the correct approach, but we need to document it carefully.

This does not fill values for missing dimension values. If, for some reason, there were 0 bookings with is_instant = true we would render one row - false, <number>.

In cases where users are doing some kind of full scope group by, or using those rows as inputs into a derived metric, they would want the full value set represented. However, in cases where they were filtering out those values, they would not.

Therefore, we cannot fill 0s for all dimension combinations, we can only ensure that any case where an actual NULL shows up (rather than a total absence of values) for a measure the NULL is coalesced to the specified fill value. This seems appropriate given the name of the property, but it's good for us to be clear about it in docs and so forth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we did discuss this at standup yesterday! i.e., the fact that we obviously don't have a "dimension values spine" table to join to so we won't be filling those missing rows. So product is aware of this limitation (at least Nick is). But definitely agree that we need to document this clearly!!

metricflow/test/integration/test_cases/itest_metrics.yaml Outdated Show resolved Hide resolved
@tlento tlento added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed run_mf_sql_engine_tests labels Nov 2, 2023
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS November 2, 2023 00:43 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS November 2, 2023 00:43 — with GitHub Actions Inactive
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS November 2, 2023 00:43 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS November 2, 2023 00:43 — with GitHub Actions Failure
@courtneyholcomb
Copy link
Contributor Author

document it as allowing filling nulls with whatever for time spine values only (since that is what it does).

I'll add here that this isn't ALL it does. If there are nulls in the underlying data, it fills those too. Not sure how common that is, though.

  1. Add some documentation including expected input and output.... somewhere. The test cases don't exactly do this because they're just SQL queries.

Do we currently do this anywhere ?

@courtneyholcomb courtneyholcomb added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Nov 2, 2023
@courtneyholcomb courtneyholcomb added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Nov 2, 2023
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS November 2, 2023 19:14 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS November 2, 2023 19:14 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS November 2, 2023 19:14 — with GitHub Actions Inactive
@courtneyholcomb
Copy link
Contributor Author

Snowflake test failures are unrelated to this PR. I put a fix up for those here: #835

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add here that this isn't ALL it does. If there are nulls in the underlying data, it fills those too. Not sure how common that is, though.

True! My bad, too much sweeping generalization in there.

As for documenting things in other places.... I don't think we do, but now that I think about it I feel like the official dbt docs is the place to do it, so maybe we just open a PR over there after the next update is ready to roll out.

Thanks for doing the renaming, saves me some time!

@courtneyholcomb courtneyholcomb merged commit 7bbb704 into main Nov 2, 2023
@courtneyholcomb courtneyholcomb deleted the court/fill-nulls branch November 2, 2023 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants