-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join to Time Spine & Fill Nulls #832
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl,dr; I think we should do the following:
- Make sure the code runs and test cases pass for this PR as it is, and document it as allowing filling nulls with whatever for time spine values only (since that is what it does).
- Add a follow up PR to allow filling nulls with zero (or whatever) for derived metrics/multi-metric queries. That will require a change in our compute metrics join semantics and tests for the multi-metric scenario.
- Add some documentation including expected input and output.... somewhere. The test cases don't exactly do this because they're just SQL queries.
Long commentary follows:
Whew. This feature is super gnarly. In hindsight I think we should've built the coalesce without time spine first and then added the time spine in after.
This PR does not appear to address cases where there are joined in dimensions and multiple metrics, either as input to derived metrics or as general metric output. Right now, that is an INNER JOIN and so we will still filter out measure values from those input sets.
I think there's a bit of an open question about what the user should expect in these cases, so we can chat about that separately, but for derived metrics in particular it's a problem that we throw away entire rows of output just because something that should be represented as a 0 doesn't produce any join output values.
I also think addressing them might be a pretty gnarly PR in and of itself, since we need to change our join semantics on the final compute metrics join steps. This also begins to get into the LEFT vs RIGHT outer join type on our dimension joins, which currently all originate from the measure aggregation input rather than allowing for dimension values to have NULL (replaced with zero) values.
metricflow/test/fixtures/semantic_manifest_yamls/simple_manifest/metrics.yaml
Show resolved
Hide resolved
booking__is_instant | ||
, COALESCE(bookings, 0) AS bookings_fill_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we haven't talked about at all I don't think, but I was thinking about it while reading the code. I believe this is the correct approach, but we need to document it carefully.
This does not fill values for missing dimension values. If, for some reason, there were 0 bookings with is_instant = true we would render one row - false, <number>
.
In cases where users are doing some kind of full scope group by, or using those rows as inputs into a derived metric, they would want the full value set represented. However, in cases where they were filtering out those values, they would not.
Therefore, we cannot fill 0s for all dimension combinations, we can only ensure that any case where an actual NULL shows up (rather than a total absence of values) for a measure the NULL is coalesced to the specified fill value. This seems appropriate given the name of the property, but it's good for us to be clear about it in docs and so forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we did discuss this at standup yesterday! i.e., the fact that we obviously don't have a "dimension values spine" table to join to so we won't be filling those missing rows. So product is aware of this limitation (at least Nick is). But definitely agree that we need to document this clearly!!
I'll add here that this isn't ALL it does. If there are nulls in the underlying data, it fills those too. Not sure how common that is, though.
Do we currently do this anywhere ? |
Snowflake test failures are unrelated to this PR. I put a fix up for those here: #835 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add here that this isn't ALL it does. If there are nulls in the underlying data, it fills those too. Not sure how common that is, though.
True! My bad, too much sweeping generalization in there.
As for documenting things in other places.... I don't think we do, but now that I think about it I feel like the official dbt docs is the place to do it, so maybe we just open a PR over there after the next update is ready to roll out.
Thanks for doing the renaming, saves me some time!
Resolves #SL-783
Description
We have two new attributes on
MetricInputMeasures
-join_to_timespine
andfill_nulls_with
. This PR implements the expected behavior for those features.join_to_timespine
- whentrue
for an input measure, we should join the already-aggregated measure to the time spine table, ensuring that any missing date periods in the underlying data are filled in withnull
rows. We do this by adding our existingJoinToTimeSpineNode
to the Dataflow plan.fill_nulls_with
accepts an optional integer input. When notnull
, any rows that arenull
for the post-aggregated measure should be filled with the requested integer. We implement this with aCOALESCE
statement.