Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: querying multiple granularities with offset metrics #1054

Merged
merged 10 commits into from
Feb 29, 2024

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Feb 28, 2024

Resolves SL-1777
Resolves SL-1810
Resolves #1053
Resolves #1052

Description

Resolves 2 related issues:

  1. If you query a time offset metric with multiple instances of metric_time/agg_time_dimension, only one of the instances will be offset. All of them should be.
  2. If you query a time offset metric with one instance of metric_time/agg_time_dimension but filter by a different one, the query will fail

These issues were caused by issues in the JoinToTimeSpineNode dataflow to SQL code. The node was only designed to offset one time dimension. It was also filtering out columns prematurely, when that filtering should instead be happening in the FilterElementsNode. This PR fixes both issues.

I would recommend reviewing this PR by commit.

  • The first commit (after changelog) adds tests that show what the SQL previously looked like, as well as the scenarios that trigger an error before SQL can be generated.
  • Next commit updates the dataflow to SQL logic.
  • Third commit updates the snapshots so you can clearly see how the logic change impacts the SQL.

@cla-bot cla-bot bot added the cla:yes label Feb 28, 2024
@courtneyholcomb courtneyholcomb marked this pull request as ready for review February 28, 2024 05:43
@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Feb 28, 2024
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Feb 28, 2024
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like what we want to happen instead of whatever was happening before, but I do wonder about the invariant handling I mentioned inline - there's an implicit expectation of a granularity relationship between the time spine and the selected columns but right now if that expectation isn't met we'll render a date trunc to a coarser granularity, which doesn't seem like what we really want to do.

Should we just put an assertion in there? Or is there some other mechanism enforcing this?

Comment on lines 1361 to 1363
if time_dimension_spec.time_granularity == time_spine_dim_instance.spec.time_granularity
else SqlDateTruncExpression(
time_granularity=time_dimension_spec.time_granularity, arg=time_spine_column_select_expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies the time_spine_dim_instance.spec.time_granularity will always be equal to or finer (<= I think) than the time_dimension_spec.time_granularity.

I'm pretty sure that's true at this point, but is it enforced anywhere? I'm not sure what'll happen if we allow different granularities in the time spine, or if we allow the time spine to be DAY while the base granularity moves to MILLISECOND or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I feel like the time spine SHOULD be set with the smallest possible grain in order to fill all granularity periods we support.
In practice, though, it might not be - especially when we transition to supporting smaller grains. Users might need to update their time spines.
I'll add an assertion for now and a comment to explain!

Comment on lines +11 to +33
metric_time__month
, SUM(bookings) AS bookings_start_of_month
FROM (
-- Join to Time Spine Dataset
-- Pass Only Elements: ['bookings', 'metric_time__month', 'metric_time__day']
SELECT
subq_13.ds AS metric_time__day
, DATE_TRUNC('month', subq_13.ds) AS metric_time__month
, subq_11.bookings AS bookings
FROM ***************************.mf_time_spine subq_13
INNER JOIN (
-- Read Elements From Semantic Model 'bookings_source'
-- Metric Time Dimension 'ds'
SELECT
DATE_TRUNC('day', ds) AS metric_time__day
, 1 AS bookings
FROM ***************************.fct_bookings bookings_source_src_28000
) subq_11
ON
DATE_TRUNC('month', subq_13.ds) = subq_11.metric_time__day
WHERE DATE_TRUNC('month', subq_13.ds) = subq_13.ds
) subq_15
WHERE metric_time__day = '2020-01-01'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all feels very convoluted but maybe that's just how we have to do offset to grain calculations? Like if this grouped by metric_time__day I guess we'd drop the where filter and just have a repeated bookings_start_of_month value collected alongside each day.

I'm actually not sure how I'd write the SQL for this to make it less confusing. For this case I'd probably do the filter + aggregation on the month grains and then join to time spine to fill out other granularities, but that doesn't generalize.

Anyway, something for us to think about later - it really feels like there has to be a cleaner way to render this, but maybe there isn't and that's another reason why MetricFlow is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset_to_grain ends up with the weirdest edge cases. We might just want to be more restrictive with what we allow for those metrics. It's already so weird when there are multiple instances of metric_time queried with an offset_to_grain metric because the where filter that we add by default doesn't totally do its job anymore. And I have no idea why someone would want to make that query. But alas, for now, we can just give the people what they asked for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But alas, for now, we can just give the people what they asked for.

LOL, this is me every time I have to add a feature of any kind.

@@ -21,6 +21,7 @@ def test_offset_to_grain_with_single_granularity( # noqa: D
MetricFlowQueryRequest.create_with_random_request_id(
metric_names=["bookings_at_start_of_month"],
group_by_names=["metric_time__day"],
order_by_names=["metric_time__day"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Thank you!

@courtneyholcomb
Copy link
Contributor Author

courtneyholcomb commented Feb 29, 2024

@tlento When I added that assertion, it revealed a sneaky bug related to a duplicate variable name. I added an additional commit to fix that bug so you might want to take a look at that before I merge. It's this commit which also updates basic snapshots.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaky sneaky variable naming bugs. We probably need a refactor in there to make us less dependent on how we name things, but I don't think it's worth the time to do that today.

Looks good!

@@ -30,7 +30,7 @@ FROM (
, DATE_TRUNC('month', subq_5.metric_time__day) AS metric_time__month
, DATE_TRUNC('quarter', subq_5.metric_time__day) AS metric_time__quarter
, DATE_TRUNC('year', subq_5.metric_time__day) AS metric_time__year
, EXTRACT(year FROM DATE_TRUNC('day', subq_5.metric_time__day)) AS metric_time__extract_year
, EXTRACT(year FROM subq_5.metric_time__day) AS metric_time__extract_year
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow I didn't even notice that.

)
time_spine_dim_instance = TimeDimensionInstance(
defined_from=time_spine_dim_instance.defined_from,
defined_from=original_time_spine_dim_instance.defined_from,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, this was inside the loop!

@courtneyholcomb courtneyholcomb merged commit ce7f194 into main Feb 29, 2024
9 checks passed
@courtneyholcomb courtneyholcomb deleted the court/offset-multiple-grain branch February 29, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants