Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: selective where filters after join to time spine #1120

Merged
merged 6 commits into from
May 23, 2024

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Apr 10, 2024

Resolves SL-1987

Description

When an input metric specifies join_to_timespine: true in its YAML definition, we apply special behavior for filters. Before this PR, we were applying all filters twice - before and after the time spine join. This is because you might end up with new rows after the time spine join that should be filtered out. This resulted in a bug sometimes if you filtered by a spec that was not in the group by, since the filter column no longer existed at that level of the query.
This fixes that bug by only applying the post-join filter if the filtered spec is also in the group by. The specs in the group by should be the only ones changed after the time spine join, anyway.

@cla-bot cla-bot bot added the cla:yes label Apr 10, 2024
@courtneyholcomb courtneyholcomb force-pushed the court/join-to-time-spine-with-filters branch from c17025f to ad98912 Compare May 21, 2024 23:57
@dbt-labs dbt-labs deleted a comment from github-actions bot May 21, 2024
@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 22, 2024
@courtneyholcomb courtneyholcomb marked this pull request as ready for review May 22, 2024 00:04
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 22, 2024 00:04 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 22, 2024 00:04 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 22, 2024 00:04 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 22, 2024 00:04 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb requested review from tlento and plypaul May 22, 2024 00:05
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 22, 2024
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is interesting, so our current expected behavior on filters and time spine joins is, basically, the query filter applies before the time spine join for fill nulls, and we only do a secondary restriction to filter out nulls. I think that's right. Two questions:

  1. What happens with a more deeply nested metric with a filter? Like if I have a filter on bookings__is_instant in the definition of an input metric to something that has fill nulls with, what happens? I'm assuming this fits into the group by case for the inner metric, then we fill nulls, then they get collapsed away? It's sort of interesting if the default is 1 instead of zero, or if the outer metric is an average.
  2. What happens to time filters if I do where metric_time__day = '2024-01-01' and bookings__is_instant but I only group by metric_time__day? I think that will still over-select, correct? Meaning we'll get all the metric_times but there will be null metric values in the output. Is that right?

---
integration_test:
name: join_to_time_spine_with_queried_time_constraint
description: Test join to time spine with time constraint that isn't in group by. Should apply constraint twice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: that is in the group by

@courtneyholcomb
Copy link
Contributor Author

@tlento

  1. What happens with a more deeply nested metric with a filter? Like if I have a filter on booking__is_instant in the definition of an input metric to something that has fill nulls with, what happens?

I think the current behavior in this scenario is expected, too. If the input metric filters to booking__is_instant, we should filter that subquery accordingly. But there might intentionally be another input metric in the parent metric that does not filter to booking__is_instant, so we can't assume that the parent metric wants that filter. For example, say you have a ratio metric like percent_instant_bookings which is bookings WHERE booking__is_instant / bookings.
If you query that ratio metric with booking__is_instant, you might end up with some null rows for one of the input metrics, and those would get filled in the parent metric calculation. In this example you would likely fill with zero in that case, which gets a reasonable result.

I'm assuming this fits into the group by case for the inner metric, then we fill nulls, then they get collapsed away? It's sort of interesting if the default is 1 instead of zero, or if the outer metric is an average.

I wouldn't say they get collapsed away. The nulls get filled, and that gets used in the next stage of the query. It is weird if the default is 1, but I'm struggling to think of a scenario where you would want that 🤔 If the outer metric is an average, I think this is pretty normal. In the ratio metric example I mentioned above, the input metric filling nulls would result in 0 on some days, and that's what you want for the average calculation.

  1. What happens to time filters if I do where metric_time__day = '2024-01-01' and bookings__is_instant but I only group by metric_time__day? I think that will still over-select, correct? Meaning we'll get all the metric_times but there will be null metric values in the output. Is that right?

There might be null metric values in the output, yes, unless they apply a fill_nulls_with value. I think that's expected for the join_to_timespine feature - they have the option to fix that by using fill_nulls_with.
The fix here is really fixing behavior regarding nulls for the group bys, not the metrics!

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this all sounds reasonable and the behavior you describe aligns with what I was expecting, I just wasn't sure if this PR might affect that.

Thanks for making this fix so readable! It's an awful lot of complicated SQL for such a small adjustment. 😭

@courtneyholcomb courtneyholcomb enabled auto-merge (squash) May 23, 2024 17:33
@courtneyholcomb courtneyholcomb merged commit 6758d03 into main May 23, 2024
15 checks passed
@courtneyholcomb courtneyholcomb deleted the court/join-to-time-spine-with-filters branch May 23, 2024 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SL-1987] Post-join filter application on time spine queries results in invalid query exceptions
2 participants