Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe qualified name for GroupByMetricSpecs #1171

Merged
merged 5 commits into from
May 3, 2024

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Apr 30, 2024

Description

Typically, a spec's qualified name is <entity_links>__<element_name>. In the case of GroupByMetricSpecs, this name isn't guaranteed to be unique. That's because the metric_subquery_entity_links are not accounted for. This will also be a problem in SQL generation, since you could end up with a query that references two different group by metrics by the same column name. To avoid that, here I added some deduping logic for the qualified name.

  1. If entity_links matches metric_subquery_entity_links, use the standard qualified name.
  2. If they don't match, include both, e.g., <entity_links>__<metric_subquery_entity_links>__<element_name>. You can still tell where one path ends and the next begins because they will both end with the same entity link, which is the last link before the element name.

This is definitely ugly!! I am suggesting any alternate suggestions to make this work.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have three concerns:

  1. This naming expansion only works if the link paths are different, but if they're the same it won't help. I don't know how often they'll be different but it seems like a bunch of cases will slip through here.
  2. The optimized SQL isn't applying the updated aliases. Does this indicate that we'll still have problems if there are duplicate metric names?
  3. This is outside the scope of this PR specifically, but it seems to me that we need a robust aliasing approach for these group by metrics. I don't have great ideas here - the group by item resolver is probably too fine-grained to the metric element in the filter/group by list (so if it's the same metric in group by and filter, or used twice in the filter, we may end up with multiple aliases when those should be collapsed).

@courtneyholcomb
Copy link
Contributor Author

  1. This naming expansion only works if the link paths are different, but if they're the same it won't help. I don't know how often they'll be different but it seems like a bunch of cases will slip through here.

🤔 I don't think I'm understanding this concern. In what scenario would we have duplicates?
If the link paths are the same, we won't expand the name (though we could if consistency is preferable to sometimes-readability). But that un-expanded name still represents a unique join path. e.g., listing__country__bookings vs. listing__country__listing__country__bookings - both would be unique.

  1. The optimized SQL isn't applying the updated aliases. Does this indicate that we'll still have problems if there are duplicate metric names?

🤦‍♀️🤦‍♀️🤦‍♀️ Ok, so the snapshots that looked weird in #1166 were just things that got added to the wrong PR. I think I must have made a commit at some point while in the middle of generating snapshots, so a few of the snapshots got updated in the wrong commit. So you don't see those updates here because they're lower in the stack. I need to rebase this one now that I've removed those snapshots, and then update the snapshots again here.

@tlento
Copy link
Contributor

tlento commented May 2, 2024

🤔 I don't think I'm understanding this concern. In what scenario would we have duplicates?
If the link paths are the same, we won't expand the name (though we could if consistency is preferable to sometimes-readability). But that un-expanded name still represents a unique join path. e.g., listing__country__bookings vs. listing__country__listing__country__bookings - both would be unique.

It's entirely possible I'm just completely confused here, but if I have, say, a ratio metric that uses the same metric filter reference we'd end up with duplicates in the query output because they are actually the same, and expanding the names like this doesn't help that unless the join paths are different between the group by metric and each of those ratio metrics.

I don't know how common it will be for those join paths to be different, but it's not clear to me that making these long chain sequence names is worth it.

That said, it just occurred to me that we only allow this in the where filter. As long as we don't expose this naming scheme in the output table schemas it doesn't matter much one way or the other - but we'd need to make sure we remove it before allowing group by usage.

🤦‍♀️🤦‍♀️🤦‍♀️ Ok, so the snapshots that looked weird in #1166 were just things that got added to the wrong PR. I think I must have made a commit at some point while in the middle of generating snapshots, so a few of the snapshots got updated in the wrong commit. So you don't see those updates here because they're lower in the stack. I need to rebase this one now that I've removed those snapshots, and then update the snapshots again here.

The snapshots are still showing the same issue. The comment has the expanded alias, but the SQL itself does not. Is that expected?

@@ -24,7 +24,7 @@ FROM (
FROM ***************************.dim_lux_listing_id_mapping lux_listing_mapping_src_28000
) subq_4
FULL OUTER JOIN (
-- Pass Only Elements: ['listing', 'bookings']
-- Pass Only Elements: ['listing', 'listing__bookings']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like here we have the link, but on line 30 we do not. I don't think that's common with FilterElementsNode accesses - it typically matches, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry I haven't been able to update snapshots yet! Slowly rebasing up the stack 😓

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good catch on this!! I wasn't using the qualified name as the column name for group by metrics. Wasn't breaking anything, except our typical column naming convention 😉 I added a commit to fix that.

@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 3, 2024
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 3, 2024 18:41 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 3, 2024 18:41 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 3, 2024 18:41 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS May 3, 2024 18:41 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb requested a review from tlento May 3, 2024 18:42
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 3, 2024
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, LGTM, and nice find on the weird entity link omission pieces.

Let's run with this updated naming for now while we sort out the aliasing situation and whatever all else might be needed to enable using these things in group by statements. I don't think we want the nomenclature here exposed to end users on the output side, but that isn't going to happen just yet.

entity_link_names=tuple(x.element_name for x in dimension_spec.entity_links),
element_name=dimension_spec.element_name,
).qualified_name,
column_name=dimension_spec.qualified_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had just ripped this whole class out when I first noticed it none of this would've happened. Lesson learned. Thanks for making the property access nature of this class even more obvious. :)

Base automatically changed from court/gb-metrics13 to main May 3, 2024 19:07
Previously we were duplicating this logic, which made it easy to overlook updating the column name to use the qualified name. I updated this for all specs so that it will update automatically in the future. You can see in the snapshots that this changed nothing for most specs, except group by metrics (as expected).
@courtneyholcomb courtneyholcomb enabled auto-merge (squash) May 3, 2024 19:36
@courtneyholcomb courtneyholcomb merged commit 62db0e8 into main May 3, 2024
15 checks passed
@courtneyholcomb courtneyholcomb deleted the court/gb-metrics14 branch May 3, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants