Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by resolution for group by metrics #1139

Closed
wants to merge 21 commits into from

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Apr 15, 2024

Description

Refactor group by resolution for LinkableMetrics. This gets more explicit about what the join paths will should look like for LinkableMetrics and updates that logic accordingly. LinkableMetrics are different from other linkable elements for a couple of reasons because there is no such thing as a "local" LinkableMetric. There is always a join required to get to the metric subquery. To make this clear, here we reclassify LinkableMetrics as either single-hop or muti-hop, where previously they were considered either local or single-hop.

There was also a bug here related to how we resolved what linkable metrics are available. Previously, the code assumed that if a metric could be queried with one entity in a semantic model, it could be queried with any metric in that semantic model, but that's not true because (1) some entities might cause potential fan-out and (2) derived metrics need all input measures to be joinable to the entity. Here, we fix that bug by updating the logic to be based on entities, not semantic models.

Those are the main changes in this PR, but reviewing by commit would be helpful to get more granular context.

Coming after this: Query tests for metric filters with single hop and multi-hop joins + ensure DFP is using expected join paths.

@cla-bot cla-bot bot added the cla:yes label Apr 15, 2024
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@courtneyholcomb courtneyholcomb force-pushed the court/gb-reso-metric branch 2 times, most recently from b9e443f to 04a90a6 Compare April 17, 2024 03:15
@courtneyholcomb courtneyholcomb marked this pull request as ready for review April 17, 2024 03:27
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change seems reasonable but I don't understand why the snapshot outputs are changing so drastically. It looks like we were cutting off a lot of multi-hop joins as well. (Edit to add - actually it looks like maybe we're allowing fanout implicitly in the first step of a multi-hop entity join leading to a metric join.)

What worries me about this, apart from my general driving need to understand how things work, is that the multi-hop case might be overly broad. For example, do we have any way of asserting that we are, in fact, excluding intermediate fanout join steps?

Comment on lines 1 to 2
kind: Fixes
body: Refactor group by resolution for group by metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost absurdly nitty, but maybe this should be Under the Hood? Or else have some indication of what this fixes?

) -> Sequence[SemanticModel]:
# May switch to non-cached implementation.
semantic_models = self._entity_to_semantic_model[entity_reference.element_name]
semantic_models = self._entity_to_semantic_model[join_by_entity_reference.element_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to your changes, but I'm baffled by that comment. Why on earth would we do that? I don't know, but clearly it was on our minds at some point!

self, semantic_model: SemanticModel, using_join_path: Optional[SemanticModelJoinPath] = None
) -> LinkableElementSet:
"""Get the set of linkable metrics that can be joined to this semantic model."""
properties = frozenset({LinkableElementProperty.METRIC, LinkableElementProperty.JOINED})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JOINED is new, right? Seems appropriate.

@@ -1064,12 +1062,12 @@ def create_linkable_element_set_from_join_path(

for entity in semantic_model.entities:
# Avoid creating "booking_id__booking_id"
if entity.reference != entity_links[-1]:
if entity.reference != join_path.last_entity_link:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, love this change.

# Avoid creating "booking_id__booking_id"
if entity.reference != join_path.last_entity_link:
# Avoid creating a cycle
if entity.reference not in join_path.entity_links:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Did this work before because we have a MAX_JOIN_HOPS of 2 at the moment? Or are we cleaning up some edge cases around existing multi-hop joins here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option! I don't think there was any behavior change here due to our join hop limit.

Comment on lines 911 to 935
for _ in range(self._max_entity_links):
next_semantic_models_and_join_paths: List[Tuple[SemanticModel, Optional[SemanticModelJoinPath]]] = []
for semantic_model, join_path in semantic_models_and_join_paths:
new_linkable_metrics = self.get_metrics_directly_joinable_to_semantic_model(
semantic_model=semantic_model, using_join_path=join_path
)
linkable_metrics = LinkableElementSet.merge_by_path_key([linkable_metrics, new_linkable_metrics])

# Get next set of semantic models & join paths to iterate through.
for entity in semantic_model.entities:
for next_semantic_model in self._entity_to_semantic_model[entity.name]:
if next_semantic_model.name == semantic_model.name or (
join_path and entity in join_path.entity_links
):
continue
join_path_elements = join_path.path_elements if join_path else ()
new_join_path_element = SemanticModelJoinPathElement(
semantic_model_reference=next_semantic_model.reference, join_on_entity=entity.reference
)
next_semantic_models_and_join_paths.append(
(next_semantic_model, SemanticModelJoinPath(join_path_elements + (new_join_path_element,)))
)
semantic_models_and_join_paths = next_semantic_models_and_join_paths

return linkable_metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My head hurts. Why do joins gotta be so complicaaaaaated........

"""Given the current path, generate the respective linkable elements from the last semantic model in the path."""
"""Given the current path, generate the respective linkable elements from the last semantic model in the path.

Does not include LinkableMetrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Would be good to explain that here.

Comment on lines 235 to 265
expected_single_hop_metrics = [
linkable_metric
for linkable_metrics in metric_lookup._linkable_spec_resolver.get_metrics_directly_joinable_to_semantic_model(
semantic_model=measure_semantic_model
).path_key_to_linkable_metrics.values()
for linkable_metric in linkable_metrics
]

for expected_metric in expected_single_hop_metrics:
assert expected_metric in actual_metrics
assert expected_metric not in already_seen
already_seen.add(expected_metric)
actual_metrics.remove(expected_metric)

expected_multi_hop_metrics = []
for entity in measure_semantic_model.entities:
next_semantic_models = metric_lookup._linkable_spec_resolver._entity_to_semantic_model[entity.name]
for next_semantic_model in next_semantic_models:
if next_semantic_model.name == measure_semantic_model.name:
continue
join_path = SemanticModelJoinPath.from_single_element(
semantic_model_reference=next_semantic_model.reference, join_on_entity=entity.reference
)
for (
linkable_metrics
) in metric_lookup._linkable_spec_resolver.get_metrics_directly_joinable_to_semantic_model(
semantic_model=next_semantic_model,
using_join_path=join_path,
).path_key_to_linkable_metrics.values():
for linkable_metric in linkable_metrics:
expected_multi_hop_metrics.append(linkable_metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is too closely coupled to the existing implementation of the LinkableSpecResolver for my liking.

I don't have a great answer for how to go about asserting these boundaries in a more implementation-agnostic way. The only thing I can come up with is to create a semantic model just for this (or to find one that is suitably simple, the multi-hop model might work) and essentially hand-craft the expected output and assert against that.

This would be more robust to the inevitable refactoring (or replacement) awaiting the LinkableSpecResolver, so I think ti's probably worth doing something at this point.

Alternatively we can probably remove this test or replace it with yet another snapshot entry.

Comment on lines 918 to 930
for entity in semantic_model.entities:
for next_semantic_model in self._entity_to_semantic_model[entity.name]:
if next_semantic_model.name == semantic_model.name or (
join_path and entity in join_path.entity_links
):
continue
join_path_elements = join_path.path_elements if join_path else ()
new_join_path_element = SemanticModelJoinPathElement(
semantic_model_reference=next_semantic_model.reference, join_on_entity=entity.reference
)
next_semantic_models_and_join_paths.append(
(next_semantic_model, SemanticModelJoinPath(join_path_elements + (new_join_path_element,)))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this open the door to fanout on the intermediate step? Like I have a measure and that joins to the foreign listing entity via account in some other semantic model, and then we end up with account__listing__some_listing_metric but there's fanout between account and listing so our metrics aren't necessarily correct.

I don't see any checks for join validity within the entity block here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right - I'll go back and fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain realized this problem just when I was trying to fall asleep last night 🙃

@courtneyholcomb courtneyholcomb force-pushed the court/gb-reso-metric branch 3 times, most recently from 8108f1c to 96408a8 Compare April 23, 2024 03:20
@courtneyholcomb
Copy link
Contributor Author

@tlento this is ready for re-review!
You already reviewed some of the commits - new commits start at Move setup dict to avoid circular logic error if you want to start there.

@courtneyholcomb courtneyholcomb requested a review from tlento April 23, 2024 03:43
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this seems reasonable but I know you've got a monster update coming.

It'd be good to have the final PR with the change explain what we expect to see in the snapshot updates based on what's changing in the logic. It's really hard for me to visualize what's happening inside of the LinkableSpecResolver in general, so it'll be good to have some documentation in the PR description around what we're expecting to see and why this change is happening.

@@ -544,6 +544,7 @@ def __init__(
self._entity_to_semantic_model[entity.reference.element_name].append(semantic_model)

self._metric_to_linkable_element_sets: Dict[str, List[LinkableElementSet]] = {}
self._joinable_metrics_for_entities: Dict[EntityReference, Set[MetricReference]] = defaultdict(set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the location of this matters at all but if this is causing circular logic errors that's a good sign that we've put waaaaaay too much code in this initializer. Maybe we'll get rid of this class before we need to refactor this....

Comment on lines 642 to 643
if using_join_path and entity_reference in using_join_path.entity_links:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the there's an extra hop and this entity points through that path, so we skip the direct join to the metric? Might be worth a clarification, or maybe it's possible to illustrate this behavior in a test somehow.

We could do this with a variable name, like:

is_multi_hop_entity = using_join_path and entity_reference in using_join_path.entity_links
if is_multi_hop_entity:
    continue

@@ -521,17 +500,11 @@ def _get_linkable_element_set_for_measure(
measure_semantic_model = self._get_semantic_model_for_measure(measure_reference)

elements_in_semantic_model = self._get_elements_in_semantic_model(measure_semantic_model)
metrics_linked_to_semantic_model = self._get_joinable_metrics_for_semantic_model(measure_semantic_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok this has been moved into get_joined_elements. That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants