Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Resolver for Saved-Query Dependencies #1152

Merged
merged 14 commits into from
Apr 29, 2024

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Apr 24, 2024

Description

This PR adds a resolver to get the dependencies of a saved query.

@cla-bot cla-bot bot added the cla:yes label Apr 24, 2024
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@plypaul plypaul force-pushed the plypaul--12--saved-query-resolver branch 2 times, most recently from 11787e3 to ebfe930 Compare April 24, 2024 22:24
@plypaul plypaul force-pushed the plypaul--11--repackage branch from 0cea41f to e626208 Compare April 24, 2024 22:43
@plypaul plypaul force-pushed the plypaul--12--saved-query-resolver branch 2 times, most recently from 1a36bf8 to 3018cdf Compare April 24, 2024 22:51
@plypaul plypaul force-pushed the plypaul--11--repackage branch from e626208 to 4afbd09 Compare April 24, 2024 22:51
@plypaul plypaul force-pushed the plypaul--12--saved-query-resolver branch from 3018cdf to e14c2b2 Compare April 25, 2024 17:09
@plypaul plypaul force-pushed the plypaul--11--repackage branch 3 times, most recently from d62f62c to 47d7057 Compare April 26, 2024 09:59
Base automatically changed from plypaul--11--repackage to main April 26, 2024 10:07
@plypaul plypaul force-pushed the plypaul--12--saved-query-resolver branch 3 times, most recently from 257fda7 to d9749ad Compare April 26, 2024 10:29
This was moved to address circular-dependency issues.
@plypaul plypaul force-pushed the plypaul--12--saved-query-resolver branch from 31937ea to 46fe4ac Compare April 26, 2024 21:26
plypaul added 10 commits April 26, 2024 14:29
This was moved to address circular-dependency issues.
Note: `GroupByMetricSpec` had to be moved back in
a later commit.
This was moved to address circular-dependency issues. Some methods were
removed / changed to help support the split.

The general issue was that because the set classes had bidirectional
dependences with the spec classes, circular dependencies could be easily
formed when imported by other modules.
`LinkableElementSet` will be used in group-by resolution to keep track
of the semantic models that a query would read.
… classes.

Using `LinkableElementSet` allows retrieval of the semantic models that
are needed for computation.
@plypaul plypaul force-pushed the plypaul--12--saved-query-resolver branch from 46fe4ac to 76a11f9 Compare April 26, 2024 21:30
@plypaul plypaul marked this pull request as ready for review April 26, 2024 21:45
@plypaul plypaul changed the title Add an Interface to Resolve Saved-Query Dependencies Add a Resolver for Saved-Query Dependencies Apr 26, 2024
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable! I can make use of the LinkableELementSet return value - I was going to make a similar set of changes before all this started anyway.

Note: this approval assumes the large blocks of green in cases where things existed before are copy/paste. I did not read those sections carefully.

Lots of small things inline. I think the main things are some added test coverage and the usual "I don't like visitors" thing. I'd like the tests done either separately as a follow-up or prior to merge, it's up to you.

@@ -812,6 +813,14 @@ def _build_measure_spec_properties(self, measure_specs: Sequence[MeasureSpec]) -
non_additive_dimension_spec=non_additive_dimension_spec,
)

def _query_spec_for_source_node(self, group_by_metric_spec: GroupByMetricSpec) -> MetricFlowQuerySpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attn @courtneyholcomb, just FYI, this function moved into the DataflowPlanBuilder.


@log_runtime()
# @log_runtime()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this come back later or be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a typo - updated.

@@ -115,15 +116,19 @@ def parse_and_validate_saved_query(
if where_filter is not None:
where_filters.append(where_filter)

return self.parse_and_validate_query(
return self._parse_and_validate_query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this isn't applied to parse_and_validate_query more broadly just yet.

Comment on lines +154 to +191
class _GroupSpecByTypeVisitor(InstanceSpecVisitor[None]):
"""Groups a spec by type into an `InstanceSpecSet`."""

metric_specs: List[MetricSpec] = dataclasses.field(default_factory=list)
measure_specs: List[MeasureSpec] = dataclasses.field(default_factory=list)
dimension_specs: List[DimensionSpec] = dataclasses.field(default_factory=list)
entity_specs: List[EntitySpec] = dataclasses.field(default_factory=list)
time_dimension_specs: List[TimeDimensionSpec] = dataclasses.field(default_factory=list)
group_by_metric_specs: List[GroupByMetricSpec] = dataclasses.field(default_factory=list)
metadata_specs: List[MetadataSpec] = dataclasses.field(default_factory=list)

@override
def visit_measure_spec(self, measure_spec: MeasureSpec) -> None:
self.measure_specs.append(measure_spec)

@override
def visit_dimension_spec(self, dimension_spec: DimensionSpec) -> None:
self.dimension_specs.append(dimension_spec)

@override
def visit_time_dimension_spec(self, time_dimension_spec: TimeDimensionSpec) -> None:
self.time_dimension_specs.append(time_dimension_spec)

@override
def visit_entity_spec(self, entity_spec: EntitySpec) -> None:
self.entity_specs.append(entity_spec)

@override
def visit_group_by_metric_spec(self, group_by_metric_spec: GroupByMetricSpec) -> None:
self.group_by_metric_specs.append(group_by_metric_spec)

@override
def visit_metric_spec(self, metric_spec: MetricSpec) -> None:
self.metric_specs.append(metric_spec)

@override
def visit_metadata_spec(self, metadata_spec: MetadataSpec) -> None:
self.metadata_specs.append(metadata_spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤮

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously though, the type refinement standard in python is the isinstance check. If that's unpalatable then that indicates we should avoid the need to perform this operation. Looking above I think the real issue is we've been disguising this for a very long time, so it'll be challenging to undo it.

Given that, I'd rather we follow the language standard here than add a visitor implementation. The drawback is if someone forgets to add a new spec type it's possible they'll forget to update the instance check logic, but that is an exceedingly rare case and it's something anybody adding a new spec has to deal with everywhere else in the InstanceSpecSet class anyway.

It's up to you if you want to change this here, or if you want me to do it when I rip out the InstanceSpecVisitor as a construct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.

This rule is too rigid for a gradually typed language like Python where type refinement is typically done via isinstance(). Assertions to hint to the typechecker are a thing, as are instance checks to classify an object by type in the (hopefully exceedingly rare) cases where you need to do so. As such, this approach leads to a proliferation of code that does basically nothing.

In this specific case, you could've put ~10 lines of logic into the existing method with a comment indicating that we are aware of the type deficiencies and do not want these checks to proliferate through the codebase at large, and an assertion to cover completeness in runtime tests. Instead you've added 2 new module-scope functions, a module-private helper class, and ~100 lines of boilerplate to accomplish the same end.

I don't think this is a good trade, especially given that the code smell I keep running into in our codebase is calcified generic abstractions that serve no functional purpose other than to hide the fact that the custom type foundation of the system is inadequate. The InstanceSpecVisitor is a perfect example of this, with the SinkNodeVisitor (and the existence of the DataflowPlanSinkNodes or whatever they're called themselves) representing another one.

You are correct that isinstance calls tend to spread. They must, if the custom type hierarchy we're relying on isn't working well. The thing is, these visitor implementations that make the codebase fundamentally harder to reason about and more awkward to work with also tend to spread in these situations. I don't find the latter more palatable than the former - rather the opposite, in the specific case of MetricFlow, because we routinely tie ourselves into knots to satisfy the constraints imposed by these generic abstractions and they provide absolutely nothing of value in return.

I feel the correct way to avoid spread is through continuous improvement of type specifications, and rigor and encapsulation around the isinstance checks themselves. Relying on inappropriate object oriented design patterns that hide the mess behind generics and piles of boilerplate represents a huge step backwards in terms of readability in the codebase and communication around what we intend vs what exists today and requires change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we had had this discussion a few times and to avoid having this thread be the place where we debate, let me just say that I see your points and I'm up for continued discussion on the issue.


@dataclass(frozen=True)
class LinkableEntity:
class LinkableEntity(LinkableElement, SemanticModelDerivation):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need both of these, right, just LinkableElement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - updated.


@dataclass(frozen=True)
class LinkableMetric:
class LinkableMetric(LinkableElement, SemanticModelDerivation):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, just LinkableElement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

entity_links=path_key.entity_links,
)
elif path_key.element_type is LinkableElementType.TIME_DIMENSION:
assert path_key.time_granularity is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: error message, maybe something like "type refinement, should have been checked in dataclass validation"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


@property
@override
def derived_from_semantic_models(self) -> Sequence[SemanticModelReference]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some test cases for this and the filter by spec pattern operations? They're quite a bit more complex than the direct to spec conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some test cases.

try:
return self._resolve_dependencies(saved_query_name)
except Exception:
logger.exception(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We might need to do something about this log level for deployment - if it turns out to be common we'll want it to show up with level WARN or INFO instead of EXCEPTION - but I think we can do that via datadog instead of here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fail-safe and shouldn't really be hit - if it is, it means there's a bug that we need to fix.

@plypaul
Copy link
Contributor Author

plypaul commented Apr 27, 2024

Note: this approval assumes the large blocks of green in cases where things existed before are copy/paste. I did not read those sections carefully.

That's correct. There shouldn't be significant logic changes in those sections.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates look good, thank you, let's get this merged as soon as it's ready. The only update you might want to make before merge is the more_itertools change but even that I can do later as I grind my way through this thing I'm trying to get done. The useless visitor chafes but I'm just going to start removing stuff like that so one extra diff in that stack won't matter much to me one way or the other.

Comment on lines +154 to +191
class _GroupSpecByTypeVisitor(InstanceSpecVisitor[None]):
"""Groups a spec by type into an `InstanceSpecSet`."""

metric_specs: List[MetricSpec] = dataclasses.field(default_factory=list)
measure_specs: List[MeasureSpec] = dataclasses.field(default_factory=list)
dimension_specs: List[DimensionSpec] = dataclasses.field(default_factory=list)
entity_specs: List[EntitySpec] = dataclasses.field(default_factory=list)
time_dimension_specs: List[TimeDimensionSpec] = dataclasses.field(default_factory=list)
group_by_metric_specs: List[GroupByMetricSpec] = dataclasses.field(default_factory=list)
metadata_specs: List[MetadataSpec] = dataclasses.field(default_factory=list)

@override
def visit_measure_spec(self, measure_spec: MeasureSpec) -> None:
self.measure_specs.append(measure_spec)

@override
def visit_dimension_spec(self, dimension_spec: DimensionSpec) -> None:
self.dimension_specs.append(dimension_spec)

@override
def visit_time_dimension_spec(self, time_dimension_spec: TimeDimensionSpec) -> None:
self.time_dimension_specs.append(time_dimension_spec)

@override
def visit_entity_spec(self, entity_spec: EntitySpec) -> None:
self.entity_specs.append(entity_spec)

@override
def visit_group_by_metric_spec(self, group_by_metric_spec: GroupByMetricSpec) -> None:
self.group_by_metric_specs.append(group_by_metric_spec)

@override
def visit_metric_spec(self, metric_spec: MetricSpec) -> None:
self.metric_specs.append(metric_spec)

@override
def visit_metadata_spec(self, metadata_spec: MetadataSpec) -> None:
self.metadata_specs.append(metadata_spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.

This rule is too rigid for a gradually typed language like Python where type refinement is typically done via isinstance(). Assertions to hint to the typechecker are a thing, as are instance checks to classify an object by type in the (hopefully exceedingly rare) cases where you need to do so. As such, this approach leads to a proliferation of code that does basically nothing.

In this specific case, you could've put ~10 lines of logic into the existing method with a comment indicating that we are aware of the type deficiencies and do not want these checks to proliferate through the codebase at large, and an assertion to cover completeness in runtime tests. Instead you've added 2 new module-scope functions, a module-private helper class, and ~100 lines of boilerplate to accomplish the same end.

I don't think this is a good trade, especially given that the code smell I keep running into in our codebase is calcified generic abstractions that serve no functional purpose other than to hide the fact that the custom type foundation of the system is inadequate. The InstanceSpecVisitor is a perfect example of this, with the SinkNodeVisitor (and the existence of the DataflowPlanSinkNodes or whatever they're called themselves) representing another one.

You are correct that isinstance calls tend to spread. They must, if the custom type hierarchy we're relying on isn't working well. The thing is, these visitor implementations that make the codebase fundamentally harder to reason about and more awkward to work with also tend to spread in these situations. I don't find the latter more palatable than the former - rather the opposite, in the specific case of MetricFlow, because we routinely tie ourselves into knots to satisfy the constraints imposed by these generic abstractions and they provide absolutely nothing of value in return.

I feel the correct way to avoid spread is through continuous improvement of type specifications, and rigor and encapsulation around the isinstance checks themselves. Relying on inappropriate object oriented design patterns that hide the mess behind generics and piles of boilerplate represents a huge step backwards in terms of readability in the codebase and communication around what we intend vs what exists today and requires change.

Comment on lines +8 to +15
def ordered_dedupe(*iterables: Iterable[IterableT]) -> Tuple[IterableT, ...]:
"""De-duplicates the items in the iterables while preserving the order."""
ordered_results: Dict[IterableT, None] = {}
for iterable in iterables:
for item in iterable:
ordered_results[item] = None

return tuple(ordered_results.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a less robust custom implementation of the (admittedly poorly named) unique_everseen function in more-itertools.

Since dbt-semantic-interfaces depends on more-itertools we should add the dependency with the same version range and use it here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just noticed this takes *iterables, so it's a less robust implementation of unique_everseen(itertools.chain(iterables)). The point still stands, we should use the libraries, especially where iterator collection helpers are concerned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can make that change, but let me check about the "minimize dependencies" request and check how that method works with types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already looked at the dependencies, that's not an issue. The typing is another matter, more-itertools is an old library and these methods date back a long way.

@plypaul plypaul merged commit f54eeed into main Apr 29, 2024
15 checks passed
@plypaul plypaul deleted the plypaul--12--saved-query-resolver branch April 29, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants