-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Resolver for Saved-Query Dependencies #1152
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
11787e3
to
ebfe930
Compare
0cea41f
to
e626208
Compare
1a36bf8
to
3018cdf
Compare
e626208
to
4afbd09
Compare
3018cdf
to
e14c2b2
Compare
d62f62c
to
47d7057
Compare
257fda7
to
d9749ad
Compare
This was moved to address circular-dependency issues.
31937ea
to
46fe4ac
Compare
This was moved to address circular-dependency issues. Note: `GroupByMetricSpec` had to be moved back in a later commit.
…turn a result object.
This was moved to address circular-dependency issues. Some methods were removed / changed to help support the split. The general issue was that because the set classes had bidirectional dependences with the spec classes, circular dependencies could be easily formed when imported by other modules.
`LinkableElementSet` will be used in group-by resolution to keep track of the semantic models that a query would read.
… classes. Using `LinkableElementSet` allows retrieval of the semantic models that are needed for computation.
46fe4ac
to
76a11f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable! I can make use of the LinkableELementSet return value - I was going to make a similar set of changes before all this started anyway.
Note: this approval assumes the large blocks of green in cases where things existed before are copy/paste. I did not read those sections carefully.
Lots of small things inline. I think the main things are some added test coverage and the usual "I don't like visitors" thing. I'd like the tests done either separately as a follow-up or prior to merge, it's up to you.
@@ -812,6 +813,14 @@ def _build_measure_spec_properties(self, measure_specs: Sequence[MeasureSpec]) - | |||
non_additive_dimension_spec=non_additive_dimension_spec, | |||
) | |||
|
|||
def _query_spec_for_source_node(self, group_by_metric_spec: GroupByMetricSpec) -> MetricFlowQuerySpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attn @courtneyholcomb, just FYI, this function moved into the DataflowPlanBuilder.
|
||
@log_runtime() | ||
# @log_runtime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this come back later or be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a typo - updated.
@@ -115,15 +116,19 @@ def parse_and_validate_saved_query( | |||
if where_filter is not None: | |||
where_filters.append(where_filter) | |||
|
|||
return self.parse_and_validate_query( | |||
return self._parse_and_validate_query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this isn't applied to parse_and_validate_query more broadly just yet.
class _GroupSpecByTypeVisitor(InstanceSpecVisitor[None]): | ||
"""Groups a spec by type into an `InstanceSpecSet`.""" | ||
|
||
metric_specs: List[MetricSpec] = dataclasses.field(default_factory=list) | ||
measure_specs: List[MeasureSpec] = dataclasses.field(default_factory=list) | ||
dimension_specs: List[DimensionSpec] = dataclasses.field(default_factory=list) | ||
entity_specs: List[EntitySpec] = dataclasses.field(default_factory=list) | ||
time_dimension_specs: List[TimeDimensionSpec] = dataclasses.field(default_factory=list) | ||
group_by_metric_specs: List[GroupByMetricSpec] = dataclasses.field(default_factory=list) | ||
metadata_specs: List[MetadataSpec] = dataclasses.field(default_factory=list) | ||
|
||
@override | ||
def visit_measure_spec(self, measure_spec: MeasureSpec) -> None: | ||
self.measure_specs.append(measure_spec) | ||
|
||
@override | ||
def visit_dimension_spec(self, dimension_spec: DimensionSpec) -> None: | ||
self.dimension_specs.append(dimension_spec) | ||
|
||
@override | ||
def visit_time_dimension_spec(self, time_dimension_spec: TimeDimensionSpec) -> None: | ||
self.time_dimension_specs.append(time_dimension_spec) | ||
|
||
@override | ||
def visit_entity_spec(self, entity_spec: EntitySpec) -> None: | ||
self.entity_specs.append(entity_spec) | ||
|
||
@override | ||
def visit_group_by_metric_spec(self, group_by_metric_spec: GroupByMetricSpec) -> None: | ||
self.group_by_metric_specs.append(group_by_metric_spec) | ||
|
||
@override | ||
def visit_metric_spec(self, metric_spec: MetricSpec) -> None: | ||
self.metric_specs.append(metric_spec) | ||
|
||
@override | ||
def visit_metadata_spec(self, metadata_spec: MetadataSpec) -> None: | ||
self.metadata_specs.append(metadata_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously though, the type refinement standard in python is the isinstance
check. If that's unpalatable then that indicates we should avoid the need to perform this operation. Looking above I think the real issue is we've been disguising this for a very long time, so it'll be challenging to undo it.
Given that, I'd rather we follow the language standard here than add a visitor implementation. The drawback is if someone forgets to add a new spec type it's possible they'll forget to update the instance check logic, but that is an exceedingly rare case and it's something anybody adding a new spec has to deal with everywhere else in the InstanceSpecSet class anyway.
It's up to you if you want to change this here, or if you want me to do it when I rip out the InstanceSpecVisitor as a construct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule I've been following is: no isinstance
calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.
This rule is too rigid for a gradually typed language like Python where type refinement is typically done via isinstance(). Assertions to hint to the typechecker are a thing, as are instance checks to classify an object by type in the (hopefully exceedingly rare) cases where you need to do so. As such, this approach leads to a proliferation of code that does basically nothing.
In this specific case, you could've put ~10 lines of logic into the existing method with a comment indicating that we are aware of the type deficiencies and do not want these checks to proliferate through the codebase at large, and an assertion to cover completeness in runtime tests. Instead you've added 2 new module-scope functions, a module-private helper class, and ~100 lines of boilerplate to accomplish the same end.
I don't think this is a good trade, especially given that the code smell I keep running into in our codebase is calcified generic abstractions that serve no functional purpose other than to hide the fact that the custom type foundation of the system is inadequate. The InstanceSpecVisitor is a perfect example of this, with the SinkNodeVisitor (and the existence of the DataflowPlanSinkNodes or whatever they're called themselves) representing another one.
You are correct that isinstance calls tend to spread. They must, if the custom type hierarchy we're relying on isn't working well. The thing is, these visitor implementations that make the codebase fundamentally harder to reason about and more awkward to work with also tend to spread in these situations. I don't find the latter more palatable than the former - rather the opposite, in the specific case of MetricFlow, because we routinely tie ourselves into knots to satisfy the constraints imposed by these generic abstractions and they provide absolutely nothing of value in return.
I feel the correct way to avoid spread is through continuous improvement of type specifications, and rigor and encapsulation around the isinstance checks themselves. Relying on inappropriate object oriented design patterns that hide the mess behind generics and piles of boilerplate represents a huge step backwards in terms of readability in the codebase and communication around what we intend vs what exists today and requires change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we had had this discussion a few times and to avoid having this thread be the place where we debate, let me just say that I see your points and I'm up for continued discussion on the issue.
|
||
@dataclass(frozen=True) | ||
class LinkableEntity: | ||
class LinkableEntity(LinkableElement, SemanticModelDerivation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need both of these, right, just LinkableElement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - updated.
|
||
@dataclass(frozen=True) | ||
class LinkableMetric: | ||
class LinkableMetric(LinkableElement, SemanticModelDerivation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, just LinkableElement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
entity_links=path_key.entity_links, | ||
) | ||
elif path_key.element_type is LinkableElementType.TIME_DIMENSION: | ||
assert path_key.time_granularity is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: error message, maybe something like "type refinement, should have been checked in dataclass validation"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
@property | ||
@override | ||
def derived_from_semantic_models(self) -> Sequence[SemanticModelReference]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some test cases for this and the filter by spec pattern operations? They're quite a bit more complex than the direct to spec conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some test cases.
try: | ||
return self._resolve_dependencies(saved_query_name) | ||
except Exception: | ||
logger.exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. We might need to do something about this log level for deployment - if it turns out to be common we'll want it to show up with level WARN or INFO instead of EXCEPTION - but I think we can do that via datadog instead of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fail-safe and shouldn't really be hit - if it is, it means there's a bug that we need to fix.
That's correct. There shouldn't be significant logic changes in those sections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updates look good, thank you, let's get this merged as soon as it's ready. The only update you might want to make before merge is the more_itertools change but even that I can do later as I grind my way through this thing I'm trying to get done. The useless visitor chafes but I'm just going to start removing stuff like that so one extra diff in that stack won't matter much to me one way or the other.
class _GroupSpecByTypeVisitor(InstanceSpecVisitor[None]): | ||
"""Groups a spec by type into an `InstanceSpecSet`.""" | ||
|
||
metric_specs: List[MetricSpec] = dataclasses.field(default_factory=list) | ||
measure_specs: List[MeasureSpec] = dataclasses.field(default_factory=list) | ||
dimension_specs: List[DimensionSpec] = dataclasses.field(default_factory=list) | ||
entity_specs: List[EntitySpec] = dataclasses.field(default_factory=list) | ||
time_dimension_specs: List[TimeDimensionSpec] = dataclasses.field(default_factory=list) | ||
group_by_metric_specs: List[GroupByMetricSpec] = dataclasses.field(default_factory=list) | ||
metadata_specs: List[MetadataSpec] = dataclasses.field(default_factory=list) | ||
|
||
@override | ||
def visit_measure_spec(self, measure_spec: MeasureSpec) -> None: | ||
self.measure_specs.append(measure_spec) | ||
|
||
@override | ||
def visit_dimension_spec(self, dimension_spec: DimensionSpec) -> None: | ||
self.dimension_specs.append(dimension_spec) | ||
|
||
@override | ||
def visit_time_dimension_spec(self, time_dimension_spec: TimeDimensionSpec) -> None: | ||
self.time_dimension_specs.append(time_dimension_spec) | ||
|
||
@override | ||
def visit_entity_spec(self, entity_spec: EntitySpec) -> None: | ||
self.entity_specs.append(entity_spec) | ||
|
||
@override | ||
def visit_group_by_metric_spec(self, group_by_metric_spec: GroupByMetricSpec) -> None: | ||
self.group_by_metric_specs.append(group_by_metric_spec) | ||
|
||
@override | ||
def visit_metric_spec(self, metric_spec: MetricSpec) -> None: | ||
self.metric_specs.append(metric_spec) | ||
|
||
@override | ||
def visit_metadata_spec(self, metadata_spec: MetadataSpec) -> None: | ||
self.metadata_specs.append(metadata_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.
This rule is too rigid for a gradually typed language like Python where type refinement is typically done via isinstance(). Assertions to hint to the typechecker are a thing, as are instance checks to classify an object by type in the (hopefully exceedingly rare) cases where you need to do so. As such, this approach leads to a proliferation of code that does basically nothing.
In this specific case, you could've put ~10 lines of logic into the existing method with a comment indicating that we are aware of the type deficiencies and do not want these checks to proliferate through the codebase at large, and an assertion to cover completeness in runtime tests. Instead you've added 2 new module-scope functions, a module-private helper class, and ~100 lines of boilerplate to accomplish the same end.
I don't think this is a good trade, especially given that the code smell I keep running into in our codebase is calcified generic abstractions that serve no functional purpose other than to hide the fact that the custom type foundation of the system is inadequate. The InstanceSpecVisitor is a perfect example of this, with the SinkNodeVisitor (and the existence of the DataflowPlanSinkNodes or whatever they're called themselves) representing another one.
You are correct that isinstance calls tend to spread. They must, if the custom type hierarchy we're relying on isn't working well. The thing is, these visitor implementations that make the codebase fundamentally harder to reason about and more awkward to work with also tend to spread in these situations. I don't find the latter more palatable than the former - rather the opposite, in the specific case of MetricFlow, because we routinely tie ourselves into knots to satisfy the constraints imposed by these generic abstractions and they provide absolutely nothing of value in return.
I feel the correct way to avoid spread is through continuous improvement of type specifications, and rigor and encapsulation around the isinstance checks themselves. Relying on inappropriate object oriented design patterns that hide the mess behind generics and piles of boilerplate represents a huge step backwards in terms of readability in the codebase and communication around what we intend vs what exists today and requires change.
def ordered_dedupe(*iterables: Iterable[IterableT]) -> Tuple[IterableT, ...]: | ||
"""De-duplicates the items in the iterables while preserving the order.""" | ||
ordered_results: Dict[IterableT, None] = {} | ||
for iterable in iterables: | ||
for item in iterable: | ||
ordered_results[item] = None | ||
|
||
return tuple(ordered_results.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a less robust custom implementation of the (admittedly poorly named) unique_everseen function in more-itertools.
Since dbt-semantic-interfaces depends on more-itertools we should add the dependency with the same version range and use it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just noticed this takes *iterables
, so it's a less robust implementation of unique_everseen(itertools.chain(iterables))
. The point still stands, we should use the libraries, especially where iterator collection helpers are concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can make that change, but let me check about the "minimize dependencies" request and check how that method works with types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already looked at the dependencies, that's not an issue. The typing is another matter, more-itertools is an old library and these methods date back a long way.
Description
This PR adds a resolver to get the dependencies of a saved query.