-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataflow Plan for Min & Max of Distinct Values Query #854
Conversation
Oh I sure did forget all about this one, will take a look tomorrow if nobody else gets to it first. |
for agg_type in (AggregationType.MIN, AggregationType.MAX) | ||
] | ||
return SqlDataSet( | ||
instance_set=parent_data_set.instance_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't fleshed out the MetadataSpec
too much, but the min / max value might be better associated with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking - instead of adding a MinMaxNode
, use the MetadataSpec
and add a new InstanceSpecVisitor
to visit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh please don't add more InstanceSpecVisitors.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns. That way we ensure every column in the SqlSelectStatementNode maps directly to a spec, which we implicitly need for subquery handling since the ColumnAssociationResolver manages the name resolution via spec objects.
There's an example in the Conversion Metrics PR: https://github.com/dbt-labs/metricflow/pull/352/files#diff-488be53d5a33d76e334c75668b62f8c961f169996b1fba1cca5daaf177e1d81bR1424-R1444
No new visitors needed, so you can do this and I can proceed with removing that interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this would be right in line with the documented purpose of the MetadataSpec, although it's a bit fuzzier since the column is actually part of the output.
It would be nice if the contract around instance spec and SqlSelectStatementNode.select_columns was more clear and easier to inspect, but it's complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns.
Ahh ok that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this seems reasonable but I think it'll cause problems if we ever end up wrapping this in a subquery for some reason. Updating the naming and spec should help with this.
Separately, I know @plypaul is working on cleaning up our query param handling. With the current state I don't think there's a great alternative to having this be part of the spec and then doing runtime validation to make sure we don't allow it when it shouldn't be included, but hopefully you all can sort out the smoothest transition.
for agg_type in (AggregationType.MIN, AggregationType.MAX) | ||
] | ||
return SqlDataSet( | ||
instance_set=parent_data_set.instance_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns. That way we ensure every column in the SqlSelectStatementNode maps directly to a spec, which we implicitly need for subquery handling since the ColumnAssociationResolver manages the name resolution via spec objects.
There's an example in the Conversion Metrics PR: https://github.com/dbt-labs/metricflow/pull/352/files#diff-488be53d5a33d76e334c75668b62f8c961f169996b1fba1cca5daaf177e1d81bR1424-R1444
No new visitors needed, so you can do this and I can proceed with removing that interface.
for agg_type in (AggregationType.MIN, AggregationType.MAX) | ||
] | ||
return SqlDataSet( | ||
instance_set=parent_data_set.instance_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this would be right in line with the documented purpose of the MetadataSpec, although it's a bit fuzzier since the column is actually part of the output.
It would be nice if the contract around instance spec and SqlSelectStatementNode.select_columns was more clear and easier to inspect, but it's complicated.
sql_function=SqlFunction.from_aggregation_type(aggregation_type=agg_type), | ||
sql_function_args=[parent_column_alias_expr], | ||
), | ||
column_alias=agg_type.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min
and max
are reserved keywords in a lot of dialects, aren't they? They're also kind of nondescript. I wonder if we want to do min_<name>
and max_<name>
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it's currently implemented in MFS for the Tableau integration (with just min
and max
as column names). Seems to be working as expected. I wanted to make sure SLG would be able to know what the column name would be to select it - but maybe in SLG we can do a startswith
check on the column name instead to see if it starts with min_
or max_
. Let me sync with @aiguofer on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diego's on board. @tlento would it make more sense to do <name>__min
and <name>__max
to follow our usual dunder pattern? Not sure where we're at in terms of getting rid of dunders these days...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this with min_<name>
and max_<name>
for now - LMK if you think that should change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have so many review-related things to do today.....
See inline. Short answer is I don't know what we should do regarding to dunder or not to dunder, but I'll have more space for that on Monday.
I suspect we'll end up with a more general way of putting the min/max as dunder suffixes on the output columns. If so that doesn't necessarily need to happen before this PR goes in, but we'd at least want the output columns to be formatted accordingly.
metricflow/specs/specs.py
Outdated
|
||
@property | ||
def qualified_name(self) -> str: # noqa: D | ||
return self.element_name | ||
return f"{self.agg_type.value}_{self.element_name}" if self.agg_type else self.element_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a thought here.
We do something similar elsewhere in a temporary way for semi-additive metrics (and we overload aggregation state for it). We likely need to expand that to cover things like auto-aliasing derived metric offsets, since offset metrics have the same name, fundamentally, as non-offset metrics.
I think we'd be better off having a structured representation of these added bits of information. I'll see if I can get a PR up with what I've been thinking about on Monday. If not maybe we put this in and then I'll fast-follow with an update.
We can have the DUNDER conversation over there, since that PR would centralize all of these glued on bits of internal state information that we need to communicate across subqueries and, at least in this case, in the final output column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - will leave PR this until next week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on this - have you thought about the above yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have! Just chatted with Paul and we think it makes sense to move to name__max
and name__min
for this PR.
The idea is, eventually, to take this thing:
and generalize it to a standard property. What that does in the resolver is glue on the aggregation state with a DUNDER to the end of the column name.
So I think we'll move to that model more broadly. If you use name__max
and name__min
here I can update the naming logic and consolidate it when I fix up the rest of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Just updated to use those names.
Anything else missing for this PR? Would love to get it merged if not!
|
||
@classmethod | ||
def id_prefix(cls) -> str: # noqa: D | ||
return DATAFLOW_NODE_MIN_MAX_ID_PREFIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plypaul is this still ok given your other changes to an enumerated ID prefix type?
@@ -397,7 +399,17 @@ def _parse_and_validate_query( | |||
where_constraint_str: Optional[str] = None, | |||
order_by_names: Optional[Sequence[str]] = None, | |||
order_by: Optional[Sequence[OrderByQueryParameter]] = None, | |||
min_max_only: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@courtneyholcomb please coordinate with @plypaul , this file changed dramatically in his pending stack of changes.
|
||
@property | ||
def qualified_name(self) -> str: # noqa: D | ||
return self.element_name | ||
return f"{self.element_name}{DUNDER}{self.agg_type.value}" if self.agg_type else self.element_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plypaul - as we discussed, this is an example of where we want to have structured annotations. The plan is to add an better abstraction to the spec interface so we can unify this with the similar thing we're doing with non-additive time dimensions, and then build out support for auto-aliasing metric offsets on top of the same set of spec interfaces.
Resolves #SL-1080
Description
Adds a new Dataflow Plan Node to get the min & max of a single-column distinct values query. This is a very limited use case node, only designed to get metadata about group by inputs for integrations.
I opted not to add a changelog entry for this because I don't think we should document it as something for users.