Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow Plan for Min & Max of Distinct Values Query #854

Merged
merged 17 commits into from
Jan 2, 2024

Conversation

courtneyholcomb
Copy link
Contributor

Resolves #SL-1080

Description

Adds a new Dataflow Plan Node to get the min & max of a single-column distinct values query. This is a very limited use case node, only designed to get metadata about group by inputs for integrations.
I opted not to add a changelog entry for this because I don't think we should document it as something for users.

@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 8, 2023
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 8, 2023
@tlento
Copy link
Contributor

tlento commented Nov 17, 2023

Oh I sure did forget all about this one, will take a look tomorrow if nobody else gets to it first.

for agg_type in (AggregationType.MIN, AggregationType.MAX)
]
return SqlDataSet(
instance_set=parent_data_set.instance_set,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't fleshed out the MetadataSpec too much, but the min / max value might be better associated with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking - instead of adding a MinMaxNode, use the MetadataSpec and add a new InstanceSpecVisitor to visit it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh please don't add more InstanceSpecVisitors.....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns. That way we ensure every column in the SqlSelectStatementNode maps directly to a spec, which we implicitly need for subquery handling since the ColumnAssociationResolver manages the name resolution via spec objects.

There's an example in the Conversion Metrics PR: https://github.com/dbt-labs/metricflow/pull/352/files#diff-488be53d5a33d76e334c75668b62f8c961f169996b1fba1cca5daaf177e1d81bR1424-R1444

No new visitors needed, so you can do this and I can proceed with removing that interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this would be right in line with the documented purpose of the MetadataSpec, although it's a bit fuzzier since the column is actually part of the output.

It would be nice if the contract around instance spec and SqlSelectStatementNode.select_columns was more clear and easier to inspect, but it's complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns.

Ahh ok that makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this!

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems reasonable but I think it'll cause problems if we ever end up wrapping this in a subquery for some reason. Updating the naming and spec should help with this.

Separately, I know @plypaul is working on cleaning up our query param handling. With the current state I don't think there's a great alternative to having this be part of the spec and then doing runtime validation to make sure we don't allow it when it shouldn't be included, but hopefully you all can sort out the smoothest transition.

for agg_type in (AggregationType.MIN, AggregationType.MAX)
]
return SqlDataSet(
instance_set=parent_data_set.instance_set,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns. That way we ensure every column in the SqlSelectStatementNode maps directly to a spec, which we implicitly need for subquery handling since the ColumnAssociationResolver manages the name resolution via spec objects.

There's an example in the Conversion Metrics PR: https://github.com/dbt-labs/metricflow/pull/352/files#diff-488be53d5a33d76e334c75668b62f8c961f169996b1fba1cca5daaf177e1d81bR1424-R1444

No new visitors needed, so you can do this and I can proceed with removing that interface.

for agg_type in (AggregationType.MIN, AggregationType.MAX)
]
return SqlDataSet(
instance_set=parent_data_set.instance_set,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this would be right in line with the documented purpose of the MetadataSpec, although it's a bit fuzzier since the column is actually part of the output.

It would be nice if the contract around instance spec and SqlSelectStatementNode.select_columns was more clear and easier to inspect, but it's complicated.

sql_function=SqlFunction.from_aggregation_type(aggregation_type=agg_type),
sql_function_args=[parent_column_alias_expr],
),
column_alias=agg_type.value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and max are reserved keywords in a lot of dialects, aren't they? They're also kind of nondescript. I wonder if we want to do min_<name> and max_<name> instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it's currently implemented in MFS for the Tableau integration (with just min and max as column names). Seems to be working as expected. I wanted to make sure SLG would be able to know what the column name would be to select it - but maybe in SLG we can do a startswith check on the column name instead to see if it starts with min_ or max_. Let me sync with @aiguofer on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diego's on board. @tlento would it make more sense to do <name>__min and <name>__max to follow our usual dunder pattern? Not sure where we're at in terms of getting rid of dunders these days...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this with min_<name> and max_<name> for now - LMK if you think that should change!

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have so many review-related things to do today.....

See inline. Short answer is I don't know what we should do regarding to dunder or not to dunder, but I'll have more space for that on Monday.

I suspect we'll end up with a more general way of putting the min/max as dunder suffixes on the output columns. If so that doesn't necessarily need to happen before this PR goes in, but we'd at least want the output columns to be formatted accordingly.


@property
def qualified_name(self) -> str: # noqa: D
return self.element_name
return f"{self.agg_type.value}_{self.element_name}" if self.agg_type else self.element_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a thought here.

We do something similar elsewhere in a temporary way for semi-additive metrics (and we overload aggregation state for it). We likely need to expand that to cover things like auto-aliasing derived metric offsets, since offset metrics have the same name, fundamentally, as non-offset metrics.

I think we'd be better off having a structured representation of these added bits of information. I'll see if I can get a PR up with what I've been thinking about on Monday. If not maybe we put this in and then I'll fast-follow with an update.

We can have the DUNDER conversation over there, since that PR would centralize all of these glued on bits of internal state information that we need to communicate across subqueries and, at least in this case, in the final output column names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - will leave PR this until next week!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this - have you thought about the above yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have! Just chatted with Paul and we think it makes sense to move to name__max and name__min for this PR.

The idea is, eventually, to take this thing:

https://github.com/dbt-labs/metricflow/blob/main/metricflow/plan_conversion/dataflow_to_sql.py#L1103-L1104

and generalize it to a standard property. What that does in the resolver is glue on the aggregation state with a DUNDER to the end of the column name.

So I think we'll move to that model more broadly. If you use name__max and name__min here I can update the naming logic and consolidate it when I fix up the rest of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Just updated to use those names.

Anything else missing for this PR? Would love to get it merged if not!

@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 7, 2023

@classmethod
def id_prefix(cls) -> str: # noqa: D
return DATAFLOW_NODE_MIN_MAX_ID_PREFIX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plypaul is this still ok given your other changes to an enumerated ID prefix type?

@@ -397,7 +399,17 @@ def _parse_and_validate_query(
where_constraint_str: Optional[str] = None,
order_by_names: Optional[Sequence[str]] = None,
order_by: Optional[Sequence[OrderByQueryParameter]] = None,
min_max_only: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@courtneyholcomb please coordinate with @plypaul , this file changed dramatically in his pending stack of changes.


@property
def qualified_name(self) -> str: # noqa: D
return self.element_name
return f"{self.element_name}{DUNDER}{self.agg_type.value}" if self.agg_type else self.element_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plypaul - as we discussed, this is an example of where we want to have structured annotations. The plan is to add an better abstraction to the spec interface so we can unify this with the similar thing we're doing with non-additive time dimensions, and then build out support for auto-aliasing metric offsets on top of the same set of spec interfaces.

@courtneyholcomb courtneyholcomb added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Jan 2, 2024
@courtneyholcomb courtneyholcomb merged commit afbb4b1 into main Jan 2, 2024
9 checks passed
@courtneyholcomb courtneyholcomb deleted the court/distinct-values-range branch January 2, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment Skip Changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants