Dataflow Plan for Min & Max of Distinct Values Query #854

courtneyholcomb · 2023-11-08T19:10:26Z

Resolves #SL-1080

Description

Adds a new Dataflow Plan Node to get the min & max of a single-column distinct values query. This is a very limited use case node, only designed to get metadata about group by inputs for integrations.
I opted not to add a changelog entry for this because I don't think we should document it as something for users.

tlento · 2023-11-17T00:50:38Z

Oh I sure did forget all about this one, will take a look tomorrow if nobody else gets to it first.

plypaul · 2023-11-18T17:20:41Z

metricflow/plan_conversion/dataflow_to_sql.py

+            for agg_type in (AggregationType.MIN, AggregationType.MAX)
+        ]
+        return SqlDataSet(
+            instance_set=parent_data_set.instance_set,


We haven't fleshed out the MetadataSpec too much, but the min / max value might be better associated with that.

Are you thinking - instead of adding a MinMaxNode, use the MetadataSpec and add a new InstanceSpecVisitor to visit it?

Oh please don't add more InstanceSpecVisitors.....

Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns. That way we ensure every column in the SqlSelectStatementNode maps directly to a spec, which we implicitly need for subquery handling since the ColumnAssociationResolver manages the name resolution via spec objects.

There's an example in the Conversion Metrics PR: https://github.com/dbt-labs/metricflow/pull/352/files#diff-488be53d5a33d76e334c75668b62f8c961f169996b1fba1cca5daaf177e1d81bR1424-R1444

No new visitors needed, so you can do this and I can proceed with removing that interface.

FWIW, this would be right in line with the documented purpose of the MetadataSpec, although it's a bit fuzzier since the column is actually part of the output.

It would be nice if the contract around instance spec and SqlSelectStatementNode.select_columns was more clear and easier to inspect, but it's complicated.

Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns.

Ahh ok that makes sense!

Updated this!

tlento

Overall this seems reasonable but I think it'll cause problems if we ever end up wrapping this in a subquery for some reason. Updating the naming and spec should help with this.

Separately, I know @plypaul is working on cleaning up our query param handling. With the current state I don't think there's a great alternative to having this be part of the spec and then doing runtime validation to make sure we don't allow it when it shouldn't be included, but hopefully you all can sort out the smoothest transition.

tlento · 2023-11-20T21:55:56Z

metricflow/plan_conversion/dataflow_to_sql.py

+            for agg_type in (AggregationType.MIN, AggregationType.MAX)
+        ]
+        return SqlDataSet(
+            instance_set=parent_data_set.instance_set,


Ok, I think what @plypaul is suggesting is we use the MetadataSpec to hold the element names associated with the select columns. That way we ensure every column in the SqlSelectStatementNode maps directly to a spec, which we implicitly need for subquery handling since the ColumnAssociationResolver manages the name resolution via spec objects.

There's an example in the Conversion Metrics PR: https://github.com/dbt-labs/metricflow/pull/352/files#diff-488be53d5a33d76e334c75668b62f8c961f169996b1fba1cca5daaf177e1d81bR1424-R1444

No new visitors needed, so you can do this and I can proceed with removing that interface.

tlento · 2023-11-20T22:16:23Z

metricflow/plan_conversion/dataflow_to_sql.py

+            for agg_type in (AggregationType.MIN, AggregationType.MAX)
+        ]
+        return SqlDataSet(
+            instance_set=parent_data_set.instance_set,


FWIW, this would be right in line with the documented purpose of the MetadataSpec, although it's a bit fuzzier since the column is actually part of the output.

It would be nice if the contract around instance spec and SqlSelectStatementNode.select_columns was more clear and easier to inspect, but it's complicated.

tlento · 2023-11-20T22:20:06Z

metricflow/plan_conversion/dataflow_to_sql.py

+                    sql_function=SqlFunction.from_aggregation_type(aggregation_type=agg_type),
+                    sql_function_args=[parent_column_alias_expr],
+                ),
+                column_alias=agg_type.value,


min and max are reserved keywords in a lot of dialects, aren't they? They're also kind of nondescript. I wonder if we want to do min_<name> and max_<name> instead.

This is how it's currently implemented in MFS for the Tableau integration (with just min and max as column names). Seems to be working as expected. I wanted to make sure SLG would be able to know what the column name would be to select it - but maybe in SLG we can do a startswith check on the column name instead to see if it starts with min_ or max_. Let me sync with @aiguofer on this.

Diego's on board. @tlento would it make more sense to do <name>__min and <name>__max to follow our usual dunder pattern? Not sure where we're at in terms of getting rid of dunders these days...

I implemented this with min_<name> and max_<name> for now - LMK if you think that should change!

…tance converter

tlento

I have so many review-related things to do today.....

See inline. Short answer is I don't know what we should do regarding to dunder or not to dunder, but I'll have more space for that on Monday.

I suspect we'll end up with a more general way of putting the min/max as dunder suffixes on the output columns. If so that doesn't necessarily need to happen before this PR goes in, but we'd at least want the output columns to be formatted accordingly.

tlento · 2023-12-01T18:48:34Z

metricflow/specs/specs.py


    @property
    def qualified_name(self) -> str:  # noqa: D
-        return self.element_name
+        return f"{self.agg_type.value}_{self.element_name}" if self.agg_type else self.element_name


I have a thought here.

We do something similar elsewhere in a temporary way for semi-additive metrics (and we overload aggregation state for it). We likely need to expand that to cover things like auto-aliasing derived metric offsets, since offset metrics have the same name, fundamentally, as non-offset metrics.

I think we'd be better off having a structured representation of these added bits of information. I'll see if I can get a PR up with what I've been thinking about on Monday. If not maybe we put this in and then I'll fast-follow with an update.

We can have the DUNDER conversation over there, since that PR would centralize all of these glued on bits of internal state information that we need to communicate across subqueries and, at least in this case, in the final output column names.

Sounds good - will leave PR this until next week!

Following up on this - have you thought about the above yet?

I have! Just chatted with Paul and we think it makes sense to move to name__max and name__min for this PR.

The idea is, eventually, to take this thing:

https://github.com/dbt-labs/metricflow/blob/main/metricflow/plan_conversion/dataflow_to_sql.py#L1103-L1104

and generalize it to a standard property. What that does in the resolver is glue on the aggregation state with a DUNDER to the end of the column name.

So I think we'll move to that model more broadly. If you use name__max and name__min here I can update the naming logic and consolidate it when I fix up the rest of it.

Sounds good! Just updated to use those names.

Anything else missing for this PR? Would love to get it merged if not!

tlento · 2023-12-15T00:50:40Z

metricflow/dataflow/dataflow_plan.py

+
+    @classmethod
+    def id_prefix(cls) -> str:  # noqa: D
+        return DATAFLOW_NODE_MIN_MAX_ID_PREFIX


@plypaul is this still ok given your other changes to an enumerated ID prefix type?

tlento · 2023-12-15T00:51:39Z

metricflow/query/query_parser.py

@@ -397,7 +399,17 @@ def _parse_and_validate_query(
        where_constraint_str: Optional[str] = None,
        order_by_names: Optional[Sequence[str]] = None,
        order_by: Optional[Sequence[OrderByQueryParameter]] = None,
+        min_max_only: bool = False,


@courtneyholcomb please coordinate with @plypaul , this file changed dramatically in his pending stack of changes.

tlento · 2023-12-15T00:53:31Z

metricflow/specs/specs.py


    @property
    def qualified_name(self) -> str:  # noqa: D
-        return self.element_name
+        return f"{self.element_name}{DUNDER}{self.agg_type.value}" if self.agg_type else self.element_name


@plypaul - as we discussed, this is an example of where we want to have structured annotations. The plan is to add an better abstraction to the spec interface so we can unify this with the similar thing we're doing with non-additive time dimensions, and then build out support for auto-aliasing metric offsets on top of the same set of spec interfaces.

courtneyholcomb added 5 commits November 8, 2023 10:45

Add MinMaxNode and render SQL

8cdfc5e

Handle min_max_only param in queries

c49d4a7

Write Dataflow tests

041309c

Write integration tests

eea8492

Write SQL tests

11856b8

courtneyholcomb added the Skip Changelog label Nov 8, 2023

courtneyholcomb requested review from tlento, plypaul and WilliamDee November 8, 2023 19:10

cla-bot bot added the cla:yes label Nov 8, 2023

Update SQL engine snapshots

292e5fe

courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 8, 2023

courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS November 8, 2023 19:13 — with GitHub Actions Inactive

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 8, 2023

plypaul reviewed Nov 18, 2023

View reviewed changes

tlento reviewed Nov 20, 2023

View reviewed changes

courtneyholcomb added 3 commits November 30, 2023 13:58

Merge main

7228c76

Move min/max column name logic to MetadataSpec & add new metadata ins…

f5cb02d

…tance converter

Update SQL snapshots

fd6e0f7

courtneyholcomb requested review from tlento and plypaul November 30, 2023 23:32

tlento reviewed Dec 1, 2023

View reviewed changes

courtneyholcomb added 2 commits December 7, 2023 11:27

Merge main

2d80dea

Use dundered column names for min/max cols

4cfbbcc

courtneyholcomb requested a review from tlento December 7, 2023 19:38

courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 7, 2023

courtneyholcomb had a problem deploying to DW_INTEGRATION_TESTS December 7, 2023 19:39 — with GitHub Actions Failure

courtneyholcomb added 2 commits December 7, 2023 11:40

Merge branch 'main' into court/distinct-values-range

f5f471e

Pin DSI temporarily to avoid breaking changes

14f619c

tlento approved these changes Dec 15, 2023

View reviewed changes

courtneyholcomb added 3 commits December 19, 2023 12:33

Merge main

5cc381b

Add validation for min_max_only to match new structure

b036e4e

Merge main

88c6709

courtneyholcomb added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Jan 2, 2024

courtneyholcomb had a problem deploying to DW_INTEGRATION_TESTS January 2, 2024 20:03 — with GitHub Actions Failure

Update snapshots

ef1648e

courtneyholcomb merged commit afbb4b1 into main Jan 2, 2024
9 checks passed

courtneyholcomb deleted the court/distinct-values-range branch January 2, 2024 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow Plan for Min & Max of Distinct Values Query #854

Dataflow Plan for Min & Max of Distinct Values Query #854

courtneyholcomb commented Nov 8, 2023

tlento commented Nov 17, 2023

plypaul Nov 18, 2023

courtneyholcomb Nov 20, 2023

tlento Nov 20, 2023

tlento Nov 20, 2023

tlento Nov 20, 2023

courtneyholcomb Nov 30, 2023

courtneyholcomb Nov 30, 2023

tlento left a comment

tlento Nov 20, 2023

tlento Nov 20, 2023

tlento Nov 20, 2023

courtneyholcomb Nov 30, 2023

courtneyholcomb Nov 30, 2023

courtneyholcomb Nov 30, 2023

tlento left a comment

tlento Dec 1, 2023

courtneyholcomb Dec 1, 2023

courtneyholcomb Dec 5, 2023

tlento Dec 7, 2023

courtneyholcomb Dec 7, 2023

tlento Dec 15, 2023

tlento Dec 15, 2023

tlento Dec 15, 2023

Dataflow Plan for Min & Max of Distinct Values Query #854

Dataflow Plan for Min & Max of Distinct Values Query #854

Conversation

courtneyholcomb commented Nov 8, 2023

Description

tlento commented Nov 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment