Add a method to figure out common nodes in a dataflow plan #1520

plypaul · 2024-11-10T04:53:04Z

This PR adds a method to figure out nodes in a dataflow plan that appear more than once. i.e. a node that is the parent of multiple nodes. These common nodes indicate operations where a computation is reused, e.g. a metric that is used in the computation of multiple derived metrics in a query. These nodes will be later used to generate CTEs.

github-actions · 2024-11-10T04:53:20Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

courtneyholcomb

Left a few small comments inline, but overall looks great! 🚀 🚀 🚀

courtneyholcomb · 2024-11-13T00:09:58Z

metricflow/dataflow/dataflow_plan.py

@@ -81,6 +86,12 @@ def aggregated_to_elements(self) -> Set[LinkableInstanceSpec]:
        """Indicates that the node has been aggregated to these specs, guaranteeing uniqueness in all combinations."""
        return set()

+    def __lt__(self, other: ComparisonAnyType) -> bool:  # noqa: D105
+        if not isinstance(other, DataflowPlanNode):
+            raise NotImplementedError


Should we add an error message here in case this somehow gets hit? Seems similar to a bare AssertionError.

This is actually a special value that is supposed to be returned for __lt__ when you try to compare two objects that aren't comparable:

https://docs.python.org/3/library/constants.html

courtneyholcomb · 2024-11-13T00:13:35Z

metricflow/dataflow/dataflow_plan_analyzer.py

+
+
+class DataflowPlanAnalyzer:
+    """CLass to determine more complex properties of the dataflow plan.


nit: Class*

courtneyholcomb · 2024-11-13T00:24:57Z

metricflow/dataflow/dataflow_plan_analyzer.py

+        return tuple(sorted(dataflow_plan.sink_node.accept(common_branches_visitor)))
+
+
+class _CountCommonDataflowNodeVisitor(DataflowPlanNodeVisitorWithDefaultHandler[None]):


nit: This isn't counting just common nodes, it's counting all nodes (i.e., you'll get some nodes that are just used once, so they're not common). Could change the name of the class and related variables to reflect that - e.g., _CountDataflowNodeVisitor

courtneyholcomb · 2024-11-13T00:25:42Z

metricflow/dataflow/dataflow_plan_analyzer.py

+
+
+class _CountCommonDataflowNodeVisitor(DataflowPlanNodeVisitorWithDefaultHandler[None]):
+    """Helper visitor to build a dict from a node in the plan to the number of times it appears in the plans."""


This only traverses one dataflow plan at a time, right? If so, plan* instead of plans

courtneyholcomb · 2024-11-13T00:27:58Z

metricflow/dataflow/dataflow_plan_visitor.py

+        raise NotImplementedError
+
+
+class DataflowPlanNodeVisitorWithDefaultHandler(DataflowPlanNodeVisitor[VisitorOutputT], Generic[VisitorOutputT]):


Appreciate this reduction in boilerplate! 🙏

courtneyholcomb · 2024-11-13T00:29:42Z

metricflow/dataflow/dataflow_plan_analyzer.py

+
+    @override
+    def _default_handler(self, node: DataflowPlanNode) -> FrozenSet[DataflowPlanNode]:
+        if node in self._common_nodes:


It took me a minute to understand why this gets the largest common nodes! Mind adding a comment or a docstring to explain that this early return works because we're traversing from largest to smallest?

courtneyholcomb · 2024-11-13T00:38:13Z

...tricflow/snapshots/test_common_dataflow_branches.py/str/test_shared_metric_query__result.txt

+docstring:
+  For a known case, test that a metric computation node is identified as a common branch.
+
+      A query for `bookings` and `bookings_per_booker` should have the computation for `bookings` as a common branch in


This is an interesting case! The ComputeMetricsNode for bookings is common, but within that node there is another common node that could also be used as a CTE - the ReadSqlSourceNode for the bookings_source table, which is also used in the bookers branch. Curious if you plan to add that optimization later?

That's correct, and the ReadSqlSourceNode is included in the result snapshot under common_branch_1 below. I should probably add newlines because the different items are easy to miss.

Ahh nice! Yep I missed that

This PR: * Adds `SqlGenerationOptionSet` to encapsulate the options for how SQL should be generated from the dataflow plan. * Adds `O5` level that uses all previous optimizers and generates CTEs. * Updates `DataflowToSqlQueryPlanConverter` to use `SqlGenerationOptionSet`. When the `allow_cte` option it set, converts the common nodes (as implemented in #1520) in a dataflow plan to map to a CTE instead of a subquery. Since CTEs are not generated by default, the generated SQL is the same for test cases and there are no snapshot changes (aside from the ones that specifically test this feature).

cla-bot bot added the cla:yes label Nov 10, 2024

plypaul added the Skip Changelog label Nov 10, 2024

plypaul mentioned this pull request Nov 10, 2024

Support generation of CTEs in DataflowToSqlQueryPlanConverter #1521

Merged

plypaul force-pushed the p--cte--14 branch from b18d395 to e288428 Compare November 10, 2024 05:44

plypaul force-pushed the p--cte--15 branch from 1f4237d to 4e81daf Compare November 10, 2024 05:44

plypaul force-pushed the p--cte--14 branch from e288428 to b06d011 Compare November 11, 2024 21:41

plypaul force-pushed the p--cte--15 branch from 4e81daf to fd15a10 Compare November 11, 2024 21:41

plypaul force-pushed the p--cte--14 branch from b06d011 to 69c4d38 Compare November 11, 2024 23:23

plypaul force-pushed the p--cte--15 branch from fd15a10 to c970117 Compare November 11, 2024 23:23

plypaul force-pushed the p--cte--14 branch from 69c4d38 to d95faa6 Compare November 12, 2024 01:11

plypaul force-pushed the p--cte--15 branch from c970117 to 929dc52 Compare November 12, 2024 01:11

plypaul force-pushed the p--cte--14 branch from d95faa6 to ecc96e2 Compare November 12, 2024 01:26

plypaul force-pushed the p--cte--15 branch 3 times, most recently from 5e3fbd4 to 030886e Compare November 12, 2024 01:41

plypaul marked this pull request as ready for review November 12, 2024 01:41

courtneyholcomb approved these changes Nov 13, 2024

View reviewed changes

plypaul force-pushed the p--cte--14 branch from ecc96e2 to f046e4e Compare November 13, 2024 05:01

plypaul force-pushed the p--cte--15 branch from 030886e to 595e10d Compare November 13, 2024 05:01

plypaul force-pushed the p--cte--14 branch from f046e4e to fc3c505 Compare November 13, 2024 05:05

plypaul force-pushed the p--cte--15 branch from 595e10d to dc5b3bd Compare November 13, 2024 05:05

plypaul force-pushed the p--cte--14 branch from fc3c505 to 85321be Compare November 13, 2024 05:11

plypaul force-pushed the p--cte--15 branch from dc5b3bd to baeded8 Compare November 13, 2024 05:11

Base automatically changed from p--cte--14 to main November 13, 2024 05:16

plypaul added 3 commits November 12, 2024 21:16

/* PR_START p--cte 15 */ Add dataflow visitor with default.

7af1431

Make dataflow plan nodes sortable.

d8d4f51

Add method to figure out the common branches in a dataflow plan.

901bfac

plypaul force-pushed the p--cte--15 branch from baeded8 to 3ea7d29 Compare November 13, 2024 05:16

Add padding option to mf_pformat_dict.

3da06c0

Update snapshots.

9fbdffb

plypaul force-pushed the p--cte--15 branch from 3ea7d29 to 9fbdffb Compare November 13, 2024 05:39

plypaul merged commit 668c5f8 into main Nov 13, 2024
15 checks passed

plypaul deleted the p--cte--15 branch November 13, 2024 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a method to figure out common nodes in a dataflow plan #1520

Add a method to figure out common nodes in a dataflow plan #1520

plypaul commented Nov 10, 2024

github-actions bot commented Nov 10, 2024

courtneyholcomb left a comment

courtneyholcomb Nov 13, 2024

plypaul Nov 13, 2024

courtneyholcomb Nov 13, 2024

courtneyholcomb Nov 13, 2024

plypaul Nov 13, 2024

courtneyholcomb Nov 13, 2024

plypaul Nov 13, 2024

courtneyholcomb Nov 13, 2024

courtneyholcomb Nov 13, 2024

plypaul Nov 13, 2024

courtneyholcomb Nov 13, 2024

plypaul Nov 13, 2024

courtneyholcomb Nov 13, 2024



		class DataflowPlanAnalyzer:
		"""CLass to determine more complex properties of the dataflow plan.

		return tuple(sorted(dataflow_plan.sink_node.accept(common_branches_visitor)))


		class _CountCommonDataflowNodeVisitor(DataflowPlanNodeVisitorWithDefaultHandler[None]):



		class _CountCommonDataflowNodeVisitor(DataflowPlanNodeVisitorWithDefaultHandler[None]):
		"""Helper visitor to build a dict from a node in the plan to the number of times it appears in the plans."""

		raise NotImplementedError


		class DataflowPlanNodeVisitorWithDefaultHandler(DataflowPlanNodeVisitor[VisitorOutputT], Generic[VisitorOutputT]):

Add a method to figure out common nodes in a dataflow plan #1520

Add a method to figure out common nodes in a dataflow plan #1520

Conversation

plypaul commented Nov 10, 2024

github-actions bot commented Nov 10, 2024

courtneyholcomb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment