Split column pruner into two phases #1501

plypaul · 2024-11-04T16:56:12Z

Currently, the column pruner checks the columns that are needed in each SELECT statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases.

First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.

The logic in SqlTagRequiredColumnAliasesVisitor has been copied from the original implementation.

courtneyholcomb

Overall, this logic looks great!
I had a bit of trouble reading the code (sorry for the slow review - that's why!), but I think this was only due to the naming of some of the classes / variables / etc. I've left some suggestions to improve readability, and most all of them are just related to naming.

courtneyholcomb · 2024-11-05T05:02:35Z

metricflow/sql/optimizer/column_pruner.py

+                f"SQL, but this is a bug and should be investigated."
+            )
+            return node
+
        pruned_select_columns = tuple(


This is tangential, but I've frequently read this code and found this variable name confusing (pruned_select_columns). We frequently refer to "pruned columns" when we mean the ones that have been removed, but in this case we mean the columns that have been kept. I think the word pruned can technically be used both ways, but it typically is used to refer to what has been removed. Can we change this to a more clear variable name?

Sure, this could be renamed. To understand where you're coming from, you mention that [pruned] typically is used to refer to what has been removed. What examples were you thinking?

When I think of pruned, I think about an overgrown tree. Once I prune it, I would call it a pruned tree.

I think that's true - the tree has been pruned. But if you're referring to the branches, I think the "pruned" branches would typically refer to the ones removed. In this case the tree is the SQL node and the columns are the branches.

I just did a quick search through the code to see how we use this word, and there are a couple places where we use the opposite meaning of prune:

metricflow/metricflow/sql/optimizer/column_pruner.py

Line 37 in d01cbea

required_alias_mapping: Describes columns aliases that should be kept / not pruned for each node.

metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py

Lines 349 to 350 in d01cbea

def test_dont_prune_if_in_where(

request: FixtureRequest,

metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py

Lines 394 to 395 in d01cbea

def test_dont_prune_with_str_expr(

request: FixtureRequest,

And this is silly but I just did a quick google search for a gut check here and it does look like the pruned leaves are the ones that have been removed:

Not totally related to this PR so this isn't blocking! But if you don't update the naming here I probably will the next time I come across it.

In that case, I'm thinking we use different terms like "removed" and "retained" then, but will have to handle in a follow up.

metricflow/sql/optimizer/tag_required_column_aliases.py

courtneyholcomb · 2024-11-05T22:43:26Z

metricflow/sql/optimizer/tag_required_column_aliases.py

+        self._column_alias_tagger = tagged_column_alias_set
+
+    def _search_for_expressions(
+        self, select_node: SqlSelectStatementNode, pruned_select_columns: Tuple[SqlSelectColumn, ...]


Same concern re: the name pruned_select_columns here

metricflow/sql/optimizer/tag_column_aliases.py

metricflow/sql/optimizer/tag_required_column_aliases.py

plypaul · 2024-11-08T17:42:55Z

@courtneyholcomb Updated with naming changes + other alterations. I'm going to rename the files later as they cause conflicts in the stack.

courtneyholcomb

New names look great, overall code looks great! Thank you!

courtneyholcomb · 2024-11-08T21:11:55Z

metricflow/sql/optimizer/column_pruner.py

+                f"SQL, but this is a bug and should be investigated."
+            )
+            return node
+
        pruned_select_columns = tuple(


I think that's true - the tree has been pruned. But if you're referring to the branches, I think the "pruned" branches would typically refer to the ones removed. In this case the tree is the SQL node and the columns are the branches.

I just did a quick search through the code to see how we use this word, and there are a couple places where we use the opposite meaning of prune:

metricflow/metricflow/sql/optimizer/column_pruner.py

Line 37 in d01cbea

required_alias_mapping: Describes columns aliases that should be kept / not pruned for each node.

metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py

Lines 349 to 350 in d01cbea

def test_dont_prune_if_in_where(

request: FixtureRequest,

metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py

Lines 394 to 395 in d01cbea

def test_dont_prune_with_str_expr(

request: FixtureRequest,

And this is silly but I just did a quick google search for a gut check here and it does look like the pruned leaves are the ones that have been removed:

Not totally related to this PR so this isn't blocking! But if you don't update the naming here I probably will the next time I come across it.

Currently, the column pruner checks the columns that are needed in each `SELECT` statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases. First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.

plypaul added the Skip Changelog label Nov 4, 2024

cla-bot bot added the cla:yes label Nov 4, 2024

plypaul marked this pull request as ready for review November 4, 2024 17:07

courtneyholcomb reviewed Nov 6, 2024

View reviewed changes

courtneyholcomb approved these changes Nov 8, 2024

View reviewed changes

plypaul force-pushed the p--cte--05 branch from e3c7e41 to 72e7221 Compare November 9, 2024 01:53

plypaul force-pushed the p--cte--06 branch from 093f292 to bb4ce01 Compare November 9, 2024 01:53

Base automatically changed from p--cte--05 to main November 9, 2024 02:00

plypaul force-pushed the p--cte--06 branch from bb4ce01 to ee31b49 Compare November 9, 2024 07:13

plypaul added 5 commits November 9, 2024 16:18

Rename to NodeToColumnAliasMapping.

51abaf6

Create a copy when reusing SELECT nodes.

c41b7bb

Update snapshots due to re-created nodes.

b7e7ee1

Change log call from debug to error.

c1fe86c

plypaul force-pushed the p--cte--06 branch from ee31b49 to c1fe86c Compare November 10, 2024 00:47

plypaul merged commit dd090a2 into main Nov 10, 2024
15 checks passed

plypaul deleted the p--cte--06 branch November 10, 2024 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split column pruner into two phases #1501

Split column pruner into two phases #1501

plypaul commented Nov 4, 2024

courtneyholcomb left a comment

courtneyholcomb Nov 5, 2024

plypaul Nov 8, 2024

courtneyholcomb Nov 8, 2024

plypaul Nov 9, 2024

courtneyholcomb Nov 5, 2024

plypaul commented Nov 8, 2024

courtneyholcomb left a comment

courtneyholcomb Nov 8, 2024

Split column pruner into two phases #1501

Split column pruner into two phases #1501

Conversation

plypaul commented Nov 4, 2024

courtneyholcomb left a comment

Choose a reason for hiding this comment

courtneyholcomb Nov 5, 2024

Choose a reason for hiding this comment

plypaul Nov 8, 2024

Choose a reason for hiding this comment

courtneyholcomb Nov 8, 2024

Choose a reason for hiding this comment

plypaul Nov 9, 2024

Choose a reason for hiding this comment

courtneyholcomb Nov 5, 2024

Choose a reason for hiding this comment

plypaul commented Nov 8, 2024

courtneyholcomb left a comment

Choose a reason for hiding this comment

courtneyholcomb Nov 8, 2024

Choose a reason for hiding this comment