Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: source node evaluation #801

Merged
merged 3 commits into from
Oct 11, 2023
Merged

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Oct 10, 2023

Description

Fixes a bug in how we determine which source node to use. Previously, we were choosing a plan based on source node cost. Source nodes can only be ReadSqlSourceNode or MetricTimeDimensionTransformNode, so they all have DefaultCost(num_joins=0, num_aggregations=0). That means we were just arbitrarily choosing a source node that could satisfy the query. Here, the logic changes to use the number of joins in each node's LinkableInstanceSatisfiabilityEvaluation, so we choose the dataflow plan with the lowest number of joins.
Note that this fixes a bug that came up in the process of enabling querying dimensions without metrics.

Also: we have a log that says Not evaluating other nodes since we found one that doesn't require joins, but we don't stop evaluating nodes at that point. I added a break to the loop to fix that, too.

@cla-bot cla-bot bot added the cla:yes label Oct 10, 2023
@github-actions
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@courtneyholcomb courtneyholcomb marked this pull request as ready for review October 10, 2023 03:37
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure how you figured this out, most impressive. It makes sense to me, though! You might want to get @plypaul to take a quick look just to make sure I'm not missing something, this is an incredibly dense and unruly part of the codebase.

If this does go through I think it might be time to remove this whole DefaultCost visitor (in a separate PR, of course), as I believe it has outlived its usefulness.

@@ -553,6 +553,7 @@ def _find_measure_recipe(
# this is going to be the lowest cost solution.
if len(evaluation.join_recipes) == 0:
logger.info("Not evaluating other nodes since we found one that doesn't require joins")
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

homer mishap

Comment on lines +560 to +570
# All source nodes cost the same. Find evaluation with lowest number of joins.
node_with_lowest_cost_plan = min(
node_to_evaluation, key=lambda node: len(node_to_evaluation[node].join_recipes)
)
evaluation = node_to_evaluation[node_with_lowest_cost_plan]
logger.info(
"Lowest cost node is:\n"
"Lowest cost plan is:\n"
+ pformat_big_objects(
lowest_cost_node=dataflow_dag_as_text(node_with_lowest_cost),
node=dataflow_dag_as_text(node_with_lowest_cost_plan),
evaluation=evaluation,
cost=cost_function.calculate_cost(node_with_lowest_cost),
joins=len(node_to_evaluation[node_with_lowest_cost_plan].join_recipes),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

We have Yet Another Inscrutable Visitor doing this complex cost calculation, but it can only go from leaf node to root node, not the other way around. Since measure nodes are currently ONLY source nodes the cost comparison is pointless.

Using minimum join count is probably the right answer here. Later on it'll matter more what the user requests, because eventually the measure nodes could be on the right. In theory, this kind of graph walk cost computation will be useful then, but my worry about this block of logic was always that it the results might diverge as the join layout changes. However, since we have now committed to always picking the shortest join paths I think this gets us closer to where we need to be.

@plypaul
Copy link
Contributor

plypaul commented Oct 10, 2023

@courtneyholcomb It would be preferable to encapsulate all costing using a centralized class, but that might be a more involved fix. If this is blocking you, we could merge this and do a follow up later.

@tlento
Copy link
Contributor

tlento commented Oct 11, 2023

@plypaul at the moment we don't do any costing at all. I say we just remove _sort_by_suitability and all of the DataflowPlanCost stuff and drop the pretense. If we find a use for a more involved centralized cost computation setup we can add one then.

@courtneyholcomb
Copy link
Contributor Author

@plypaul at the moment we don't do any costing at all. I say we just remove _sort_by_suitability and all of the DataflowPlanCost stuff and drop the pretense. If we find a use for a more involved centralized cost computation setup we can add one then.

@tlento we do use _sort_by_suitability, but only to determine which node can satisfy the most linkable specs. All other costing appears to be unused.

@courtneyholcomb
Copy link
Contributor Author

courtneyholcomb commented Oct 11, 2023

@courtneyholcomb It would be preferable to encapsulate all costing using a centralized class, but that might be a more involved fix. If this is blocking you, we could merge this and do a follow up later.

@plypaul _sort_by_suitability is the only costing function that's used currently, and it is actually on DataFlowPlanBuilder instead of a costing class.
With that in mind, maybe we merge this as-is and then put up a separate task to remove all other costing code?

@courtneyholcomb courtneyholcomb merged commit df0ae70 into main Oct 11, 2023
9 checks passed
@courtneyholcomb courtneyholcomb deleted the court/source-node-sort branch October 11, 2023 00:44
@tlento
Copy link
Contributor

tlento commented Oct 11, 2023

@tlento we do use _sort_by_suitability, but only to determine which node can satisfy the most linkable specs. All other costing appears to be unused.

I should learn to read, missed the tuple return in there.....

@plypaul
Copy link
Contributor

plypaul commented Oct 11, 2023

@plypaul at the moment we don't do any costing at all. I say we just remove _sort_by_suitability and all of the DataflowPlanCost stuff and drop the pretense. If we find a use for a more involved centralized cost computation setup we can add one then.

Yeah, fine to remove as well actually. The original use case was for the internal caching implementation, but that's no longer relevant.

This was referenced Oct 30, 2023
sarbmeetka pushed a commit to sarbmeetka/metricflow that referenced this pull request Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants