-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-900] [Bug] Nested loop in query planner execution of relations.sql #609
Comments
Wow, incredible write-up @slin30 🏆 We'd be open to a potential tweak to push the planner towards a more efficient path. As long as it yields the same results either way and the only side effect is more consistent performance, we should be good to go 👍 Would you be interested in raising a PR with your modifications? The diff with 22a23
> where coalesce(pg_rewrite.ev_class, pg_depend.objid) != pg_depend.refobjid
35d35
< where ref.relation_name != dep.relation_name |
Thanks, @dbeatty10 -- happy to raise a PR. Should be able to do so within 24 hours. While I don't see any difference in the results, nor do I see any reason (looking at the logic) for there to be a difference, I also can't directly speak to the much wider range of scenarios in the wild. I'm keen to see how this does during testing/QA on your end. |
@dbeatty10 While my PR is in the queue, is there a mechanism for us to temporarily override the adapter macro for our projects with the adjusted one I am proposing? Would it be as simple as e.g. creating a macro named Asking because this has again shown up on our end. |
@slin30 Thanks for opening that PR! 🤩 Yes, it would be exactly as simple as you described 👍 i.e., adding this file to your project will make it take priority over the definition in dbt-redshift:
You can see it take effect for yourself by adding a logging statement to both the log file and stdout like this: {{ log("Using my version of redshift__get_relations", True) }} |
Thank you @dbeatty10! That worked perfectly and I learned something new. Will deploy this for our environments as this should give us some real-world evidence to see how this works, and whether we see the anomaly again. [0m11:13:23.599816 [debug] [MainThread]: On master: /* {"app": "dbt", "dbt_version": "1.6.1", "profile_name": "convertkit", "target_name": "dev", "connection_name": "master"} */
with
relation as (
select
pg_class.oid as relation_id,
pg_class.relname as relation_name,
pg_class.relnamespace as schema_id,
pg_namespace.nspname as schema_name,
pg_class.relkind as relation_type
from pg_class
join pg_namespace
on pg_class.relnamespace = pg_namespace.oid
where pg_namespace.nspname != 'information_schema'
and pg_namespace.nspname not like 'pg\_%'
),
dependency as (
select distinct
coalesce(pg_rewrite.ev_class, pg_depend.objid) as dep_relation_id,
pg_depend.refobjid as ref_relation_id,
pg_depend.refclassid as ref_class_id
from pg_depend
left join pg_rewrite
on pg_depend.objid = pg_rewrite.oid
where coalesce(pg_rewrite.ev_class, pg_depend.objid) != pg_depend.refobjid
)
select distinct
dep.schema_name as dependent_schema,
dep.relation_name as dependent_name,
ref.schema_name as referenced_schema,
ref.relation_name as referenced_name
from dependency
join relation ref
on dependency.ref_relation_id = ref.relation_id
join relation dep
on dependency.dep_relation_id = dep.relation_id
[0m11:13:23.834033 [debug] [MainThread]: SQL status: SUCCESS in 0.0 seconds
[0m11:13:23.837065 [info ] [MainThread]: Using local experimental version of redshift__get_relations |
Confirming this works. While I don't have an A/B test, I can at least confirm that the First time we see the anomaly today:
Last run before pushing the local override version:
First run after pushing the local override version:
|
Is this a new bug in dbt-redshift?
Current Behavior
When running the statement in relations.sql, our Redshift cluster query planner will, seemingly at random, expect to use a nested loop join. This reflects in extended execution times, adding anywhere from 1-6 minutes to the start of a run/step.
Expected Behavior
Under normal circumstances, this query takes almost no time to run.
Steps To Reproduce
This appears to be a Redshift query planner decision, and I unfortuantely have no idea how to trigger it, nor how to resolve. I do have some thoughts on what might be causing it and a potential tweak to push the planner towards a more efficient path under more conditions (see additional context section)
Relevant log output
Environment
Additional Context
This is a fairly difficult anomaly to document because it seems to be related to the Redshift query planner. Our first encounter was about a week ago. The issue lasted for about 14 hours, and then went away. It just manifested again, and this time I was ready to log some of the results as well as test an alternative query I'd prepped in case this recurred.
My best assessment is that the original query is almost always optimized by the query planner to avoid a nested loop, but that edge cases remain where the planner chooses a less optimal path. My approach was to understand the core intent, and then see if I could tweak the statement a bit to hopefully push the query planner towards a more efficient path under a wider range of condtions.
The original logic that threw the planner for a loop (no pun intended) was the
WHERE !=
in the outer statement. Removing this or setting it to=
, or pushing it into theselect
as an=
flag, works fine. For the latter, if I subsequently attempt to filter on the flag (withNOT
), this predictably ends up with the same nested loop execution plan -- it's the attempt to filter on not equals in the outerselect
that triggers the nested loop (when the anomaly manifests).My adjustment simply pushes the evaluation up to the
dependency
CTE. My thinking was that in doing so, the planner should choose a more efficient path under more conditions.Because the conditions to test the difference between these two approaches are entirely unclear to me/a black box (i.e. I have no idea how to make the query planner decide to make an inefficient choice on demand), I just made a note to wait until the next time this happened, and if so, to capture the
explain
as well as the actual results, comparing runtimes and results.The alternative CTE is simply:
...with the only additional change being removal of the outer statement
WHERE
(where ref.relation_name != dep.relation_name
).Thus, the restated full statement would be:
I've included the orig_query_with_explain.txt and alt_query_with_explain.txt. Note: the
FALSE AS is_eq
in the outer select of the alt version text file is a testing artifact I forgot to delete -- please ignore.Finally, I've tested the outputs on our end and can confirm:
The text was updated successfully, but these errors were encountered: