Maximally parallelize dbt clone #10129

MichelleArk · 2024-05-10T18:36:21Z

resolves #7914

Problem

For GraphRunnableTasks such as the CloneTask, the run queue always enforced topological order of dependencies during execution, even when the task does not strictly require it.

Solution

Make it possible for GraphRunnableTask subclasses to provide a PRESERVE_EDGES class attribute that directs the graph queue generation to remove edges from the graph prior to constructing the priority queue. The reason a class attribute was chosen was because this does not need to be user-configurable and should be a constant queue mechanism for all invocations of the clone command.

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
This PR includes type annotations for new and modified functions

github-actions · 2024-05-10T18:36:36Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

codecov · 2024-05-10T18:41:16Z

Codecov Report

Attention: Patch coverage is 94.44444% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.68%. Comparing base (ecf9436) to head (8ed1a42).
Report is 15 commits behind head on main.

Files	Patch %	Lines
core/dbt/task/runnable.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10129      +/-   ##
==========================================
+ Coverage   88.19%   88.68%   +0.49%     
==========================================
  Files         181      180       -1     
  Lines       22786    22446     -340     
==========================================
- Hits        20096    19907     -189     
+ Misses       2690     2539     -151

Flag	Coverage Δ
integration	`85.96% <94.44%> (+0.50%)`	⬆️
unit	`63.35% <66.66%> (+0.67%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MichelleArk · 2024-05-22T22:09:41Z

I've done some local 🎩 benchmarking of dbt clone against bigquery (since it supports zero-copy cloning) to compare performance on this branch vs main.

🎩 Setup:

dbt run --target prod
mkdir state
mv target/manifest.json state/

On a project with a chain of 5 models (i.e. 1 -> 2 -> 3 -> 4 -> 5):

Recording average execution time of dbt clone --state state --full-refresh --threads 5 across 5 runs:

main: 15.38s, 34% cpu utilization
this branch: 10.17s, 48% cpu utilization

On a project with a chain of 10 models:

Recording average execution time of dbt clone --state state --full-refresh --threads 10 across 5 runs:

main: 20.41s, 20% cpu utilization
this branch: 9.02s, 57% cpu utilization

gshank

Really nice that this was fairly simple. Would like just an additional comment as noted on the line.

core/dbt/graph/queue.py

ChenyuLInx

Late to the party and sorry about being nitpicking here.
Let's discuss a bit as this is probably going to be used as example for next set of tests

ChenyuLInx · 2024-05-24T21:47:50Z

tests/unit/task/test_runnable.py

        self.forced_exception_class = exception_class
        self.did_cancel: bool = False
        super().__init__(args=MockArgs(), config=MockConfig(), manifest=None)
+        self.manifest = make_manifest(nodes=nodes)


This feels like we are reimplementing the task logic in the test again. I view unit test as documenting behavior in this case, so the only thing we need to test at the runnable task is: are we using the correct arguments used when calling get_graph_queue?

ChenyuLInx · 2024-05-24T21:48:26Z

tests/unit/task/test_runnable.py

@@ -40,13 +63,25 @@ def _cancel_connections(self, pool):

    def get_node_selector(self):
        """This is an `abstract_method` on `GraphRunnableTask`, thus we must implement it"""
-        return None
+        selector = ResourceTypeSelector(


ChenyuLInx · 2024-05-24T21:49:28Z

tests/unit/task/test_runnable.py


    def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):
        """This is an `abstract_method` on `GraphRunnableTask`, thus we must implement it"""
        return None


+class MockRunnableTaskIndependent(MockRunnableTask):
+    def get_run_mode(self) -> GraphRunnableMode:
+        return GraphRunnableMode.Independent


I think we are using inheritance where we should be using patch

ChenyuLInx · 2024-05-24T21:52:10Z

tests/unit/utils/__init__.py

@@ -387,3 +388,17 @@ def replace_config(n, **kwargs):
        config=n.config.replace(**kwargs),
        unrendered_config=dict_replace(n.unrendered_config, **kwargs),
    )
+
+
+def make_manifest(nodes=[], sources=[], macros=[], docs=[]) -> Manifest:


I think this is duplicating

dbt-core/tests/unit/utils/manifest.py

Line 986 in 84456f5

def manifest(

here, thoughts on how can we make that one easier to use/more visible?

first pass

8a491dd

cla-bot bot added the cla:yes label May 10, 2024

MichelleArk added 3 commits May 21, 2024 23:06

add unit testing for GraphQueue initialization

03a5582

add unit testing for GraphRunnableTask.PRESERVE_EDGES

a769437

changelog entry

6456b02

refactor: PRESERVE_EDGES -> GraphRunnableMode

ac7ae3b

MichelleArk marked this pull request as ready for review May 22, 2024 22:21

MichelleArk requested a review from a team as a code owner May 22, 2024 22:21

gshank approved these changes May 23, 2024

View reviewed changes

core/dbt/graph/queue.py Show resolved Hide resolved

add comment

8ed1a42

MichelleArk merged commit fb10bb4 into main May 23, 2024
63 checks passed

MichelleArk deleted the graph-runnable-task-no-edges branch May 23, 2024 15:06

ChenyuLInx reviewed May 24, 2024

View reviewed changes

ChenyuLInx mentioned this pull request Jun 4, 2024

Unit test GraphQueue class #9872

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximally parallelize dbt clone #10129

Maximally parallelize dbt clone #10129

MichelleArk commented May 10, 2024 •

edited

Loading

github-actions bot commented May 10, 2024

codecov bot commented May 10, 2024 •

edited

Loading

MichelleArk commented May 22, 2024 •

edited

Loading

gshank left a comment

ChenyuLInx left a comment

ChenyuLInx May 24, 2024

ChenyuLInx May 24, 2024

ChenyuLInx May 24, 2024

ChenyuLInx May 24, 2024

Maximally parallelize dbt clone #10129

Maximally parallelize dbt clone #10129

Conversation

MichelleArk commented May 10, 2024 • edited Loading

Problem

Solution

Checklist

github-actions bot commented May 10, 2024

codecov bot commented May 10, 2024 • edited Loading

Codecov Report

MichelleArk commented May 22, 2024 • edited Loading

On a project with a chain of 5 models (i.e. 1 -> 2 -> 3 -> 4 -> 5):

On a project with a chain of 10 models:

gshank left a comment

Choose a reason for hiding this comment

ChenyuLInx left a comment

Choose a reason for hiding this comment

ChenyuLInx May 24, 2024

Choose a reason for hiding this comment

ChenyuLInx May 24, 2024

Choose a reason for hiding this comment

ChenyuLInx May 24, 2024

Choose a reason for hiding this comment

ChenyuLInx May 24, 2024

Choose a reason for hiding this comment

MichelleArk commented May 10, 2024 •

edited

Loading

codecov bot commented May 10, 2024 •

edited

Loading

MichelleArk commented May 22, 2024 •

edited

Loading