Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dbt seed times by removing type lookup and casting during INSERT #493

Closed
wants to merge 1 commit into from

Conversation

nrichards17
Copy link
Contributor

Resolves #476

Description

Several users have noticed slow run times for loading dbt seeds with >1k records when using the dbt-databricks adapter, with run times becoming prohibitively slow for seeds with >10k records. This PR speeds up that run time substantially by removing unnecessary type lookup and casting during the INSERT statement.

DBT seeds essentially are built in two steps:

  1. The table is created with the appropriate column types (explicit or inferred) with CREATE TABLE AS ...
  2. The values are loaded into that table with INSERT OVERWRITE INTO table.schema VALUES ...

For some reason in the second step, there is both a type lookup and subsequent CAST(x) AS type for every single value (rows x columns) in the seed. This is effectively redundant and unnecessary, since the type information was already used when defining the column types during table creation.

Removing these steps significantly speeds up the seed run times. For example, I was able to load a seed with 47k records and 9 columns in about 1 minute with this change, whereas previously the seed hadn't even finished after 10+ minutes of loading.

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@benc-db
Copy link
Collaborator

benc-db commented Oct 27, 2023

@nrichards17 I'm on vacation next week, but I'll run this change against our test suite when I get back, and assuming it passes, will try to sneak it in before I release 1.7.0.

@benc-db
Copy link
Collaborator

benc-db commented Oct 27, 2023

Thanks for the PR, appreciate it.

@nrichards17
Copy link
Contributor Author

happy to help, thank you @benc-db !

@benc-db
Copy link
Collaborator

benc-db commented Nov 7, 2023

Closing in favor of 498 which can run against our infra. Thanks @nrichards17!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbt seed never completing
3 participants