Improve dbt seed times by removing type lookup and casting during INSERT #493

nrichards17 · 2023-10-27T22:50:51Z

Resolves #476

Description

Several users have noticed slow run times for loading dbt seeds with >1k records when using the dbt-databricks adapter, with run times becoming prohibitively slow for seeds with >10k records. This PR speeds up that run time substantially by removing unnecessary type lookup and casting during the INSERT statement.

DBT seeds essentially are built in two steps:

The table is created with the appropriate column types (explicit or inferred) with CREATE TABLE AS ...
The values are loaded into that table with INSERT OVERWRITE INTO table.schema VALUES ...

For some reason in the second step, there is both a type lookup and subsequent CAST(x) AS type for every single value (rows x columns) in the seed. This is effectively redundant and unnecessary, since the type information was already used when defining the column types during table creation.

Removing these steps significantly speeds up the seed run times. For example, I was able to load a seed with 47k records and 9 columns in about 1 minute with this change, whereas previously the seed hadn't even finished after 10+ minutes of loading.

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

… when loading dbt seeds.

benc-db · 2023-10-27T22:52:51Z

@nrichards17 I'm on vacation next week, but I'll run this change against our test suite when I get back, and assuming it passes, will try to sneak it in before I release 1.7.0.

benc-db · 2023-10-27T22:53:21Z

Thanks for the PR, appreciate it.

nrichards17 · 2023-10-27T22:54:43Z

happy to help, thank you @benc-db !

benc-db · 2023-11-07T21:43:09Z

Closing in favor of 498 which can run against our infra. Thanks @nrichards17!

Remove type lookup and casting in double loop during INSERT statement…

a68509b

… when loading dbt seeds.

nrichards17 requested review from andrefurlan-db, susodapop, benc-db and rcypher-databricks as code owners October 27, 2023 22:50

benc-db temporarily deployed to azure-prod November 7, 2023 20:51 — with GitHub Actions Inactive

benc-db had a problem deploying to azure-prod November 7, 2023 20:51 — with GitHub Actions Failure

benc-db temporarily deployed to azure-prod November 7, 2023 21:22 — with GitHub Actions Inactive

benc-db mentioned this pull request Nov 7, 2023

Faster dbt seeds #498

Merged

3 tasks

benc-db closed this Nov 7, 2023

nrichards17 mentioned this pull request Nov 8, 2023

Seed is slow with big CSV files with many rows #500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dbt seed times by removing type lookup and casting during INSERT #493

Improve dbt seed times by removing type lookup and casting during INSERT #493

nrichards17 commented Oct 27, 2023

benc-db commented Oct 27, 2023

benc-db commented Oct 27, 2023

nrichards17 commented Oct 27, 2023

benc-db commented Nov 7, 2023

Improve dbt seed times by removing type lookup and casting during INSERT #493

Improve dbt seed times by removing type lookup and casting during INSERT #493

Conversation

nrichards17 commented Oct 27, 2023

Description

Checklist

benc-db commented Oct 27, 2023

benc-db commented Oct 27, 2023

nrichards17 commented Oct 27, 2023

benc-db commented Nov 7, 2023