Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed is slow with big CSV files with many rows #500

Closed
leo-schick opened this issue Nov 8, 2023 · 6 comments
Closed

Seed is slow with big CSV files with many rows #500

leo-schick opened this issue Nov 8, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@leo-schick
Copy link

Describe the bug

Using dbt seed with the databricks adapter is quite slow when processing CSV big files which have many rows. I have two bigger CSV files with more than 100k rows. Those take too long:

Model Rows File Size Runtime Runtime in minutes
CompanyLinkCustomerAccount 863.146 21 MB 1336.36s 22.3 Minutes
MasterCustomerAccountLink 389.598 16 MB 820.04s 13.6 Minutes

Steps To Reproduce

  1. Create a CSV file with more than 100k rows
  2. add the CSV as seed
  3. execute dbt seed -s <your_file>

Expected behavior

I would have expected a runtime about 1-5 minute max.

Screenshots and log output

image

System information

Running with dbt=1.6.2
Registered adapter: databricks=1.6.4

Databricks Cluster configuration

image

@leo-schick leo-schick added the bug Something isn't working label Nov 8, 2023
@nrichards17
Copy link
Contributor

Hey @leo-schick, I helped resolve this slow seed problem in #493! I believe that fix is now in main but has not been released yet as part of the 1.7.0 release. Haven't had a seed that was >100k, but I was able to get seeds with 50k records to load in ~1 minute each.

(Duplicate of #476)

@benc-db
Copy link
Collaborator

benc-db commented Nov 9, 2023

What @nrichards17 said. The one caveat is with parquet as the landed file format, where we still have to do the slow thing. 1.7.0 will be out shortly (hopefully today or tomorrow).

@benc-db benc-db closed this as completed Nov 9, 2023
@leo-schick
Copy link
Author

The time improved, but is still not good enough I would say. After upgrading to 1.7.1, I have now a total runtime of around 15 minutes.

The models shown above have now these runtime times:

Model Rows File Size Runtime Runtime in minutes
CompanyLinkCustomerAccount 863.146 21 MB 911.19s 15.2 Minutes
MasterCustomerAccountLink 389.598 16 MB 556.46s 9.3 Minutes

This leads to the following comparison:

Model Runtime before Runtime after Delta in %
CompanyLinkCustomerAccount 22.3 Minutes 15.2 Minutes -7.1 Minutes -31.8 %
MasterCustomerAccountLink 13.6 Minutes 9.3 Minutes -4,3 Minutes -31.6 %

So, on average a performance improvement of 32% was gained by #493.

Since we work here with big data technology, I think this is still far from acceptable.

I don't know if that has been tested, but I guess running a python model with spark.read.option("delimiter", ",").csv(path) would perform better. But I guess in a python models one cannot use csv files from the code base.

@leo-schick leo-schick mentioned this issue Nov 17, 2023
@nrichards17
Copy link
Contributor

@leo-schick I think you're starting to hit the limit of what dbt seeds were designed for; if you're getting into the hundreds of thousands of records, I'd recommend to start looking at a different form of ingestion into your warehouse.

The dbt seed command isn't a very efficient way to ingest large files because it essentially has to construct one big INSERT INTO SQL statement of all values in the local CSV.

If you still need to use seeds, you could take inspiration from The Tuva Project where they host their large files externally in cloud storage (ex. s3) but still use the dbt seed command with a database-specific load__seed macro (customized per database type) that leverages the COPY INTO statement, which gets called as a post-hook defined in the dbt_project.yml file

@benc-db
Copy link
Collaborator

benc-db commented Nov 17, 2023

@leo-schick what @nrichards17 says is correct. Seeds are not intended for tables with hundreds of thousands of rows. I'm not even sure that seeds should be supported in dbt at all (personal opinion, obviously not shared by dbt-core), which aims to be your transform layer, not your ingest. They exist as a convenience for when you have to, for example, manage a small exclusion list that otherwise doesn't exist in your system. From the dbt docs:

Seeds are best suited to static data which changes infrequently.

Good use-cases for seeds:

  • A list of mappings of country codes to country names
  • A list of test emails to exclude from analysis
  • A list of employee account IDs

Poor use-cases of dbt seeds:

  • Loading raw data that has been exported to CSVs
  • Any kind of production data containing sensitive information. For example personal identifiable information (PII) and passwords.

You may find https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/ a more appropriate tool for this case.

@leo-schick
Copy link
Author

In my use case, it is neither of the above: Those data is "frozen" static data which will never change, since the old system does not exist anymore (data is frozen). Something like a "customer account ID mapping old / new system". It felt just right to me to put this data to the code since it is not part of any source system and opening a new data silo (e.g. a new single database) felt unconvenient. This brought me to the idea to put it into dbt seed csv files into the repository where the data logic is placed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants