-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seed is slow with big CSV files with many rows #500
Comments
Hey @leo-schick, I helped resolve this slow seed problem in #493! I believe that fix is now in (Duplicate of #476) |
What @nrichards17 said. The one caveat is with parquet as the landed file format, where we still have to do the slow thing. 1.7.0 will be out shortly (hopefully today or tomorrow). |
The time improved, but is still not good enough I would say. After upgrading to 1.7.1, I have now a total runtime of around 15 minutes. The models shown above have now these runtime times:
This leads to the following comparison:
So, on average a performance improvement of 32% was gained by #493. Since we work here with big data technology, I think this is still far from acceptable. I don't know if that has been tested, but I guess running a python model with |
@leo-schick I think you're starting to hit the limit of what dbt seeds were designed for; if you're getting into the hundreds of thousands of records, I'd recommend to start looking at a different form of ingestion into your warehouse. The If you still need to use seeds, you could take inspiration from The Tuva Project where they host their large files externally in cloud storage (ex. s3) but still use the |
@leo-schick what @nrichards17 says is correct. Seeds are not intended for tables with hundreds of thousands of rows. I'm not even sure that seeds should be supported in dbt at all (personal opinion, obviously not shared by dbt-core), which aims to be your transform layer, not your ingest. They exist as a convenience for when you have to, for example, manage a small exclusion list that otherwise doesn't exist in your system. From the dbt docs: Seeds are best suited to static data which changes infrequently. Good use-cases for seeds:
Poor use-cases of dbt seeds:
You may find https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/ a more appropriate tool for this case. |
In my use case, it is neither of the above: Those data is "frozen" static data which will never change, since the old system does not exist anymore (data is frozen). Something like a "customer account ID mapping old / new system". It felt just right to me to put this data to the code since it is not part of any source system and opening a new data silo (e.g. a new single database) felt unconvenient. This brought me to the idea to put it into dbt seed csv files into the repository where the data logic is placed. |
Describe the bug
Using dbt seed with the databricks adapter is quite slow when processing CSV big files which have many rows. I have two bigger CSV files with more than 100k rows. Those take too long:
CompanyLinkCustomerAccount
MasterCustomerAccountLink
Steps To Reproduce
dbt seed -s <your_file>
Expected behavior
I would have expected a runtime about 1-5 minute max.
Screenshots and log output
System information
Running with dbt=1.6.2
Registered adapter: databricks=1.6.4
Databricks Cluster configuration
The text was updated successfully, but these errors were encountered: