diff --git a/contributing/adding-page-components.md b/contributing/adding-page-components.md index 751f7c1f6c1..a07d0ff02e4 100644 --- a/contributing/adding-page-components.md +++ b/contributing/adding-page-components.md @@ -1,6 +1,6 @@ ## Using warehouse components -You can use the following components to provide code snippets for each supported warehouse. You can see a real-life example in the docs page [Initialize your project](/quickstarts/databricks?step=6). +You can use the following components to provide code snippets for each supported warehouse. You can see a real-life example in the docs page [Initialize your project](/guides/databricks?step=6). Identify code by labeling with the warehouse names: diff --git a/contributing/content-style-guide.md b/contributing/content-style-guide.md index 688a6d21175..0d2bf243d45 100644 --- a/contributing/content-style-guide.md +++ b/contributing/content-style-guide.md @@ -360,7 +360,7 @@ Otherwise, the text will appear squished and provide users with a bad experience - ``: creates 5 columns cards (use sparingly) - You can't create cards with 6 or more columns as that would provide users a poor experience. -Refer to [dbt Cloud features](/docs/cloud/about-cloud/dbt-cloud-features) and [Quickstarts](/docs/quickstarts/overview) as examples. +Refer to [dbt Cloud features](/docs/cloud/about-cloud/dbt-cloud-features) and [Quickstarts](/docs/guides) as examples. ### Create cards diff --git a/website/blog/2021-02-05-dbt-project-checklist.md b/website/blog/2021-02-05-dbt-project-checklist.md index dbe2c10f408..9820c279b0f 100644 --- a/website/blog/2021-02-05-dbt-project-checklist.md +++ b/website/blog/2021-02-05-dbt-project-checklist.md @@ -139,7 +139,7 @@ This post is the checklist I created to guide our internal work, and I’m shari * [Sources](/docs/build/sources/) * [Refs](/reference/dbt-jinja-functions/ref/) * [tags](/reference/resource-configs/tags/) -* [Jinja docs](/guides/advanced/using-jinja) +* [Jinja docs](/guides/using-jinja) ## ✅ Testing & Continuous Integration ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- @@ -156,7 +156,7 @@ This post is the checklist I created to guide our internal work, and I’m shari **Useful links** -* [Version control](/guides/legacy/best-practices#version-control-your-dbt-project) +* [Version control](/best-practices/best-practice-workflows#version-control-your-dbt-project) * [dbt Labs' PR Template](/blog/analytics-pull-request-template) ## ✅ Documentation @@ -252,7 +252,7 @@ Thanks to Christine Berger for her DAG diagrams! **Useful links** -* [How we structure our dbt Project](/guides/best-practices/how-we-structure/1-guide-overview) +* [How we structure our dbt Project](/best-practices/how-we-structure/1-guide-overview) * [Coalesce DAG Audit Talk](https://www.youtube.com/watch?v=5W6VrnHVkCA&t=2s) * [Modular Data Modeling Technique](https://getdbt.com/analytics-engineering/modular-data-modeling-technique/) * [Understanding Threads](/docs/running-a-dbt-project/using-threads) diff --git a/website/blog/2021-02-09-how-to-configure-your-dbt-repository-one-or-many.md b/website/blog/2021-02-09-how-to-configure-your-dbt-repository-one-or-many.md index 50d09625436..8a986a12f27 100644 --- a/website/blog/2021-02-09-how-to-configure-your-dbt-repository-one-or-many.md +++ b/website/blog/2021-02-09-how-to-configure-your-dbt-repository-one-or-many.md @@ -159,4 +159,4 @@ All of the above configurations “work”. And as detailed, they each solve for 2. Figure out what may be a pain point in the future and try to plan for it from the beginning. 3. Don’t over-complicate things until you have the right reason. As I said in my Coalesce talk: **don’t drag your skeletons from one closet to another** 💀! -**Note:** Our attempt in writing guides like this and [How we structure our dbt projects](/guides/best-practices/how-we-structure/1-guide-overview) aren’t to try to convince you that our way is right; it is to hopefully save you the hundreds of hours it has taken us to form those opinions! +**Note:** Our attempt in writing guides like this and [How we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview) aren’t to try to convince you that our way is right; it is to hopefully save you the hundreds of hours it has taken us to form those opinions! diff --git a/website/blog/2021-11-23-how-to-upgrade-dbt-versions.md b/website/blog/2021-11-23-how-to-upgrade-dbt-versions.md index 69ca0b2522c..3aa9368a2ca 100644 --- a/website/blog/2021-11-23-how-to-upgrade-dbt-versions.md +++ b/website/blog/2021-11-23-how-to-upgrade-dbt-versions.md @@ -156,7 +156,7 @@ Once your compilation issues are resolved, it's time to run your job for real, t After that, make sure that your CI environment in dbt Cloud or your orchestrator is on the right dbt version, then open a PR. -If you're using [Slim CI](https://docs.getdbt.com/docs/guides/best-practices#run-only-modified-models-to-test-changes-slim-ci), keep in mind that artifacts aren't necessarily compatible from one version to another, so you won't be able to use it until the job you defer to has completed a run with the upgraded dbt version. This doesn't impact our example because support for Slim CI didn't come out until 0.18.0. +If you're using [Slim CI](https://docs.getdbt.com/docs/best-practices#run-only-modified-models-to-test-changes-slim-ci), keep in mind that artifacts aren't necessarily compatible from one version to another, so you won't be able to use it until the job you defer to has completed a run with the upgraded dbt version. This doesn't impact our example because support for Slim CI didn't come out until 0.18.0. ## Step 7. Merge and communicate diff --git a/website/blog/2021-11-26-welcome-to-the-dbt-developer-blog.md b/website/blog/2021-11-26-welcome-to-the-dbt-developer-blog.md index c6fff54b465..8db2407afdb 100644 --- a/website/blog/2021-11-26-welcome-to-the-dbt-developer-blog.md +++ b/website/blog/2021-11-26-welcome-to-the-dbt-developer-blog.md @@ -26,7 +26,7 @@ So let’s all commit to sharing our hard won knowledge with each other—and in The purpose of this blog is to double down on our long running commitment to contributing to the knowledge loop. -From early posts like ‘[The Startup Founders Guide to Analytics’](https://thinkgrowth.org/the-startup-founders-guide-to-analytics-1d2176f20ac1) to foundational guides like [‘How We Structure Our dbt Projects](/guides/best-practices/how-we-structure/1-guide-overview)’, we’ve had a long standing goal of working with the community to create practical, hands-on tutorials and guides which distill the knowledge we’ve been able to collectively gather. +From early posts like ‘[The Startup Founders Guide to Analytics’](https://thinkgrowth.org/the-startup-founders-guide-to-analytics-1d2176f20ac1) to foundational guides like [‘How We Structure Our dbt Projects](/best-practices/how-we-structure/1-guide-overview)’, we’ve had a long standing goal of working with the community to create practical, hands-on tutorials and guides which distill the knowledge we’ve been able to collectively gather. dbt as a product is based around the philosophy that even the most complicated problems can be broken down into modular, reusable components, then mixed and matched to create something novel. diff --git a/website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md b/website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md index fd1a11c41cf..b179c0f5c7c 100644 --- a/website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md +++ b/website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md @@ -91,7 +91,7 @@ The common skills needed for implementing any flavor of dbt (Core or Cloud) are: * SQL: ‘nuff said * YAML: required to generate config files for [writing tests on data models](/docs/build/tests) -* [Jinja](/guides/advanced/using-jinja): allows you to write DRY code (using [macros](/docs/build/jinja-macros), for loops, if statements, etc) +* [Jinja](/guides/using-jinja): allows you to write DRY code (using [macros](/docs/build/jinja-macros), for loops, if statements, etc) YAML + Jinja can be learned pretty quickly, but SQL is the non-negotiable you’ll need to get started. @@ -176,7 +176,7 @@ Instead you can now use the following command: `dbt build –select result:error+ –defer –state ` … and that’s it! -You can see more examples [here](https://docs.getdbt.com/docs/guides/best-practices#run-only-modified-models-to-test-changes-slim-ci). +You can see more examples [here](https://docs.getdbt.com/docs/best-practices#run-only-modified-models-to-test-changes-slim-ci). This means that whether you’re actively developing or you simply want to rerun a scheduled job (because of, say, permission errors or timeouts in your database), you now have a unified approach to doing both. diff --git a/website/blog/2021-12-05-how-to-build-a-mature-dbt-project-from-scratch.md b/website/blog/2021-12-05-how-to-build-a-mature-dbt-project-from-scratch.md index c4de04a48c3..8ea387cf00c 100644 --- a/website/blog/2021-12-05-how-to-build-a-mature-dbt-project-from-scratch.md +++ b/website/blog/2021-12-05-how-to-build-a-mature-dbt-project-from-scratch.md @@ -69,7 +69,7 @@ In addition to learning the basic pieces of dbt, we're familiarizing ourselves w If we decide not to do this, we end up missing out on what the dbt workflow has to offer. If you want to learn more about why we think analytics engineering with dbt is the way to go, I encourage you to read the [dbt Viewpoint](/community/resources/viewpoint#analytics-is-collaborative)! -In order to learn the basics, we’re going to [port over the SQL file](/guides/migration/tools/refactoring-legacy-sql) that powers our existing "patient_claim_summary" report that we use in our KPI dashboard in parallel to our old transformation process. We’re not ripping out the old plumbing just yet. In doing so, we're going to try dbt on for size and get used to interfacing with a dbt project. +In order to learn the basics, we’re going to [port over the SQL file](/guides/refactoring-legacy-sql) that powers our existing "patient_claim_summary" report that we use in our KPI dashboard in parallel to our old transformation process. We’re not ripping out the old plumbing just yet. In doing so, we're going to try dbt on for size and get used to interfacing with a dbt project. **Project Appearance** diff --git a/website/blog/2022-02-07-customer-360-view-census-playbook.md b/website/blog/2022-02-07-customer-360-view-census-playbook.md index 01bea4b09c5..71acb32fe94 100644 --- a/website/blog/2022-02-07-customer-360-view-census-playbook.md +++ b/website/blog/2022-02-07-customer-360-view-census-playbook.md @@ -30,7 +30,7 @@ In short, a jaffle is: *See above: Tasty, tasty jaffles.* -Jaffle Shop is a demo repo referenced in [dbt’s Getting Started Guide](/quickstarts), and its jaffles hold a special place in the dbt community’s hearts, as well as on Data Twitter™. +Jaffle Shop is a demo repo referenced in [dbt’s Getting Started Guide](/guides), and its jaffles hold a special place in the dbt community’s hearts, as well as on Data Twitter™. ![jaffles on data twitter](/img/blog/2022-02-08-customer-360-view/image_1.png) diff --git a/website/blog/2022-05-17-stakeholder-friendly-model-names.md b/website/blog/2022-05-17-stakeholder-friendly-model-names.md index 0e0ccad5c96..39107035465 100644 --- a/website/blog/2022-05-17-stakeholder-friendly-model-names.md +++ b/website/blog/2022-05-17-stakeholder-friendly-model-names.md @@ -157,7 +157,7 @@ These 3 parts go from least granular (general) to most granular (specific) so yo ### Coming up... -In this part of the series, we talked about why the model name is the center of understanding for the purpose and content within a model. In the in the upcoming ["How We Structure Our dbt Projects"](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) guide, you can explore how to use this naming pattern with more specific examples in different parts of your dbt DAG that cover regular use cases: +In this part of the series, we talked about why the model name is the center of understanding for the purpose and content within a model. In the in the upcoming ["How We Structure Our dbt Projects"](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) guide, you can explore how to use this naming pattern with more specific examples in different parts of your dbt DAG that cover regular use cases: - How would you name a model that is filtered on some columns - Do we recommend naming snapshots in a specific way diff --git a/website/blog/2022-06-30-lower-sql-function.md b/website/blog/2022-06-30-lower-sql-function.md index c50af5f3fb3..3f7cff44ccb 100644 --- a/website/blog/2022-06-30-lower-sql-function.md +++ b/website/blog/2022-06-30-lower-sql-function.md @@ -75,7 +75,7 @@ After running this query, the `customers` table will look a little something lik Now, all characters in the `first_name` and `last_name` columns are lowercase. > **Where do you lower?** -> Changing all string columns to lowercase to create uniformity across data sources typically happens in our dbt project’s [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging). There are a few reasons for that: data cleanup and standardization, such as aliasing, casting, and lowercasing, should ideally happen in staging models to create downstream uniformity. It’s also more performant in downstream models that join on string values to join on strings that are of all the same casing versus having to join and perform lowercasing at the same time. +> Changing all string columns to lowercase to create uniformity across data sources typically happens in our dbt project’s [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging). There are a few reasons for that: data cleanup and standardization, such as aliasing, casting, and lowercasing, should ideally happen in staging models to create downstream uniformity. It’s also more performant in downstream models that join on string values to join on strings that are of all the same casing versus having to join and perform lowercasing at the same time. ## Why we love it diff --git a/website/blog/2022-07-19-migrating-from-stored-procs.md b/website/blog/2022-07-19-migrating-from-stored-procs.md index 691284a49e9..e2afdbfcd66 100644 --- a/website/blog/2022-07-19-migrating-from-stored-procs.md +++ b/website/blog/2022-07-19-migrating-from-stored-procs.md @@ -54,7 +54,7 @@ With dbt, we work towards creating simpler, more transparent data pipelines like ![Diagram of what data flows look like with dbt. It's easier to trace lineage in this setup.](/img/blog/2022-07-19-migrating-from-stored-procs/dbt-diagram.png) -Tight [version control integration](https://docs.getdbt.com/docs/guides/best-practices#version-control-your-dbt-project) is an added benefit of working with dbt. By leveraging the power of git-based tools, dbt enables you to integrate and test changes to transformation pipelines much faster than you can with other approaches. We often see teams who work in stored procedures making changes to their code without any notion of tracking those changes over time. While that’s more of an issue with the team’s chosen workflow than a problem with stored procedures per se, it does reflect how legacy tooling makes analytics work harder than necessary. +Tight [version control integration](https://docs.getdbt.com/docs/best-practices#version-control-your-dbt-project) is an added benefit of working with dbt. By leveraging the power of git-based tools, dbt enables you to integrate and test changes to transformation pipelines much faster than you can with other approaches. We often see teams who work in stored procedures making changes to their code without any notion of tracking those changes over time. While that’s more of an issue with the team’s chosen workflow than a problem with stored procedures per se, it does reflect how legacy tooling makes analytics work harder than necessary. ## Methodologies for migrating from stored procedures to dbt diff --git a/website/blog/2022-07-26-pre-commit-dbt.md b/website/blog/2022-07-26-pre-commit-dbt.md index fc100897ff0..e75bd622293 100644 --- a/website/blog/2022-07-26-pre-commit-dbt.md +++ b/website/blog/2022-07-26-pre-commit-dbt.md @@ -12,7 +12,7 @@ is_featured: true *Editor's note — since the creation of this post, the package pre-commit-dbt's ownership has moved to another team and it has been renamed to [dbt-checkpoint](https://github.com/dbt-checkpoint/dbt-checkpoint). A redirect has been set up, meaning that the code example below will still work. It is also possible to replace `repo: https://github.com/offbi/pre-commit-dbt` with `repo: https://github.com/dbt-checkpoint/dbt-checkpoint` in your `.pre-commit-config.yaml` file.* -At dbt Labs, we have [best practices](https://docs.getdbt.com/docs/guides/best-practices) we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least `unique` and `not_null` tests on their primary key. But how can we enforce rules like this? +At dbt Labs, we have [best practices](https://docs.getdbt.com/docs/best-practices) we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least `unique` and `not_null` tests on their primary key. But how can we enforce rules like this? That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions. diff --git a/website/blog/2022-08-12-how-we-shaved-90-minutes-off-long-running-model.md b/website/blog/2022-08-12-how-we-shaved-90-minutes-off-long-running-model.md index 020a48c763f..e6a8b943051 100644 --- a/website/blog/2022-08-12-how-we-shaved-90-minutes-off-long-running-model.md +++ b/website/blog/2022-08-12-how-we-shaved-90-minutes-off-long-running-model.md @@ -286,7 +286,7 @@ Developing an analytic code base is an ever-evolving process. What worked well w 4. **Test on representative data** - Testing on a [subset of data](https://docs.getdbt.com/guides/legacy/best-practices#limit-the-data-processed-when-in-development) is a great general practice. It allows you to iterate quickly, and doesn’t waste resources. However, there are times when you need to test on a larger dataset for problems like disk spillage to come to the fore. Testing on large data is hard and expensive, so make sure you have a good idea of the solution before you commit to this step. + Testing on a [subset of data](https://docs.getdbt.com/best-practices/best-practice-workflows#limit-the-data-processed-when-in-development) is a great general practice. It allows you to iterate quickly, and doesn’t waste resources. However, there are times when you need to test on a larger dataset for problems like disk spillage to come to the fore. Testing on large data is hard and expensive, so make sure you have a good idea of the solution before you commit to this step. 5. **Repeat** diff --git a/website/blog/2022-08-22-narrative-modeling.md b/website/blog/2022-08-22-narrative-modeling.md index a5418ccded1..a74c73fdbd1 100644 --- a/website/blog/2022-08-22-narrative-modeling.md +++ b/website/blog/2022-08-22-narrative-modeling.md @@ -177,7 +177,7 @@ To that final point, if presented with the DAG from the narrative modeling appro ### Users can tie business concepts to source data -- While the schema structure above is focused on business entities, there are still ample use cases for [staging and intermediate tables](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). +- While the schema structure above is focused on business entities, there are still ample use cases for [staging and intermediate tables](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview). - After cleaning up source data with staging tables, use the same “what happened” approach to more technical events, creating a three-node dependency from `stg_snowplow_events` to `int_page_click_captured` to `user_refreshed_cart` and thus answering the question “where do we get online user behavior information?” in a quick visit to the DAG in dbt docs. # Should your team use it? diff --git a/website/blog/2022-09-08-konmari-your-query-migration.md b/website/blog/2022-09-08-konmari-your-query-migration.md index f7d7cc74ead..c1472058150 100644 --- a/website/blog/2022-09-08-konmari-your-query-migration.md +++ b/website/blog/2022-09-08-konmari-your-query-migration.md @@ -108,7 +108,7 @@ Here are a few things to look for: ## Steps 4 & 5: Tidy by category and follow the right order—upstream to downstream -We are ready to unpack our kitchen. Use your design as a guideline for [modularization](/guides/best-practices/how-we-structure/1-guide-overview). +We are ready to unpack our kitchen. Use your design as a guideline for [modularization](/best-practices/how-we-structure/1-guide-overview). - Build your staging tables first, and then your intermediate tables in your pre-planned buckets. - Important, reusable joins that are performed in the final query should be moved upstream into their own modular models, as well as any joins that are repeated in your query. diff --git a/website/blog/2022-11-22-move-spreadsheets-to-your-dwh.md b/website/blog/2022-11-22-move-spreadsheets-to-your-dwh.md index ba5dddcae19..93cf91efeed 100644 --- a/website/blog/2022-11-22-move-spreadsheets-to-your-dwh.md +++ b/website/blog/2022-11-22-move-spreadsheets-to-your-dwh.md @@ -102,7 +102,7 @@ Instead of syncing all cells in a sheet, you create a [named range](https://five -Beware of inconsistent data types though—if someone types text into a column that was originally numeric, Fivetran will automatically convert the column to a string type which might cause issues in your downstream transformations. [The recommended workaround](https://fivetran.com/docs/files/google-sheets#typetransformationsandmapping) is to explicitly cast your types in [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging) to ensure that any undesirable records are converted to null. +Beware of inconsistent data types though—if someone types text into a column that was originally numeric, Fivetran will automatically convert the column to a string type which might cause issues in your downstream transformations. [The recommended workaround](https://fivetran.com/docs/files/google-sheets#typetransformationsandmapping) is to explicitly cast your types in [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging) to ensure that any undesirable records are converted to null. #### Good fit for: @@ -192,4 +192,4 @@ Databricks also supports [pulling in data, such as spreadsheets, from external c Beyond the options we’ve already covered, there’s an entire world of other tools that can load data from your spreadsheets into your data warehouse. This is a living document, so if your preferred method isn't listed then please [open a PR](https://github.com/dbt-labs/docs.getdbt.com) and I'll check it out. -The most important things to consider are your files’ origins and formats—if you need your colleagues to upload files on a regular basis then try to provide them with a more user-friendly process; but if you just need two computers to talk to each other, or it’s a one-off file that will hardly ever change, then a more technical integration is totally appropriate. \ No newline at end of file +The most important things to consider are your files’ origins and formats—if you need your colleagues to upload files on a regular basis then try to provide them with a more user-friendly process; but if you just need two computers to talk to each other, or it’s a one-off file that will hardly ever change, then a more technical integration is totally appropriate. diff --git a/website/blog/2022-11-30-dbt-project-evaluator.md b/website/blog/2022-11-30-dbt-project-evaluator.md index 558d8877d72..3ea7a459c35 100644 --- a/website/blog/2022-11-30-dbt-project-evaluator.md +++ b/website/blog/2022-11-30-dbt-project-evaluator.md @@ -34,7 +34,7 @@ Throughout these engagements, we began to take note of the common issues many an Maybe your team is facing some of these issues right now 👀 And that’s okay! We know that building an effective, scalable dbt project takes a lot of effort and brain power. Maybe you’ve inherited a legacy dbt project with a mountain of tech debt. Maybe you’re starting from scratch. Either way it can be difficult to know the best way to set your team up for success. Don’t worry, you’re in the right place! -Through solving these problems over and over, the Professional Services team began to hone our best practices for working with dbt and how analytics engineers could improve their dbt project. We added “solutions reviews'' to our list of service offerings — client engagements in which we evaluate a given dbt project and provide specific recommendations to improve performance, save developer time, and prevent misuse of dbt’s features. And in an effort to share these best practices with the wider dbt community, we developed a *lot* of content. We wrote articles on the Developer Blog (see [1](https://docs.getdbt.com/blog/on-the-importance-of-naming), [2](https://discourse.getdbt.com/t/your-essential-dbt-project-checklist/1377), and [3](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview)), gave [Coalesce talks](https://www.getdbt.com/coalesce-2020/auditing-model-layers-and-modularity-with-your-dag/), and created [training courses](https://courses.getdbt.com/courses/refactoring-sql-for-modularity). +Through solving these problems over and over, the Professional Services team began to hone our best practices for working with dbt and how analytics engineers could improve their dbt project. We added “solutions reviews'' to our list of service offerings — client engagements in which we evaluate a given dbt project and provide specific recommendations to improve performance, save developer time, and prevent misuse of dbt’s features. And in an effort to share these best practices with the wider dbt community, we developed a *lot* of content. We wrote articles on the Developer Blog (see [1](https://docs.getdbt.com/blog/on-the-importance-of-naming), [2](https://discourse.getdbt.com/t/your-essential-dbt-project-checklist/1377), and [3](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview)), gave [Coalesce talks](https://www.getdbt.com/coalesce-2020/auditing-model-layers-and-modularity-with-your-dag/), and created [training courses](https://courses.getdbt.com/courses/refactoring-sql-for-modularity). TIme and time again, we found that when teams are aligned with these best practices, their projects are more: @@ -63,10 +63,10 @@ Currently, the dbt_project_evaluator package covers five main categories: | Category | Example Best Practices | | --- | --- | -| Modeling | - Every [raw source](https://docs.getdbt.com/docs/build/sources) has a one-to-one relationship with a [staging model](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) to centralize data cleanup.
- Every model can be traced back to a declared source in the dbt project (i.e. no "root" models).
- End-of-DAG fanout remains under a specified threshold. | +| Modeling | - Every [raw source](https://docs.getdbt.com/docs/build/sources) has a one-to-one relationship with a [staging model](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) to centralize data cleanup.
- Every model can be traced back to a declared source in the dbt project (i.e. no "root" models).
- End-of-DAG fanout remains under a specified threshold. | | Testing | - Every model has a that is appropriately tested.
- The percentage of models that have minimum 1 test applied is greater than or equal to a specified threshold. | | Documentation | - Every model has a [description](https://docs.getdbt.com/reference/resource-properties/description).
- The percentage of models that have a description is greater than or equal to a specified threshold. | -| Structure | - All models are named with the appropriate prefix aligned according to their model types (e.g. staging models are prefixed with `stg_`).
- The sql file for each model is in the subdirectory aligned with the model type (e.g. intermediate models are in an [intermediate subdirectory](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate)).
- Each models subdirectory contains one .yml file that includes tests and documentation for all models within the given subdirectory. | +| Structure | - All models are named with the appropriate prefix aligned according to their model types (e.g. staging models are prefixed with `stg_`).
- The sql file for each model is in the subdirectory aligned with the model type (e.g. intermediate models are in an [intermediate subdirectory](https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate)).
- Each models subdirectory contains one .yml file that includes tests and documentation for all models within the given subdirectory. | | Performance | - Every model that directly feeds into an [exposure](https://docs.getdbt.com/docs/build/exposures) is materialized as a .
- No models are dependent on chains of "non-physically-materialized" models greater than a specified threshold. | For the full up-to-date list of covered rules, check out the package’s [README](https://github.com/dbt-labs/dbt-project-evaluator#rules-1), which outlines for each misalignment of a best practice: diff --git a/website/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt.md b/website/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt.md index ffc0369a908..3ca1f6ac2a9 100644 --- a/website/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt.md +++ b/website/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt.md @@ -62,7 +62,7 @@ Before you can get started: - You must have Python 3.8 or above installed - You must have dbt version 1.3.0 or above installed - You should have a basic understanding of [SQL](https://www.sqltutorial.org/) -- You should have a basic understanding of [dbt](https://docs.getdbt.com/quickstarts) +- You should have a basic understanding of [dbt](https://docs.getdbt.com/guides) ### Step 2: Clone the repository diff --git a/website/blog/2023-04-24-framework-refactor-alteryx-dbt.md b/website/blog/2023-04-24-framework-refactor-alteryx-dbt.md index c5b677f7f3e..46cfcb58cdd 100644 --- a/website/blog/2023-04-24-framework-refactor-alteryx-dbt.md +++ b/website/blog/2023-04-24-framework-refactor-alteryx-dbt.md @@ -94,7 +94,7 @@ It is essential to click on each data source (the green book icons on the leftmo For this step, we identified which operators were used in the data source (for example, joining data, order columns, group by, etc). Usually the Alteryx operators are pretty self-explanatory and all the information needed for understanding appears on the left side of the menu. We also checked the documentation to understand how each Alteryx operator works behind the scenes. -We followed dbt Labs' guide on how to refactor legacy SQL queries in dbt and some [best practices](https://docs.getdbt.com/guides/migration/tools/refactoring-legacy-sql). After we finished refactoring all the Alteryx workflows, we checked if the Alteryx output matched the output of the refactored model built on dbt. +We followed dbt Labs' guide on how to refactor legacy SQL queries in dbt and some [best practices](/guides/refactoring-legacy-sql). After we finished refactoring all the Alteryx workflows, we checked if the Alteryx output matched the output of the refactored model built on dbt. #### Step 3: Use the `audit_helper` package to audit refactored data models @@ -131,4 +131,4 @@ As we can see, refactoring Alteryx to dbt was an important step in the direction > > [Audit_helper in dbt: Bringing data auditing to a higher level](https://docs.getdbt.com/blog/audit-helper-for-migration) > -> [Refactoring legacy SQL to dbt](https://docs.getdbt.com/guides/migration/tools/refactoring-legacy-sql) +> [Refactoring legacy SQL to dbt](/guides/refactoring-legacy-sql) diff --git a/website/docs/guides/legacy/best-practices.md b/website/docs/best-practices/best-practice-workflows.md similarity index 99% rename from website/docs/guides/legacy/best-practices.md rename to website/docs/best-practices/best-practice-workflows.md index 1fbcbc72cc1..f06e785c6db 100644 --- a/website/docs/guides/legacy/best-practices.md +++ b/website/docs/best-practices/best-practice-workflows.md @@ -1,11 +1,12 @@ --- -title: "Best practices" -id: "best-practices" +title: "Best practices for workflows" +id: "best-practice-workflows" --- This page contains the collective wisdom of experienced users of dbt on how to best use it in your analytics work. Observing these best practices will help your analytics team work as effectively as possible, while implementing the pro-tips will add some polish to your dbt projects! ## Best practice workflows + ### Version control your dbt project All dbt projects should be managed in version control. Git branches should be created to manage development of new features and bug fixes. All code changes should be reviewed by a colleague (or yourself) in a Pull Request prior to merging into `master`. @@ -57,7 +58,7 @@ All subsequent data models should be built on top of these models, reducing the Earlier versions of this documentation recommended implementing “base models” as the first layer of transformation, and gave advice on the SQL within these models. We realized that while the reasons behind this convention were valid, the specific advice around "base models" represented an opinion, so we moved it out of the official documentation. -You can instead find our opinions on [how we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). +You can instead find our opinions on [how we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview). ::: diff --git a/website/docs/guides/best-practices/custom-generic-tests.md b/website/docs/best-practices/custom-generic-tests.md similarity index 100% rename from website/docs/guides/best-practices/custom-generic-tests.md rename to website/docs/best-practices/custom-generic-tests.md diff --git a/website/docs/guides/dbt-ecosystem/databricks-guides/dbt-unity-catalog-best-practices.md b/website/docs/best-practices/dbt-unity-catalog-best-practices.md similarity index 85% rename from website/docs/guides/dbt-ecosystem/databricks-guides/dbt-unity-catalog-best-practices.md rename to website/docs/best-practices/dbt-unity-catalog-best-practices.md index 8713938db86..89153fe1b86 100644 --- a/website/docs/guides/dbt-ecosystem/databricks-guides/dbt-unity-catalog-best-practices.md +++ b/website/docs/best-practices/dbt-unity-catalog-best-practices.md @@ -1,6 +1,13 @@ -# Best practices for dbt and Unity Catalog +--- +title: "Best practices for dbt and Unity Catalog" +id: "dbt-unity-catalog-best-practices" +description: Learn how to configure your. +displayText: Writing custom generic tests +hoverSnippet: Learn how to define your own custom generic tests. +--- -Your Databricks dbt project should be configured after following the ["How to set up your databricks dbt project guide"](how-to-set-up-your-databricks-dbt-project). Now we’re ready to start building a dbt project using Unity Catalog. However, we should first consider how we want to allow dbt users to interact with our different catalogs. We recommend the following best practices to ensure the integrity of your production data: + +Your Databricks dbt project should be configured after following the ["How to set up your databricks dbt project guide"](/guides/set-up-your-databricks-dbt-project). Now we’re ready to start building a dbt project using Unity Catalog. However, we should first consider how we want to allow dbt users to interact with our different catalogs. We recommend the following best practices to ensure the integrity of your production data: ## Isolate your Bronze (aka source) data @@ -53,9 +60,9 @@ Ready to start transforming your Unity Catalog datasets with dbt? Check out the resources below for guides, tips, and best practices: -- [How we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) +- [How we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview) - [Self-paced dbt fundamentals training videos](https://courses.getdbt.com/courses/fundamentals) -- [Customizing CI/CD](https://docs.getdbt.com/guides/orchestration/custom-cicd-pipelines/1-cicd-background) & [SQL linting](https://docs.getdbt.com/guides/orchestration/custom-cicd-pipelines/2-lint-on-push) -- [Debugging errors](https://docs.getdbt.com/guides/best-practices/debugging-errors) -- [Writing custom generic tests](https://docs.getdbt.com/guides/best-practices/writing-custom-generic-tests) -- [dbt packages hub](https://hub.getdbt.com/) \ No newline at end of file +- [Customizing CI/CD](/guides/custom-cicd-pipelines) +- [Debugging errors](/guides/debug-errors) +- [Writing custom generic tests](/best-practices/writing-custom-generic-tests) +- [dbt packages hub](https://hub.getdbt.com/) diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-1-intro.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-1-intro.md similarity index 100% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-1-intro.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-1-intro.md diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-2-setup.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-2-setup.md similarity index 95% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-2-setup.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-2-setup.md index 801227924dd..ffbd78b939c 100644 --- a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-2-setup.md +++ b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-2-setup.md @@ -33,7 +33,7 @@ Lastly, to get to the pre-Semantic Layer starting state, checkout the `start-her git checkout start-here ``` -For more information, refer to the [MetricFlow commands](/docs/build/metricflow-commands) or a [quickstart](/quickstarts) to get more familiar with setting up a dbt project. +For more information, refer to the [MetricFlow commands](/docs/build/metricflow-commands) or a [quickstart](/guides) to get more familiar with setting up a dbt project. ## Basic commands diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-3-build-semantic-models.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-3-build-semantic-models.md similarity index 100% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-3-build-semantic-models.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-3-build-semantic-models.md diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-4-build-metrics.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-4-build-metrics.md similarity index 100% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-4-build-metrics.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-4-build-metrics.md diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart.md similarity index 99% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart.md index b2efb39e9fc..dfdba2941e9 100644 --- a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart.md +++ b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart.md @@ -72,7 +72,7 @@ So far we've been working in new pointing at a staging model to simplify things Now, let's tackle a thornier situation. Products and supplies both have dimensions and measures but no time dimension. Products has a one-to-one relationship with `order_items`, enriching that table, which is itself just a mapping table of products to orders. Additionally, products have a one-to-many relationship with supplies. The high-level ERD looks like the diagram below. - + So to calculate, for instance, the cost of ingredients and supplies for a given order, we'll need to do some joining and aggregating, but again we **lack a time dimension for products and supplies**. This is the signal to us that we'll **need to build a logical mart** and point our semantic model at that. diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-6-advanced-metrics.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-6-advanced-metrics.md similarity index 100% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-6-advanced-metrics.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-6-advanced-metrics.md diff --git a/website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-7-conclusion.md b/website/docs/best-practices/how-we-build-our-metrics/semantic-layer-7-conclusion.md similarity index 100% rename from website/docs/guides/best-practices/how-we-build-our-metrics/semantic-layer-7-conclusion.md rename to website/docs/best-practices/how-we-build-our-metrics/semantic-layer-7-conclusion.md diff --git a/website/docs/guides/best-practices/how-we-mesh/mesh-1-intro.md b/website/docs/best-practices/how-we-mesh/mesh-1-intro.md similarity index 100% rename from website/docs/guides/best-practices/how-we-mesh/mesh-1-intro.md rename to website/docs/best-practices/how-we-mesh/mesh-1-intro.md diff --git a/website/docs/guides/best-practices/how-we-mesh/mesh-2-structures.md b/website/docs/best-practices/how-we-mesh/mesh-2-structures.md similarity index 100% rename from website/docs/guides/best-practices/how-we-mesh/mesh-2-structures.md rename to website/docs/best-practices/how-we-mesh/mesh-2-structures.md diff --git a/website/docs/guides/best-practices/how-we-mesh/mesh-3-implementation.md b/website/docs/best-practices/how-we-mesh/mesh-3-implementation.md similarity index 100% rename from website/docs/guides/best-practices/how-we-mesh/mesh-3-implementation.md rename to website/docs/best-practices/how-we-mesh/mesh-3-implementation.md diff --git a/website/docs/guides/best-practices/how-we-structure/1-guide-overview.md b/website/docs/best-practices/how-we-structure/1-guide-overview.md similarity index 100% rename from website/docs/guides/best-practices/how-we-structure/1-guide-overview.md rename to website/docs/best-practices/how-we-structure/1-guide-overview.md diff --git a/website/docs/guides/best-practices/how-we-structure/2-staging.md b/website/docs/best-practices/how-we-structure/2-staging.md similarity index 97% rename from website/docs/guides/best-practices/how-we-structure/2-staging.md rename to website/docs/best-practices/how-we-structure/2-staging.md index bcb589508e5..8eb91ff5b7b 100644 --- a/website/docs/guides/best-practices/how-we-structure/2-staging.md +++ b/website/docs/best-practices/how-we-structure/2-staging.md @@ -12,7 +12,7 @@ We'll use an analogy for working with dbt throughout this guide: thinking modula ### Staging: Files and folders -Let's zoom into the staging directory from our `models` file tree [in the overview](/guides/best-practices/how-we-structure/1-guide-overview) and walk through what's going on here. +Let's zoom into the staging directory from our `models` file tree [in the overview](/best-practices/how-we-structure/1-guide-overview) and walk through what's going on here. ```shell models/staging @@ -106,7 +106,7 @@ select * from renamed - ❌ **Aggregations** — aggregations entail grouping, and we're not doing that at this stage. Remember - staging models are your place to create the building blocks you’ll use all throughout the rest of your project — if we start changing the grain of our tables by grouping in this layer, we’ll lose access to source data that we’ll likely need at some point. We just want to get our individual concepts cleaned and ready for use, and will handle aggregating values downstream. - ✅ **Materialized as views.** Looking at a partial view of our `dbt_project.yml` below, we can see that we’ve configured the entire staging directory to be materialized as views. As they’re not intended to be final artifacts themselves, but rather building blocks for later models, staging models should typically be materialized as views for two key reasons: - - Any downstream model (discussed more in [marts](/guides/best-practices/how-we-structure/4-marts)) referencing our staging models will always get the freshest data possible from all of the component views it’s pulling together and materializing + - Any downstream model (discussed more in [marts](/best-practices/how-we-structure/4-marts)) referencing our staging models will always get the freshest data possible from all of the component views it’s pulling together and materializing - It avoids wasting space in the warehouse on models that are not intended to be queried by data consumers, and thus do not need to perform as quickly or efficiently ```yaml diff --git a/website/docs/guides/best-practices/how-we-structure/3-intermediate.md b/website/docs/best-practices/how-we-structure/3-intermediate.md similarity index 100% rename from website/docs/guides/best-practices/how-we-structure/3-intermediate.md rename to website/docs/best-practices/how-we-structure/3-intermediate.md diff --git a/website/docs/guides/best-practices/how-we-structure/4-marts.md b/website/docs/best-practices/how-we-structure/4-marts.md similarity index 100% rename from website/docs/guides/best-practices/how-we-structure/4-marts.md rename to website/docs/best-practices/how-we-structure/4-marts.md diff --git a/website/docs/guides/best-practices/how-we-structure/5-semantic-layer-marts.md b/website/docs/best-practices/how-we-structure/5-semantic-layer-marts.md similarity index 86% rename from website/docs/guides/best-practices/how-we-structure/5-semantic-layer-marts.md rename to website/docs/best-practices/how-we-structure/5-semantic-layer-marts.md index adebc4a63c7..62e07a72e36 100644 --- a/website/docs/guides/best-practices/how-we-structure/5-semantic-layer-marts.md +++ b/website/docs/best-practices/how-we-structure/5-semantic-layer-marts.md @@ -3,7 +3,7 @@ title: "Marts for the Semantic Layer" id: "5-semantic-layer-marts" --- -The Semantic Layer alters some fundamental principles of how you organize your project. Using dbt without the Semantic Layer necessitates creating the most useful combinations of your building block components into wide, denormalized marts. On the other hand, the Semantic Layer leverages MetricFlow to denormalize every possible combination of components we've encoded dynamically. As such we're better served to bring more normalized models through from the logical layer into the Semantic Layer to maximize flexibility. This section will assume familiarity with the best practices laid out in the [How we build our metrics](https://docs.getdbt.com/guides/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) guide, so check that out first for a more hands-on introduction to the Semantic Layer. +The Semantic Layer alters some fundamental principles of how you organize your project. Using dbt without the Semantic Layer necessitates creating the most useful combinations of your building block components into wide, denormalized marts. On the other hand, the Semantic Layer leverages MetricFlow to denormalize every possible combination of components we've encoded dynamically. As such we're better served to bring more normalized models through from the logical layer into the Semantic Layer to maximize flexibility. This section will assume familiarity with the best practices laid out in the [How we build our metrics](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) guide, so check that out first for a more hands-on introduction to the Semantic Layer. ## Semantic Layer: Files and folders @@ -39,7 +39,7 @@ models ## When to make a mart - ❓ If we can go directly to staging models and it's better to serve normalized models to the Semantic Layer, then when, where, and why would we make a mart? - - 🕰️ We have models that have measures but no time dimension to aggregate against. The details of this are laid out in the [Semantic Layer guide](https://docs.getdbt.com/guides/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) but in short, we need a time dimension to aggregate against in MetricFlow. Dimensional tables that + - 🕰️ We have models that have measures but no time dimension to aggregate against. The details of this are laid out in the [Semantic Layer guide](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) but in short, we need a time dimension to aggregate against in MetricFlow. Dimensional tables that - 🧱 We want to **materialize** our model in various ways. - 👯 We want to **version** our model. - 🛒 We have various related models that make more sense as **one wider component**. diff --git a/website/docs/guides/best-practices/how-we-structure/6-the-rest-of-the-project.md b/website/docs/best-practices/how-we-structure/6-the-rest-of-the-project.md similarity index 100% rename from website/docs/guides/best-practices/how-we-structure/6-the-rest-of-the-project.md rename to website/docs/best-practices/how-we-structure/6-the-rest-of-the-project.md diff --git a/website/docs/guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md b/website/docs/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md similarity index 100% rename from website/docs/guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md rename to website/docs/best-practices/how-we-style/0-how-we-style-our-dbt-projects.md diff --git a/website/docs/guides/best-practices/how-we-style/1-how-we-style-our-dbt-models.md b/website/docs/best-practices/how-we-style/1-how-we-style-our-dbt-models.md similarity index 100% rename from website/docs/guides/best-practices/how-we-style/1-how-we-style-our-dbt-models.md rename to website/docs/best-practices/how-we-style/1-how-we-style-our-dbt-models.md diff --git a/website/docs/guides/best-practices/how-we-style/2-how-we-style-our-sql.md b/website/docs/best-practices/how-we-style/2-how-we-style-our-sql.md similarity index 100% rename from website/docs/guides/best-practices/how-we-style/2-how-we-style-our-sql.md rename to website/docs/best-practices/how-we-style/2-how-we-style-our-sql.md diff --git a/website/docs/guides/best-practices/how-we-style/3-how-we-style-our-python.md b/website/docs/best-practices/how-we-style/3-how-we-style-our-python.md similarity index 100% rename from website/docs/guides/best-practices/how-we-style/3-how-we-style-our-python.md rename to website/docs/best-practices/how-we-style/3-how-we-style-our-python.md diff --git a/website/docs/guides/best-practices/how-we-style/4-how-we-style-our-jinja.md b/website/docs/best-practices/how-we-style/4-how-we-style-our-jinja.md similarity index 100% rename from website/docs/guides/best-practices/how-we-style/4-how-we-style-our-jinja.md rename to website/docs/best-practices/how-we-style/4-how-we-style-our-jinja.md diff --git a/website/docs/guides/best-practices/how-we-style/5-how-we-style-our-yaml.md b/website/docs/best-practices/how-we-style/5-how-we-style-our-yaml.md similarity index 100% rename from website/docs/guides/best-practices/how-we-style/5-how-we-style-our-yaml.md rename to website/docs/best-practices/how-we-style/5-how-we-style-our-yaml.md diff --git a/website/docs/guides/best-practices/how-we-style/6-how-we-style-conclusion.md b/website/docs/best-practices/how-we-style/6-how-we-style-conclusion.md similarity index 97% rename from website/docs/guides/best-practices/how-we-style/6-how-we-style-conclusion.md rename to website/docs/best-practices/how-we-style/6-how-we-style-conclusion.md index a6402e46870..24103861b97 100644 --- a/website/docs/guides/best-practices/how-we-style/6-how-we-style-conclusion.md +++ b/website/docs/best-practices/how-we-style/6-how-we-style-conclusion.md @@ -31,7 +31,7 @@ Our models (typically) fit into two main categories:\ Things to note: - There are different types of models that typically exist in each of the above categories. See [Model Layers](#model-layers) for more information. -- Read [How we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) for an example and more details around organization. +- Read [How we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview) for an example and more details around organization. ## Model Layers diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-1-guide-overview.md b/website/docs/best-practices/materializations/materializations-guide-1-guide-overview.md similarity index 89% rename from website/docs/guides/best-practices/materializations/materializations-guide-1-guide-overview.md rename to website/docs/best-practices/materializations/materializations-guide-1-guide-overview.md index 209041b1df5..248b4c4749b 100644 --- a/website/docs/guides/best-practices/materializations/materializations-guide-1-guide-overview.md +++ b/website/docs/best-practices/materializations/materializations-guide-1-guide-overview.md @@ -26,9 +26,9 @@ By the end of this guide you should have a solid understanding of: ### Prerequisites -- 📒 You’ll want to have worked through the [quickstart guide](/quickstarts) and have a project setup to work through these concepts. +- 📒 You’ll want to have worked through the [quickstart guide](/guides) and have a project setup to work through these concepts. - 🏃🏻‍♀️ Concepts like dbt runs, `ref()` statements, and models should be familiar to you. -- 🔧 [**Optional**] Reading through the [How we structure our dbt projects](guides/best-practices/how-we-structure/1-guide-overview) Guide will be beneficial for the last section of this guide, when we review best practices for materializations using the dbt project approach of staging models and marts. +- 🔧 [**Optional**] Reading through the [How we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview) Guide will be beneficial for the last section of this guide, when we review best practices for materializations using the dbt project approach of staging models and marts. ### Guiding principle diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-2-available-materializations.md b/website/docs/best-practices/materializations/materializations-guide-2-available-materializations.md similarity index 98% rename from website/docs/guides/best-practices/materializations/materializations-guide-2-available-materializations.md rename to website/docs/best-practices/materializations/materializations-guide-2-available-materializations.md index 54110b46385..9910e5f8269 100644 --- a/website/docs/guides/best-practices/materializations/materializations-guide-2-available-materializations.md +++ b/website/docs/best-practices/materializations/materializations-guide-2-available-materializations.md @@ -19,7 +19,7 @@ Views and tables and incremental models, oh my! In this section we’ll start ge **Views and Tables are the two basic categories** of object that we can create across warehouses. They exist natively as types of objects in the warehouse, as you can see from this screenshot of Snowflake (depending on your warehouse the interface will look a little different). **Incremental models** and other materializations types are a little bit different. They tell dbt to **construct tables in a special way**. -![Tables and views in the browser on Snowflake.](/img/guides/best-practices/materializations/tables-and-views.png) +![Tables and views in the browser on Snowflake.](/img/best-practices/materializations/tables-and-views.png) ### Views diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-3-configuring-materializations.md b/website/docs/best-practices/materializations/materializations-guide-3-configuring-materializations.md similarity index 100% rename from website/docs/guides/best-practices/materializations/materializations-guide-3-configuring-materializations.md rename to website/docs/best-practices/materializations/materializations-guide-3-configuring-materializations.md diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-4-incremental-models.md b/website/docs/best-practices/materializations/materializations-guide-4-incremental-models.md similarity index 99% rename from website/docs/guides/best-practices/materializations/materializations-guide-4-incremental-models.md rename to website/docs/best-practices/materializations/materializations-guide-4-incremental-models.md index cd4264bafd3..71b24ef58f2 100644 --- a/website/docs/guides/best-practices/materializations/materializations-guide-4-incremental-models.md +++ b/website/docs/best-practices/materializations/materializations-guide-4-incremental-models.md @@ -76,7 +76,7 @@ So we’ve found a way to isolate the new rows we need to process. How then do w - 🌍  Lastly, if we’re building into a new environment and there’s **no previous run to reference**, or we need to **build the model from scratch.** Put another way, we’ll want a means to skip the incremental logic and transform all of our input data like a regular table if needed. - 😎 **Visualized below**, we’ve figured out how to get the red ‘new records’ portion selected, but we need to sort out the step to the right, where we stick those on to our model. -![Diagram visualizing how incremental models work](/img/guides/best-practices/materializations/incremental-diagram.png) +![Diagram visualizing how incremental models work](/img/best-practices/materializations/incremental-diagram.png) :::info 😌 Incremental models can be confusing at first, **take your time reviewing** this visual and the previous steps until you have a **clear mental model.** Be patient with yourself. This materialization will become second nature soon, but it’s tough at first. If you’re feeling confused the [dbt Community is here for you on the Forum and Slack](community/join). diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-5-best-practices.md b/website/docs/best-practices/materializations/materializations-guide-5-best-practices.md similarity index 98% rename from website/docs/guides/best-practices/materializations/materializations-guide-5-best-practices.md rename to website/docs/best-practices/materializations/materializations-guide-5-best-practices.md index a2cb22d5755..268a326eed0 100644 --- a/website/docs/guides/best-practices/materializations/materializations-guide-5-best-practices.md +++ b/website/docs/best-practices/materializations/materializations-guide-5-best-practices.md @@ -58,7 +58,7 @@ models: As we’ve learned, views store only the logic of the transformation in the warehouse, so our runs take only a couple seconds per model (or less). What happens when we go to query the data though? -![Long query time from Snowflake](/img/guides/best-practices/materializations/snowflake-query-timing.png) +![Long query time from Snowflake](/img/best-practices/materializations/snowflake-query-timing.png) Our marts are slow to query! diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-6-examining-builds.md b/website/docs/best-practices/materializations/materializations-guide-6-examining-builds.md similarity index 90% rename from website/docs/guides/best-practices/materializations/materializations-guide-6-examining-builds.md rename to website/docs/best-practices/materializations/materializations-guide-6-examining-builds.md index 909618ef8a5..0b18518d0bd 100644 --- a/website/docs/guides/best-practices/materializations/materializations-guide-6-examining-builds.md +++ b/website/docs/best-practices/materializations/materializations-guide-6-examining-builds.md @@ -16,9 +16,9 @@ hoverSnippet: Read this guide to understand how to examine your builds in dbt. ### Model Timing -That’s where dbt Cloud’s Model Timing visualization comes in extremely handy. If we’ve set up a [Job](/quickstarts/bigquery) in dbt Cloud to run our models, we can use the Model Timing tab to pinpoint our longest-running models. +That’s where dbt Cloud’s Model Timing visualization comes in extremely handy. If we’ve set up a [Job](/guides/bigquery) in dbt Cloud to run our models, we can use the Model Timing tab to pinpoint our longest-running models. -![dbt Cloud's Model Timing diagram](/img/guides/best-practices/materializations/model-timing-diagram.png) +![dbt Cloud's Model Timing diagram](/img/best-practices/materializations/model-timing-diagram.png) - 🧵 This view lets us see our **mapped out in threads** (up to 64 threads, we’re currently running with 4, so we get 4 tracks) over time. You can think of **each thread as a lane on a highway**. - ⌛ We can see above that `customer_status_histories` is **taking by far the most time**, so we may want to go ahead and **make that incremental**. @@ -29,7 +29,7 @@ If you aren’t using dbt Cloud, that’s okay! We don’t get a fancy visualiza If you’ve ever run dbt, whether `build`, `test`, `run` or something else, you’ve seen some output like below. Let’s take a closer look at how to read this. -![CLI output from a dbt build command](/img/guides/best-practices/materializations/dbt-build-output.png) +![CLI output from a dbt build command](/img/best-practices/materializations/dbt-build-output.png) - There are two entries per model, the **start** of a model’s build and the **completion**, which will include **how long** the model took to run. The **type** of model is included as well. For example: diff --git a/website/docs/guides/best-practices/materializations/materializations-guide-7-conclusion.md b/website/docs/best-practices/materializations/materializations-guide-7-conclusion.md similarity index 89% rename from website/docs/guides/best-practices/materializations/materializations-guide-7-conclusion.md rename to website/docs/best-practices/materializations/materializations-guide-7-conclusion.md index 119563b9a50..cd561716fe4 100644 --- a/website/docs/guides/best-practices/materializations/materializations-guide-7-conclusion.md +++ b/website/docs/best-practices/materializations/materializations-guide-7-conclusion.md @@ -9,6 +9,6 @@ hoverSnippet: Read this conclusion to our guide on using materializations in dbt You're now following best practices in your project, and have optimized the materializations of your DAG. You’re equipped with the 3 main materializations that cover almost any analytics engineering situation! -There are more configs and materializations available, as well as specific materializations for certain platforms and adapters — and like everything with dbt, materializations are extensible, meaning you can create your own [custom materializations](/guides/advanced/creating-new-materializations) for your needs. So this is just the beginning of what you can do with these powerful configurations. +There are more configs and materializations available, as well as specific materializations for certain platforms and adapters — and like everything with dbt, materializations are extensible, meaning you can create your own [custom materializations](/guides/create-new-materializations) for your needs. So this is just the beginning of what you can do with these powerful configurations. For the vast majority of users and companies though, tables, views, and incremental models will handle everything you can throw at them. Develop your intuition and expertise for these materializations, and you’ll be well on your way to tackling advanced analytics engineering problems. diff --git a/website/docs/community/resources/getting-help.md b/website/docs/community/resources/getting-help.md index 5f423683014..2f30644186e 100644 --- a/website/docs/community/resources/getting-help.md +++ b/website/docs/community/resources/getting-help.md @@ -7,9 +7,9 @@ dbt is open source, and has a generous community behind it. Asking questions wel ### 1. Try to solve your problem first before asking for help #### Search the existing documentation -The docs site you're on is highly searchable, make sure to explore for the answer here as a first step. If you're new to dbt, try working through the [quickstart guide](/quickstarts) first to get a firm foundation on the essential concepts. +The docs site you're on is highly searchable, make sure to explore for the answer here as a first step. If you're new to dbt, try working through the [quickstart guide](/guides) first to get a firm foundation on the essential concepts. #### Try to debug the issue yourself -We have a handy guide on [debugging errors](/guides/best-practices/debugging-errors) to help out! This guide also helps explain why errors occur, and which docs you might need to search for help. +We have a handy guide on [debugging errors](/guides/debug-errors) to help out! This guide also helps explain why errors occur, and which docs you might need to search for help. #### Search for answers using your favorite search engine We're committed to making more errors searchable, so it's worth checking if there's a solution already out there! Further, some errors related to installing dbt, the SQL in your models, or getting YAML right, are errors that are not-specific to dbt, so there may be other resources to check. @@ -60,4 +60,4 @@ If you want to receive dbt training, check out our [dbt Learn](https://learn.get - Billing - Bug reports related to the web interface -As a rule of thumb, if you are using dbt Cloud, but your problem is related to code within your dbt project, then please follow the above process rather than reaching out to support. \ No newline at end of file +As a rule of thumb, if you are using dbt Cloud, but your problem is related to code within your dbt project, then please follow the above process rather than reaching out to support. diff --git a/website/docs/docs/build/jinja-macros.md b/website/docs/docs/build/jinja-macros.md index c5fd6b2e111..135db740f75 100644 --- a/website/docs/docs/build/jinja-macros.md +++ b/website/docs/docs/build/jinja-macros.md @@ -27,7 +27,7 @@ Jinja can be used in any SQL in a dbt project, including [models](/docs/build/sq :::info Ready to get started with Jinja and macros? -Check out the [tutorial on using Jinja](/guides/advanced/using-jinja) for a step-by-step example of using Jinja in a model, and turning it into a macro! +Check out the [tutorial on using Jinja](/guides/using-jinja) for a step-by-step example of using Jinja in a model, and turning it into a macro! ::: diff --git a/website/docs/docs/build/metrics.md b/website/docs/docs/build/metrics.md index 7a505fdad14..b75c4bfb502 100644 --- a/website/docs/docs/build/metrics.md +++ b/website/docs/docs/build/metrics.md @@ -11,7 +11,7 @@ tags: [Metrics] The dbt_metrics package has been deprecated and replaced with [MetricFlow](/docs/build/about-metricflow?version=1.6). If you're using the dbt_metrics package or the legacy Semantic Layer (available on v1.5 or lower), we **highly** recommend [upgrading your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to access MetricFlow and the new [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl?version=1.6). -To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/migration/sl-migration) for more info. +To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/sl-migration) for more info. ::: @@ -26,7 +26,7 @@ The dbt_metrics package has been [deprecated](https://docs.getdbt.com/blog/depre Anyone who uses the dbt_metrics package or is integrated with the legacy Semantic Layer. The new Semantic Layer is available to [Team or Enterprise](https://www.getdbt.com/pricing/) multi-tenant dbt Cloud plans [hosted in North America](/docs/cloud/about-cloud/regions-ip-addresses). You must be on dbt v1.6 or higher to access it. All users can define metrics using MetricFlow. Users on dbt Cloud Developer plans or dbt Core can only use it to define and test metrics locally, but can't dynamically query them with integrated tools. **What should you do?**

-If you've defined metrics using dbt_metrics or integrated with the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use MetricFlow or the new dbt Semantic Layer. To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/migration/sl-migration) for more info. +If you've defined metrics using dbt_metrics or integrated with the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use MetricFlow or the new dbt Semantic Layer. To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/sl-migration) for more info. diff --git a/website/docs/docs/build/models.md b/website/docs/docs/build/models.md index d10eb5ed01a..1cf2fbafeda 100644 --- a/website/docs/docs/build/models.md +++ b/website/docs/docs/build/models.md @@ -20,4 +20,4 @@ The top level of a dbt workflow is the project. A project is a directory of a `. Your organization may need only a few models, but more likely you’ll need a complex structure of nested models to transform the required data. A model is a single file containing a final `select` statement, and a project can have multiple models, and models can even reference each other. Add to that, numerous projects and the level of effort required for transforming complex data sets can improve drastically compared to older methods. -Learn more about models in [SQL models](/docs/build/sql-models) and [Python models](/docs/build/python-models) pages. If you'd like to begin with a bit of practice, visit our [Getting Started Guide](/quickstarts) for instructions on setting up the Jaffle_Shop sample data so you can get hands-on with the power of dbt. +Learn more about models in [SQL models](/docs/build/sql-models) and [Python models](/docs/build/python-models) pages. If you'd like to begin with a bit of practice, visit our [Getting Started Guide](/guides) for instructions on setting up the Jaffle_Shop sample data so you can get hands-on with the power of dbt. diff --git a/website/docs/docs/build/projects.md b/website/docs/docs/build/projects.md index b4b04e3334d..a54f6042cce 100644 --- a/website/docs/docs/build/projects.md +++ b/website/docs/docs/build/projects.md @@ -79,7 +79,7 @@ After configuring the Project subdirectory option, dbt Cloud will use it as the You can create new projects and [share them](/docs/collaborate/git-version-control) with other people by making them available on a hosted git repository like GitHub, GitLab, and BitBucket. -After you set up a connection with your data platform, you can [initialize your new project in dbt Cloud](/quickstarts) and start developing. Or, run [dbt init from the command line](/reference/commands/init) to set up your new project. +After you set up a connection with your data platform, you can [initialize your new project in dbt Cloud](/guides) and start developing. Or, run [dbt init from the command line](/reference/commands/init) to set up your new project. During project initialization, dbt creates sample model files in your project directory to help you start developing quickly. @@ -91,6 +91,6 @@ If you want to see what a mature, production project looks like, check out the [ ## Related docs -* [Best practices: How we structure our dbt projects](/guides/best-practices/how-we-structure/1-guide-overview) -* [Quickstarts for dbt Cloud](/quickstarts) -* [Quickstart for dbt Core](/quickstarts/manual-install) +* [Best practices: How we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview) +* [Quickstarts for dbt Cloud](/guides) +* [Quickstart for dbt Core](/guides/manual-install) diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md index bff65362d06..3fe194a4cb7 100644 --- a/website/docs/docs/build/python-models.md +++ b/website/docs/docs/build/python-models.md @@ -67,7 +67,7 @@ models: - not_null tests: # Write your own validation logic (in SQL) for Python results - - [custom_generic_test](/guides/best-practices/writing-custom-generic-tests) + - [custom_generic_test](/best-practices/writing-custom-generic-tests) ``` @@ -716,4 +716,4 @@ You can also install packages at cluster creation time by [defining cluster prop - \ No newline at end of file + diff --git a/website/docs/docs/build/sl-getting-started.md b/website/docs/docs/build/sl-getting-started.md index 11453dde578..d5a59c33ec2 100644 --- a/website/docs/docs/build/sl-getting-started.md +++ b/website/docs/docs/build/sl-getting-started.md @@ -77,7 +77,7 @@ If you're encountering some issues when defining your metrics or setting up the
How do I migrate from the legacy Semantic Layer to the new one?
-
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
+
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
diff --git a/website/docs/docs/build/sql-models.md b/website/docs/docs/build/sql-models.md index 65fdd58adf0..237ac84c0c2 100644 --- a/website/docs/docs/build/sql-models.md +++ b/website/docs/docs/build/sql-models.md @@ -14,7 +14,7 @@ id: "sql-models" :::info Building your first models -If you're new to dbt, we recommend that you read a [quickstart guide](/quickstarts) to build your first dbt project with models. +If you're new to dbt, we recommend that you read a [quickstart guide](/guides) to build your first dbt project with models. ::: diff --git a/website/docs/docs/build/tests.md b/website/docs/docs/build/tests.md index 75c358155b2..3d86dc6a81b 100644 --- a/website/docs/docs/build/tests.md +++ b/website/docs/docs/build/tests.md @@ -30,7 +30,7 @@ There are two ways of defining tests in dbt: Defining tests is a great way to confirm that your code is working correctly, and helps prevent regressions when your code changes. Because you can use them over and over again, making similar assertions with minor variations, generic tests tend to be much more common—they should make up the bulk of your dbt testing suite. That said, both ways of defining tests have their time and place. :::tip Creating your first tests -If you're new to dbt, we recommend that you check out our [quickstart guide](/quickstarts) to build your first dbt project with models and tests. +If you're new to dbt, we recommend that you check out our [quickstart guide](/guides) to build your first dbt project with models and tests. ::: ## Singular tests @@ -112,7 +112,7 @@ You can find more information about these tests, and additional configurations ( ### More generic tests -Those four tests are enough to get you started. You'll quickly find you want to use a wider variety of tests—a good thing! You can also install generic tests from a package, or write your own, to use (and reuse) across your dbt project. Check out the [guide on custom generic tests](/guides/best-practices/writing-custom-generic-tests) for more information. +Those four tests are enough to get you started. You'll quickly find you want to use a wider variety of tests—a good thing! You can also install generic tests from a package, or write your own, to use (and reuse) across your dbt project. Check out the [guide on custom generic tests](/best-practices/writing-custom-generic-tests) for more information. :::info There are generic tests defined in some open source packages, such as [dbt-utils](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/) and [dbt-expectations](https://hub.getdbt.com/calogica/dbt_expectations/latest/) — skip ahead to the docs on [packages](/docs/build/packages) to learn more! diff --git a/website/docs/docs/cloud/about-cloud-develop.md b/website/docs/docs/cloud/about-cloud-develop.md index 9f864ede5ca..90abbb98bf4 100644 --- a/website/docs/docs/cloud/about-cloud-develop.md +++ b/website/docs/docs/cloud/about-cloud-develop.md @@ -25,7 +25,7 @@ dbt Cloud offers a fast and reliable way to work on your dbt project. It runs db
-The following sections provide detailed instructions on setting up the dbt Cloud CLI and dbt Cloud IDE. To get started with dbt development, you'll need a [developer](/docs/cloud/manage-access/seats-and-users) account. For a more comprehensive guide about developing in dbt, refer to our [quickstart guides](/quickstarts). +The following sections provide detailed instructions on setting up the dbt Cloud CLI and dbt Cloud IDE. To get started with dbt development, you'll need a [developer](/docs/cloud/manage-access/seats-and-users) account. For a more comprehensive guide about developing in dbt, refer to our [quickstart guides](/guides). --------- diff --git a/website/docs/docs/cloud/about-cloud-setup.md b/website/docs/docs/cloud/about-cloud-setup.md index 7b68b52a45a..5c8e5525bf1 100644 --- a/website/docs/docs/cloud/about-cloud-setup.md +++ b/website/docs/docs/cloud/about-cloud-setup.md @@ -16,7 +16,7 @@ dbt Cloud is the fastest and most reliable way to deploy your dbt jobs. It conta - Configuring the [dbt Cloud IDE](/docs/cloud/about-cloud-develop) - Installing and configuring the [dbt Cloud CLI](/docs/cloud/cloud-cli-installation) -These settings are intended for dbt Cloud administrators. If you need a more detailed first-time setup guide for specific data platforms, read our [quickstart guides](/quickstarts). +These settings are intended for dbt Cloud administrators. If you need a more detailed first-time setup guide for specific data platforms, read our [quickstart guides](/guides). If you want a more in-depth learning experience, we recommend taking the dbt Fundamentals on our [dbt Learn online courses site](https://courses.getdbt.com/). diff --git a/website/docs/docs/cloud/about-cloud/about-dbt-cloud.md b/website/docs/docs/cloud/about-cloud/about-dbt-cloud.md index 71f3175a108..518efe56a8b 100644 --- a/website/docs/docs/cloud/about-cloud/about-dbt-cloud.md +++ b/website/docs/docs/cloud/about-cloud/about-dbt-cloud.md @@ -99,6 +99,6 @@ dbt Cloud's [flexible plans](https://www.getdbt.com/pricing/) and features make ## Related docs - [dbt Cloud plans and pricing](https://www.getdbt.com/pricing/) -- [Quickstart guides](/quickstarts) +- [Quickstart guides](/guides) - [dbt Cloud IDE](/docs/cloud/dbt-cloud-ide/develop-in-the-cloud) diff --git a/website/docs/docs/cloud/billing.md b/website/docs/docs/cloud/billing.md index ef3eb00a3c6..31b7689ceb9 100644 --- a/website/docs/docs/cloud/billing.md +++ b/website/docs/docs/cloud/billing.md @@ -215,7 +215,7 @@ If you want to ensure that you're building views whenever the logic is changed, Executing `dbt build` in this context is unnecessary because the CI job was used to both run and test the code that just got merged into main. 5. Under the **Execution Settings**, select the default production job to compare changes against: - **Defer to a previous run state** — Select the “Merge Job” you created so the job compares and identifies what has changed since the last merge. -6. In your dbt project, follow the steps in [Run a dbt Cloud job on merge](/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge) to create a script to trigger the dbt Cloud API to run your job after a merge happens within your git repository or watch this [video](https://www.loom.com/share/e7035c61dbed47d2b9b36b5effd5ee78?sid=bcf4dd2e-b249-4e5d-b173-8ca204d9becb). +6. In your dbt project, follow the steps in Run a dbt Cloud job on merge in the [Customizing CI/CD with custom pipelines](/guides/custom-cicd-pipelines) guide to create a script to trigger the dbt Cloud API to run your job after a merge happens within your git repository or watch this [video](https://www.loom.com/share/e7035c61dbed47d2b9b36b5effd5ee78?sid=bcf4dd2e-b249-4e5d-b173-8ca204d9becb). The purpose of the merge job is to: @@ -237,7 +237,7 @@ To understand better how long each model takes to run within the context of a sp Once you've identified which models could be optimized, check out these other resources that walk through how to optimize your work: * [Build scalable and trustworthy data pipelines with dbt and BigQuery](https://services.google.com/fh/files/misc/dbt_bigquery_whitepaper.pdf) * [Best Practices for Optimizing Your dbt and Snowflake Deployment](https://www.snowflake.com/wp-content/uploads/2021/10/Best-Practices-for-Optimizing-Your-dbt-and-Snowflake-Deployment.pdf) -* [How to optimize and troubleshoot dbt models on Databricks](/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks) +* [How to optimize and troubleshoot dbt models on Databricks](/guides/optimize-dbt-models-on-databricks) ## FAQs diff --git a/website/docs/docs/cloud/connect-data-platform/about-connections.md b/website/docs/docs/cloud/connect-data-platform/about-connections.md index 1fe89c7273c..1329d179900 100644 --- a/website/docs/docs/cloud/connect-data-platform/about-connections.md +++ b/website/docs/docs/cloud/connect-data-platform/about-connections.md @@ -23,7 +23,7 @@ You can connect to your database in dbt Cloud by clicking the gear in the top ri -These connection instructions provide the basic fields required for configuring a data platform connection in dbt Cloud. For more detailed guides, which include demo project data, read our [Quickstart guides](https://docs.getdbt.com/quickstarts) +These connection instructions provide the basic fields required for configuring a data platform connection in dbt Cloud. For more detailed guides, which include demo project data, read our [Quickstart guides](https://docs.getdbt.com/guides) ## IP Restrictions diff --git a/website/docs/docs/cloud/dbt-cloud-ide/dbt-cloud-tips.md b/website/docs/docs/cloud/dbt-cloud-ide/dbt-cloud-tips.md index 39db7832d79..0ceb4929530 100644 --- a/website/docs/docs/cloud/dbt-cloud-ide/dbt-cloud-tips.md +++ b/website/docs/docs/cloud/dbt-cloud-ide/dbt-cloud-tips.md @@ -46,7 +46,7 @@ There are default keyboard shortcuts that can help make development more product - Use [severity](/reference/resource-configs/severity) thresholds to set an acceptable number of failures for a test. - Use [incremental_strategy](/docs/build/incremental-models#about-incremental_strategy) in your incremental model config to implement the most effective behavior depending on the volume of your data and reliability of your unique keys. - Set `vars` in your `dbt_project.yml` to define global defaults for certain conditions, which you can then override using the `--vars` flag in your commands. -- Use [for loops](/guides/advanced/using-jinja#use-a-for-loop-in-models-for-repeated-sql) in Jinja to [DRY](https://docs.getdbt.com/terms/dry) up repetitive logic, such as selecting a series of columns that all require the same transformations and naming patterns to be applied. +- Use [for loops](/guides/using-jinja?step=3) in Jinja to DRY up repetitive logic, such as selecting a series of columns that all require the same transformations and naming patterns to be applied. - Instead of relying on post-hooks, use the [grants config](/reference/resource-configs/grants) to apply permission grants in the warehouse resiliently. - Define [source-freshness](/docs/build/sources#snapshotting-source-data-freshness) thresholds on your sources to avoid running transformations on data that has already been processed. - Use the `+` operator on the left of a model `dbt build --select +model_name` to run a model and all of its upstream dependencies. Use the `+` operator on the right of the model `dbt build --select model_name+` to run a model and everything downstream that depends on it. @@ -59,6 +59,6 @@ There are default keyboard shortcuts that can help make development more product ## Related docs -- [Quickstart guide](/quickstarts) +- [Quickstart guide](/guides) - [About dbt Cloud](/docs/cloud/about-cloud/dbt-cloud-features) - [Develop in the Cloud](/docs/cloud/dbt-cloud-ide/develop-in-the-cloud) diff --git a/website/docs/docs/cloud/dbt-cloud-ide/lint-format.md b/website/docs/docs/cloud/dbt-cloud-ide/lint-format.md index 6a86f1aa14b..f145e76df11 100644 --- a/website/docs/docs/cloud/dbt-cloud-ide/lint-format.md +++ b/website/docs/docs/cloud/dbt-cloud-ide/lint-format.md @@ -127,7 +127,7 @@ group_by_and_order_by_style = implicit ```
-For more info on styling best practices, refer to [How we style our SQL](/guides/best-practices/how-we-style/2-how-we-style-our-sql). +For more info on styling best practices, refer to [How we style our SQL](/best-practices/how-we-style/2-how-we-style-our-sql). ::: diff --git a/website/docs/docs/collaborate/documentation.md b/website/docs/docs/collaborate/documentation.md index 0fa00c7cca2..16a4e610c70 100644 --- a/website/docs/docs/collaborate/documentation.md +++ b/website/docs/docs/collaborate/documentation.md @@ -11,7 +11,7 @@ pagination_prev: null * [Declaring properties](/reference/configs-and-properties) * [`dbt docs` command](/reference/commands/cmd-docs) * [`doc` Jinja function](/reference/dbt-jinja-functions) -* If you're new to dbt, we recommend that you check out our [quickstart guide](/quickstarts) to build your first dbt project, complete with documentation. +* If you're new to dbt, we recommend that you check out our [quickstart guide](/guides) to build your first dbt project, complete with documentation. ## Assumed knowledge diff --git a/website/docs/docs/collaborate/govern/model-access.md b/website/docs/docs/collaborate/govern/model-access.md index 765e833ac0c..76eb8bd6f6d 100644 --- a/website/docs/docs/collaborate/govern/model-access.md +++ b/website/docs/docs/collaborate/govern/model-access.md @@ -35,7 +35,7 @@ Why define model `groups`? There are two reasons: - It turns implicit relationships into an explicit grouping, with a defined owner. By thinking about the interface boundaries _between_ groups, you can have a cleaner (less entangled) DAG. In the future, those interface boundaries could be appropriate as the interfaces between separate projects. - It enables you to designate certain models as having "private" access—for use exclusively within that group. Other models will be restricted from referencing (taking a dependency on) those models. In the future, they won't be visible to other teams taking a dependency on your project—only "public" models will be. -If you follow our [best practices for structuring a dbt project](/guides/best-practices/how-we-structure/1-guide-overview), you're probably already using subdirectories to organize your dbt project. It's easy to apply a `group` label to an entire subdirectory at once: +If you follow our [best practices for structuring a dbt project](/best-practices/how-we-structure/1-guide-overview), you're probably already using subdirectories to organize your dbt project. It's easy to apply a `group` label to an entire subdirectory at once: diff --git a/website/docs/docs/collaborate/govern/project-dependencies.md b/website/docs/docs/collaborate/govern/project-dependencies.md index d873d8883d6..174e4572890 100644 --- a/website/docs/docs/collaborate/govern/project-dependencies.md +++ b/website/docs/docs/collaborate/govern/project-dependencies.md @@ -113,7 +113,7 @@ with monthly_revenue as ( **Cycle detection:** Currently, "project" dependencies can only go in one direction, meaning that the `jaffle_finance` project could not add a new model that depends, in turn, on `jaffle_marketing.roi_by_channel`. dbt will check for cycles across projects and raise errors if any are detected. We are considering support for this pattern in the future, whereby dbt would still check for node-level cycles while allowing cycles at the project level. -For more guidance on how to use dbt Mesh, refer to the dedicated [dbt Mesh guide](/guides/best-practices/how-we-mesh/mesh-1-intro). +For more guidance on how to use dbt Mesh, refer to the dedicated [dbt Mesh guide](/best-practices/how-we-mesh/mesh-1-intro). ### Comparison @@ -139,4 +139,4 @@ If you're using private packages with the [git token method](/docs/build/package ## Related docs -- Refer to the [dbt Mesh](/guides/best-practices/how-we-mesh/mesh-1-intro) guide for more guidance on how to use dbt Mesh. +- Refer to the [dbt Mesh](/best-practices/how-we-mesh/mesh-1-intro) guide for more guidance on how to use dbt Mesh. diff --git a/website/docs/docs/connect-adapters.md b/website/docs/docs/connect-adapters.md index 77ead34e51d..e301cfc237e 100644 --- a/website/docs/docs/connect-adapters.md +++ b/website/docs/docs/connect-adapters.md @@ -3,7 +3,7 @@ title: "How to connect to adapters" id: "connect-adapters" --- -Adapters are an essential component of dbt. At their most basic level, they are how dbt connects with the various supported data platforms. At a higher-level, adapters strive to give analytics engineers more transferrable skills as well as standardize how analytics projects are structured. Gone are the days where you have to learn a new language or flavor of SQL when you move to a new job that has a different data platform. That is the power of adapters in dbt — for more detail, read the [What are adapters](/guides/dbt-ecosystem/adapter-development/1-what-are-adapters) guide. +Adapters are an essential component of dbt. At their most basic level, they are how dbt connects with the various supported data platforms. At a higher-level, adapters strive to give analytics engineers more transferrable skills as well as standardize how analytics projects are structured. Gone are the days where you have to learn a new language or flavor of SQL when you move to a new job that has a different data platform. That is the power of adapters in dbt — for more detail, refer to the [Build, test, document, and promote adapters](/guides/adapter-creation) guide. This section provides more details on different ways you can connect dbt to an adapter, and explains what a maintainer is. diff --git a/website/docs/docs/contribute-core-adapters.md b/website/docs/docs/contribute-core-adapters.md index 553361ee1a2..d3b1edf2a38 100644 --- a/website/docs/docs/contribute-core-adapters.md +++ b/website/docs/docs/contribute-core-adapters.md @@ -17,6 +17,6 @@ Community-supported plugins are works in progress, and anyone is welcome to cont ### Create a new adapter -If you see something missing from the lists above, and you're interested in developing an integration, read more about adapters and how they're developed in the [Adapter Development](/guides/dbt-ecosystem/adapter-development/1-what-are-adapters) section. +If you see something missing from the lists above, and you're interested in developing an integration, read more about adapters and how they're developed in the [Build, test, document, and promote adapters](/guides/adapter-creation). -If you have a new adapter, please add it to this list using a pull request! See [Documenting your adapter](/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter) for more information. +If you have a new adapter, please add it to this list using a pull request! You can refer to [Build, test, document, and promote adapters](/guides/adapter-creation) for more information on documenting your adapter. diff --git a/website/docs/docs/core/about-core-setup.md b/website/docs/docs/core/about-core-setup.md index a4d5ff09ee3..64e7694b793 100644 --- a/website/docs/docs/core/about-core-setup.md +++ b/website/docs/docs/core/about-core-setup.md @@ -16,4 +16,4 @@ dbt Core is an [open-source](https://github.com/dbt-labs/dbt-core) tool that ena - [Connecting to a data platform](/docs/core/connect-data-platform/profiles.yml) - [How to run your dbt projects](/docs/running-a-dbt-project/run-your-dbt-projects) -If you need a more detailed first-time setup guide for specific data platforms, read our [quickstart guides](https://docs.getdbt.com/quickstarts). +If you need a more detailed first-time setup guide for specific data platforms, read our [quickstart guides](https://docs.getdbt.com/guides). diff --git a/website/docs/docs/core/connect-data-platform/about-core-connections.md b/website/docs/docs/core/connect-data-platform/about-core-connections.md index a85a32cc031..492e5ae878a 100644 --- a/website/docs/docs/core/connect-data-platform/about-core-connections.md +++ b/website/docs/docs/core/connect-data-platform/about-core-connections.md @@ -22,7 +22,7 @@ dbt communicates with a number of different data platforms by using a dedicated Data platforms supported in dbt Core may be verified or unverified, and maintained by dbt Labs, partners, or community members. -These connection instructions provide the basic fields required for configuring a data platform connection in dbt Cloud. For more detailed guides, which include demo project data, read our [Quickstart guides](https://docs.getdbt.com/docs/quickstarts/overview) +These connection instructions provide the basic fields required for configuring a data platform connection in dbt Cloud. For more detailed guides, which include demo project data, read our [Quickstart guides](https://docs.getdbt.com/docs/guides) ## Connection profiles diff --git a/website/docs/docs/dbt-cloud-apis/sl-jdbc.md b/website/docs/docs/dbt-cloud-apis/sl-jdbc.md index e10d057dc75..931666dd10c 100644 --- a/website/docs/docs/dbt-cloud-apis/sl-jdbc.md +++ b/website/docs/docs/dbt-cloud-apis/sl-jdbc.md @@ -363,5 +363,5 @@ semantic_layer.query(metrics=['food_order_amount', 'order_gross_profit'], ## Related docs -- [dbt Semantic Layer integration best practices](/guides/dbt-ecosystem/sl-partner-integration-guide) +- [dbt Semantic Layer integration best practices](/guides/sl-partner-integration-guide) diff --git a/website/docs/docs/dbt-cloud-environments.md b/website/docs/docs/dbt-cloud-environments.md index cd0d7f6858f..522a354be97 100644 --- a/website/docs/docs/dbt-cloud-environments.md +++ b/website/docs/docs/dbt-cloud-environments.md @@ -45,4 +45,4 @@ To use the dbt Cloud IDE or dbt Cloud CLI, each developer will need to set up [p Deployment environments in dbt Cloud are necessary to execute scheduled jobs and use other features. A dbt Cloud project can have multiple deployment environments, allowing for flexibility and customization. However, a dbt Cloud project can only have one deployment environment that represents the production source of truth. -To learn more about dbt Cloud deployment environments and how to configure them, refer to the [Deployment environments](/docs/deploy/deploy-environments) page. For our best practices guide, read [dbt Cloud environment best practices](/guides/orchestration/set-up-ci/overview) for more info. +To learn more about dbt Cloud deployment environments and how to configure them, refer to the [Deployment environments](/docs/deploy/deploy-environments) page. For our best practices guide, read [dbt Cloud environment best practices](/guides/set-up-ci) for more info. diff --git a/website/docs/docs/dbt-versions/core-upgrade/01-upgrading-to-v1.6.md b/website/docs/docs/dbt-versions/core-upgrade/01-upgrading-to-v1.6.md index f62b6308ce6..d36cc544814 100644 --- a/website/docs/docs/dbt-versions/core-upgrade/01-upgrading-to-v1.6.md +++ b/website/docs/docs/dbt-versions/core-upgrade/01-upgrading-to-v1.6.md @@ -36,7 +36,7 @@ The [spec for metrics](https://github.com/dbt-labs/dbt-core/discussions/7456) ha If your dbt project defines metrics, you must migrate to dbt v1.6 because the YAML spec has moved from dbt_metrics to MetricFlow. Any tests you have won't compile on v1.5 or older. - dbt Core v1.6 does not support Python 3.7, which reached End Of Life on June 23. Support Python versions are 3.8, 3.9, 3.10, and 3.11. -- As part of the [dbt Semantic layer](/docs/use-dbt-semantic-layer/dbt-sl) re-launch (in beta), the spec for `metrics` has changed significantly. Refer to the [migration guide](/guides/migration/sl-migration) for more info on how to migrate to the re-launched dbt Semantic Layer. +- As part of the [dbt Semantic layer](/docs/use-dbt-semantic-layer/dbt-sl) re-launch (in beta), the spec for `metrics` has changed significantly. Refer to the [migration guide](/guides/sl-migration) for more info on how to migrate to the re-launched dbt Semantic Layer. - The manifest schema version is now v10. - dbt Labs is ending support for Homebrew installation of dbt-core and adapters. See [the discussion](https://github.com/dbt-labs/dbt-core/discussions/8277) for more details. diff --git a/website/docs/docs/dbt-versions/core-upgrade/07-upgrading-to-v1.1.md b/website/docs/docs/dbt-versions/core-upgrade/07-upgrading-to-v1.1.md index 7819709558e..403264a46e6 100644 --- a/website/docs/docs/dbt-versions/core-upgrade/07-upgrading-to-v1.1.md +++ b/website/docs/docs/dbt-versions/core-upgrade/07-upgrading-to-v1.1.md @@ -21,7 +21,7 @@ There are no breaking changes for code in dbt projects and packages. We are comm ### For maintainers of adapter plugins -We have reworked the testing suite for adapter plugin functionality. For details on the new testing suite, see: [Testing a new adapter](/guides/dbt-ecosystem/adapter-development/4-testing-a-new-adapter). +We have reworked the testing suite for adapter plugin functionality. For details on the new testing suite, refer to the "Test your adapter" step in the [Build, test, document, and promote adapters](/guides/adapter-creation) guide. The abstract methods `get_response` and `execute` now only return `connection.AdapterReponse` in type hints. Previously, they could return a string. We encourage you to update your methods to return an object of class `AdapterResponse`, or implement a subclass specific to your adapter. This also gives you the opportunity to add fields specific to your adapter's query execution, such as `rows_affected` or `bytes_processed`. diff --git a/website/docs/docs/dbt-versions/core-upgrade/08-upgrading-to-v1.0.md b/website/docs/docs/dbt-versions/core-upgrade/08-upgrading-to-v1.0.md index 7c67a1849a1..3f45e44076c 100644 --- a/website/docs/docs/dbt-versions/core-upgrade/08-upgrading-to-v1.0.md +++ b/website/docs/docs/dbt-versions/core-upgrade/08-upgrading-to-v1.0.md @@ -51,7 +51,7 @@ Global project macros have been reorganized, and some old unused macros have bee ### For users of adapter plugins -- **BigQuery:** Support for [ingestion-time-partitioned tables](/guides/legacy/creating-date-partitioned-tables) has been officially deprecated in favor of modern approaches. Use `partition_by` and incremental modeling strategies instead. +- **BigQuery:** Support for ingestion-time-partitioned tables has been officially deprecated in favor of modern approaches. Use `partition_by` and incremental modeling strategies instead. For more information, refer to [Incremental models](/docs/build/incremental-models). ### For maintainers of plugins + other integrations @@ -71,9 +71,9 @@ Several under-the-hood changes from past minor versions, tagged with deprecation ## New features and changed documentation - Add [metrics](/docs/build/metrics), a new node type -- [Generic tests](/guides/best-practices/writing-custom-generic-tests) can be defined in `tests/generic` (new), in addition to `macros/` (as before) +- [Generic tests](/best-practices/writing-custom-generic-tests) can be defined in `tests/generic` (new), in addition to `macros/` (as before) - [Parsing](/reference/parsing): partial parsing and static parsing have been turned on by default. - [Global configs](/reference/global-configs/about-global-configs) have been standardized. Related updates to [global CLI flags](/reference/global-cli-flags) and [`profiles.yml`](/docs/core/connect-data-platform/profiles.yml). - [The `init` command](/reference/commands/init) has a whole new look and feel. It's no longer just for first-time users. -- Add `result:` subselectors for smarter reruns when dbt models have errors and tests fail. See examples: [Pro-tips for Workflows](/guides/legacy/best-practices#pro-tips-for-workflows) +- Add `result:` subselectors for smarter reruns when dbt models have errors and tests fail. See examples: [Pro-tips for Workflows](/best-practices/best-practice-workflows#pro-tips-for-workflows) - Secret-prefixed [env vars](/reference/dbt-jinja-functions/env_var) are now allowed only in `profiles.yml` + `packages.yml` diff --git a/website/docs/docs/dbt-versions/core-upgrade/10-upgrading-to-v0.20.md b/website/docs/docs/dbt-versions/core-upgrade/10-upgrading-to-v0.20.md index 61a7120370a..9ff5695d5dc 100644 --- a/website/docs/docs/dbt-versions/core-upgrade/10-upgrading-to-v0.20.md +++ b/website/docs/docs/dbt-versions/core-upgrade/10-upgrading-to-v0.20.md @@ -33,7 +33,7 @@ dbt Core v0.20 has reached the end of critical support. No new patch versions wi - [Test Configs](/reference/test-configs) - [Test properties](/reference/resource-properties/tests) - [Node Selection](/reference/node-selection/syntax) (with updated [test selection examples](/reference/node-selection/test-selection-examples)) -- [Writing custom generic tests](/guides/best-practices/writing-custom-generic-tests) +- [Writing custom generic tests](/best-practices/writing-custom-generic-tests) ### Elsewhere in Core - [Parsing](/reference/parsing): rework of partial parsing, introduction of experimental parser diff --git a/website/docs/docs/dbt-versions/core-upgrade/11-Older versions/upgrading-to-0-15-0.md b/website/docs/docs/dbt-versions/core-upgrade/11-Older versions/upgrading-to-0-15-0.md index 6dd2b6fb9eb..5eba212590f 100644 --- a/website/docs/docs/dbt-versions/core-upgrade/11-Older versions/upgrading-to-0-15-0.md +++ b/website/docs/docs/dbt-versions/core-upgrade/11-Older versions/upgrading-to-0-15-0.md @@ -26,7 +26,7 @@ expect this field will now return errors. See the latest ### Custom materializations -All materializations must now [manage dbt's Relation cache](/guides/advanced/creating-new-materializations#update-the-relation-cache). +All materializations must now manage dbt's Relation cache. For more information, refer to [Create new materializations](/guides/create-new-materializations). ### dbt Server diff --git a/website/docs/docs/dbt-versions/core-versions.md b/website/docs/docs/dbt-versions/core-versions.md index 5e8e437f0b1..2467f3c946b 100644 --- a/website/docs/docs/dbt-versions/core-versions.md +++ b/website/docs/docs/dbt-versions/core-versions.md @@ -84,7 +84,7 @@ Like many software projects, dbt Core releases follow [semantic versioning](http We are committed to avoiding breaking changes in minor versions for end users of dbt. There are two types of breaking changes that may be included in minor versions: -- Changes to the [Python interface for adapter plugins](/guides/dbt-ecosystem/adapter-development/3-building-a-new-adapter). These changes are relevant _only_ to adapter maintainers, and they will be clearly communicated in documentation and release notes. +- Changes to the Python interface for adapter plugins. These changes are relevant _only_ to adapter maintainers, and they will be clearly communicated in documentation and release notes. For more information, refer to [Build, test, document, and promote adapters](/guides/adapter-creation) guide. - Changes to metadata interfaces, including [artifacts](/docs/deploy/artifacts) and [logging](/reference/events-logging), signalled by a version bump. Those version upgrades may require you to update external code that depends on these interfaces, or to coordinate upgrades between dbt orchestrations that share metadata, such as [state-powered selection](/reference/node-selection/syntax#about-node-selection). ### How we version adapter plugins diff --git a/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/product-docs-sept-rn.md b/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/product-docs-sept-rn.md index e669b037d17..3fdaa0eafe8 100644 --- a/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/product-docs-sept-rn.md +++ b/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/product-docs-sept-rn.md @@ -27,11 +27,11 @@ Here's what's new to [docs.getdbt.com](http://docs.getdbt.com/): - Deprecated dbt Core v1.0 and v1.1 from the docs. - Added configuration instructions for the [AWS Glue](/docs/core/connect-data-platform/glue-setup) community plugin. -- Revised the dbt Core quickstart, making it easier to follow. Divided this guide into steps that align with the [other guides](/quickstarts/manual-install?step=1). +- Revised the dbt Core quickstart, making it easier to follow. Divided this guide into steps that align with the [other guides](/guides/manual-install?step=1). ## New 📚 Guides, ✏️ blog posts, and FAQs -Added a [style guide template](/guides/best-practices/how-we-style/6-how-we-style-conclusion#style-guide-template) that you can copy & paste to make sure you adhere to best practices when styling dbt projects! +Added a [style guide template](/best-practices/how-we-style/6-how-we-style-conclusion#style-guide-template) that you can copy & paste to make sure you adhere to best practices when styling dbt projects! ## Upcoming changes diff --git a/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/sl-ga.md b/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/sl-ga.md index 8ba71e9d825..a81abec5d42 100644 --- a/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/sl-ga.md +++ b/website/docs/docs/dbt-versions/release-notes/03-Oct-2023/sl-ga.md @@ -8,7 +8,7 @@ tags: [Oct-2023] --- :::important -If you're using the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher and [migrate](/guides/migration/sl-migration) to the latest Semantic Layer. +If you're using the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher and [migrate](/guides/sl-migration) to the latest Semantic Layer. ::: dbt Labs is thrilled to announce that the [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) is now generally available. It offers consistent data organization, improved governance, reduced costs, enhanced efficiency, and accessible data for better decision-making and collaboration across organizations. diff --git a/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/ci-updates-phase2-rn.md b/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/ci-updates-phase2-rn.md index fd2d163b748..a8ae1ade65b 100644 --- a/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/ci-updates-phase2-rn.md +++ b/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/ci-updates-phase2-rn.md @@ -29,7 +29,7 @@ Below is a comparison table that describes how deploy jobs and CI jobs behave di ## What you need to update -- If you want to set up a CI environment for your jobs, dbt Labs recommends that you create your CI job in a dedicated [deployment environment](/docs/deploy/deploy-environments#create-a-deployment-environment) that's connected to a staging database. To learn more about these environment best practices, refer to the guide [Get started with continuous integration tests](/guides/orchestration/set-up-ci/overview). +- If you want to set up a CI environment for your jobs, dbt Labs recommends that you create your CI job in a dedicated [deployment environment](/docs/deploy/deploy-environments#create-a-deployment-environment) that's connected to a staging database. To learn more about these environment best practices, refer to the guide [Get started with continuous integration tests](/guides/set-up-ci). - If you had set up a CI job before October 2, 2023, the job might've been misclassified as a deploy job with this update. Below describes how to fix the job type: diff --git a/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/product-docs-summer-rn.md b/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/product-docs-summer-rn.md index d8148542eef..e8fb9539c50 100644 --- a/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/product-docs-summer-rn.md +++ b/website/docs/docs/dbt-versions/release-notes/04-Sept-2023/product-docs-summer-rn.md @@ -40,4 +40,4 @@ You can provide feedback by opening a pull request or issue in [our repo](https: ## New 📚 Guides, ✏️ blog posts, and FAQs * Check out how these community members use the dbt community in the [Community spotlight](/community/spotlight). * Blog posts published this summer include [Optimizing Materialized Views with dbt](/blog/announcing-materialized-views), [Data Vault 2.0 with dbt Cloud](/blog/data-vault-with-dbt-cloud), and [Create dbt Documentation and Tests 10x faster with ChatGPT](/blog/create-dbt-documentation-10x-faster-with-chatgpt) -* We now have two new best practice guides: [How we build our metrics](/guides/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) and [Set up Continuous Integration](/guides/orchestration/set-up-ci/overview). +- We now have two new best practice guides: [How we build our metrics](/best-practices/how-we-build-our-metrics/semantic-layer-1-intro) and [Set up Continuous Integration](/guides/set-up-ci). diff --git a/website/docs/docs/dbt-versions/release-notes/05-Aug-2023/sl-revamp-beta.md b/website/docs/docs/dbt-versions/release-notes/05-Aug-2023/sl-revamp-beta.md index 921ed6dcd79..f44fd57aa4a 100644 --- a/website/docs/docs/dbt-versions/release-notes/05-Aug-2023/sl-revamp-beta.md +++ b/website/docs/docs/dbt-versions/release-notes/05-Aug-2023/sl-revamp-beta.md @@ -8,14 +8,14 @@ sidebar_position: 7 --- :::important -If you're using the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use the new dbt Semantic Layer. To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/migration/sl-migration) for more info. +If you're using the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use the new dbt Semantic Layer. To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/sl-migration) for more info. ::: dbt Labs are thrilled to announce the re-release of the [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl), now available in [public beta](#public-beta). It aims to bring the best of modeling and semantics to downstream applications by introducing: - [MetricFlow](/docs/build/about-metricflow) is a framework for constructing performant and legible SQL from an all new set of semantic constructs which include semantic models, entities, and metrics. - New Semantic Layer infrastructure that enables support for more data platforms (Snowflake, Databricks, BigQuery, Redshift, and soon more), along with improved performance. -- New and improved [developer workflows](/guides/migration/sl-migration), governance, and collaboration features. +- New and improved [developer workflows](/guides/sl-migration), governance, and collaboration features. - New [Semantic Layer API](/docs/dbt-cloud-apis/sl-api-overview) using JDBC to query metrics and build integrations. With semantics at its core, the dbt Semantic Layer marks a crucial milestone towards a new era of centralized logic and data applications. diff --git a/website/docs/docs/dbt-versions/release-notes/07-June-2023/product-docs-jun.md b/website/docs/docs/dbt-versions/release-notes/07-June-2023/product-docs-jun.md index 469d2ac362b..db73597cd63 100644 --- a/website/docs/docs/dbt-versions/release-notes/07-June-2023/product-docs-jun.md +++ b/website/docs/docs/dbt-versions/release-notes/07-June-2023/product-docs-jun.md @@ -32,4 +32,4 @@ Here's what's new to [docs.getdbt.com](http://docs.getdbt.com/) in June: ## New 📚 Guides, ✏️ blog posts, and FAQs -- Add an Azure DevOps example to the [Customizing CI/CD guide](/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge). +- Add an Azure DevOps example in the [Customizing CI/CD with custom pipelines](/guides/custom-cicd-pipelines) guide. diff --git a/website/docs/docs/dbt-versions/release-notes/08-May-2023/product-docs-may.md b/website/docs/docs/dbt-versions/release-notes/08-May-2023/product-docs-may.md index 762a6a723f8..a692c901a80 100644 --- a/website/docs/docs/dbt-versions/release-notes/08-May-2023/product-docs-may.md +++ b/website/docs/docs/dbt-versions/release-notes/08-May-2023/product-docs-may.md @@ -16,7 +16,7 @@ Here's what's new to [docs.getdbt.com](http://docs.getdbt.com/) in May: - We made sure everyone knows that Cloud-users don’t need a [profiles.yml file](/docs/core/connect-data-platform/profiles.yml) by adding a callout on several key pages. - Fleshed out the [model jinja variable page](/reference/dbt-jinja-functions/model), which originally lacked conceptual info and didn’t link to the schema page. -- Added a new [Quickstarts landing page](/quickstarts). This new format sets up for future iterations that will include filtering! But for now, we are excited you can step through quickstarts in a focused way. +- Added a new [Quickstarts landing page](/guides). This new format sets up for future iterations that will include filtering! But for now, we are excited you can step through quickstarts in a focused way. ## ☁ Cloud projects diff --git a/website/docs/docs/dbt-versions/release-notes/09-April-2023/product-docs.md b/website/docs/docs/dbt-versions/release-notes/09-April-2023/product-docs.md index d78040ea7e4..3de29b605ce 100644 --- a/website/docs/docs/dbt-versions/release-notes/09-April-2023/product-docs.md +++ b/website/docs/docs/dbt-versions/release-notes/09-April-2023/product-docs.md @@ -17,7 +17,7 @@ Hello from the dbt Docs team: @mirnawong1, @matthewshaver, @nghi-ly, and @runleo ## ☁ Cloud projects - Added Starburst/Trino adapter docs, including: - * [dbt Cloud quickstart guide](/quickstarts/starburst-galaxy),  + * [dbt Cloud quickstart guide](/guides/starburst-galaxy),  * [connection page](/docs/cloud/connect-data-platform/connect-starburst-trino),  * [set up page](/docs/core/connect-data-platform/trino-setup), and [config page](/reference/resource-configs/trino-configs). - Enhanced [dbt Cloud jobs page](/docs/deploy/jobs) and section to include conceptual info on the queue time, improvements made around it, and about failed jobs. @@ -31,10 +31,10 @@ Hello from the dbt Docs team: @mirnawong1, @matthewshaver, @nghi-ly, and @runleo ## New 📚 Guides and ✏️ blog posts -- [Use Databricks workflows to run dbt Cloud jobs](/guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs) -- [Refresh Tableau workbook with extracts after a job finishes](/guides/orchestration/webhooks/zapier-refresh-tableau-workbook) -- [dbt Python Snowpark workshop/tutorial](/guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark) -- [How to optimize and troubleshoot dbt Models on Databricks](/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks) -- [The missing guide to debug() in dbt](https://docs.getdbt.com/blog/guide-to-jinja-debug) -- [dbt Squared: Leveraging dbt Core and dbt Cloud together at scale](https://docs.getdbt.com/blog/dbt-squared) -- [Audit_helper in dbt: Bringing data auditing to a higher level](https://docs.getdbt.com/blog/audit-helper-for-migration) +- [Use Databricks workflows to run dbt Cloud jobs](/guides/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs) +- [Refresh Tableau workbook with extracts after a job finishes](/guides/zapier-refresh-tableau-workbook) +- [dbt Python Snowpark workshop/tutorial](/guides/dbt-python-snowpark) +- [How to optimize and troubleshoot dbt Models on Databricks](/guides/optimize-dbt-models-on-databricks) +- [The missing guide to debug() in dbt](/blog/guide-to-jinja-debug) +- [dbt Squared: Leveraging dbt Core and dbt Cloud together at scale](/blog/dbt-squared) +- [Audit_helper in dbt: Bringing data auditing to a higher level](/blog/audit-helper-for-migration) diff --git a/website/docs/docs/dbt-versions/release-notes/09-April-2023/starburst-trino-ga.md b/website/docs/docs/dbt-versions/release-notes/09-April-2023/starburst-trino-ga.md index 613a0c02432..708d51f0a44 100644 --- a/website/docs/docs/dbt-versions/release-notes/09-April-2023/starburst-trino-ga.md +++ b/website/docs/docs/dbt-versions/release-notes/09-April-2023/starburst-trino-ga.md @@ -8,5 +8,5 @@ tags: [Apr-2023] The Starburst (Trino compatible) connection is now generally available in dbt Cloud. This means you can now use dbt Cloud to connect with Starburst Galaxy, Starburst Enterprise, and self-hosted Trino. This feature is powered by the [`dbt-trino`](https://github.com/starburstdata/dbt-trino) adapter. -To learn more, check out our Quickstart guide for [dbt Cloud and Starburst Galaxy](https://docs.getdbt.com/quickstarts/starburst-galaxy). +To learn more, check out our Quickstart guide for [dbt Cloud and Starburst Galaxy](https://docs.getdbt.com/guides/starburst-galaxy). diff --git a/website/docs/docs/dbt-versions/release-notes/10-Mar-2023/public-preview-trino-in-dbt-cloud.md b/website/docs/docs/dbt-versions/release-notes/10-Mar-2023/public-preview-trino-in-dbt-cloud.md index bf3840a8b02..06abf178b8a 100644 --- a/website/docs/docs/dbt-versions/release-notes/10-Mar-2023/public-preview-trino-in-dbt-cloud.md +++ b/website/docs/docs/dbt-versions/release-notes/10-Mar-2023/public-preview-trino-in-dbt-cloud.md @@ -8,7 +8,7 @@ tags: [Mar-2023] dbt Labs is introducing the newest connection option in dbt Cloud: the `dbt-trino` adapter is now available in Public Preview. This allows you to connect to Starburst Galaxy, Starburst Enterprise, and self-hosted Trino from dbt Cloud. -Check out our [Quickstart for dbt Cloud and Starburst Galaxy](/quickstarts/starburst-galaxy) to explore more. +Check out our [Quickstart for dbt Cloud and Starburst Galaxy](/guides/starburst-galaxy) to explore more. ## What’s the reason users should be excited about this? diff --git a/website/docs/docs/dbt-versions/release-notes/24-Nov-2022/dbt-databricks-unity-catalog-support.md b/website/docs/docs/dbt-versions/release-notes/24-Nov-2022/dbt-databricks-unity-catalog-support.md index 25d5ca5205f..012615e1e4e 100644 --- a/website/docs/docs/dbt-versions/release-notes/24-Nov-2022/dbt-databricks-unity-catalog-support.md +++ b/website/docs/docs/dbt-versions/release-notes/24-Nov-2022/dbt-databricks-unity-catalog-support.md @@ -8,6 +8,6 @@ tags: [Nov-2022, v1.1.66.15] dbt Cloud is the easiest and most reliable way to develop and deploy a dbt project. It helps remove complexity while also giving you more features and better performance. A simpler Databricks connection experience with support for Databricks’ Unity Catalog and better modeling defaults is now available for your use. -For all the Databricks customers already using dbt Cloud with the dbt-spark adapter, you can now [migrate](https://docs.getdbt.com/guides/migration/tools/migrating-from-spark-to-databricks#migration) your connection to the [dbt-databricks adapter](https://docs.getdbt.com/reference/warehouse-setups/databricks-setup) to get the benefits. [Databricks](https://www.databricks.com/blog/2022/11/17/introducing-native-high-performance-integration-dbt-cloud.html) is committed to maintaining and improving the adapter, so this integrated experience will continue to provide the best of dbt and Databricks. +For all the Databricks customers already using dbt Cloud with the dbt-spark adapter, you can now [migrate](/guides/migrate-from-spark-to-databricks) your connection to the [dbt-databricks adapter](/docs/core/connect-data-platform/databricks-setup) to get the benefits. [Databricks](https://www.databricks.com/blog/2022/11/17/introducing-native-high-performance-integration-dbt-cloud.html) is committed to maintaining and improving the adapter, so this integrated experience will continue to provide the best of dbt and Databricks. Check out our [live blog post](https://www.getdbt.com/blog/dbt-cloud-databricks-experience/) to learn more. diff --git a/website/docs/docs/deploy/ci-jobs.md b/website/docs/docs/deploy/ci-jobs.md index d10bc780fc2..6114ed1ca14 100644 --- a/website/docs/docs/deploy/ci-jobs.md +++ b/website/docs/docs/deploy/ci-jobs.md @@ -9,7 +9,7 @@ You can set up [continuous integration](/docs/deploy/continuous-integration) (CI ## Set up CI jobs {#set-up-ci-jobs} -dbt Labs recommends that you create your CI job in a dedicated dbt Cloud [deployment environment](/docs/deploy/deploy-environments#create-a-deployment-environment) that's connected to a staging database. Having a separate environment dedicated for CI will provide better isolation between your temporary CI schema builds and your production data builds. Additionally, sometimes teams need their CI jobs to be triggered when a PR is made to a branch other than main. If your team maintains a staging branch as part of your release process, having a separate environment will allow you to set a [custom branch](/faqs/environments/custom-branch-settings) and, accordingly, the CI job in that dedicated environment will be triggered only when PRs are made to the specified custom branch. To learn more, refer to [Get started with CI tests](/guides/orchestration/set-up-ci/overview). +dbt Labs recommends that you create your CI job in a dedicated dbt Cloud [deployment environment](/docs/deploy/deploy-environments#create-a-deployment-environment) that's connected to a staging database. Having a separate environment dedicated for CI will provide better isolation between your temporary CI schema builds and your production data builds. Additionally, sometimes teams need their CI jobs to be triggered when a PR is made to a branch other than main. If your team maintains a staging branch as part of your release process, having a separate environment will allow you to set a [custom branch](/faqs/environments/custom-branch-settings) and, accordingly, the CI job in that dedicated environment will be triggered only when PRs are made to the specified custom branch. To learn more, refer to [Get started with CI tests](/guides/set-up-ci). ### Prerequisites - You have a dbt Cloud account. diff --git a/website/docs/docs/deploy/deploy-environments.md b/website/docs/docs/deploy/deploy-environments.md index 83231e2d66d..650fdb1c28a 100644 --- a/website/docs/docs/deploy/deploy-environments.md +++ b/website/docs/docs/deploy/deploy-environments.md @@ -13,7 +13,7 @@ Deployment environments in dbt Cloud are crucial for deploying dbt jobs in produ A dbt Cloud project can have multiple deployment environments, providing you the flexibility and customization to tailor the execution of dbt jobs. You can use deployment environments to [create and schedule jobs](/docs/deploy/deploy-jobs#create-and-schedule-jobs), [enable continuous integration](/docs/deploy/continuous-integration), or more based on your specific needs or requirements. :::tip Learn how to manage dbt Cloud environments -To learn different approaches to managing dbt Cloud environments and recommendations for your organization's unique needs, read [dbt Cloud environment best practices](https://docs.getdbt.com/guides/best-practices/environment-setup/1-env-guide-overview). +To learn different approaches to managing dbt Cloud environments and recommendations for your organization's unique needs, read [dbt Cloud environment best practices](/guides/set-up-ci). ::: This page reviews the different types of environments and how to configure your deployment environment in dbt Cloud. @@ -186,7 +186,7 @@ This section allows you to determine the credentials that should be used when co ## Related docs -- [dbt Cloud environment best practices](https://docs.getdbt.com/guides/best-practices/environment-setup/1-env-guide-overview) +- [dbt Cloud environment best practices](/guides/set-up-ci) - [Deploy jobs](/docs/deploy/deploy-jobs) - [CI jobs](/docs/deploy/continuous-integration) - [Delete a job or environment in dbt Cloud](/faqs/Environments/delete-environment-job) diff --git a/website/docs/docs/deploy/deployment-tools.md b/website/docs/docs/deploy/deployment-tools.md index baa9d6c4a01..cca2368f38a 100644 --- a/website/docs/docs/deploy/deployment-tools.md +++ b/website/docs/docs/deploy/deployment-tools.md @@ -126,14 +126,14 @@ Cron is a decent way to schedule bash commands. However, while it may seem like Use Databricks workflows to call the dbt Cloud job API, which has several benefits such as integration with other ETL processes, utilizing dbt Cloud job features, separation of concerns, and custom job triggering based on custom conditions or logic. These advantages lead to more modularity, efficient debugging, and flexibility in scheduling dbt Cloud jobs. -For more info, refer to the guide on [Databricks workflows and dbt Cloud jobs](/guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs). +For more info, refer to the guide on [Databricks workflows and dbt Cloud jobs](/guides/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs). ## Related docs - [dbt Cloud plans and pricing](https://www.getdbt.com/pricing/) -- [Quickstart guides](/quickstarts) +- [Quickstart guides](/guides) - [Webhooks for your jobs](/docs/deploy/webhooks) - [Orchestration guides](https://docs.getdbt.com/guides/orchestration) - [Commands for your production deployment](https://discourse.getdbt.com/t/what-are-the-dbt-commands-you-run-in-your-production-deployment-of-dbt/366) diff --git a/website/docs/docs/deploy/webhooks.md b/website/docs/docs/deploy/webhooks.md index 123cb6ef39f..f6c766ab201 100644 --- a/website/docs/docs/deploy/webhooks.md +++ b/website/docs/docs/deploy/webhooks.md @@ -8,7 +8,7 @@ With dbt Cloud, you can create outbound webhooks to send events (notifications) A webhook is an HTTP-based callback function that allows event-driven communication between two different web applications. This allows you to get the latest information on your dbt jobs in real time. Without it, you would need to make API calls repeatedly to check if there are any updates that you need to account for (polling). Because of this, webhooks are also called _push APIs_ or _reverse APIs_ and are often used for infrastructure development. -dbt Cloud sends a JSON payload to your application's endpoint URL when your webhook is triggered. You can send a [Slack](/guides/orchestration/webhooks/zapier-slack) notification, a [Microsoft Teams](/guides/orchestration/webhooks/zapier-ms-teams) notification, [open a PagerDuty incident](/guides/orchestration/webhooks/serverless-pagerduty) when a dbt job fails, [and more](/guides/orchestration/webhooks). +dbt Cloud sends a JSON payload to your application's endpoint URL when your webhook is triggered. You can send a [Slack](/guides/zapier-slack) notification, a [Microsoft Teams](/guides/zapier-ms-teams) notification, [open a PagerDuty incident](/guides/serverless-pagerduty) when a dbt job fails. You can create webhooks for these events from the [dbt Cloud web-based UI](#create-a-webhook-subscription) and by using the [dbt Cloud API](#api-for-webhooks): @@ -549,5 +549,5 @@ DELETE https://{your access URL}/api/v3/accounts/{account_id}/webhooks/subscript ## Related docs - [dbt Cloud CI](/docs/deploy/continuous-integration) -- [Use dbt Cloud's webhooks with other SaaS apps](/guides/orchestration/webhooks) +- [Use dbt Cloud's webhooks with other SaaS apps](/guides) diff --git a/website/docs/docs/environments-in-dbt.md b/website/docs/docs/environments-in-dbt.md index 70bc096cf4f..f0691761dd6 100644 --- a/website/docs/docs/environments-in-dbt.md +++ b/website/docs/docs/environments-in-dbt.md @@ -33,7 +33,7 @@ Configure environments to tell dbt Cloud or dbt Core how to build and execute yo ## Related docs -- [dbt Cloud environment best practices](https://docs.getdbt.com/guides/best-practices/environment-setup/1-env-guide-overview) +- [dbt Cloud environment best practices](/guides/set-up-ci) - [Deployment environments](/docs/deploy/deploy-environments) - [About dbt Core versions](/docs/dbt-versions/core) - [Set Environment variables in dbt Cloud](/docs/build/environment-variables#special-environment-variables) diff --git a/website/docs/docs/introduction.md b/website/docs/docs/introduction.md index 0aeef0201cb..61cda6e1d3e 100644 --- a/website/docs/docs/introduction.md +++ b/website/docs/docs/introduction.md @@ -39,11 +39,11 @@ You can learn about plans and pricing on [www.getdbt.com](https://www.getdbt.com ### dbt Cloud dbt Cloud is the fastest and most reliable way to deploy dbt. Develop, test, schedule, and investigate data models all in one web-based UI. It also natively supports developing using a command line with the [dbt Cloud CLI](/docs/cloud/cloud-cli-installation). -Learn more about [dbt Cloud features](/docs/cloud/about-cloud/dbt-cloud-features) and try one of the [dbt Cloud quickstarts](/quickstarts). +Learn more about [dbt Cloud features](/docs/cloud/about-cloud/dbt-cloud-features) and try one of the [dbt Cloud quickstarts](/guides). ### dbt Core -dbt Core is an open-source tool that enables data teams to transform data using analytics engineering best practices. You can install and use dbt Core on the command line. Learn more with the [quickstart for dbt Core](/quickstarts/codespace). +dbt Core is an open-source tool that enables data teams to transform data using analytics engineering best practices. You can install and use dbt Core on the command line. Learn more with the [quickstart for dbt Core](/guides/codespace). ## The power of dbt @@ -62,7 +62,7 @@ As a dbt user, your main focus will be on writing models (i.e. select queries) t ### Related docs -- [Quickstarts for dbt](/quickstarts) -- [Best practice guides](/guides/best-practices) +- [Quickstarts for dbt](/guides) +- [Best practice guides](/best-practices) - [What is a dbt Project?](/docs/build/projects) - [dbt run](/docs/running-a-dbt-project/run-your-dbt-projects) diff --git a/website/docs/docs/supported-data-platforms.md b/website/docs/docs/supported-data-platforms.md index a8e146f49d0..c0c9a30db36 100644 --- a/website/docs/docs/supported-data-platforms.md +++ b/website/docs/docs/supported-data-platforms.md @@ -8,7 +8,7 @@ pagination_next: "docs/connect-adapters" pagination_prev: null --- -dbt connects to and runs SQL against your database, warehouse, lake, or query engine. These SQL-speaking platforms are collectively referred to as _data platforms_. dbt connects with data platforms by using a dedicated adapter plugin for each. Plugins are built as Python modules that dbt Core discovers if they are installed on your system. Read [What are Adapters](/guides/dbt-ecosystem/adapter-development/1-what-are-adapters) for more info. +dbt connects to and runs SQL against your database, warehouse, lake, or query engine. These SQL-speaking platforms are collectively referred to as _data platforms_. dbt connects with data platforms by using a dedicated adapter plugin for each. Plugins are built as Python modules that dbt Core discovers if they are installed on your system. Refer to the [Build, test, document, and promote adapters](/guides/adapter-creation) guide. for more info. You can [connect](/docs/connect-adapters) to adapters and data platforms natively in dbt Cloud or install them manually using dbt Core. diff --git a/website/docs/docs/trusted-adapters.md b/website/docs/docs/trusted-adapters.md index 08191e8ea42..20d61f69575 100644 --- a/website/docs/docs/trusted-adapters.md +++ b/website/docs/docs/trusted-adapters.md @@ -21,7 +21,7 @@ pendency on this library? ### Trusted adapter specifications -See [Building a Trusted Adapter](/guides/dbt-ecosystem/adapter-development/8-building-a-trusted-adapter) for more information, particularly if you are an adapter maintainer considering having your adapter be added to the trusted list. +Refer to the [Build, test, document, and promote adapters](/guides/adapter-creation) guide for more information, particularly if you are an adapter maintainer considering having your adapter be added to the trusted list. ### Trusted vs Verified diff --git a/website/docs/docs/use-dbt-semantic-layer/dbt-sl.md b/website/docs/docs/use-dbt-semantic-layer/dbt-sl.md index 8c78d556a67..4e3caa3eb21 100644 --- a/website/docs/docs/use-dbt-semantic-layer/dbt-sl.md +++ b/website/docs/docs/use-dbt-semantic-layer/dbt-sl.md @@ -99,7 +99,7 @@ The dbt Semantic Layer reduces code duplication and inconsistency regarding your :::info 📌 -New to dbt or metrics? Check out our [quickstart guide](/quickstarts) to build your first dbt project! If you'd like to define your first metrics, try our [Jaffle Shop](https://github.com/dbt-labs/jaffle_shop_metrics) example project. +New to dbt or metrics? Check out our [quickstart guide](/guides) to build your first dbt project! If you'd like to define your first metrics, try our [Jaffle Shop](https://github.com/dbt-labs/jaffle_shop_metrics) example project. ::: @@ -130,7 +130,7 @@ You can design and define your metrics in `.yml` files nested under a metrics ke
How do I migrate from the legacy Semantic Layer to the new one?
-
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
+
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
diff --git a/website/docs/docs/use-dbt-semantic-layer/quickstart-sl.md b/website/docs/docs/use-dbt-semantic-layer/quickstart-sl.md index 42f08a90401..7961294c38e 100644 --- a/website/docs/docs/use-dbt-semantic-layer/quickstart-sl.md +++ b/website/docs/docs/use-dbt-semantic-layer/quickstart-sl.md @@ -91,7 +91,7 @@ If you're encountering some issues when defining your metrics or setting up the
How do I migrate from the legacy Semantic Layer to the new one?
-
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
+
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
@@ -137,7 +137,7 @@ To use the dbt Semantic Layer, you’ll need to meet the following: :::info 📌 -New to dbt or metrics? Check out our [quickstart guide](/quickstarts) to build your first dbt project! If you'd like to define your first metrics, try our [Jaffle Shop](https://github.com/dbt-labs/jaffle_shop_metrics) example project. +New to dbt or metrics? Check out our [quickstart guide](/guides) to build your first dbt project! If you'd like to define your first metrics, try our [Jaffle Shop](https://github.com/dbt-labs/jaffle_shop_metrics) example project. ::: diff --git a/website/docs/docs/use-dbt-semantic-layer/setup-sl.md b/website/docs/docs/use-dbt-semantic-layer/setup-sl.md index 4c88ee50b25..5f793142bdc 100644 --- a/website/docs/docs/use-dbt-semantic-layer/setup-sl.md +++ b/website/docs/docs/use-dbt-semantic-layer/setup-sl.md @@ -53,7 +53,7 @@ With the dbt Semantic Layer, you can define business metrics, reduce code duplic ## Set up dbt Semantic Layer :::tip -If you're using the legacy Semantic Layer, dbt Labs strongly recommends that you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use the latest dbt Semantic Layer. Refer to the dedicated [migration guide](/guides/migration/sl-migration) for more info. +If you're using the legacy Semantic Layer, dbt Labs strongly recommends that you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use the latest dbt Semantic Layer. Refer to the dedicated [migration guide](/guides/sl-migration) for more info. ::: @@ -95,5 +95,5 @@ It is _not_ recommended that you use your dbt Cloud credentials due to elevated - [Build your metrics](/docs/build/build-metrics-intro) - [Available integrations](/docs/use-dbt-semantic-layer/avail-sl-integrations) - [Semantic Layer APIs](/docs/dbt-cloud-apis/sl-api-overview) -- [Migrate your legacy Semantic Layer](/guides/migration/sl-migration) +- [Migrate your legacy Semantic Layer](/guides/sl-migration) - [Get started with the dbt Semantic Layer](/docs/use-dbt-semantic-layer/quickstart-sl) diff --git a/website/docs/docs/use-dbt-semantic-layer/sl-architecture.md b/website/docs/docs/use-dbt-semantic-layer/sl-architecture.md index 9e8737c68d3..0a195aa8ecb 100644 --- a/website/docs/docs/use-dbt-semantic-layer/sl-architecture.md +++ b/website/docs/docs/use-dbt-semantic-layer/sl-architecture.md @@ -32,7 +32,7 @@ The dbt Semantic Layer includes the following components:
How do I migrate from the legacy Semantic Layer to the new one?
-
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
+
If you're using the legacy Semantic Layer, we highly recommend you upgrade your dbt version to dbt v1.6 or higher to use the new dbt Semantic Layer. Refer to the dedicated migration guide for more info.
diff --git a/website/docs/docs/verified-adapters.md b/website/docs/docs/verified-adapters.md index 170bc8f885b..75c7529c247 100644 --- a/website/docs/docs/verified-adapters.md +++ b/website/docs/docs/verified-adapters.md @@ -11,7 +11,7 @@ These adapters then earn a "Verified" status so that users can have a certain le The verification process serves as the on-ramp to integration with dbt Cloud. As such, we restrict applicants to data platform vendors with whom we are already engaged. -To learn more, see [Verifying a new adapter](/guides/dbt-ecosystem/adapter-development/7-verifying-a-new-adapter). +To learn more, refer to the [Build, test, document, and promote adapters](/guides/adapter-creation) guide. import MSCallout from '/snippets/_microsoft-adapters-soon.md'; diff --git a/website/docs/faqs/Jinja/jinja-whitespace.md b/website/docs/faqs/Jinja/jinja-whitespace.md index 49ced7183b7..5e1ec3dc7ac 100644 --- a/website/docs/faqs/Jinja/jinja-whitespace.md +++ b/website/docs/faqs/Jinja/jinja-whitespace.md @@ -7,6 +7,6 @@ id: jinja-whitespace This is known as "whitespace control". -Use a minus sign (`-`, e.g. `{{- ... -}}`, `{%- ... %}`, `{#- ... -#}`) at the start or end of a block to strip whitespace before or after the block (more docs [here](https://jinja.palletsprojects.com/page/templates/#whitespace-control)). Check out the [tutorial on using Jinja](/guides/advanced/using-jinja#use-whitespace-control-to-tidy-up-compiled-code) for an example. +Use a minus sign (`-`, e.g. `{{- ... -}}`, `{%- ... %}`, `{#- ... -#}`) at the start or end of a block to strip whitespace before or after the block (more docs [here](https://jinja.palletsprojects.com/page/templates/#whitespace-control)). Check out the [tutorial on using Jinja](/guides/using-jinja#use-whitespace-control-to-tidy-up-compiled-code) for an example. Take caution: it's easy to fall down a rabbit hole when it comes to whitespace control! diff --git a/website/docs/faqs/Models/available-materializations.md b/website/docs/faqs/Models/available-materializations.md index 011d3ba3fb0..bf11c92b595 100644 --- a/website/docs/faqs/Models/available-materializations.md +++ b/website/docs/faqs/Models/available-materializations.md @@ -8,4 +8,4 @@ id: available-materializations dbt ships with five materializations: `view`, `table`, `incremental`, `ephemeral` and `materialized_view`. Check out the documentation on [materializations](/docs/build/materializations) for more information on each of these options. -You can also create your own [custom materializations](/guides/advanced/creating-new-materializations), if required however this is an advanced feature of dbt. +You can also create your own [custom materializations](/guides/create-new-materializations), if required however this is an advanced feature of dbt. diff --git a/website/docs/faqs/Models/create-dependencies.md b/website/docs/faqs/Models/create-dependencies.md index 6a01aa18dca..e902d93b018 100644 --- a/website/docs/faqs/Models/create-dependencies.md +++ b/website/docs/faqs/Models/create-dependencies.md @@ -44,4 +44,4 @@ Found 2 models, 28 tests, 0 snapshots, 0 analyses, 130 macros, 0 operations, 0 s Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2 ``` -To learn more about building a dbt project, we recommend you complete the [quickstart guide](/quickstarts). +To learn more about building a dbt project, we recommend you complete the [quickstart guide](/guides). diff --git a/website/docs/faqs/Project/example-projects.md b/website/docs/faqs/Project/example-projects.md index f59d6e56e78..cd58c8832e2 100644 --- a/website/docs/faqs/Project/example-projects.md +++ b/website/docs/faqs/Project/example-projects.md @@ -8,7 +8,7 @@ id: example-projects Yes! -* **Quickstart Tutorial:** You can build your own example dbt project in the [quickstart guide](/quickstarts) +* **Quickstart Tutorial:** You can build your own example dbt project in the [quickstart guide](/guides) * **Jaffle Shop:** A demonstration project (closely related to the tutorial) for a fictional ecommerce store ([source code](https://github.com/dbt-labs/jaffle_shop)) * **MRR Playbook:** A demonstration project that models subscription revenue ([source code](https://github.com/dbt-labs/mrr-playbook), [docs](https://www.getdbt.com/mrr-playbook/#!/overview)) * **Attribution Playbook:** A demonstration project that models marketing attribution ([source code](https://github.com/dbt-labs/attribution-playbook), [docs](https://www.getdbt.com/attribution-playbook/#!/overview)) diff --git a/website/docs/faqs/Project/multiple-resource-yml-files.md b/website/docs/faqs/Project/multiple-resource-yml-files.md index 422b7beb702..04e1702a162 100644 --- a/website/docs/faqs/Project/multiple-resource-yml-files.md +++ b/website/docs/faqs/Project/multiple-resource-yml-files.md @@ -9,4 +9,4 @@ It's up to you: - Some folks find it useful to have one file per model (or source / snapshot / seed etc) - Some find it useful to have one per directory, documenting and testing multiple models in one file -Choose what works for your team. We have more recommendations in our guide on [structuring dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). +Choose what works for your team. We have more recommendations in our guide on [structuring dbt projects](/best-practices/how-we-structure/1-guide-overview). diff --git a/website/docs/faqs/Project/resource-yml-name.md b/website/docs/faqs/Project/resource-yml-name.md index 8a6ebe96134..c26cff26474 100644 --- a/website/docs/faqs/Project/resource-yml-name.md +++ b/website/docs/faqs/Project/resource-yml-name.md @@ -10,4 +10,4 @@ It's up to you! Here's a few options: - Use the same name as your directory (assuming you're using sensible names for your directories) - If you test and document one model (or seed, snapshot, macro etc.) per file, you can give it the same name as the model (or seed, snapshot, macro etc.) -Choose what works for your team. We have more recommendations in our guide on [structuring dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). +Choose what works for your team. We have more recommendations in our guide on [structuring dbt projects](/best-practices/how-we-structure/1-guide-overview). diff --git a/website/docs/faqs/Project/structure-a-project.md b/website/docs/faqs/Project/structure-a-project.md index 5d73f9f25ba..a9ef53f5c8f 100644 --- a/website/docs/faqs/Project/structure-a-project.md +++ b/website/docs/faqs/Project/structure-a-project.md @@ -8,4 +8,4 @@ id: structure-a-project There's no one best way to structure a project! Every organization is unique. -If you're just getting started, check out how we (dbt Labs) [structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). +If you're just getting started, check out how we (dbt Labs) [structure our dbt projects](/best-practices/how-we-structure/1-guide-overview). diff --git a/website/docs/faqs/Project/why-not-write-dml.md b/website/docs/faqs/Project/why-not-write-dml.md index fd2cea7d3ad..210ef4a916d 100644 --- a/website/docs/faqs/Project/why-not-write-dml.md +++ b/website/docs/faqs/Project/why-not-write-dml.md @@ -30,4 +30,4 @@ You can test your models, generate documentation, create snapshots, and more! SQL dialects tend to diverge the most in DML and DDL (rather than in `select` statements) — check out the example [here](/faqs/models/sql-dialect). By writing less SQL, it can make a migration to a new database technology easier. -If you do need to write custom DML, there are ways to do this in dbt using [custom materializations](/guides/advanced/creating-new-materializations). +If you do need to write custom DML, there are ways to do this in dbt using [custom materializations](/guides/create-new-materializations). diff --git a/website/docs/faqs/Tests/custom-test-thresholds.md b/website/docs/faqs/Tests/custom-test-thresholds.md index 7155b39d25e..34d2eec7494 100644 --- a/website/docs/faqs/Tests/custom-test-thresholds.md +++ b/website/docs/faqs/Tests/custom-test-thresholds.md @@ -11,4 +11,4 @@ As of `v0.20.0`, you can use the `error_if` and `warn_if` configs to set custom For dbt `v0.19.0` and earlier, you could try these possible solutions: * Setting the [severity](/reference/resource-properties/tests#severity) to `warn`, or: -* Writing a [custom generic test](/guides/best-practices/writing-custom-generic-tests) that accepts a threshold argument ([example](https://discourse.getdbt.com/t/creating-an-error-threshold-for-schema-tests/966)) +* Writing a [custom generic test](/best-practices/writing-custom-generic-tests) that accepts a threshold argument ([example](https://discourse.getdbt.com/t/creating-an-error-threshold-for-schema-tests/966)) diff --git a/website/docs/faqs/Warehouse/db-connection-dbt-compile.md b/website/docs/faqs/Warehouse/db-connection-dbt-compile.md index d8e58155b10..8017da4545b 100644 --- a/website/docs/faqs/Warehouse/db-connection-dbt-compile.md +++ b/website/docs/faqs/Warehouse/db-connection-dbt-compile.md @@ -22,7 +22,7 @@ To generate the compiled SQL for many models, dbt needs to run introspective que These introspective queries include: -- Populating the [relation cache](/guides/advanced/creating-new-materializations#update-the-relation-cache). Caching speeds up the metadata checks, including whether an [incremental model](/docs/build/incremental-models) already exists in the data platform. +- Populating the relation cache. For more information, refer to the [Create new materializations](/guides/create-new-materializations) guide. Caching speeds up the metadata checks, including whether an [incremental model](/docs/build/incremental-models) already exists in the data platform. - Resolving [macros](/docs/build/jinja-macros#macros), such as `run_query` or `dbt_utils.get_column_values` that you're using to template out your SQL. This is because dbt needs to run those queries during model SQL compilation. Without a data platform connection, dbt can't perform these introspective queries and won't be able to generate the compiled SQL needed for the next steps in the dbt workflow. You can [`parse`](/reference/commands/parse) a project and use the [`list`](/reference/commands/list) resources in the project, without an internet or data platform connection. Parsing a project is enough to produce a [manifest](/reference/artifacts/manifest-json), however, keep in mind that the written-out manifest won't include compiled SQL. diff --git a/website/docs/guides/adapter-creation.md b/website/docs/guides/adapter-creation.md new file mode 100644 index 00000000000..8a9145f0258 --- /dev/null +++ b/website/docs/guides/adapter-creation.md @@ -0,0 +1,1352 @@ +--- +title: Build, test, document, and promote adapters +id: adapter-creation +description: "Create an adapter that connects dbt to you platform, and learn how to maintain and version that adapter." +hoverSnippet: "Learn how to build, test, document, and promote adapters as well as maintaining and versioning an adapter." +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Adapter creation'] +level: 'Advanced' +recently_updated: true +--- + +## Introduction + +Adapters are an essential component of dbt. At their most basic level, they are how dbt connects with the various supported data platforms. At a higher-level, dbt Core adapters strive to give analytics engineers more transferrable skills as well as standardize how analytics projects are structured. Gone are the days where you have to learn a new language or flavor of SQL when you move to a new job that has a different data platform. That is the power of adapters in dbt Core. + + Navigating and developing around the nuances of different databases can be daunting, but you are not alone. Visit [#adapter-ecosystem](https://getdbt.slack.com/archives/C030A0UF5LM) Slack channel for additional help beyond the documentation. + +### All databases are not the same + +There's a tremendous amount of work that goes into creating a database. Here is a high-level list of typical database layers (from the outermost layer moving inwards): +- SQL API +- Client Library / Driver +- Server Connection Manager +- Query parser +- Query optimizer +- Runtime +- Storage Access Layer +- Storage + +There's a lot more there than just SQL as a language. Databases (and data warehouses) are so popular because you can abstract away a great deal of the complexity from your brain to the database itself. This enables you to focus more on the data. + +dbt allows for further abstraction and standardization of the outermost layers of a database (SQL API, client library, connection manager) into a framework that both: + - Opens database technology to less technical users (a large swath of a DBA's role has been automated, similar to how the vast majority of folks with websites today no longer have to be "[webmasters](https://en.wikipedia.org/wiki/Webmaster)"). + - Enables more meaningful conversations about how data warehousing should be done. + +This is where dbt adapters become critical. + +### What needs to be adapted? + +dbt adapters are responsible for _adapting_ dbt's standard functionality to a particular database. Our prototypical database and adapter are PostgreSQL and dbt-postgres, and most of our adapters are somewhat based on the functionality described in dbt-postgres. + +Connecting dbt to a new database will require a new adapter to be built or an existing adapter to be extended. + +The outermost layers of a database map roughly to the areas in which the dbt adapter framework encapsulates inter-database differences. + +### SQL API + +Even amongst ANSI-compliant databases, there are differences in the SQL grammar. +Here are some categories and examples of SQL statements that can be constructed differently: + + +| Category | Area of differences | Examples | +|----------------------------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Statement syntax | The use of `IF EXISTS` |
  • `IF EXISTS, DROP TABLE`
  • `DROP
  • IF EXISTS` | +| Workflow definition & semantics | Incremental updates |
  • `MERGE`
  • `DELETE; INSERT`
  • | +| Relation and column attributes/configuration | Database-specific materialization configs |
  • `DIST = ROUND_ROBIN` (Synapse)
  • `DIST = EVEN` (Redshift)
  • | +| Permissioning | Grant statements that can only take one grantee at a time vs those that accept lists of grantees |
  • `grant SELECT on table dinner.corn to corn_kid, everyone`
  • `grant SELECT on table dinner.corn to corn_kid; grant SELECT on table dinner.corn to everyone`
  • | + +### Python Client Library & Connection Manager + +The other big category of inter-database differences comes with how the client connects to the database and executes queries against the connection. To integrate with dbt, a data platform must have a pre-existing python client library or support ODBC, using a generic python library like pyodbc. + +| Category | Area of differences | Examples | +|------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------| +| Credentials & authentication | Authentication |
  • Username & password
  • MFA with `boto3` or Okta token
  • | +| Connection opening/closing | Create a new connection to db |
  • `psycopg2.connect(connection_string)`
  • `google.cloud.bigquery.Client(...)`
  • | +| Inserting local data | Load seed .`csv` files into Python memory |
  • `google.cloud.bigquery.Client.load_table_from_file(...)` (BigQuery)
  • `INSERT ... INTO VALUES ...` prepared statement (most other databases)
  • | + + +### How dbt encapsulates and abstracts these differences + +Differences between databases are encoded into discrete areas: + +| Components | Code Path | Function | +|------------------|---------------------------------------------------|-------------------------------------------------------------------------------| +| Python Classes | `adapters/` | Configuration (See above [Python classes](##python classes) | +| Macros | `include//macros/adapters/` | SQL API & statement syntax (for example, how to create schema or how to get table info) | +| Materializations | `include//macros/materializations/` | Table/view/snapshot/ workflow definitions | + + +#### Python Classes + +These classes implement all the methods responsible for: +- Connecting to a database and issuing queries. +- Providing dbt with database-specific configuration information. + +| Class | Description | +|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| AdapterClass | High-level configuration type conversion and any database-specific python methods needed | +| AdapterCredentials | Typed dictionary of possible profiles and associated methods | +| AdapterConnectionManager | All the methods responsible for connecting to a database and issuing queries | +| AdapterRelation | How relation names should be rendered, printed, and quoted. Do relation names use all three parts? `catalog.model_name` (two-part name) or `database.schema.model_name` (three-part name) | +| AdapterColumn | How names should be rendered, and database-specific properties | + +#### Macros + +A set of *macros* responsible for generating SQL that is compliant with the target database. + +#### Materializations + +A set of *materializations* and their corresponding helper macros defined in dbt using jinja and SQL. They codify for dbt how model files should be persisted into the database. + +### Adapter Architecture + + +Below is a diagram of how dbt-postgres, the adapter at the center of dbt-core, works. + + + +## Prerequisites + +It is very important that you have the right skills, and understand the level of difficulty required to make an adapter for your data platform. + +The more you can answer Yes to the below questions, the easier your adapter development (and user-) experience will be. See the [New Adapter Information Sheet wiki](https://github.com/dbt-labs/dbt-core/wiki/New-Adapter-Information-Sheet) for even more specific questions. + +### Training + +- the developer (and any product managers) ideally will have substantial experience as an end-user of dbt. If not, it is highly advised that you at least take the [dbt Fundamentals](https://courses.getdbt.com/courses/fundamentals) and [Advanced Materializations](https://courses.getdbt.com/courses/advanced-materializations) course. + +### Database + +- Does the database complete transactions fast enough for interactive development? +- Can you execute SQL against the data platform? +- Is there a concept of schemas? +- Does the data platform support ANSI SQL, or at least a subset? + +### Driver / Connection Library + +- Is there a Python-based driver for interacting with the database that is db API 2.0 compliant (e.g. Psycopg2 for Postgres, pyodbc for SQL Server) +- Does it support: prepared statements, multiple statements, or single sign on token authorization to the data platform? + +### Open source software + +- Does your organization have an established process for publishing open source software? + +It is easiest to build an adapter for dbt when the following the /platform in question has: + +- a conventional ANSI-SQL interface (or as close to it as possible), +- a mature connection library/SDK that uses ODBC or Python DB 2 API, and +- a way to enable developers to iterate rapidly with both quick reads and writes + +### Maintaining your new adapter + +When your adapter becomes more popular, and people start using it, you may quickly become the maintainer of an increasingly popular open source project. With this new role, comes some unexpected responsibilities that not only include code maintenance, but also working with a community of users and contributors. To help people understand what to expect of your project, you should communicate your intentions early and often in your adapter documentation or README. Answer questions like, Is this experimental work that people should use at their own risk? Or is this production-grade code that you're committed to maintaining into the future? + +#### Keeping the code compatible with dbt Core + +New minor version releases of `dbt-core` may include changes to the Python interface for adapter plugins, as well as new or updated test cases. The maintainers of `dbt-core` will clearly communicate these changes in documentation and release notes, and they will aim for backwards compatibility whenever possible. + +Patch releases of `dbt-core` will _not_ include breaking changes to adapter-facing code. For more details, see ["About dbt Core versions"](/docs/dbt-versions/core). + +#### Versioning and releasing your adapter + +We strongly encourage you to adopt the following approach when versioning and releasing your plugin: + +- The minor version of your plugin should match the minor version in `dbt-core` (e.g. 1.1.x). +- Aim to release a new version of your plugin for each new minor version of `dbt-core` (once every three months). +- While your plugin is new, and you're iterating on features, aim to offer backwards compatibility and deprecation notices for at least one minor version. As your plugin matures, aim to leave backwards compatibility and deprecation notices in place until the next major version (dbt Core v2). +- Release patch versions of your plugins whenever needed. These patch releases should contain fixes _only_. + +## Build a new adapter + +This step will walk you through the first creating the necessary adapter classes and macros, and provide some resources to help you validate that your new adapter is working correctly. Make sure you've familiarized yourself with the previous steps in this guide. + +Once the adapter is passing most of the functional tests in the previous "Testing a new adapter" step, please let the community know that is available to use by adding the adapter to the ["Supported Data Platforms"](/docs/supported-data-platforms) page by following the steps given in "Documenting your adapter. + +For any questions you may have, don't hesitate to ask in the [#adapter-ecosystem](https://getdbt.slack.com/archives/C030A0UF5LM) Slack channel. The community is very helpful and likely has experienced a similar issue as you. + +### Scaffolding a new adapter + + To create a new adapter plugin from scratch, you can use the [dbt-database-adapter-scaffold](https://github.com/dbt-labs/dbt-database-adapter-scaffold) to trigger an interactive session which will generate a scaffolding for you to build upon. + + Example usage: + + ``` + $ cookiecutter gh:dbt-labs/dbt-database-adapter-scaffold + ``` + +The generated boilerplate starting project will include a basic adapter plugin file structure, examples of macros, high level method descriptions, etc. + +One of the most important choices you will make during the cookiecutter generation will revolve around the field for `is_sql_adapter` which is a boolean used to correctly apply imports for either a `SQLAdapter` or `BaseAdapter`. Knowing which you will need requires a deeper knowledge of your selected database but a few good guides for the choice are. + +- Does your database have a complete SQL API? Can it perform tasks using SQL such as creating schemas, dropping schemas, querying an `information_schema` for metadata calls? If so, it is more likely to be a SQLAdapter where you set `is_sql_adapter` to `True`. +- Most adapters do fall under SQL adapters which is why we chose it as the default `True` value. +- It is very possible to build out a fully functional `BaseAdapter`. This will require a little more ground work as it doesn't come with some prebuilt methods the `SQLAdapter` class provides. See `dbt-bigquery` as a good guide. + +### Implementation Details + +Regardless if you decide to use the cookiecutter template or manually create the plugin, this section will go over each method that is required to be implemented. The table below provides a high-level overview of the classes, methods, and macros you may have to define for your data platform. + +| file | component | purpose | +|---------------------------------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `./setup.py` | `setup()` function | adapter meta-data (package name, version, author, homepage, etc) | +| `myadapter/dbt/adapters/myadapter/__init__.py` | `AdapterPlugin` | bundle all the information below into a dbt plugin | +| `myadapter/dbt/adapters/myadapter/connections.py` | `MyAdapterCredentials` class | parameters to connect to and configure the database, via a the chosen Python driver | +| `myadapter/dbt/adapters/myadapter/connections.py` | `MyAdapterConnectionManager` class | telling dbt how to interact with the database w.r.t opening/closing connections, executing queries, and fetching data. Effectively a wrapper around the db API or driver. | +| `myadapter/dbt/include/bigquery/` | a dbt project of macro "overrides" in the format of "myadapter__" | any differences in SQL syntax for regular db operations will be modified here from the global_project (e.g. "Create Table As Select", "Get all relations in the current schema", etc) | +| `myadapter/dbt/adapters/myadapter/impl.py` | `MyAdapterConfig` | database- and relation-level configs and | +| `myadapter/dbt/adapters/myadapter/impl.py` | `MyAdapterAdapter` | for changing _how_ dbt performs operations like macros and other needed Python functionality | +| `myadapter/dbt/adapters/myadapter/column.py` | `MyAdapterColumn` | for defining database-specific column such as datatype mappings | + +### Editing `setup.py` + +Edit the file at `myadapter/setup.py` and fill in the missing information. + +You can skip this step if you passed the arguments for `email`, `url`, `author`, and `dependencies` to the cookiecutter template script. If you plan on having nested macro folder structures, you may need to add entries to `package_data` so your macro source files get installed. + +### Editing the connection manager + +Edit the connection manager at `myadapter/dbt/adapters/myadapter/connections.py`. This file is defined in the sections below. + +#### The Credentials class + +The credentials class defines all of the database-specific credentials (e.g. `username` and `password`) that users will need in the [connection profile](/docs/supported-data-platforms) for your new adapter. Each credentials contract should subclass dbt.adapters.base.Credentials, and be implemented as a python dataclass. + +Note that the base class includes required database and schema fields, as dbt uses those values internally. + +For example, if your adapter requires a host, integer port, username string, and password string, but host is the only required field, you'd add definitions for those new properties to the class as types, like this: + + + +```python + +from dataclasses import dataclass +from typing import Optional + +from dbt.adapters.base import Credentials + + +@dataclass +class MyAdapterCredentials(Credentials): + host: str + port: int = 1337 + username: Optional[str] = None + password: Optional[str] = None + + @property + def type(self): + return 'myadapter' + + @property + def unique_field(self): + """ + Hashed and included in anonymous telemetry to track adapter adoption. + Pick a field that can uniquely identify one team/organization building with this adapter + """ + return self.host + + def _connection_keys(self): + """ + List of keys to display in the `dbt debug` output. + """ + return ('host', 'port', 'database', 'username') +``` + + + +There are a few things you can do to make it easier for users when connecting to your database: + +- Be sure to implement the Credentials' `_connection_keys` method shown above. This method will return the keys that should be displayed in the output of the `dbt debug` command. As a general rule, it's good to return all the arguments used in connecting to the actual database except the password (even optional arguments). +- Create a `profile_template.yml` to enable configuration prompts for a brand-new user setting up a connection profile via the [`dbt init` command](/reference/commands/init). You will find more details in the following steps. +- You may also want to define an `ALIASES` mapping on your Credentials class to include any config names you want users to be able to use in place of 'database' or 'schema'. For example if everyone using the MyAdapter database calls their databases "collections", you might do: + + + +```python +@dataclass +class MyAdapterCredentials(Credentials): + host: str + port: int = 1337 + username: Optional[str] = None + password: Optional[str] = None + + ALIASES = { + 'collection': 'database', + } +``` + + + +Then users can use `collection` OR `database` in their `profiles.yml`, `dbt_project.yml`, or `config()` calls to set the database. + +#### `ConnectionManager` class methods + +Once credentials are configured, you'll need to implement some connection-oriented methods. They are enumerated in the SQLConnectionManager docstring, but an overview will also be provided here. + +**Methods to implement:** + +- `open` +- `get_response` +- `cancel` +- `exception_handler` +- `standardize_grants_dict` + +##### `open(cls, connection)` + +`open()` is a classmethod that gets a connection object (which could be in any state, but will have a `Credentials` object with the attributes you defined above) and moves it to the 'open' state. + +Generally this means doing the following: + - if the connection is open already, log and return it. + - If a database needed changes to the underlying connection before re-use, that would happen here + - create a connection handle using the underlying database library using the credentials + - on success: + - set connection.state to `'open'` + - set connection.handle to the handle object + - this is what must have a `cursor()` method that returns a cursor! + - on error: + - set connection.state to `'fail'` + - set connection.handle to `None` + - raise a `dbt.exceptions.FailedToConnectException` with the error and any other relevant information + +For example: + + + +```python + @classmethod + def open(cls, connection): + if connection.state == 'open': + logger.debug('Connection is already open, skipping open.') + return connection + + credentials = connection.credentials + + try: + handle = myadapter_library.connect( + host=credentials.host, + port=credentials.port, + username=credentials.username, + password=credentials.password, + catalog=credentials.database + ) + connection.state = 'open' + connection.handle = handle + return connection +``` + + + +##### `get_response(cls, cursor)` + +`get_response` is a classmethod that gets a cursor object and returns adapter-specific information about the last executed command. The return value should be an `AdapterResponse` object that includes items such as `code`, `rows_affected`, `bytes_processed`, and a summary `_message` for logging to stdout. + + + +```python + @classmethod + def get_response(cls, cursor) -> AdapterResponse: + code = cursor.sqlstate or "OK" + rows = cursor.rowcount + status_message = f"{code} {rows}" + return AdapterResponse( + _message=status_message, + code=code, + rows_affected=rows + ) +``` + + + +##### `cancel(self, connection)` + +`cancel` is an instance method that gets a connection object and attempts to cancel any ongoing queries, which is database dependent. Some databases don't support the concept of cancellation, they can simply implement it via 'pass' and their adapter classes should implement an `is_cancelable` that returns False - On ctrl+c connections may remain running. This method must be implemented carefully, as the affected connection will likely be in use in a different thread. + + + +```python + def cancel(self, connection): + tid = connection.handle.transaction_id() + sql = 'select cancel_transaction({})'.format(tid) + logger.debug("Cancelling query '{}' ({})".format(connection_name, pid)) + _, cursor = self.add_query(sql, 'master') + res = cursor.fetchone() + logger.debug("Canceled query '{}': {}".format(connection_name, res)) +``` + + + +##### `exception_handler(self, sql, connection_name='master')` + +`exception_handler` is an instance method that returns a context manager that will handle exceptions raised by running queries, catch them, log appropriately, and then raise exceptions dbt knows how to handle. + +If you use the (highly recommended) `@contextmanager` decorator, you only have to wrap a `yield` inside a `try` block, like so: + + + +```python + @contextmanager + def exception_handler(self, sql: str): + try: + yield + except myadapter_library.DatabaseError as exc: + self.release(connection_name) + + logger.debug('myadapter error: {}'.format(str(e))) + raise dbt.exceptions.DatabaseException(str(exc)) + except Exception as exc: + logger.debug("Error running SQL: {}".format(sql)) + logger.debug("Rolling back transaction.") + self.release(connection_name) + raise dbt.exceptions.RuntimeException(str(exc)) +``` + + + +##### `standardize_grants_dict(self, grants_table: agate.Table) -> dict` + +`standardize_grants_dict` is an method that returns the dbt-standardized grants dictionary that matches how users configure grants now in dbt. The input is the result of `SHOW GRANTS ON {{model}}` call loaded into an agate table. + +If there's any massaging of agate table containing the results, of `SHOW GRANTS ON {{model}}`, that can't easily be accomplished in SQL, it can be done here. For example, the SQL to show grants _should_ filter OUT any grants TO the current user/role (e.g. OWNERSHIP). If that's not possible in SQL, it can be done in this method instead. + + + +```python + @available + def standardize_grants_dict(self, grants_table: agate.Table) -> dict: + """ + :param grants_table: An agate table containing the query result of + the SQL returned by get_show_grant_sql + :return: A standardized dictionary matching the `grants` config + :rtype: dict + """ + grants_dict: Dict[str, List[str]] = {} + for row in grants_table: + grantee = row["grantee"] + privilege = row["privilege_type"] + if privilege in grants_dict.keys(): + grants_dict[privilege].append(grantee) + else: + grants_dict.update({privilege: [grantee]}) + return grants_dict +``` + + + +### Editing the adapter implementation + +Edit the connection manager at `myadapter/dbt/adapters/myadapter/impl.py` + +Very little is required to implement the adapter itself. On some adapters, you will not need to override anything. On others, you'll likely need to override some of the ``convert_*`` classmethods, or override the `is_cancelable` classmethod on others to return `False`. + +#### `datenow()` + +This classmethod provides the adapter's canonical date function. This is not used but is required– anyway on all adapters. + + + +```python + @classmethod + def date_function(cls): + return 'datenow()' +``` + + + +### Editing SQL logic + +dbt implements specific SQL operations using jinja macros. While reasonable defaults are provided for many such operations (like `create_schema`, `drop_schema`, `create_table`, etc), you may need to override one or more of macros when building a new adapter. + +#### Required macros + +The following macros must be implemented, but you can override their behavior for your adapter using the "dispatch" pattern described below. Macros marked (required) do not have a valid default implementation, and are required for dbt to operate. + +- `alter_column_type` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/columns.sql#L37-L55)) +- `check_schema_exists` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/metadata.sql#L43-L55)) +- `create_schema` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/schema.sql#L1-L9)) +- `drop_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/relation.sql#L34-L42)) +- `drop_schema` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/schema.sql#L12-L20)) +- `get_columns_in_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/columns.sql#L1-L8)) (required) +- `list_relations_without_caching` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/metadata.sql#L58-L65)) (required) +- `list_schemas` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/metadata.sql#L29-L40)) +- `rename_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/relation.sql#L56-L65)) +- `truncate_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/relation.sql#L45-L53)) +- `current_timestamp` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/freshness.sql#L1-L8)) (required) +- `copy_grants` + +#### Adapter dispatch + +Most modern databases support a majority of the standard SQL spec. There are some databases that _do not_ support critical aspects of the SQL spec however, or they provide their own nonstandard mechanisms for implementing the same functionality. To account for these variations in SQL support, dbt provides a mechanism called [multiple dispatch](https://en.wikipedia.org/wiki/Multiple_dispatch) for macros. With this feature, macros can be overridden for specific adapters. This makes it possible to implement high-level methods (like "create ") in a database-specific way. + + + +```jinja2 + +{# dbt will call this macro by name, providing any arguments #} +{% macro create_table_as(temporary, relation, sql) -%} + + {# dbt will dispatch the macro call to the relevant macro #} + {{ return( + adapter.dispatch('create_table_as')(temporary, relation, sql) + ) }} +{%- endmacro %} + + + +{# If no macro matches the specified adapter, "default" will be used #} +{% macro default__create_table_as(temporary, relation, sql) -%} + ... +{%- endmacro %} + + + +{# Example which defines special logic for Redshift #} +{% macro redshift__create_table_as(temporary, relation, sql) -%} + ... +{%- endmacro %} + + + +{# Example which defines special logic for BigQuery #} +{% macro bigquery__create_table_as(temporary, relation, sql) -%} + ... +{%- endmacro %} +``` + + + +The `adapter.dispatch()` macro takes a second argument, `packages`, which represents a set of "search namespaces" in which to find potential implementations of a dispatched macro. This allows users of community-supported adapters to extend or "shim" dispatched macros from common packages, such as `dbt-utils`, with adapter-specific versions in their own project or other installed packages. See: + +- "Shim" package examples: [`spark-utils`](https://github.com/dbt-labs/spark-utils), [`tsql-utils`](https://github.com/dbt-msft/tsql-utils) +- [`adapter.dispatch` docs](/reference/dbt-jinja-functions/dispatch) + +#### Overriding adapter methods + +While much of dbt's adapter-specific functionality can be modified in adapter macros, it can also make sense to override adapter methods directly. In this example, assume that a database does not support a `cascade` parameter to `drop schema`. Instead, we can implement an approximation where we drop each relation and then drop the schema. + + + +```python + def drop_schema(self, relation: BaseRelation): + relations = self.list_relations( + database=relation.database, + schema=relation.schema + ) + for relation in relations: + self.drop_relation(relation) + super().drop_schema(relation) +``` + + + +#### Grants Macros + +See [this GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/5468) for information on the macros required for `GRANT` statements: + +### Other files + +#### `profile_template.yml` + +In order to enable the [`dbt init` command](/reference/commands/init) to prompt users when setting up a new project and connection profile, you should include a **profile template**. The filepath needs to be `dbt/include//profile_template.yml`. It's possible to provide hints, default values, and conditional prompts based on connection methods that require different supporting attributes. Users will also be able to include custom versions of this file in their own projects, with fixed values specific to their organization, to support their colleagues when using your dbt adapter for the first time. + +See examples: + +- [dbt-postgres](https://github.com/dbt-labs/dbt-core/blob/main/plugins/postgres/dbt/include/postgres/profile_template.yml) +- [dbt-redshift](https://github.com/dbt-labs/dbt-redshift/blob/main/dbt/include/redshift/profile_template.yml) +- [dbt-snowflake](https://github.com/dbt-labs/dbt-snowflake/blob/main/dbt/include/snowflake/profile_template.yml) +- [dbt-bigquery](https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/include/bigquery/profile_template.yml) + +#### `__version__.py` + +To assure that `dbt --version` provides the latest dbt core version the adapter supports, be sure include a `__version__.py` file. The filepath will be `dbt/adapters//__version__.py`. We recommend using the latest dbt core version and as the adapter is made compatible with later versions, this file will need to be updated. For a sample file, check out this [example](https://github.com/dbt-labs/dbt-snowflake/blob/main/dbt/adapters/snowflake/__version__.py). + +It should be noted that both of these files are included in the bootstrapped output of the `dbt-database-adapter-scaffold` so when using the scaffolding, these files will be included. + +## Test your adapter + +:::info + +Previously, we offered a packaged suite of tests for dbt adapter functionality: [`pytest-dbt-adapter`](https://github.com/dbt-labs/dbt-adapter-tests). We are deprecating that suite, in favor of the newer testing framework outlined in this document. + +::: + +This document has two sections: + +1. Refer to "About the testing framework" for a description of the standard framework that we maintain for using pytest together with dbt. It includes an example that shows the anatomy of a simple test case. +2. Refer to "Testing your adapter" for a step-by-step guide for using our out-of-the-box suite of "basic" tests, which will validate that your adapter meets a baseline of dbt functionality. + +### Testing prerequisites + +- Your adapter must be compatible with dbt-core **v1.1** or newer +- You should be familiar with **pytest**: + +### About the testing framework + +dbt-core offers a standard framework for running pre-built functional tests, and for defining your own tests. The core testing framework is built using `pytest`, a mature and standard library for testing Python projects. + +The **[`tests` module](https://github.com/dbt-labs/dbt-core/tree/HEAD/core/dbt/tests)** within `dbt-core` includes basic utilities for setting up pytest + dbt. These are used by all "pre-built" functional tests, and make it possible to quickly write your own tests. + +Those utilities allow you to do three basic things: + +1. **Quickly set up a dbt "project."** Define project resources via methods such as `models()` and `seeds()`. Use `project_config_update()` to pass configurations into `dbt_project.yml`. +2. **Define a sequence of dbt commands.** The most important utility is `run_dbt()`, which returns the [results](/reference/dbt-classes#result-objects) of each dbt command. It takes a list of CLI specifiers (subcommand + flags), as well as an optional second argument, `expect_pass=False`, for cases where you expect the command to fail. +3. **Validate the results of those dbt commands.** For example, `check_relations_equal()` asserts that two database objects have the same structure and content. You can also write your own `assert` statements, by inspecting the results of a dbt command, or querying arbitrary database objects with `project.run_sql()`. + +You can see the full suite of utilities, with arguments and annotations, in [`util.py`](https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/tests/util.py). You'll also see them crop up across a number of test cases. While all utilities are intended to be reusable, you won't need all of them for every test. In the example below, we'll show a simple test case that uses only a few utilities. + +#### Example: a simple test case + +This example will show you the anatomy of a test case using dbt + pytest. We will create reusable components, combine them to form a dbt "project", and define a sequence of dbt commands. Then, we'll use Python `assert` statements to ensure those commands succeed (or fail) as we expect. + +In ["Getting started running basic tests,"](#getting-started-running-basic-tests) we'll offer step-by-step instructions for installing and configuring `pytest`, so that you can run it on your own machine. For now, it's more important to see how the pieces of a test case fit together. + +This example includes a seed, a model, and two tests—one of which will fail. + +1. Define Python strings that will represent the file contents in your dbt project. Defining these in a separate file enables you to reuse the same components across different test cases. The pytest name for this type of reusable component is "fixture." + + + +```python +# seeds/my_seed.csv +my_seed_csv = """ +id,name,some_date +1,Easton,1981-05-20T06:46:51 +2,Lillian,1978-09-03T18:10:33 +3,Jeremiah,1982-03-11T03:59:51 +4,Nolan,1976-05-06T20:21:35 +""".lstrip() + +# models/my_model.sql +my_model_sql = """ +select * from {{ ref('my_seed') }} +union all +select null as id, null as name, null as some_date +""" + +# models/my_model.yml +my_model_yml = """ +version: 2 +models: + - name: my_model + columns: + - name: id + tests: + - unique + - not_null # this test will fail +""" +``` + + + +2. Use the "fixtures" to define the project for your test case. These fixtures are always scoped to the **class**, where the class represents one test case—that is, one dbt project or scenario. (The same test case can be used for one or more actual tests, which we'll see in step 3.) Following the default pytest configurations, the file name must begin with `test_`, and the class name must begin with `Test`. + + + +```python +import pytest +from dbt.tests.util import run_dbt + +# our file contents +from tests.functional.example.fixtures import ( + my_seed_csv, + my_model_sql, + my_model_yml, +) + +# class must begin with 'Test' +class TestExample: + """ + Methods in this class will be of two types: + 1. Fixtures defining the dbt "project" for this test case. + These are scoped to the class, and reused for all tests in the class. + 2. Actual tests, whose names begin with 'test_'. + These define sequences of dbt commands and 'assert' statements. + """ + + # configuration in dbt_project.yml + @pytest.fixture(scope="class") + def project_config_update(self): + return { + "name": "example", + "models": {"+materialized": "view"} + } + + # everything that goes in the "seeds" directory + @pytest.fixture(scope="class") + def seeds(self): + return { + "my_seed.csv": my_seed_csv, + } + + # everything that goes in the "models" directory + @pytest.fixture(scope="class") + def models(self): + return { + "my_model.sql": my_model_sql, + "my_model.yml": my_model_yml, + } + + # continues below +``` + + + +3. Now that we've set up our project, it's time to define a sequence of dbt commands and assertions. We define one or more methods in the same file, on the same class (`TestExampleFailingTest`), whose names begin with `test_`. These methods share the same setup (project scenario) from above, but they can be run independently by pytest—so they shouldn't depend on each other in any way. + + + +```python + # continued from above + + # The actual sequence of dbt commands and assertions + # pytest will take care of all "setup" + "teardown" + def test_run_seed_test(self, project): + """ + Seed, then run, then test. We expect one of the tests to fail + An alternative pattern is to use pytest "xfail" (see below) + """ + # seed seeds + results = run_dbt(["seed"]) + assert len(results) == 1 + # run models + results = run_dbt(["run"]) + assert len(results) == 1 + # test tests + results = run_dbt(["test"], expect_pass = False) # expect failing test + assert len(results) == 2 + # validate that the results include one pass and one failure + result_statuses = sorted(r.status for r in results) + assert result_statuses == ["fail", "pass"] + + @pytest.mark.xfail + def test_build(self, project): + """Expect a failing test""" + # do it all + results = run_dbt(["build"]) +``` + + + +3. Our test is ready to run! The last step is to invoke `pytest` from your command line. We'll walk through the actual setup and configuration of `pytest` in the next section. + + + +```sh +$ python3 -m pytest tests/functional/test_example.py +=========================== test session starts ============================ +platform ... -- Python ..., pytest-..., pluggy-... +rootdir: ... +plugins: ... + +tests/functional/test_example.py .X [100%] + +======================= 1 passed, 1 xpassed in 1.38s ======================= +``` + + + +You can find more ways to run tests, along with a full command reference, in the [pytest usage docs](https://docs.pytest.org/how-to/usage.html). + +We've found the `-s` flag (or `--capture=no`) helpful to print logs from the underlying dbt invocations, and to step into an interactive debugger if you've added one. You can also use environment variables to set [global dbt configs](/reference/global-configs/about-global-configs), such as `DBT_DEBUG` (to show debug-level logs). + +### Testing this adapter + +Anyone who installs `dbt-core`, and wishes to define their own test cases, can use the framework presented in the first section. The framework is especially useful for testing standard dbt behavior across different databases. + +To that end, we have built and made available a [package of reusable adapter test cases](https://github.com/dbt-labs/dbt-core/tree/HEAD/tests/adapter), for creators and maintainers of adapter plugins. These test cases cover basic expected functionality, as well as functionality that frequently requires different implementations across databases. + +For the time being, this package is also located within the `dbt-core` repository, but separate from the `dbt-core` Python package. + +### Categories of tests + +In the course of creating and maintaining your adapter, it's likely that you will end up implementing tests that fall into three broad categories: + +1. **Basic tests** that every adapter plugin is expected to pass. These are defined in `tests.adapter.basic`. Given differences across data platforms, these may require slight modification or reimplementation. Significantly overriding or disabling these tests should be with good reason, since each represents basic functionality expected by dbt users. For example, if your adapter does not support incremental models, you should disable the test, [by marking it with `skip` or `xfail`](https://docs.pytest.org/en/latest/how-to/skipping.html), as well as noting that limitation in any documentation, READMEs, and usage guides that accompany your adapter. + +2. **Optional tests**, for second-order functionality that is common across plugins, but not required for basic use. Your plugin can opt into these test cases by inheriting existing ones, or reimplementing them with adjustments. For now, this category includes all tests located outside the `basic` subdirectory. More tests will be added as we convert older tests defined on dbt-core and mature plugins to use the standard framework. + +3. **Custom tests**, for behavior that is specific to your adapter / data platform. Each has its own specialties and idiosyncracies. We encourage you to use the same `pytest`-based framework, utilities, and fixtures to write your own custom tests for functionality that is unique to your adapter. + +If you run into an issue with the core framework, or the basic/optional test cases—or if you've written a custom test that you believe would be relevant and useful for other adapter plugin developers—please open an issue or PR in the `dbt-core` repository on GitHub. + +### Getting started running basic tests + +In this section, we'll walk through the three steps to start running our basic test cases on your adapter plugin: + +1. Install dependencies +2. Set up and configure pytest +3. Define test cases + +### Install dependencies + +You should already have a virtual environment with `dbt-core` and your adapter plugin installed. You'll also need to install: + +- [`pytest`](https://pypi.org/project/pytest/) +- [`dbt-tests-adapter`](https://pypi.org/project/dbt-tests-adapter/), the set of common test cases +- (optional) [`pytest` plugins](https://docs.pytest.org/en/7.0.x/reference/plugin_list.html)--we'll use `pytest-dotenv` below + +Or specify all dependencies in a requirements file like: + + +```txt +pytest +pytest-dotenv +dbt-tests-adapter +``` + + + +```sh +pip install -r dev_requirements.txt +``` + +### Set up and configure pytest + +First, set yourself up to run `pytest` by creating a file named `pytest.ini` at the root of your repository: + + + +```python +[pytest] +filterwarnings = + ignore:.*'soft_unicode' has been renamed to 'soft_str'*:DeprecationWarning + ignore:unclosed file .*:ResourceWarning +env_files = + test.env # uses pytest-dotenv plugin + # this allows you to store env vars for database connection in a file named test.env + # rather than passing them in every CLI command, or setting in `PYTEST_ADDOPTS` + # be sure to add "test.env" to .gitignore as well! +testpaths = + tests/functional # name per convention +``` + + + +Then, create a configuration file within your tests directory. In it, you'll want to define all necessary profile configuration for connecting to your data platform in local development and continuous integration. We recommend setting these values with environment variables, since this file will be checked into version control. + + + +```python +import pytest +import os + +# Import the standard functional fixtures as a plugin +# Note: fixtures with session scope need to be local +pytest_plugins = ["dbt.tests.fixtures.project"] + +# The profile dictionary, used to write out profiles.yml +# dbt will supply a unique schema per test, so we do not specify 'schema' here +@pytest.fixture(scope="class") +def dbt_profile_target(): + return { + 'type': '', + 'threads': 1, + 'host': os.getenv('HOST_ENV_VAR_NAME'), + 'user': os.getenv('USER_ENV_VAR_NAME'), + ... + } +``` + + + +### Define test cases + +As in the example above, each test case is defined as a class, and has its own "project" setup. To get started, you can import all basic test cases and try running them without changes. + + + +```python +import pytest + +from dbt.tests.adapter.basic.test_base import BaseSimpleMaterializations +from dbt.tests.adapter.basic.test_singular_tests import BaseSingularTests +from dbt.tests.adapter.basic.test_singular_tests_ephemeral import BaseSingularTestsEphemeral +from dbt.tests.adapter.basic.test_empty import BaseEmpty +from dbt.tests.adapter.basic.test_ephemeral import BaseEphemeral +from dbt.tests.adapter.basic.test_incremental import BaseIncremental +from dbt.tests.adapter.basic.test_generic_tests import BaseGenericTests +from dbt.tests.adapter.basic.test_snapshot_check_cols import BaseSnapshotCheckCols +from dbt.tests.adapter.basic.test_snapshot_timestamp import BaseSnapshotTimestamp +from dbt.tests.adapter.basic.test_adapter_methods import BaseAdapterMethod + +class TestSimpleMaterializationsMyAdapter(BaseSimpleMaterializations): + pass + + +class TestSingularTestsMyAdapter(BaseSingularTests): + pass + + +class TestSingularTestsEphemeralMyAdapter(BaseSingularTestsEphemeral): + pass + + +class TestEmptyMyAdapter(BaseEmpty): + pass + + +class TestEphemeralMyAdapter(BaseEphemeral): + pass + + +class TestIncrementalMyAdapter(BaseIncremental): + pass + + +class TestGenericTestsMyAdapter(BaseGenericTests): + pass + + +class TestSnapshotCheckColsMyAdapter(BaseSnapshotCheckCols): + pass + + +class TestSnapshotTimestampMyAdapter(BaseSnapshotTimestamp): + pass + + +class TestBaseAdapterMethod(BaseAdapterMethod): + pass +``` + + + +Finally, run pytest: + +```sh +python3 -m pytest tests/functional +``` + +### Modifying test cases + +You may need to make slight modifications in a specific test case to get it passing on your adapter. The mechanism to do this is simple: rather than simply inheriting the "base" test with `pass`, you can redefine any of its fixtures or test methods. + +For instance, on Redshift, we need to explicitly cast a column in the fixture input seed to use data type `varchar(64)`: + + + +```python +import pytest +from dbt.tests.adapter.basic.files import seeds_base_csv, seeds_added_csv, seeds_newcolumns_csv +from dbt.tests.adapter.basic.test_snapshot_check_cols import BaseSnapshotCheckCols + +# set the datatype of the name column in the 'added' seed so it +# can hold the '_update' that's added +schema_seed_added_yml = """ +version: 2 +seeds: + - name: added + config: + column_types: + name: varchar(64) +""" + +class TestSnapshotCheckColsRedshift(BaseSnapshotCheckCols): + # Redshift defines the 'name' column such that it's not big enough + # to hold the '_update' added in the test. + @pytest.fixture(scope="class") + def models(self): + return { + "base.csv": seeds_base_csv, + "added.csv": seeds_added_csv, + "seeds.yml": schema_seed_added_yml, + } +``` + + + +As another example, the `dbt-bigquery` adapter asks users to "authorize" replacing a with a by supplying the `--full-refresh` flag. The reason: In the table logic, a view by the same name must first be dropped; if the table query fails, the model will be missing. + +Knowing this possibility, the "base" test case offers a `require_full_refresh` switch on the `test_config` fixture class. For BigQuery, we'll switch it on: + + + +```python +import pytest +from dbt.tests.adapter.basic.test_base import BaseSimpleMaterializations + +class TestSimpleMaterializationsBigQuery(BaseSimpleMaterializations): + @pytest.fixture(scope="class") + def test_config(self): + # effect: add '--full-refresh' flag in requisite 'dbt run' step + return {"require_full_refresh": True} +``` + + + +It's always worth asking whether the required modifications represent gaps in perceived or expected dbt functionality. Are these simple implementation details, which any user of this database would understand? Are they limitations worth documenting? + +If, on the other hand, they represent poor assumptions in the "basic" test cases, which fail to account for a common pattern in other types of databases-—please open an issue or PR in the `dbt-core` repository on GitHub. + +### Running with multiple profiles + +Some databases support multiple connection methods, which map to actually different functionality behind the scenes. For instance, the `dbt-spark` adapter supports connections to Apache Spark clusters _and_ Databricks runtimes, which supports additional functionality out of the box, enabled by the Delta file format. + + + +```python +def pytest_addoption(parser): + parser.addoption("--profile", action="store", default="apache_spark", type=str) + + +# Using @pytest.mark.skip_profile('apache_spark') uses the 'skip_by_profile_type' +# autouse fixture below +def pytest_configure(config): + config.addinivalue_line( + "markers", + "skip_profile(profile): skip test for the given profile", + ) + +@pytest.fixture(scope="session") +def dbt_profile_target(request): + profile_type = request.config.getoption("--profile") + elif profile_type == "databricks_sql_endpoint": + target = databricks_sql_endpoint_target() + elif profile_type == "apache_spark": + target = apache_spark_target() + else: + raise ValueError(f"Invalid profile type '{profile_type}'") + return target + +def apache_spark_target(): + return { + "type": "spark", + "host": "localhost", + ... + } + +def databricks_sql_endpoint_target(): + return { + "type": "spark", + "host": os.getenv("DBT_DATABRICKS_HOST_NAME"), + ... + } + +@pytest.fixture(autouse=True) +def skip_by_profile_type(request): + profile_type = request.config.getoption("--profile") + if request.node.get_closest_marker("skip_profile"): + for skip_profile_type in request.node.get_closest_marker("skip_profile").args: + if skip_profile_type == profile_type: + pytest.skip("skipped on '{profile_type}' profile") +``` + + + +If there are tests that _shouldn't_ run for a given profile: + + + +```python +# Snapshots require access to the Delta file format, available on our Databricks connection, +# so let's skip on Apache Spark +@pytest.mark.skip_profile('apache_spark') +class TestSnapshotCheckColsSpark(BaseSnapshotCheckCols): + @pytest.fixture(scope="class") + def project_config_update(self): + return { + "seeds": { + "+file_format": "delta", + }, + "snapshots": { + "+file_format": "delta", + } + } +``` + + + +Finally: + +```sh +python3 -m pytest tests/functional --profile apache_spark +python3 -m pytest tests/functional --profile databricks_sql_endpoint +``` + +## Document a new adapter + +If you've already built, and tested your adapter, it's time to document it so the dbt community will know that it exists and how to use it. + +### Making your adapter available + +Many community members maintain their adapter plugins under open source licenses. If you're interested in doing this, we recommend: + +- Hosting on a public git provider (for example, GitHub or Gitlab) +- Publishing to [PyPI](https://pypi.org/) +- Adding to the list of ["Supported Data Platforms"](/docs/supported-data-platforms#community-supported) (more info below) + +### General Guidelines + +To best inform the dbt community of the new adapter, you should contribute to the dbt's open-source documentation site, which uses the [Docusaurus project](https://docusaurus.io/). This is the site you're currently on! + +### Conventions + +Each `.md` file you create needs a header as shown below. The document id will also need to be added to the config file: `website/sidebars.js`. + +```md +--- +title: "Documenting a new adapter" +id: "documenting-a-new-adapter" +--- +``` + +### Single Source of Truth + +We ask our adapter maintainers to use the [docs.getdbt.com repo](https://github.com/dbt-labs/docs.getdbt.com) (i.e. this site) as the single-source-of-truth for documentation rather than having to maintain the same set of information in three different places. The adapter repo's `README.md` and the data platform's documentation pages should simply link to the corresponding page on this docs site. Keep reading for more information on what should and shouldn't be included on the dbt docs site. + +### Assumed Knowledge + +To simplify things, assume the reader of this documentation already knows how both dbt and your data platform works. There's already great material for how to learn dbt and the data platform out there. The documentation we're asking you to add should be what a user who is already profiecient in both dbt and your data platform would need to know in order to use both. Effectively that boils down to two things: how to connect, and how to configure. + +### Topics and Pages to Cover + +The following subjects need to be addressed across three pages of this docs site to have your data platform be listed on our documentation. After the corresponding pull request is merged, we ask that you link to these pages from your adapter repo's `REAMDE` as well as from your product documentation. + + To contribute, all you will have to do make the changes listed in the table below. + +| How To... | File to change within `/website/docs/` | Action | Info to Include | +|----------------------|--------------------------------------------------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Connect | `/docs/core/connect-data-platform/{MY-DATA-PLATFORM}-setup.md` | Create | Give all information needed to define a target in `~/.dbt/profiles.yml` and get `dbt debug` to connect to the database successfully. All possible configurations should be mentioned. | +| Configure | `reference/resource-configs/{MY-DATA-PLATFORM}-configs.md` | Create | What options and configuration specific to your data platform do users need to know? e.g. table distribution and indexing options, column_quoting policy, which incremental strategies are supported | +| Discover and Install | `docs/supported-data-platforms.md` | Modify | Is it a vendor- or community- supported adapter? How to install Python adapter package? Ideally with pip and PyPI hosted package, but can also use `git+` link to GitHub Repo | +| Add link to sidebar | `website/sidebars.js` | Modify | Add the document id to the correct location in the sidebar menu | + +For example say I want to document my new adapter: `dbt-ders`. For the "Connect" page, I will make a new Markdown file, `ders-setup.md` and add it to the `/website/docs/core/connect-data-platform/` directory. + +### Example PRs to add new adapter documentation + +Below are some recent pull requests made by partners to document their data platform's adapter: + +- [TiDB](https://github.com/dbt-labs/docs.getdbt.com/pull/1309) +- [SingleStore](https://github.com/dbt-labs/docs.getdbt.com/pull/1044) +- [Firebolt](https://github.com/dbt-labs/docs.getdbt.com/pull/941) + +## Promote a new adapter + +The most important thing here is recognizing that people are successful in the community when they join, first and foremost, to engage authentically. + +What does authentic engagement look like? It’s challenging to define explicit rules. One good rule of thumb is to treat people with dignity and respect. + +Contributors to the community should think of contribution _as the end itself,_ not a means toward other business KPIs (leads, community members, etc.). [We are a mission-driven company.](https://www.getdbt.com/dbt-labs/values/) Some ways to know if you’re authentically engaging: + +- Is an engagement’s _primary_ purpose of sharing knowledge and resources or building brand engagement? +- Imagine you didn’t work at the org you do — can you imagine yourself still writing this? +- Is it written in formal / marketing language, or does it sound like you, the human? + +### Who should join the dbt community slack? + +- People who have insight into what it means to do hands-on [analytics engineering](https://www.getdbt.com/analytics-engineering/) work + The dbt Community Slack workspace is fundamentally a place for analytics practitioners to interact with each other — the closer the users are in the community to actual data/analytics engineering work, the more natural their engagement will be (leading to better outcomes for partners and the community). + +- DevRel practitioners with strong focus + DevRel practitioners often have a strong analytics background and a good understanding of the community. It’s essential to be sure they are focused on _contributing,_ not on driving community metrics for partner org (such as signing people up for their slack or events). The metrics will rise naturally through authentic engagement. + +- Founder and executives who are interested in directly engaging with the community + This is either incredibly successful or not at all depending on the profile of the founder. Typically, this works best when the founder has a practitioner-level of technical understanding and is interested in joining not to promote, but to learn and hear from users. + +- Software Engineers at partner products that are building and supporting integrations with either dbt Core or dbt Cloud + This is successful when the engineers are familiar with dbt as a product or at least have taken our training course. The Slack is often a place where end-user questions and feedback is initially shared, so it is recommended that someone technical from the team be present. There are also a handful of channels aimed at those building integrations, which tend to be a font of knowledge. + +### Who might struggle in the dbt community + +- People in marketing roles + dbt Slack is not a marketing channel. Attempts to use it as such invariably fall flat and can even lead to people having a negative view of a product. This doesn’t mean that dbt can’t serve marketing objectives, but a long-term commitment to engagement is the only proven method to do this sustainably. + +- People in product roles + The dbt Community can be an invaluable source of feedback on a product. There are two primary ways this can happen — organically (community members proactively suggesting a new feature) and via direct calls for feedback and user research. Immediate calls for engagement must be done in your dedicated #tools channel. Direct calls should be used sparingly, as they can overwhelm more organic discussions and feedback. + +### Who is the audience for an adapter release? + + A new adapter is likely to drive huge community interest from several groups of people: + - People who are currently using the database that the adapter is supporting + - People who may be adopting the database in the near future. + - People who are interested in dbt development in general. + +The database users will be your primary audience and the most helpful in achieving success. Engage them directly in the adapter’s dedicated Slack channel. If one does not exist already, reach out in #channel-requests, and we will get one made for you and include it in an announcement about new channels. + +The final group is where non-slack community engagement becomes important. Twitter and LinkedIn are both great places to interact with a broad audience. A well-orchestrated adapter release can generate impactful and authentic engagement. + +### How to message the initial rollout and follow-up content + +Tell a story that engages dbt users and the community. Highlight new use cases and functionality unlocked by the adapter in a way that will resonate with each segment. + +- Existing users of your technology who are new to dbt + - Provide a general overview of the value dbt will deliver to your users. This can lean on dbt's messaging and talking points which are laid out in the [dbt viewpoint.](/community/resources/viewpoint) + - Give examples of a rollout that speaks to the overall value of dbt and your product. + +- Users who are already familiar with dbt and the community + - Consider unique use cases or advantages your adapter provide over existing adapters. Who will be excited for this? + - Contribute to the dbt Community and ensure that dbt users on your adapter are well supported (tutorial content, packages, documentation, etc). + - Example of a rollout that is compelling for those familiar with dbt: [Firebolt](https://www.linkedin.com/feed/update/urn:li:activity:6879090752459182080/) + +### Tactically manage distribution of content about new or existing adapters + +There are tactical pieces on how and where to share that help ensure success. + +- On slack: + - #i-made-this channel — this channel has a policy against “marketing” and “content marketing” posts, but it should be successful if you write your content with the above guidelines in mind. Even with that, it’s important to post here sparingly. + - Your own database / tool channel — this is where the people who have opted in to receive communications from you and always a great place to share things that are relevant to them. + +- On social media: + - Twitter + - LinkedIn + - Social media posts _from the author_ or an individual connected to the project tend to have better engagement than posts from a company or organization account. + - Ask your partner representative about: + - Retweets and shares from the official dbt Labs accounts. + - Flagging posts internally at dbt Labs to get individual employees to share. + +#### Measuring engagement + +You don’t need 1000 people in a channel to succeed, but you need at least a few active participants who can make it feel lived in. If you’re comfortable working in public, this could be members of your team, or it can be a few people who you know that are highly engaged and would be interested in participating. Having even 2 or 3 regulars hanging out in a channel is all that’s needed for a successful start and is, in fact, much more impactful than 250 people that never post. + +### How to announce a new adapter + +We’d recommend _against_ boilerplate announcements and encourage finding a unique voice. That being said, there are a couple of things that we’d want to include: + +- A summary of the value prop of your database / technology for users who aren’t familiar. +- The personas that might be interested in this news. +- A description of what the adapter _is_. For example: + > With the release of our new dbt adapter, you’ll be able to to use dbt to model and transform your data in [name-of-your-org] +- Particular or unique use cases or functionality unlocked by the adapter. +- Plans for future / ongoing support / development. +- The link to the documentation for using the adapter on the dbt Labs docs site. +- An announcement blog. + +#### Announcing new release versions of existing adapters + +This can vary substantially depending on the nature of the release but a good baseline is the types of release messages that [we put out in the #dbt-releases](https://getdbt.slack.com/archives/C37J8BQEL/p1651242161526509) channel. + +![Full Release Post](/img/adapter-guide/0-full-release-notes.png) + +Breaking this down: + +- Visually distinctive announcement - make it clear this is a release + +- Short written description of what is in the release + +- Links to additional resources + +- Implementation instructions: + +- Future plans + +- Contributor recognition (if applicable) + + + +## Verify a new adapter + +The very first data platform dbt supported was Redshift followed quickly by Postgres (([dbt-core#174](https://github.com/dbt-labs/dbt-core/pull/174)). In 2017, back when dbt Labs (née Fishtown Analytics) was still a data consultancy, we added support for Snowflake and BigQuery. We also turned dbt's database support into an adapter framework ([dbt-core#259](https://github.com/dbt-labs/dbt-core/pull/259/)), and a plugin system a few years later. For years, dbt Labs specialized in those four data platforms and became experts in them. However, the surface area of all possible databases, their respective nuances, and keeping them up-to-date and bug-free is a Herculean and/or Sisyphean task that couldn't be done by a single person or even a single team! Enter the dbt community which enables dbt Core to work on more than 30 different databases (32 as of Sep '22)! + +Free and open-source tools for the data professional are increasingly abundant. This is by-and-large a _good thing_, however it requires due dilligence that wasn't required in a paid-license, closed-source software world. Before taking a dependency on an open-source projet is is important to determine the answer to the following questions: + +1. Does it work? +2. Does it meet my team's specific use case? +3. Does anyone "own" the code, or is anyone liable for ensuring it works? +4. Do bugs get fixed quickly? +5. Does it stay up-to-date with new Core features? +6. Is the usage substantial enough to self-sustain? +7. What risks do I take on by taking a dependency on this library? + +These are valid, important questions to answer—especially given that `dbt-core` itself only put out its first stable release (major version v1.0) in December 2021! Indeed, up until now, the majority of new user questions in database-specific channels are some form of: + +- "How mature is `dbt-`? Any gotchas I should be aware of before I start exploring?" +- "has anyone here used `dbt-` for production models?" +- "I've been playing with `dbt-` -- I was able to install and run my initial experiments. I noticed that there are certain features mentioned on the documentation that are marked as 'not ok' or 'not tested'. What are the risks? +I'd love to make a statement on my team to adopt DBT [sic], but I'm pretty sure questions will be asked around the possible limitations of the adapter or if there are other companies out there using dbt [sic] with Oracle DB in production, etc." + +There has been a tendency to trust the dbt Labs-maintained adapters over community- and vendor-supported adapters, but repo ownership is only one among many indicators of software quality. We aim to help our users feel well-informed as to the caliber of an adapter with a new program. + +### Verified by dbt Labs + +The adapter verification program aims to quickly indicate to users which adapters can be trusted to use in production. Previously, doing so was uncharted territory for new users and complicated making the business case to their leadership team. We plan to give quality assurances by: + +1. appointing a key stakeholder for the adapter repository, +2. ensuring that the chosen stakeholder fixes bugs and cuts new releases in a timely manner. Refer to the "Maintaining your new adapter" step for more information. +3. demonstrating that it passes our adapter pytest suite tests, +4. assuring that it works for us internally and ideally an existing team using the adapter in production . + +Every major & minor version of a adapter will be verified internally and given an official :white_check_mark: (custom emoji coming soon), on the ["Supported Data Platforms"](/docs/supported-data-platforms) page. + +### How to get an adapter verified? + +We envision that data platform vendors will be most interested in having their adapter versions verified, however we are open to community adapter verification. If interested, please reach out either to the `partnerships` at `dbtlabs.com` or post in the [#adapter-ecosystem Slack channel](https://getdbt.slack.com/archives/C030A0UF5LM). + +## Build a trusted adapter + +The Trusted adapter program exists to allow adapter maintainers to demonstrate to the dbt community that your adapter is trusted to be used in production. + +### What it means to be trusted + +By opting into the below, you agree to this, and we take you at your word. dbt Labs reserves the right to remove an adapter from the trusted adapter list at any time, should any of the below guidelines not be met. + +### Feature Completeness + +To be considered for the Trusted Adapter program, the adapter must cover the essential functionality of dbt Core given below, with best effort given to support the entire feature set. + +Essential functionality includes (but is not limited to the following features): + +- table, view, and seed materializations +- dbt tests + +The adapter should have the required documentation for connecting and configuring the adapter. The dbt docs site should be the single source of truth for this information. These docs should be kept up-to-date. + +Proceed to the "Document a new adapter" step for more information. + +### Release Cadence + +Keeping an adapter up-to-date with dbt Core is an integral part of being a trusted adapter. Therefore, we ask that adapter maintainers: + +- Release of new minor versions of the adapter with all tests passing within four weeks of dbt Core's release cut. +- Release of new major versions of the adapter with all tests passing within eight weeks of dbt Core's release cut. + +### Community Responsiveness + +On a best effort basis, active participation and engagement with the dbt Community across the following forums: + +- Being responsive to feedback and supporting user enablement in dbt Community’s Slack workspace +- Responding with comments to issues raised in public dbt adapter code repository +- Merging in code contributions from community members as deemed appropriate + +### Security Practices + +Trusted adapters will not do any of the following: + +- Output to logs or file either access credentials information to or data from the underlying data platform itself. +- Make API calls other than those expressly required for using dbt features (adapters may not add additional logging) +- Obfuscate code and/or functionality so as to avoid detection + +Additionally, to avoid supply-chain attacks: + +- Use an automated service to keep Python dependencies up-to-date (such as Dependabot or similar), +- Publish directly to PyPI from the dbt adapter code repository by using trusted CI/CD process (such as GitHub actions) +- Restrict admin access to both the respective code (GitHub) and package (PyPI) repositories +- Identify and mitigate security vulnerabilities by use of a static code analyzing tool (such as Snyk) as part of a CI/CD process + +### Other considerations + +The adapter repository is: + +- open-souce licensed, +- published to PyPI, and +- automatically tests the codebase against dbt Lab's provided adapter test suite + +### How to get an adapter verified + +Open an issue on the [docs.getdbt.com GitHub repository](https://github.com/dbt-labs/docs.getdbt.com) using the "Add adapter to Trusted list" template. In addition to contact information, it will ask confirm that you agree to the following. + +1. my adapter meet the guidelines given above +2. I will make best reasonable effort that this continues to be so +3. checkbox: I acknowledge that dbt Labs reserves the right to remove an adapter from the trusted adapter list at any time, should any of the above guidelines not be met. + +The approval workflow is as follows: + +1. create and populate the template-created issue +2. dbt Labs will respond as quickly as possible (maximally four weeks, though likely faster) +3. If approved, dbt Labs will create and merge a Pull request to formally add the adapter to the list. + +### Getting help for my trusted adapter + +Ask your question in #adapter-ecosystem channel of the dbt community Slack. diff --git a/website/docs/guides/airflow-and-dbt-cloud.md b/website/docs/guides/airflow-and-dbt-cloud.md new file mode 100644 index 00000000000..a3ff59af14e --- /dev/null +++ b/website/docs/guides/airflow-and-dbt-cloud.md @@ -0,0 +1,296 @@ +--- +title: Airflow and dbt Cloud +id: airflow-and-dbt-cloud +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['dbt Cloud', 'Orchestration'] +level: 'Intermediate' +recently_updated: true +--- + +## Introduction + +In some cases, [Airflow](https://airflow.apache.org/) may be the preferred orchestrator for your organization over working fully within dbt Cloud. There are a few reasons your team might be considering using Airflow to orchestrate your dbt jobs: + +- Your team is already using Airflow to orchestrate other processes +- Your team needs to ensure that a [dbt job](https://docs.getdbt.com/docs/dbt-cloud/cloud-overview#schedule-and-run-dbt-jobs-in-production) kicks off before or after another process outside of dbt Cloud +- Your team needs flexibility to manage more complex scheduling, such as kicking off one dbt job only after another has completed +- Your team wants to own their own orchestration solution +- You need code to work right now without starting from scratch + +### Prerequisites + +- [dbt Cloud Teams or Enterprise account](https://www.getdbt.com/pricing/) (with [admin access](https://docs.getdbt.com/docs/cloud/manage-access/enterprise-permissions)) in order to create a service token. Permissions for service tokens can be found [here](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens#permissions-for-service-account-tokens). +- A [free Docker account](https://hub.docker.com/signup) in order to sign in to Docker Desktop, which will be installed in the initial setup. +- A local digital scratchpad for temporarily copy-pasting API keys and URLs + +### Airflow + dbt Core + +There are [so many great examples](https://gitlab.com/gitlab-data/analytics/-/blob/master/dags/transformation/dbt_snowplow_backfill.py) from GitLab through their open source data engineering work. This is especially appropriate if you are well-versed in Kubernetes, CI/CD, and docker task management when building your airflow pipelines. If this is you and your team, you’re in good hands reading through more details [here](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/#airflow) and [here](https://about.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/). + +### Airflow + dbt Cloud API w/Custom Scripts + +This has served as a bridge until the fabled Astronomer + dbt Labs-built dbt Cloud provider became generally available [here](https://registry.astronomer.io/providers/dbt%20Cloud/versions/latest). + +There are many different permutations of this over time: + +- [Custom Python Scripts](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/archive/dbt_cloud_example.py): This is an airflow DAG based on [custom python API utilities](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/archive/dbt_cloud_utils.py) +- [Make API requests directly through the BashOperator based on the docs](https://docs.getdbt.com/dbt-cloud/api-v2-legacy#operation/triggerRun): You can make cURL requests to invoke dbt Cloud to do what you want +- For more options, check out the [official dbt Docs](/docs/deploy/deployments#airflow) on the various ways teams are running dbt in airflow + +These solutions are great, but can be difficult to trust as your team grows and management for things like: testing, job definitions, secrets, and pipelines increase past your team’s capacity. Roles become blurry (or were never clearly defined at the start!). Both data and analytics engineers start digging through custom logging within each other’s workflows to make heads or tails of where and what the issue really is. Not to mention that when the issue is found, it can be even harder to decide on the best path forward for safely implementing fixes. This complex workflow and unclear delineation on process management results in a lot of misunderstandings and wasted time just trying to get the process to work smoothly! + + +In this guide, you'll learn how to: + +1. Creating a working local Airflow environment +2. Invoking a dbt Cloud job with Airflow (with proof!) +3. Reusing tested and trusted Airflow code for your specific use cases + +You’ll also gain a better understanding of how this will: + +- Reduce the cognitive load when building and maintaining pipelines +- Avoid dependency hell (think: `pip install` conflicts) +- Implement better recoveries from failures +- Define clearer workflows so that data and analytics engineers work better, together ♥️ + + +🙌 Let’s get started! 🙌 + +## Install the Astro CLI + +Astro is a managed software service that includes key features for teams working with Airflow. In order to use Astro, we’ll install the Astro CLI, which will give us access to useful commands for working with Airflow locally. You can read more about Astro [here](https://docs.astronomer.io/astro/). + +In this example, we’re using Homebrew to install Astro CLI. Follow the instructions to install the Astro CLI for your own operating system [here](https://docs.astronomer.io/astro/install-cli). + +```bash +brew install astro +``` + + + +## Install and start Docker Desktop + +Docker allows us to spin up an environment with all the apps and dependencies we need for the example. + +Follow the instructions [here](https://docs.docker.com/desktop/) to install Docker desktop for your own operating system. Once Docker is installed, ensure you have it up and running for the next steps. + + + +## Clone the airflow-dbt-cloud repository + +Open your terminal and clone the [airflow-dbt-cloud repository](https://github.com/sungchun12/airflow-dbt-cloud.git). This contains example Airflow DAGs that you’ll use to orchestrate your dbt Cloud job. Once cloned, navigate into the `airflow-dbt-cloud` project. + +```bash +git clone https://github.com/sungchun12/airflow-dbt-cloud.git +cd airflow-dbt-cloud +``` + + + +## Start the Docker container + +You can initialize an Astronomer project in an empty local directory using a Docker container, and then run your project locally using the `start` command. + +1. Run the following commands to initialize your project and start your local Airflow deployment: + + ```bash + astro dev init + astro dev start + ``` + + When this finishes, you should see a message similar to the following: + + ```bash + Airflow is starting up! This might take a few minutes… + + Project is running! All components are now available. + + Airflow Webserver: http://localhost:8080 + Postgres Database: localhost:5432/postgres + The default Airflow UI credentials are: admin:admin + The default Postrgres DB credentials are: postgres:postgres + ``` + +2. Open the Airflow interface. Launch your web browser and navigate to the address for the **Airflow Webserver** from your output in Step 1. + + This will take you to your local instance of Airflow. You’ll need to log in with the **default credentials**: + + - Username: admin + - Password: admin + + ![Airflow login screen](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-login.png) + + + +## Create a dbt Cloud service token + +Create a service token from within dbt Cloud using the instructions [found here](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Ensure that you save a copy of the token, as you won’t be able to access this later. In this example we use `Account Admin`, but you can also use `Job Admin` instead for token permissions. + + + +## Create a dbt Cloud job + +In your dbt Cloud account create a job, paying special attention to the information in the bullets below. Additional information for creating a dbt Cloud job can be found [here](/guides/bigquery). + +- Configure the job with the commands that you want to include when this job kicks off, as Airflow will be referring to the job’s configurations for this rather than being explicitly coded in the Airflow DAG. This job will run a set of commands rather than a single command. +- Ensure that the schedule is turned **off** since we’ll be using Airflow to kick things off. +- Once you hit `save` on the job, make sure you copy the URL and save it for referencing later. The url will look similar to this: + +```html +https://cloud.getdbt.com/#/accounts/{account_id}/projects/{project_id}/jobs/{job_id}/ +``` + + + +## Add your dbt Cloud API token as a secure connection + + + +Now you have all the working pieces to get up and running with Airflow + dbt Cloud. Let’s dive into make this all work together. We will **set up a connection** and **run a DAG in Airflow** that kicks off a dbt Cloud job. + +1. Navigate to Admin and click on **Connections** + + ![Airflow connections menu](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-connections-menu.png) + +2. Click on the `+` sign to add a new connection, then click on the drop down to search for the dbt Cloud Connection Type + + ![Create connection](/img/guides/orchestration/airflow-and-dbt-cloud/create-connection.png) + + ![Connection type](/img/guides/orchestration/airflow-and-dbt-cloud/connection-type.png) + +3. Add in your connection details and your default dbt Cloud account id. This is found in your dbt Cloud URL after the accounts route section (`/accounts/{YOUR_ACCOUNT_ID}`), for example the account with id 16173 would see this in their URL: `https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/` + +![https://lh3.googleusercontent.com/sRxe5xbv_LYhIKblc7eiY7AmByr1OibOac2_fIe54rpU3TBGwjMpdi_j0EPEFzM1_gNQXry7Jsm8aVw9wQBSNs1I6Cyzpvijaj0VGwSnmVf3OEV8Hv5EPOQHrwQgK2RhNBdyBxN2](https://lh3.googleusercontent.com/sRxe5xbv_LYhIKblc7eiY7AmByr1OibOac2_fIe54rpU3TBGwjMpdi_j0EPEFzM1_gNQXry7Jsm8aVw9wQBSNs1I6Cyzpvijaj0VGwSnmVf3OEV8Hv5EPOQHrwQgK2RhNBdyBxN2) + +## Add your `job_id` and `account_id` config details to the python file + + Add your `job_id` and `account_id` config details to the python file: [dbt_cloud_provider_eltml.py](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/dags/dbt_cloud_provider_eltml.py). + +1. You’ll find these details within the dbt Cloud job URL, see the comments in the code snippet below for an example. + + ```python + # dbt Cloud Job URL: https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/ + # account_id: 16173 + #job_id: 65767 + + # line 28 + default_args={"dbt_cloud_conn_id": "dbt_cloud", "account_id": 16173}, + + trigger_dbt_cloud_job_run = DbtCloudRunJobOperator( + task_id="trigger_dbt_cloud_job_run", + job_id=65767, # line 39 + check_interval=10, + timeout=300, + ) + ``` + +2. Turn on the DAG and verify the job succeeded after running. Note: screenshots taken from different job runs, but the user experience is consistent. + + ![https://lh6.googleusercontent.com/p8AqQRy0UGVLjDGPmcuGYmQ_BRodyL0Zis-eQgSmp69EHbKW51o4S-bCl1fXHlOmwpYEBxD0A-O1Q1hwt-VDVMO1wWH-AIeaoelBx06JXRJ0m1OcHaPpFKH0xDiduIhNlQhhbLiy](https://lh6.googleusercontent.com/p8AqQRy0UGVLjDGPmcuGYmQ_BRodyL0Zis-eQgSmp69EHbKW51o4S-bCl1fXHlOmwpYEBxD0A-O1Q1hwt-VDVMO1wWH-AIeaoelBx06JXRJ0m1OcHaPpFKH0xDiduIhNlQhhbLiy) + + ![Airflow DAG](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-dag.png) + + ![Task run instance](/img/guides/orchestration/airflow-and-dbt-cloud/task-run-instance.png) + + ![https://lh6.googleusercontent.com/S9QdGhLAdioZ3x634CChugsJRiSVtTTd5CTXbRL8ADA6nSbAlNn4zV0jb3aC946c8SGi9FRTfyTFXqjcM-EBrJNK5hQ0HHAsR5Fj7NbdGoUfBI7xFmgeoPqnoYpjyZzRZlXkjtxS](https://lh6.googleusercontent.com/S9QdGhLAdioZ3x634CChugsJRiSVtTTd5CTXbRL8ADA6nSbAlNn4zV0jb3aC946c8SGi9FRTfyTFXqjcM-EBrJNK5hQ0HHAsR5Fj7NbdGoUfBI7xFmgeoPqnoYpjyZzRZlXkjtxS) + +## How do I rerun the dbt Cloud job and downstream tasks in my pipeline? + +If you have worked with dbt Cloud before, you have likely encountered cases where a job fails. In those cases, you have likely logged into dbt Cloud, investigated the error, and then manually restarted the job. + +This section of the guide will show you how to restart the job directly from Airflow. This will specifically run *just* the `trigger_dbt_cloud_job_run` and downstream tasks of the Airflow DAG and not the entire DAG. If only the transformation step fails, you don’t need to re-run the extract and load processes. Let’s jump into how to do that in Airflow. + +1. Click on the task + + ![Task DAG view](/img/guides/orchestration/airflow-and-dbt-cloud/task-dag-view.png) + +2. Clear the task instance + + ![Clear task instance](/img/guides/orchestration/airflow-and-dbt-cloud/clear-task-instance.png) + + ![Approve clearing](/img/guides/orchestration/airflow-and-dbt-cloud/approve-clearing.png) + +3. Watch it rerun in real time + + ![Re-run](/img/guides/orchestration/airflow-and-dbt-cloud/re-run.png) + +## Cleaning up + +At the end of this guide, make sure you shut down your docker container. When you’re done using Airflow, use the following command to stop the container: + +```bash +$ astrocloud dev stop + +[+] Running 3/3 + ⠿ Container airflow-dbt-cloud_e3fe3c-webserver-1 Stopped 7.5s + ⠿ Container airflow-dbt-cloud_e3fe3c-scheduler-1 Stopped 3.3s + ⠿ Container airflow-dbt-cloud_e3fe3c-postgres-1 Stopped 0.3s +``` + +To verify that the deployment has stopped, use the following command: + +```bash +astrocloud dev ps +``` + +This should give you an output like this: + +```bash +Name State Ports +airflow-dbt-cloud_e3fe3c-webserver-1 exited +airflow-dbt-cloud_e3fe3c-scheduler-1 exited +airflow-dbt-cloud_e3fe3c-postgres-1 exited +``` + + + +## Frequently asked questions + +### How can we run specific subsections of the dbt DAG in Airflow? + +Because of the way we configured the dbt Cloud job to run in Airflow, you can leave this job to your analytics engineers to define in the job configurations from dbt Cloud. If, for example, we need to run hourly-tagged models every hour and daily-tagged models daily, we can create jobs like `Hourly Run` or `Daily Run` and utilize the commands `dbt run -s tag:hourly` and `dbt run -s tag:daily` within each, respectively. We only need to grab our dbt Cloud `account` and `job id`, configure it in an Airflow DAG with the code provided, and then we can be on your way. See more node selection options: [here](/reference/node-selection/syntax) + +### How can I re-run models from the point of failure? + +You may want to parse the dbt DAG in Airflow to get the benefit of re-running from the point of failure. However, when you have hundreds of models in your DAG expanded out, it becomes useless for diagnosis and rerunning due to the overhead that comes along with creating an expansive Airflow DAG. + +You can’t re-run from failure natively in dbt Cloud today (feature coming!), but you can use a custom rerun parser. + +Using a simple python script coupled with the dbt Cloud provider, you can: + +- Avoid managing artifacts in a separate storage bucket(dbt Cloud does this for you) +- Avoid building your own parsing logic +- Get clear logs on what models you're rerunning in dbt Cloud (without hard coding step override commands) + +Watch the video below to see how it works! + + + +### Should Airflow run one big dbt job or many dbt jobs? + +Overall we recommend being as purposeful and minimalistic as you can. This is because dbt manages all of the dependencies between models and the orchestration of running those dependencies in order, which in turn has benefits in terms of warehouse processing efforts. + +### We want to kick off our dbt jobs after our ingestion tool (such as Fivetran) / data pipelines are done loading data. Any best practices around that? + +Our friends at Astronomer answer this question with this example: [here](https://registry.astronomer.io/dags/fivetran-dbt-cloud-census) + +### How do you set up a CI/CD workflow with Airflow? + +Check out these two resources for accomplishing your own CI/CD pipeline: + +- [Continuous Integration with dbt Cloud](/docs/deploy/continuous-integration) +- [Astronomer's CI/CD Example](https://docs.astronomer.io/software/ci-cd/#example-cicd-workflow) + +### Can dbt dynamically create tasks in the DAG like Airflow can? + +We prefer to keep models bundled vs. unbundled. You can go this route, but if you have hundreds of dbt models, it’s more effective to let the dbt Cloud job handle the models and dependencies. Bundling provides the solution to clear observability when things go wrong - we've seen more success in having the ability to clearly see issues in a bundled dbt Cloud job than combing through the nodes of an expansive Airflow DAG. If you still have a use case for this level of control though, our friends at Astronomer answer this question [here](https://www.astronomer.io/blog/airflow-dbt-1/)! + +### Can you trigger notifications if a dbt job fails with Airflow? Is there any way to access the status of the dbt Job to do that? + +Yes, either through [Airflow's email/slack](https://www.astronomer.io/guides/error-notifications-in-airflow/) functionality by itself or combined with [dbt Cloud's notifications](/docs/deploy/job-notifications), which support email and slack notifications. + +### Are there decision criteria for how to best work with dbt Cloud and airflow? + +Check out this deep dive into planning your dbt Cloud + Airflow implementation [here](https://www.youtube.com/watch?v=n7IIThR8hGk)! diff --git a/website/docs/quickstarts/bigquery-qs.md b/website/docs/guides/bigquery-qs.md similarity index 93% rename from website/docs/quickstarts/bigquery-qs.md rename to website/docs/guides/bigquery-qs.md index 546b56c234c..c1f632f0621 100644 --- a/website/docs/quickstarts/bigquery-qs.md +++ b/website/docs/guides/bigquery-qs.md @@ -1,10 +1,12 @@ --- title: "Quickstart for dbt Cloud and BigQuery" id: "bigquery" -time_to_complete: '30 minutes' -platform: 'dbt-cloud' +# time_to_complete: '30 minutes' commenting out until we test +level: 'Beginner' icon: 'bigquery' hide_table_of_contents: true +tags: ['BigQuery', 'dbt Cloud','Quickstart'] +recently_updated: true --- ## Introduction @@ -88,22 +90,25 @@ In order to let dbt connect to your warehouse, you'll need to generate a keyfile 4. Click **Upload a Service Account JSON File** in settings. 5. Select the JSON file you downloaded in [Generate BigQuery credentials](#generate-bigquery-credentials) and dbt Cloud will fill in all the necessary fields. 6. Click **Test Connection**. This verifies that dbt Cloud can access your BigQuery account. -7. Click **Next** if the test succeeds. If it fails, you might need to go back and regenerate your BigQuery credentials. +7. Click **Next** if the test succeeded. If it failed, you might need to go back and regenerate your BigQuery credentials. ## Set up a dbt Cloud managed repository -## Initialize your dbt project +## Initialize your dbt project​ and start developing Now that you have a repository configured, you can initialize your project and start development in dbt Cloud: 1. Click **Start developing in the IDE**. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse. 2. Above the file tree to the left, click **Initialize dbt project**. This builds out your folder structure with example models. 3. Make your initial commit by clicking **Commit and sync**. Use the commit message `initial commit` and click **Commit**. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. 4. You can now directly query data from your warehouse and execute `dbt run`. You can try this out now: + - Click **+ Create new file**, add this query to the new file, and click **Save as** to save the new file: + ```sql + select * from `dbt-tutorial.jaffle_shop.customers` + ``` - In the command line bar at the bottom, enter `dbt run` and click **Enter**. You should see a `dbt run succeeded` message. - - To confirm the success of the above command, navigate to the BigQuery Console and discover the newly created models. ## Build your first model 1. Under **Version Control** on the left, click **Create branch**. You can name it `add-customers-model`. You need to create a new branch since the main branch is set to read-only mode. @@ -171,7 +176,7 @@ select * from final 6. Enter `dbt run` in the command prompt at the bottom of the screen. You should get a successful run and see the three models. -Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned-up data rather than raw data in your BI tool. +Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool. #### FAQs @@ -279,7 +284,7 @@ Later, you can connect your business intelligence (BI) tools to these views and 4. Execute `dbt run`. - This time, when you performed a `dbt run`, separate views/tables were created for `stg_customers`, `stg_orders`, and `customers`. dbt inferred the order to run these models. Because `customers` depends on `stg_customers` and `stg_orders`, dbt builds `customers` last. You do not need to explicitly define these dependencies. + This time, when you performed a `dbt run`, separate views/tables were created for `stg_customers`, `stg_orders` and `customers`. dbt inferred the order to run these models. Because `customers` depends on `stg_customers` and `stg_orders`, dbt builds `customers` last. You do not need to explicitly define these dependencies. #### FAQs {#faq-2} diff --git a/website/docs/guides/legacy/building-packages.md b/website/docs/guides/building-packages.md similarity index 88% rename from website/docs/guides/legacy/building-packages.md rename to website/docs/guides/building-packages.md index 2a6803334d4..641a1c6af6d 100644 --- a/website/docs/guides/legacy/building-packages.md +++ b/website/docs/guides/building-packages.md @@ -1,26 +1,38 @@ --- -title: "Building a dbt package" # to do: update this to creating -id: "building-packages" +title: Building dbt packages +id: building-packages +description: "When you have dbt code that might help others, you can create a package for dbt using a GitHub repository." +displayText: Building dbt packages +hoverSnippet: Learn how to create packages for dbt. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['dbt Core'] +level: 'Advanced' +recently_updated: true --- -## Assumed knowledge -This article assumes you are familiar with: +## Introduction + +Creating packages is an **advanced use of dbt**. If you're new to the tool, we recommend that you first use the product for your own analytics before attempting to create a package for others. + +### Prerequisites + +A strong understanding of: - [packages](/docs/build/packages) - administering a repository on GitHub - [semantic versioning](https://semver.org/) -Heads up — developing a package is an **advanced use of dbt**. If you're new to the tool, we recommend that you first use the product for your own company's analytics before attempting to create a package. - -## 1. Assess whether a package is the right solution +### Assess whether a package is the right solution Packages typically contain either: - macros that solve a particular analytics engineering problem — for example, [auditing the results of a query](https://hub.getdbt.com/dbt-labs/audit_helper/latest/), [generating code](https://hub.getdbt.com/dbt-labs/codegen/latest/), or [adding additional schema tests to a dbt project](https://hub.getdbt.com/calogica/dbt_expectations/latest/). - models for a common dataset — for example a dataset for software products like [MailChimp](https://hub.getdbt.com/fivetran/mailchimp/latest/) or [Snowplow](https://hub.getdbt.com/dbt-labs/snowplow/latest/), or even models for metadata about your data stack like [Snowflake query spend](https://hub.getdbt.com/gitlabhq/snowflake_spend/latest/) and [the artifacts produced by `dbt run`](https://hub.getdbt.com/tailsdotcom/dbt_artifacts/latest/). In general, there should be a shared set of industry-standard metrics that you can model (e.g. email open rate). Packages are _not_ a good fit for sharing models that contain business-specific logic, for example, writing code for marketing attribution, or monthly recurring revenue. Instead, consider sharing a blog post and a link to a sample repo, rather than bundling this code as a package (here's our blog post on [marketing attribution](https://blog.getdbt.com/modeling-marketing-attribution/) as an example). -## 2. Create your new project -:::note Using the CLI for package development -We tend to use the CLI for package development. The development workflow often involves installing a local copy of your package in another dbt project — at present dbt Cloud is not designed for this workflow. +## Create your new project +:::note Using the command line for package development +We tend to use the command line interface for package development. The development workflow often involves installing a local copy of your package in another dbt project — at present dbt Cloud is not designed for this workflow. ::: 1. Use the [dbt init](/reference/commands/init) command to create a new dbt project, which will be your package: @@ -33,15 +45,15 @@ $ dbt init [package_name] ¹Currently, our package registry only supports packages that are hosted in GitHub. -## 3. Develop your package +## Develop your package We recommend that first-time package authors first develop macros and models for use in their own dbt project. Once your new package is created, you can get to work on moving them across, implementing some additional package-specific design patterns along the way. When working on your package, we often find it useful to install a local copy of the package in another dbt project — this workflow is described [here](https://discourse.getdbt.com/t/contributing-to-an-external-dbt-package/657). -### Follow our best practices +### Follow best practices _Modeling packages only_ -Use our [dbt coding conventions](https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md), our article on [how we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview), and our [best practices](best-practices) for all of our advice on how to build your dbt project. +Use our [dbt coding conventions](https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md), our article on [how we structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview), and our [best practices](best-practices) for all of our advice on how to build your dbt project. This is where it comes in especially handy to have worked on your own dbt project previously. @@ -103,7 +115,7 @@ Over time, we've developed a set of useful GitHub artifacts that make administer - Descriptions of the main models included in the package ([example](https://github.com/dbt-labs/snowplow)) - GitHub templates, including PR templates and issue templates ([example](https://github.com/dbt-labs/dbt-audit-helper/tree/master/.github)) -## 4. Add integration tests +## Add integration tests _Optional_ We recommend that you implement integration tests to confirm that the package works as expected — this is an even _more_ advanced step, so you may find that you build up to this. @@ -125,7 +137,7 @@ packages: -4. Add resources to the package (seeds, models, tests) so that you can successfully run your project, and compare the output with what you expect. The exact appraoch here will vary depending on your packages. In general you will find that you need to: +4. Add resources to the package (seeds, models, tests) so that you can successfully run your project, and compare the output with what you expect. The exact approach here will vary depending on your packages. In general you will find that you need to: - Add mock data via a [seed](/docs/build/seeds) with a few sample (anonymized) records. Configure the `integration_tests` project to point to the seeds instead of raw data tables. - Add more seeds that represent the expected output of your models, and use the [dbt_utils.equality](https://github.com/dbt-labs/dbt-utils#equality-source) test to confirm the output of your package, and the expected output matches. @@ -134,7 +146,7 @@ packages: 5. (Optional) Use a CI tool, like CircleCI or GitHub Actions, to automate running your dbt project when you open a new Pull Request. For inspiration, check out one of our [CircleCI configs](https://github.com/dbt-labs/snowplow/blob/main/.circleci/config.yml), which runs tests against our four main warehouses. Note: this is an advanced step — if you are going down this path, you may find it useful to say hi on [dbt Slack](https://community.getdbt.com/). -## 5. Deploy the docs for your package +## Deploy the docs for your package _Optional_ A dbt docs site can help a prospective user of your package understand the code you've written. As such, we recommend that you deploy the site generated by `dbt docs generate` and link to the deployed site from your package. @@ -147,12 +159,13 @@ The easiest way we've found to do this is to use [GitHub Pages](https://pages.gi 4. Enable GitHub pages on the repo in the settings tab, and point it to the “docs” subdirectory 4. GitHub should then deploy the docs at `.github.io/`, like so: [fivetran.github.io/dbt_ad_reporting](https://fivetran.github.io/dbt_ad_reporting/) -## 6. Release your package +## Release your package Create a new [release](https://docs.github.com/en/github/administering-a-repository/managing-releases-in-a-repository) once you are ready for others to use your work! Be sure to use [semantic versioning](https://semver.org/) when naming your release. In particular, if new changes will cause errors for users of earlier versions of the package, be sure to use _at least_ a minor release (e.g. go from `0.1.1` to `0.2.0`). The release notes should contain an overview of the changes introduced in the new version. Be sure to call out any changes that break the existing interface! -## 7. Add the package to hub.getdbt.com +## Add the package to hub.getdbt.com + Our package registry, [hub.getdbt.com](https://hub.getdbt.com/), gets updated by the [hubcap script](https://github.com/dbt-labs/hubcap). To add your package to hub.getdbt.com, create a PR on the [hubcap repository](https://github.com/dbt-labs/hubcap) to include it in the `hub.json` file. diff --git a/website/docs/quickstarts/codespace-qs.md b/website/docs/guides/codespace-qs.md similarity index 93% rename from website/docs/quickstarts/codespace-qs.md rename to website/docs/guides/codespace-qs.md index 3cd048c97a4..7712ed8f8e8 100644 --- a/website/docs/quickstarts/codespace-qs.md +++ b/website/docs/guides/codespace-qs.md @@ -1,9 +1,11 @@ --- -title: "Quickstart for dbt Core using GitHub Codespaces" +title: Quickstart for dbt Core using GitHub Codespaces id: codespace platform: 'dbt-core' icon: 'fa-github' +level: 'Beginner' hide_table_of_contents: true +tags: ['dbt Core','Quickstart'] --- ## Introduction @@ -19,10 +21,10 @@ dbt Labs provides a [GitHub Codespace](https://docs.github.com/en/codespaces/ove ## Related content -- [Create a GitHub repository](/quickstarts/manual-install?step=2) -- [Build your first models](/quickstarts/manual-install?step=3) -- [Test and document your project](/quickstarts/manual-install?step=4) -- [Schedule a job](/quickstarts/manual-install?step=5) +- [Create a GitHub repository](/guides/manual-install?step=2) +- [Build your first models](/guides/manual-install?step=3) +- [Test and document your project](/guides/manual-install?step=4) +- [Schedule a job](/guides/manual-install?step=5) - Learn more with [dbt Courses](https://courses.getdbt.com/collections) ## Create a codespace diff --git a/website/docs/guides/advanced/creating-new-materializations.md b/website/docs/guides/create-new-materializations.md similarity index 95% rename from website/docs/guides/advanced/creating-new-materializations.md rename to website/docs/guides/create-new-materializations.md index d3081ea8e20..1ad7d202de6 100644 --- a/website/docs/guides/advanced/creating-new-materializations.md +++ b/website/docs/guides/create-new-materializations.md @@ -1,12 +1,18 @@ --- -title: "Creating new materializations" -id: "creating-new-materializations" +title: "Create new materializations" +id: create-new-materializations description: Learn how to create your own materializations. displayText: Creating new materializations hoverSnippet: Learn how to create your own materializations. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['dbt Core'] +level: 'Advanced' +recently_updated: true --- -## Overview +## Introduction The model materializations you're familiar with, `table`, `view`, and `incremental` are implemented as macros in a package that's distributed along with dbt. You can check out the [source code for these materializations](https://github.com/dbt-labs/dbt-core/tree/main/core/dbt/include/global_project/macros/materializations). If you need to create your own materializations, reading these files is a good place to start. Continue reading below for a deep-dive into dbt materializations. @@ -110,13 +116,6 @@ Be sure to `commit` the transaction in the `cleanup` phase of the materializatio ### Update the Relation cache - -:::info New in 0.15.0 - -The ability to synchronize the Relation cache is new in dbt v0.15.0 - -::: - Materializations should [return](/reference/dbt-jinja-functions/return) the list of Relations that they have created at the end of execution. dbt will use this list of Relations to update the relation cache in order to reduce the number of queries executed against the database's `information_schema`. If a list of Relations is not returned, then dbt will raise a Deprecation Warning and infer the created relation from the model's configured database, schema, and alias. @@ -172,13 +171,6 @@ For more information on the `config` dbt Jinja function, see the [config](/refer ## Materialization precedence - -:::info New in 0.15.1 - -The materialization resolution order was poorly defined in versions of dbt prior to 0.15.1. Please use this guide for versions of dbt greater than or equal to 0.15.1. - -::: - dbt will pick the materialization macro in the following order (lower takes priority): 1. global project - default diff --git a/website/docs/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge.md b/website/docs/guides/custom-cicd-pipelines.md similarity index 58% rename from website/docs/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge.md rename to website/docs/guides/custom-cicd-pipelines.md index d22d1d14284..672c6e6dab8 100644 --- a/website/docs/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge.md +++ b/website/docs/guides/custom-cicd-pipelines.md @@ -1,13 +1,64 @@ --- -title: Run a dbt Cloud job on merge -id: 3-dbt-cloud-job-on-merge +title: Customizing CI/CD with custom pipelines +id: custom-cicd-pipelines +description: "Learn the benefits of version-controlled analytics code and custom pipelines in dbt for enhanced code testing and workflow automation during the development process." +displayText: Learn version-controlled code, custom pipelines, and enhanced code testing. +hoverSnippet: Learn version-controlled code, custom pipelines, and enhanced code testing. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['dbt Cloud', 'Orchestration', 'CI'] +level: 'Intermediate' +recently_updated: true --- +## Introduction + +One of the core tenets of dbt is that analytic code should be version controlled. This provides a ton of benefit to your organization in terms of collaboration, code consistency, stability, and the ability to roll back to a prior version. There’s an additional benefit that is provided with your code hosting platform that is often overlooked or underutilized. Some of you may have experience using dbt Cloud’s [webhook functionality](https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-enabling-continuous-integration) to run a job when a PR is created. This is a fantastic capability, and meets most use cases for testing your code before merging to production. However, there are circumstances when an organization needs additional functionality, like running workflows on every commit (linting), or running workflows after a merge is complete. In this article, we will show you how to setup custom pipelines to lint your project and trigger a dbt Cloud job via the API. + +A note on parlance in this article since each code hosting platform uses different terms for similar concepts. The terms `pull request` (PR) and `merge request` (MR) are used interchangeably to mean the process of merging one branch into another branch. + + +### What are pipelines? + +Pipelines (which are known by many names, such as workflows, actions, or build steps) are a series of pre-defined jobs that are triggered by specific events in your repository (PR created, commit pushed, branch merged, etc). Those jobs can do pretty much anything your heart desires assuming you have the proper security access and coding chops. + +Jobs are executed on [runners](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners), which are virtual servers. The runners come pre-configured with Ubuntu Linux, macOS, or Windows. That means the commands you execute are determined by the operating system of your runner. You’ll see how this comes into play later in the setup, but for now just remember that your code is executed on virtual servers that are, typically, hosted by the code hosting platform. + +![Diagram of how pipelines work](/img/guides/orchestration/custom-cicd-pipelines/pipeline-diagram.png) + +Please note, runners hosted by your code hosting platform provide a certain amount of free time. After that, billing charges may apply depending on how your account is setup. You also have the ability to host your own runners. That is beyond the scope of this article, but checkout the links below for more information if you’re interested in setting that up: + +- Repo-hosted runner billing information: + - [GitHub](https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions) + - [GitLab](https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html) + - [Bitbucket](https://bitbucket.org/product/features/pipelines#) +- Self-hosted runner information: + - [GitHub](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners) + - [GitLab](https://docs.gitlab.com/runner/) + - [Bitbucket](https://support.atlassian.com/bitbucket-cloud/docs/runners/) + +Additionally, if you’re using the free tier of GitLab you can still follow this guide, but it may ask you to provide a credit card to verify your account. You’ll see something like this the first time you try to run a pipeline: + +![Warning from GitLab showing payment information is required](/img/guides/orchestration/custom-cicd-pipelines/gitlab-cicd-payment-warning.png) + + +### How to setup pipelines + +This guide provides details for multiple code hosting platforms. Where steps are unique, they are presented without a selection option. If code is specific to a platform (i.e. GitHub, GitLab, Bitbucket) you will see a selection option for each. + +Pipelines can be triggered by various events. The [dbt Cloud webhook](https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-enabling-continuous-integration) process already triggers a run if you want to run your jobs on a merge request, so this guide focuses on running pipelines for every push and when PRs are merged. Since pushes happen frequently in a project, we’ll keep this job super simple and fast by linting with SQLFluff. The pipeline that runs on merge requests will run less frequently, and can be used to call the dbt Cloud API to trigger a specific job. This can be helpful if you have specific requirements that need to happen when code is updated in production, like running a `--full-refresh` on all impacted incremental models. + +Here’s a quick look at what this pipeline will accomplish: + +![Diagram showing the pipelines to be created and the programs involved](/img/guides/orchestration/custom-cicd-pipelines/pipeline-programs-diagram.png) + +## Run a dbt Cloud job on merge + This job will take a bit more to setup, but is a good example of how to call the dbt Cloud API from a CI/CD pipeline. The concepts presented here can be generalized and used in whatever way best suits your use case. The setup below shows how to call the dbt Cloud API to run a job every time there's a push to your main branch (The branch where pull requests are typically merged. Commonly referred to as the main, primary, or master branch, but can be named differently). - ### 1. Get your dbt Cloud API key When running a CI/CD pipeline you’ll want to use a service token instead of any individual’s API key. There are [detailed docs](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens) available on this, but below is a quick rundown (this must be performed by an Account Admin): @@ -28,7 +79,7 @@ Here’s a video showing the steps as well: ### 2. Put your dbt Cloud API key into your repo -This next part will happen in you code hosting platform. We need to save your API key from above into a repository secret so the job we create can access it. It is **not** recommended to ever save passwords or API keys in your code, so this step ensures that your key stays secure, but is still usable for your pipelines. +This next part will happen in you code hosting platform. We need to save your API key from above into a repository secret so the job we create can access it. It is **not** recommended to ever save passwords or API keys in your code, so this step ensures that your key stays secure, but is still usable for your pipelines. -In GitHub: - - Open up your repository where you want to run the pipeline (the same one that houses your dbt project) - Click *Settings* to open up the repository options - On the left click the *Security* dropdown - From that list, click on *Actions* - Towards the middle of the screen, click the *New repository secret* button - It will ask you for a name, so let’s call ours `DBT_API_KEY` - - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** + - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** - In the *Value* section, paste in the key you copied from dbt Cloud - Click *Add secret* and you’re all set! @@ -62,23 +111,21 @@ Here’s a video showing these steps: -In GitLab: - - Open up your repository where you want to run the pipeline (the same one that houses your dbt project) - Click *Settings* > *CI/CD* - Under the *Variables* section, click *Expand,* then click *Add variable* - It will ask you for a name, so let’s call ours `DBT_API_KEY` - - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** + - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** - In the *Value* section, paste in the key you copied from dbt Cloud - Make sure the check box next to *Protect variable* is unchecked, and the box next to *Mask variable* is selected (see below) - - “Protected” means that the variable is only available in pipelines that run on protected branches or protected tags - that won’t work for us because we want to run this pipeline on multiple branches. “Masked” means that it will be available to your pipeline runner, but will be masked in the logs. - + - “Protected” means that the variable is only available in pipelines that run on protected branches or protected tags - that won’t work for us because we want to run this pipeline on multiple branches. “Masked” means that it will be available to your pipeline runner, but will be masked in the logs. + ![View of the GitLab window for entering DBT_API_KEY](/img/guides/orchestration/custom-cicd-pipelines/dbt-api-key-gitlab.png) - + Here’s a video showing these steps: - + - + @@ -91,7 +138,7 @@ In Azure: - Select *Starter pipeline* (this will be updated later in Step 4) - Click on *Variables* and then *New variable* - In the *Name* field, enter the `DBT_API_KEY` - - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** + - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** - In the *Value* section, paste in the key you copied from dbt Cloud - Make sure the check box next to *Keep this value secret* is checked. This will mask the value in logs, and you won't be able to see the value for the variable in the UI. - Click *OK* and then *Save* to save the variable @@ -99,7 +146,7 @@ In Azure: - + In Bitbucket: @@ -108,16 +155,16 @@ In Bitbucket: - In the left menu, click *Repository Settings* - Scroll to the bottom of the left menu, and select *Repository variables* - In the *Name* field, input `DBT_API_KEY` - - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** + - **It’s very important that you copy/paste this name exactly because it’s used in the scripts below.** - In the *Value* section, paste in the key you copied from dbt Cloud - Make sure the check box next to *Secured* is checked. This will mask the value in logs, and you won't be able to see the value for the variable in the UI. - Click *Add* to save the variable - + ![View of the Bitbucket window for entering DBT_API_KEY](/img/guides/orchestration/custom-cicd-pipelines/dbt-api-key-bitbucket.png) - + Here’s a video showing these steps: - + @@ -304,13 +351,12 @@ run-dbt-cloud-job: - For this new job, open the existing Azure pipeline you created above and select the *Edit* button. We'll want to edit the corresponding Azure pipeline YAML file with the appropriate configuration, instead of the starter code, along with including a `variables` section to pass in the required variables. -Copy the below YAML file into your Azure pipeline and update the variables below to match your setup based on the comments in the file. It's worth noting that we changed the `trigger` section so that it will run **only** when there are pushes to a branch named `main` (like a PR merged to your main branch). +Copy the below YAML file into your Azure pipeline and update the variables below to match your setup based on the comments in the file. It's worth noting that we changed the `trigger` section so that it will run **only** when there are pushes to a branch named `main` (like a PR merged to your main branch). Read through [Azure's docs](https://learn.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops) on these filters for additional use cases. @@ -406,13 +452,12 @@ pipelines: - ### 5. Test your new action -Now that you have a shiny new action, it’s time to test it out! Since this change is setup to only run on merges to your default branch, you’ll need to create and merge this change into your main branch. Once you do that, you’ll see a new pipeline job has been triggered to run the dbt Cloud job you assigned in the variables section. +Now that you have a shiny new action, it’s time to test it out! Since this change is setup to only run on merges to your default branch, you’ll need to create and merge this change into your main branch. Once you do that, you’ll see a new pipeline job has been triggered to run the dbt Cloud job you assigned in the variables section. Additionally, you’ll see the job in the run history of dbt Cloud. It should be fairly easy to spot because it will say it was triggered by the API, and the *INFO* section will have the branch you used for this guide. @@ -454,3 +499,140 @@ Additionally, you’ll see the job in the run history of dbt Cloud. It should be + +## Run a dbt Cloud job on pull request + +If your git provider is not one with a native integration with dbt Cloud, but you still want to take advantage of CI builds, you've come to the right spot! With just a bit of work it's possible to setup a job that will run a dbt Cloud job when a pull request (PR) is created. + +:::info Run on PR + +If your git provider has a native integration with dbt Cloud, you can take advantage of the setup instructions [here](/docs/deploy/ci-jobs). +This section is only for those projects that connect to their git repository using an SSH key. + +::: + +The setup for this pipeline will use the same steps as the prior page. Before moving on, **follow steps 1-5 from the [prior page](https://docs.getdbt.com/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge)** + +### 1. Create a pipeline job that runs when PRs are created + + + + +For this job, we'll set it up using the `bitbucket-pipelines.yml` file as in the prior step. The YAML file will look pretty similar to our earlier job, but we’ll pass in the required variables to the Python script using `export` statements. Update this section to match your setup based on the comments in the file. + +**What is this pipeline going to do?** +The setup below will trigger a dbt Cloud job to run every time a PR is opened in this repository. It will also run a fresh version of the pipeline for every commit that is made on the PR until it is merged. +For example: If you open a PR, it will run the pipeline. If you then decide additional changes are needed, and commit/push to the PR branch, a new pipeline will run with the updated code. + +The following varibles control this job: + +- `DBT_JOB_BRANCH`: Tells the dbt Cloud job to run the code in the branch that created this PR +- `DBT_JOB_SCHEMA_OVERRIDE`: Tells the dbt Cloud job to run this into a custom target schema + - The format of this will look like: `DBT_CLOUD_PR_{REPO_KEY}_{PR_NUMBER}` + +```yaml +image: python:3.11.1 + + +pipelines: + # This job will run when pull requests are created in the repository + pull-requests: + '**': + - step: + name: 'Run dbt Cloud PR Job' + script: + # Check to only build if PR destination is master (or other branch). + # Comment or remove line below if you want to run on all PR's regardless of destination branch. + - if [ "${BITBUCKET_PR_DESTINATION_BRANCH}" != "main" ]; then printf 'PR Destination is not master, exiting.'; exit; fi + - export DBT_URL="https://cloud.getdbt.com" + - export DBT_JOB_CAUSE="Bitbucket Pipeline CI Job" + - export DBT_JOB_BRANCH=$BITBUCKET_BRANCH + - export DBT_JOB_SCHEMA_OVERRIDE="DBT_CLOUD_PR_"$BITBUCKET_PROJECT_KEY"_"$BITBUCKET_PR_ID + - export DBT_ACCOUNT_ID=00000 # enter your account id here + - export DBT_PROJECT_ID=00000 # enter your project id here + - export DBT_PR_JOB_ID=00000 # enter your job id here + - python python/run_and_monitor_dbt_job.py +``` + + + + +### 2. Confirm the pipeline runs + +Now that you have a new pipeline, it's time to run it and make sure it works. Since this only triggers when a PR is created, you'll need to create a new PR on a branch that contains the code above. Once you do that, you should see a pipeline that looks like this: + + + + +Bitbucket pipeline: +![dbt run on PR job in Bitbucket](/img/guides/orchestration/custom-cicd-pipelines/bitbucket-run-on-pr.png) + +dbt Cloud job: +![dbt Cloud job showing it was triggered by Bitbucket](/img/guides/orchestration/custom-cicd-pipelines/bitbucket-dbt-cloud-pr.png) + + + + +### 3. Handle those extra schemas in your database + +As noted above, when the PR job runs it will create a new schema based on the PR. To avoid having your database overwhelmed with PR schemas, consider adding a "cleanup" job to your dbt Cloud account. This job can run on a scheduled basis to cleanup any PR schemas that haven't been updated/used recently. + +Add this as a macro to your project. It takes 2 arguments that lets you control which schema get dropped: + +- `age_in_days`: The number of days since the schema was last altered before it should be dropped (default 10 days) +- `database_to_clean`: The name of the database to remove schemas from + +```sql +{# + This macro finds PR schemas older than a set date and drops them + The macro defaults to 10 days old, but can be configured with the input argument age_in_days + Sample usage with different date: + dbt run-operation pr_schema_cleanup --args "{'database_to_clean': 'analytics','age_in_days':'15'}" +#} +{% macro pr_schema_cleanup(database_to_clean, age_in_days=10) %} + + {% set find_old_schemas %} + select + 'drop schema {{ database_to_clean }}.'||schema_name||';' + from {{ database_to_clean }}.information_schema.schemata + where + catalog_name = '{{ database_to_clean | upper }}' + and schema_name ilike 'DBT_CLOUD_PR%' + and last_altered <= (current_date() - interval '{{ age_in_days }} days') + {% endset %} + + {% if execute %} + + {{ log('Schema drop statements:' ,True) }} + + {% set schema_drop_list = run_query(find_old_schemas).columns[0].values() %} + + {% for schema_to_drop in schema_drop_list %} + {% do run_query(schema_to_drop) %} + {{ log(schema_to_drop ,True) }} + {% endfor %} + + {% endif %} + +{% endmacro %} +``` + +This macro goes into a dbt Cloud job that is run on a schedule. The command will look like this (text below for copy/paste): +![dbt Cloud job showing the run operation command for the cleanup macro](/img/guides/orchestration/custom-cicd-pipelines/dbt-macro-cleanup-pr.png) +`dbt run-operation pr_schema_cleanup --args "{ 'database_to_clean': 'development','age_in_days':15}"` + +## Consider risk of conflicts when using multiple orchestration tools + +Running dbt Cloud jobs through a CI/CD pipeline is a form of job orchestration. If you also run jobs using dbt Cloud’s built in scheduler, you now have 2 orchestration tools running jobs. The risk with this is that you could run into conflicts - you can imagine a case where you are triggering a pipeline on certain actions and running scheduled jobs in dbt Cloud, you would probably run into job clashes. The more tools you have, the more you have to make sure everything talks to each other. + +That being said, if **the only reason you want to use pipelines is for adding a lint check or run on merge**, you might decide the pros outweigh the cons, and as such you want to go with a hybrid approach. Just keep in mind that if two processes try and run the same job at the same time, dbt Cloud will queue the jobs and run one after the other. It’s a balancing act but can be accomplished with diligence to ensure you’re orchestrating jobs in a manner that does not conflict. diff --git a/website/docs/quickstarts/databricks-qs.md b/website/docs/guides/databricks-qs.md similarity index 99% rename from website/docs/quickstarts/databricks-qs.md rename to website/docs/guides/databricks-qs.md index 08334862517..5a0c5536e7f 100644 --- a/website/docs/quickstarts/databricks-qs.md +++ b/website/docs/guides/databricks-qs.md @@ -1,9 +1,11 @@ --- title: "Quickstart for dbt Cloud and Databricks" id: "databricks" -platform: 'dbt-cloud' +level: 'Beginner' icon: 'databricks' hide_table_of_contents: true +recently_updated: true +tags: ['dbt Cloud', 'Quickstart','Databricks'] --- ## Introduction diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/1-what-are-adapters.md b/website/docs/guides/dbt-ecosystem/adapter-development/1-what-are-adapters.md deleted file mode 100644 index 0959dbee707..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/1-what-are-adapters.md +++ /dev/null @@ -1,100 +0,0 @@ ---- -title: "What are adapters? Why do we need them?" -id: "1-what-are-adapters" ---- - -Adapters are an essential component of dbt. At their most basic level, they are how dbt Core connects with the various supported data platforms. At a higher-level, dbt Core adapters strive to give analytics engineers more transferrable skills as well as standardize how analytics projects are structured. Gone are the days where you have to learn a new language or flavor of SQL when you move to a new job that has a different data platform. That is the power of adapters in dbt Core. - - Navigating and developing around the nuances of different databases can be daunting, but you are not alone. Visit [#adapter-ecosystem](https://getdbt.slack.com/archives/C030A0UF5LM) Slack channel for additional help beyond the documentation. - -## All databases are not the same - -There's a tremendous amount of work that goes into creating a database. Here is a high-level list of typical database layers (from the outermost layer moving inwards): -- SQL API -- Client Library / Driver -- Server Connection Manager -- Query parser -- Query optimizer -- Runtime -- Storage Access Layer -- Storage - -There's a lot more there than just SQL as a language. Databases (and data warehouses) are so popular because you can abstract away a great deal of the complexity from your brain to the database itself. This enables you to focus more on the data. - -dbt allows for further abstraction and standardization of the outermost layers of a database (SQL API, client library, connection manager) into a framework that both: - - Opens database technology to less technical users (a large swath of a DBA's role has been automated, similar to how the vast majority of folks with websites today no longer have to be "[webmasters](https://en.wikipedia.org/wiki/Webmaster)"). - - Enables more meaningful conversations about how data warehousing should be done. - -This is where dbt adapters become critical. - -## What needs to be adapted? - -dbt adapters are responsible for _adapting_ dbt's standard functionality to a particular database. Our prototypical database and adapter are PostgreSQL and dbt-postgres, and most of our adapters are somewhat based on the functionality described in dbt-postgres. - -Connecting dbt to a new database will require a new adapter to be built or an existing adapter to be extended. - -The outermost layers of a database map roughly to the areas in which the dbt adapter framework encapsulates inter-database differences. - -### SQL API - -Even amongst ANSI-compliant databases, there are differences in the SQL grammar. -Here are some categories and examples of SQL statements that can be constructed differently: - - -| Category | Area of differences | Examples | -|----------------------------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Statement syntax | The use of `IF EXISTS` |
  • `IF
  • EXISTS, DROP TABLE`
  • `DROP
  • IF EXISTS` | -| Workflow definition & semantics | Incremental updates |
  • `MERGE`
  • `DELETE; INSERT`
  • | -| Relation and column attributes/configuration | Database-specific materialization configs |
  • `DIST = ROUND_ROBIN` (Synapse)
  • `DIST = EVEN` (Redshift)
  • | -| Permissioning | Grant statements that can only take one grantee at a time vs those that accept lists of grantees |
  • `grant SELECT on table dinner.corn to corn_kid, everyone`
  • `grant SELECT on table dinner.corn to corn_kid; grant SELECT on table dinner.corn to everyone`
  • | - -### Python Client Library & Connection Manager - -The other big category of inter-database differences comes with how the client connects to the database and executes queries against the connection. To integrate with dbt, a data platform must have a pre-existing python client library or support ODBC, using a generic python library like pyodbc. - -| Category | Area of differences | Examples | -|------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------| -| Credentials & authentication | Authentication |
  • Username & password
  • MFA with `boto3` or Okta token
  • | -| Connection opening/closing | Create a new connection to db |
  • `psycopg2.connect(connection_string)`
  • `google.cloud.bigquery.Client(...)`
  • | -| Inserting local data | Load seed .`csv` files into Python memory |
  • `google.cloud.bigquery.Client.load_table_from_file(...)` (BigQuery)
  • `INSERT ... INTO VALUES ...` prepared statement (most other databases)
  • | - - -## How dbt encapsulates and abstracts these differences - -Differences between databases are encoded into discrete areas: - -| Components | Code Path | Function | -|------------------|---------------------------------------------------|-------------------------------------------------------------------------------| -| Python Classes | `adapters/` | Configuration (See above [Python classes](##python classes) | -| Macros | `include//macros/adapters/` | SQL API & statement syntax (for example, how to create schema or how to get table info) | -| Materializations | `include//macros/materializations/` | Table/view/snapshot/ workflow definitions | - - -### Python Classes - -These classes implement all the methods responsible for: -- Connecting to a database and issuing queries. -- Providing dbt with database-specific configuration information. - -| Class | Description | -|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| AdapterClass | High-level configuration type conversion and any database-specific python methods needed | -| AdapterCredentials | Typed dictionary of possible profiles and associated methods | -| AdapterConnectionManager | All the methods responsible for connecting to a database and issuing queries | -| AdapterRelation | How relation names should be rendered, printed, and quoted. Do relation names use all three parts? `catalog.model_name` (two-part name) or `database.schema.model_name` (three-part name) | -| AdapterColumn | How names should be rendered, and database-specific properties | - -### Macros - -A set of *macros* responsible for generating SQL that is compliant with the target database. - -### Materializations - -A set of *materializations* and their corresponding helper macros defined in dbt using jinja and SQL. They codify for dbt how model files should be persisted into the database. - -## Adapter Architecture - - -Below is a diagram of how dbt-postgres, the adapter at the center of dbt-core, works. - - diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/2-prerequisites-for-a-new-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/2-prerequisites-for-a-new-adapter.md deleted file mode 100644 index 28cd8935937..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/2-prerequisites-for-a-new-adapter.md +++ /dev/null @@ -1,52 +0,0 @@ ---- -title: "Prerequisites for a new adapter" -id: "2-prerequisites-for-a-new-adapter" ---- - -To learn what an adapter is and they role they serve, see [What are adapters?](1-what-are-adapters) - -It is very important that make sure that you have the right skills, and to understand the level of difficulty required to make an adapter for your data platform. - -## Pre-Requisite Data Warehouse Features - -The more you can answer Yes to the below questions, the easier your adapter development (and user-) experience will be. See the [New Adapter Information Sheet wiki](https://github.com/dbt-labs/dbt-core/wiki/New-Adapter-Information-Sheet) for even more specific questions. - -### Training -- the developer (and any product managers) ideally will have substantial experience as an end-user of dbt. If not, it is highly advised that you at least take the [dbt Fundamentals](https://courses.getdbt.com/courses/fundamentals) and [Advanced Materializations](https://courses.getdbt.com/courses/advanced-materializations) course. - -### Database -- Does the database complete transactions fast enough for interactive development? -- Can you execute SQL against the data platform? -- Is there a concept of schemas? -- Does the data platform support ANSI SQL, or at least a subset? -### Driver / Connection Library -- Is there a Python-based driver for interacting with the database that is db API 2.0 compliant (e.g. Psycopg2 for Postgres, pyodbc for SQL Server) -- Does it support: prepared statements, multiple statements, or single sign on token authorization to the data platform? - -### Open source software -- Does your organization have an established process for publishing open source software? - - -It is easiest to build an adapter for dbt when the following the /platform in question has: -- a conventional ANSI-SQL interface (or as close to it as possible), -- a mature connection library/SDK that uses ODBC or Python DB 2 API, and -- a way to enable developers to iterate rapidly with both quick reads and writes - - -## Maintaining your new adapter - -When your adapter becomes more popular, and people start using it, you may quickly become the maintainer of an increasingly popular open source project. With this new role, comes some unexpected responsibilities that not only include code maintenance, but also working with a community of users and contributors. To help people understand what to expect of your project, you should communicate your intentions early and often in your adapter documentation or README. Answer questions like, Is this experimental work that people should use at their own risk? Or is this production-grade code that you're committed to maintaining into the future? - -### Keeping the code compatible with dbt Core - -New minor version releases of `dbt-core` may include changes to the Python interface for adapter plugins, as well as new or updated test cases. The maintainers of `dbt-core` will clearly communicate these changes in documentation and release notes, and they will aim for backwards compatibility whenever possible. - -Patch releases of `dbt-core` will _not_ include breaking changes to adapter-facing code. For more details, see ["About dbt Core versions"](/docs/dbt-versions/core). - -### Versioning and releasing your adapter - -We strongly encourage you to adopt the following approach when versioning and releasing your plugin: -- The minor version of your plugin should match the minor version in `dbt-core` (e.g. 1.1.x). -- Aim to release a new version of your plugin for each new minor version of `dbt-core` (once every three months). -- While your plugin is new, and you're iterating on features, aim to offer backwards compatibility and deprecation notices for at least one minor version. As your plugin matures, aim to leave backwards compatibility and deprecation notices in place until the next major version (dbt Core v2). -- Release patch versions of your plugins whenever needed. These patch releases should contain fixes _only_. diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/3-building-a-new-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/3-building-a-new-adapter.md deleted file mode 100644 index 43826ca4b1d..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/3-building-a-new-adapter.md +++ /dev/null @@ -1,416 +0,0 @@ ---- -title: "Building a new adapter" -id: "3-building-a-new-adapter" ---- - -:::tip -Before you build your adapter, we strongly encourage you to first learn dbt as an end user, learn [what an adapter is and the role they serve](1-what-are-adapters), as well as [data platform prerequisites](2-prerequisites-for-a-new-adapter) -::: - - -This guide will walk you through the first creating the necessary adapter classes and macros, and provide some resources to help you validate that your new adapter is working correctly. Once the adapter is passing most of the functional tests (see ["Testing a new adapter"](4-testing-a-new-adapter) -), please let the community know that is available to use by adding the adapter to the ["Supported Data Platforms"](/docs/supported-data-platforms) page by following the steps given in [Documenting your adapter](/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter). - -For any questions you may have, don't hesitate to ask in the [#adapter-ecosystem](https://getdbt.slack.com/archives/C030A0UF5LM) Slack channel. The community is very helpful and likely has experienced a similar issue as you. - -## Scaffolding a new adapter - To create a new adapter plugin from scratch, you can use the [dbt-database-adapter-scaffold](https://github.com/dbt-labs/dbt-database-adapter-scaffold) to trigger an interactive session which will generate a scaffolding for you to build upon. - - Example usage: - - ``` - $ cookiecutter gh:dbt-labs/dbt-database-adapter-scaffold - ``` - -The generated boilerplate starting project will include a basic adapter plugin file structure, examples of macros, high level method descriptions, etc. - -One of the most important choices you will make during the cookiecutter generation will revolve around the field for `is_sql_adapter` which is a boolean used to correctly apply imports for either a `SQLAdapter` or `BaseAdapter`. Knowing which you will need requires a deeper knowledge of your selected database but a few good guides for the choice are. -- Does your database have a complete SQL API? Can it perform tasks using SQL such as creating schemas, dropping schemas, querying an `information_schema` for metadata calls? If so, it is more likely to be a SQLAdapter where you set `is_sql_adapter` to `True`. -- Most adapters do fall under SQL adapters which is why we chose it as the default `True` value. -- It is very possible to build out a fully functional `BaseAdapter`. This will require a little more ground work as it doesn't come with some prebuilt methods the `SQLAdapter` class provides. See `dbt-bigquery` as a good guide. - -## Implementation Details - -Regardless if you decide to use the cookiecutter template or manually create the plugin, this section will go over each method that is required to be implemented. The table below provides a high-level overview of the classes, methods, and macros you may have to define for your data platform. - -| file | component | purpose | -|---------------------------------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `./setup.py` | `setup()` function | adapter meta-data (package name, version, author, homepage, etc) | -| `myadapter/dbt/adapters/myadapter/__init__.py` | `AdapterPlugin` | bundle all the information below into a dbt plugin | -| `myadapter/dbt/adapters/myadapter/connections.py` | `MyAdapterCredentials` class | parameters to connect to and configure the database, via a the chosen Python driver | -| `myadapter/dbt/adapters/myadapter/connections.py` | `MyAdapterConnectionManager` class | telling dbt how to interact with the database w.r.t opening/closing connections, executing queries, and fetching data. Effectively a wrapper around the db API or driver. | -| `myadapter/dbt/include/bigquery/` | a dbt project of macro "overrides" in the format of "myadapter__" | any differences in SQL syntax for regular db operations will be modified here from the global_project (e.g. "Create Table As Select", "Get all relations in the current schema", etc) | -| `myadapter/dbt/adapters/myadapter/impl.py` | `MyAdapterConfig` | database- and relation-level configs and | -| `myadapter/dbt/adapters/myadapter/impl.py` | `MyAdapterAdapter` | for changing _how_ dbt performs operations like macros and other needed Python functionality | -| `myadapter/dbt/adapters/myadapter/column.py` | `MyAdapterColumn` | for defining database-specific column such as datatype mappings | - -### Editing `setup.py` - -Edit the file at `myadapter/setup.py` and fill in the missing information. - -You can skip this step if you passed the arguments for `email`, `url`, `author`, and `dependencies` to the cookiecutter template script. If you plan on having nested macro folder structures, you may need to add entries to `package_data` so your macro source files get installed. - -### Editing the connection manager - -Edit the connection manager at `myadapter/dbt/adapters/myadapter/connections.py`. This file is defined in the sections below. - -#### The Credentials class - -The credentials class defines all of the database-specific credentials (e.g. `username` and `password`) that users will need in the [connection profile](/docs/supported-data-platforms) for your new adapter. Each credentials contract should subclass dbt.adapters.base.Credentials, and be implemented as a python dataclass. - -Note that the base class includes required database and schema fields, as dbt uses those values internally. - -For example, if your adapter requires a host, integer port, username string, and password string, but host is the only required field, you'd add definitions for those new properties to the class as types, like this: - - - -```python - -from dataclasses import dataclass -from typing import Optional - -from dbt.adapters.base import Credentials - - -@dataclass -class MyAdapterCredentials(Credentials): - host: str - port: int = 1337 - username: Optional[str] = None - password: Optional[str] = None - - @property - def type(self): - return 'myadapter' - - @property - def unique_field(self): - """ - Hashed and included in anonymous telemetry to track adapter adoption. - Pick a field that can uniquely identify one team/organization building with this adapter - """ - return self.host - - def _connection_keys(self): - """ - List of keys to display in the `dbt debug` output. - """ - return ('host', 'port', 'database', 'username') -``` - - - -There are a few things you can do to make it easier for users when connecting to your database: -- Be sure to implement the Credentials' `_connection_keys` method shown above. This method will return the keys that should be displayed in the output of the `dbt debug` command. As a general rule, it's good to return all the arguments used in connecting to the actual database except the password (even optional arguments). -- Create a `profile_template.yml` to enable configuration prompts for a brand-new user setting up a connection profile via the [`dbt init` command](/reference/commands/init). See more details [below](#other-files). -- You may also want to define an `ALIASES` mapping on your Credentials class to include any config names you want users to be able to use in place of 'database' or 'schema'. For example if everyone using the MyAdapter database calls their databases "collections", you might do: - - - -```python -@dataclass -class MyAdapterCredentials(Credentials): - host: str - port: int = 1337 - username: Optional[str] = None - password: Optional[str] = None - - ALIASES = { - 'collection': 'database', - } -``` - - - -Then users can use `collection` OR `database` in their `profiles.yml`, `dbt_project.yml`, or `config()` calls to set the database. - -#### `ConnectionManager` class methods - -Once credentials are configured, you'll need to implement some connection-oriented methods. They are enumerated in the SQLConnectionManager docstring, but an overview will also be provided here. - -**Methods to implement:** -- `open` -- `get_response` -- `cancel` -- `exception_handler` -- `standardize_grants_dict` - -##### `open(cls, connection)` - -`open()` is a classmethod that gets a connection object (which could be in any state, but will have a `Credentials` object with the attributes you defined above) and moves it to the 'open' state. - -Generally this means doing the following: - - if the connection is open already, log and return it. - - If a database needed changes to the underlying connection before re-use, that would happen here - - create a connection handle using the underlying database library using the credentials - - on success: - - set connection.state to `'open'` - - set connection.handle to the handle object - - this is what must have a `cursor()` method that returns a cursor! - - on error: - - set connection.state to `'fail'` - - set connection.handle to `None` - - raise a `dbt.exceptions.FailedToConnectException` with the error and any other relevant information - -For example: - - - -```python - @classmethod - def open(cls, connection): - if connection.state == 'open': - logger.debug('Connection is already open, skipping open.') - return connection - - credentials = connection.credentials - - try: - handle = myadapter_library.connect( - host=credentials.host, - port=credentials.port, - username=credentials.username, - password=credentials.password, - catalog=credentials.database - ) - connection.state = 'open' - connection.handle = handle - return connection -``` - - - -##### `get_response(cls, cursor)` - -`get_response` is a classmethod that gets a cursor object and returns adapter-specific information about the last executed command. The return value should be an `AdapterResponse` object that includes items such as `code`, `rows_affected`, `bytes_processed`, and a summary `_message` for logging to stdout. - - - -```python - @classmethod - def get_response(cls, cursor) -> AdapterResponse: - code = cursor.sqlstate or "OK" - rows = cursor.rowcount - status_message = f"{code} {rows}" - return AdapterResponse( - _message=status_message, - code=code, - rows_affected=rows - ) -``` - - - -##### `cancel(self, connection)` - -`cancel` is an instance method that gets a connection object and attempts to cancel any ongoing queries, which is database dependent. Some databases don't support the concept of cancellation, they can simply implement it via 'pass' and their adapter classes should implement an `is_cancelable` that returns False - On ctrl+c connections may remain running. This method must be implemented carefully, as the affected connection will likely be in use in a different thread. - - - -```python - def cancel(self, connection): - tid = connection.handle.transaction_id() - sql = 'select cancel_transaction({})'.format(tid) - logger.debug("Cancelling query '{}' ({})".format(connection_name, pid)) - _, cursor = self.add_query(sql, 'master') - res = cursor.fetchone() - logger.debug("Canceled query '{}': {}".format(connection_name, res)) -``` - - - -##### `exception_handler(self, sql, connection_name='master')` - -`exception_handler` is an instance method that returns a context manager that will handle exceptions raised by running queries, catch them, log appropriately, and then raise exceptions dbt knows how to handle. - -If you use the (highly recommended) `@contextmanager` decorator, you only have to wrap a `yield` inside a `try` block, like so: - - - -```python - @contextmanager - def exception_handler(self, sql: str): - try: - yield - except myadapter_library.DatabaseError as exc: - self.release(connection_name) - - logger.debug('myadapter error: {}'.format(str(e))) - raise dbt.exceptions.DatabaseException(str(exc)) - except Exception as exc: - logger.debug("Error running SQL: {}".format(sql)) - logger.debug("Rolling back transaction.") - self.release(connection_name) - raise dbt.exceptions.RuntimeException(str(exc)) -``` - - - -##### `standardize_grants_dict(self, grants_table: agate.Table) -> dict` - -`standardize_grants_dict` is an method that returns the dbt-standardized grants dictionary that matches how users configure grants now in dbt. The input is the result of `SHOW GRANTS ON {{model}}` call loaded into an agate table. - -If there's any massaging of agate table containing the results, of `SHOW GRANTS ON {{model}}`, that can't easily be accomplished in SQL, it can be done here. For example, the SQL to show grants *should* filter OUT any grants TO the current user/role (e.g. OWNERSHIP). If that's not possible in SQL, it can be done in this method instead. - - - -```python - @available - def standardize_grants_dict(self, grants_table: agate.Table) -> dict: - """ - :param grants_table: An agate table containing the query result of - the SQL returned by get_show_grant_sql - :return: A standardized dictionary matching the `grants` config - :rtype: dict - """ - grants_dict: Dict[str, List[str]] = {} - for row in grants_table: - grantee = row["grantee"] - privilege = row["privilege_type"] - if privilege in grants_dict.keys(): - grants_dict[privilege].append(grantee) - else: - grants_dict.update({privilege: [grantee]}) - return grants_dict -``` - - - -### Editing the adapter implementation - -Edit the connection manager at `myadapter/dbt/adapters/myadapter/impl.py` - -Very little is required to implement the adapter itself. On some adapters, you will not need to override anything. On others, you'll likely need to override some of the ``convert_*`` classmethods, or override the `is_cancelable` classmethod on others to return `False`. - - -#### `datenow()` - -This classmethod provides the adapter's canonical date function. This is not used but is required– anyway on all adapters. - - - -```python - @classmethod - def date_function(cls): - return 'datenow()' -``` - - - -### Editing SQL logic - -dbt implements specific SQL operations using jinja macros. While reasonable defaults are provided for many such operations (like `create_schema`, `drop_schema`, `create_table`, etc), you may need to override one or more of macros when building a new adapter. - -#### Required macros - -The following macros must be implemented, but you can override their behavior for your adapter using the "dispatch" pattern described below. Macros marked (required) do not have a valid default implementation, and are required for dbt to operate. - -- `alter_column_type` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/columns.sql#L37-L55)) -- `check_schema_exists` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/metadata.sql#L43-L55)) -- `create_schema` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/schema.sql#L1-L9)) -- `drop_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/relation.sql#L34-L42)) -- `drop_schema` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/schema.sql#L12-L20)) -- `get_columns_in_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/columns.sql#L1-L8)) (required) -- `list_relations_without_caching` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/metadata.sql#L58-L65)) (required) -- `list_schemas` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/metadata.sql#L29-L40)) -- `rename_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/relation.sql#L56-L65)) -- `truncate_relation` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/relation.sql#L45-L53)) -- `current_timestamp` ([source](https://github.com/dbt-labs/dbt-core/blob/f988f76fccc1878aaf8d8631c05be3e9104b3b9a/core/dbt/include/global_project/macros/adapters/freshness.sql#L1-L8)) (required) -- `copy_grants` - -#### Adapter dispatch - -Most modern databases support a majority of the standard SQL spec. There are some databases that _do not_ support critical aspects of the SQL spec however, or they provide their own nonstandard mechanisms for implementing the same functionality. To account for these variations in SQL support, dbt provides a mechanism called [multiple dispatch](https://en.wikipedia.org/wiki/Multiple_dispatch) for macros. With this feature, macros can be overridden for specific adapters. This makes it possible to implement high-level methods (like "create ") in a database-specific way. - - - -```jinja2 - -{# dbt will call this macro by name, providing any arguments #} -{% macro create_table_as(temporary, relation, sql) -%} - - {# dbt will dispatch the macro call to the relevant macro #} - {{ return( - adapter.dispatch('create_table_as')(temporary, relation, sql) - ) }} -{%- endmacro %} - - - -{# If no macro matches the specified adapter, "default" will be used #} -{% macro default__create_table_as(temporary, relation, sql) -%} - ... -{%- endmacro %} - - - -{# Example which defines special logic for Redshift #} -{% macro redshift__create_table_as(temporary, relation, sql) -%} - ... -{%- endmacro %} - - - -{# Example which defines special logic for BigQuery #} -{% macro bigquery__create_table_as(temporary, relation, sql) -%} - ... -{%- endmacro %} -``` - - - -The `adapter.dispatch()` macro takes a second argument, `packages`, which represents a set of "search namespaces" in which to find potential implementations of a dispatched macro. This allows users of community-supported adapters to extend or "shim" dispatched macros from common packages, such as `dbt-utils`, with adapter-specific versions in their own project or other installed packages. See: -- "Shim" package examples: [`spark-utils`](https://github.com/dbt-labs/spark-utils), [`tsql-utils`](https://github.com/dbt-msft/tsql-utils) -- [`adapter.dispatch` docs](/reference/dbt-jinja-functions/dispatch) - -#### Overriding adapter methods - -While much of dbt's adapter-specific functionality can be modified in adapter macros, it can also make sense to override adapter methods directly. In this example, assume that a database does not support a `cascade` parameter to `drop schema`. Instead, we can implement an approximation where we drop each relation and then drop the schema. - - - -```python - def drop_schema(self, relation: BaseRelation): - relations = self.list_relations( - database=relation.database, - schema=relation.schema - ) - for relation in relations: - self.drop_relation(relation) - super().drop_schema(relation) -``` - - - -#### Grants Macros - -See [this GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/5468) for information on the macros required for `GRANT` statements: -### Other files - -#### `profile_template.yml` - -In order to enable the [`dbt init` command](/reference/commands/init) to prompt users when setting up a new project and connection profile, you should include a **profile template**. The filepath needs to be `dbt/include//profile_template.yml`. It's possible to provide hints, default values, and conditional prompts based on connection methods that require different supporting attributes. Users will also be able to include custom versions of this file in their own projects, with fixed values specific to their organization, to support their colleagues when using your dbt adapter for the first time. - -See examples: -- [dbt-postgres](https://github.com/dbt-labs/dbt-core/blob/main/plugins/postgres/dbt/include/postgres/profile_template.yml) -- [dbt-redshift](https://github.com/dbt-labs/dbt-redshift/blob/main/dbt/include/redshift/profile_template.yml) -- [dbt-snowflake](https://github.com/dbt-labs/dbt-snowflake/blob/main/dbt/include/snowflake/profile_template.yml) -- [dbt-bigquery](https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/include/bigquery/profile_template.yml) - -#### `__version__.py` - -To assure that `dbt --version` provides the latest dbt core version the adapter supports, be sure include a `__version__.py` file. The filepath will be `dbt/adapters//__version__.py`. We recommend using the latest dbt core version and as the adapter is made compatible with later versions, this file will need to be updated. For a sample file, check out this [example](https://github.com/dbt-labs/dbt-snowflake/blob/main/dbt/adapters/snowflake/__version__.py). - -It should be noted that both of these files are included in the bootstrapped output of the `dbt-database-adapter-scaffold` so when using the scaffolding, these files will be included. - -## Testing your new adapter - -This has moved to its own page: ["Testing a new adapter"](4-testing-a-new-adapter) - -## Documenting your new adapter - -This has moved to its own page: ["Documenting a new adapter"](/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter) - -## Maintaining your new adapter - -This has moved to a new spot: ["Maintaining your new adapter"](2-prerequisites-for-a-new-adapter##maintaining-your-new-adapter) diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/4-testing-a-new-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/4-testing-a-new-adapter.md deleted file mode 100644 index b1b5072670a..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/4-testing-a-new-adapter.md +++ /dev/null @@ -1,499 +0,0 @@ ---- -title: "Testing a new adapter" -id: "4-testing-a-new-adapter" ---- - -:::info - -Previously, we offered a packaged suite of tests for dbt adapter functionality: [`pytest-dbt-adapter`](https://github.com/dbt-labs/dbt-adapter-tests). We are deprecating that suite, in favor of the newer testing framework outlined in this document. - -::: - -This document has two sections: - -1. "[About the testing framework](#about-the-testing-framework)" describes the standard framework that we maintain for using pytest together with dbt. It includes an example that shows the anatomy of a simple test case. -2. "[Testing your adapter](#testing-your-adapter)" offers a step-by-step guide for using our out-of-the-box suite of "basic" tests, which will validate that your adapter meets a baseline of dbt functionality. - -## Prerequisites - -- Your adapter must be compatible with dbt-core **v1.1** or newer -- You should be familiar with **pytest**: https://docs.pytest.org/ - -## About the testing framework - -dbt-core offers a standard framework for running pre-built functional tests, and for defining your own tests. The core testing framework is built using `pytest`, a mature and standard library for testing Python projects. - -The **[`tests` module](https://github.com/dbt-labs/dbt-core/tree/HEAD/core/dbt/tests)** within `dbt-core` includes basic utilities for setting up pytest + dbt. These are used by all "pre-built" functional tests, and make it possible to quickly write your own tests. - -Those utilities allow you to do three basic things: -1. **Quickly set up a dbt "project."** Define project resources via methods such as `models()` and `seeds()`. Use `project_config_update()` to pass configurations into `dbt_project.yml`. -2. **Define a sequence of dbt commands.** The most important utility is `run_dbt()`, which returns the [results](/reference/dbt-classes#result-objects) of each dbt command. It takes a list of CLI specifiers (subcommand + flags), as well as an optional second argument, `expect_pass=False`, for cases where you expect the command to fail. -3. **Validate the results of those dbt commands.** For example, `check_relations_equal()` asserts that two database objects have the same structure and content. You can also write your own `assert` statements, by inspecting the results of a dbt command, or querying arbitrary database objects with `project.run_sql()`. - -You can see the full suite of utilities, with arguments and annotations, in [`util.py`](https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/tests/util.py). You'll also see them crop up across a number of test cases. While all utilities are intended to be reusable, you won't need all of them for every test. In the example below, we'll show a simple test case that uses only a few utilities. - -### Example: a simple test case - -This example will show you the anatomy of a test case using dbt + pytest. We will create reusable components, combine them to form a dbt "project", and define a sequence of dbt commands. Then, we'll use Python `assert` statements to ensure those commands succeed (or fail) as we expect. - -In ["Getting started running basic tests,"](#getting-started-running-basic-tests) we'll offer step-by-step instructions for installing and configuring `pytest`, so that you can run it on your own machine. For now, it's more important to see how the pieces of a test case fit together. - -This example includes a seed, a model, and two tests—one of which will fail. - -1. Define Python strings that will represent the file contents in your dbt project. Defining these in a separate file enables you to reuse the same components across different test cases. The pytest name for this type of reusable component is "fixture." - - - -```python -# seeds/my_seed.csv -my_seed_csv = """ -id,name,some_date -1,Easton,1981-05-20T06:46:51 -2,Lillian,1978-09-03T18:10:33 -3,Jeremiah,1982-03-11T03:59:51 -4,Nolan,1976-05-06T20:21:35 -""".lstrip() - -# models/my_model.sql -my_model_sql = """ -select * from {{ ref('my_seed') }} -union all -select null as id, null as name, null as some_date -""" - -# models/my_model.yml -my_model_yml = """ -version: 2 -models: - - name: my_model - columns: - - name: id - tests: - - unique - - not_null # this test will fail -""" -``` - - - -2. Use the "fixtures" to define the project for your test case. These fixtures are always scoped to the **class**, where the class represents one test case—that is, one dbt project or scenario. (The same test case can be used for one or more actual tests, which we'll see in step 3.) Following the default pytest configurations, the file name must begin with `test_`, and the class name must begin with `Test`. - - - -```python -import pytest -from dbt.tests.util import run_dbt - -# our file contents -from tests.functional.example.fixtures import ( - my_seed_csv, - my_model_sql, - my_model_yml, -) - -# class must begin with 'Test' -class TestExample: - """ - Methods in this class will be of two types: - 1. Fixtures defining the dbt "project" for this test case. - These are scoped to the class, and reused for all tests in the class. - 2. Actual tests, whose names begin with 'test_'. - These define sequences of dbt commands and 'assert' statements. - """ - - # configuration in dbt_project.yml - @pytest.fixture(scope="class") - def project_config_update(self): - return { - "name": "example", - "models": {"+materialized": "view"} - } - - # everything that goes in the "seeds" directory - @pytest.fixture(scope="class") - def seeds(self): - return { - "my_seed.csv": my_seed_csv, - } - - # everything that goes in the "models" directory - @pytest.fixture(scope="class") - def models(self): - return { - "my_model.sql": my_model_sql, - "my_model.yml": my_model_yml, - } - - # continues below -``` - - - -3. Now that we've set up our project, it's time to define a sequence of dbt commands and assertions. We define one or more methods in the same file, on the same class (`TestExampleFailingTest`), whose names begin with `test_`. These methods share the same setup (project scenario) from above, but they can be run independently by pytest—so they shouldn't depend on each other in any way. - - - -```python - # continued from above - - # The actual sequence of dbt commands and assertions - # pytest will take care of all "setup" + "teardown" - def test_run_seed_test(self, project): - """ - Seed, then run, then test. We expect one of the tests to fail - An alternative pattern is to use pytest "xfail" (see below) - """ - # seed seeds - results = run_dbt(["seed"]) - assert len(results) == 1 - # run models - results = run_dbt(["run"]) - assert len(results) == 1 - # test tests - results = run_dbt(["test"], expect_pass = False) # expect failing test - assert len(results) == 2 - # validate that the results include one pass and one failure - result_statuses = sorted(r.status for r in results) - assert result_statuses == ["fail", "pass"] - - @pytest.mark.xfail - def test_build(self, project): - """Expect a failing test""" - # do it all - results = run_dbt(["build"]) -``` - - - -3. Our test is ready to run! The last step is to invoke `pytest` from your command line. We'll walk through the actual setup and configuration of `pytest` in the next section. - - - -```sh -$ python3 -m pytest tests/functional/test_example.py -=========================== test session starts ============================ -platform ... -- Python ..., pytest-..., pluggy-... -rootdir: ... -plugins: ... - -tests/functional/test_example.py .X [100%] - -======================= 1 passed, 1 xpassed in 1.38s ======================= -``` - - - -You can find more ways to run tests, along with a full command reference, in the [pytest usage docs](https://docs.pytest.org/how-to/usage.html). - -We've found the `-s` flag (or `--capture=no`) helpful to print logs from the underlying dbt invocations, and to step into an interactive debugger if you've added one. You can also use environment variables to set [global dbt configs](/reference/global-configs/about-global-configs), such as `DBT_DEBUG` (to show debug-level logs). - -## Testing your adapter - -Anyone who installs `dbt-core`, and wishes to define their own test cases, can use the framework presented in the first section. The framework is especially useful for testing standard dbt behavior across different databases. - -To that end, we have built and made available a [package of reusable adapter test cases](https://github.com/dbt-labs/dbt-core/tree/HEAD/tests/adapter), for creators and maintainers of adapter plugins. These test cases cover basic expected functionality, as well as functionality that frequently requires different implementations across databases. - -For the time being, this package is also located within the `dbt-core` repository, but separate from the `dbt-core` Python package. - -### Categories of tests - -In the course of creating and maintaining your adapter, it's likely that you will end up implementing tests that fall into three broad categories: - -1. **Basic tests** that every adapter plugin is expected to pass. These are defined in `tests.adapter.basic`. Given differences across data platforms, these may require slight modification or reimplementation. Significantly overriding or disabling these tests should be with good reason, since each represents basic functionality expected by dbt users. For example, if your adapter does not support incremental models, you should disable the test, [by marking it with `skip` or `xfail`](https://docs.pytest.org/en/latest/how-to/skipping.html), as well as noting that limitation in any documentation, READMEs, and usage guides that accompany your adapter. - -2. **Optional tests**, for second-order functionality that is common across plugins, but not required for basic use. Your plugin can opt into these test cases by inheriting existing ones, or reimplementing them with adjustments. For now, this category includes all tests located outside the `basic` subdirectory. More tests will be added as we convert older tests defined on dbt-core and mature plugins to use the standard framework. - -3. **Custom tests**, for behavior that is specific to your adapter / data platform. Each has its own specialties and idiosyncracies. We encourage you to use the same `pytest`-based framework, utilities, and fixtures to write your own custom tests for functionality that is unique to your adapter. - -If you run into an issue with the core framework, or the basic/optional test cases—or if you've written a custom test that you believe would be relevant and useful for other adapter plugin developers—please open an issue or PR in the `dbt-core` repository on GitHub. - -## Getting started running basic tests - -In this section, we'll walk through the three steps to start running our basic test cases on your adapter plugin: - -1. Install dependencies -2. Set up and configure pytest -3. Define test cases - -### Install dependencies - -You should already have a virtual environment with `dbt-core` and your adapter plugin installed. You'll also need to install: -- [`pytest`](https://pypi.org/project/pytest/) -- [`dbt-tests-adapter`](https://pypi.org/project/dbt-tests-adapter/), the set of common test cases -- (optional) [`pytest` plugins](https://docs.pytest.org/en/7.0.x/reference/plugin_list.html)--we'll use `pytest-dotenv` below - -Or specify all dependencies in a requirements file like: - - -```txt -pytest -pytest-dotenv -dbt-tests-adapter -``` - - -```sh -pip install -r dev_requirements.txt -``` - -### Set up and configure pytest - -First, set yourself up to run `pytest` by creating a file named `pytest.ini` at the root of your repository: - - - -```python -[pytest] -filterwarnings = - ignore:.*'soft_unicode' has been renamed to 'soft_str'*:DeprecationWarning - ignore:unclosed file .*:ResourceWarning -env_files = - test.env # uses pytest-dotenv plugin - # this allows you to store env vars for database connection in a file named test.env - # rather than passing them in every CLI command, or setting in `PYTEST_ADDOPTS` - # be sure to add "test.env" to .gitignore as well! -testpaths = - tests/functional # name per convention -``` - - - -Then, create a configuration file within your tests directory. In it, you'll want to define all necessary profile configuration for connecting to your data platform in local development and continuous integration. We recommend setting these values with environment variables, since this file will be checked into version control. - - - -```python -import pytest -import os - -# Import the standard functional fixtures as a plugin -# Note: fixtures with session scope need to be local -pytest_plugins = ["dbt.tests.fixtures.project"] - -# The profile dictionary, used to write out profiles.yml -# dbt will supply a unique schema per test, so we do not specify 'schema' here -@pytest.fixture(scope="class") -def dbt_profile_target(): - return { - 'type': '', - 'threads': 1, - 'host': os.getenv('HOST_ENV_VAR_NAME'), - 'user': os.getenv('USER_ENV_VAR_NAME'), - ... - } -``` - - - -### Define test cases - -As in the example above, each test case is defined as a class, and has its own "project" setup. To get started, you can import all basic test cases and try running them without changes. - - - -```python -import pytest - -from dbt.tests.adapter.basic.test_base import BaseSimpleMaterializations -from dbt.tests.adapter.basic.test_singular_tests import BaseSingularTests -from dbt.tests.adapter.basic.test_singular_tests_ephemeral import BaseSingularTestsEphemeral -from dbt.tests.adapter.basic.test_empty import BaseEmpty -from dbt.tests.adapter.basic.test_ephemeral import BaseEphemeral -from dbt.tests.adapter.basic.test_incremental import BaseIncremental -from dbt.tests.adapter.basic.test_generic_tests import BaseGenericTests -from dbt.tests.adapter.basic.test_snapshot_check_cols import BaseSnapshotCheckCols -from dbt.tests.adapter.basic.test_snapshot_timestamp import BaseSnapshotTimestamp -from dbt.tests.adapter.basic.test_adapter_methods import BaseAdapterMethod - -class TestSimpleMaterializationsMyAdapter(BaseSimpleMaterializations): - pass - - -class TestSingularTestsMyAdapter(BaseSingularTests): - pass - - -class TestSingularTestsEphemeralMyAdapter(BaseSingularTestsEphemeral): - pass - - -class TestEmptyMyAdapter(BaseEmpty): - pass - - -class TestEphemeralMyAdapter(BaseEphemeral): - pass - - -class TestIncrementalMyAdapter(BaseIncremental): - pass - - -class TestGenericTestsMyAdapter(BaseGenericTests): - pass - - -class TestSnapshotCheckColsMyAdapter(BaseSnapshotCheckCols): - pass - - -class TestSnapshotTimestampMyAdapter(BaseSnapshotTimestamp): - pass - - -class TestBaseAdapterMethod(BaseAdapterMethod): - pass -``` - - - - -Finally, run pytest: -```sh -python3 -m pytest tests/functional -``` - -### Modifying test cases - -You may need to make slight modifications in a specific test case to get it passing on your adapter. The mechanism to do this is simple: rather than simply inheriting the "base" test with `pass`, you can redefine any of its fixtures or test methods. - -For instance, on Redshift, we need to explicitly cast a column in the fixture input seed to use data type `varchar(64)`: - - - -```python -import pytest -from dbt.tests.adapter.basic.files import seeds_base_csv, seeds_added_csv, seeds_newcolumns_csv -from dbt.tests.adapter.basic.test_snapshot_check_cols import BaseSnapshotCheckCols - -# set the datatype of the name column in the 'added' seed so it -# can hold the '_update' that's added -schema_seed_added_yml = """ -version: 2 -seeds: - - name: added - config: - column_types: - name: varchar(64) -""" - -class TestSnapshotCheckColsRedshift(BaseSnapshotCheckCols): - # Redshift defines the 'name' column such that it's not big enough - # to hold the '_update' added in the test. - @pytest.fixture(scope="class") - def models(self): - return { - "base.csv": seeds_base_csv, - "added.csv": seeds_added_csv, - "seeds.yml": schema_seed_added_yml, - } -``` - - - -As another example, the `dbt-bigquery` adapter asks users to "authorize" replacing a with a by supplying the `--full-refresh` flag. The reason: In the table logic, a view by the same name must first be dropped; if the table query fails, the model will be missing. - -Knowing this possibility, the "base" test case offers a `require_full_refresh` switch on the `test_config` fixture class. For BigQuery, we'll switch it on: - - - -```python -import pytest -from dbt.tests.adapter.basic.test_base import BaseSimpleMaterializations - -class TestSimpleMaterializationsBigQuery(BaseSimpleMaterializations): - @pytest.fixture(scope="class") - def test_config(self): - # effect: add '--full-refresh' flag in requisite 'dbt run' step - return {"require_full_refresh": True} -``` - - - -It's always worth asking whether the required modifications represent gaps in perceived or expected dbt functionality. Are these simple implementation details, which any user of this database would understand? Are they limitations worth documenting? - -If, on the other hand, they represent poor assumptions in the "basic" test cases, which fail to account for a common pattern in other types of databases-—please open an issue or PR in the `dbt-core` repository on GitHub. - -### Running with multiple profiles - -Some databases support multiple connection methods, which map to actually different functionality behind the scenes. For instance, the `dbt-spark` adapter supports connections to Apache Spark clusters _and_ Databricks runtimes, which supports additional functionality out of the box, enabled by the Delta file format. - - - -```python -def pytest_addoption(parser): - parser.addoption("--profile", action="store", default="apache_spark", type=str) - - -# Using @pytest.mark.skip_profile('apache_spark') uses the 'skip_by_profile_type' -# autouse fixture below -def pytest_configure(config): - config.addinivalue_line( - "markers", - "skip_profile(profile): skip test for the given profile", - ) - -@pytest.fixture(scope="session") -def dbt_profile_target(request): - profile_type = request.config.getoption("--profile") - elif profile_type == "databricks_sql_endpoint": - target = databricks_sql_endpoint_target() - elif profile_type == "apache_spark": - target = apache_spark_target() - else: - raise ValueError(f"Invalid profile type '{profile_type}'") - return target - -def apache_spark_target(): - return { - "type": "spark", - "host": "localhost", - ... - } - -def databricks_sql_endpoint_target(): - return { - "type": "spark", - "host": os.getenv("DBT_DATABRICKS_HOST_NAME"), - ... - } - -@pytest.fixture(autouse=True) -def skip_by_profile_type(request): - profile_type = request.config.getoption("--profile") - if request.node.get_closest_marker("skip_profile"): - for skip_profile_type in request.node.get_closest_marker("skip_profile").args: - if skip_profile_type == profile_type: - pytest.skip("skipped on '{profile_type}' profile") -``` - - - -If there are tests that _shouldn't_ run for a given profile: - - - -```python -# Snapshots require access to the Delta file format, available on our Databricks connection, -# so let's skip on Apache Spark -@pytest.mark.skip_profile('apache_spark') -class TestSnapshotCheckColsSpark(BaseSnapshotCheckCols): - @pytest.fixture(scope="class") - def project_config_update(self): - return { - "seeds": { - "+file_format": "delta", - }, - "snapshots": { - "+file_format": "delta", - } - } -``` - - - -Finally: -```sh -python3 -m pytest tests/functional --profile apache_spark -python3 -m pytest tests/functional --profile databricks_sql_endpoint -``` diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter.md deleted file mode 100644 index 80b994aefb0..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter.md +++ /dev/null @@ -1,60 +0,0 @@ ---- -title: "Documenting a new adapter" -id: "5-documenting-a-new-adapter" ---- - -If you've already [built](3-building-a-new-adapter), and [tested](4-testing-a-new-adapter) your adapter, it's time to document it so the dbt community will know that it exists and how to use it. - -## Making your adapter available - -Many community members maintain their adapter plugins under open source licenses. If you're interested in doing this, we recommend: - -- Hosting on a public git provider (for example, GitHub or Gitlab) -- Publishing to [PyPI](https://pypi.org/) -- Adding to the list of ["Supported Data Platforms"](/docs/supported-data-platforms#community-supported) (more info below) - -## General Guidelines - -To best inform the dbt community of the new adapter, you should contribute to the dbt's open-source documentation site, which uses the [Docusaurus project](https://docusaurus.io/). This is the site you're currently on! - -### Conventions - -Each `.md` file you create needs a header as shown below. The document id will also need to be added to the config file: `website/sidebars.js`. - -```md ---- -title: "Documenting a new adapter" -id: "documenting-a-new-adapter" ---- -``` - -### Single Source of Truth - -We ask our adapter maintainers to use the [docs.getdbt.com repo](https://github.com/dbt-labs/docs.getdbt.com) (i.e. this site) as the single-source-of-truth for documentation rather than having to maintain the same set of information in three different places. The adapter repo's `README.md` and the data platform's documentation pages should simply link to the corresponding page on this docs site. Keep reading for more information on what should and shouldn't be included on the dbt docs site. - -### Assumed Knowledge - -To simplify things, assume the reader of this documentation already knows how both dbt and your data platform works. There's already great material for how to learn dbt and the data platform out there. The documentation we're asking you to add should be what a user who is already profiecient in both dbt and your data platform would need to know in order to use both. Effectively that boils down to two things: how to connect, and how to configure. - -## Topics and Pages to Cover - -The following subjects need to be addressed across three pages of this docs site to have your data platform be listed on our documentation. After the corresponding pull request is merged, we ask that you link to these pages from your adapter repo's `REAMDE` as well as from your product documentation. - - To contribute, all you will have to do make the changes listed in the table below. - -| How To... | File to change within `/website/docs/` | Action | Info to Include | -|----------------------|--------------------------------------------------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Connect | `/docs/core/connect-data-platform/{MY-DATA-PLATFORM}-setup.md` | Create | Give all information needed to define a target in `~/.dbt/profiles.yml` and get `dbt debug` to connect to the database successfully. All possible configurations should be mentioned. | -| Configure | `reference/resource-configs/{MY-DATA-PLATFORM}-configs.md` | Create | What options and configuration specific to your data platform do users need to know? e.g. table distribution and indexing options, column_quoting policy, which incremental strategies are supported | -| Discover and Install | `docs/supported-data-platforms.md` | Modify | Is it a vendor- or community- supported adapter? How to install Python adapter package? Ideally with pip and PyPI hosted package, but can also use `git+` link to GitHub Repo | -| Add link to sidebar | `website/sidebars.js` | Modify | Add the document id to the correct location in the sidebar menu | - -For example say I want to document my new adapter: `dbt-ders`. For the "Connect" page, I will make a new Markdown file, `ders-setup.md` and add it to the `/website/docs/core/connect-data-platform/` directory. - -## Example PRs to add new adapter documentation - -Below are some recent pull requests made by partners to document their data platform's adapter: - -- [TiDB](https://github.com/dbt-labs/docs.getdbt.com/pull/1309) -- [SingleStore](https://github.com/dbt-labs/docs.getdbt.com/pull/1044) -- [Firebolt](https://github.com/dbt-labs/docs.getdbt.com/pull/941) diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/6-promoting-a-new-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/6-promoting-a-new-adapter.md deleted file mode 100644 index 9bf2f949bef..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/6-promoting-a-new-adapter.md +++ /dev/null @@ -1,120 +0,0 @@ ---- -title: "Promoting a new adapter" -id: "6-promoting-a-new-adapter" ---- - -## Model for engagement in the dbt community - -The most important thing here is recognizing that people are successful in the community when they join, first and foremost, to engage authentically. - -What does authentic engagement look like? It’s challenging to define explicit rules. One good rule of thumb is to treat people with dignity and respect. - -Contributors to the community should think of contribution *as the end itself,* not a means toward other business KPIs (leads, community members, etc.). [We are a mission-driven company.](https://www.getdbt.com/dbt-labs/values/) Some ways to know if you’re authentically engaging: - -- Is an engagement’s *primary* purpose of sharing knowledge and resources or building brand engagement? -- Imagine you didn’t work at the org you do — can you imagine yourself still writing this? -- Is it written in formal / marketing language, or does it sound like you, the human? - -## Who should join the dbt community slack - -### People who have insight into what it means to do hands-on [analytics engineering](https://www.getdbt.com/analytics-engineering/) work - -The dbt Community Slack workspace is fundamentally a place for analytics practitioners to interact with each other — the closer the users are in the community to actual data/analytics engineering work, the more natural their engagement will be (leading to better outcomes for partners and the community). - -### DevRel practitioners with strong focus - -DevRel practitioners often have a strong analytics background and a good understanding of the community. It’s essential to be sure they are focused on *contributing,* not on driving community metrics for partner org (such as signing people up for their slack or events). The metrics will rise naturally through authentic engagement. - -### Founder and executives who are interested in directly engaging with the community - -This is either incredibly successful or not at all depending on the profile of the founder. Typically, this works best when the founder has a practitioner-level of technical understanding and is interested in joining not to promote, but to learn and hear from users. - -### Software Engineers at partner products that are building and supporting integrations with either dbt Core or dbt Cloud - -This is successful when the engineers are familiar with dbt as a product or at least have taken our training course. The Slack is often a place where end-user questions and feedback is initially shared, so it is recommended that someone technical from the team be present. There are also a handful of channels aimed at those building integrations, which tend to be a font of knowledge. - -### Who might struggle in the dbt community -#### People in marketing roles -dbt Slack is not a marketing channel. Attempts to use it as such invariably fall flat and can even lead to people having a negative view of a product. This doesn’t mean that dbt can’t serve marketing objectives, but a long-term commitment to engagement is the only proven method to do this sustainably. - -#### People in product roles -The dbt Community can be an invaluable source of feedback on a product. There are two primary ways this can happen — organically (community members proactively suggesting a new feature) and via direct calls for feedback and user research. Immediate calls for engagement must be done in your dedicated #tools channel. Direct calls should be used sparingly, as they can overwhelm more organic discussions and feedback. - -## Who is the audience for an adapter release - -A new adapter is likely to drive huge community interest from several groups of people: -- People who are currently using the database that the adapter is supporting -- People who may be adopting the database in the near future. -- People who are interested in dbt development in general. - -The database users will be your primary audience and the most helpful in achieving success. Engage them directly in the adapter’s dedicated Slack channel. If one does not exist already, reach out in #channel-requests, and we will get one made for you and include it in an announcement about new channels. - -The final group is where non-slack community engagement becomes important. Twitter and LinkedIn are both great places to interact with a broad audience. A well-orchestrated adapter release can generate impactful and authentic engagement. - -## How to message the initial rollout and follow-up content - -Tell a story that engages dbt users and the community. Highlight new use cases and functionality unlocked by the adapter in a way that will resonate with each segment. - -### Existing users of your technology who are new to dbt - -- Provide a general overview of the value dbt will deliver to your users. This can lean on dbt's messaging and talking points which are laid out in the [dbt viewpoint.](/community/resources/viewpoint) - - Give examples of a rollout that speaks to the overall value of dbt and your product. - -### Users who are already familiar with dbt and the community -- Consider unique use cases or advantages your adapter provide over existing adapters. Who will be excited for this? -- Contribute to the dbt Community and ensure that dbt users on your adapter are well supported (tutorial content, packages, documentation, etc). -- Example of a rollout that is compelling for those familiar with dbt: [Firebolt](https://www.linkedin.com/feed/update/urn:li:activity:6879090752459182080/) - -## Tactically manage distribution of content about new or existing adapters - -There are tactical pieces on how and where to share that help ensure success. - -### On slack: -- #i-made-this channel — this channel has a policy against “marketing” and “content marketing” posts, but it should be successful if you write your content with the above guidelines in mind. Even with that, it’s important to post here sparingly. -- Your own database / tool channel — this is where the people who have opted in to receive communications from you and always a great place to share things that are relevant to them. - -### On social media: -- Twitter -- LinkedIn -- Social media posts *from the author* or an individual connected to the project tend to have better engagement than posts from a company or organization account. -- Ask your partner representative about: - - Retweets and shares from the official dbt Labs accounts. - - Flagging posts internally at dbt Labs to get individual employees to share. - -## Measuring engagement - -You don’t need 1000 people in a channel to succeed, but you need at least a few active participants who can make it feel lived in. If you’re comfortable working in public, this could be members of your team, or it can be a few people who you know that are highly engaged and would be interested in participating. Having even 2 or 3 regulars hanging out in a channel is all that’s needed for a successful start and is, in fact, much more impactful than 250 people that never post. - -## How to announce a new adapter - -We’d recommend *against* boilerplate announcements and encourage finding a unique voice. That being said, there are a couple of things that we’d want to include: - -- A summary of the value prop of your database / technology for users who aren’t familiar. -- The personas that might be interested in this news. -- A description of what the adapter *is*. For example: - > With the release of our new dbt adapter, you’ll be able to to use dbt to model and transform your data in [name-of-your-org] -- Particular or unique use cases or functionality unlocked by the adapter. -- Plans for future / ongoing support / development. -- The link to the documentation for using the adapter on the dbt Labs docs site. -- An announcement blog. - -## Announcing new release versions of existing adapters - -This can vary substantially depending on the nature of the release but a good baseline is the types of release messages that [we put out in the #dbt-releases](https://getdbt.slack.com/archives/C37J8BQEL/p1651242161526509) channel. - -![Full Release Post](/img/adapter-guide/0-full-release-notes.png) - -Breaking this down: - -- Visually distinctive announcement - make it clear this is a release - -- Short written description of what is in the release - -- Links to additional resources - -- Implementation instructions: - -- Future plans - -- Contributor recognition (if applicable) - diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/7-verifying-a-new-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/7-verifying-a-new-adapter.md deleted file mode 100644 index 6310569dfad..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/7-verifying-a-new-adapter.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: "Verifying a new adapter" -id: "7-verifying-a-new-adapter" ---- - -## Why verify an adapter? - -The very first data platform dbt supported was Redshift followed quickly by Postgres (([dbt-core#174](https://github.com/dbt-labs/dbt-core/pull/174)). In 2017, back when dbt Labs (née Fishtown Analytics) was still a data consultancy, we added support for Snowflake and BigQuery. We also turned dbt's database support into an adapter framework ([dbt-core#259](https://github.com/dbt-labs/dbt-core/pull/259/)), and a plugin system a few years later. For years, dbt Labs specialized in those four data platforms and became experts in them. However, the surface area of all possible databases, their respective nuances, and keeping them up-to-date and bug-free is a Herculean and/or Sisyphean task that couldn't be done by a single person or even a single team! Enter the dbt community which enables dbt Core to work on more than 30 different databases (32 as of Sep '22)! - -Free and open-source tools for the data professional are increasingly abundant. This is by-and-large a *good thing*, however it requires due dilligence that wasn't required in a paid-license, closed-source software world. Before taking a dependency on an open-source projet is is important to determine the answer to the following questions: - -1. Does it work? -2. Does it meet my team's specific use case? -3. Does anyone "own" the code, or is anyone liable for ensuring it works? -4. Do bugs get fixed quickly? -5. Does it stay up-to-date with new Core features? -6. Is the usage substantial enough to self-sustain? -7. What risks do I take on by taking a dependency on this library? - -These are valid, important questions to answer—especially given that `dbt-core` itself only put out its first stable release (major version v1.0) in December 2021! Indeed, up until now, the majority of new user questions in database-specific channels are some form of: -- "How mature is `dbt-`? Any gotchas I should be aware of before I start exploring?" -- "has anyone here used `dbt-` for production models?" -- "I've been playing with `dbt-` -- I was able to install and run my initial experiments. I noticed that there are certain features mentioned on the documentation that are marked as 'not ok' or 'not tested'. What are the risks? -I'd love to make a statement on my team to adopt DBT [sic], but I'm pretty sure questions will be asked around the possible limitations of the adapter or if there are other companies out there using dbt [sic] with Oracle DB in production, etc." - -There has been a tendency to trust the dbt Labs-maintained adapters over community- and vendor-supported adapters, but repo ownership is only one among many indicators of software quality. We aim to help our users feel well-informed as to the caliber of an adapter with a new program. - -## Verified by dbt Labs - -The adapter verification program aims to quickly indicate to users which adapters can be trusted to use in production. Previously, doing so was uncharted territory for new users and complicated making the business case to their leadership team. We plan to give quality assurances by: -1. appointing a key stakeholder for the adapter repository, -2. ensuring that the chosen stakeholder fixes bugs and cuts new releases in a timely manner see maintainer your adapter (["Maintaining your new adapter"](2-prerequisites-for-a-new-adapter#maintaining-your-new-adapter)), -3. demonstrating that it passes our adapter pytest suite tests, -4. assuring that it works for us internally and ideally an existing team using the adapter in production . - - -Every major & minor version of a adapter will be verified internally and given an official :white_check_mark: (custom emoji coming soon), on the ["Supported Data Platforms"](/docs/supported-data-platforms) page. - -## How to get an adapter verified? - -We envision that data platform vendors will be most interested in having their adapter versions verified, however we are open to community adapter verification. If interested, please reach out either to the `partnerships` at `dbtlabs.com` or post in the [#adapter-ecosystem Slack channel](https://getdbt.slack.com/archives/C030A0UF5LM). diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/8-building-a-trusted-adapter.md b/website/docs/guides/dbt-ecosystem/adapter-development/8-building-a-trusted-adapter.md deleted file mode 100644 index 9783ec66460..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/8-building-a-trusted-adapter.md +++ /dev/null @@ -1,79 +0,0 @@ ---- -title: "Building a Trusted Adapter" -id: "8-building-a-trusted-adapter" ---- - -The Trusted adapter program exists to allow adapter maintainers to demonstrate to the dbt community that your adapter is trusted to be used in production. - -## What does it mean to be trusted - -By opting into the below, you agree to this, and we take you at your word. dbt Labs reserves the right to remove an adapter from the trusted adapter list at any time, should any of the below guidelines not be met. - -### Feature Completeness - -To be considered for the Trusted Adapter program, the adapter must cover the essential functionality of dbt Core given below, with best effort given to support the entire feature set. - -Essential functionality includes (but is not limited to the following features): - -- table, view, and seed materializations -- dbt tests - -The adapter should have the required documentation for connecting and configuring the adapter. The dbt docs site should be the single source of truth for this information. These docs should be kept up-to-date. - -See [Documenting a new adapter](/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter) for more information. - -### Release Cadence - -Keeping an adapter up-to-date with dbt Core is an integral part of being a trusted adapter. Therefore, we ask that adapter maintainers: - -- Release of new minor versions of the adapter with all tests passing within four weeks of dbt Core's release cut. -- Release of new major versions of the adapter with all tests passing within eight weeks of dbt Core's release cut. - -### Community Responsiveness - -On a best effort basis, active participation and engagement with the dbt Community across the following forums: - -- Being responsive to feedback and supporting user enablement in dbt Community’s Slack workspace -- Responding with comments to issues raised in public dbt adapter code repository -- Merging in code contributions from community members as deemed appropriate - -### Security Practices - -Trusted adapters will not do any of the following: - -- Output to logs or file either access credentials information to or data from the underlying data platform itself. -- Make API calls other than those expressly required for using dbt features (adapters may not add additional logging) -- Obfuscate code and/or functionality so as to avoid detection - -Additionally, to avoid supply-chain attacks: - -- Use an automated service to keep Python dependencies up-to-date (such as Dependabot or similar), -- Publish directly to PyPI from the dbt adapter code repository by using trusted CI/CD process (such as GitHub actions) -- Restrict admin access to both the respective code (GitHub) and package (PyPI) repositories -- Identify and mitigate security vulnerabilities by use of a static code analyzing tool (such as Snyk) as part of a CI/CD process - -### Other considerations - -The adapter repository is: - -- open-souce licensed, -- published to PyPI, and -- automatically tests the codebase against dbt Lab's provided adapter test suite - -## How to get an adapter verified? - -Open an issue on the [docs.getdbt.com GitHub repository](https://github.com/dbt-labs/docs.getdbt.com) using the "Add adapter to Trusted list" template. In addition to contact information, it will ask confirm that you agree to the following. - -1. my adapter meet the guidelines given above -2. I will make best reasonable effort that this continues to be so -3. checkbox: I acknowledge that dbt Labs reserves the right to remove an adapter from the trusted adapter list at any time, should any of the above guidelines not be met. - -The approval workflow is as follows: - -1. create and populate the template-created issue -2. dbt Labs will respond as quickly as possible (maximally four weeks, though likely faster) -3. If approved, dbt Labs will create and merge a Pull request to formally add the adapter to the list. - -## How to get help with my trusted adapter? - -Ask your question in #adapter-ecosystem channel of the community Slack. diff --git a/website/docs/guides/dbt-ecosystem/adapter-development/adapter-development b/website/docs/guides/dbt-ecosystem/adapter-development/adapter-development deleted file mode 100644 index 8b137891791..00000000000 --- a/website/docs/guides/dbt-ecosystem/adapter-development/adapter-development +++ /dev/null @@ -1 +0,0 @@ - diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark.md deleted file mode 100644 index b03cb2ca013..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark.md +++ /dev/null @@ -1,38 +0,0 @@ ---- -title: "Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake" -id: "1-overview-dbt-python-snowpark" -description: "Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake" ---- - -The focus of this workshop will be to demonstrate how we can use both *SQL and python together* in the same workflow to run *both analytics and machine learning models* on dbt Cloud. - -All code in today’s workshop can be found on [GitHub](https://github.com/dbt-labs/python-snowpark-formula1/tree/python-formula1). - -## What you'll use during the lab - -- A [Snowflake account](https://trial.snowflake.com/) with ACCOUNTADMIN access -- A [dbt Cloud account](https://www.getdbt.com/signup/) - -## What you'll learn - -- How to build scalable data transformation pipelines using dbt, and Snowflake using SQL and Python -- How to leverage copying data into Snowflake from a public S3 bucket - -## What you need to know - -- Basic to intermediate SQL and python. -- Basic understanding of dbt fundamentals. We recommend the [dbt Fundamentals course](https://courses.getdbt.com/collections) if you're interested. -- High level machine learning process (encoding, training, testing) -- Simple ML algorithms — we will use logistic regression to keep the focus on the *workflow*, not algorithms! - -## What you'll build - -- A set of data analytics and prediction pipelines using Formula 1 data leveraging dbt and Snowflake, making use of best practices like data quality tests and code promotion between environments -- We will create insights for: - 1. Finding the lap time average and rolling average through the years (is it generally trending up or down)? - 2. Which constructor has the fastest pit stops in 2021? - 3. Predicting the position of each driver given using a decade of data (2010 - 2020) - -As inputs, we are going to leverage Formula 1 datasets hosted on a dbt Labs public S3 bucket. We will create a Snowflake Stage for our CSV files then use Snowflake’s `COPY INTO` function to copy the data in from our CSV files into tables. The Formula 1 is available on [Kaggle](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020). The data is originally compiled from the [Ergast Developer API](http://ergast.com/mrd/). - -Overall we are going to set up the environments, build scalable pipelines in dbt, establish data tests, and promote code to production. diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/10-python-transformations.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/10-python-transformations.md deleted file mode 100644 index 446981214e3..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/10-python-transformations.md +++ /dev/null @@ -1,150 +0,0 @@ ---- -title: "Python transformations!" -id: "10-python-transformations" -description: "Python transformations" ---- - -Up until now, SQL has been driving the project (car pun intended) for data cleaning and hierarchical joining. Now it’s time for Python to take the wheel (car pun still intended) for the rest of our lab! For more information about running Python models on dbt, check out our [docs](/docs/build/python-models). To learn more about dbt python works under the hood, check out [Snowpark for Python](https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html), which makes running dbt Python models possible. - -There are quite a few differences between SQL and Python in terms of the dbt syntax and DDL, so we’ll be breaking our code and model runs down further for our python models. - -## Pit stop analysis - -First, we want to find out: which constructor had the fastest pit stops in 2021? (constructor is a Formula 1 team that builds or “constructs” the car). - -1. Create a new file called `fastest_pit_stops_by_constructor.py` in our `aggregates` (this is the first time we are using the `.py` extension!). -2. Copy the following code into the file: - ```python - import numpy as np - import pandas as pd - - def model(dbt, session): - # dbt configuration - dbt.config(packages=["pandas","numpy"]) - - # get upstream data - pit_stops_joined = dbt.ref("pit_stops_joined").to_pandas() - - # provide year so we do not hardcode dates - year=2021 - - # describe the data - pit_stops_joined["PIT_STOP_SECONDS"] = pit_stops_joined["PIT_STOP_MILLISECONDS"]/1000 - fastest_pit_stops = pit_stops_joined[(pit_stops_joined["RACE_YEAR"]==year)].groupby(by="CONSTRUCTOR_NAME")["PIT_STOP_SECONDS"].describe().sort_values(by='mean') - fastest_pit_stops.reset_index(inplace=True) - fastest_pit_stops.columns = fastest_pit_stops.columns.str.upper() - - return fastest_pit_stops.round(2) - ``` - -3. Let’s break down what this code is doing step by step: - - First, we are importing the Python libraries that we are using. A *library* is a reusable chunk of code that someone else wrote that you may want to include in your programs/projects. We are using `numpy` and `pandas`in this Python model. This is similar to a dbt *package*, but our Python libraries do *not* persist across the entire project. - - Defining a function called `model` with the parameter `dbt` and `session`. The parameter `dbt` is a class compiled by dbt, which enables you to run your Python code in the context of your dbt project and DAG. The parameter `session` is a class representing your Snowflake’s connection to the Python backend. The `model` function *must return a single DataFrame*. You can see that all the data transformation happening is within the body of the `model` function that the `return` statement is tied to. - - Then, within the context of our dbt model library, we are passing in a configuration of which packages we need using `dbt.config(packages=["pandas","numpy"])`. - - Use the `.ref()` function to retrieve the data frame `pit_stops_joined` that we created in our last step using SQL. We cast this to a pandas dataframe (by default it's a Snowpark Dataframe). - - Create a variable named `year` so we aren’t passing a hardcoded value. - - Generate a new column called `PIT_STOP_SECONDS` by dividing the value of `PIT_STOP_MILLISECONDS` by 1000. - - Create our final data frame `fastest_pit_stops` that holds the records where year is equal to our year variable (2021 in this case), then group the data frame by `CONSTRUCTOR_NAME` and use the `describe()` and `sort_values()` and in descending order. This will make our first row in the new aggregated data frame the team with the fastest pit stops over an entire competition year. - - Finally, it resets the index of the `fastest_pit_stops` data frame. The `reset_index()` method allows you to reset the index back to the default 0, 1, 2, etc indexes. By default, this method will keep the "old" indexes in a column named "index"; to avoid this, use the drop parameter. Think of this as keeping your data “flat and square” as opposed to “tiered”. If you are new to Python, now might be a good time to [learn about indexes for 5 minutes](https://towardsdatascience.com/the-basics-of-indexing-and-slicing-python-lists-2d12c90a94cf) since it's the foundation of how Python retrieves, slices, and dices data. The `inplace` argument means we override the existing data frame permanently. Not to fear! This is what we want to do to avoid dealing with multi-indexed dataframes! - - Convert our Python column names to all uppercase using `.upper()`, so Snowflake recognizes them. - - Finally we are returning our dataframe with 2 decimal places for all the columns using the `round()` method. -4. Zooming out a bit, what are we doing differently here in Python from our typical SQL code: - - Method chaining is a technique in which multiple methods are called on an object in a single statement, with each method call modifying the result of the previous one. The methods are called in a chain, with the output of one method being used as the input for the next one. The technique is used to simplify the code and make it more readable by eliminating the need for intermediate variables to store the intermediate results. - - The way you see method chaining in Python is the syntax `.().()`. For example, `.describe().sort_values(by='mean')` where the `.describe()` method is chained to `.sort_values()`. - - The `.describe()` method is used to generate various summary statistics of the dataset. It's used on pandas dataframe. It gives a quick and easy way to get the summary statistics of your dataset without writing multiple lines of code. - - The `.sort_values()` method is used to sort a pandas dataframe or a series by one or multiple columns. The method sorts the data by the specified column(s) in ascending or descending order. It is the pandas equivalent to `order by` in SQL. - - We won’t go as in depth for our subsequent scripts, but will continue to explain at a high level what new libraries, functions, and methods are doing. - -5. Build the model using the UI which will **execute**: - ```bash - dbt run --select fastest_pit_stops_by_constructor - ``` - in the command bar. - - Let’s look at some details of our first Python model to see what our model executed. There two major differences we can see while running a Python model compared to an SQL model: - - - Our Python model was executed as a stored procedure. Snowflake needs a way to know that it's meant to execute this code in a Python runtime, instead of interpreting in a SQL runtime. We do this by creating a Python stored proc, called by a SQL command. - - The `snowflake-snowpark-python` library has been picked up to execute our Python code. Even though this wasn’t explicitly stated this is picked up by the dbt class object because we need our Snowpark package to run Python! - - Python models take a bit longer to run than SQL models, however we could always speed this up by using [Snowpark-optimized Warehouses](https://docs.snowflake.com/en/user-guide/warehouses-snowpark-optimized.html) if we wanted to. Our data is sufficiently small, so we won’t worry about creating a separate warehouse for Python versus SQL files today. - - - The rest of our **Details** output gives us information about how dbt and Snowpark for Python are working together to define class objects and apply a specific set of methods to run our models. - - So which constructor had the fastest pit stops in 2021? Let’s look at our data to find out! - -6. We can't preview Python models directly, so let’s create a new file using the **+** button or the Control-n shortcut to create a new scratchpad. -7. Reference our Python model: - ```sql - select * from {{ ref('fastest_pit_stops_by_constructor') }} - ``` - and preview the output: - - - Not only did Red Bull have the fastest average pit stops by nearly 40 seconds, they also had the smallest standard deviation, meaning they are both fastest and most consistent teams in pit stops. By using the `.describe()` method we were able to avoid verbose SQL requiring us to create a line of code per column and repetitively use the `PERCENTILE_COUNT()` function. - - Now we want to find the lap time average and rolling average through the years (is it generally trending up or down)? - -8. Create a new file called `lap_times_moving_avg.py` in our `aggregates` folder. -9. Copy the following code into the file: - ```python - import pandas as pd - - def model(dbt, session): - # dbt configuration - dbt.config(packages=["pandas"]) - - # get upstream data - lap_times = dbt.ref("int_lap_times_years").to_pandas() - - # describe the data - lap_times["LAP_TIME_SECONDS"] = lap_times["LAP_TIME_MILLISECONDS"]/1000 - lap_time_trends = lap_times.groupby(by="RACE_YEAR")["LAP_TIME_SECONDS"].mean().to_frame() - lap_time_trends.reset_index(inplace=True) - lap_time_trends["LAP_MOVING_AVG_5_YEARS"] = lap_time_trends["LAP_TIME_SECONDS"].rolling(5).mean() - lap_time_trends.columns = lap_time_trends.columns.str.upper() - - return lap_time_trends.round(1) - ``` - -10. Breaking down our code a bit: - - We’re only using the `pandas` library for this model and casting it to a pandas data frame `.to_pandas()`. - - Generate a new column called `LAP_TIMES_SECONDS` by dividing the value of `LAP_TIME_MILLISECONDS` by 1000. - - Create the final dataframe. Get the lap time per year. Calculate the mean series and convert to a data frame. - - Reset the index. - - Calculate the rolling 5 year mean. - - Round our numeric columns to one decimal place. -11. Now, run this model by using the UI **Run model** or - ```bash - dbt run --select lap_times_moving_avg - ``` - in the command bar. - -12. Once again previewing the output of our data using the same steps for our `fastest_pit_stops_by_constructor` model. - - - We can see that it looks like lap times are getting consistently faster over time. Then in 2010 we see an increase occur! Using outside subject matter context, we know that significant rule changes were introduced to Formula 1 in 2010 and 2011 causing slower lap times. - -13. Now is a good time to checkpoint and commit our work to Git. Click **Commit and push** and give your commit a message like `aggregate python models` before moving on. - -## The dbt model, .source(), .ref() and .config() functions - -Let’s take a step back before starting machine learning to both review and go more in-depth at the methods that make running dbt python models possible. If you want to know more outside of this lab’s explanation read the documentation [here](/docs/build/python-models?version=1.3). - -- dbt model(dbt, session). For starters, each Python model lives in a .py file in your models/ folder. It defines a function named `model()`, which takes two parameters: - - dbt — A class compiled by dbt Core, unique to each model, enables you to run your Python code in the context of your dbt project and DAG. - - session — A class representing your data platform’s connection to the Python backend. The session is needed to read in tables as DataFrames and to write DataFrames back to tables. In PySpark, by convention, the SparkSession is named spark, and available globally. For consistency across platforms, we always pass it into the model function as an explicit argument called session. -- The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or pandas DataFrame. -- `.source()` and `.ref()` functions. Python models participate fully in dbt's directed acyclic graph (DAG) of transformations. If you want to read directly from a raw source table, use `dbt.source()`. We saw this in our earlier section using SQL with the source function. These functions have the same execution, but with different syntax. Use the `dbt.ref()` method within a Python model to read data from other models (SQL or Python). These methods return DataFrames pointing to the upstream source, model, seed, or snapshot. -- `.config()`. Just like SQL models, there are three ways to configure Python models: - - In a dedicated `.yml` file, within the `models/` directory - - Within the model's `.py` file, using the `dbt.config()` method - - Calling the `dbt.config()` method will set configurations for your model within your `.py` file, similar to the `{{ config() }} macro` in `.sql` model files: - ```python - def model(dbt, session): - - # setting configuration - dbt.config(materialized="table") - ``` - - There's a limit to how complex you can get with the `dbt.config()` method. It accepts only literal values (strings, booleans, and numeric types). Passing another function or a more complex data structure is not possible. The reason is that dbt statically analyzes the arguments to `.config()` while parsing your model without executing your Python code. If you need to set a more complex configuration, we recommend you define it using the config property in a [YAML file](/reference/resource-properties/config). Learn more about configurations [here](/reference/model-configs). diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/11-machine-learning-prep.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/11-machine-learning-prep.md deleted file mode 100644 index bde163b59db..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/11-machine-learning-prep.md +++ /dev/null @@ -1,225 +0,0 @@ ---- -title: "Machine Learning prep: cleaning, encoding, and splits, oh my!" -id: "11-machine-learning-prep" -description: "Machine Learning prep" ---- -Now that we’ve gained insights and business intelligence about Formula 1 at a descriptive level, we want to extend our capabilities into prediction. We’re going to take the scenario where we censor the data. This means that we will pretend that we will train a model using earlier data and apply it to future data. In practice, this means we’ll take data from 2010-2019 to train our model and then predict 2020 data. - -In this section, we’ll be preparing our data to predict the final race position of a driver. - -At a high level we’ll be: - -- Creating new prediction features and filtering our dataset to active drivers -- Encoding our data (algorithms like numbers) and simplifying our target variable called `position` -- Splitting our dataset into training, testing, and validation - -## ML data prep - -1. To keep our project organized, we’ll need to create two new subfolders in our `ml` directory. Under the `ml` folder, make the subfolders `prep` and `train_predict`. -2. Create a new file under `ml/prep` called `ml_data_prep`. Copy the following code into the file and **Save**. - ```python - import pandas as pd - - def model(dbt, session): - # dbt configuration - dbt.config(packages=["pandas"]) - - # get upstream data - fct_results = dbt.ref("fct_results").to_pandas() - - # provide years so we do not hardcode dates in filter command - start_year=2010 - end_year=2020 - - # describe the data for a full decade - data = fct_results.loc[fct_results['RACE_YEAR'].between(start_year, end_year)] - - # convert string to an integer - data['POSITION'] = data['POSITION'].astype(float) - - # we cannot have nulls if we want to use total pit stops - data['TOTAL_PIT_STOPS_PER_RACE'] = data['TOTAL_PIT_STOPS_PER_RACE'].fillna(0) - - # some of the constructors changed their name over the year so replacing old names with current name - mapping = {'Force India': 'Racing Point', 'Sauber': 'Alfa Romeo', 'Lotus F1': 'Renault', 'Toro Rosso': 'AlphaTauri'} - data['CONSTRUCTOR_NAME'].replace(mapping, inplace=True) - - # create confidence metrics for drivers and constructors - dnf_by_driver = data.groupby('DRIVER').sum()['DNF_FLAG'] - driver_race_entered = data.groupby('DRIVER').count()['DNF_FLAG'] - driver_dnf_ratio = (dnf_by_driver/driver_race_entered) - driver_confidence = 1-driver_dnf_ratio - driver_confidence_dict = dict(zip(driver_confidence.index,driver_confidence)) - - dnf_by_constructor = data.groupby('CONSTRUCTOR_NAME').sum()['DNF_FLAG'] - constructor_race_entered = data.groupby('CONSTRUCTOR_NAME').count()['DNF_FLAG'] - constructor_dnf_ratio = (dnf_by_constructor/constructor_race_entered) - constructor_relaiblity = 1-constructor_dnf_ratio - constructor_relaiblity_dict = dict(zip(constructor_relaiblity.index,constructor_relaiblity)) - - data['DRIVER_CONFIDENCE'] = data['DRIVER'].apply(lambda x:driver_confidence_dict[x]) - data['CONSTRUCTOR_RELAIBLITY'] = data['CONSTRUCTOR_NAME'].apply(lambda x:constructor_relaiblity_dict[x]) - - #removing retired drivers and constructors - active_constructors = ['Renault', 'Williams', 'McLaren', 'Ferrari', 'Mercedes', - 'AlphaTauri', 'Racing Point', 'Alfa Romeo', 'Red Bull', - 'Haas F1 Team'] - active_drivers = ['Daniel Ricciardo', 'Kevin Magnussen', 'Carlos Sainz', - 'Valtteri Bottas', 'Lance Stroll', 'George Russell', - 'Lando Norris', 'Sebastian Vettel', 'Kimi Räikkönen', - 'Charles Leclerc', 'Lewis Hamilton', 'Daniil Kvyat', - 'Max Verstappen', 'Pierre Gasly', 'Alexander Albon', - 'Sergio Pérez', 'Esteban Ocon', 'Antonio Giovinazzi', - 'Romain Grosjean','Nicholas Latifi'] - - # create flags for active drivers and constructors so we can filter downstream - data['ACTIVE_DRIVER'] = data['DRIVER'].apply(lambda x: int(x in active_drivers)) - data['ACTIVE_CONSTRUCTOR'] = data['CONSTRUCTOR_NAME'].apply(lambda x: int(x in active_constructors)) - - return data - ``` -3. As usual, let’s break down what we are doing in this Python model: - - We’re first referencing our upstream `fct_results` table and casting it to a pandas dataframe. - - Filtering on years 2010-2020 since we’ll need to clean all our data we are using for prediction (both training and testing). - - Filling in empty data for `total_pit_stops` and making a mapping active constructors and drivers to avoid erroneous predictions - - ⚠️ You might be wondering why we didn’t do this upstream in our `fct_results` table! The reason for this is that we want our machine learning cleanup to reflect the year 2020 for our predictions and give us an up-to-date team name. However, for business intelligence purposes we can keep the historical data at that point in time. Instead of thinking of one table as “one source of truth” we are creating different datasets fit for purpose: one for historical descriptions and reporting and another for relevant predictions. - - Create new confidence features for drivers and constructors - - Generate flags for the constructors and drivers that were active in 2020 -4. Execute the following in the command bar: - ```bash - dbt run --select ml_data_prep - ``` -5. There are more aspects we could consider for this project, such as normalizing the driver confidence by the number of races entered. Including this would help account for a driver’s history and consider whether they are a new or long-time driver. We’re going to keep it simple for now, but these are some of the ways we can expand and improve our machine learning dbt projects. Breaking down our machine learning prep model: - - Lambda functions — We use some lambda functions to transform our data without having to create a fully-fledged function using the `def` notation. So what exactly are lambda functions? - - In Python, a lambda function is a small, anonymous function defined using the keyword "lambda". Lambda functions are used to perform a quick operation, such as a mathematical calculation or a transformation on a list of elements. They are often used in conjunction with higher-order functions, such as `apply`, `map`, `filter`, and `reduce`. - - `.apply()` method — We used `.apply()` to pass our functions into our lambda expressions to the columns and perform this multiple times in our code. Let’s explain apply a little more: - - The `.apply()` function in the pandas library is used to apply a function to a specified axis of a DataFrame or a Series. In our case the function we used was our lambda function! - - The `.apply()` function takes two arguments: the first is the function to be applied, and the second is the axis along which the function should be applied. The axis can be specified as 0 for rows or 1 for columns. We are using the default value of 0 so we aren’t explicitly writing it in the code. This means that the function will be applied to each *row* of the DataFrame or Series. -6. Let’s look at the preview of our clean dataframe after running our `ml_data_prep` model: - - -## Covariate encoding - -In this next part, we’ll be performing covariate encoding. Breaking down this phrase a bit, a *covariate* is a variable that is relevant to the outcome of a study or experiment, and *encoding* refers to the process of converting data (such as text or categorical variables) into a numerical format that can be used as input for a model. This is necessary because most machine learning algorithms can only work with numerical data. Algorithms don’t speak languages, have eyes to see images, etc. so we encode our data into numbers so algorithms can perform tasks by using calculations they otherwise couldn’t. - -🧠 We’ll think about this as : “algorithms like numbers”. - -1. Create a new file under `ml/prep` called `covariate_encoding` copy the code below and save. - ```python - import pandas as pd - import numpy as np - from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder - from sklearn.linear_model import LogisticRegression - - def model(dbt, session): - # dbt configuration - dbt.config(packages=["pandas","numpy","scikit-learn"]) - - # get upstream data - data = dbt.ref("ml_data_prep").to_pandas() - - # list out covariates we want to use in addition to outcome variable we are modeling - position - covariates = data[['RACE_YEAR','CIRCUIT_NAME','GRID','CONSTRUCTOR_NAME','DRIVER','DRIVERS_AGE_YEARS','DRIVER_CONFIDENCE','CONSTRUCTOR_RELAIBLITY','TOTAL_PIT_STOPS_PER_RACE','ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR', 'POSITION']] - - # filter covariates on active drivers and constructors - # use fil_cov as short for "filtered_covariates" - fil_cov = covariates[(covariates['ACTIVE_DRIVER']==1)&(covariates['ACTIVE_CONSTRUCTOR']==1)] - - # Encode categorical variables using LabelEncoder - # TODO: we'll update this to both ohe in the future for non-ordinal variables! - le = LabelEncoder() - fil_cov['CIRCUIT_NAME'] = le.fit_transform(fil_cov['CIRCUIT_NAME']) - fil_cov['CONSTRUCTOR_NAME'] = le.fit_transform(fil_cov['CONSTRUCTOR_NAME']) - fil_cov['DRIVER'] = le.fit_transform(fil_cov['DRIVER']) - fil_cov['TOTAL_PIT_STOPS_PER_RACE'] = le.fit_transform(fil_cov['TOTAL_PIT_STOPS_PER_RACE']) - - # Simply target variable "position" to represent 3 meaningful categories in Formula1 - # 1. Podium position 2. Points for team 3. Nothing - no podium or points! - def position_index(x): - if x<4: - return 1 - if x>10: - return 3 - else : - return 2 - - # we are dropping the columns that we filtered on in addition to our training variable - encoded_data = fil_cov.drop(['ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR'],1) - encoded_data['POSITION_LABEL']= encoded_data['POSITION'].apply(lambda x: position_index(x)) - encoded_data_grouped_target = encoded_data.drop(['POSITION'],1) - - return encoded_data_grouped_target - ``` -2. Execute the following in the command bar: - ```bash - dbt run --select covariate_encoding - ``` -3. In this code, we are using a ton of functions from libraries! This is really cool, because we can utilize code other people have developed and bring it into our project simply by using the `import` function. [Scikit-learn](https://scikit-learn.org/stable/), “sklearn” for short, is an extremely popular data science library. Sklearn contains a wide range of machine learning techniques, including supervised and unsupervised learning algorithms, feature scaling and imputation, as well as tools model evaluation and selection. We’ll be using Sklearn for both preparing our covariates and creating models (our next section). -4. Our dataset is pretty small data so we are good to use pandas and `sklearn`. If you have larger data for your own project in mind, consider `dask` or `category_encoders`. -5. Breaking it down a bit more: - - We’re selecting a subset of variables that will be used as predictors for a driver’s position. - - Filter the dataset to only include rows using the active driver and constructor flags we created in the last step. - - The next step is to use the `LabelEncoder` from scikit-learn to convert the categorical variables `CIRCUIT_NAME`, `CONSTRUCTOR_NAME`, `DRIVER`, and `TOTAL_PIT_STOPS_PER_RACE` into numerical values. - - Create a new variable called `POSITION_LABEL`, which is a derived from our position variable. - - 💭 Why are we changing our position variable? There are 20 total positions in Formula 1 and we are grouping them together to simplify the classification and improve performance. We also want to demonstrate you can create a new function within your dbt model! - - Our new `position_label` variable has meaning: - - In Formula1 if you are in: - - Top 3 you get a “podium” position - - Top 10 you gain points that add to your overall season total - - Below top 10 you get no points! - - We are mapping our original variable position to `position_label` to the corresponding places above to 1,2, and 3 respectively. - - Drop the active driver and constructor flags since they were filter criteria and additionally drop our original position variable. - -## Splitting into training and testing datasets - -Now that we’ve cleaned and encoded our data, we are going to further split in by time. In this step, we will create dataframes to use for training and prediction. We’ll be creating two dataframes 1) using data from 2010-2019 for training, and 2) data from 2020 for new prediction inferences. We’ll create variables called `start_year` and `end_year` so we aren’t filtering on hardcasted values (and can more easily swap them out in the future if we want to retrain our model on different timeframes). - -1. Create a file called `train_test_dataset` copy and save the following code: - ```python - import pandas as pd - - def model(dbt, session): - - # dbt configuration - dbt.config(packages=["pandas"], tags="train") - - # get upstream data - encoding = dbt.ref("covariate_encoding").to_pandas() - - # provide years so we do not hardcode dates in filter command - start_year=2010 - end_year=2019 - - # describe the data for a full decade - train_test_dataset = encoding.loc[encoding['RACE_YEAR'].between(start_year, end_year)] - - return train_test_dataset - ``` - -2. Create a file called `hold_out_dataset_for_prediction` copy and save the following code below. Now we’ll have a dataset with only the year 2020 that we’ll keep as a hold out set that we are going to use similar to a deployment use case. - ```python - import pandas as pd - - def model(dbt, session): - # dbt configuration - dbt.config(packages=["pandas"], tags="predict") - - # get upstream data - encoding = dbt.ref("covariate_encoding").to_pandas() - - # variable for year instead of hardcoding it - year=2020 - - # filter the data based on the specified year - hold_out_dataset = encoding.loc[encoding['RACE_YEAR'] == year] - - return hold_out_dataset - ``` -3. Execute the following in the command bar: - ```bash - dbt run --select train_test_dataset hold_out_dataset_for_prediction - ``` - To run our temporal data split models, we can use this syntax in the command line to run them both at once. Make sure you use a *space* [syntax](/reference/node-selection/syntax) between the model names to indicate you want to run both! -4. **Commit and push** our changes to keep saving our work as we go using `ml data prep and splits` before moving on. - -👏 Now that we’ve finished our machine learning prep work we can move onto the fun part — training and prediction! diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-testing.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-testing.md deleted file mode 100644 index 8b353a85fa3..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-testing.md +++ /dev/null @@ -1,251 +0,0 @@ ---- -title: "Machine Learning: training and prediction " -id: "12-machine-learning-training-prediction" -description: "Machine Learning: training and prediction" ---- - -We’re ready to start training a model to predict the driver’s position. Now is a good time to pause and take a step back and say, usually in ML projects you’ll try multiple algorithms during development and use an evaluation method such as cross validation to determine which algorithm to use. You can definitely do this in your dbt project, but for the content of this lab we’ll have decided on using a logistic regression to predict position (we actually tried some other algorithms using cross validation outside of this lab such as k-nearest neighbors and a support vector classifier but that didn’t perform as well as the logistic regression and a decision tree that overfit). - -There are 3 areas to break down as we go since we are working at the intersection all within one model file: -1. Machine Learning -2. Snowflake and Snowpark -3. dbt Python models - -If you haven’t seen code like this before or use joblib files to save machine learning models, we’ll be going over them at a high level and you can explore the links for more technical in-depth along the way! Because Snowflake and dbt have abstracted away a lot of the nitty gritty about serialization and storing our model object to be called again, we won’t go into too much detail here. There’s *a lot* going on here so take it at your pace! - -## Training and saving a machine learning model - -1. Project organization remains key, so let’s make a new subfolder called `train_predict` under the `ml` folder. -2. Now create a new file called `train_test_position` and copy and save the following code: - - ```python - import snowflake.snowpark.functions as F - from sklearn.model_selection import train_test_split - import pandas as pd - from sklearn.metrics import confusion_matrix, balanced_accuracy_score - import io - from sklearn.linear_model import LogisticRegression - from joblib import dump, load - import joblib - import logging - import sys - from joblib import dump, load - - logger = logging.getLogger("mylog") - - def save_file(session, model, path, dest_filename): - input_stream = io.BytesIO() - joblib.dump(model, input_stream) - session._conn.upload_stream(input_stream, path, dest_filename) - return "successfully created file: " + path - - def model(dbt, session): - dbt.config( - packages = ['numpy','scikit-learn','pandas','numpy','joblib','cachetools'], - materialized = "table", - tags = "train" - ) - # Create a stage in Snowflake to save our model file - session.sql('create or replace stage MODELSTAGE').collect() - - #session._use_scoped_temp_objects = False - version = "1.0" - logger.info('Model training version: ' + version) - - # read in our training and testing upstream dataset - test_train_df = dbt.ref("train_test_dataset") - - # cast snowpark df to pandas df - test_train_pd_df = test_train_df.to_pandas() - target_col = "POSITION_LABEL" - - # split out covariate predictors, x, from our target column position_label, y. - split_X = test_train_pd_df.drop([target_col], axis=1) - split_y = test_train_pd_df[target_col] - - # Split out our training and test data into proportions - X_train, X_test, y_train, y_test = train_test_split(split_X, split_y, train_size=0.7, random_state=42) - train = [X_train, y_train] - test = [X_test, y_test] - # now we are only training our one model to deploy - # we are keeping the focus on the workflows and not algorithms for this lab! - model = LogisticRegression() - - # fit the preprocessing pipeline and the model together - model.fit(X_train, y_train) - y_pred = model.predict_proba(X_test)[:,1] - predictions = [round(value) for value in y_pred] - balanced_accuracy = balanced_accuracy_score(y_test, predictions) - - # Save the model to a stage - save_file(session, model, "@MODELSTAGE/driver_position_"+version, "driver_position_"+version+".joblib" ) - logger.info('Model artifact:' + "@MODELSTAGE/driver_position_"+version+".joblib") - - # Take our pandas training and testing dataframes and put them back into snowpark dataframes - snowpark_train_df = session.write_pandas(pd.concat(train, axis=1, join='inner'), "train_table", auto_create_table=True, create_temp_table=True) - snowpark_test_df = session.write_pandas(pd.concat(test, axis=1, join='inner'), "test_table", auto_create_table=True, create_temp_table=True) - - # Union our training and testing data together and add a column indicating train vs test rows - return snowpark_train_df.with_column("DATASET_TYPE", F.lit("train")).union(snowpark_test_df.with_column("DATASET_TYPE", F.lit("test"))) - ``` - -3. Execute the following in the command bar: - ```bash - dbt run --select train_test_position - ``` -4. Breaking down our Python script here: - - We’re importing some helpful libraries. - - Defining a function called `save_file()` that takes four parameters: `session`, `model`, `path` and `dest_filename` that will save our logistic regression model file. - - `session` — an object representing a connection to Snowflake. - - `model` — an object that needs to be saved. In this case, it's a Python object that is a scikit-learn that can be serialized with joblib. - - `path` — a string representing the directory or bucket location where the file should be saved. - - `dest_filename` — a string representing the desired name of the file. - - Creating our dbt model - - Within this model we are creating a stage called `MODELSTAGE` to place our logistic regression `joblib` model file. This is really important since we need a place to keep our model to reuse and want to ensure it's there. When using Snowpark commands, it's common to see the `.collect()` method to ensure the action is performed. Think of the session as our “start” and collect as our “end” when [working with Snowpark](https://docs.snowflake.com/en/developer-guide/snowpark/python/working-with-dataframes.html) (you can use other ending methods other than collect). - - Using `.ref()` to connect into our `train_test_dataset` model. - - Now we see the machine learning part of our analysis: - - Create new dataframes for our prediction features from our target variable `position_label`. - - Split our dataset into 70% training (and 30% testing), train_size=0.7 with a `random_state` specified to have repeatable results. - - Specify our model is a logistic regression. - - Fit our model. In a logistic regression this means finding the coefficients that will give the least classification error. - - Round our predictions to the nearest integer since logistic regression creates a probability between for each class and calculate a balanced accuracy to account for imbalances in the target variable. - - Right now our model is only in memory, so we need to use our nifty function `save_file` to save our model file to our Snowflake stage. We save our model as a joblib file so Snowpark can easily call this model object back to create predictions. We really don’t need to know much else as a data practitioner unless we want to. It’s worth noting that joblib files aren’t able to be queried directly by SQL. To do this, we would need to transform the joblib file to an SQL querable format such as JSON or CSV (out of scope for this workshop). - - Finally we want to return our dataframe, but create a new column indicating what rows were used for training and those for training. -5. Viewing our output of this model: - - -6. Let’s pop back over to Snowflake and check that our logistic regression model has been stored in our `MODELSTAGE` using the command: - ```sql - list @modelstage - ``` - - -7. To investigate the commands run as part of `train_test_position` script, navigate to Snowflake query history to view it **Activity > Query History**. We can view the portions of query that we wrote such as `create or replace stage MODELSTAGE`, but we also see additional queries that Snowflake uses to interpret python code. - - -## Predicting on new data - -1. Create a new file called `predict_position` and copy and save the following code: - ```python - import logging - import joblib - import pandas as pd - import os - from snowflake.snowpark import types as T - - DB_STAGE = 'MODELSTAGE' - version = '1.0' - # The name of the model file - model_file_path = 'driver_position_'+version - model_file_packaged = 'driver_position_'+version+'.joblib' - - # This is a local directory, used for storing the various artifacts locally - LOCAL_TEMP_DIR = f'/tmp/driver_position' - DOWNLOAD_DIR = os.path.join(LOCAL_TEMP_DIR, 'download') - TARGET_MODEL_DIR_PATH = os.path.join(LOCAL_TEMP_DIR, 'ml_model') - TARGET_LIB_PATH = os.path.join(LOCAL_TEMP_DIR, 'lib') - - # The feature columns that were used during model training - # and that will be used during prediction - FEATURE_COLS = [ - "RACE_YEAR" - ,"CIRCUIT_NAME" - ,"GRID" - ,"CONSTRUCTOR_NAME" - ,"DRIVER" - ,"DRIVERS_AGE_YEARS" - ,"DRIVER_CONFIDENCE" - ,"CONSTRUCTOR_RELAIBLITY" - ,"TOTAL_PIT_STOPS_PER_RACE"] - - def register_udf_for_prediction(p_predictor ,p_session ,p_dbt): - - # The prediction udf - - def predict_position(p_df: T.PandasDataFrame[int, int, int, int, - int, int, int, int, int]) -> T.PandasSeries[int]: - # Snowpark currently does not set the column name in the input dataframe - # The default col names are like 0,1,2,... Hence we need to reset the column - # names to the features that we initially used for training. - p_df.columns = [*FEATURE_COLS] - - # Perform prediction. this returns an array object - pred_array = p_predictor.predict(p_df) - # Convert to series - df_predicted = pd.Series(pred_array) - return df_predicted - - # The list of packages that will be used by UDF - udf_packages = p_dbt.config.get('packages') - - predict_position_udf = p_session.udf.register( - predict_position - ,name=f'predict_position' - ,packages = udf_packages - ) - return predict_position_udf - - def download_models_and_libs_from_stage(p_session): - p_session.file.get(f'@{DB_STAGE}/{model_file_path}/{model_file_packaged}', DOWNLOAD_DIR) - - def load_model(p_session): - # Load the model and initialize the predictor - model_fl_path = os.path.join(DOWNLOAD_DIR, model_file_packaged) - predictor = joblib.load(model_fl_path) - return predictor - - # ------------------------------- - def model(dbt, session): - dbt.config( - packages = ['snowflake-snowpark-python' ,'scipy','scikit-learn' ,'pandas' ,'numpy'], - materialized = "table", - tags = "predict" - ) - session._use_scoped_temp_objects = False - download_models_and_libs_from_stage(session) - predictor = load_model(session) - predict_position_udf = register_udf_for_prediction(predictor, session ,dbt) - - # Retrieve the data, and perform the prediction - hold_out_df = (dbt.ref("hold_out_dataset_for_prediction") - .select(*FEATURE_COLS) - ) - - # Perform prediction. - new_predictions_df = hold_out_df.withColumn("position_predicted" - ,predict_position_udf(*FEATURE_COLS) - ) - - return new_predictions_df - ``` -2. Execute the following in the command bar: - ```bash - dbt run --select predict_position - ``` -3. **Commit and push** our changes to keep saving our work as we go using the commit message `logistic regression model training and application` before moving on. -4. At a high level in this script, we are: - - Retrieving our staged logistic regression model - - Loading the model in - - Placing the model within a user defined function (UDF) to call in line predictions on our driver’s position -5. At a more detailed level: - - Import our libraries. - - Create variables to reference back to the `MODELSTAGE` we just created and stored our model to. - - The temporary file paths we created might look intimidating, but all we’re doing here is programmatically using an initial file path and adding to it to create the following directories: - - LOCAL_TEMP_DIR ➡️ /tmp/driver_position - - DOWNLOAD_DIR ➡️ /tmp/driver_position/download - - TARGET_MODEL_DIR_PATH ➡️ /tmp/driver_position/ml_model - - TARGET_LIB_PATH ➡️ /tmp/driver_position/lib - - Provide a list of our feature columns that we used for model training and will now be used on new data for prediction. - - Next, we are creating our main function `register_udf_for_prediction(p_predictor ,p_session ,p_dbt):`. This function is used to register a user-defined function (UDF) that performs the machine learning prediction. It takes three parameters: `p_predictor` is an instance of the machine learning model, `p_session` is an instance of the Snowflake session, and `p_dbt` is an instance of the dbt library. The function creates a UDF named `predict_churn` which takes a pandas dataframe with the input features and returns a pandas series with the predictions. - - ⚠️ Pay close attention to the whitespace here. We are using a function within a function for this script. - - We have 2 simple functions that are programmatically retrieving our file paths to first get our stored model out of our `MODELSTAGE` and downloaded into the session `download_models_and_libs_from_stage` and then to load the contents of our model in (parameters) in `load_model` to use for prediction. - - Take the model we loaded in and call it `predictor` and wrap it in a UDF. - - Return our dataframe with both the features used to predict and the new label. - -🧠 Another way to read this script is from the bottom up. This can help us progressively see what is going into our final dbt model and work backwards to see how the other functions are being referenced. - -6. Let’s take a look at our predicted position alongside our feature variables. Open a new scratchpad and use the following query. I chose to order by the prediction of who would obtain a podium position: - ```sql - select * from {{ ref('predict_position') }} order by position_predicted - ``` -7. We can see that we created predictions in our final dataset, we are ready to move on to testing! diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/13-testing.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/13-testing.md deleted file mode 100644 index bcda9a775fb..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/13-testing.md +++ /dev/null @@ -1,136 +0,0 @@ ---- -title: "Testing" -id: "13-testing" -description: "Testing" ---- -We have now completed building all the models for today’s lab, but how do we know if they meet our assertions? Put another way, how do we know the quality of our data models are any good? This brings us to testing! - -We test data models for mainly two reasons: - -- Ensure that our source data is clean on ingestion before we start data modeling/transformation (aka avoid garbage in, garbage out problem). -- Make sure we don’t introduce bugs in the transformation code we wrote (stop ourselves from creating bad joins/fanouts). - -Testing in dbt comes in two flavors: [generic](/docs/build/tests#generic-tests) and [singular](/docs/build/tests#singular-tests). - -You define them in a test block (similar to a macro) and once defined, you can reference them by name in your `.yml` files (applying them to models, columns, sources, snapshots, and seeds). - -You might be wondering: *what about testing Python models?* - -Since the output of our Python models are tables, we can test SQL and Python models the same way! We don’t have to worry about any syntax differences when testing SQL versus Python data models. This means we use `.yml` and `.sql` files to test our entities (tables, views, etc.). Under the hood, dbt is running an SQL query on our tables to see if they meet assertions. If no rows are returned, dbt will surface a passed test. Conversely, if a test results in returned rows, it will fail or warn depending on the configuration (more on that later). - -## Generic tests - -1. To implement generic out-of-the-box tests dbt comes with, we can use YAML files to specify information about our models. To add generic tests to our aggregates model, create a file called `aggregates.yml`, copy the code block below into the file, and save. - - - ```yaml - version: 2 - - models: - - name: fastest_pit_stops_by_constructor - description: Use the python .describe() method to retrieve summary statistics table about pit stops by constructor. Sort by average stop time ascending so the first row returns the fastest constructor. - columns: - - name: constructor_name - description: team that makes the car - tests: - - unique - - - name: lap_times_moving_avg - description: Use the python .rolling() method to calculate the 5 year rolling average of pit stop times alongside the average for each year. - columns: - - name: race_year - description: year of the race - tests: - - relationships: - to: ref('int_lap_times_years') - field: race_year - ``` - -2. Let’s unpack the code we have here. We have both our aggregates models with the model name to know the object we are referencing and the description of the model that we’ll populate in our documentation. At the column level (a level below our model), we are providing the column name followed by our tests. We want to ensure our `constructor_name` is unique since we used a pandas `groupby` on `constructor_name` in the model `fastest_pit_stops_by_constructor`. Next, we want to ensure our `race_year` has referential integrity from the model we selected from `int_lap_times_years` into our subsequent `lap_times_moving_avg` model. -3. Finally, if we want to see how tests were deployed on sources and SQL models, we can look at other files in our project such as the `f1_sources.yml` we created in our Sources and staging section. - -## Using macros for testing - -1. Under your `macros` folder, create a new file and name it `test_all_values_gte_zero.sql`. Copy the code block below and save the file. For clarity, “gte” is an abbreviation for greater than or equal to. - - - ```sql - {% macro test_all_values_gte_zero(table, column) %} - - select * from {{ ref(table) }} where {{ column }} < 0 - - {% endmacro %} - ``` - -2. Macros in Jinja are pieces of code that can be reused multiple times in our SQL models — they are analogous to "functions" in other programming languages, and are extremely useful if you find yourself repeating code across multiple models. -3. We use the `{% macro %}` to indicate the start of the macro and `{% endmacro %}` for the end. The text after the beginning of the macro block is the name we are giving the macro to later call it. In this case, our macro is called `test_all_values_gte_zero`. Macros take in *arguments* to pass through, in this case the `table` and the `column`. In the body of the macro, we see an SQL statement that is using the `ref` function to dynamically select the table and then the column. You can always view macros without having to run them by using `dbt run-operation`. You can learn more [here](https://docs.getdbt.com/reference/commands/run-operation). -4. Great, now we want to reference this macro as a test! Let’s create a new test file called `macro_pit_stops_mean_is_positive.sql` in our `tests` folder. - - - -5. Copy the following code into the file and save: - - ```sql - {{ - config( - enabled=true, - severity='warn', - tags = ['bi'] - ) - }} - - {{ test_all_values_gte_zero('fastest_pit_stops_by_constructor', 'mean') }} - ``` - -6. In our testing file, we are applying some configurations to the test including `enabled`, which is an optional configuration for disabling models, seeds, snapshots, and tests. Our severity is set to `warn` instead of `error`, which means our pipeline will still continue to run. We have tagged our test with `bi` since we are applying this test to one of our bi models. - -Then, in our final line, we are calling the `test_all_values_gte_zero` macro that takes in our table and column arguments and inputting our table `'fastest_pit_stops_by_constructor'` and the column `'mean'`. - -## Custom singular tests to validate Python models - -The simplest way to define a test is by writing the exact SQL that will return failing records. We call these "singular" tests, because they're one-off assertions usable for a single purpose. - -These tests are defined in `.sql` files, typically in your `tests` directory (as defined by your test-paths config). You can use Jinja in SQL models (including ref and source) in the test definition, just like you can when creating models. Each `.sql` file contains one select statement, and it defines one test. - -Let’s add a custom test that asserts that the moving average of the lap time over the last 5 years is greater than zero (it’s impossible to have time less than 0!). It is easy to assume if this is not the case the data has been corrupted. - -1. Create a file `lap_times_moving_avg_assert_positive_or_null.sql` under the `tests` folder. - - -2. Copy the following code and save the file: - - ```sql - {{ - config( - enabled=true, - severity='error', - tags = ['bi'] - ) - }} - - with lap_times_moving_avg as ( select * from {{ ref('lap_times_moving_avg') }} ) - - select * - from lap_times_moving_avg - where lap_moving_avg_5_years < 0 and lap_moving_avg_5_years is not null - ``` - -## Putting all our tests together - -1. Time to run our tests! Altogether, we have created 4 tests for our 2 Python models: - - `fastest_pit_stops_by_constructor` - - Unique `constructor_name` - - Lap times are greater than 0 or null (to allow for the first leading values in a rolling calculation) - - `lap_times_moving_avg` - - Referential test on `race_year` - - Mean pit stop times are greater than or equal to 0 (no negative time values) -2. To run the tests on both our models, we can use this syntax in the command line to run them both at once, similar to how we did our data splits earlier. - Execute the following in the command bar: - ```bash - dbt test --select fastest_pit_stops_by_constructor lap_times_moving_avg - ``` - - -3. All 4 of our tests passed (yay for clean data)! To understand the SQL being run against each of our tables, we can click into the details of the test. -4. Navigating into the **Details** of the `unique_fastest_pit_stops_by_constructor_name`, we can see that each line `constructor_name` should only have one row. - \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/14-documentation.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/14-documentation.md deleted file mode 100644 index 95ec8ad242f..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/14-documentation.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -title: "Documentation" -id: "14-documentation" -description: "Documentation" ---- -When it comes to documentation, dbt brings together both column and model level descriptions that you can provide as well as details from your Snowflake information schema in a static site for consumption by other data team members and stakeholders. - -We are going to revisit 2 areas of our project to understand our documentation: - -- `intermediate.md` file -- `dbt_project.yml` file - -To start, let’s look back at our `intermediate.md` file. We can see that we provided multi-line descriptions for the models in our intermediate models using [docs blocks](/docs/collaborate/documentation#using-docs-blocks). Then we reference these docs blocks in our `.yml` file. Building descriptions with doc blocks in Markdown files gives you the ability to format your descriptions with Markdown and are particularly helpful when building long descriptions, either at the column or model level. In our `dbt_project.yml`, we added `node_colors` at folder levels. - -1. To see all these pieces come together, execute this in the command bar: - ```bash - dbt docs generate - ``` - This will generate the documentation for your project. Click the book button, as shown in the screenshot below to access the docs. - - -2. Go to our project area and view `int_results`. View the description that we created in our doc block. - - -3. View the mini-lineage that looks at the model we are currently selected on (`int_results` in this case). - - -4. In our `dbt_project.yml`, we configured `node_colors` depending on the file directory. Starting in dbt v1.3, we can see how our lineage in our docs looks. By color coding your project, it can help you cluster together similar models or steps and more easily troubleshoot. - \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/15-deployment.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/15-deployment.md deleted file mode 100644 index d9cedb60861..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/15-deployment.md +++ /dev/null @@ -1,50 +0,0 @@ ---- -title: "Deployment" -id: "15-deployment" -description: "Deployment" ---- - -Before we jump into deploying our code, let's have a quick primer on environments. Up to this point, all of the work we've done in the dbt Cloud IDE has been in our development environment, with code committed to a feature branch and the models we've built created in our development schema in Snowflake as defined in our Development environment connection. Doing this work on a feature branch, allows us to separate our code from what other coworkers are building and code that is already deemed production ready. Building models in a development schema in Snowflake allows us to separate the database objects we might still be modifying and testing from the database objects running production dashboards or other downstream dependencies. Together, the combination of a Git branch and Snowflake database objects form our environment. - -Now that we've completed testing and documenting our work, we're ready to deploy our code from our development environment to our production environment and this involves two steps: - -- Promoting code from our feature branch to the production branch in our repository. - - Generally, the production branch is going to be named your main branch and there's a review process to go through before merging code to the main branch of a repository. Here we are going to merge without review for ease of this workshop. -- Deploying code to our production environment. - - Once our code is merged to the main branch, we'll need to run dbt in our production environment to build all of our models and run all of our tests. This will allow us to build production-ready objects into our production environment in Snowflake. Luckily for us, the Partner Connect flow has already created our deployment environment and job to facilitate this step. - -1. Before getting started, let's make sure that we've committed all of our work to our feature branch. If you still have work to commit, you'll be able to select the **Commit and push**, provide a message, and then select **Commit** again. -2. Once all of your work is committed, the git workflow button will now appear as **Merge to main**. Select **Merge to main** and the merge process will automatically run in the background. - - -3. When it's completed, you should see the git button read **Create branch** and the branch you're currently looking at will become **main**. -4. Now that all of our development work has been merged to the main branch, we can build our deployment job. Given that our production environment and production job were created automatically for us through Partner Connect, all we need to do here is update some default configurations to meet our needs. -5. In the menu, select **Deploy** **> Environments** - - -6. You should see two environments listed and you'll want to select the **Deployment** environment then **Settings** to modify it. -7. Before making any changes, let's touch on what is defined within this environment. The Snowflake connection shows the credentials that dbt Cloud is using for this environment and in our case they are the same as what was created for us through Partner Connect. Our deployment job will build in our `PC_DBT_DB` database and use the default Partner Connect role and warehouse to do so. The deployment credentials section also uses the info that was created in our Partner Connect job to create the credential connection. However, it is using the same default schema that we've been using as the schema for our development environment. -8. Let's update the schema to create a new schema specifically for our production environment. Click **Edit** to allow you to modify the existing field values. Navigate to **Deployment Credentials >** **schema.** -9. Update the schema name to **production**. Remember to select **Save** after you've made the change. - -10. By updating the schema for our production environment to **production**, it ensures that our deployment job for this environment will build our dbt models in the **production** schema within the `PC_DBT_DB` database as defined in the Snowflake Connection section. -11. Now let's switch over to our production job. Click on the deploy tab again and then select **Jobs**. You should see an existing and preconfigured **Partner Connect Trial Job**. Similar to the environment, click on the job, then select **Settings** to modify it. Let's take a look at the job to understand it before making changes. - - - The Environment section is what connects this job with the environment we want it to run in. This job is already defaulted to use the Deployment environment that we just updated and the rest of the settings we can keep as is. - - The Execution settings section gives us the option to generate docs, run source freshness, and defer to a previous run state. For the purposes of our lab, we're going to keep these settings as is as well and stick with just generating docs. - - The Commands section is where we specify exactly which commands we want to run during this job, and we also want to keep this as is. We want our seed to be uploaded first, then run our models, and finally test them. The order of this is important as well, considering that we need our seed to be created before we can run our incremental model, and we need our models to be created before we can test them. - - Finally, we have the Triggers section, where we have a number of different options for scheduling our job. Given that our data isn't updating regularly here and we're running this job manually for now, we're also going to leave this section alone. - - So, what are we changing then? Just the name! Click **Edit** to allow you to make changes. Then update the name of the job to **Production Job** to denote this as our production deployment job. After that's done, click **Save**. -12. Now let's go to run our job. Clicking on the job name in the path at the top of the screen will take you back to the job run history page where you'll be able to click **Run run** to kick off the job. If you encounter any job failures, try running the job again before further troubleshooting. - - - -13. Let's go over to Snowflake to confirm that everything built as expected in our production schema. Refresh the database objects in your Snowflake account and you should see the production schema now within our default Partner Connect database. If you click into the schema and everything ran successfully, you should be able to see all of the models we developed. - - -## Conclusion - -Fantastic! You’ve finished the workshop! We hope you feel empowered in using both SQL and Python in your dbt Cloud workflows with Snowflake. Having a reliable pipeline to surface both analytics and machine learning is crucial to creating tangible business value from your data. - -For more help and information join our [dbt community Slack](https://www.getdbt.com/community/) which contains more than 50,000 data practitioners today. We have a dedicated slack channel #db-snowflake to Snowflake related content. Happy dbt'ing! \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/2-snowflake-configuration.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/2-snowflake-configuration.md deleted file mode 100644 index e864c363a44..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/2-snowflake-configuration.md +++ /dev/null @@ -1,27 +0,0 @@ ---- -title: "Configure Snowflake" -id: "2-snowflake-configuration" -description: "Configure Snowflake" ---- - - -1. Log in to your trial Snowflake account. You can [sign up for a Snowflake Trial Account using this form](https://signup.snowflake.com/) if you don’t have one. -2. Ensure that your account is set up using **AWS** in the **US East (N. Virginia)**. We will be copying the data from a public AWS S3 bucket hosted by dbt Labs in the us-east-1 region. By ensuring our Snowflake environment setup matches our bucket region, we avoid any multi-region data copy and retrieval latency issues. - - - -3. After creating your account and verifying it from your sign-up email, Snowflake will direct you back to the UI called Snowsight. - -4. When Snowsight first opens, your window should look like the following, with you logged in as the ACCOUNTADMIN with demo worksheets open: - - - - -5. Navigate to **Admin > Billing & Terms**. Click **Enable > Acknowledge & Continue** to enable Anaconda Python Packages to run in Snowflake. - - - - - -6. Finally, create a new Worksheet by selecting **+ Worksheet** in the upper right corner. - diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/3-connect-to-data-source.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/3-connect-to-data-source.md deleted file mode 100644 index 9a41e7f45c5..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/3-connect-to-data-source.md +++ /dev/null @@ -1,192 +0,0 @@ ---- -title: "Connect to data source" -id: "3-connect-to-data-source" -description: "Connect to data source" ---- - -We need to obtain our data source by copying our Formula 1 data into Snowflake tables from a public S3 bucket that dbt Labs hosts. - -1. When a new Snowflake account is created, there should be a preconfigured warehouse in your account named `COMPUTE_WH`. -2. If for any reason your account doesn’t have this warehouse, we can create a warehouse using the following script: - - ```sql - create or replace warehouse COMPUTE_WH with warehouse_size=XSMALL - ``` -3. Rename the worksheet to `data setup script` since we will be placing code in this worksheet to ingest the Formula 1 data. Make sure you are still logged in as the **ACCOUNTADMIN** and select the **COMPUTE_WH** warehouse. - - - -4. Copy the following code into the main body of the Snowflake worksheet. You can also find this setup script under the `setup` folder in the [Git repository](https://github.com/dbt-labs/python-snowpark-formula1/blob/main/setup/setup_script_s3_to_snowflake.sql). The script is long since it's bring in all of the data we'll need today! - - ```sql - -- create and define our formula1 database - create or replace database formula1; - use database formula1; - create or replace schema raw; - use schema raw; - - -- define our file format for reading in the csvs - create or replace file format csvformat - type = csv - field_delimiter =',' - field_optionally_enclosed_by = '"', - skip_header=1; - - -- - create or replace stage formula1_stage - file_format = csvformat - url = 's3://formula1-dbt-cloud-python-demo/formula1-kaggle-data/'; - - -- load in the 8 tables we need for our demo - -- we are first creating the table then copying our data in from s3 - -- think of this as an empty container or shell that we are then filling - create or replace table formula1.raw.circuits ( - CIRCUITID NUMBER(38,0), - CIRCUITREF VARCHAR(16777216), - NAME VARCHAR(16777216), - LOCATION VARCHAR(16777216), - COUNTRY VARCHAR(16777216), - LAT FLOAT, - LNG FLOAT, - ALT NUMBER(38,0), - URL VARCHAR(16777216) - ); - -- copy our data from public s3 bucket into our tables - copy into circuits - from @formula1_stage/circuits.csv - on_error='continue'; - - create or replace table formula1.raw.constructors ( - CONSTRUCTORID NUMBER(38,0), - CONSTRUCTORREF VARCHAR(16777216), - NAME VARCHAR(16777216), - NATIONALITY VARCHAR(16777216), - URL VARCHAR(16777216) - ); - copy into constructors - from @formula1_stage/constructors.csv - on_error='continue'; - - create or replace table formula1.raw.drivers ( - DRIVERID NUMBER(38,0), - DRIVERREF VARCHAR(16777216), - NUMBER VARCHAR(16777216), - CODE VARCHAR(16777216), - FORENAME VARCHAR(16777216), - SURNAME VARCHAR(16777216), - DOB DATE, - NATIONALITY VARCHAR(16777216), - URL VARCHAR(16777216) - ); - copy into drivers - from @formula1_stage/drivers.csv - on_error='continue'; - - create or replace table formula1.raw.lap_times ( - RACEID NUMBER(38,0), - DRIVERID NUMBER(38,0), - LAP NUMBER(38,0), - POSITION FLOAT, - TIME VARCHAR(16777216), - MILLISECONDS NUMBER(38,0) - ); - copy into lap_times - from @formula1_stage/lap_times.csv - on_error='continue'; - - create or replace table formula1.raw.pit_stops ( - RACEID NUMBER(38,0), - DRIVERID NUMBER(38,0), - STOP NUMBER(38,0), - LAP NUMBER(38,0), - TIME VARCHAR(16777216), - DURATION VARCHAR(16777216), - MILLISECONDS NUMBER(38,0) - ); - copy into pit_stops - from @formula1_stage/pit_stops.csv - on_error='continue'; - - create or replace table formula1.raw.races ( - RACEID NUMBER(38,0), - YEAR NUMBER(38,0), - ROUND NUMBER(38,0), - CIRCUITID NUMBER(38,0), - NAME VARCHAR(16777216), - DATE DATE, - TIME VARCHAR(16777216), - URL VARCHAR(16777216), - FP1_DATE VARCHAR(16777216), - FP1_TIME VARCHAR(16777216), - FP2_DATE VARCHAR(16777216), - FP2_TIME VARCHAR(16777216), - FP3_DATE VARCHAR(16777216), - FP3_TIME VARCHAR(16777216), - QUALI_DATE VARCHAR(16777216), - QUALI_TIME VARCHAR(16777216), - SPRINT_DATE VARCHAR(16777216), - SPRINT_TIME VARCHAR(16777216) - ); - copy into races - from @formula1_stage/races.csv - on_error='continue'; - - create or replace table formula1.raw.results ( - RESULTID NUMBER(38,0), - RACEID NUMBER(38,0), - DRIVERID NUMBER(38,0), - CONSTRUCTORID NUMBER(38,0), - NUMBER NUMBER(38,0), - GRID NUMBER(38,0), - POSITION FLOAT, - POSITIONTEXT VARCHAR(16777216), - POSITIONORDER NUMBER(38,0), - POINTS NUMBER(38,0), - LAPS NUMBER(38,0), - TIME VARCHAR(16777216), - MILLISECONDS NUMBER(38,0), - FASTESTLAP NUMBER(38,0), - RANK NUMBER(38,0), - FASTESTLAPTIME VARCHAR(16777216), - FASTESTLAPSPEED FLOAT, - STATUSID NUMBER(38,0) - ); - copy into results - from @formula1_stage/results.csv - on_error='continue'; - - create or replace table formula1.raw.status ( - STATUSID NUMBER(38,0), - STATUS VARCHAR(16777216) - ); - copy into status - from @formula1_stage/status.csv - on_error='continue'; - - ``` -5. Ensure all the commands are selected before running the query — an easy way to do this is to use Ctrl-a to highlight all of the code in the worksheet. Select **run** (blue triangle icon). Notice how the dot next to your **COMPUTE_WH** turns from gray to green as you run the query. The **status** table is the final table of all 8 tables loaded in. - - - -6. Let’s unpack that pretty long query we ran into component parts. We ran this query to load in our 8 Formula 1 tables from a public S3 bucket. To do this, we: - - Created a new database called `formula1` and a schema called `raw` to place our raw (untransformed) data into. - - Defined our file format for our CSV files. Importantly, here we use a parameter called `field_optionally_enclosed_by =` since the string columns in our Formula 1 csv files use quotes. Quotes are used around string values to avoid parsing issues where commas `,` and new lines `/n` in data values could cause data loading errors. - - Created a stage to locate our data we are going to load in. Snowflake Stages are locations where data files are stored. Stages are used to both load and unload data to and from Snowflake locations. Here we are using an external stage, by referencing an S3 bucket. - - Created our tables for our data to be copied into. These are empty tables with the column name and data type. Think of this as creating an empty container that the data will then fill into. - - Used the `copy into` statement for each of our tables. We reference our staged location we created and upon loading errors continue to load in the rest of the data. You should not have data loading errors but if you do, those rows will be skipped and Snowflake will tell you which rows caused errors - -7. Now let's take a look at some of our cool Formula 1 data we just loaded up! - 1. Create a new worksheet by selecting the **+** then **New Worksheet**. - - 2. Navigate to **Database > Formula1 > RAW > Tables**. - 3. Query the data using the following code. There are only 76 rows in the circuits table, so we don’t need to worry about limiting the amount of data we query. - ```sql - select * from formula1.raw.circuits - ``` - 4. Run the query. From here on out, we’ll use the keyboard shortcuts Command-Enter or Control-Enter to run queries and won’t explicitly call out this step. - 5. Review the query results, you should see information about Formula 1 circuits, starting with Albert Park in Australia! - 6. Finally, ensure you have all 8 tables starting with `CIRCUITS` and ending with `STATUS`. Now we are ready to connect into dbt Cloud! - - - - \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/4-configure-dbt.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/4-configure-dbt.md deleted file mode 100644 index 21eaa7e8d7f..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/4-configure-dbt.md +++ /dev/null @@ -1,27 +0,0 @@ ---- -title: "Configure dbt" -id: "4-configure-dbt" -description: "Configure dbt" ---- - -1. We are going to be using [Snowflake Partner Connect](https://docs.snowflake.com/en/user-guide/ecosystem-partner-connect.html) to set up a dbt Cloud account. Using this method will allow you to spin up a fully fledged dbt account with your [Snowflake connection](/docs/cloud/connect-data-platform/connect-snowflake), [managed repository](/docs/collaborate/git/managed-repository), environments, and credentials already established. -2. Navigate out of your worksheet back by selecting **home**. -3. In Snowsight, confirm that you are using the **ACCOUNTADMIN** role. -4. Navigate to the **Admin** **> Partner Connect**. Find **dbt** either by using the search bar or navigating the **Data Integration**. Select the **dbt** tile. - -5. You should now see a new window that says **Connect to dbt**. Select **Optional Grant** and add the `FORMULA1` database. This will grant access for your new dbt user role to the FORMULA1 database. - - -6. Ensure the `FORMULA1` is present in your optional grant before clicking **Connect**.  This will create a dedicated dbt user, database, warehouse, and role for your dbt Cloud trial. - - - -7. When you see the **Your partner account has been created** window, click **Activate**. - -8. You should be redirected to a dbt Cloud registration page. Fill out the form. Make sure to save the password somewhere for login in the future. - - - -9. Select **Complete Registration**. You should now be redirected to your dbt Cloud account, complete with a connection to your Snowflake account, a deployment and a development environment, and a sample job. - -10. To help you version control your dbt project, we have connected it to a [managed repository](/docs/collaborate/git/managed-repository), which means that dbt Labs will be hosting your repository for you. This will give you access to a Git workflow without you having to create and host the repository yourself. You will not need to know Git for this workshop; dbt Cloud will help guide you through the workflow. In the future, when you’re developing your own project, [feel free to use your own repository](/docs/cloud/git/connect-github). This will allow you to learn more about features like [Slim CI](/docs/deploy/continuous-integration) builds after this workshop. diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/5-development-schema-name.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/5-development-schema-name.md deleted file mode 100644 index f098c47bdad..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/5-development-schema-name.md +++ /dev/null @@ -1,46 +0,0 @@ ---- -title: "Development schema name and IDE walkthrough" -id: "5-development-schema-name" -description: "Development schema name and IDE walkthrough" ---- - -1. First we are going to change the name of our default schema to where our dbt models will build. By default, the name is `dbt_`. We will change this to `dbt_` to create your own personal development schema. To do this, select **Profile Settings** from the gear icon in the upper right. - - - -2. Navigate to the **Credentials** menu and select **Partner Connect Trial**, which will expand the credentials menu. - - - -3. Click **Edit** and change the name of your schema from `dbt_` to `dbt_YOUR_NAME` replacing `YOUR_NAME` with your initials and name (`hwatson` is used in the lab screenshots). Be sure to click **Save** for your changes! - - -4. We now have our own personal development schema, amazing! When we run our first dbt models they will build into this schema. -5. Let’s open up dbt Cloud’s Integrated Development Environment (IDE) and familiarize ourselves. Choose **Develop** at the top of the UI. - -6. When the IDE is done loading, click **Initialize dbt project**. The initialization process creates a collection of files and folders necessary to run your dbt project. - - -7. After the initialization is finished, you can view the files and folders in the file tree menu. As we move through the workshop we'll be sure to touch on a few key files and folders that we'll work with to build out our project. -8. Next click **Commit and push** to commit the new files and folders from the initialize step. We always want our commit messages to be relevant to the work we're committing, so be sure to provide a message like `initialize project` and select **Commit Changes**. - - - - - -9. [Committing](https://www.atlassian.com/git/tutorials/saving-changes/git-commit) your work here will save it to the managed git repository that was created during the Partner Connect signup. This initial commit is the only commit that will be made directly to our `main` branch and from *here on out we'll be doing all of our work on a development branch*. This allows us to keep our development work separate from our production code. -10. There are a couple of key features to point out about the IDE before we get to work. It is a text editor, an SQL and Python runner, and a CLI with Git version control all baked into one package! This allows you to focus on editing your SQL and Python files, previewing the results with the SQL runner (it even runs Jinja!), and building models at the command line without having to move between different applications. The Git workflow in dbt Cloud allows both Git beginners and experts alike to be able to easily version control all of their work with a couple clicks. - - - -11. Let's run our first dbt models! Two example models are included in your dbt project in the `models/examples` folder that we can use to illustrate how to run dbt at the command line. Type `dbt run` into the command line and click **Enter** on your keyboard. When the run bar expands you'll be able to see the results of the run, where you should see the run complete successfully. - - - -12. The run results allow you to see the code that dbt compiles and sends to Snowflake for execution. To view the logs for this run, select one of the model tabs using the  **>** icon and then **Details**. If you scroll down a bit you'll be able to see the compiled code and how dbt interacts with Snowflake. Given that this run took place in our development environment, the models were created in your development schema. - - - - -13. Now let's switch over to Snowflake to confirm that the objects were actually created. Click on the three dots **…** above your database objects and then **Refresh**. Expand the **PC_DBT_DB** database and you should see your development schema. Select the schema, then **Tables**  and **Views**. Now you should be able to see `MY_FIRST_DBT_MODEL` as a table and `MY_SECOND_DBT_MODEL` as a view. - \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/6-foundational-structure.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/6-foundational-structure.md deleted file mode 100644 index 8a938e10c34..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/6-foundational-structure.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -title: "Foundational structure" -id: "6-foundational-structure" -description: "Foundational structure" ---- - -In this step, we’ll need to create a development branch and set up project level configurations. - -1. To get started with development for our project, we'll need to create a new Git branch for our work. Select **create branch** and name your development branch. We'll call our branch `snowpark_python_workshop` then click **Submit**. -2. The first piece of development we'll do on the project is to update the `dbt_project.yml` file. Every dbt project requires a `dbt_project.yml` file — this is how dbt knows a directory is a dbt project. The [dbt_project.yml](/reference/dbt_project.yml) file also contains important information that tells dbt how to operate on your project. -3. Select the `dbt_project.yml` file from the file tree to open it and replace all of the existing contents with the following code below. When you're done, save the file by clicking **save**. You can also use the Command-S or Control-S shortcut from here on out. - - ```yaml - # Name your project! Project names should contain only lowercase characters - # and underscores. A good package name should reflect your organization's - # name or the intended use of these models - name: 'snowflake_dbt_python_formula1' - version: '1.3.0' - require-dbt-version: '>=1.3.0' - config-version: 2 - - # This setting configures which "profile" dbt uses for this project. - profile: 'default' - - # These configurations specify where dbt should look for different types of files. - # The `model-paths` config, for example, states that models in this project can be - # found in the "models/" directory. You probably won't need to change these! - model-paths: ["models"] - analysis-paths: ["analyses"] - test-paths: ["tests"] - seed-paths: ["seeds"] - macro-paths: ["macros"] - snapshot-paths: ["snapshots"] - - target-path: "target" # directory which will store compiled SQL files - clean-targets: # directories to be removed by `dbt clean` - - "target" - - "dbt_packages" - - models: - snowflake_dbt_python_formula1: - staging: - - +docs: - node_color: "CadetBlue" - marts: - +materialized: table - aggregates: - +docs: - node_color: "Maroon" - +tags: "bi" - - core: - +docs: - node_color: "#800080" - intermediate: - +docs: - node_color: "MediumSlateBlue" - ml: - prep: - +docs: - node_color: "Indigo" - train_predict: - +docs: - node_color: "#36454f" - - ``` - -4. The key configurations to point out in the file with relation to the work that we're going to do are in the `models` section. - - `require-dbt-version` — Tells dbt which version of dbt to use for your project. We are requiring 1.3.0 and any newer version to run python models and node colors. - - `materialized` — Tells dbt how to materialize models when compiling the code before it pushes it down to Snowflake. All models in the `marts` folder will be built as tables. - - `tags` — Applies tags at a directory level to all models. All models in the `aggregates` folder will be tagged as `bi` (abbreviation for business intelligence). - - `docs` — Specifies the `node_color` either by the plain color name or a hex value. -5. [Materializations](/docs/build/materializations) are strategies for persisting dbt models in a warehouse, with `tables` and `views` being the most commonly utilized types. By default, all dbt models are materialized as views and other materialization types can be configured in the `dbt_project.yml` file or in a model itself. It’s very important to note *Python models can only be materialized as tables or incremental models.* Since all our Python models exist under `marts`, the following portion of our `dbt_project.yml` ensures no errors will occur when we run our Python models. Starting with [dbt version 1.4](/docs/dbt-versions/core-upgrade/upgrading-to-v1.4#updates-to-python-models), Python files will automatically get materialized as tables even if not explicitly specified. - - ```yaml - marts:     - +materialized: table - ``` - diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/7-folder-structure.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/7-folder-structure.md deleted file mode 100644 index a47a3b54d48..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/7-folder-structure.md +++ /dev/null @@ -1,27 +0,0 @@ ---- -title: "Folder structure" -id: "7-folder-structure" -description: "Folder structure" ---- -dbt Labs has developed a [project structure guide](/guides/best-practices/how-we-structure/1-guide-overview/) that contains a number of recommendations for how to build the folder structure for your project. Do check out that guide if you want to learn more. Right now we are going to create some folders to organize our files: - -- Sources — This is our Formula 1 dataset and it will be defined in a source YAML file. -- Staging models — These models have a 1:1 with their source table. -- Intermediate — This is where we will be joining some Formula staging models. -- Marts models — Here is where we perform our major transformations. It contains these subfolders: - - aggregates - - core - - ml -1. In your file tree, use your cursor and hover over the `models` subdirectory, click the three dots **…** that appear to the right of the folder name, then select **Create Folder**. We're going to add two new folders to the file path, `staging` and `formula1` (in that order) by typing `staging/formula1` into the file path. - - - - - - If you click into your `models` directory now, you should see the new `staging` folder nested within `models` and the `formula1` folder nested within `staging`. -2. Create two additional folders the same as the last step. Within the `models` subdirectory, create new directories `marts/core`. - -3. We will need to create a few more folders and subfolders using the UI. After you create all the necessary folders, your folder tree should look like this when it's all done: - - - -Remember you can always reference the entire project in [GitHub](https://github.com/dbt-labs/python-snowpark-formula1/tree/python-formula1) to view the complete folder and file strucutre. \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/8-sources-and-staging.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/8-sources-and-staging.md deleted file mode 100644 index 22e49c8a30b..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/8-sources-and-staging.md +++ /dev/null @@ -1,334 +0,0 @@ ---- -title: "Sources and staging" -id: "8-sources-and-staging" -description: "Sources and staging" ---- - -In this section, we are going to create our source and staging models. - -Sources allow us to create a dependency between our source database object and our staging models which will help us when we look at later. Also, if your source changes database or schema, you only have to update it in your `f1_sources.yml` file rather than updating all of the models it might be used in. - -Staging models are the base of our project, where we bring all the individual components we're going to use to build our more complex and useful models into the project. - -Since we want to focus on dbt and Python in this workshop, check out our [sources](/docs/build/sources) and [staging](/guides/best-practices/how-we-structure/2-staging) docs if you want to learn more (or take our [dbt Fundamentals](https://courses.getdbt.com/collections) course which covers all of our core functionality). - -## Create sources - -We're going to be using each of our 8 Formula 1 tables from our `formula1` database under the `raw`  schema for our transformations and we want to create those tables as sources in our project. - -1. Create a new file called `f1_sources.yml` with the following file path: `models/staging/formula1/f1_sources.yml`. -2. Then, paste the following code into the file before saving it: - -```yaml -version: 2 - -sources: - - name: formula1 - description: formula 1 datasets with normalized tables - database: formula1 - schema: raw - tables: - - name: circuits - description: One record per circuit, which is the specific race course. - columns: - - name: circuitid - tests: - - unique - - not_null - - name: constructors - description: One record per constructor. Constructors are the teams that build their formula 1 cars. - columns: - - name: constructorid - tests: - - unique - - not_null - - name: drivers - description: One record per driver. This table gives details about the driver. - columns: - - name: driverid - tests: - - unique - - not_null - - name: lap_times - description: One row per lap in each race. Lap times started being recorded in this dataset in 1984 and joined through driver_id. - - name: pit_stops - description: One row per pit stop. Pit stops do not have their own id column, the combination of the race_id and driver_id identify the pit stop. - columns: - - name: stop - tests: - - accepted_values: - values: [1,2,3,4,5,6,7,8] - quote: false - - name: races - description: One race per row. Importantly this table contains the race year to understand trends. - columns: - - name: raceid - tests: - - unique - - not_null - - name: results - columns: - - name: resultid - tests: - - unique - - not_null - description: One row per result. The main table that we join out for grid and position variables. - - name: status - description: One status per row. The status contextualizes whether the race was finished or what issues arose e.g. collisions, engine, etc. - columns: - - name: statusid - tests: - - unique - - not_null -``` - -## Create staging models - -The next step is to set up the staging models for each of the 8 source tables. Given the one-to-one relationship between staging models and their corresponding source tables, we'll build 8 staging models here. We know it’s a lot and in the future, we will seek to update the workshop to make this step less repetitive and more efficient. This step is also a good representation of the real world of data, where you have multiple hierarchical tables that you will need to join together! - -1. Let's go in alphabetical order to easily keep track of all our staging models! Create a new file called `stg_f1_circuits.sql` with this file path `models/staging/formula1/stg_f1_circuits.sql`. Then, paste the following code into the file before saving it: - - ```sql - with - - source as ( - - select * from {{ source('formula1','circuits') }} - - ), - - renamed as ( - select - circuitid as circuit_id, - circuitref as circuit_ref, - name as circuit_name, - location, - country, - lat as latitude, - lng as longitude, - alt as altitude - -- omit the url - from source - ) - select * from renamed - ``` - - All we're doing here is pulling the source data into the model using the `source` function, renaming some columns, and omitting the column `url` with a commented note since we don’t need it for our analysis. - -1. Create `stg_f1_constructors.sql` with this file path `models/staging/formula1/stg_f1_constructors.sql`. Paste the following code into it before saving the file: - - ```sql - with - - source as ( - - select * from {{ source('formula1','constructors') }} - - ), - - renamed as ( - select - constructorid as constructor_id, - constructorref as constructor_ref, - name as constructor_name, - nationality as constructor_nationality - -- omit the url - from source - ) - - select * from renamed - ``` - - We have 6 other stages models to create. We can do this by creating new files, then copy and paste the code into our `staging` folder. - -1. Create `stg_f1_drivers.sql` with this file path `models/staging/formula1/stg_f1_drivers.sql`: - - ```sql - with - - source as ( - - select * from {{ source('formula1','drivers') }} - - ), - - renamed as ( - select - driverid as driver_id, - driverref as driver_ref, - number as driver_number, - code as driver_code, - forename, - surname, - dob as date_of_birth, - nationality as driver_nationality - -- omit the url - from source - ) - - select * from renamed - ``` -1. Create `stg_f1_lap_times.sql` with this file path `models/staging/formula1/stg_f1_lap_times.sql`: - - ```sql - with - - source as ( - - select * from {{ source('formula1','lap_times') }} - - ), - - renamed as ( - select - raceid as race_id, - driverid as driver_id, - lap, - position, - time as lap_time_formatted, - milliseconds as lap_time_milliseconds - from source - ) - - select * from renamed - ``` -1. Create `stg_f1_pit_stops.sql` with this file path `models/staging/formula1/stg_f1_pit_stops.sql`: - - ```sql - with - - source as ( - - select * from {{ source('formula1','pit_stops') }} - - ), - - renamed as ( - select - raceid as race_id, - driverid as driver_id, - stop as stop_number, - lap, - time as lap_time_formatted, - duration as pit_stop_duration_seconds, - milliseconds as pit_stop_milliseconds - from source - ) - - select * from renamed - order by pit_stop_duration_seconds desc - ``` - -1. Create ` stg_f1_races.sql` with this file path `models/staging/formula1/stg_f1_races.sql`: - - ```sql - with - - source as ( - - select * from {{ source('formula1','races') }} - - ), - - renamed as ( - select - raceid as race_id, - year as race_year, - round as race_round, - circuitid as circuit_id, - name as circuit_name, - date as race_date, - to_time(time) as race_time, - -- omit the url - fp1_date as free_practice_1_date, - fp1_time as free_practice_1_time, - fp2_date as free_practice_2_date, - fp2_time as free_practice_2_time, - fp3_date as free_practice_3_date, - fp3_time as free_practice_3_time, - quali_date as qualifying_date, - quali_time as qualifying_time, - sprint_date, - sprint_time - from source - ) - - select * from renamed - ``` -1. Create `stg_f1_results.sql` with this file path `models/staging/formula1/stg_f1_results.sql`: - - ```sql - with - - source as ( - - select * from {{ source('formula1','results') }} - - ), - - renamed as ( - select - resultid as result_id, - raceid as race_id, - driverid as driver_id, - constructorid as constructor_id, - number as driver_number, - grid, - position::int as position, - positiontext as position_text, - positionorder as position_order, - points, - laps, - time as results_time_formatted, - milliseconds as results_milliseconds, - fastestlap as fastest_lap, - rank as results_rank, - fastestlaptime as fastest_lap_time_formatted, - fastestlapspeed::decimal(6,3) as fastest_lap_speed, - statusid as status_id - from source - ) - - select * from renamed - ``` -1. Last one! Create `stg_f1_status.sql` with this file path: `models/staging/formula1/stg_f1_status.sql`: - - ```sql - with - - source as ( - - select * from {{ source('formula1','status') }} - - ), - - renamed as ( - select - statusid as status_id, - status - from source - ) - - select * from renamed - ``` - After the source and all the staging models are complete for each of the 8 tables, your staging folder should look like this: - - - -1. It’s a good time to delete our example folder since these two models are extraneous to our formula1 pipeline and `my_first_model` fails a `not_null` test that we won’t spend time investigating. dbt Cloud will warn us that this folder will be permanently deleted, and we are okay with that so select **Delete**. - - - -1. Now that the staging models are built and saved, it's time to create the models in our development schema in Snowflake. To do this we're going to enter into the command line `dbt build` to run all of the models in our project, which includes the 8 new staging models and the existing example models. - - Your run should complete successfully and you should see green checkmarks next to all of your models in the run results. We built our 8 staging models as views and ran 13 source tests that we configured in the `f1_sources.yml` file with not that much code, pretty cool! - - - - Let's take a quick look in Snowflake, refresh database objects, open our development schema, and confirm that the new models are there. If you can see them, then we're good to go! - - - - Before we move onto the next section, be sure to commit your new models to your Git branch. Click **Commit and push** and give your commit a message like `profile, sources, and staging setup` before moving on. - - \ No newline at end of file diff --git a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/9-sql-transformations.md b/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/9-sql-transformations.md deleted file mode 100644 index 262bf0e5e52..00000000000 --- a/website/docs/guides/dbt-ecosystem/dbt-python-snowpark/9-sql-transformations.md +++ /dev/null @@ -1,299 +0,0 @@ ---- -title: "SQL transformations" -id: "9-sql-transformations" -description: "SQL transformations" ---- - -Now that we have all our sources and staging models done, it's time to move into where dbt shines — transformation! - -We need to: - -- Create some intermediate tables to join tables that aren’t hierarchical -- Create core tables for business intelligence (BI) tool ingestion -- Answer the two questions about: - - fastest pit stops - - lap time trends about our Formula 1 data by creating aggregate models using python! - -## Intermediate models - -We need to join lots of reference tables to our results table to create a human readable dataframe. What does this mean? For example, we don’t only want to have the numeric `status_id` in our table, we want to be able to read in a row of data that a driver could not finish a race due to engine failure (`status_id=5`). - -By now, we are pretty good at creating new files in the correct directories so we won’t cover this in detail. All intermediate models should be created in the path `models/intermediate`. - -1. Create a new file called `int_lap_times_years.sql`. In this model, we are joining our lap time and race information so we can look at lap times over years. In earlier Formula 1 eras, lap times were not recorded (only final results), so we filter out records where lap times are null. - - ```sql - with lap_times as ( - - select * from {{ ref('stg_f1_lap_times') }} - - ), - - races as ( - - select * from {{ ref('stg_f1_races') }} - - ), - - expanded_lap_times_by_year as ( - select - lap_times.race_id, - driver_id, - race_year, - lap, - lap_time_milliseconds - from lap_times - left join races - on lap_times.race_id = races.race_id - where lap_time_milliseconds is not null - ) - - select * from expanded_lap_times_by_year - ``` - -2. Create a file called `in_pit_stops.sql`. Pit stops are a many-to-one (M:1) relationship with our races. We are creating a feature called `total_pit_stops_per_race` by partitioning over our `race_id` and `driver_id`, while preserving individual level pit stops for rolling average in our next section. - - ```sql - with stg_f1__pit_stops as - ( - select * from {{ ref('stg_f1_pit_stops') }} - ), - - pit_stops_per_race as ( - select - race_id, - driver_id, - stop_number, - lap, - lap_time_formatted, - pit_stop_duration_seconds, - pit_stop_milliseconds, - max(stop_number) over (partition by race_id,driver_id) as total_pit_stops_per_race - from stg_f1__pit_stops - ) - - select * from pit_stops_per_race - ``` - -3. Create a file called `int_results.sql`. Here we are using 4 of our tables — `races`, `drivers`, `constructors`, and `status` — to give context to our `results` table. We are now able to calculate a new feature `drivers_age_years` by bringing the `date_of_birth` and `race_year` into the same table. We are also creating a column to indicate if the driver did not finish (dnf) the race, based upon if their `position` was null called, `dnf_flag`. - - ```sql - with results as ( - - select * from {{ ref('stg_f1_results') }} - - ), - - races as ( - - select * from {{ ref('stg_f1_races') }} - - ), - - drivers as ( - - select * from {{ ref('stg_f1_drivers') }} - - ), - - constructors as ( - - select * from {{ ref('stg_f1_constructors') }} - ), - - status as ( - - select * from {{ ref('stg_f1_status') }} - ), - - int_results as ( - select - result_id, - results.race_id, - race_year, - race_round, - circuit_id, - circuit_name, - race_date, - race_time, - results.driver_id, - results.driver_number, - forename ||' '|| surname as driver, - cast(datediff('year', date_of_birth, race_date) as int) as drivers_age_years, - driver_nationality, - results.constructor_id, - constructor_name, - constructor_nationality, - grid, - position, - position_text, - position_order, - points, - laps, - results_time_formatted, - results_milliseconds, - fastest_lap, - results_rank, - fastest_lap_time_formatted, - fastest_lap_speed, - results.status_id, - status, - case when position is null then 1 else 0 end as dnf_flag - from results - left join races - on results.race_id=races.race_id - left join drivers - on results.driver_id = drivers.driver_id - left join constructors - on results.constructor_id = constructors.constructor_id - left join status - on results.status_id = status.status_id - ) - - select * from int_results - ``` -1. Create a *Markdown* file `intermediate.md` that we will go over in depth during the [Testing](/guides/dbt-ecosystem/dbt-python-snowpark/13-testing) and [Documentation](/guides/dbt-ecosystem/dbt-python-snowpark/14-documentation) sections. - - ```markdown - # the intent of this .md is to allow for multi-line long form explanations for our intermediate transformations - - # below are descriptions - {% docs int_results %} In this query we want to join out other important information about the race results to have a human readable table about results, races, drivers, constructors, and status. - We will have 4 left joins onto our results table. {% enddocs %} - - {% docs int_pit_stops %} There are many pit stops within one race, aka a M:1 relationship. - We want to aggregate this so we can properly join pit stop information without creating a fanout. {% enddocs %} - - {% docs int_lap_times_years %} Lap times are done per lap. We need to join them out to the race year to understand yearly lap time trends. {% enddocs %} - ``` -1. Create a *YAML* file `intermediate.yml` that we will go over in depth during the [Testing](/guides/dbt-ecosystem/dbt-python-snowpark/13-testing) and [Documentation](/guides/dbt-ecosystem/dbt-python-snowpark/14-documentation) sections. - - ```yaml - version: 2 - - models: - - name: int_results - description: '{{ doc("int_results") }}' - - name: int_pit_stops - description: '{{ doc("int_pit_stops") }}' - - name: int_lap_times_years - description: '{{ doc("int_lap_times_years") }}' - ``` - That wraps up the intermediate models we need to create our core models! - -## Core models - -1. Create a file `fct_results.sql`. This is what I like to refer to as the “mega table” — a really large denormalized table with all our context added in at row level for human readability. Importantly, we have a table `circuits` that is linked through the table `races`. When we joined `races` to `results` in `int_results.sql` we allowed our tables to make the connection from `circuits` to `results` in `fct_results.sql`. We are only taking information about pit stops at the result level so our join would not cause a [fanout](https://community.looker.com/technical-tips-tricks-1021/what-is-a-fanout-23327). - - ```sql - with int_results as ( - - select * from {{ ref('int_results') }} - - ), - - int_pit_stops as ( - select - race_id, - driver_id, - max(total_pit_stops_per_race) as total_pit_stops_per_race - from {{ ref('int_pit_stops') }} - group by 1,2 - ), - - circuits as ( - - select * from {{ ref('stg_f1_circuits') }} - ), - base_results as ( - select - result_id, - int_results.race_id, - race_year, - race_round, - int_results.circuit_id, - int_results.circuit_name, - circuit_ref, - location, - country, - latitude, - longitude, - altitude, - total_pit_stops_per_race, - race_date, - race_time, - int_results.driver_id, - driver, - driver_number, - drivers_age_years, - driver_nationality, - constructor_id, - constructor_name, - constructor_nationality, - grid, - position, - position_text, - position_order, - points, - laps, - results_time_formatted, - results_milliseconds, - fastest_lap, - results_rank, - fastest_lap_time_formatted, - fastest_lap_speed, - status_id, - status, - dnf_flag - from int_results - left join circuits - on int_results.circuit_id=circuits.circuit_id - left join int_pit_stops - on int_results.driver_id=int_pit_stops.driver_id and int_results.race_id=int_pit_stops.race_id - ) - - select * from base_results - ``` - -1. Create the file `pit_stops_joined.sql`. Our results and pit stops are at different levels of dimensionality (also called grain). Simply put, we have multiple pit stops per a result. Since we are interested in understanding information at the pit stop level with information about race year and constructor, we will create a new table `pit_stops_joined.sql` where each row is per pit stop. Our new table tees up our aggregation in Python. - - ```sql - with base_results as ( - - select * from {{ ref('fct_results') }} - - ), - - pit_stops as ( - - select * from {{ ref('int_pit_stops') }} - - ), - - pit_stops_joined as ( - - select - base_results.race_id, - race_year, - base_results.driver_id, - constructor_id, - constructor_name, - stop_number, - lap, - lap_time_formatted, - pit_stop_duration_seconds, - pit_stop_milliseconds - from base_results - left join pit_stops - on base_results.race_id=pit_stops.race_id and base_results.driver_id=pit_stops.driver_id - ) - select * from pit_stops_joined - ``` - -1. Enter in the command line and execute `dbt build` to build out our entire pipeline to up to this point. Don’t worry about “overriding” your previous models – dbt workflows are designed to be idempotent so we can run them again and expect the same results. - -1. Let’s talk about our lineage so far. It’s looking good 😎. We’ve shown how SQL can be used to make data type, column name changes, and handle hierarchical joins really well; all while building out our automated lineage! - - - -1. Time to **Commit and push** our changes and give your commit a message like `intermediate and fact models` before moving on. diff --git a/website/docs/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks.md b/website/docs/guides/dbt-models-on-databricks.md similarity index 93% rename from website/docs/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks.md rename to website/docs/guides/dbt-models-on-databricks.md index b5389645258..489a3c28467 100644 --- a/website/docs/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks.md +++ b/website/docs/guides/dbt-models-on-databricks.md @@ -1,17 +1,26 @@ --- -title: How to optimize and troubleshoot dbt models on Databricks -sidebar_label: "How to optimize and troubleshoot dbt models on Databricks" +title: Optimize and troubleshoot dbt models on Databricks +id: optimize-dbt-models-on-databricks description: "Learn more about optimizing and troubleshooting your dbt models on Databricks" +displayText: Optimizing and troubleshooting your dbt models on Databricks +hoverSnippet: Learn how to optimize and troubleshoot your dbt models on Databricks. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'databricks' +hide_table_of_contents: true +tags: ['Databricks', 'dbt Core','dbt Cloud'] +level: 'Intermediate' +recently_updated: true --- +## Introduction -Continuing our Databricks and dbt guide series from the last [guide](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project), it’s time to talk about performance optimization. In this follow-up post,  we outline simple strategies to optimize for cost, performance, and simplicity when architecting your data pipelines. We’ve encapsulated these strategies in this acronym-framework: +Building on the [Set up your dbt project with Databricks](/guides/set-up-your-databricks-dbt-project) guide, we'd like to discuss performance optimization. In this follow-up post, we outline simple strategies to optimize for cost, performance, and simplicity when you architect data pipelines. We’ve encapsulated these strategies in this acronym-framework: - Platform Components - Patterns & Best Practices - Performance Troubleshooting -## 1. Platform Components +## Platform Components As you start to develop your dbt projects, one of the first decisions you will make is what kind of backend infrastructure to run your models against. Databricks offers SQL warehouses, All-Purpose Compute, and Jobs Compute, each optimized to workloads they are catered to. Our recommendation is to use Databricks SQL warehouses for all your SQL workloads. SQL warehouses are optimized for SQL workloads when compared to other compute options, additionally, they can scale both vertically to support larger workloads and horizontally to support concurrency. Also, SQL warehouses are easier to manage and provide out-of-the-box features such as query history to help audit and optimize your SQL workloads. Between Serverless, Pro, and Classic SQL Warehouse types that Databricks offers, our standard recommendation for you is to leverage Databricks serverless warehouses. You can explore features of these warehouse types in the [Compare features section](https://www.databricks.com/product/pricing/databricks-sql?_gl=1*2rsmlo*_ga*ZmExYzgzZDAtMWU0Ny00N2YyLWFhYzEtM2RhZTQzNTAyZjZi*_ga_PQSEQ3RZQC*MTY3OTYwMDg0Ni4zNTAuMS4xNjc5NjAyMDMzLjUzLjAuMA..&_ga=2.104593536.1471430337.1679342371-fa1c83d0-1e47-47f2-aac1-3dae43502f6b) on the Databricks pricing page. @@ -31,11 +40,11 @@ Another technique worth implementing is to provision separate SQL warehouses for Because of the ability of serverless warehouses to spin up in a matter of seconds, setting your auto-stop configuration to a lower threshold will not impact SLAs and end-user experience. From the SQL Workspace UI, the default value is 10 minutes and  you can set it to 5 minutes for a lower threshold with the UI. If you would like more custom settings, you can set the threshold to as low as 1 minute with the [API](https://docs.databricks.com/sql/api/sql-endpoints.html#). -## 2. Patterns & Best Practices +## Patterns & Best Practices Now that we have a solid sense of the infrastructure components, we can shift our focus to best practices and design patterns on pipeline development.  We recommend the staging/intermediate/mart approach which is analogous to the medallion architecture bronze/silver/gold approach that’s recommended by Databricks. Let’s dissect each stage further. -dbt has guidelines on how you can [structure your dbt project](/guides/best-practices/how-we-structure/1-guide-overview) which you can learn more about. +dbt has guidelines on how you can [structure your dbt project](/best-practices/how-we-structure/1-guide-overview) which you can learn more about. ### Bronze / Staging Layer: @@ -49,7 +58,7 @@ The main benefit of leveraging `COPY INTO` is that it's an incremental operation Now that we have our bronze table taken care of, we can proceed with the silver layer. -For cost and performance reasons, many customers opt to implement an incremental pipeline approach. The main benefit with this approach is that you process a lot less data when you insert new records into the silver layer, rather than re-create the table each time with all the data from the bronze layer. However it should be noted that by default, [dbt recommends using views and tables](/guides/best-practices/materializations/1-guide-overview) to start out with and then moving to incremental as you require more performance optimization. +For cost and performance reasons, many customers opt to implement an incremental pipeline approach. The main benefit with this approach is that you process a lot less data when you insert new records into the silver layer, rather than re-create the table each time with all the data from the bronze layer. However it should be noted that by default, [dbt recommends using views and tables](/best-practices/materializations/1-guide-overview) to start out with and then moving to incremental as you require more performance optimization. dbt has an [incremental model materialization](/reference/resource-configs/spark-configs#the-merge-strategy) to facilitate this framework. How this works at a high level is that Databricks will create a temp view with a snapshot of data and then merge that snapshot into the silver table. You can customize the time range of the snapshot to suit your specific use case by configuring the `where` conditional in your `is_incremental` logic. The most straightforward implementation is to merge data using a timestamp that’s later than the current max timestamp in the silver table, but there are certainly valid use cases for increasing the temporal range of the source snapshot. @@ -121,7 +130,7 @@ incremental_predicates = [ }} ``` -## 3. Performance Troubleshooting +## Performance Troubleshooting Performance troubleshooting refers to the process of identifying and resolving issues that impact the performance of your dbt models and overall data pipelines. By improving the speed and performance of your Lakehouse platform, you will be able to process data faster, process large and complex queries more effectively, and provide faster time to market.  Let’s go into detail the three effective strategies that you can implement. @@ -166,8 +175,8 @@ Now you might be wondering, how do you identify opportunities for performance im With the [dbt Cloud Admin API](/docs/dbt-cloud-apis/admin-cloud-api), you can  pull the dbt artifacts from your dbt Cloud run,  put the generated `manifest.json` into an S3 bucket, stage it, and model the data using the [dbt artifacts package](https://hub.getdbt.com/brooklyn-data/dbt_artifacts/latest/). That package can help you identify inefficiencies in your dbt models and pinpoint where opportunities for improvement are. -## Conclusion +### Conclusion -This concludes the second guide in our series on “Working with Databricks and dbt”, following [How to set up your Databricks and dbt Project](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project). +This builds on the content in [Set up your dbt project with Databricks](/guides/set-up-your-databricks-dbt-project). We welcome you to try these strategies on our example open source TPC-H implementation and to provide us with thoughts/feedback as you start to incorporate these features into production. Looking forward to your feedback on [#db-databricks-and-spark](https://getdbt.slack.com/archives/CNGCW8HKL) Slack channel! diff --git a/website/docs/guides/dbt-python-snowpark.md b/website/docs/guides/dbt-python-snowpark.md new file mode 100644 index 00000000000..55e6b68c172 --- /dev/null +++ b/website/docs/guides/dbt-python-snowpark.md @@ -0,0 +1,1925 @@ +--- +title: "Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake" +id: "dbt-python-snowpark" +description: "Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake" +hoverSnippet: Learn how to leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Snowflake'] +level: 'Intermediate' +recently_updated: true +--- + +## Introduction + +The focus of this workshop will be to demonstrate how we can use both *SQL and python together* in the same workflow to run *both analytics and machine learning models* on dbt Cloud. + +All code in today’s workshop can be found on [GitHub](https://github.com/dbt-labs/python-snowpark-formula1/tree/python-formula1). + +### What you'll use during the lab + +- A [Snowflake account](https://trial.snowflake.com/) with ACCOUNTADMIN access +- A [dbt Cloud account](https://www.getdbt.com/signup/) + +### What you'll learn + +- How to build scalable data transformation pipelines using dbt, and Snowflake using SQL and Python +- How to leverage copying data into Snowflake from a public S3 bucket + +### What you need to know + +- Basic to intermediate SQL and python. +- Basic understanding of dbt fundamentals. We recommend the [dbt Fundamentals course](https://courses.getdbt.com/collections) if you're interested. +- High level machine learning process (encoding, training, testing) +- Simple ML algorithms — we will use logistic regression to keep the focus on the *workflow*, not algorithms! + +### What you'll build + +- A set of data analytics and prediction pipelines using Formula 1 data leveraging dbt and Snowflake, making use of best practices like data quality tests and code promotion between environments +- We will create insights for: + 1. Finding the lap time average and rolling average through the years (is it generally trending up or down)? + 2. Which constructor has the fastest pit stops in 2021? + 3. Predicting the position of each driver given using a decade of data (2010 - 2020) + +As inputs, we are going to leverage Formula 1 datasets hosted on a dbt Labs public S3 bucket. We will create a Snowflake Stage for our CSV files then use Snowflake’s `COPY INTO` function to copy the data in from our CSV files into tables. The Formula 1 is available on [Kaggle](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020). The data is originally compiled from the [Ergast Developer API](http://ergast.com/mrd/). + +Overall we are going to set up the environments, build scalable pipelines in dbt, establish data tests, and promote code to production. + +## Configure Snowflake + +1. Log in to your trial Snowflake account. You can [sign up for a Snowflake Trial Account using this form](https://signup.snowflake.com/) if you don’t have one. +2. Ensure that your account is set up using **AWS** in the **US East (N. Virginia)**. We will be copying the data from a public AWS S3 bucket hosted by dbt Labs in the us-east-1 region. By ensuring our Snowflake environment setup matches our bucket region, we avoid any multi-region data copy and retrieval latency issues. + + + +3. After creating your account and verifying it from your sign-up email, Snowflake will direct you back to the UI called Snowsight. + +4. When Snowsight first opens, your window should look like the following, with you logged in as the ACCOUNTADMIN with demo worksheets open: + + + +5. Navigate to **Admin > Billing & Terms**. Click **Enable > Acknowledge & Continue** to enable Anaconda Python Packages to run in Snowflake. + + + + + +6. Finally, create a new Worksheet by selecting **+ Worksheet** in the upper right corner. + +## Connect to data source + +We need to obtain our data source by copying our Formula 1 data into Snowflake tables from a public S3 bucket that dbt Labs hosts. + +1. When a new Snowflake account is created, there should be a preconfigured warehouse in your account named `COMPUTE_WH`. +2. If for any reason your account doesn’t have this warehouse, we can create a warehouse using the following script: + + ```sql + create or replace warehouse COMPUTE_WH with warehouse_size=XSMALL + ``` + +3. Rename the worksheet to `data setup script` since we will be placing code in this worksheet to ingest the Formula 1 data. Make sure you are still logged in as the **ACCOUNTADMIN** and select the **COMPUTE_WH** warehouse. + + + +4. Copy the following code into the main body of the Snowflake worksheet. You can also find this setup script under the `setup` folder in the [Git repository](https://github.com/dbt-labs/python-snowpark-formula1/blob/main/setup/setup_script_s3_to_snowflake.sql). The script is long since it's bring in all of the data we'll need today! + + ```sql + -- create and define our formula1 database + create or replace database formula1; + use database formula1; + create or replace schema raw; + use schema raw; + + -- define our file format for reading in the csvs + create or replace file format csvformat + type = csv + field_delimiter =',' + field_optionally_enclosed_by = '"', + skip_header=1; + + -- + create or replace stage formula1_stage + file_format = csvformat + url = 's3://formula1-dbt-cloud-python-demo/formula1-kaggle-data/'; + + -- load in the 8 tables we need for our demo + -- we are first creating the table then copying our data in from s3 + -- think of this as an empty container or shell that we are then filling + create or replace table formula1.raw.circuits ( + CIRCUITID NUMBER(38,0), + CIRCUITREF VARCHAR(16777216), + NAME VARCHAR(16777216), + LOCATION VARCHAR(16777216), + COUNTRY VARCHAR(16777216), + LAT FLOAT, + LNG FLOAT, + ALT NUMBER(38,0), + URL VARCHAR(16777216) + ); + -- copy our data from public s3 bucket into our tables + copy into circuits + from @formula1_stage/circuits.csv + on_error='continue'; + + create or replace table formula1.raw.constructors ( + CONSTRUCTORID NUMBER(38,0), + CONSTRUCTORREF VARCHAR(16777216), + NAME VARCHAR(16777216), + NATIONALITY VARCHAR(16777216), + URL VARCHAR(16777216) + ); + copy into constructors + from @formula1_stage/constructors.csv + on_error='continue'; + + create or replace table formula1.raw.drivers ( + DRIVERID NUMBER(38,0), + DRIVERREF VARCHAR(16777216), + NUMBER VARCHAR(16777216), + CODE VARCHAR(16777216), + FORENAME VARCHAR(16777216), + SURNAME VARCHAR(16777216), + DOB DATE, + NATIONALITY VARCHAR(16777216), + URL VARCHAR(16777216) + ); + copy into drivers + from @formula1_stage/drivers.csv + on_error='continue'; + + create or replace table formula1.raw.lap_times ( + RACEID NUMBER(38,0), + DRIVERID NUMBER(38,0), + LAP NUMBER(38,0), + POSITION FLOAT, + TIME VARCHAR(16777216), + MILLISECONDS NUMBER(38,0) + ); + copy into lap_times + from @formula1_stage/lap_times.csv + on_error='continue'; + + create or replace table formula1.raw.pit_stops ( + RACEID NUMBER(38,0), + DRIVERID NUMBER(38,0), + STOP NUMBER(38,0), + LAP NUMBER(38,0), + TIME VARCHAR(16777216), + DURATION VARCHAR(16777216), + MILLISECONDS NUMBER(38,0) + ); + copy into pit_stops + from @formula1_stage/pit_stops.csv + on_error='continue'; + + create or replace table formula1.raw.races ( + RACEID NUMBER(38,0), + YEAR NUMBER(38,0), + ROUND NUMBER(38,0), + CIRCUITID NUMBER(38,0), + NAME VARCHAR(16777216), + DATE DATE, + TIME VARCHAR(16777216), + URL VARCHAR(16777216), + FP1_DATE VARCHAR(16777216), + FP1_TIME VARCHAR(16777216), + FP2_DATE VARCHAR(16777216), + FP2_TIME VARCHAR(16777216), + FP3_DATE VARCHAR(16777216), + FP3_TIME VARCHAR(16777216), + QUALI_DATE VARCHAR(16777216), + QUALI_TIME VARCHAR(16777216), + SPRINT_DATE VARCHAR(16777216), + SPRINT_TIME VARCHAR(16777216) + ); + copy into races + from @formula1_stage/races.csv + on_error='continue'; + + create or replace table formula1.raw.results ( + RESULTID NUMBER(38,0), + RACEID NUMBER(38,0), + DRIVERID NUMBER(38,0), + CONSTRUCTORID NUMBER(38,0), + NUMBER NUMBER(38,0), + GRID NUMBER(38,0), + POSITION FLOAT, + POSITIONTEXT VARCHAR(16777216), + POSITIONORDER NUMBER(38,0), + POINTS NUMBER(38,0), + LAPS NUMBER(38,0), + TIME VARCHAR(16777216), + MILLISECONDS NUMBER(38,0), + FASTESTLAP NUMBER(38,0), + RANK NUMBER(38,0), + FASTESTLAPTIME VARCHAR(16777216), + FASTESTLAPSPEED FLOAT, + STATUSID NUMBER(38,0) + ); + copy into results + from @formula1_stage/results.csv + on_error='continue'; + + create or replace table formula1.raw.status ( + STATUSID NUMBER(38,0), + STATUS VARCHAR(16777216) + ); + copy into status + from @formula1_stage/status.csv + on_error='continue'; + + ``` + +5. Ensure all the commands are selected before running the query — an easy way to do this is to use Ctrl-a to highlight all of the code in the worksheet. Select **run** (blue triangle icon). Notice how the dot next to your **COMPUTE_WH** turns from gray to green as you run the query. The **status** table is the final table of all 8 tables loaded in. + + + +6. Let’s unpack that pretty long query we ran into component parts. We ran this query to load in our 8 Formula 1 tables from a public S3 bucket. To do this, we: + - Created a new database called `formula1` and a schema called `raw` to place our raw (untransformed) data into. + - Defined our file format for our CSV files. Importantly, here we use a parameter called `field_optionally_enclosed_by =` since the string columns in our Formula 1 csv files use quotes. Quotes are used around string values to avoid parsing issues where commas `,` and new lines `/n` in data values could cause data loading errors. + - Created a stage to locate our data we are going to load in. Snowflake Stages are locations where data files are stored. Stages are used to both load and unload data to and from Snowflake locations. Here we are using an external stage, by referencing an S3 bucket. + - Created our tables for our data to be copied into. These are empty tables with the column name and data type. Think of this as creating an empty container that the data will then fill into. + - Used the `copy into` statement for each of our tables. We reference our staged location we created and upon loading errors continue to load in the rest of the data. You should not have data loading errors but if you do, those rows will be skipped and Snowflake will tell you which rows caused errors + +7. Now let's take a look at some of our cool Formula 1 data we just loaded up! + 1. Create a new worksheet by selecting the **+** then **New Worksheet**. + + 2. Navigate to **Database > Formula1 > RAW > Tables**. + 3. Query the data using the following code. There are only 76 rows in the circuits table, so we don’t need to worry about limiting the amount of data we query. + + ```sql + select * from formula1.raw.circuits + ``` + + 4. Run the query. From here on out, we’ll use the keyboard shortcuts Command-Enter or Control-Enter to run queries and won’t explicitly call out this step. + 5. Review the query results, you should see information about Formula 1 circuits, starting with Albert Park in Australia! + 6. Finally, ensure you have all 8 tables starting with `CIRCUITS` and ending with `STATUS`. Now we are ready to connect into dbt Cloud! + + + +## Configure dbt Cloud + +1. We are going to be using [Snowflake Partner Connect](https://docs.snowflake.com/en/user-guide/ecosystem-partner-connect.html) to set up a dbt Cloud account. Using this method will allow you to spin up a fully fledged dbt account with your [Snowflake connection](/docs/cloud/connect-data-platform/connect-snowflake), [managed repository](/docs/collaborate/git/managed-repository), environments, and credentials already established. +2. Navigate out of your worksheet back by selecting **home**. +3. In Snowsight, confirm that you are using the **ACCOUNTADMIN** role. +4. Navigate to the **Admin** **> Partner Connect**. Find **dbt** either by using the search bar or navigating the **Data Integration**. Select the **dbt** tile. + +5. You should now see a new window that says **Connect to dbt**. Select **Optional Grant** and add the `FORMULA1` database. This will grant access for your new dbt user role to the FORMULA1 database. + + +6. Ensure the `FORMULA1` is present in your optional grant before clicking **Connect**.  This will create a dedicated dbt user, database, warehouse, and role for your dbt Cloud trial. + + + +7. When you see the **Your partner account has been created** window, click **Activate**. + +8. You should be redirected to a dbt Cloud registration page. Fill out the form. Make sure to save the password somewhere for login in the future. + + + +9. Select **Complete Registration**. You should now be redirected to your dbt Cloud account, complete with a connection to your Snowflake account, a deployment and a development environment, and a sample job. + +10. To help you version control your dbt project, we have connected it to a [managed repository](/docs/collaborate/git/managed-repository), which means that dbt Labs will be hosting your repository for you. This will give you access to a Git workflow without you having to create and host the repository yourself. You will not need to know Git for this workshop; dbt Cloud will help guide you through the workflow. In the future, when you’re developing your own project, [feel free to use your own repository](/docs/cloud/git/connect-github). This will allow you to learn more about features like [Slim CI](/docs/deploy/continuous-integration) builds after this workshop. + +## Change development schema name navigate the IDE + +1. First we are going to change the name of our default schema to where our dbt models will build. By default, the name is `dbt_`. We will change this to `dbt_` to create your own personal development schema. To do this, select **Profile Settings** from the gear icon in the upper right. + + + +2. Navigate to the **Credentials** menu and select **Partner Connect Trial**, which will expand the credentials menu. + + + +3. Click **Edit** and change the name of your schema from `dbt_` to `dbt_YOUR_NAME` replacing `YOUR_NAME` with your initials and name (`hwatson` is used in the lab screenshots). Be sure to click **Save** for your changes! + + +4. We now have our own personal development schema, amazing! When we run our first dbt models they will build into this schema. +5. Let’s open up dbt Cloud’s Integrated Development Environment (IDE) and familiarize ourselves. Choose **Develop** at the top of the UI. + +6. When the IDE is done loading, click **Initialize dbt project**. The initialization process creates a collection of files and folders necessary to run your dbt project. + + +7. After the initialization is finished, you can view the files and folders in the file tree menu. As we move through the workshop we'll be sure to touch on a few key files and folders that we'll work with to build out our project. +8. Next click **Commit and push** to commit the new files and folders from the initialize step. We always want our commit messages to be relevant to the work we're committing, so be sure to provide a message like `initialize project` and select **Commit Changes**. + + + + + +9. [Committing](https://www.atlassian.com/git/tutorials/saving-changes/git-commit) your work here will save it to the managed git repository that was created during the Partner Connect signup. This initial commit is the only commit that will be made directly to our `main` branch and from *here on out we'll be doing all of our work on a development branch*. This allows us to keep our development work separate from our production code. +10. There are a couple of key features to point out about the IDE before we get to work. It is a text editor, an SQL and Python runner, and a CLI with Git version control all baked into one package! This allows you to focus on editing your SQL and Python files, previewing the results with the SQL runner (it even runs Jinja!), and building models at the command line without having to move between different applications. The Git workflow in dbt Cloud allows both Git beginners and experts alike to be able to easily version control all of their work with a couple clicks. + + + +11. Let's run our first dbt models! Two example models are included in your dbt project in the `models/examples` folder that we can use to illustrate how to run dbt at the command line. Type `dbt run` into the command line and click **Enter** on your keyboard. When the run bar expands you'll be able to see the results of the run, where you should see the run complete successfully. + + + +12. The run results allow you to see the code that dbt compiles and sends to Snowflake for execution. To view the logs for this run, select one of the model tabs using the  **>** icon and then **Details**. If you scroll down a bit you'll be able to see the compiled code and how dbt interacts with Snowflake. Given that this run took place in our development environment, the models were created in your development schema. + + + +13. Now let's switch over to Snowflake to confirm that the objects were actually created. Click on the three dots **…** above your database objects and then **Refresh**. Expand the **PC_DBT_DB** database and you should see your development schema. Select the schema, then **Tables**  and **Views**. Now you should be able to see `MY_FIRST_DBT_MODEL` as a table and `MY_SECOND_DBT_MODEL` as a view. + + +## Create branch and set up project configs + +In this step, we’ll need to create a development branch and set up project level configurations. + +1. To get started with development for our project, we'll need to create a new Git branch for our work. Select **create branch** and name your development branch. We'll call our branch `snowpark_python_workshop` then click **Submit**. +2. The first piece of development we'll do on the project is to update the `dbt_project.yml` file. Every dbt project requires a `dbt_project.yml` file — this is how dbt knows a directory is a dbt project. The [dbt_project.yml](/reference/dbt_project.yml) file also contains important information that tells dbt how to operate on your project. +3. Select the `dbt_project.yml` file from the file tree to open it and replace all of the existing contents with the following code below. When you're done, save the file by clicking **save**. You can also use the Command-S or Control-S shortcut from here on out. + + ```yaml + # Name your project! Project names should contain only lowercase characters + # and underscores. A good package name should reflect your organization's + # name or the intended use of these models + name: 'snowflake_dbt_python_formula1' + version: '1.3.0' + require-dbt-version: '>=1.3.0' + config-version: 2 + + # This setting configures which "profile" dbt uses for this project. + profile: 'default' + + # These configurations specify where dbt should look for different types of files. + # The `model-paths` config, for example, states that models in this project can be + # found in the "models/" directory. You probably won't need to change these! + model-paths: ["models"] + analysis-paths: ["analyses"] + test-paths: ["tests"] + seed-paths: ["seeds"] + macro-paths: ["macros"] + snapshot-paths: ["snapshots"] + + target-path: "target" # directory which will store compiled SQL files + clean-targets: # directories to be removed by `dbt clean` + - "target" + - "dbt_packages" + + models: + snowflake_dbt_python_formula1: + staging: + + +docs: + node_color: "CadetBlue" + marts: + +materialized: table + aggregates: + +docs: + node_color: "Maroon" + +tags: "bi" + + core: + +docs: + node_color: "#800080" + intermediate: + +docs: + node_color: "MediumSlateBlue" + ml: + prep: + +docs: + node_color: "Indigo" + train_predict: + +docs: + node_color: "#36454f" + + ``` + +4. The key configurations to point out in the file with relation to the work that we're going to do are in the `models` section. + - `require-dbt-version` — Tells dbt which version of dbt to use for your project. We are requiring 1.3.0 and any newer version to run python models and node colors. + - `materialized` — Tells dbt how to materialize models when compiling the code before it pushes it down to Snowflake. All models in the `marts` folder will be built as tables. + - `tags` — Applies tags at a directory level to all models. All models in the `aggregates` folder will be tagged as `bi` (abbreviation for business intelligence). + - `docs` — Specifies the `node_color` either by the plain color name or a hex value. +5. [Materializations](/docs/build/materializations) are strategies for persisting dbt models in a warehouse, with `tables` and `views` being the most commonly utilized types. By default, all dbt models are materialized as views and other materialization types can be configured in the `dbt_project.yml` file or in a model itself. It’s very important to note *Python models can only be materialized as tables or incremental models.* Since all our Python models exist under `marts`, the following portion of our `dbt_project.yml` ensures no errors will occur when we run our Python models. Starting with [dbt version 1.4](/docs/dbt-versions/core-upgrade/upgrading-to-v1.4#updates-to-python-models), Python files will automatically get materialized as tables even if not explicitly specified. + + ```yaml + marts:     + +materialized: table + ``` + +## Create folders and organize files + +dbt Labs has developed a [project structure guide](/best-practices/how-we-structure/1-guide-overview/) that contains a number of recommendations for how to build the folder structure for your project. Do check out that guide if you want to learn more. Right now we are going to create some folders to organize our files: + +- Sources — This is our Formula 1 dataset and it will be defined in a source YAML file. +- Staging models — These models have a 1:1 with their source table. +- Intermediate — This is where we will be joining some Formula staging models. +- Marts models — Here is where we perform our major transformations. It contains these subfolders: + - aggregates + - core + - ml + +1. In your file tree, use your cursor and hover over the `models` subdirectory, click the three dots **…** that appear to the right of the folder name, then select **Create Folder**. We're going to add two new folders to the file path, `staging` and `formula1` (in that order) by typing `staging/formula1` into the file path. + + + + + - If you click into your `models` directory now, you should see the new `staging` folder nested within `models` and the `formula1` folder nested within `staging`. +2. Create two additional folders the same as the last step. Within the `models` subdirectory, create new directories `marts/core`. + +3. We will need to create a few more folders and subfolders using the UI. After you create all the necessary folders, your folder tree should look like this when it's all done: + + + +Remember you can always reference the entire project in [GitHub](https://github.com/dbt-labs/python-snowpark-formula1/tree/python-formula1) to view the complete folder and file strucutre. + +## Create source and staging models + +In this section, we are going to create our source and staging models. + +Sources allow us to create a dependency between our source database object and our staging models which will help us when we look at later. Also, if your source changes database or schema, you only have to update it in your `f1_sources.yml` file rather than updating all of the models it might be used in. + +Staging models are the base of our project, where we bring all the individual components we're going to use to build our more complex and useful models into the project. + +Since we want to focus on dbt and Python in this workshop, check out our [sources](/docs/build/sources) and [staging](/best-practices/how-we-structure/2-staging) docs if you want to learn more (or take our [dbt Fundamentals](https://courses.getdbt.com/collections) course which covers all of our core functionality). + +### 1. Create sources + +We're going to be using each of our 8 Formula 1 tables from our `formula1` database under the `raw`  schema for our transformations and we want to create those tables as sources in our project. + +1. Create a new file called `f1_sources.yml` with the following file path: `models/staging/formula1/f1_sources.yml`. +2. Then, paste the following code into the file before saving it: + +```yaml +version: 2 + +sources: + - name: formula1 + description: formula 1 datasets with normalized tables + database: formula1 + schema: raw + tables: + - name: circuits + description: One record per circuit, which is the specific race course. + columns: + - name: circuitid + tests: + - unique + - not_null + - name: constructors + description: One record per constructor. Constructors are the teams that build their formula 1 cars. + columns: + - name: constructorid + tests: + - unique + - not_null + - name: drivers + description: One record per driver. This table gives details about the driver. + columns: + - name: driverid + tests: + - unique + - not_null + - name: lap_times + description: One row per lap in each race. Lap times started being recorded in this dataset in 1984 and joined through driver_id. + - name: pit_stops + description: One row per pit stop. Pit stops do not have their own id column, the combination of the race_id and driver_id identify the pit stop. + columns: + - name: stop + tests: + - accepted_values: + values: [1,2,3,4,5,6,7,8] + quote: false + - name: races + description: One race per row. Importantly this table contains the race year to understand trends. + columns: + - name: raceid + tests: + - unique + - not_null + - name: results + columns: + - name: resultid + tests: + - unique + - not_null + description: One row per result. The main table that we join out for grid and position variables. + - name: status + description: One status per row. The status contextualizes whether the race was finished or what issues arose e.g. collisions, engine, etc. + columns: + - name: statusid + tests: + - unique + - not_null +``` + +### 2. Create staging models + +The next step is to set up the staging models for each of the 8 source tables. Given the one-to-one relationship between staging models and their corresponding source tables, we'll build 8 staging models here. We know it’s a lot and in the future, we will seek to update the workshop to make this step less repetitive and more efficient. This step is also a good representation of the real world of data, where you have multiple hierarchical tables that you will need to join together! + +1. Let's go in alphabetical order to easily keep track of all our staging models! Create a new file called `stg_f1_circuits.sql` with this file path `models/staging/formula1/stg_f1_circuits.sql`. Then, paste the following code into the file before saving it: + + ```sql + with + + source as ( + + select * from {{ source('formula1','circuits') }} + + ), + + renamed as ( + select + circuitid as circuit_id, + circuitref as circuit_ref, + name as circuit_name, + location, + country, + lat as latitude, + lng as longitude, + alt as altitude + -- omit the url + from source + ) + select * from renamed + ``` + + All we're doing here is pulling the source data into the model using the `source` function, renaming some columns, and omitting the column `url` with a commented note since we don’t need it for our analysis. + +1. Create `stg_f1_constructors.sql` with this file path `models/staging/formula1/stg_f1_constructors.sql`. Paste the following code into it before saving the file: + + ```sql + with + + source as ( + + select * from {{ source('formula1','constructors') }} + + ), + + renamed as ( + select + constructorid as constructor_id, + constructorref as constructor_ref, + name as constructor_name, + nationality as constructor_nationality + -- omit the url + from source + ) + + select * from renamed + ``` + + We have 6 other stages models to create. We can do this by creating new files, then copy and paste the code into our `staging` folder. + +1. Create `stg_f1_drivers.sql` with this file path `models/staging/formula1/stg_f1_drivers.sql`: + + ```sql + with + + source as ( + + select * from {{ source('formula1','drivers') }} + + ), + + renamed as ( + select + driverid as driver_id, + driverref as driver_ref, + number as driver_number, + code as driver_code, + forename, + surname, + dob as date_of_birth, + nationality as driver_nationality + -- omit the url + from source + ) + + select * from renamed + ``` + +1. Create `stg_f1_lap_times.sql` with this file path `models/staging/formula1/stg_f1_lap_times.sql`: + + ```sql + with + + source as ( + + select * from {{ source('formula1','lap_times') }} + + ), + + renamed as ( + select + raceid as race_id, + driverid as driver_id, + lap, + position, + time as lap_time_formatted, + milliseconds as lap_time_milliseconds + from source + ) + + select * from renamed + ``` + +1. Create `stg_f1_pit_stops.sql` with this file path `models/staging/formula1/stg_f1_pit_stops.sql`: + + ```sql + with + + source as ( + + select * from {{ source('formula1','pit_stops') }} + + ), + + renamed as ( + select + raceid as race_id, + driverid as driver_id, + stop as stop_number, + lap, + time as lap_time_formatted, + duration as pit_stop_duration_seconds, + milliseconds as pit_stop_milliseconds + from source + ) + + select * from renamed + order by pit_stop_duration_seconds desc + ``` + +1. Create `stg_f1_races.sql` with this file path `models/staging/formula1/stg_f1_races.sql`: + + ```sql + with + + source as ( + + select * from {{ source('formula1','races') }} + + ), + + renamed as ( + select + raceid as race_id, + year as race_year, + round as race_round, + circuitid as circuit_id, + name as circuit_name, + date as race_date, + to_time(time) as race_time, + -- omit the url + fp1_date as free_practice_1_date, + fp1_time as free_practice_1_time, + fp2_date as free_practice_2_date, + fp2_time as free_practice_2_time, + fp3_date as free_practice_3_date, + fp3_time as free_practice_3_time, + quali_date as qualifying_date, + quali_time as qualifying_time, + sprint_date, + sprint_time + from source + ) + + select * from renamed + ``` + +1. Create `stg_f1_results.sql` with this file path `models/staging/formula1/stg_f1_results.sql`: + + ```sql + with + + source as ( + + select * from {{ source('formula1','results') }} + + ), + + renamed as ( + select + resultid as result_id, + raceid as race_id, + driverid as driver_id, + constructorid as constructor_id, + number as driver_number, + grid, + position::int as position, + positiontext as position_text, + positionorder as position_order, + points, + laps, + time as results_time_formatted, + milliseconds as results_milliseconds, + fastestlap as fastest_lap, + rank as results_rank, + fastestlaptime as fastest_lap_time_formatted, + fastestlapspeed::decimal(6,3) as fastest_lap_speed, + statusid as status_id + from source + ) + + select * from renamed + ``` + +1. Last one! Create `stg_f1_status.sql` with this file path: `models/staging/formula1/stg_f1_status.sql`: + + ```sql + with + + source as ( + + select * from {{ source('formula1','status') }} + + ), + + renamed as ( + select + statusid as status_id, + status + from source + ) + + select * from renamed + ``` + + After the source and all the staging models are complete for each of the 8 tables, your staging folder should look like this: + + + +1. It’s a good time to delete our example folder since these two models are extraneous to our formula1 pipeline and `my_first_model` fails a `not_null` test that we won’t spend time investigating. dbt Cloud will warn us that this folder will be permanently deleted, and we are okay with that so select **Delete**. + + + +1. Now that the staging models are built and saved, it's time to create the models in our development schema in Snowflake. To do this we're going to enter into the command line `dbt build` to run all of the models in our project, which includes the 8 new staging models and the existing example models. + + Your run should complete successfully and you should see green checkmarks next to all of your models in the run results. We built our 8 staging models as views and ran 13 source tests that we configured in the `f1_sources.yml` file with not that much code, pretty cool! + + + + Let's take a quick look in Snowflake, refresh database objects, open our development schema, and confirm that the new models are there. If you can see them, then we're good to go! + + + + Before we move onto the next section, be sure to commit your new models to your Git branch. Click **Commit and push** and give your commit a message like `profile, sources, and staging setup` before moving on. + +## Transform SQL + +Now that we have all our sources and staging models done, it's time to move into where dbt shines — transformation! + +We need to: + +- Create some intermediate tables to join tables that aren’t hierarchical +- Create core tables for business intelligence (BI) tool ingestion +- Answer the two questions about: + - fastest pit stops + - lap time trends about our Formula 1 data by creating aggregate models using python! + +### Intermediate models + +We need to join lots of reference tables to our results table to create a human readable dataframe. What does this mean? For example, we don’t only want to have the numeric `status_id` in our table, we want to be able to read in a row of data that a driver could not finish a race due to engine failure (`status_id=5`). + +By now, we are pretty good at creating new files in the correct directories so we won’t cover this in detail. All intermediate models should be created in the path `models/intermediate`. + +1. Create a new file called `int_lap_times_years.sql`. In this model, we are joining our lap time and race information so we can look at lap times over years. In earlier Formula 1 eras, lap times were not recorded (only final results), so we filter out records where lap times are null. + + ```sql + with lap_times as ( + + select * from {{ ref('stg_f1_lap_times') }} + + ), + + races as ( + + select * from {{ ref('stg_f1_races') }} + + ), + + expanded_lap_times_by_year as ( + select + lap_times.race_id, + driver_id, + race_year, + lap, + lap_time_milliseconds + from lap_times + left join races + on lap_times.race_id = races.race_id + where lap_time_milliseconds is not null + ) + + select * from expanded_lap_times_by_year + ``` + +2. Create a file called `in_pit_stops.sql`. Pit stops are a many-to-one (M:1) relationship with our races. We are creating a feature called `total_pit_stops_per_race` by partitioning over our `race_id` and `driver_id`, while preserving individual level pit stops for rolling average in our next section. + + ```sql + with stg_f1__pit_stops as + ( + select * from {{ ref('stg_f1_pit_stops') }} + ), + + pit_stops_per_race as ( + select + race_id, + driver_id, + stop_number, + lap, + lap_time_formatted, + pit_stop_duration_seconds, + pit_stop_milliseconds, + max(stop_number) over (partition by race_id,driver_id) as total_pit_stops_per_race + from stg_f1__pit_stops + ) + + select * from pit_stops_per_race + ``` + +3. Create a file called `int_results.sql`. Here we are using 4 of our tables — `races`, `drivers`, `constructors`, and `status` — to give context to our `results` table. We are now able to calculate a new feature `drivers_age_years` by bringing the `date_of_birth` and `race_year` into the same table. We are also creating a column to indicate if the driver did not finish (dnf) the race, based upon if their `position` was null called, `dnf_flag`. + + ```sql + with results as ( + + select * from {{ ref('stg_f1_results') }} + + ), + + races as ( + + select * from {{ ref('stg_f1_races') }} + + ), + + drivers as ( + + select * from {{ ref('stg_f1_drivers') }} + + ), + + constructors as ( + + select * from {{ ref('stg_f1_constructors') }} + ), + + status as ( + + select * from {{ ref('stg_f1_status') }} + ), + + int_results as ( + select + result_id, + results.race_id, + race_year, + race_round, + circuit_id, + circuit_name, + race_date, + race_time, + results.driver_id, + results.driver_number, + forename ||' '|| surname as driver, + cast(datediff('year', date_of_birth, race_date) as int) as drivers_age_years, + driver_nationality, + results.constructor_id, + constructor_name, + constructor_nationality, + grid, + position, + position_text, + position_order, + points, + laps, + results_time_formatted, + results_milliseconds, + fastest_lap, + results_rank, + fastest_lap_time_formatted, + fastest_lap_speed, + results.status_id, + status, + case when position is null then 1 else 0 end as dnf_flag + from results + left join races + on results.race_id=races.race_id + left join drivers + on results.driver_id = drivers.driver_id + left join constructors + on results.constructor_id = constructors.constructor_id + left join status + on results.status_id = status.status_id + ) + + select * from int_results + ``` + +1. Create a *Markdown* file `intermediate.md` that we will go over in depth in the Test and Documentation sections of the [Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake](/guides/dbt-python-snowpark) guide. + + ```markdown + # the intent of this .md is to allow for multi-line long form explanations for our intermediate transformations + + # below are descriptions + {% docs int_results %} In this query we want to join out other important information about the race results to have a human readable table about results, races, drivers, constructors, and status. + We will have 4 left joins onto our results table. {% enddocs %} + + {% docs int_pit_stops %} There are many pit stops within one race, aka a M:1 relationship. + We want to aggregate this so we can properly join pit stop information without creating a fanout. {% enddocs %} + + {% docs int_lap_times_years %} Lap times are done per lap. We need to join them out to the race year to understand yearly lap time trends. {% enddocs %} + ``` + +1. Create a *YAML* file `intermediate.yml` that we will go over in depth during the Test and Document sections of the [Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake](/guides/dbt-python-snowpark) guide. + + ```yaml + version: 2 + + models: + - name: int_results + description: '{{ doc("int_results") }}' + - name: int_pit_stops + description: '{{ doc("int_pit_stops") }}' + - name: int_lap_times_years + description: '{{ doc("int_lap_times_years") }}' + ``` + + That wraps up the intermediate models we need to create our core models! + +### Core models + +1. Create a file `fct_results.sql`. This is what I like to refer to as the “mega table” — a really large denormalized table with all our context added in at row level for human readability. Importantly, we have a table `circuits` that is linked through the table `races`. When we joined `races` to `results` in `int_results.sql` we allowed our tables to make the connection from `circuits` to `results` in `fct_results.sql`. We are only taking information about pit stops at the result level so our join would not cause a [fanout](https://community.looker.com/technical-tips-tricks-1021/what-is-a-fanout-23327). + + ```sql + with int_results as ( + + select * from {{ ref('int_results') }} + + ), + + int_pit_stops as ( + select + race_id, + driver_id, + max(total_pit_stops_per_race) as total_pit_stops_per_race + from {{ ref('int_pit_stops') }} + group by 1,2 + ), + + circuits as ( + + select * from {{ ref('stg_f1_circuits') }} + ), + base_results as ( + select + result_id, + int_results.race_id, + race_year, + race_round, + int_results.circuit_id, + int_results.circuit_name, + circuit_ref, + location, + country, + latitude, + longitude, + altitude, + total_pit_stops_per_race, + race_date, + race_time, + int_results.driver_id, + driver, + driver_number, + drivers_age_years, + driver_nationality, + constructor_id, + constructor_name, + constructor_nationality, + grid, + position, + position_text, + position_order, + points, + laps, + results_time_formatted, + results_milliseconds, + fastest_lap, + results_rank, + fastest_lap_time_formatted, + fastest_lap_speed, + status_id, + status, + dnf_flag + from int_results + left join circuits + on int_results.circuit_id=circuits.circuit_id + left join int_pit_stops + on int_results.driver_id=int_pit_stops.driver_id and int_results.race_id=int_pit_stops.race_id + ) + + select * from base_results + ``` + +1. Create the file `pit_stops_joined.sql`. Our results and pit stops are at different levels of dimensionality (also called grain). Simply put, we have multiple pit stops per a result. Since we are interested in understanding information at the pit stop level with information about race year and constructor, we will create a new table `pit_stops_joined.sql` where each row is per pit stop. Our new table tees up our aggregation in Python. + + ```sql + with base_results as ( + + select * from {{ ref('fct_results') }} + + ), + + pit_stops as ( + + select * from {{ ref('int_pit_stops') }} + + ), + + pit_stops_joined as ( + + select + base_results.race_id, + race_year, + base_results.driver_id, + constructor_id, + constructor_name, + stop_number, + lap, + lap_time_formatted, + pit_stop_duration_seconds, + pit_stop_milliseconds + from base_results + left join pit_stops + on base_results.race_id=pit_stops.race_id and base_results.driver_id=pit_stops.driver_id + ) + select * from pit_stops_joined + ``` + +1. Enter in the command line and execute `dbt build` to build out our entire pipeline to up to this point. Don’t worry about “overriding” your previous models – dbt workflows are designed to be idempotent so we can run them again and expect the same results. + +1. Let’s talk about our lineage so far. It’s looking good 😎. We’ve shown how SQL can be used to make data type, column name changes, and handle hierarchical joins really well; all while building out our automated lineage! + + + +1. Time to **Commit and push** our changes and give your commit a message like `intermediate and fact models` before moving on. + +## Running dbt Python models + +Up until now, SQL has been driving the project (car pun intended) for data cleaning and hierarchical joining. Now it’s time for Python to take the wheel (car pun still intended) for the rest of our lab! For more information about running Python models on dbt, check out our [docs](/docs/build/python-models). To learn more about dbt python works under the hood, check out [Snowpark for Python](https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html), which makes running dbt Python models possible. + +There are quite a few differences between SQL and Python in terms of the dbt syntax and DDL, so we’ll be breaking our code and model runs down further for our python models. + +### Pit stop analysis + +First, we want to find out: which constructor had the fastest pit stops in 2021? (constructor is a Formula 1 team that builds or “constructs” the car). + +1. Create a new file called `fastest_pit_stops_by_constructor.py` in our `aggregates` (this is the first time we are using the `.py` extension!). +2. Copy the following code into the file: + + ```python + import numpy as np + import pandas as pd + + def model(dbt, session): + # dbt configuration + dbt.config(packages=["pandas","numpy"]) + + # get upstream data + pit_stops_joined = dbt.ref("pit_stops_joined").to_pandas() + + # provide year so we do not hardcode dates + year=2021 + + # describe the data + pit_stops_joined["PIT_STOP_SECONDS"] = pit_stops_joined["PIT_STOP_MILLISECONDS"]/1000 + fastest_pit_stops = pit_stops_joined[(pit_stops_joined["RACE_YEAR"]==year)].groupby(by="CONSTRUCTOR_NAME")["PIT_STOP_SECONDS"].describe().sort_values(by='mean') + fastest_pit_stops.reset_index(inplace=True) + fastest_pit_stops.columns = fastest_pit_stops.columns.str.upper() + + return fastest_pit_stops.round(2) + ``` + +3. Let’s break down what this code is doing step by step: + - First, we are importing the Python libraries that we are using. A *library* is a reusable chunk of code that someone else wrote that you may want to include in your programs/projects. We are using `numpy` and `pandas`in this Python model. This is similar to a dbt *package*, but our Python libraries do *not* persist across the entire project. + - Defining a function called `model` with the parameter `dbt` and `session`. The parameter `dbt` is a class compiled by dbt, which enables you to run your Python code in the context of your dbt project and DAG. The parameter `session` is a class representing your Snowflake’s connection to the Python backend. The `model` function *must return a single DataFrame*. You can see that all the data transformation happening is within the body of the `model` function that the `return` statement is tied to. + - Then, within the context of our dbt model library, we are passing in a configuration of which packages we need using `dbt.config(packages=["pandas","numpy"])`. + - Use the `.ref()` function to retrieve the data frame `pit_stops_joined` that we created in our last step using SQL. We cast this to a pandas dataframe (by default it's a Snowpark Dataframe). + - Create a variable named `year` so we aren’t passing a hardcoded value. + - Generate a new column called `PIT_STOP_SECONDS` by dividing the value of `PIT_STOP_MILLISECONDS` by 1000. + - Create our final data frame `fastest_pit_stops` that holds the records where year is equal to our year variable (2021 in this case), then group the data frame by `CONSTRUCTOR_NAME` and use the `describe()` and `sort_values()` and in descending order. This will make our first row in the new aggregated data frame the team with the fastest pit stops over an entire competition year. + - Finally, it resets the index of the `fastest_pit_stops` data frame. The `reset_index()` method allows you to reset the index back to the default 0, 1, 2, etc indexes. By default, this method will keep the "old" indexes in a column named "index"; to avoid this, use the drop parameter. Think of this as keeping your data “flat and square” as opposed to “tiered”. If you are new to Python, now might be a good time to [learn about indexes for 5 minutes](https://towardsdatascience.com/the-basics-of-indexing-and-slicing-python-lists-2d12c90a94cf) since it's the foundation of how Python retrieves, slices, and dices data. The `inplace` argument means we override the existing data frame permanently. Not to fear! This is what we want to do to avoid dealing with multi-indexed dataframes! + - Convert our Python column names to all uppercase using `.upper()`, so Snowflake recognizes them. + - Finally we are returning our dataframe with 2 decimal places for all the columns using the `round()` method. +4. Zooming out a bit, what are we doing differently here in Python from our typical SQL code: + - Method chaining is a technique in which multiple methods are called on an object in a single statement, with each method call modifying the result of the previous one. The methods are called in a chain, with the output of one method being used as the input for the next one. The technique is used to simplify the code and make it more readable by eliminating the need for intermediate variables to store the intermediate results. + - The way you see method chaining in Python is the syntax `.().()`. For example, `.describe().sort_values(by='mean')` where the `.describe()` method is chained to `.sort_values()`. + - The `.describe()` method is used to generate various summary statistics of the dataset. It's used on pandas dataframe. It gives a quick and easy way to get the summary statistics of your dataset without writing multiple lines of code. + - The `.sort_values()` method is used to sort a pandas dataframe or a series by one or multiple columns. The method sorts the data by the specified column(s) in ascending or descending order. It is the pandas equivalent to `order by` in SQL. + + We won’t go as in depth for our subsequent scripts, but will continue to explain at a high level what new libraries, functions, and methods are doing. + +5. Build the model using the UI which will **execute**: + + ```bash + dbt run --select fastest_pit_stops_by_constructor + ``` + + in the command bar. + + Let’s look at some details of our first Python model to see what our model executed. There two major differences we can see while running a Python model compared to an SQL model: + + - Our Python model was executed as a stored procedure. Snowflake needs a way to know that it's meant to execute this code in a Python runtime, instead of interpreting in a SQL runtime. We do this by creating a Python stored proc, called by a SQL command. + - The `snowflake-snowpark-python` library has been picked up to execute our Python code. Even though this wasn’t explicitly stated this is picked up by the dbt class object because we need our Snowpark package to run Python! + + Python models take a bit longer to run than SQL models, however we could always speed this up by using [Snowpark-optimized Warehouses](https://docs.snowflake.com/en/user-guide/warehouses-snowpark-optimized.html) if we wanted to. Our data is sufficiently small, so we won’t worry about creating a separate warehouse for Python versus SQL files today. + + + The rest of our **Details** output gives us information about how dbt and Snowpark for Python are working together to define class objects and apply a specific set of methods to run our models. + + So which constructor had the fastest pit stops in 2021? Let’s look at our data to find out! + +6. We can't preview Python models directly, so let’s create a new file using the **+** button or the Control-n shortcut to create a new scratchpad. +7. Reference our Python model: + + ```sql + select * from {{ ref('fastest_pit_stops_by_constructor') }} + ``` + + and preview the output: + + + Not only did Red Bull have the fastest average pit stops by nearly 40 seconds, they also had the smallest standard deviation, meaning they are both fastest and most consistent teams in pit stops. By using the `.describe()` method we were able to avoid verbose SQL requiring us to create a line of code per column and repetitively use the `PERCENTILE_COUNT()` function. + + Now we want to find the lap time average and rolling average through the years (is it generally trending up or down)? + +8. Create a new file called `lap_times_moving_avg.py` in our `aggregates` folder. +9. Copy the following code into the file: + + ```python + import pandas as pd + + def model(dbt, session): + # dbt configuration + dbt.config(packages=["pandas"]) + + # get upstream data + lap_times = dbt.ref("int_lap_times_years").to_pandas() + + # describe the data + lap_times["LAP_TIME_SECONDS"] = lap_times["LAP_TIME_MILLISECONDS"]/1000 + lap_time_trends = lap_times.groupby(by="RACE_YEAR")["LAP_TIME_SECONDS"].mean().to_frame() + lap_time_trends.reset_index(inplace=True) + lap_time_trends["LAP_MOVING_AVG_5_YEARS"] = lap_time_trends["LAP_TIME_SECONDS"].rolling(5).mean() + lap_time_trends.columns = lap_time_trends.columns.str.upper() + + return lap_time_trends.round(1) + ``` + +10. Breaking down our code a bit: + - We’re only using the `pandas` library for this model and casting it to a pandas data frame `.to_pandas()`. + - Generate a new column called `LAP_TIMES_SECONDS` by dividing the value of `LAP_TIME_MILLISECONDS` by 1000. + - Create the final dataframe. Get the lap time per year. Calculate the mean series and convert to a data frame. + - Reset the index. + - Calculate the rolling 5 year mean. + - Round our numeric columns to one decimal place. +11. Now, run this model by using the UI **Run model** or + + ```bash + dbt run --select lap_times_moving_avg + ``` + + in the command bar. + +12. Once again previewing the output of our data using the same steps for our `fastest_pit_stops_by_constructor` model. + + + We can see that it looks like lap times are getting consistently faster over time. Then in 2010 we see an increase occur! Using outside subject matter context, we know that significant rule changes were introduced to Formula 1 in 2010 and 2011 causing slower lap times. + +13. Now is a good time to checkpoint and commit our work to Git. Click **Commit and push** and give your commit a message like `aggregate python models` before moving on. + +### The dbt model, .source(), .ref() and .config() functions + +Let’s take a step back before starting machine learning to both review and go more in-depth at the methods that make running dbt python models possible. If you want to know more outside of this lab’s explanation read the documentation [here](/docs/build/python-models?version=1.3). + +- dbt model(dbt, session). For starters, each Python model lives in a .py file in your models/ folder. It defines a function named `model()`, which takes two parameters: + - dbt — A class compiled by dbt Core, unique to each model, enables you to run your Python code in the context of your dbt project and DAG. + - session — A class representing your data platform’s connection to the Python backend. The session is needed to read in tables as DataFrames and to write DataFrames back to tables. In PySpark, by convention, the SparkSession is named spark, and available globally. For consistency across platforms, we always pass it into the model function as an explicit argument called session. +- The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or pandas DataFrame. +- `.source()` and `.ref()` functions. Python models participate fully in dbt's directed acyclic graph (DAG) of transformations. If you want to read directly from a raw source table, use `dbt.source()`. We saw this in our earlier section using SQL with the source function. These functions have the same execution, but with different syntax. Use the `dbt.ref()` method within a Python model to read data from other models (SQL or Python). These methods return DataFrames pointing to the upstream source, model, seed, or snapshot. +- `.config()`. Just like SQL models, there are three ways to configure Python models: + - In a dedicated `.yml` file, within the `models/` directory + - Within the model's `.py` file, using the `dbt.config()` method + - Calling the `dbt.config()` method will set configurations for your model within your `.py` file, similar to the `{{ config() }} macro` in `.sql` model files: + + ```python + def model(dbt, session): + + # setting configuration + dbt.config(materialized="table") + ``` + - There's a limit to how complex you can get with the `dbt.config()` method. It accepts only literal values (strings, booleans, and numeric types). Passing another function or a more complex data structure is not possible. The reason is that dbt statically analyzes the arguments to `.config()` while parsing your model without executing your Python code. If you need to set a more complex configuration, we recommend you define it using the config property in a [YAML file](/reference/resource-properties/config). Learn more about configurations [here](/reference/model-configs). + +## Prepare for machine learning: cleaning, encoding, and splits + +Now that we’ve gained insights and business intelligence about Formula 1 at a descriptive level, we want to extend our capabilities into prediction. We’re going to take the scenario where we censor the data. This means that we will pretend that we will train a model using earlier data and apply it to future data. In practice, this means we’ll take data from 2010-2019 to train our model and then predict 2020 data. + +In this section, we’ll be preparing our data to predict the final race position of a driver. + +At a high level we’ll be: + +- Creating new prediction features and filtering our dataset to active drivers +- Encoding our data (algorithms like numbers) and simplifying our target variable called `position` +- Splitting our dataset into training, testing, and validation + +### ML data prep + +1. To keep our project organized, we’ll need to create two new subfolders in our `ml` directory. Under the `ml` folder, make the subfolders `prep` and `train_predict`. +2. Create a new file under `ml/prep` called `ml_data_prep`. Copy the following code into the file and **Save**. + + ```python + import pandas as pd + + def model(dbt, session): + # dbt configuration + dbt.config(packages=["pandas"]) + + # get upstream data + fct_results = dbt.ref("fct_results").to_pandas() + + # provide years so we do not hardcode dates in filter command + start_year=2010 + end_year=2020 + + # describe the data for a full decade + data = fct_results.loc[fct_results['RACE_YEAR'].between(start_year, end_year)] + + # convert string to an integer + data['POSITION'] = data['POSITION'].astype(float) + + # we cannot have nulls if we want to use total pit stops + data['TOTAL_PIT_STOPS_PER_RACE'] = data['TOTAL_PIT_STOPS_PER_RACE'].fillna(0) + + # some of the constructors changed their name over the year so replacing old names with current name + mapping = {'Force India': 'Racing Point', 'Sauber': 'Alfa Romeo', 'Lotus F1': 'Renault', 'Toro Rosso': 'AlphaTauri'} + data['CONSTRUCTOR_NAME'].replace(mapping, inplace=True) + + # create confidence metrics for drivers and constructors + dnf_by_driver = data.groupby('DRIVER').sum()['DNF_FLAG'] + driver_race_entered = data.groupby('DRIVER').count()['DNF_FLAG'] + driver_dnf_ratio = (dnf_by_driver/driver_race_entered) + driver_confidence = 1-driver_dnf_ratio + driver_confidence_dict = dict(zip(driver_confidence.index,driver_confidence)) + + dnf_by_constructor = data.groupby('CONSTRUCTOR_NAME').sum()['DNF_FLAG'] + constructor_race_entered = data.groupby('CONSTRUCTOR_NAME').count()['DNF_FLAG'] + constructor_dnf_ratio = (dnf_by_constructor/constructor_race_entered) + constructor_relaiblity = 1-constructor_dnf_ratio + constructor_relaiblity_dict = dict(zip(constructor_relaiblity.index,constructor_relaiblity)) + + data['DRIVER_CONFIDENCE'] = data['DRIVER'].apply(lambda x:driver_confidence_dict[x]) + data['CONSTRUCTOR_RELAIBLITY'] = data['CONSTRUCTOR_NAME'].apply(lambda x:constructor_relaiblity_dict[x]) + + #removing retired drivers and constructors + active_constructors = ['Renault', 'Williams', 'McLaren', 'Ferrari', 'Mercedes', + 'AlphaTauri', 'Racing Point', 'Alfa Romeo', 'Red Bull', + 'Haas F1 Team'] + active_drivers = ['Daniel Ricciardo', 'Kevin Magnussen', 'Carlos Sainz', + 'Valtteri Bottas', 'Lance Stroll', 'George Russell', + 'Lando Norris', 'Sebastian Vettel', 'Kimi Räikkönen', + 'Charles Leclerc', 'Lewis Hamilton', 'Daniil Kvyat', + 'Max Verstappen', 'Pierre Gasly', 'Alexander Albon', + 'Sergio Pérez', 'Esteban Ocon', 'Antonio Giovinazzi', + 'Romain Grosjean','Nicholas Latifi'] + + # create flags for active drivers and constructors so we can filter downstream + data['ACTIVE_DRIVER'] = data['DRIVER'].apply(lambda x: int(x in active_drivers)) + data['ACTIVE_CONSTRUCTOR'] = data['CONSTRUCTOR_NAME'].apply(lambda x: int(x in active_constructors)) + + return data + ``` + +3. As usual, let’s break down what we are doing in this Python model: + - We’re first referencing our upstream `fct_results` table and casting it to a pandas dataframe. + - Filtering on years 2010-2020 since we’ll need to clean all our data we are using for prediction (both training and testing). + - Filling in empty data for `total_pit_stops` and making a mapping active constructors and drivers to avoid erroneous predictions + - ⚠️ You might be wondering why we didn’t do this upstream in our `fct_results` table! The reason for this is that we want our machine learning cleanup to reflect the year 2020 for our predictions and give us an up-to-date team name. However, for business intelligence purposes we can keep the historical data at that point in time. Instead of thinking of one table as “one source of truth” we are creating different datasets fit for purpose: one for historical descriptions and reporting and another for relevant predictions. + - Create new confidence features for drivers and constructors + - Generate flags for the constructors and drivers that were active in 2020 +4. Execute the following in the command bar: + + ```bash + dbt run --select ml_data_prep + ``` + +5. There are more aspects we could consider for this project, such as normalizing the driver confidence by the number of races entered. Including this would help account for a driver’s history and consider whether they are a new or long-time driver. We’re going to keep it simple for now, but these are some of the ways we can expand and improve our machine learning dbt projects. Breaking down our machine learning prep model: + - Lambda functions — We use some lambda functions to transform our data without having to create a fully-fledged function using the `def` notation. So what exactly are lambda functions? + - In Python, a lambda function is a small, anonymous function defined using the keyword "lambda". Lambda functions are used to perform a quick operation, such as a mathematical calculation or a transformation on a list of elements. They are often used in conjunction with higher-order functions, such as `apply`, `map`, `filter`, and `reduce`. + - `.apply()` method — We used `.apply()` to pass our functions into our lambda expressions to the columns and perform this multiple times in our code. Let’s explain apply a little more: + - The `.apply()` function in the pandas library is used to apply a function to a specified axis of a DataFrame or a Series. In our case the function we used was our lambda function! + - The `.apply()` function takes two arguments: the first is the function to be applied, and the second is the axis along which the function should be applied. The axis can be specified as 0 for rows or 1 for columns. We are using the default value of 0 so we aren’t explicitly writing it in the code. This means that the function will be applied to each *row* of the DataFrame or Series. +6. Let’s look at the preview of our clean dataframe after running our `ml_data_prep` model: + + +### Covariate encoding + +In this next part, we’ll be performing covariate encoding. Breaking down this phrase a bit, a *covariate* is a variable that is relevant to the outcome of a study or experiment, and *encoding* refers to the process of converting data (such as text or categorical variables) into a numerical format that can be used as input for a model. This is necessary because most machine learning algorithms can only work with numerical data. Algorithms don’t speak languages, have eyes to see images, etc. so we encode our data into numbers so algorithms can perform tasks by using calculations they otherwise couldn’t. + +🧠 We’ll think about this as : “algorithms like numbers”. + +1. Create a new file under `ml/prep` called `covariate_encoding` copy the code below and save. + + ```python + import pandas as pd + import numpy as np + from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder + from sklearn.linear_model import LogisticRegression + + def model(dbt, session): + # dbt configuration + dbt.config(packages=["pandas","numpy","scikit-learn"]) + + # get upstream data + data = dbt.ref("ml_data_prep").to_pandas() + + # list out covariates we want to use in addition to outcome variable we are modeling - position + covariates = data[['RACE_YEAR','CIRCUIT_NAME','GRID','CONSTRUCTOR_NAME','DRIVER','DRIVERS_AGE_YEARS','DRIVER_CONFIDENCE','CONSTRUCTOR_RELAIBLITY','TOTAL_PIT_STOPS_PER_RACE','ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR', 'POSITION']] + + # filter covariates on active drivers and constructors + # use fil_cov as short for "filtered_covariates" + fil_cov = covariates[(covariates['ACTIVE_DRIVER']==1)&(covariates['ACTIVE_CONSTRUCTOR']==1)] + + # Encode categorical variables using LabelEncoder + # TODO: we'll update this to both ohe in the future for non-ordinal variables! + le = LabelEncoder() + fil_cov['CIRCUIT_NAME'] = le.fit_transform(fil_cov['CIRCUIT_NAME']) + fil_cov['CONSTRUCTOR_NAME'] = le.fit_transform(fil_cov['CONSTRUCTOR_NAME']) + fil_cov['DRIVER'] = le.fit_transform(fil_cov['DRIVER']) + fil_cov['TOTAL_PIT_STOPS_PER_RACE'] = le.fit_transform(fil_cov['TOTAL_PIT_STOPS_PER_RACE']) + + # Simply target variable "position" to represent 3 meaningful categories in Formula1 + # 1. Podium position 2. Points for team 3. Nothing - no podium or points! + def position_index(x): + if x<4: + return 1 + if x>10: + return 3 + else : + return 2 + + # we are dropping the columns that we filtered on in addition to our training variable + encoded_data = fil_cov.drop(['ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR'],1) + encoded_data['POSITION_LABEL']= encoded_data['POSITION'].apply(lambda x: position_index(x)) + encoded_data_grouped_target = encoded_data.drop(['POSITION'],1) + + return encoded_data_grouped_target + ``` + +2. Execute the following in the command bar: + + ```bash + dbt run --select covariate_encoding + ``` + +3. In this code, we are using a ton of functions from libraries! This is really cool, because we can utilize code other people have developed and bring it into our project simply by using the `import` function. [Scikit-learn](https://scikit-learn.org/stable/), “sklearn” for short, is an extremely popular data science library. Sklearn contains a wide range of machine learning techniques, including supervised and unsupervised learning algorithms, feature scaling and imputation, as well as tools model evaluation and selection. We’ll be using Sklearn for both preparing our covariates and creating models (our next section). +4. Our dataset is pretty small data so we are good to use pandas and `sklearn`. If you have larger data for your own project in mind, consider `dask` or `category_encoders`. +5. Breaking it down a bit more: + - We’re selecting a subset of variables that will be used as predictors for a driver’s position. + - Filter the dataset to only include rows using the active driver and constructor flags we created in the last step. + - The next step is to use the `LabelEncoder` from scikit-learn to convert the categorical variables `CIRCUIT_NAME`, `CONSTRUCTOR_NAME`, `DRIVER`, and `TOTAL_PIT_STOPS_PER_RACE` into numerical values. + - Create a new variable called `POSITION_LABEL`, which is a derived from our position variable. + - 💭 Why are we changing our position variable? There are 20 total positions in Formula 1 and we are grouping them together to simplify the classification and improve performance. We also want to demonstrate you can create a new function within your dbt model! + - Our new `position_label` variable has meaning: + - In Formula1 if you are in: + - Top 3 you get a “podium” position + - Top 10 you gain points that add to your overall season total + - Below top 10 you get no points! + - We are mapping our original variable position to `position_label` to the corresponding places above to 1,2, and 3 respectively. + - Drop the active driver and constructor flags since they were filter criteria and additionally drop our original position variable. + +### Splitting into training and testing datasets + +Now that we’ve cleaned and encoded our data, we are going to further split in by time. In this step, we will create dataframes to use for training and prediction. We’ll be creating two dataframes 1) using data from 2010-2019 for training, and 2) data from 2020 for new prediction inferences. We’ll create variables called `start_year` and `end_year` so we aren’t filtering on hardcasted values (and can more easily swap them out in the future if we want to retrain our model on different timeframes). + +1. Create a file called `train_test_dataset` copy and save the following code: + + ```python + import pandas as pd + + def model(dbt, session): + + # dbt configuration + dbt.config(packages=["pandas"], tags="train") + + # get upstream data + encoding = dbt.ref("covariate_encoding").to_pandas() + + # provide years so we do not hardcode dates in filter command + start_year=2010 + end_year=2019 + + # describe the data for a full decade + train_test_dataset = encoding.loc[encoding['RACE_YEAR'].between(start_year, end_year)] + + return train_test_dataset + ``` + +2. Create a file called `hold_out_dataset_for_prediction` copy and save the following code below. Now we’ll have a dataset with only the year 2020 that we’ll keep as a hold out set that we are going to use similar to a deployment use case. + + ```python + import pandas as pd + + def model(dbt, session): + # dbt configuration + dbt.config(packages=["pandas"], tags="predict") + + # get upstream data + encoding = dbt.ref("covariate_encoding").to_pandas() + + # variable for year instead of hardcoding it + year=2020 + + # filter the data based on the specified year + hold_out_dataset = encoding.loc[encoding['RACE_YEAR'] == year] + + return hold_out_dataset + ``` + +3. Execute the following in the command bar: + + ```bash + dbt run --select train_test_dataset hold_out_dataset_for_prediction + ``` + + To run our temporal data split models, we can use this syntax in the command line to run them both at once. Make sure you use a *space* [syntax](/reference/node-selection/syntax) between the model names to indicate you want to run both! +4. **Commit and push** our changes to keep saving our work as we go using `ml data prep and splits` before moving on. + +👏 Now that we’ve finished our machine learning prep work we can move onto the fun part — training and prediction! + + +## Training a model to predict in machine learning + +We’re ready to start training a model to predict the driver’s position. Now is a good time to pause and take a step back and say, usually in ML projects you’ll try multiple algorithms during development and use an evaluation method such as cross validation to determine which algorithm to use. You can definitely do this in your dbt project, but for the content of this lab we’ll have decided on using a logistic regression to predict position (we actually tried some other algorithms using cross validation outside of this lab such as k-nearest neighbors and a support vector classifier but that didn’t perform as well as the logistic regression and a decision tree that overfit). + +There are 3 areas to break down as we go since we are working at the intersection all within one model file: + +1. Machine Learning +2. Snowflake and Snowpark +3. dbt Python models + +If you haven’t seen code like this before or use joblib files to save machine learning models, we’ll be going over them at a high level and you can explore the links for more technical in-depth along the way! Because Snowflake and dbt have abstracted away a lot of the nitty gritty about serialization and storing our model object to be called again, we won’t go into too much detail here. There’s *a lot* going on here so take it at your pace! + +### Training and saving a machine learning model + +1. Project organization remains key, so let’s make a new subfolder called `train_predict` under the `ml` folder. +2. Now create a new file called `train_test_position` and copy and save the following code: + + ```python + import snowflake.snowpark.functions as F + from sklearn.model_selection import train_test_split + import pandas as pd + from sklearn.metrics import confusion_matrix, balanced_accuracy_score + import io + from sklearn.linear_model import LogisticRegression + from joblib import dump, load + import joblib + import logging + import sys + from joblib import dump, load + + logger = logging.getLogger("mylog") + + def save_file(session, model, path, dest_filename): + input_stream = io.BytesIO() + joblib.dump(model, input_stream) + session._conn.upload_stream(input_stream, path, dest_filename) + return "successfully created file: " + path + + def model(dbt, session): + dbt.config( + packages = ['numpy','scikit-learn','pandas','numpy','joblib','cachetools'], + materialized = "table", + tags = "train" + ) + # Create a stage in Snowflake to save our model file + session.sql('create or replace stage MODELSTAGE').collect() + + #session._use_scoped_temp_objects = False + version = "1.0" + logger.info('Model training version: ' + version) + + # read in our training and testing upstream dataset + test_train_df = dbt.ref("train_test_dataset") + + # cast snowpark df to pandas df + test_train_pd_df = test_train_df.to_pandas() + target_col = "POSITION_LABEL" + + # split out covariate predictors, x, from our target column position_label, y. + split_X = test_train_pd_df.drop([target_col], axis=1) + split_y = test_train_pd_df[target_col] + + # Split out our training and test data into proportions + X_train, X_test, y_train, y_test = train_test_split(split_X, split_y, train_size=0.7, random_state=42) + train = [X_train, y_train] + test = [X_test, y_test] + # now we are only training our one model to deploy + # we are keeping the focus on the workflows and not algorithms for this lab! + model = LogisticRegression() + + # fit the preprocessing pipeline and the model together + model.fit(X_train, y_train) + y_pred = model.predict_proba(X_test)[:,1] + predictions = [round(value) for value in y_pred] + balanced_accuracy = balanced_accuracy_score(y_test, predictions) + + # Save the model to a stage + save_file(session, model, "@MODELSTAGE/driver_position_"+version, "driver_position_"+version+".joblib" ) + logger.info('Model artifact:' + "@MODELSTAGE/driver_position_"+version+".joblib") + + # Take our pandas training and testing dataframes and put them back into snowpark dataframes + snowpark_train_df = session.write_pandas(pd.concat(train, axis=1, join='inner'), "train_table", auto_create_table=True, create_temp_table=True) + snowpark_test_df = session.write_pandas(pd.concat(test, axis=1, join='inner'), "test_table", auto_create_table=True, create_temp_table=True) + + # Union our training and testing data together and add a column indicating train vs test rows + return snowpark_train_df.with_column("DATASET_TYPE", F.lit("train")).union(snowpark_test_df.with_column("DATASET_TYPE", F.lit("test"))) + ``` + +3. Execute the following in the command bar: + + ```bash + dbt run --select train_test_position + ``` + +4. Breaking down our Python script here: + - We’re importing some helpful libraries. + - Defining a function called `save_file()` that takes four parameters: `session`, `model`, `path` and `dest_filename` that will save our logistic regression model file. + - `session` — an object representing a connection to Snowflake. + - `model` — an object that needs to be saved. In this case, it's a Python object that is a scikit-learn that can be serialized with joblib. + - `path` — a string representing the directory or bucket location where the file should be saved. + - `dest_filename` — a string representing the desired name of the file. + - Creating our dbt model + - Within this model we are creating a stage called `MODELSTAGE` to place our logistic regression `joblib` model file. This is really important since we need a place to keep our model to reuse and want to ensure it's there. When using Snowpark commands, it's common to see the `.collect()` method to ensure the action is performed. Think of the session as our “start” and collect as our “end” when [working with Snowpark](https://docs.snowflake.com/en/developer-guide/snowpark/python/working-with-dataframes.html) (you can use other ending methods other than collect). + - Using `.ref()` to connect into our `train_test_dataset` model. + - Now we see the machine learning part of our analysis: + - Create new dataframes for our prediction features from our target variable `position_label`. + - Split our dataset into 70% training (and 30% testing), train_size=0.7 with a `random_state` specified to have repeatable results. + - Specify our model is a logistic regression. + - Fit our model. In a logistic regression this means finding the coefficients that will give the least classification error. + - Round our predictions to the nearest integer since logistic regression creates a probability between for each class and calculate a balanced accuracy to account for imbalances in the target variable. + - Right now our model is only in memory, so we need to use our nifty function `save_file` to save our model file to our Snowflake stage. We save our model as a joblib file so Snowpark can easily call this model object back to create predictions. We really don’t need to know much else as a data practitioner unless we want to. It’s worth noting that joblib files aren’t able to be queried directly by SQL. To do this, we would need to transform the joblib file to an SQL querable format such as JSON or CSV (out of scope for this workshop). + - Finally we want to return our dataframe, but create a new column indicating what rows were used for training and those for training. +5. Viewing our output of this model: + + +6. Let’s pop back over to Snowflake and check that our logistic regression model has been stored in our `MODELSTAGE` using the command: + + ```sql + list @modelstage + ``` + + + +7. To investigate the commands run as part of `train_test_position` script, navigate to Snowflake query history to view it **Activity > Query History**. We can view the portions of query that we wrote such as `create or replace stage MODELSTAGE`, but we also see additional queries that Snowflake uses to interpret python code. + + +### Predicting on new data + +1. Create a new file called `predict_position` and copy and save the following code: + + ```python + import logging + import joblib + import pandas as pd + import os + from snowflake.snowpark import types as T + + DB_STAGE = 'MODELSTAGE' + version = '1.0' + # The name of the model file + model_file_path = 'driver_position_'+version + model_file_packaged = 'driver_position_'+version+'.joblib' + + # This is a local directory, used for storing the various artifacts locally + LOCAL_TEMP_DIR = f'/tmp/driver_position' + DOWNLOAD_DIR = os.path.join(LOCAL_TEMP_DIR, 'download') + TARGET_MODEL_DIR_PATH = os.path.join(LOCAL_TEMP_DIR, 'ml_model') + TARGET_LIB_PATH = os.path.join(LOCAL_TEMP_DIR, 'lib') + + # The feature columns that were used during model training + # and that will be used during prediction + FEATURE_COLS = [ + "RACE_YEAR" + ,"CIRCUIT_NAME" + ,"GRID" + ,"CONSTRUCTOR_NAME" + ,"DRIVER" + ,"DRIVERS_AGE_YEARS" + ,"DRIVER_CONFIDENCE" + ,"CONSTRUCTOR_RELAIBLITY" + ,"TOTAL_PIT_STOPS_PER_RACE"] + + def register_udf_for_prediction(p_predictor ,p_session ,p_dbt): + + # The prediction udf + + def predict_position(p_df: T.PandasDataFrame[int, int, int, int, + int, int, int, int, int]) -> T.PandasSeries[int]: + # Snowpark currently does not set the column name in the input dataframe + # The default col names are like 0,1,2,... Hence we need to reset the column + # names to the features that we initially used for training. + p_df.columns = [*FEATURE_COLS] + + # Perform prediction. this returns an array object + pred_array = p_predictor.predict(p_df) + # Convert to series + df_predicted = pd.Series(pred_array) + return df_predicted + + # The list of packages that will be used by UDF + udf_packages = p_dbt.config.get('packages') + + predict_position_udf = p_session.udf.register( + predict_position + ,name=f'predict_position' + ,packages = udf_packages + ) + return predict_position_udf + + def download_models_and_libs_from_stage(p_session): + p_session.file.get(f'@{DB_STAGE}/{model_file_path}/{model_file_packaged}', DOWNLOAD_DIR) + + def load_model(p_session): + # Load the model and initialize the predictor + model_fl_path = os.path.join(DOWNLOAD_DIR, model_file_packaged) + predictor = joblib.load(model_fl_path) + return predictor + + # ------------------------------- + def model(dbt, session): + dbt.config( + packages = ['snowflake-snowpark-python' ,'scipy','scikit-learn' ,'pandas' ,'numpy'], + materialized = "table", + tags = "predict" + ) + session._use_scoped_temp_objects = False + download_models_and_libs_from_stage(session) + predictor = load_model(session) + predict_position_udf = register_udf_for_prediction(predictor, session ,dbt) + + # Retrieve the data, and perform the prediction + hold_out_df = (dbt.ref("hold_out_dataset_for_prediction") + .select(*FEATURE_COLS) + ) + + # Perform prediction. + new_predictions_df = hold_out_df.withColumn("position_predicted" + ,predict_position_udf(*FEATURE_COLS) + ) + + return new_predictions_df + ``` + +2. Execute the following in the command bar: + + ```bash + dbt run --select predict_position + ``` + +3. **Commit and push** our changes to keep saving our work as we go using the commit message `logistic regression model training and application` before moving on. +4. At a high level in this script, we are: + - Retrieving our staged logistic regression model + - Loading the model in + - Placing the model within a user defined function (UDF) to call in line predictions on our driver’s position +5. At a more detailed level: + - Import our libraries. + - Create variables to reference back to the `MODELSTAGE` we just created and stored our model to. + - The temporary file paths we created might look intimidating, but all we’re doing here is programmatically using an initial file path and adding to it to create the following directories: + - LOCAL_TEMP_DIR ➡️ /tmp/driver_position + - DOWNLOAD_DIR ➡️ /tmp/driver_position/download + - TARGET_MODEL_DIR_PATH ➡️ /tmp/driver_position/ml_model + - TARGET_LIB_PATH ➡️ /tmp/driver_position/lib + - Provide a list of our feature columns that we used for model training and will now be used on new data for prediction. + - Next, we are creating our main function `register_udf_for_prediction(p_predictor ,p_session ,p_dbt):`. This function is used to register a user-defined function (UDF) that performs the machine learning prediction. It takes three parameters: `p_predictor` is an instance of the machine learning model, `p_session` is an instance of the Snowflake session, and `p_dbt` is an instance of the dbt library. The function creates a UDF named `predict_churn` which takes a pandas dataframe with the input features and returns a pandas series with the predictions. + - ⚠️ Pay close attention to the whitespace here. We are using a function within a function for this script. + - We have 2 simple functions that are programmatically retrieving our file paths to first get our stored model out of our `MODELSTAGE` and downloaded into the session `download_models_and_libs_from_stage` and then to load the contents of our model in (parameters) in `load_model` to use for prediction. + - Take the model we loaded in and call it `predictor` and wrap it in a UDF. + - Return our dataframe with both the features used to predict and the new label. + +🧠 Another way to read this script is from the bottom up. This can help us progressively see what is going into our final dbt model and work backwards to see how the other functions are being referenced. + +6. Let’s take a look at our predicted position alongside our feature variables. Open a new scratchpad and use the following query. I chose to order by the prediction of who would obtain a podium position: + + ```sql + select * from {{ ref('predict_position') }} order by position_predicted + ``` + +7. We can see that we created predictions in our final dataset, we are ready to move on to testing! + +## Test your data models + +We have now completed building all the models for today’s lab, but how do we know if they meet our assertions? Put another way, how do we know the quality of our data models are any good? This brings us to testing! + +We test data models for mainly two reasons: + +- Ensure that our source data is clean on ingestion before we start data modeling/transformation (aka avoid garbage in, garbage out problem). +- Make sure we don’t introduce bugs in the transformation code we wrote (stop ourselves from creating bad joins/fanouts). + +Testing in dbt comes in two flavors: [generic](/docs/build/tests#generic-tests) and [singular](/docs/build/tests#singular-tests). + +You define them in a test block (similar to a macro) and once defined, you can reference them by name in your `.yml` files (applying them to models, columns, sources, snapshots, and seeds). + +You might be wondering: *what about testing Python models?* + +Since the output of our Python models are tables, we can test SQL and Python models the same way! We don’t have to worry about any syntax differences when testing SQL versus Python data models. This means we use `.yml` and `.sql` files to test our entities (tables, views, etc.). Under the hood, dbt is running an SQL query on our tables to see if they meet assertions. If no rows are returned, dbt will surface a passed test. Conversely, if a test results in returned rows, it will fail or warn depending on the configuration (more on that later). + +### Generic tests + +1. To implement generic out-of-the-box tests dbt comes with, we can use YAML files to specify information about our models. To add generic tests to our aggregates model, create a file called `aggregates.yml`, copy the code block below into the file, and save. + + + ```yaml + version: 2 + + models: + - name: fastest_pit_stops_by_constructor + description: Use the python .describe() method to retrieve summary statistics table about pit stops by constructor. Sort by average stop time ascending so the first row returns the fastest constructor. + columns: + - name: constructor_name + description: team that makes the car + tests: + - unique + + - name: lap_times_moving_avg + description: Use the python .rolling() method to calculate the 5 year rolling average of pit stop times alongside the average for each year. + columns: + - name: race_year + description: year of the race + tests: + - relationships: + to: ref('int_lap_times_years') + field: race_year + ``` + +2. Let’s unpack the code we have here. We have both our aggregates models with the model name to know the object we are referencing and the description of the model that we’ll populate in our documentation. At the column level (a level below our model), we are providing the column name followed by our tests. We want to ensure our `constructor_name` is unique since we used a pandas `groupby` on `constructor_name` in the model `fastest_pit_stops_by_constructor`. Next, we want to ensure our `race_year` has referential integrity from the model we selected from `int_lap_times_years` into our subsequent `lap_times_moving_avg` model. +3. Finally, if we want to see how tests were deployed on sources and SQL models, we can look at other files in our project such as the `f1_sources.yml` we created in our Sources and staging section. + +### Using macros for testing + +1. Under your `macros` folder, create a new file and name it `test_all_values_gte_zero.sql`. Copy the code block below and save the file. For clarity, “gte” is an abbreviation for greater than or equal to. + + + ```sql + {% macro test_all_values_gte_zero(table, column) %} + + select * from {{ ref(table) }} where {{ column }} < 0 + + {% endmacro %} + ``` + +2. Macros in Jinja are pieces of code that can be reused multiple times in our SQL models — they are analogous to "functions" in other programming languages, and are extremely useful if you find yourself repeating code across multiple models. +3. We use the `{% macro %}` to indicate the start of the macro and `{% endmacro %}` for the end. The text after the beginning of the macro block is the name we are giving the macro to later call it. In this case, our macro is called `test_all_values_gte_zero`. Macros take in *arguments* to pass through, in this case the `table` and the `column`. In the body of the macro, we see an SQL statement that is using the `ref` function to dynamically select the table and then the column. You can always view macros without having to run them by using `dbt run-operation`. You can learn more [here](https://docs.getdbt.com/reference/commands/run-operation). +4. Great, now we want to reference this macro as a test! Let’s create a new test file called `macro_pit_stops_mean_is_positive.sql` in our `tests` folder. + + + +5. Copy the following code into the file and save: + + ```sql + {{ + config( + enabled=true, + severity='warn', + tags = ['bi'] + ) + }} + + {{ test_all_values_gte_zero('fastest_pit_stops_by_constructor', 'mean') }} + ``` + +6. In our testing file, we are applying some configurations to the test including `enabled`, which is an optional configuration for disabling models, seeds, snapshots, and tests. Our severity is set to `warn` instead of `error`, which means our pipeline will still continue to run. We have tagged our test with `bi` since we are applying this test to one of our bi models. + +Then, in our final line, we are calling the `test_all_values_gte_zero` macro that takes in our table and column arguments and inputting our table `'fastest_pit_stops_by_constructor'` and the column `'mean'`. + +### Custom singular tests to validate Python models + +The simplest way to define a test is by writing the exact SQL that will return failing records. We call these "singular" tests, because they're one-off assertions usable for a single purpose. + +These tests are defined in `.sql` files, typically in your `tests` directory (as defined by your test-paths config). You can use Jinja in SQL models (including ref and source) in the test definition, just like you can when creating models. Each `.sql` file contains one select statement, and it defines one test. + +Let’s add a custom test that asserts that the moving average of the lap time over the last 5 years is greater than zero (it’s impossible to have time less than 0!). It is easy to assume if this is not the case the data has been corrupted. + +1. Create a file `lap_times_moving_avg_assert_positive_or_null.sql` under the `tests` folder. + + +2. Copy the following code and save the file: + + ```sql + {{ + config( + enabled=true, + severity='error', + tags = ['bi'] + ) + }} + + with lap_times_moving_avg as ( select * from {{ ref('lap_times_moving_avg') }} ) + + select * + from lap_times_moving_avg + where lap_moving_avg_5_years < 0 and lap_moving_avg_5_years is not null + ``` + +### Putting all our tests together + +1. Time to run our tests! Altogether, we have created 4 tests for our 2 Python models: + - `fastest_pit_stops_by_constructor` + - Unique `constructor_name` + - Lap times are greater than 0 or null (to allow for the first leading values in a rolling calculation) + - `lap_times_moving_avg` + - Referential test on `race_year` + - Mean pit stop times are greater than or equal to 0 (no negative time values) +2. To run the tests on both our models, we can use this syntax in the command line to run them both at once, similar to how we did our data splits earlier. + Execute the following in the command bar: + + ```bash + dbt test --select fastest_pit_stops_by_constructor lap_times_moving_avg + ``` + + + +3. All 4 of our tests passed (yay for clean data)! To understand the SQL being run against each of our tables, we can click into the details of the test. +4. Navigating into the **Details** of the `unique_fastest_pit_stops_by_constructor_name`, we can see that each line `constructor_name` should only have one row. + + +## Document your dbt project + +When it comes to documentation, dbt brings together both column and model level descriptions that you can provide as well as details from your Snowflake information schema in a static site for consumption by other data team members and stakeholders. + +We are going to revisit 2 areas of our project to understand our documentation: + +- `intermediate.md` file +- `dbt_project.yml` file + +To start, let’s look back at our `intermediate.md` file. We can see that we provided multi-line descriptions for the models in our intermediate models using [docs blocks](/docs/collaborate/documentation#using-docs-blocks). Then we reference these docs blocks in our `.yml` file. Building descriptions with doc blocks in Markdown files gives you the ability to format your descriptions with Markdown and are particularly helpful when building long descriptions, either at the column or model level. In our `dbt_project.yml`, we added `node_colors` at folder levels. + +1. To see all these pieces come together, execute this in the command bar: + + ```bash + dbt docs generate + ``` + + This will generate the documentation for your project. Click the book button, as shown in the screenshot below to access the docs. + + +2. Go to our project area and view `int_results`. View the description that we created in our doc block. + + +3. View the mini-lineage that looks at the model we are currently selected on (`int_results` in this case). + + +4. In our `dbt_project.yml`, we configured `node_colors` depending on the file directory. Starting in dbt v1.3, we can see how our lineage in our docs looks. By color coding your project, it can help you cluster together similar models or steps and more easily troubleshoot. + + + +## Deploy your code + +Before we jump into deploying our code, let's have a quick primer on environments. Up to this point, all of the work we've done in the dbt Cloud IDE has been in our development environment, with code committed to a feature branch and the models we've built created in our development schema in Snowflake as defined in our Development environment connection. Doing this work on a feature branch, allows us to separate our code from what other coworkers are building and code that is already deemed production ready. Building models in a development schema in Snowflake allows us to separate the database objects we might still be modifying and testing from the database objects running production dashboards or other downstream dependencies. Together, the combination of a Git branch and Snowflake database objects form our environment. + +Now that we've completed testing and documenting our work, we're ready to deploy our code from our development environment to our production environment and this involves two steps: + +- Promoting code from our feature branch to the production branch in our repository. + - Generally, the production branch is going to be named your main branch and there's a review process to go through before merging code to the main branch of a repository. Here we are going to merge without review for ease of this workshop. +- Deploying code to our production environment. + - Once our code is merged to the main branch, we'll need to run dbt in our production environment to build all of our models and run all of our tests. This will allow us to build production-ready objects into our production environment in Snowflake. Luckily for us, the Partner Connect flow has already created our deployment environment and job to facilitate this step. + +1. Before getting started, let's make sure that we've committed all of our work to our feature branch. If you still have work to commit, you'll be able to select the **Commit and push**, provide a message, and then select **Commit** again. +2. Once all of your work is committed, the git workflow button will now appear as **Merge to main**. Select **Merge to main** and the merge process will automatically run in the background. + + +3. When it's completed, you should see the git button read **Create branch** and the branch you're currently looking at will become **main**. +4. Now that all of our development work has been merged to the main branch, we can build our deployment job. Given that our production environment and production job were created automatically for us through Partner Connect, all we need to do here is update some default configurations to meet our needs. +5. In the menu, select **Deploy** **> Environments** + + +6. You should see two environments listed and you'll want to select the **Deployment** environment then **Settings** to modify it. +7. Before making any changes, let's touch on what is defined within this environment. The Snowflake connection shows the credentials that dbt Cloud is using for this environment and in our case they are the same as what was created for us through Partner Connect. Our deployment job will build in our `PC_DBT_DB` database and use the default Partner Connect role and warehouse to do so. The deployment credentials section also uses the info that was created in our Partner Connect job to create the credential connection. However, it is using the same default schema that we've been using as the schema for our development environment. +8. Let's update the schema to create a new schema specifically for our production environment. Click **Edit** to allow you to modify the existing field values. Navigate to **Deployment Credentials >** **schema.** +9. Update the schema name to **production**. Remember to select **Save** after you've made the change. + +10. By updating the schema for our production environment to **production**, it ensures that our deployment job for this environment will build our dbt models in the **production** schema within the `PC_DBT_DB` database as defined in the Snowflake Connection section. +11. Now let's switch over to our production job. Click on the deploy tab again and then select **Jobs**. You should see an existing and preconfigured **Partner Connect Trial Job**. Similar to the environment, click on the job, then select **Settings** to modify it. Let's take a look at the job to understand it before making changes. + + - The Environment section is what connects this job with the environment we want it to run in. This job is already defaulted to use the Deployment environment that we just updated and the rest of the settings we can keep as is. + - The Execution settings section gives us the option to generate docs, run source freshness, and defer to a previous run state. For the purposes of our lab, we're going to keep these settings as is as well and stick with just generating docs. + - The Commands section is where we specify exactly which commands we want to run during this job, and we also want to keep this as is. We want our seed to be uploaded first, then run our models, and finally test them. The order of this is important as well, considering that we need our seed to be created before we can run our incremental model, and we need our models to be created before we can test them. + - Finally, we have the Triggers section, where we have a number of different options for scheduling our job. Given that our data isn't updating regularly here and we're running this job manually for now, we're also going to leave this section alone. + + So, what are we changing then? Just the name! Click **Edit** to allow you to make changes. Then update the name of the job to **Production Job** to denote this as our production deployment job. After that's done, click **Save**. +12. Now let's go to run our job. Clicking on the job name in the path at the top of the screen will take you back to the job run history page where you'll be able to click **Run run** to kick off the job. If you encounter any job failures, try running the job again before further troubleshooting. + + + +13. Let's go over to Snowflake to confirm that everything built as expected in our production schema. Refresh the database objects in your Snowflake account and you should see the production schema now within our default Partner Connect database. If you click into the schema and everything ran successfully, you should be able to see all of the models we developed. + + +### Conclusion + +Fantastic! You’ve finished the workshop! We hope you feel empowered in using both SQL and Python in your dbt Cloud workflows with Snowflake. Having a reliable pipeline to surface both analytics and machine learning is crucial to creating tangible business value from your data. + +For more help and information join our [dbt community Slack](https://www.getdbt.com/community/) which contains more than 50,000 data practitioners today. We have a dedicated slack channel #db-snowflake to Snowflake related content. Happy dbt'ing! diff --git a/website/docs/guides/best-practices/debugging-errors.md b/website/docs/guides/debug-errors.md similarity index 98% rename from website/docs/guides/best-practices/debugging-errors.md rename to website/docs/guides/debug-errors.md index fe600ec4f67..febfb6ac422 100644 --- a/website/docs/guides/best-practices/debugging-errors.md +++ b/website/docs/guides/debug-errors.md @@ -1,13 +1,18 @@ --- -title: "Debugging errors" -id: "debugging-errors" +title: "Debug errors" +id: "debug-errors" description: Learn about errors and the art of debugging them. displayText: Debugging errors hoverSnippet: Learn about errors and the art of debugging those errors. +icon: 'guides' +hide_table_of_contents: true +tags: ['Troubleshooting', 'dbt Core', 'dbt Cloud'] +level: 'Beginner' +recently_updated: true --- - ## General process of debugging + Learning how to debug is a skill, and one that will make you great at your role! 1. Read the error message — when writing the code behind dbt, we try our best to make error messages as useful as we can. The error message dbt produces will normally contain the type of error (more on these error types below), and the file where the error occurred. 2. Inspect the file that was known to cause the issue, and see if there's an immediate fix. diff --git a/website/docs/guides/legacy/debugging-schema-names.md b/website/docs/guides/debug-schema-names.md similarity index 84% rename from website/docs/guides/legacy/debugging-schema-names.md rename to website/docs/guides/debug-schema-names.md index dee2bc57293..c7bf1a195b1 100644 --- a/website/docs/guides/legacy/debugging-schema-names.md +++ b/website/docs/guides/debug-schema-names.md @@ -1,7 +1,19 @@ --- -title: Debugging schema names +title: Debug schema names +id: debug-schema-names +description: Learn how to debug schema names when models build under unexpected schemas. +displayText: Debug schema names +hoverSnippet: Learn how to debug schema names in dbt. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['dbt Core','Troubleshooting'] +level: 'Advanced' +recently_updated: true --- +## Introduction + If a model uses the [`schema` config](/reference/resource-properties/schema) but builds under an unexpected schema, here are some steps for debugging the issue. :::info @@ -12,10 +24,10 @@ You can also follow along via this video: -### 1. Search for a macro named `generate_schema_name` +## Search for a macro named `generate_schema_name` Do a file search to check if you have a macro named `generate_schema_name` in the `macros` directory of your project. -#### I do not have a macro named `generate_schema_name` in my project +### You do not have a macro named `generate_schema_name` in your project This means that you are using dbt's default implementation of the macro, as defined [here](https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L47C1-L60) ```sql @@ -37,7 +49,7 @@ This means that you are using dbt's default implementation of the macro, as defi Note that this logic is designed so that two dbt users won't accidentally overwrite each other's work by writing to the same schema. -#### I have a `generate_schema_name` macro in my project that calls another macro +### You have a `generate_schema_name` macro in a project that calls another macro If your `generate_schema_name` macro looks like so: ```sql {% macro generate_schema_name(custom_schema_name, node) -%} @@ -61,22 +73,22 @@ Your project is switching out the `generate_schema_name` macro for another macro {%- endmacro %} ``` -#### I have a `generate_schema_name` macro with custom logic +### You have a `generate_schema_name` macro with custom logic If this is the case — it might be a great idea to reach out to the person who added this macro to your project, as they will have context here — you can use [GitHub's blame feature](https://docs.github.com/en/free-pro-team@latest/github/managing-files-in-a-repository/tracking-changes-in-a-file) to do this. In all cases take a moment to read through the Jinja to see if you can follow the logic. -### 2. Confirm your `schema` config +## Confirm your `schema` config Check if you are using the [`schema` config](/reference/resource-properties/schema) in your model, either via a `{{ config() }}` block, or from `dbt_project.yml`. In both cases, dbt passes this value as the `custom_schema_name` parameter of the `generate_schema_name` macro. -### 3. Confirm your target values +## Confirm your target values Most `generate_schema_name` macros incorporate logic from the [`target` variable](/reference/dbt-jinja-functions/target), in particular `target.schema` and `target.name`. Use the docs [here](/reference/dbt-jinja-functions/target) to help you find the values of each key in this dictionary. -### 4. Put the two together +## Put the two together Now, re-read through the logic of your `generate_schema_name` macro, and mentally plug in your `customer_schema_name` and `target` values. @@ -86,7 +98,7 @@ You should find that the schema dbt is constructing for your model matches the o Note that snapshots do not follow this behavior, check out the docs on [target_schema](/reference/resource-configs/target_schema) instead. ::: -### 5. Adjust as necessary +## Adjust as necessary Now that you understand how a model's schema is being generated, you can adjust as necessary: - You can adjust the logic in your `generate_schema_name` macro (or add this macro to your project if you don't yet have one and adjust from there) diff --git a/website/docs/guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs.md b/website/docs/guides/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs.md similarity index 95% rename from website/docs/guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs.md rename to website/docs/guides/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs.md index bb1045b3d2f..30221332355 100644 --- a/website/docs/guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs.md +++ b/website/docs/guides/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs.md @@ -4,8 +4,16 @@ id: how-to-use-databricks-workflows-to-run-dbt-cloud-jobs description: Learn how to use Databricks workflows to run dbt Cloud jobs displayText: "Use Databricks workflows to run dbt Cloud jobs" hoverSnippet: Learn how to use Databricks workflows to run dbt Cloud jobs +# time_to_complete: '30 minutes' commenting out until we test +icon: 'databricks' +hide_table_of_contents: true +tags: ['Databricks', 'dbt Core','dbt Cloud','Orchestration'] +level: 'Intermediate' +recently_updated: true --- +## Introduction + Using Databricks workflows to call the dbt Cloud job API can be useful for several reasons: 1. **Integration with other ETL processes** — If you're already running other ETL processes in Databricks, you can use a Databricks workflow to trigger a dbt Cloud job after those processes are done. @@ -13,7 +21,7 @@ Using Databricks workflows to call the dbt Cloud job API can be useful for sever 3. [**Separation of concerns —**](https://en.wikipedia.org/wiki/Separation_of_concerns) Detailed logs for dbt jobs in the dbt Cloud environment can lead to more modularity and efficient debugging. By doing so, it becomes easier to isolate bugs quickly while still being able to see the overall status in Databricks. 4. **Custom job triggering —** Use a Databricks workflow to trigger dbt Cloud jobs based on custom conditions or logic that aren't natively supported by dbt Cloud's scheduling feature. This can give you more flexibility in terms of when and how your dbt Cloud jobs run. -## Prerequisites +### Prerequisites - Active [Teams or Enterprise dbt Cloud account](https://www.getdbt.com/pricing/) - You must have a configured and existing [dbt Cloud deploy job](/docs/deploy/deploy-jobs) @@ -29,7 +37,7 @@ To use Databricks workflows for running dbt Cloud jobs, you need to perform the - [Create a Databricks Python notebook](#create-a-databricks-python-notebook) - [Configure the workflows to run the dbt Cloud jobs](#configure-the-workflows-to-run-the-dbt-cloud-jobs) -### Set up a Databricks secret scope +## Set up a Databricks secret scope 1. Retrieve **[User API Token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens#user-api-tokens) **or **[Service Account Token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens#generating-service-account-tokens) **from dbt Cloud 2. Set up a **Databricks secret scope**, which is used to securely store your dbt Cloud API key. @@ -47,7 +55,7 @@ databricks secrets put --scope --key --s 5. Replace **``** with the actual API key value that you copied from dbt Cloud in step 1. -### Create a Databricks Python notebook +## Create a Databricks Python notebook 1. [Create a **Databricks Python notebook**](https://docs.databricks.com/notebooks/notebooks-manage.html), which executes a Python script that calls the dbt Cloud job API. @@ -165,7 +173,7 @@ DbtJobRunStatus.SUCCESS You can cancel the job from dbt Cloud if necessary. ::: -### Configure the workflows to run the dbt Cloud jobs +## Configure the workflows to run the dbt Cloud jobs You can set up workflows directly from the notebook OR by adding this notebook to one of your existing workflows: @@ -206,6 +214,4 @@ You can set up workflows directly from the notebook OR by adding this notebook t Multiple Workflow tasks can be set up using the same notebook by configuring the `job_id` parameter to point to different dbt Cloud jobs. -## Closing - Using Databricks workflows to access the dbt Cloud job API can improve integration of your data pipeline processes and enable scheduling of more complex workflows. diff --git a/website/docs/guides/legacy/creating-date-partitioned-tables.md b/website/docs/guides/legacy/creating-date-partitioned-tables.md deleted file mode 100644 index 8c461dbe4a8..00000000000 --- a/website/docs/guides/legacy/creating-date-partitioned-tables.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -title: "BigQuery: Creating date-partitioned tables" -id: "creating-date-partitioned-tables" ---- - - -:::caution Deprecated - -The functionality described below was introduced in dbt Core v0.10 (March 2018). In v1.0 (December 2021), it was deprecated in favor of [column-based partitioning](/reference/resource-configs/bigquery-configs#partition-clause) and [incremental modeling](/docs/build/incremental-models). - -::: - -dbt supports the creation of [date partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables) in BigQuery. - -To configure a dbt model as a date partitioned , use the `materialized='table'` model configuration in conjunction with a list of `partitions`. dbt will execute your model query once for each specified partition. For example: - - - -```sql -{{ - config( - materialized='table', - partitions=[20180101, 20180102], - verbose=True - ) -}} - -/* - -dbt will interpolate each `partition` wherever it finds [DBT__PARTITION_DATE] -in your model code. This model will create a single table with two partitions: - 1. 20180101 - 2. 20180102 - -These partitions will be created by running the following query against -each of the following date-sharded tables: - - 1. `snowplow`.`events_20180101` - 2. `snowplow`.`events_20180102` - -*/ - -select * -from `snowplow`.`events_[DBT__PARTITION_DATE]` -``` - - - -To make this model more dynamic, we can use the `dbt.partition_range` macro to generate a list of 8-digit dates in a specified range. Further, dbt provides a handy macro, `date_sharded_table`, for getting a date-sharded by its prefix for a given date. Together, this looks like: - - - -```sql -{{ - config( - materialized='table', - partitions=dbt.partition_range('20180101, 20180201'), - verbose=True - ) -}} - --- This model creates a date-partitioned table. There will be one --- partition for each day between 20180101 and 20180201, inclusive. --- The `date_sharded_table` macro below is sugar around [DBT__PARTITION_DATE] - -select * -from `snowplow`.`{{ date_sharded_table('events_') }}` -``` - - - -Finally, it's frequently desirable to only update a date partitioned table for the last day of received data. This can be implemented using the above configurations in conjunction with a clever macro and some [command line variables](/docs/build/project-variables). - -First, the macro: - - - -```sql -{% macro yesterday() %} - - {% set today = modules.datetime.date.today() %} - {% set one_day = modules.datetime.timedelta(days=1) %} - {% set yesterday = (today - one_day) %} - - {{ return(yesterday.strftime("%Y%m%d")) }} - -{% endmacro %} -``` - - - -Next, use it in the model: - - - -```sql -{{ - config( - materialized='table', - partitions=dbt.partition_range(var('dates', default=yesterday())), - verbose=True - ) -}} - -select * -from `snowplow`.`{{ date_sharded_table('events_') }}` -``` - - - -If a `dates` variable is provided (eg. on the command line with `--vars`), then dbt will create the partitions for that date range. Otherwise, dbt will create a partition for `yesterday`, overwriting it if it already exists. - -Here's an example of running this model for the first 3 days of 2018 as a part of a backfill: - -``` -dbt run --select partitioned_yesterday --vars 'dates: "20180101, 20180103"' -``` diff --git a/website/docs/guides/legacy/videos.md b/website/docs/guides/legacy/videos.md deleted file mode 100644 index 863029ff6d9..00000000000 --- a/website/docs/guides/legacy/videos.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: "Videos 🎥" -id: "videos" ---- - -Check out some cool videos about using and deploying dbt! - -## dbt tutorial (February, 2017) - - - -## dbt docs demo with GitLab (September, 2018) - diff --git a/website/docs/quickstarts/manual-install-qs.md b/website/docs/guides/manual-install-qs.md similarity index 97% rename from website/docs/quickstarts/manual-install-qs.md rename to website/docs/guides/manual-install-qs.md index fc43d38115b..61796fe008a 100644 --- a/website/docs/quickstarts/manual-install-qs.md +++ b/website/docs/guides/manual-install-qs.md @@ -2,20 +2,21 @@ title: "Quickstart for dbt Core from a manual install" id: manual-install description: "Connecting your warehouse to dbt Core using the CLI." -sidebar_label: "Manual install quickstart" +level: 'Beginner' platform: 'dbt-core' icon: 'fa-light fa-square-terminal' +tags: ['dbt Core','Quickstart'] hide_table_of_contents: true --- ## Introduction -When you use dbt Core to work with dbt, you will be editing files locally using a code editor, and running projects using a command line interface (CLI). If you'd rather edit files and run projects using the web-based Integrated Development Environment (IDE), you should refer to the [dbt Cloud quickstarts](/quickstarts). You can also develop and run dbt commands using the [dbt Cloud CLI](/docs/cloud/cloud-cli-installation) — a dbt Cloud powered command line. +When you use dbt Core to work with dbt, you will be editing files locally using a code editor, and running projects using a command line interface (CLI). If you'd rather edit files and run projects using the web-based Integrated Development Environment (IDE), you should refer to the [dbt Cloud quickstarts](/guides). You can also develop and run dbt commands using the [dbt Cloud CLI](/docs/cloud/cloud-cli-installation) — a dbt Cloud powered command line. ### Prerequisites * To use dbt Core, it's important that you know some basics of the Terminal. In particular, you should understand `cd`, `ls` and `pwd` to navigate through the directory structure of your computer easily. * Install dbt Core using the [installation instructions](/docs/core/installation) for your operating system. -* Complete [Setting up (in BigQuery)](/quickstarts/bigquery?step=2) and [Loading data (BigQuery)](/quickstarts/bigquery?step=3). +* Complete [Setting up (in BigQuery)](/guides/bigquery?step=2) and [Loading data (BigQuery)](/guides/bigquery?step=3). * [Create a GitHub account](https://github.com/join) if you don't already have one. ### Create a starter project diff --git a/website/docs/guides/migration/tools/migrating-from-spark-to-databricks.md b/website/docs/guides/migrate-from-spark-to-databricks.md similarity index 78% rename from website/docs/guides/migration/tools/migrating-from-spark-to-databricks.md rename to website/docs/guides/migrate-from-spark-to-databricks.md index cd0577c2d96..8fb02ae79d7 100644 --- a/website/docs/guides/migration/tools/migrating-from-spark-to-databricks.md +++ b/website/docs/guides/migrate-from-spark-to-databricks.md @@ -1,18 +1,34 @@ --- -title: "Migrating from dbt-spark to dbt-databricks" -id: "migrating-from-spark-to-databricks" +title: "Migrate from dbt-spark to dbt-databricks" +id: "migrate-from-spark-to-databricks" +description: Learn how to migrate from dbt-spark to dbt-databricks. +displayText: Migrate from Spark to Databricks +hoverSnippet: Learn how to migrate from dbt-spark to dbt-databricks. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Migration', 'dbt Core','dbt Cloud'] +level: 'Intermediate' +recently_updated: true --- -You can [migrate your projects](#migrate-your-dbt-projects) from using the `dbt-spark` adapter to using the [dbt-databricks adapter](https://github.com/databricks/dbt-databricks). In collaboration with dbt Labs, Databricks built this adapter using dbt-spark as the foundation and added some critical improvements. With it, you get an easier set up — requiring only three inputs for authentication — and more features such as support for [Unity Catalog](https://www.databricks.com/product/unity-catalog). +## Introduction -## Simpler authentication +You can migrate your projects from using the `dbt-spark` adapter to using the [dbt-databricks adapter](https://github.com/databricks/dbt-databricks). In collaboration with dbt Labs, Databricks built this adapter using dbt-spark as the foundation and added some critical improvements. With it, you get an easier set up — requiring only three inputs for authentication — and more features such as support for [Unity Catalog](https://www.databricks.com/product/unity-catalog). + +### Prerequisites + +- Your project must be compatible with dbt 1.0 or greater. Refer to [Upgrading to v1.0](/docs/dbt-versions/core-upgrade/upgrading-to-v1.0) for details. For the latest version of dbt, refer to [Upgrading to v1.7](/docs/dbt-versions/core-upgrade/upgrading-to-v1.7). +- For dbt Cloud, you need administrative (admin) privileges to migrate dbt projects. + +### Simpler authentication Previously, you had to provide a `cluster` or `endpoint` ID which was hard to parse from the `http_path` that you were given. Now, it doesn't matter if you're using a cluster or an SQL endpoint because the [dbt-databricks setup](/docs/core/connect-data-platform/databricks-setup) requires the _same_ inputs for both. All you need to provide is: - hostname of the Databricks workspace - HTTP path of the Databricks SQL warehouse or cluster - appropriate credentials -## Better defaults +### Better defaults The `dbt-databricks` adapter provides better defaults than `dbt-spark` does. The defaults help optimize your workflow so you can get the fast performance and cost-effectiveness of Databricks. They are: @@ -24,24 +40,14 @@ With dbt-spark, however, the default for `incremental_strategy` is `append`. If For more information on defaults, see [Caveats](/docs/core/connect-data-platform/databricks-setup#caveats). -## Pure Python +### Pure Python If you use dbt Core, you no longer have to download an independent driver to interact with Databricks. The connection information is all embedded in a pure-Python library called `databricks-sql-connector`. -## Migrate your dbt projects - -In both dbt Core and dbt Cloud, you can migrate your projects to the Databricks-specific adapter from the generic Apache Spark adapter. - -### Prerequisites - -- Your project must be compatible with dbt 1.0 or greater. Refer to [Upgrading to v1.0](/docs/dbt-versions/core-upgrade/upgrading-to-v1.0) for details. For the latest version of dbt, refer to [Upgrading to v1.3](/docs/dbt-versions/core-upgrade/upgrading-to-v1.3). -- For dbt Cloud, you need administrative (admin) privileges to migrate dbt projects. - - - +## Migrate your dbt projects in dbt Cloud - +You can migrate your projects to the Databricks-specific adapter from the generic Apache Spark adapter. If you're using dbt Core, then skip to Step 4. The migration to the `dbt-databricks` adapter from `dbt-spark` shouldn't cause any downtime for production jobs. dbt Labs recommends that you schedule the connection change when usage of the IDE is light to avoid disrupting your team. @@ -60,7 +66,7 @@ To update your Databricks connection in dbt Cloud: Everyone in your organization who uses dbt Cloud must refresh the IDE before starting work again. It should refresh in less than a minute. -#### About your credentials +## Configure your credentials When you update the Databricks connection in dbt Cloud, your team will not lose their credentials. This makes migrating easier since it only requires you to delete the Databricks connection and re-add the cluster or endpoint information. @@ -70,9 +76,7 @@ These credentials will not get lost when there's a successful connection to Data - The personal access tokens your team added in their dbt Cloud profile so they can develop in the IDE for a given project. - The access token you added for each deployment environment so dbt Cloud can connect to Databricks during production jobs. - - - +## Migrate dbt projects in dbt Core To migrate your dbt Core projects to the `dbt-databricks` adapter from `dbt-spark`, you: 1. Install the [dbt-databricks adapter](https://github.com/databricks/dbt-databricks) in your environment @@ -80,13 +84,8 @@ To migrate your dbt Core projects to the `dbt-databricks` adapter from `dbt-spar Anyone who's using your project must also make these changes in their environment. - - - - - -### Examples +## Try these examples You can use the following examples of the `profiles.yml` file to see the authentication setup with `dbt-spark` compared to the simpler setup with `dbt-databricks` when connecting to an SQL endpoint. A cluster example would look similar. diff --git a/website/docs/guides/migrate-from-stored-procedures.md b/website/docs/guides/migrate-from-stored-procedures.md new file mode 100644 index 00000000000..c894bce9873 --- /dev/null +++ b/website/docs/guides/migrate-from-stored-procedures.md @@ -0,0 +1,377 @@ +--- +title: Migrate from DDL, DML, and stored procedures +id: migrate-from-stored-procedures +description: Learn how to transform from a historical codebase of mixed DDL and DML statements to dbt models, including tips and patterns for the shift from a procedural to a declarative approach in defining datasets. +displayText: Migrate from DDL, DML, and stored procedures +hoverSnippet: Learn how to transform from a historical codebase of mixed DDL and DML statements to dbt models +# time_to_complete: '30 minutes' commenting out until we test +platform: 'dbt-core' +icon: 'guides' +hide_table_of_contents: true +tags: ['Migration', 'dbt Core'] +level: 'Beginner' +recently_updated: true +--- + +## Introduction + +One of the more common situations that new dbt adopters encounter is a historical codebase of transformations written as a hodgepodge of DDL and DML statements, or stored procedures. Going from DML statements to dbt models is often a challenging hump for new users to get over, because the process involves a significant paradigm shift between a procedural flow of building a dataset (e.g. a series of DDL and DML statements) to a declarative approach to defining a dataset (e.g. how dbt uses SELECT statements to express data models). This guide aims to provide tips, tricks, and common patterns for converting DML statements to dbt models. + +### Preparing to migrate + +Before getting into the meat of conversion, it’s worth noting that DML statements will not always illustrate a comprehensive set of columns and column types that an original table might contain. Without knowing the DDL to create the table, it’s impossible to know precisely if your conversion effort is apples-to-apples, but you can generally get close. + +If your supports `SHOW CREATE TABLE`, that can be a quick way to get a comprehensive set of columns you’ll want to recreate. If you don’t have the DDL, but are working on a substantial stored procedure, one approach that can work is to pull column lists out of any DML statements that modify the table, and build up a full set of the columns that appear. + +As for ensuring that you have the right column types, since models materialized by dbt generally use `CREATE TABLE AS SELECT` or `CREATE VIEW AS SELECT` as the driver for object creation, tables can end up with unintended column types if the queries aren’t explicit. For example, if you care about `INT` versus `DECIMAL` versus `NUMERIC`, it’s generally going to be best to be explicit. The good news is that this is easy with dbt: you just cast the column to the type you intend. + +We also generally recommend that column renaming and type casting happen as close to the source tables as possible, typically in a layer of staging transformations, which helps ensure that future dbt modelers will know where to look for those transformations! See [How we structure our dbt projects](/best-practices/how-we-structure/1-guide-overview) for more guidance on overall project structure. + +### Operations we need to map + +There are four primary DML statements that you are likely to have to convert to dbt operations while migrating a procedure: + +- `INSERT` +- `UPDATE` +- `DELETE` +- `MERGE` + +Each of these can be addressed using various techniques in dbt. Handling `MERGE`s is a bit more involved than the rest, but can be handled effectively via dbt. The first three, however, are fairly simple to convert. + +## Map INSERTs + +An `INSERT` statement is functionally the same as using dbt to `SELECT` from an existing source or other dbt model. If you are faced with an `INSERT`-`SELECT` statement, the easiest way to convert the statement is to just create a new dbt model, and pull the `SELECT` portion of the `INSERT` statement out of the procedure and into the model. That’s basically it! + +To really break it down, let’s consider a simple example: + +```sql +INSERT INTO returned_orders (order_id, order_date, total_return) + +SELECT order_id, order_date, total FROM orders WHERE type = 'return' +``` + +Converting this with a first pass to a [dbt model](/guides/bigquery?step=8) (in a file called returned_orders.sql) might look something like: + +```sql +SELECT + order_id as order_id, + order_date as order_date, + total as total_return + +FROM {{ ref('orders') }} + +WHERE type = 'return' +``` + +Functionally, this would create a model (which could be materialized as a table or view depending on needs) called `returned_orders` that contains three columns: `order_id`, `order_date`, `total_return`) predicated on the type column. It achieves the same end as the `INSERT`, just in a declarative fashion, using dbt. + +### **A note on `FROM` clauses** + +In dbt, using a hard-coded table or view name in a `FROM` clause is one of the most serious mistakes new users make. dbt uses the ref and source macros to discover the ordering that transformations need to execute in, and if you don’t use them, you’ll be unable to benefit from dbt’s built-in lineage generation and pipeline execution. In the sample code throughout the remainder of this article, we’ll use ref statements in the dbt-converted versions of SQL statements, but it is an exercise for the reader to ensure that those models exist in their dbt projects. + +### **Sequential `INSERT`s to an existing table can be `UNION ALL`’ed together** + +Since dbt models effectively perform a single `CREATE TABLE AS SELECT` (or if you break it down into steps, `CREATE`, then an `INSERT`), you may run into complexities if there are multiple `INSERT` statements in your transformation that all insert data into the same table. Fortunately, this is a simple thing to handle in dbt. Effectively, the logic is performing a `UNION ALL` between the `INSERT` queries. If I have a transformation flow that looks something like (ignore the contrived nature of the scenario): + +```sql +CREATE TABLE all_customers + +INSERT INTO all_customers SELECT * FROM us_customers + +INSERT INTO all_customers SELECT * FROM eu_customers +``` + +The dbt-ified version of this would end up looking something like: + +```sql +SELECT * FROM {{ ref('us_customers') }} + +UNION ALL + +SELECT * FROM {{ ref('eu_customers') }} +``` + +The logic is functionally equivalent. So if there’s another statement that `INSERT`s into a model that I’ve already created, I can just add that logic into a second `SELECT` statement that is just `UNION ALL`'ed with the first. Easy! + +## Map UPDATEs + +`UPDATE`s start to increase the complexity of your transformations, but fortunately, they’re pretty darn simple to migrate, as well. The thought process that you go through when translating an `UPDATE` is quite similar to how an `INSERT` works, but the logic for the `SELECT` list in the dbt model is primarily sourced from the content in the `SET` section of the `UPDATE` statement. Let’s look at a simple example: + +```sql +UPDATE orders + +SET type = 'return' + +WHERE total < 0 +``` + +The way to look at this is similar to an `INSERT`-`SELECT` statement. The table being updated is the model you want to modify, and since this is an `UPDATE`, that model has likely already been created, and you can either: + +- add to it with subsequent transformations +- create an intermediate model that builds off of the original model – perhaps naming it something like `int_[entity]_[verb].sql`. + +The `SELECT` list should contain all of the columns for the table, but for the specific columns being updated by the DML, you’ll use the computation on the right side of the equals sign as the `SELECT`ed value. Then, you can use the target column name on the left of the equals sign as the column alias. + +If I were building an intermediate transformation from the above query would translate to something along the lines of: + +```sql +SELECT + CASE + WHEN total < 0 THEN 'return' + ELSE type + END AS type, + + order_id, + order_date + +FROM {{ ref('stg_orders') }} +``` + +Since the `UPDATE` statement doesn’t modify every value of the type column, we use a `CASE` statement to apply the contents’ `WHERE` clause. We still want to select all of the columns that should end up in the target table. If we left one of the columns out, it wouldn’t be passed through to the target table at all due to dbt’s declarative approach. + +Sometimes, you may not be sure what all the columns are in a table, or in the situation as above, you’re only modifying a small number of columns relative to the total number of columns in the table. It can be cumbersome to list out every column in the table, but fortunately dbt contains some useful utility macros that can help list out the full column list of a table. + +Another way I could have written the model a bit more dynamically might be: + +```sql +SELECT + {{ dbt_utils.star(from=ref('stg_orders'), except=['type']) }}, + CASE + WHEN total < 0 THEN 'return' + ELSE type + END AS type, + +FROM {{ ref('stg_orders') }} +``` + +The `dbt_utils.star()` macro will print out the full list of columns in the table, but skip the ones I’ve listed in the except list, which allows me to perform the same logic while writing fewer lines of code. This is a simple example of using dbt macros to simplify and shorten your code, and dbt can get a lot more sophisticated as you learn more techniques. Read more about the [dbt_utils package](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/) and the [star macro](https://github.com/dbt-labs/dbt-utils/tree/0.8.6/#star-source). + +## Map DELETEs + +One of the biggest differences between a procedural transformation and how dbt models data is that dbt, in general, will never destroy data. While there are ways to execute hard `DELETE`s in dbt that are outside of the scope of this article, the general best practice for handling deleted data is to just use soft deletes, and filter out soft-deleted data in a final transformation. + +Let’s consider a simple example query: + +```sql +DELETE FROM stg_orders WHERE order_status IS NULL +``` + +In a dbt model, you’ll need to first identify the records that should be deleted and then filter them out. There are really two primary ways you might translate this query: + +```sql +SELECT * FROM {{ ref('stg_orders') }} WHERE order_status IS NOT NULL +``` + +This first approach just inverts the logic of the DELETE to describe the set of records that should remain, instead of the set of records that should be removed. This ties back to the way dbt declaratively describes datasets. You reference the data that should be in a dataset, and the table or view gets created with that set of data. + +Another way you could achieve this is by marking the deleted records, and then filtering them out. For example: + +```sql +WITH + +soft_deletes AS ( + + SELECT + *, + CASE + WHEN order_status IS NULL THEN true + ELSE false + END AS to_delete + + FROM {{ ref('stg_orders') }} + +) + +SELECT * FROM soft_deletes WHERE to_delete = false +``` + +This approach flags all of the deleted records, and the final `SELECT` filters out any deleted data, so the resulting table contains only the remaining records. It’s a lot more verbose than just inverting the `DELETE` logic, but for complex `DELETE` logic, this ends up being a very effective way of performing the `DELETE` that retains historical context. + +It’s worth calling out that while this doesn’t enable a hard delete, hard deletes can be executed a number of ways, the most common being to execute a dbt [macros](/docs/build/jinja-macros) via as a [run-operation](https://docs.getdbt.com/reference/commands/run-operation), or by using a [post-hook](https://docs.getdbt.com/reference/resource-configs/pre-hook-post-hook/) to perform a `DELETE` statement after the records to-be-deleted have been marked. These are advanced approaches outside the scope of this guide. + + +## Map MERGEs +dbt has a concept called [materialization](/docs/build/materializations), which determines how a model is physically or logically represented in the warehouse. `INSERT`s, `UPDATE`s, and `DELETE`s will typically be accomplished using table or view materializations. For incremental workloads accomplished via commands like `MERGE` or `UPSERT`, dbt has a particular materialization called [incremental](/docs/build/incremental-models). The incremental materialization is specifically used to handle incremental loads and updates to a table without recreating the entire table from scratch on every run. + +### Step 1: Map the MERGE like an INSERT/UPDATE to start + +Before we get into the exact details of how to implement an incremental materialization, let’s talk about logic conversion. Extracting the logic of the `MERGE` and handling it as you would an `INSERT` or an `UPDATE` is the easiest way to get started migrating a `MERGE` command. . + +To see how the logic conversion works, we’ll start with an example `MERGE`. In this scenario, imagine a ride sharing app where rides are loaded into a details table daily, and tips may be updated at some later date, and need to be kept up-to-date: + +```sql +MERGE INTO ride_details USING ( + SELECT + ride_id, + subtotal, + tip + + FROM rides_to_load AS rtl + + ON ride_details.ride_id = rtl.ride_id + + WHEN MATCHED THEN UPDATE + + SET ride_details.tip = rtl.tip + + WHEN NOT MATCHED THEN INSERT (ride_id, subtotal, tip) + VALUES (rtl.ride_id, rtl.subtotal, NVL(rtl.tip, 0, rtl.tip) +); +``` + +The content of the `USING` clause is a useful piece of code because that can easily be placed in a CTE as a starting point for handling the match statement. I find that the easiest way to break this apart is to treat each match statement as a separate CTE that builds on the previous match statements. + +We can ignore the `ON` clause for now, as that will only come into play once we get to a point where we’re ready to turn this into an incremental. + +As with `UPDATE`s and `INSERT`s, you can use the `SELECT` list and aliases to name columns appropriately for the target table, and `UNION` together `INSERT` statements (taking care to use `UNION`, rather than `UNION ALL` to avoid duplicates). + +The `MERGE` would end up translating to something like this: + +```sql +WITH + +using_clause AS ( + + SELECT + ride_id, + subtotal, + tip + + FROM {{ ref('rides_to_load') }} + +), + +updates AS ( + + SELECT + ride_id, + subtotal, + tip + + FROM using_clause + +), + +inserts AS ( + + SELECT + ride_id, + subtotal, + NVL(tip, 0, tip) + + FROM using_clause + +) + +SELECT * + +FROM updates + +UNION inserts +``` + +To be clear, this transformation isn’t complete. The logic here is similar to the `MERGE`, but will not actually do the same thing, since the updates and inserts CTEs are both selecting from the same source query. We’ll need to ensure we grab the separate sets of data as we transition to the incremental materialization. + +One important caveat is that dbt does not natively support `DELETE` as a `MATCH` action. If you have a line in your `MERGE` statement that uses `WHEN MATCHED THEN DELETE`, you’ll want to treat it like an update and add a soft-delete flag, which is then filtered out in a follow-on transformation. + +### Step 2: Convert to incremental materialization + +As mentioned above, incremental materializations are a little special in that when the target table does not exist, the materialization functions in nearly the same way as a standard table materialization, and executes a `CREATE TABLE AS SELECT` statement. If the target table does exist, however, the materialization instead executes a `MERGE` statement. + +Since a `MERGE` requires a `JOIN` condition between the `USING` clause and the target table, we need a way to specify how dbt determines whether or not a record triggers a match or not. That particular piece of information is specified in the dbt model configuration. + +We can add the following `config()` block to the top of our model to specify how it should build incrementally: + +```sql +{{ + config( + materialized='incremental', + unique_key='ride_id', + incremental_strategy='merge' + ) +}} +``` + +The three configuration fields in this example are the most important ones. + +- Setting `materialized='incremental'` tells dbt to apply UPSERT logic to the target table. +- The `unique_key` should be a primary key of the target table. This is used to match records with the existing table. +- `incremental_strategy` here is set to MERGE any existing rows in the target table with a value for the `unique_key` which matches the incoming batch of data. There are [various incremental strategies](/docs/build/incremental-models#about-incremental_strategy) for different situations and warehouses. + +The bulk of the work in converting a model to an incremental materialization comes in determining how the logic should change for incremental loads versus full backfills or initial loads. dbt offers a special macro, `is_incremental()`, which evaluates false for initial loads or for backfills (called full refreshes in dbt parlance), but true for incremental loads. + +This macro can be used to augment the model code to adjust how data is loaded for subsequent loads. How that logic should be added will depend a little bit on how data is received. Some common ways might be: + +1. The source table is truncated ahead of incremental loads, and only contains the data to be loaded in that increment. +2. The source table contains all historical data, and there is a load timestamp column that identifies new data to be loaded. + +In the first case, the work is essentially done already. Since the source table always contains only the new data to be loaded, the query doesn’t have to change for incremental loads. The second case, however, requires the use of the `is_incremental()` macro to correctly handle the logic. + +Taking the converted `MERGE` statement that we’d put together previously, we’d augment it to add this additional logic: + +```sql +WITH + +using_clause AS ( + + SELECT + ride_id, + subtotal, + tip, + max(load_timestamp) as load_timestamp + + FROM {{ ref('rides_to_load') }} + + + {% if is_incremental() %} + + WHERE load_timestamp > (SELECT max(load_timestamp) FROM {{ this }}) + + {% endif %} + +), + +updates AS ( + + SELECT + ride_id, + subtotal, + tip, + load_timestamp + + FROM using_clause + + {% if is_incremental() %} + + WHERE ride_id IN (SELECT ride_id FROM {{ this }}) + + {% endif %} + +), + +inserts AS ( + + SELECT + ride_id, + subtotal, + NVL(tip, 0, tip), + load_timestamp + + FROM using_clause + + WHERE ride_id NOT IN (SELECT ride_id FROM updates) + +) + +SELECT * FROM updates UNION inserts +``` + +There are a couple important concepts to understand here: + +1. The code in the `is_incremental()` conditional block only executes for incremental executions of this model code. If the target table doesn’t exist, or if the `--full-refresh` option is used, that code will not execute. +2. `{{ this }}` is a special keyword in dbt that when used in a Jinja block, self-refers to the model for which the code is executing. So if you have a model in a file called `my_incremental_model.sql`, `{{ this }}` will refer to `my_incremental_model` (fully qualified with database and schema name if necessary). By using that keyword, we can leverage the current state of the target table to inform the source query. + + +## Migrate Stores procedures + +The techniques shared above are useful ways to get started converting the individual DML statements that are often found in stored procedures. Using these types of patterns, legacy procedural code can be rapidly transitioned to dbt models that are much more readable, maintainable, and benefit from software engineering best practices like DRY principles. Additionally, once transformations are rewritten as dbt models, it becomes much easier to test the transformations to ensure that the data being used downstream is high-quality and trustworthy. diff --git a/website/docs/guides/migration/tools/migrating-from-stored-procedures/1-migrating-from-stored-procedures.md b/website/docs/guides/migration/tools/migrating-from-stored-procedures/1-migrating-from-stored-procedures.md deleted file mode 100644 index aae8b373b2c..00000000000 --- a/website/docs/guides/migration/tools/migrating-from-stored-procedures/1-migrating-from-stored-procedures.md +++ /dev/null @@ -1,27 +0,0 @@ ---- -title: Migrating from DDL, DML, and stored procedures -id: 1-migrating-from-stored-procedures ---- - -One of the more common situations that new dbt adopters encounter is a historical codebase of transformations written as a hodgepodge of DDL and DML statements, or stored procedures. Going from DML statements to dbt models is often a challenging hump for new users to get over, because the process involves a significant paradigm shift between a procedural flow of building a dataset (e.g. a series of DDL and DML statements) to a declarative approach to defining a dataset (e.g. how dbt uses SELECT statements to express data models). This guide aims to provide tips, tricks, and common patterns for converting DML statements to dbt models. - -## Preparing to migrate - -Before getting into the meat of conversion, it’s worth noting that DML statements will not always illustrate a comprehensive set of columns and column types that an original table might contain. Without knowing the DDL to create the table, it’s impossible to know precisely if your conversion effort is apples-to-apples, but you can generally get close. - -If your supports `SHOW CREATE TABLE`, that can be a quick way to get a comprehensive set of columns you’ll want to recreate. If you don’t have the DDL, but are working on a substantial stored procedure, one approach that can work is to pull column lists out of any DML statements that modify the table, and build up a full set of the columns that appear. - -As for ensuring that you have the right column types, since models materialized by dbt generally use `CREATE TABLE AS SELECT` or `CREATE VIEW AS SELECT` as the driver for object creation, tables can end up with unintended column types if the queries aren’t explicit. For example, if you care about `INT` versus `DECIMAL` versus `NUMERIC`, it’s generally going to be best to be explicit. The good news is that this is easy with dbt: you just cast the column to the type you intend. - -We also generally recommend that column renaming and type casting happen as close to the source tables as possible, typically in a layer of staging transformations, which helps ensure that future dbt modelers will know where to look for those transformations! See [How we structure our dbt projects](/guides/best-practices/how-we-structure/1-guide-overview) for more guidance on overall project structure. - -### Operations we need to map - -There are four primary DML statements that you are likely to have to convert to dbt operations while migrating a procedure: - -- `INSERT` -- `UPDATE` -- `DELETE` -- `MERGE` - -Each of these can be addressed using various techniques in dbt. Handling `MERGE`s is a bit more involved than the rest, but can be handled effectively via dbt. The first three, however, are fairly simple to convert. diff --git a/website/docs/guides/migration/tools/migrating-from-stored-procedures/2-mapping-inserts.md b/website/docs/guides/migration/tools/migrating-from-stored-procedures/2-mapping-inserts.md deleted file mode 100644 index d8f31a0f14a..00000000000 --- a/website/docs/guides/migration/tools/migrating-from-stored-procedures/2-mapping-inserts.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -title: Inserts -id: 2-inserts ---- - -An `INSERT` statement is functionally the same as using dbt to `SELECT` from an existing source or other dbt model. If you are faced with an `INSERT`-`SELECT` statement, the easiest way to convert the statement is to just create a new dbt model, and pull the `SELECT` portion of the `INSERT` statement out of the procedure and into the model. That’s basically it! - -To really break it down, let’s consider a simple example: - -```sql -INSERT INTO returned_orders (order_id, order_date, total_return) - -SELECT order_id, order_date, total FROM orders WHERE type = 'return' -``` - -Converting this with a first pass to a [dbt model](/quickstarts/bigquery?step=8) (in a file called returned_orders.sql) might look something like: - -```sql -SELECT - order_id as order_id, - order_date as order_date, - total as total_return - -FROM {{ ref('orders') }} - -WHERE type = 'return' -``` - -Functionally, this would create a model (which could be materialized as a table or view depending on needs) called `returned_orders` that contains three columns: `order_id`, `order_date`, `total_return`) predicated on the type column. It achieves the same end as the `INSERT`, just in a declarative fashion, using dbt. - -## **A note on `FROM` clauses** - -In dbt, using a hard-coded table or view name in a `FROM` clause is one of the most serious mistakes new users make. dbt uses the ref and source macros to discover the ordering that transformations need to execute in, and if you don’t use them, you’ll be unable to benefit from dbt’s built-in lineage generation and pipeline execution. In the sample code throughout the remainder of this article, we’ll use ref statements in the dbt-converted versions of SQL statements, but it is an exercise for the reader to ensure that those models exist in their dbt projects. - -## **Sequential `INSERT`s to an existing table can be `UNION ALL`’ed together** - -Since dbt models effectively perform a single `CREATE TABLE AS SELECT` (or if you break it down into steps, `CREATE`, then an `INSERT`), you may run into complexities if there are multiple `INSERT` statements in your transformation that all insert data into the same table. Fortunately, this is a simple thing to handle in dbt. Effectively, the logic is performing a `UNION ALL` between the `INSERT` queries. If I have a transformation flow that looks something like (ignore the contrived nature of the scenario): - -```sql -CREATE TABLE all_customers - -INSERT INTO all_customers SELECT * FROM us_customers - -INSERT INTO all_customers SELECT * FROM eu_customers -``` - -The dbt-ified version of this would end up looking something like: - -```sql -SELECT * FROM {{ ref('us_customers') }} - -UNION ALL - -SELECT * FROM {{ ref('eu_customers') }} -``` - -The logic is functionally equivalent. So if there’s another statement that `INSERT`s into a model that I’ve already created, I can just add that logic into a second `SELECT` statement that is just `UNION ALL`'ed with the first. Easy! diff --git a/website/docs/guides/migration/tools/migrating-from-stored-procedures/3-mapping-updates.md b/website/docs/guides/migration/tools/migrating-from-stored-procedures/3-mapping-updates.md deleted file mode 100644 index b6f0874fb6b..00000000000 --- a/website/docs/guides/migration/tools/migrating-from-stored-procedures/3-mapping-updates.md +++ /dev/null @@ -1,55 +0,0 @@ ---- -title: Updates -id: 3-updates ---- - -`UPDATE`s start to increase the complexity of your transformations, but fortunately, they’re pretty darn simple to migrate, as well. The thought process that you go through when translating an `UPDATE` is quite similar to how an `INSERT` works, but the logic for the `SELECT` list in the dbt model is primarily sourced from the content in the `SET` section of the `UPDATE` statement. Let’s look at a simple example: - -```sql -UPDATE orders - -SET type = 'return' - -WHERE total < 0 -``` - -The way to look at this is similar to an `INSERT`-`SELECT` statement. The table being updated is the model you want to modify, and since this is an `UPDATE`, that model has likely already been created, and you can either: - -- add to it with subsequent transformations -- create an intermediate model that builds off of the original model – perhaps naming it something like `int_[entity]_[verb].sql`. - -The `SELECT` list should contain all of the columns for the table, but for the specific columns being updated by the DML, you’ll use the computation on the right side of the equals sign as the `SELECT`ed value. Then, you can use the target column name on the left of the equals sign as the column alias. - -If I were building an intermediate transformation from the above query would translate to something along the lines of: - -```sql -SELECT - CASE - WHEN total < 0 THEN 'return' - ELSE type - END AS type, - - order_id, - order_date - -FROM {{ ref('stg_orders') }} -``` - -Since the `UPDATE` statement doesn’t modify every value of the type column, we use a `CASE` statement to apply the contents’ `WHERE` clause. We still want to select all of the columns that should end up in the target table. If we left one of the columns out, it wouldn’t be passed through to the target table at all due to dbt’s declarative approach. - -Sometimes, you may not be sure what all the columns are in a table, or in the situation as above, you’re only modifying a small number of columns relative to the total number of columns in the table. It can be cumbersome to list out every column in the table, but fortunately dbt contains some useful utility macros that can help list out the full column list of a table. - -Another way I could have written the model a bit more dynamically might be: - -```sql -SELECT - {{ dbt_utils.star(from=ref('stg_orders'), except=['type']) }}, - CASE - WHEN total < 0 THEN 'return' - ELSE type - END AS type, - -FROM {{ ref('stg_orders') }} -``` - -The `dbt_utils.star()` macro will print out the full list of columns in the table, but skip the ones I’ve listed in the except list, which allows me to perform the same logic while writing fewer lines of code. This is a simple example of using dbt macros to simplify and shorten your code, and dbt can get a lot more sophisticated as you learn more techniques. Read more about the [dbt_utils package](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/) and the [star macro](https://github.com/dbt-labs/dbt-utils/tree/0.8.6/#star-source). diff --git a/website/docs/guides/migration/tools/migrating-from-stored-procedures/4-mapping-deletes.md b/website/docs/guides/migration/tools/migrating-from-stored-procedures/4-mapping-deletes.md deleted file mode 100644 index 1a8c6435d42..00000000000 --- a/website/docs/guides/migration/tools/migrating-from-stored-procedures/4-mapping-deletes.md +++ /dev/null @@ -1,45 +0,0 @@ ---- -title: Deletes -id: 4-deletes ---- - -One of the biggest differences between a procedural transformation and how dbt models data is that dbt, in general, will never destroy data. While there are ways to execute hard `DELETE`s in dbt that are outside of the scope of this article, the general best practice for handling deleted data is to just use soft deletes, and filter out soft-deleted data in a final transformation. - -Let’s consider a simple example query: - -```sql -DELETE FROM stg_orders WHERE order_status IS NULL -``` - -In a dbt model, you’ll need to first identify the records that should be deleted and then filter them out. There are really two primary ways you might translate this query: - -```sql -SELECT * FROM {{ ref('stg_orders') }} WHERE order_status IS NOT NULL -``` - -This first approach just inverts the logic of the DELETE to describe the set of records that should remain, instead of the set of records that should be removed. This ties back to the way dbt declaratively describes datasets. You reference the data that should be in a dataset, and the table or view gets created with that set of data. - -Another way you could achieve this is by marking the deleted records, and then filtering them out. For example: - -```sql -WITH - -soft_deletes AS ( - - SELECT - *, - CASE - WHEN order_status IS NULL THEN true - ELSE false - END AS to_delete - - FROM {{ ref('stg_orders') }} - -) - -SELECT * FROM soft_deletes WHERE to_delete = false -``` - -This approach flags all of the deleted records, and the final `SELECT` filters out any deleted data, so the resulting table contains only the remaining records. It’s a lot more verbose than just inverting the `DELETE` logic, but for complex `DELETE` logic, this ends up being a very effective way of performing the `DELETE` that retains historical context. - -It’s worth calling out that while this doesn’t enable a hard delete, hard deletes can be executed a number of ways, the most common being to execute a dbt [macros](/docs/build/jinja-macros) via as a [run-operation](https://docs.getdbt.com/reference/commands/run-operation), or by using a [post-hook](https://docs.getdbt.com/reference/resource-configs/pre-hook-post-hook/) to perform a `DELETE` statement after the records to-be-deleted have been marked. These are advanced approaches outside the scope of this guide. diff --git a/website/docs/guides/migration/tools/migrating-from-stored-procedures/5-mapping-merges.md b/website/docs/guides/migration/tools/migrating-from-stored-procedures/5-mapping-merges.md deleted file mode 100644 index d059ab9a258..00000000000 --- a/website/docs/guides/migration/tools/migrating-from-stored-procedures/5-mapping-merges.md +++ /dev/null @@ -1,184 +0,0 @@ ---- -title: Merges -id: 5-merges ---- - -dbt has a concept called [materialization](/docs/build/materializations), which determines how a model is physically or logically represented in the warehouse. `INSERT`s, `UPDATE`s, and `DELETE`s will typically be accomplished using table or view materializations. For incremental workloads accomplished via commands like `MERGE` or `UPSERT`, dbt has a particular materialization called [incremental](/docs/build/incremental-models). The incremental materialization is specifically used to handle incremental loads and updates to a table without recreating the entire table from scratch on every run. - -## Step 1: Map the MERGE like an INSERT/UPDATE to start - -Before we get into the exact details of how to implement an incremental materialization, let’s talk about logic conversion. Extracting the logic of the `MERGE` and handling it as you would an `INSERT` or an `UPDATE` is the easiest way to get started migrating a `MERGE` command. . - -To see how the logic conversion works, we’ll start with an example `MERGE`. In this scenario, imagine a ride sharing app where rides are loaded into a details table daily, and tips may be updated at some later date, and need to be kept up-to-date: - -```sql -MERGE INTO ride_details USING ( - SELECT - ride_id, - subtotal, - tip - - FROM rides_to_load AS rtl - - ON ride_details.ride_id = rtl.ride_id - - WHEN MATCHED THEN UPDATE - - SET ride_details.tip = rtl.tip - - WHEN NOT MATCHED THEN INSERT (ride_id, subtotal, tip) - VALUES (rtl.ride_id, rtl.subtotal, NVL(rtl.tip, 0, rtl.tip) -); -``` - -The content of the `USING` clause is a useful piece of code because that can easily be placed in a CTE as a starting point for handling the match statement. I find that the easiest way to break this apart is to treat each match statement as a separate CTE that builds on the previous match statements. - -We can ignore the `ON` clause for now, as that will only come into play once we get to a point where we’re ready to turn this into an incremental. - -As with `UPDATE`s and `INSERT`s, you can use the `SELECT` list and aliases to name columns appropriately for the target table, and `UNION` together `INSERT` statements (taking care to use `UNION`, rather than `UNION ALL` to avoid duplicates). - -The `MERGE` would end up translating to something like this: - -```sql -WITH - -using_clause AS ( - - SELECT - ride_id, - subtotal, - tip - - FROM {{ ref('rides_to_load') }} - -), - -updates AS ( - - SELECT - ride_id, - subtotal, - tip - - FROM using_clause - -), - -inserts AS ( - - SELECT - ride_id, - subtotal, - NVL(tip, 0, tip) - - FROM using_clause - -) - -SELECT * - -FROM updates - -UNION inserts -``` - -To be clear, this transformation isn’t complete. The logic here is similar to the `MERGE`, but will not actually do the same thing, since the updates and inserts CTEs are both selecting from the same source query. We’ll need to ensure we grab the separate sets of data as we transition to the incremental materialization. - -One important caveat is that dbt does not natively support `DELETE` as a `MATCH` action. If you have a line in your `MERGE` statement that uses `WHEN MATCHED THEN DELETE`, you’ll want to treat it like an update and add a soft-delete flag, which is then filtered out in a follow-on transformation. - -### Step 2: Convert to incremental materialization - -As mentioned above, incremental materializations are a little special in that when the target table does not exist, the materialization functions in nearly the same way as a standard table materialization, and executes a `CREATE TABLE AS SELECT` statement. If the target table does exist, however, the materialization instead executes a `MERGE` statement. - -Since a `MERGE` requires a `JOIN` condition between the `USING` clause and the target table, we need a way to specify how dbt determines whether or not a record triggers a match or not. That particular piece of information is specified in the dbt model configuration. - -We can add the following `config()` block to the top of our model to specify how it should build incrementally: - -```sql -{{ - config( - materialized='incremental', - unique_key='ride_id', - incremental_strategy='merge' - ) -}} -``` - -The three configuration fields in this example are the most important ones. - -- Setting `materialized='incremental'` tells dbt to apply UPSERT logic to the target table. -- The `unique_key` should be a primary key of the target table. This is used to match records with the existing table. -- `incremental_strategy` here is set to MERGE any existing rows in the target table with a value for the `unique_key` which matches the incoming batch of data. There are [various incremental strategies](/docs/build/incremental-models#about-incremental_strategy) for different situations and warehouses. - -The bulk of the work in converting a model to an incremental materialization comes in determining how the logic should change for incremental loads versus full backfills or initial loads. dbt offers a special macro, `is_incremental()`, which evaluates false for initial loads or for backfills (called full refreshes in dbt parlance), but true for incremental loads. - -This macro can be used to augment the model code to adjust how data is loaded for subsequent loads. How that logic should be added will depend a little bit on how data is received. Some common ways might be: - -1. The source table is truncated ahead of incremental loads, and only contains the data to be loaded in that increment. -2. The source table contains all historical data, and there is a load timestamp column that identifies new data to be loaded. - -In the first case, the work is essentially done already. Since the source table always contains only the new data to be loaded, the query doesn’t have to change for incremental loads. The second case, however, requires the use of the `is_incremental()` macro to correctly handle the logic. - -Taking the converted `MERGE` statement that we’d put together previously, we’d augment it to add this additional logic: - -```sql -WITH - -using_clause AS ( - - SELECT - ride_id, - subtotal, - tip, - max(load_timestamp) as load_timestamp - - FROM {{ ref('rides_to_load') }} - - - {% if is_incremental() %} - - WHERE load_timestamp > (SELECT max(load_timestamp) FROM {{ this }}) - - {% endif %} - -), - -updates AS ( - - SELECT - ride_id, - subtotal, - tip, - load_timestamp - - FROM using_clause - - {% if is_incremental() %} - - WHERE ride_id IN (SELECT ride_id FROM {{ this }}) - - {% endif %} - -), - -inserts AS ( - - SELECT - ride_id, - subtotal, - NVL(tip, 0, tip), - load_timestamp - - FROM using_clause - - WHERE ride_id NOT IN (SELECT ride_id FROM updates) - -) - -SELECT * FROM updates UNION inserts -``` - -There are a couple important concepts to understand here: - -1. The code in the `is_incremental()` conditional block only executes for incremental executions of this model code. If the target table doesn’t exist, or if the `--full-refresh` option is used, that code will not execute. -2. `{{ this }}` is a special keyword in dbt that when used in a Jinja block, self-refers to the model for which the code is executing. So if you have a model in a file called `my_incremental_model.sql`, `{{ this }}` will refer to `my_incremental_model` (fully qualified with database and schema name if necessary). By using that keyword, we can leverage the current state of the target table to inform the source query. diff --git a/website/docs/guides/migration/tools/migrating-from-stored-procedures/6-migrating-from-stored-procedures-conclusion.md b/website/docs/guides/migration/tools/migrating-from-stored-procedures/6-migrating-from-stored-procedures-conclusion.md deleted file mode 100644 index 6fddf15c163..00000000000 --- a/website/docs/guides/migration/tools/migrating-from-stored-procedures/6-migrating-from-stored-procedures-conclusion.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: Putting it all together -id: 6-migrating-from-stored-procedures-conclusion ---- - -The techniques shared above are useful ways to get started converting the individual DML statements that are often found in stored procedures. Using these types of patterns, legacy procedural code can be rapidly transitioned to dbt models that are much more readable, maintainable, and benefit from software engineering best practices like DRY principles. Additionally, once transformations are rewritten as dbt models, it becomes much easier to test the transformations to ensure that the data being used downstream is high-quality and trustworthy. diff --git a/website/docs/guides/orchestration/airflow-and-dbt-cloud/1-airflow-and-dbt-cloud.md b/website/docs/guides/orchestration/airflow-and-dbt-cloud/1-airflow-and-dbt-cloud.md deleted file mode 100644 index d6760771b79..00000000000 --- a/website/docs/guides/orchestration/airflow-and-dbt-cloud/1-airflow-and-dbt-cloud.md +++ /dev/null @@ -1,55 +0,0 @@ ---- -title: Airflow and dbt Cloud -id: 1-airflow-and-dbt-cloud ---- - -In some cases, [Airflow](https://airflow.apache.org/) may be the preferred orchestrator for your organization over working fully within dbt Cloud. There are a few reasons your team might be considering using Airflow to orchestrate your dbt jobs: - -- Your team is already using Airflow to orchestrate other processes -- Your team needs to ensure that a [dbt job](https://docs.getdbt.com/docs/dbt-cloud/cloud-overview#schedule-and-run-dbt-jobs-in-production) kicks off before or after another process outside of dbt Cloud -- Your team needs flexibility to manage more complex scheduling, such as kicking off one dbt job only after another has completed -- Your team wants to own their own orchestration solution -- You need code to work right now without starting from scratch - -## How are people using Airflow + dbt today? - -### Airflow + dbt Core - -There are [so many great examples](https://gitlab.com/gitlab-data/analytics/-/blob/master/dags/transformation/dbt_snowplow_backfill.py) from GitLab through their open source data engineering work. This is especially appropriate if you are well-versed in Kubernetes, CI/CD, and docker task management when building your airflow pipelines. If this is you and your team, you’re in good hands reading through more details [here](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/#airflow) and [here](https://about.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/). - -### Airflow + dbt Cloud API w/Custom Scripts - -This has served as a bridge until the fabled Astronomer + dbt Labs-built dbt Cloud provider became generally available [here](https://registry.astronomer.io/providers/dbt%20Cloud/versions/latest). - -There are many different permutations of this over time: - -- [Custom Python Scripts](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/archive/dbt_cloud_example.py): This is an airflow DAG based on [custom python API utilities](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/archive/dbt_cloud_utils.py) -- [Make API requests directly through the BashOperator based on the docs](https://docs.getdbt.com/dbt-cloud/api-v2-legacy#operation/triggerRun): You can make cURL requests to invoke dbt Cloud to do what you want -- For more options, check out the [official dbt Docs](/docs/deploy/deployments#airflow) on the various ways teams are running dbt in airflow - -## This guide's process - -These solutions are great, but can be difficult to trust as your team grows and management for things like: testing, job definitions, secrets, and pipelines increase past your team’s capacity. Roles become blurry (or were never clearly defined at the start!). Both data and analytics engineers start digging through custom logging within each other’s workflows to make heads or tails of where and what the issue really is. Not to mention that when the issue is found, it can be even harder to decide on the best path forward for safely implementing fixes. This complex workflow and unclear delineation on process management results in a lot of misunderstandings and wasted time just trying to get the process to work smoothly! - -### A better way - -After today’s walkthrough, you’ll get hands-on experience: - -1. Creating a working local Airflow environment -2. Invoking a dbt Cloud job with Airflow (with proof!) -3. Reusing tested and trusted Airflow code for your specific use cases - -While you’re learning the ropes, you’ll also gain a better understanding of how this helps to: - -- Reduce the cognitive load when building and maintaining pipelines -- Avoid dependency hell (think: `pip install` conflicts) -- Implement better recoveries from failures -- Define clearer workflows so that data and analytics engineers work better, together ♥️ - -### Prerequisites - -- [dbt Cloud Teams or Enterprise account](https://www.getdbt.com/pricing/) (with [admin access](https://docs.getdbt.com/docs/cloud/manage-access/enterprise-permissions)) in order to create a service token. Permissions for service tokens can be found [here](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens#permissions-for-service-account-tokens). -- A [free Docker account](https://hub.docker.com/signup) in order to sign in to Docker Desktop, which will be installed in the initial setup. -- A local digital scratchpad for temporarily copy-pasting API keys and URLs - -🙌 Let’s get started! 🙌 diff --git a/website/docs/guides/orchestration/airflow-and-dbt-cloud/2-setting-up-airflow-and-dbt-cloud.md b/website/docs/guides/orchestration/airflow-and-dbt-cloud/2-setting-up-airflow-and-dbt-cloud.md deleted file mode 100644 index 9c3b8eb7f1b..00000000000 --- a/website/docs/guides/orchestration/airflow-and-dbt-cloud/2-setting-up-airflow-and-dbt-cloud.md +++ /dev/null @@ -1,90 +0,0 @@ ---- -title: Setting up Airflow and dbt Cloud -id: 2-setting-up-airflow-and-dbt-cloud ---- - -## 1. Install the Astro CLI - -Astro is a managed software service that includes key features for teams working with Airflow. In order to use Astro, we’ll install the Astro CLI, which will give us access to useful commands for working with Airflow locally. You can read more about Astro [here](https://docs.astronomer.io/astro/). - -In this example, we’re using Homebrew to install Astro CLI. Follow the instructions to install the Astro CLI for your own operating system [here](https://docs.astronomer.io/astro/install-cli). - -```bash -brew install astro -``` - - - -## 2. Install and start Docker Desktop - -Docker allows us to spin up an environment with all the apps and dependencies we need for the example. - -Follow the instructions [here](https://docs.docker.com/desktop/) to install Docker desktop for your own operating system. Once Docker is installed, ensure you have it up and running for the next steps. - - - -## 3. Clone the airflow-dbt-cloud repository - -Open your terminal and clone the [airflow-dbt-cloud repository](https://github.com/sungchun12/airflow-dbt-cloud.git). This contains example Airflow DAGs that you’ll use to orchestrate your dbt Cloud job. Once cloned, navigate into the `airflow-dbt-cloud` project. - -```bash -git clone https://github.com/sungchun12/airflow-dbt-cloud.git -cd airflow-dbt-cloud -``` - - - -## 4. Start the Docker container - -You can initialize an Astronomer project in an empty local directory using a Docker container, and then run your project locally using the `start` command. - -1. Run the following commands to initialize your project and start your local Airflow deployment: - - ```bash - astro dev init - astro dev start - ``` - - When this finishes, you should see a message similar to the following: - - ```bash - Airflow is starting up! This might take a few minutes… - - Project is running! All components are now available. - - Airflow Webserver: http://localhost:8080 - Postgres Database: localhost:5432/postgres - The default Airflow UI credentials are: admin:admin - The default Postrgres DB credentials are: postgres:postgres - ``` - -2. Open the Airflow interface. Launch your web browser and navigate to the address for the **Airflow Webserver** from your output in Step 1. - - This will take you to your local instance of Airflow. You’ll need to log in with the **default credentials**: - - - Username: admin - - Password: admin - - ![Airflow login screen](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-login.png) - - - -## 5. Create a dbt Cloud service token - -Create a service token from within dbt Cloud using the instructions [found here](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Ensure that you save a copy of the token, as you won’t be able to access this later. In this example we use `Account Admin`, but you can also use `Job Admin` instead for token permissions. - - - -## 6. Create a dbt Cloud job - -In your dbt Cloud account create a job, paying special attention to the information in the bullets below. Additional information for creating a dbt Cloud job can be found [here](/quickstarts/bigquery). - -- Configure the job with the commands that you want to include when this job kicks off, as Airflow will be referring to the job’s configurations for this rather than being explicitly coded in the Airflow DAG. This job will run a set of commands rather than a single command. -- Ensure that the schedule is turned **off** since we’ll be using Airflow to kick things off. -- Once you hit `save` on the job, make sure you copy the URL and save it for referencing later. The url will look similar to this: - -```html -https://cloud.getdbt.com/#/accounts/{account_id}/projects/{project_id}/jobs/{job_id}/ -``` - - diff --git a/website/docs/guides/orchestration/airflow-and-dbt-cloud/3-running-airflow-and-dbt-cloud.md b/website/docs/guides/orchestration/airflow-and-dbt-cloud/3-running-airflow-and-dbt-cloud.md deleted file mode 100644 index d6fd32bdba9..00000000000 --- a/website/docs/guides/orchestration/airflow-and-dbt-cloud/3-running-airflow-and-dbt-cloud.md +++ /dev/null @@ -1,104 +0,0 @@ ---- -title: Running Airflow and dbt Cloud -id: 3-running-airflow-and-dbt-cloud ---- - - - -Now you have all the working pieces to get up and running with Airflow + dbt Cloud. Let’s dive into make this all work together. We will **set up a connection** and **run a DAG in Airflow** that kicks off a dbt Cloud job. - -## 1. Add your dbt Cloud API token as a secure connection - -1. Navigate to Admin and click on **Connections** - - ![Airflow connections menu](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-connections-menu.png) - -2. Click on the `+` sign to add a new connection, then click on the drop down to search for the dbt Cloud Connection Type - - ![Create connection](/img/guides/orchestration/airflow-and-dbt-cloud/create-connection.png) - - ![Connection type](/img/guides/orchestration/airflow-and-dbt-cloud/connection-type.png) - -3. Add in your connection details and your default dbt Cloud account id. This is found in your dbt Cloud URL after the accounts route section (`/accounts/{YOUR_ACCOUNT_ID}`), for example the account with id 16173 would see this in their URL: `https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/` - -![https://lh3.googleusercontent.com/sRxe5xbv_LYhIKblc7eiY7AmByr1OibOac2_fIe54rpU3TBGwjMpdi_j0EPEFzM1_gNQXry7Jsm8aVw9wQBSNs1I6Cyzpvijaj0VGwSnmVf3OEV8Hv5EPOQHrwQgK2RhNBdyBxN2](https://lh3.googleusercontent.com/sRxe5xbv_LYhIKblc7eiY7AmByr1OibOac2_fIe54rpU3TBGwjMpdi_j0EPEFzM1_gNQXry7Jsm8aVw9wQBSNs1I6Cyzpvijaj0VGwSnmVf3OEV8Hv5EPOQHrwQgK2RhNBdyBxN2) - -## 2. Add your `job_id` and `account_id` config details to the python file: [dbt_cloud_provider_eltml.py](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/dags/dbt_cloud_provider_eltml.py) - -1. You’ll find these details within the dbt Cloud job URL, see the comments in the code snippet below for an example. - - ```python - # dbt Cloud Job URL: https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/ - # account_id: 16173 - #job_id: 65767 - - # line 28 - default_args={"dbt_cloud_conn_id": "dbt_cloud", "account_id": 16173}, - - trigger_dbt_cloud_job_run = DbtCloudRunJobOperator( - task_id="trigger_dbt_cloud_job_run", - job_id=65767, # line 39 - check_interval=10, - timeout=300, - ) - ``` - -2. Turn on the DAG and verify the job succeeded after running. Note: screenshots taken from different job runs, but the user experience is consistent. - - ![https://lh6.googleusercontent.com/p8AqQRy0UGVLjDGPmcuGYmQ_BRodyL0Zis-eQgSmp69EHbKW51o4S-bCl1fXHlOmwpYEBxD0A-O1Q1hwt-VDVMO1wWH-AIeaoelBx06JXRJ0m1OcHaPpFKH0xDiduIhNlQhhbLiy](https://lh6.googleusercontent.com/p8AqQRy0UGVLjDGPmcuGYmQ_BRodyL0Zis-eQgSmp69EHbKW51o4S-bCl1fXHlOmwpYEBxD0A-O1Q1hwt-VDVMO1wWH-AIeaoelBx06JXRJ0m1OcHaPpFKH0xDiduIhNlQhhbLiy) - - ![Airflow DAG](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-dag.png) - - ![Task run instance](/img/guides/orchestration/airflow-and-dbt-cloud/task-run-instance.png) - - ![https://lh6.googleusercontent.com/S9QdGhLAdioZ3x634CChugsJRiSVtTTd5CTXbRL8ADA6nSbAlNn4zV0jb3aC946c8SGi9FRTfyTFXqjcM-EBrJNK5hQ0HHAsR5Fj7NbdGoUfBI7xFmgeoPqnoYpjyZzRZlXkjtxS](https://lh6.googleusercontent.com/S9QdGhLAdioZ3x634CChugsJRiSVtTTd5CTXbRL8ADA6nSbAlNn4zV0jb3aC946c8SGi9FRTfyTFXqjcM-EBrJNK5hQ0HHAsR5Fj7NbdGoUfBI7xFmgeoPqnoYpjyZzRZlXkjtxS) - -## How do I rerun the dbt Cloud job and downstream tasks in my pipeline? - -If you have worked with dbt Cloud before, you have likely encountered cases where a job fails. In those cases, you have likely logged into dbt Cloud, investigated the error, and then manually restarted the job. - -This section of the guide will show you how to restart the job directly from Airflow. This will specifically run *just* the `trigger_dbt_cloud_job_run` and downstream tasks of the Airflow DAG and not the entire DAG. If only the transformation step fails, you don’t need to re-run the extract and load processes. Let’s jump into how to do that in Airflow. - -1. Click on the task - - ![Task DAG view](/img/guides/orchestration/airflow-and-dbt-cloud/task-dag-view.png) - -2. Clear the task instance - - ![Clear task instance](/img/guides/orchestration/airflow-and-dbt-cloud/clear-task-instance.png) - - ![Approve clearing](/img/guides/orchestration/airflow-and-dbt-cloud/approve-clearing.png) - -3. Watch it rerun in real time - - ![Re-run](/img/guides/orchestration/airflow-and-dbt-cloud/re-run.png) - -## Cleaning up - -At the end of this guide, make sure you shut down your docker container. When you’re done using Airflow, use the following command to stop the container: - -```bash -$ astrocloud dev stop - -[+] Running 3/3 - ⠿ Container airflow-dbt-cloud_e3fe3c-webserver-1 Stopped 7.5s - ⠿ Container airflow-dbt-cloud_e3fe3c-scheduler-1 Stopped 3.3s - ⠿ Container airflow-dbt-cloud_e3fe3c-postgres-1 Stopped 0.3s -``` - -To verify that the deployment has stopped, use the following command: - -```bash -astrocloud dev ps -``` - -This should give you an output like this: - -```bash -Name State Ports -airflow-dbt-cloud_e3fe3c-webserver-1 exited -airflow-dbt-cloud_e3fe3c-scheduler-1 exited -airflow-dbt-cloud_e3fe3c-postgres-1 exited -``` - - diff --git a/website/docs/guides/orchestration/airflow-and-dbt-cloud/4-airflow-and-dbt-cloud-faqs.md b/website/docs/guides/orchestration/airflow-and-dbt-cloud/4-airflow-and-dbt-cloud-faqs.md deleted file mode 100644 index 5766d8c0b79..00000000000 --- a/website/docs/guides/orchestration/airflow-and-dbt-cloud/4-airflow-and-dbt-cloud-faqs.md +++ /dev/null @@ -1,50 +0,0 @@ ---- -title: Airflow and dbt Cloud FAQs -id: 4-airflow-and-dbt-cloud-faqs ---- -## 1. How can we run specific subsections of the dbt DAG in Airflow? - -Because of the way we configured the dbt Cloud job to run in Airflow, you can leave this job to your analytics engineers to define in the job configurations from dbt Cloud. If, for example, we need to run hourly-tagged models every hour and daily-tagged models daily, we can create jobs like `Hourly Run` or `Daily Run` and utilize the commands `dbt run -s tag:hourly` and `dbt run -s tag:daily` within each, respectively. We only need to grab our dbt Cloud `account` and `job id`, configure it in an Airflow DAG with the code provided, and then we can be on your way. See more node selection options: [here](/reference/node-selection/syntax) - -## 2. How can I re-run models from the point of failure? - -You may want to parse the dbt DAG in Airflow to get the benefit of re-running from the point of failure. However, when you have hundreds of models in your DAG expanded out, it becomes useless for diagnosis and rerunning due to the overhead that comes along with creating an expansive Airflow DAG. - -You can’t re-run from failure natively in dbt Cloud today (feature coming!), but you can use a custom rerun parser. - -Using a simple python script coupled with the dbt Cloud provider, you can: - -- Avoid managing artifacts in a separate storage bucket(dbt Cloud does this for you) -- Avoid building your own parsing logic -- Get clear logs on what models you're rerunning in dbt Cloud (without hard coding step override commands) - -Watch the video below to see how it works! - - - -## 3. Should Airflow run one big dbt job or many dbt jobs? - -Overall we recommend being as purposeful and minimalistic as you can. This is because dbt manages all of the dependencies between models and the orchestration of running those dependencies in order, which in turn has benefits in terms of warehouse processing efforts. - -## 4. We want to kick off our dbt jobs after our ingestion tool (such as Fivetran) / data pipelines are done loading data. Any best practices around that? - -Our friends at Astronomer answer this question with this example: [here](https://registry.astronomer.io/dags/fivetran-dbt-cloud-census) - -## 5. How do you set up a CI/CD workflow with Airflow? - -Check out these two resources for accomplishing your own CI/CD pipeline: - -- [Continuous Integration with dbt Cloud](/docs/deploy/continuous-integration) -- [Astronomer's CI/CD Example](https://docs.astronomer.io/software/ci-cd/#example-cicd-workflow) - -## 6. Can dbt dynamically create tasks in the DAG like Airflow can? - -We prefer to keep models bundled vs. unbundled. You can go this route, but if you have hundreds of dbt models, it’s more effective to let the dbt Cloud job handle the models and dependencies. Bundling provides the solution to clear observability when things go wrong - we've seen more success in having the ability to clearly see issues in a bundled dbt Cloud job than combing through the nodes of an expansive Airflow DAG. If you still have a use case for this level of control though, our friends at Astronomer answer this question [here](https://www.astronomer.io/blog/airflow-dbt-1/)! - -## 7. Can you trigger notifications if a dbt job fails with Airflow? Is there any way to access the status of the dbt Job to do that? - -Yes, either through [Airflow's email/slack](https://www.astronomer.io/guides/error-notifications-in-airflow/) functionality by itself or combined with [dbt Cloud's notifications](/docs/deploy/job-notifications), which support email and slack notifications. - -## 8. Are there decision criteria for how to best work with dbt Cloud and airflow? - -Check out this deep dive into planning your dbt Cloud + Airflow implementation [here](https://www.youtube.com/watch?v=n7IIThR8hGk)! diff --git a/website/docs/guides/orchestration/custom-cicd-pipelines/1-cicd-background.md b/website/docs/guides/orchestration/custom-cicd-pipelines/1-cicd-background.md deleted file mode 100644 index a66259c6c49..00000000000 --- a/website/docs/guides/orchestration/custom-cicd-pipelines/1-cicd-background.md +++ /dev/null @@ -1,43 +0,0 @@ ---- -title: Customizing CI/CD with Custom Pipelines -id: 1-cicd-background ---- - -One of the core tenets of dbt is that analytic code should be version controlled. This provides a ton of benefit to your organization in terms of collaboration, code consistency, stability, and the ability to roll back to a prior version. There’s an additional benefit that is provided with your code hosting platform that is often overlooked or underutilized. Some of you may have experience using dbt Cloud’s [webhook functionality](https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-enabling-continuous-integration) to run a job when a PR is created. This is a fantastic capability, and meets most use cases for testing your code before merging to production. However, there are circumstances when an organization needs additional functionality, like running workflows on every commit (linting), or running workflows after a merge is complete. In this article, we will show you how to setup custom pipelines to lint your project and trigger a dbt Cloud job via the API. - -A note on parlance in this article since each code hosting platform uses different terms for similar concepts. The terms `pull request` (PR) and `merge request` (MR) are used interchangeably to mean the process of merging one branch into another branch. - - -## What are pipelines? - -Pipelines (which are known by many names, such as workflows, actions, or build steps) are a series of pre-defined jobs that are triggered by specific events in your repository (PR created, commit pushed, branch merged, etc). Those jobs can do pretty much anything your heart desires assuming you have the proper security access and coding chops. - -Jobs are executed on [runners](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#runners), which are virtual servers. The runners come pre-configured with Ubuntu Linux, macOS, or Windows. That means the commands you execute are determined by the operating system of your runner. You’ll see how this comes into play later in the setup, but for now just remember that your code is executed on virtual servers that are, typically, hosted by the code hosting platform. - -![Diagram of how pipelines work](/img/guides/orchestration/custom-cicd-pipelines/pipeline-diagram.png) - -Please note, runners hosted by your code hosting platform provide a certain amount of free time. After that, billing charges may apply depending on how your account is setup. You also have the ability to host your own runners. That is beyond the scope of this article, but checkout the links below for more information if you’re interested in setting that up: - -- Repo-hosted runner billing information: - - [GitHub](https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions) - - [GitLab](https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html) - - [Bitbucket](https://bitbucket.org/product/features/pipelines#) -- Self-hosted runner information: - - [GitHub](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners) - - [GitLab](https://docs.gitlab.com/runner/) - - [Bitbucket](https://support.atlassian.com/bitbucket-cloud/docs/runners/) - -Additionally, if you’re using the free tier of GitLab you can still follow this guide, but it may ask you to provide a credit card to verify your account. You’ll see something like this the first time you try to run a pipeline: - -![Warning from GitLab showing payment information is required](/img/guides/orchestration/custom-cicd-pipelines/gitlab-cicd-payment-warning.png) - - -## How to setup pipelines - -This guide provides details for multiple code hosting platforms. Where steps are unique, they are presented without a selection option. If code is specific to a platform (i.e. GitHub, GitLab, Bitbucket) you will see a selection option for each. - -Pipelines can be triggered by various events. The [dbt Cloud webhook](https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-enabling-continuous-integration) process already triggers a run if you want to run your jobs on a merge request, so this guide focuses on running pipelines for every push and when PRs are merged. Since pushes happen frequently in a project, we’ll keep this job super simple and fast by linting with SQLFluff. The pipeline that runs on merge requests will run less frequently, and can be used to call the dbt Cloud API to trigger a specific job. This can be helpful if you have specific requirements that need to happen when code is updated in production, like running a `--full-refresh` on all impacted incremental models. - -Here’s a quick look at what this pipeline will accomplish: - -![Diagram showing the pipelines to be created and the programs involved](/img/guides/orchestration/custom-cicd-pipelines/pipeline-programs-diagram.png) diff --git a/website/docs/guides/orchestration/custom-cicd-pipelines/4-dbt-cloud-job-on-pr.md b/website/docs/guides/orchestration/custom-cicd-pipelines/4-dbt-cloud-job-on-pr.md deleted file mode 100644 index 1a75fdc17ac..00000000000 --- a/website/docs/guides/orchestration/custom-cicd-pipelines/4-dbt-cloud-job-on-pr.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -title: Run a dbt Cloud job on pull request -id: 4-dbt-cloud-job-on-pr ---- - -:::info Run on PR - -If your git provider has a native integration with dbt Cloud, you can take advantage of the setup instructions [here](/docs/deploy/ci-jobs). -This section is only for those projects that connect to their git repository using an SSH key. - -::: - -If your git provider is not one with a native integration with dbt Cloud, but you still want to take advantage of CI builds, you've come to the right spot! With just a bit of work it's possible to setup a job that will run a dbt Cloud job when a pull request (PR) is created. - -The setup for this pipeline will use the same steps as the prior page. Before moving on, **follow steps 1-5 from the [prior page](https://docs.getdbt.com/guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge)** - -### 6. Create a pipeline job that runs when PRs are created - - - -For this job, we'll set it up using the `bitbucket-pipelines.yml` file as in the prior step. The YAML file will look pretty similar to our earlier job, but we’ll pass in the required variables to the Python script using `export` statements. Update this section to match your setup based on the comments in the file. - -**What is this pipeline going to do?** -The setup below will trigger a dbt Cloud job to run every time a PR is opened in this repository. It will also run a fresh version of the pipeline for every commit that is made on the PR until it is merged. -For example: If you open a PR, it will run the pipeline. If you then decide additional changes are needed, and commit/push to the PR branch, a new pipeline will run with the updated code. - -The following varibles control this job: - - `DBT_JOB_BRANCH`: Tells the dbt Cloud job to run the code in the branch that created this PR - - `DBT_JOB_SCHEMA_OVERRIDE`: Tells the dbt Cloud job to run this into a custom target schema - - The format of this will look like: `DBT_CLOUD_PR_{REPO_KEY}_{PR_NUMBER}` - - -```yaml -image: python:3.11.1 - - -pipelines: - # This job will run when pull requests are created in the repository - pull-requests: - '**': - - step: - name: 'Run dbt Cloud PR Job' - script: - # Check to only build if PR destination is master (or other branch). - # Comment or remove line below if you want to run on all PR's regardless of destination branch. - - if [ "${BITBUCKET_PR_DESTINATION_BRANCH}" != "main" ]; then printf 'PR Destination is not master, exiting.'; exit; fi - - export DBT_URL="https://cloud.getdbt.com" - - export DBT_JOB_CAUSE="Bitbucket Pipeline CI Job" - - export DBT_JOB_BRANCH=$BITBUCKET_BRANCH - - export DBT_JOB_SCHEMA_OVERRIDE="DBT_CLOUD_PR_"$BITBUCKET_PROJECT_KEY"_"$BITBUCKET_PR_ID - - export DBT_ACCOUNT_ID=00000 # enter your account id here - - export DBT_PROJECT_ID=00000 # enter your project id here - - export DBT_PR_JOB_ID=00000 # enter your job id here - - python python/run_and_monitor_dbt_job.py -``` - - - - -### 7. Confirm the pipeline runs - -Now that you have a new pipeline, it's time to run it and make sure it works. Since this only triggers when a PR is created, you'll need to create a new PR on a branch that contains the code above. Once you do that, you should see a pipeline that looks like this: - - - - -Bitbucket pipeline: -![dbt run on PR job in Bitbucket](/img/guides/orchestration/custom-cicd-pipelines/bitbucket-run-on-pr.png) - -dbt Cloud job: -![dbt Cloud job showing it was triggered by Bitbucket](/img/guides/orchestration/custom-cicd-pipelines/bitbucket-dbt-cloud-pr.png) - - - - -### 8. Handle those extra schemas in your database - -As noted above, when the PR job runs it will create a new schema based on the PR. To avoid having your database overwhelmed with PR schemas, consider adding a "cleanup" job to your dbt Cloud account. This job can run on a scheduled basis to cleanup any PR schemas that haven't been updated/used recently. - -Add this as a macro to your project. It takes 2 arguments that lets you control which schema get dropped: - - `age_in_days`: The number of days since the schema was last altered before it should be dropped (default 10 days) - - `database_to_clean`: The name of the database to remove schemas from - -```sql -{# - This macro finds PR schemas older than a set date and drops them - The macro defaults to 10 days old, but can be configured with the input argument age_in_days - Sample usage with different date: - dbt run-operation pr_schema_cleanup --args "{'database_to_clean': 'analytics','age_in_days':'15'}" -#} -{% macro pr_schema_cleanup(database_to_clean, age_in_days=10) %} - - {% set find_old_schemas %} - select - 'drop schema {{ database_to_clean }}.'||schema_name||';' - from {{ database_to_clean }}.information_schema.schemata - where - catalog_name = '{{ database_to_clean | upper }}' - and schema_name ilike 'DBT_CLOUD_PR%' - and last_altered <= (current_date() - interval '{{ age_in_days }} days') - {% endset %} - - {% if execute %} - - {{ log('Schema drop statements:' ,True) }} - - {% set schema_drop_list = run_query(find_old_schemas).columns[0].values() %} - - {% for schema_to_drop in schema_drop_list %} - {% do run_query(schema_to_drop) %} - {{ log(schema_to_drop ,True) }} - {% endfor %} - - {% endif %} - -{% endmacro %} -``` - -This macro goes into a dbt Cloud job that is run on a schedule. The command will look like this (text below for copy/paste): -![dbt Cloud job showing the run operation command for the cleanup macro](/img/guides/orchestration/custom-cicd-pipelines/dbt-macro-cleanup-pr.png) -`dbt run-operation pr_schema_cleanup --args "{ 'database_to_clean': 'development','age_in_days':15}"` diff --git a/website/docs/guides/orchestration/custom-cicd-pipelines/5-something-to-consider.md b/website/docs/guides/orchestration/custom-cicd-pipelines/5-something-to-consider.md deleted file mode 100644 index 6b39c5ce405..00000000000 --- a/website/docs/guides/orchestration/custom-cicd-pipelines/5-something-to-consider.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -title: Something to Consider -id: 5-something-to-consider ---- - -Running dbt Cloud jobs through a CI/CD pipeline is a form of job orchestration. If you also run jobs using dbt Cloud’s built in scheduler, you now have 2 orchestration tools running jobs. The risk with this is that you could run into conflicts - you can imagine a case where you are triggering a pipeline on certain actions and running scheduled jobs in dbt Cloud, you would probably run into job clashes. The more tools you have, the more you have to make sure everything talks to each other. - -That being said, if **the only reason you want to use pipelines is for adding a lint check or run on merge**, you might decide the pros outweigh the cons, and as such you want to go with a hybrid approach. Just keep in mind that if two processes try and run the same job at the same time, dbt Cloud will queue the jobs and run one after the other. It’s a balancing act but can be accomplished with diligence to ensure you’re orchestrating jobs in a manner that does not conflict. \ No newline at end of file diff --git a/website/docs/guides/orchestration/set-up-ci/1-introduction.md b/website/docs/guides/orchestration/set-up-ci/1-introduction.md deleted file mode 100644 index 97df16b4ce1..00000000000 --- a/website/docs/guides/orchestration/set-up-ci/1-introduction.md +++ /dev/null @@ -1,10 +0,0 @@ ---- -title: "Get started with Continuous Integration tests" -slug: overview ---- - -By validating your code _before_ it goes into production, you don't need to spend your afternoon fielding messages from people whose reports are suddenly broken. - -A solid CI setup is critical to preventing avoidable downtime and broken trust. dbt Cloud uses **sensible defaults** to get you up and running in a performant and cost-effective way in minimal time. - -After that, there's time to get fancy, but let's walk before we run. diff --git a/website/docs/guides/orchestration/set-up-ci/2-quick-setup.md b/website/docs/guides/orchestration/set-up-ci/2-quick-setup.md deleted file mode 100644 index 9b6d46fe2b2..00000000000 --- a/website/docs/guides/orchestration/set-up-ci/2-quick-setup.md +++ /dev/null @@ -1,50 +0,0 @@ ---- -title: "Baseline: Enable CI in 15 minutes" -slug: in-15-minutes -description: Find issues before they are deployed to production with dbt Cloud's Slim CI. ---- - -In this guide, we're going to add a **CI environment**, where proposed changes can be validated in the context of the entire project without impacting production systems. We will use a single set of deployment credentials (like the Prod environment), but models are built in a separate location to avoid impacting others (like the Dev environment). - -Your git flow will look like this: - - -## Prerequisites - -As part of your initial dbt Cloud setup, you should already have Development and Production environments configured. Let's recap what each does: - -- Your **Development environment** powers the IDE. Each user has individual credentials, and builds into an individual dev schema. Nothing you do here impacts any of your colleagues. -- Your **Production environment** brings the canonical version of your project to life for downstream consumers. There is a single set of deployment credentials, and everything is built into your production schema(s). - -## Step 1: Create a new CI environment - -See [Create a new environment](/docs/dbt-cloud-environments#create-a-deployment-environment). The environment should be called **CI**. Just like your existing Production environment, it will be a Deployment-type environment. - -When setting a Schema in the **Deployment Credentials** area, remember that dbt Cloud will automatically generate a custom schema name for each PR to ensure that they don't interfere with your deployed models. This means you can safely set the same Schema name as your Production job. - -## Step 2: Double-check your Production environment is identified - -Go into your existing Production environment, and ensure that the **Set as Production environment** checkbox is set. It'll make things easier later. - -## Step 3: Create a new job in the CI environment - -Use the **Continuous Integration Job** template, and call the job **CI Check**. - -In the Execution Settings, your command will be preset to `dbt build --select state:modified+`. Let's break this down: - -- [`dbt build`](/reference/commands/build) runs all nodes (seeds, models, snapshots, tests) at once in DAG order. If something fails, nodes that depend on it will be skipped. -- The [`state:modified+` selector](/reference/node-selection/methods#the-state-method) means that only modified nodes and their children will be run ("Slim CI"). In addition to [not wasting time](https://discourse.getdbt.com/t/how-we-sped-up-our-ci-runs-by-10x-using-slim-ci/2603) building and testing nodes that weren't changed in the first place, this significantly reduces compute costs. - -To be able to find modified nodes, dbt needs to have something to compare against. dbt Cloud uses the last successful run of any job in your Production environment as its [comparison state](/reference/node-selection/syntax#about-node-selection). As long as you identified your Production environment in Step 2, you won't need to touch this. If you didn't, pick the right environment from the dropdown. - -## Step 4: Test your process - -That's it! There are other steps you can take to be even more confident in your work, such as [validating your structure follows best practices](/guides/orchestration/set-up-ci/run-dbt-project-evaluator) and [linting your code](/guides/orchestration/set-up-ci/lint-on-push), but this covers the most critical checks. - -To test your new flow, create a new branch in the dbt Cloud IDE then add a new file or modify an existing one. Commit it, then create a new Pull Request (not a draft). Within a few seconds, you’ll see a new check appear in your git provider. - -## Things to keep in mind - -- If you make a new commit while a CI run based on older code is in progress, it will be automatically canceled and replaced with the fresh code. -- An unlimited number of CI jobs can run at once. If 10 developers all commit code to different PRs at the same time, each person will get their own schema containing their changes. Once each PR is merged, dbt Cloud will drop that schema. -- CI jobs will never block a production run. diff --git a/website/docs/guides/orchestration/set-up-ci/3-run-dbt-project-evaluator.md b/website/docs/guides/orchestration/set-up-ci/3-run-dbt-project-evaluator.md deleted file mode 100644 index 646a9cb42b7..00000000000 --- a/website/docs/guides/orchestration/set-up-ci/3-run-dbt-project-evaluator.md +++ /dev/null @@ -1,46 +0,0 @@ ---- -title: "Enforce best practices with dbt project evaluator" -slug: run-dbt-project-evaluator -description: dbt Project Evaluator can be run from inside of your existing dbt Cloud CI job to identify common flaws in projects. ---- - -dbt Project Evaluator is a package designed to identify deviations from best practices common to many dbt projects, including modeling, testing, documentation, structure and performance problems. For an introduction to the package, read its [launch blog post](/blog/align-with-dbt-project-evaluator). - -## Step 1: Install the package - -As with all packages, add a reference to `dbt-labs/dbt_project_evaluator` to your `packages.yml` file. See the [dbt Package Hub](https://hub.getdbt.com/dbt-labs/dbt_project_evaluator/latest/) for full installation instructions. - -## Step 2: Define test severity with an environment variable - -As noted in the [documentation](https://dbt-labs.github.io/dbt-project-evaluator/latest/ci-check/), tests in the package are set to `warn` severity by default. - -To have these tests fail in CI, create a new environment called `DBT_PROJECT_EVALUATOR_SEVERITY`. Set the project-wide default to `warn`, and set it to `error` in the CI environment. - -In your `dbt_project.yml` file, override the severity configuration: - -```yaml -tests: -dbt_project_evaluator: - +severity: "{{ env_var('DBT_PROJECT_EVALUATOR_SEVERITY', 'warn') }}" -``` - -## Step 3: Update your CI commands - -Because these tests should only run after the rest of your project has been built, your existing CI command will need to be updated to exclude the dbt_project_evaluator package. You will then add a second step which builds _only_ the package's models and tests. - -Update your steps to: - -```bash -dbt build --select state:modified+ --exclude package:dbt_project_evaluator -dbt build --select package:dbt_project_evaluator -``` - -## Step 4: Apply any customizations - -Depending on the state of your project when you roll out the evaluator, you may need to skip some tests or allow exceptions for some areas. To do this, refer to the documentation on: - -- [disabling tests](https://dbt-labs.github.io/dbt-project-evaluator/latest/customization/customization/) -- [excluding groups of models from a specific test](https://dbt-labs.github.io/dbt-project-evaluator/latest/customization/exceptions/) -- [excluding packages or sources/models based on path](https://dbt-labs.github.io/dbt-project-evaluator/latest/customization/excluding-packages-and-paths/) - -If you create a seed to exclude groups of models from a specific test, remember to disable the default seed and include `dbt_project_evaluator_exceptions` in your second `dbt build` command above. diff --git a/website/docs/guides/orchestration/set-up-ci/4-lint-on-push.md b/website/docs/guides/orchestration/set-up-ci/4-lint-on-push.md deleted file mode 100644 index 1932ffe1019..00000000000 --- a/website/docs/guides/orchestration/set-up-ci/4-lint-on-push.md +++ /dev/null @@ -1,190 +0,0 @@ ---- -title: "Run linting checks with SQLFluff" -slug: lint-on-push -description: Enforce your organization's SQL style guide with by running SQLFluff in your git workflow whenever new code is pushed. ---- - -By [linting](/docs/cloud/dbt-cloud-ide/lint-format#lint) your project during CI, you can ensure that code styling standards are consistently enforced, without spending human time nitpicking comma placement. - -The steps below create an action/pipeline which uses [SQLFluff](https://docs.sqlfluff.com/en/stable/) to scan your code and look for linting errors. If you don't already have SQLFluff rules defined, check out [our recommended config file](/guides/best-practices/how-we-style/2-how-we-style-our-sql). - -### 1. Create a YAML file to define your pipeline - -The YAML files defined below are what tell your code hosting platform the steps to run. In this setup, you’re telling the platform to run a SQLFluff lint job every time a commit is pushed. - - - - -GitHub Actions are defined in the `.github/workflows` directory. To define the job for your action, add a new file named `lint_on_push.yml` under the `workflows` folder. Your final folder structure will look like this: - -```sql -my_awesome_project -├── .github -│ ├── workflows -│ │ └── lint_on_push.yml -``` - -**Key pieces:** - -- `on:` defines when the pipeline is run. This workflow will run whenever code is pushed to any branch except `main`. For other trigger options, check out [GitHub’s docs](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows). -- `runs-on: ubuntu-latest` - this defines the operating system we’re using to run the job -- `uses:` - When the Ubuntu server is created, it is completely empty. [`checkout`](https://github.com/actions/checkout#checkout-v3) and [`setup-python`](https://github.com/actions/setup-python#setup-python-v3) are public GitHub Actions which enable the server to access the code in your repo, and set up Python correctly. -- `run:` - these steps are run at the command line, as though you typed them at a prompt yourself. This will install sqlfluff and lint the project. Be sure to set the correct `--dialect` for your project. - -For a full breakdown of the properties in a workflow file, see [Understanding the workflow file](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#understanding-the-workflow-file) on GitHub's website. - -```yaml -name: lint dbt project on push - -on: - push: - branches-ignore: - - 'main' - -jobs: - # this job runs SQLFluff with a specific set of rules - # note the dialect is set to Snowflake, so make that specific to your setup - # details on linter rules: https://docs.sqlfluff.com/en/stable/rules.html - lint_project: - name: Run SQLFluff linter - runs-on: ubuntu-latest - - steps: - - uses: "actions/checkout@v3" - - uses: "actions/setup-python@v4" - with: - python-version: "3.9" - - name: Install SQLFluff - run: "pip install sqlfluff" - - name: Lint project - run: "sqlfluff lint models --dialect snowflake" - -``` - - - - -Create a `.gitlab-ci.yml` file in your **root directory** to define the triggers for when to execute the script below. You’ll put the code below into this file. - -```sql -my_awesome_project -├── dbt_project.yml -├── .gitlab-ci.yml -``` - -**Key pieces:** - -- `image: python:3.9` - this defines the virtual image we’re using to run the job -- `rules:` - defines when the pipeline is run. This workflow will run whenever code is pushed to any branch except `main`. For other rules, refer to [GitLab’s documentation](https://docs.gitlab.com/ee/ci/yaml/#rules). -- `script:` - this is how we’re telling the GitLab runner to execute the Python script we defined above. - -```yaml -image: python:3.9 - -stages: - - pre-build - -# this job runs SQLFluff with a specific set of rules -# note the dialect is set to Snowflake, so make that specific to your setup -# details on linter rules: https://docs.sqlfluff.com/en/stable/rules.html -lint-project: - stage: pre-build - rules: - - if: $CI_PIPELINE_SOURCE == "push" && $CI_COMMIT_BRANCH != 'main' - script: - - pip install sqlfluff - - sqlfluff lint models --dialect snowflake -``` - - - - -Create a `bitbucket-pipelines.yml` file in your **root directory** to define the triggers for when to execute the script below. You’ll put the code below into this file. - -```sql -my_awesome_project -├── bitbucket-pipelines.yml -├── dbt_project.yml -``` - -**Key pieces:** - -- `image: python:3.11.1` - this defines the virtual image we’re using to run the job -- `'**':` - this is used to filter when the pipeline runs. In this case we’re telling it to run on every push event, and you can see at line 12 we're creating a dummy pipeline for `main`. More information on filtering when a pipeline is run can be found in [Bitbucket's documentation](https://support.atlassian.com/bitbucket-cloud/docs/pipeline-triggers/) -- `script:` - this is how we’re telling the Bitbucket runner to execute the Python script we defined above. - -```yaml -image: python:3.11.1 - - -pipelines: - branches: - '**': # this sets a wildcard to run on every branch - - step: - name: Lint dbt project - script: - - pip install sqlfluff==0.13.1 - - sqlfluff lint models --dialect snowflake --rules L019,L020,L021,L022 - - 'main': # override if your default branch doesn't run on a branch named "main" - - step: - script: - - python --version -``` - - - - -### 2. Commit and push your changes to make sure everything works - -After you finish creating the YAML files, commit and push your code to trigger your pipeline for the first time. If everything goes well, you should see the pipeline in your code platform. When you click into the job you’ll get a log showing that SQLFluff was run. If your code failed linting you’ll get an error in the job with a description of what needs to be fixed. If everything passed the lint check, you’ll see a successful job run. - - - - -In your repository, click the *Actions* tab - -![Image showing the GitHub action for lint on push](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-github.png) - -Sample output from SQLFluff in the `Run SQLFluff linter` job: - -![Image showing the logs in GitHub for the SQLFluff run](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-logs-github.png) - - - - -In the menu option go to *CI/CD > Pipelines* - -![Image showing the GitLab action for lint on push](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-gitlab.png) - -Sample output from SQLFluff in the `Run SQLFluff linter` job: - -![Image showing the logs in GitLab for the SQLFluff run](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-logs-gitlab.png) - - - - -In the left menu pane, click on *Pipelines* - -![Image showing the Bitbucket action for lint on push](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-bitbucket.png) - -Sample output from SQLFluff in the `Run SQLFluff linter` job: - -![Image showing the logs in Bitbucket for the SQLFluff run](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-logs-bitbucket.png) - - - diff --git a/website/docs/guides/orchestration/set-up-ci/5-multiple-checks.md b/website/docs/guides/orchestration/set-up-ci/5-multiple-checks.md deleted file mode 100644 index 4bfe2d936d4..00000000000 --- a/website/docs/guides/orchestration/set-up-ci/5-multiple-checks.md +++ /dev/null @@ -1,62 +0,0 @@ ---- -title: "Advanced: Create a release train with additional environments" -slug: multiple-environments -description: Large and complex enterprises sometimes require additional layers of validation before deployment. Learn how to add these checks with dbt Cloud. ---- - -:::caution Are you sure you need this? -This approach can increase release safety, but creates additional manual steps in the deployment process as well as a greater maintenance burden. - -As such, it may slow down the time it takes to get new features into production. - -The team at Sunrun maintained a SOX-compliant deployment in dbt while reducing the number of environments. Check out [their Coalesce presentation](https://www.youtube.com/watch?v=vmBAO2XN-fM) to learn more. -::: - -In this section, we will add a new **QA** environment. New features will branch off from and be merged back into the associated `qa` branch, and a member of your team (the "Release Manager") will create a PR against `main` to be validated in the CI environment before going live. - -The git flow will look like this: - - -## Prerequisites - -- You have the **Development**, **CI**, and **Production** environments, as described in [the Baseline setup](/guides/orchestration/set-up-ci/in-15-minutes). - - -## Step 1: Create a `release` branch in your git repo - -As noted above, this branch will outlive any individual feature, and will be the base of all feature development for a period of time. Your team might choose to create a new branch for each sprint (`qa/sprint-01`, `qa/sprint-02`, etc), tie it to a version of your data product (`qa/1.0`, `qa/1.1`), or just have a single `qa` branch which remains active indefinitely. - -## Step 2: Update your Development environment to use the `qa` branch - -See [Custom branch behavior](/docs/dbt-cloud-environments#custom-branch-behavior). Setting `qa` as your custom branch ensures that the IDE creates new branches and PRs with the correct target, instead of using `main`. - - - -## Step 3: Create a new QA environment - -See [Create a new environment](/docs/dbt-cloud-environments#create-a-deployment-environment). The environment should be called **QA**. Just like your existing Production and CI environments, it will be a Deployment-type environment. - -Set its branch to `qa` as well. - -## Step 4: Create a new job - -Use the **Continuous Integration Job** template, and call the job **QA Check**. - -In the Execution Settings, your command will be preset to `dbt build --select state:modified+`. Let's break this down: - -- [`dbt build`](/reference/commands/build) runs all nodes (seeds, models, snapshots, tests) at once in DAG order. If something fails, nodes that depend on it will be skipped. -- The [`state:modified+` selector](/reference/node-selection/methods#the-state-method) means that only modified nodes and their children will be run ("Slim CI"). In addition to [not wasting time](https://discourse.getdbt.com/t/how-we-sped-up-our-ci-runs-by-10x-using-slim-ci/2603) building and testing nodes that weren't changed in the first place, this significantly reduces compute costs. - -To be able to find modified nodes, dbt needs to have something to compare against. Normally, we use the Production environment as the source of truth, but in this case there will be new code merged into `qa` long before it hits the `main` branch and Production environment. Because of this, we'll want to defer the Release environment to itself. - -### Optional: also add a compile-only job - -dbt Cloud uses the last successful run of any job in that environment as its [comparison state](/reference/node-selection/syntax#about-node-selection). If you have a lot of PRs in flight, the comparison state could switch around regularly. - -Adding a regularly-scheduled job inside of the QA environment whose only command is `dbt compile` can regenerate a more stable manifest for comparison purposes. - -## Step 5: Test your process - -When the Release Manager is ready to cut a new release, they will manually open a PR from `qa` into `main` from their git provider (e.g. GitHub, GitLab, Azure DevOps). dbt Cloud will detect the new PR, at which point the existing check in the CI environment will trigger and run. When using the [baseline configuration](/guides/orchestration/set-up-ci/in-15-minutes), it's possible to kick off the PR creation from inside of the dbt Cloud IDE. Under this paradigm, that button will create PRs targeting your QA branch instead. - -To test your new flow, create a new branch in the dbt Cloud IDE then add a new file or modify an existing one. Commit it, then create a new Pull Request (not a draft) against your `qa` branch. You'll see the integration tests begin to run. Once they complete, manually create a PR against `main`, and within a few seconds you’ll see the tests run again but this time incorporating all changes from all code that hasn't been merged to main yet. diff --git a/website/docs/guides/dbt-ecosystem/databricks-guides/productionizing-your-dbt-databricks-project.md b/website/docs/guides/productionize-your-dbt-databricks-project.md similarity index 89% rename from website/docs/guides/dbt-ecosystem/databricks-guides/productionizing-your-dbt-databricks-project.md rename to website/docs/guides/productionize-your-dbt-databricks-project.md index a3b4be5a051..b95d8ffd2dd 100644 --- a/website/docs/guides/dbt-ecosystem/databricks-guides/productionizing-your-dbt-databricks-project.md +++ b/website/docs/guides/productionize-your-dbt-databricks-project.md @@ -1,19 +1,27 @@ --- -title: Productionizing your dbt Databricks project -id: "productionizing-your-dbt-databricks-project" -sidebar_label: "Productionizing your dbt Databricks project" -description: "Learn how to deliver models to end users and use best practices to maintain production data" +title: Productionize your dbt Databricks project +id: productionize-your-dbt-databricks-project +description: "Learn how to deliver models to end users and use best practices to maintain production data." +displayText: Productionize your dbt Databricks project +hoverSnippet: Learn how to Productionize your dbt Databricks project. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'databricks' +hide_table_of_contents: true +tags: ['Databricks','dbt Core','dbt Cloud'] +level: 'Intermediate' +recently_updated: true --- +## Introduction Welcome to the third installment of our comprehensive series on optimizing and deploying your data pipelines using Databricks and dbt Cloud. In this guide, we'll dive into delivering these models to end users while incorporating best practices to ensure that your production data remains reliable and timely. -## Prerequisites +### Prerequisites -If you don't have any of the following requirements, refer to the instructions in the [setup guide](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project) to catch up: +If you don't have any of the following requirements, refer to the instructions in the [Set up your dbt project with Databricks](/guides/set-up-your-databricks-dbt-project) for help meeting these requirements: -- You have [set up your Databricks and dbt Cloud environments](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project). -- You have [optimized your dbt models for peak performance](/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks). +- You have [Set up your dbt project with Databricks](/guides/set-up-your-databricks-dbt-project). +- You have [optimized your dbt models for peak performance](/guides/optimize-dbt-models-on-databricks). - You have created two catalogs in Databricks: *dev* and *prod*. - You have created Databricks Service Principal to run your production jobs. - You have at least one [deployment environment](/docs/deploy/deploy-environments) in dbt Cloud. @@ -44,7 +52,7 @@ Let’s [create a job](/docs/deploy/deploy-jobs#create-and-schedule-jobs) in dbt 1. Create a new job by clicking **Deploy** in the header, click **Jobs** and then **Create job**. 2. **Name** the job “Daily refresh”. 3. Set the **Environment** to your *production* environment. - - This will allow the job to inherit the catalog, schema, credentials, and environment variables defined in the [setup guide](https://docs.getdbt.com/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project#defining-your-dbt-deployment-environment). + - This will allow the job to inherit the catalog, schema, credentials, and environment variables defined in [Set up your dbt project with Databricks](/guides/set-up-your-databricks-dbt-project). 4. Under **Execution Settings** - Check the **Generate docs on run** checkbox to configure the job to automatically generate project docs each time this job runs. This will ensure your documentation stays evergreen as models are added and modified. - Select the **Run on source freshness** checkbox to configure dbt [source freshness](/docs/deploy/source-freshness) as the first step of this job. Your sources will need to be configured to [snapshot freshness information](/docs/build/sources#snapshotting-source-data-freshness) for this to drive meaningful insights. @@ -67,7 +75,7 @@ After your job is set up and runs successfully, configure your **[project artifa This will be our main production job to refresh data that will be used by end users. Another job everyone should include in their dbt project is a continuous integration job. -### Add a CI job +## Add a CI job CI/CD, or Continuous Integration and Continuous Deployment/Delivery, has become a standard practice in software development for rapidly delivering new features and bug fixes while maintaining high quality and stability. dbt Cloud enables you to apply these practices to your data transformations. @@ -79,7 +87,7 @@ dbt allows you to write [tests](/docs/build/tests) for your data pipeline, which 2. **Development**: Running tests during development ensures that your code changes do not break existing assumptions, enabling developers to iterate faster by catching problems immediately after writing code. 3. **CI checks**: Automated CI jobs run and test your pipeline end-to end when a pull request is created, providing confidence to developers, code reviewers, and end users that the proposed changes are reliable and will not cause disruptions or data quality issues -Your CI job will ensure that the models build properly and pass any tests applied to them. We recommend creating a separate *test* environment and having a dedicated service principal. This will ensure the temporary schemas created during CI tests are in their own catalog and cannot unintentionally expose data to other users. Repeat the [steps](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project) used to create your *prod* environment to create a *test* environment. After setup, you should have: +Your CI job will ensure that the models build properly and pass any tests applied to them. We recommend creating a separate *test* environment and having a dedicated service principal. This will ensure the temporary schemas created during CI tests are in their own catalog and cannot unintentionally expose data to other users. Repeat the steps in [Set up your dbt project with Databricks](/guides/set-up-your-databricks-dbt-project) to create your *prod* environment to create a *test* environment. After setup, you should have: - A catalog called *test* - A service principal called *dbt_test_sp* @@ -89,7 +97,7 @@ We recommend setting up a dbt Cloud CI job. This will decrease the job’s runti With dbt tests and SlimCI, you can feel confident that your production data will be timely and accurate even while delivering at high velocity. -### Monitor your jobs +## Monitor your jobs Keeping a close eye on your dbt Cloud jobs is crucial for maintaining a robust and efficient data pipeline. By monitoring job performance and quickly identifying potential issues, you can ensure that your data transformations run smoothly. dbt Cloud provides three entry points to monitor the health of your project: run history, deployment monitor, and status tiles. @@ -101,7 +109,7 @@ The deployment monitor in dbt Cloud offers a higher-level view of your run histo By adding [status tiles](/docs/deploy/dashboard-status-tiles) to your BI dashboards, you can give stakeholders visibility into the health of your data pipeline without leaving their preferred interface. Status tiles instill confidence in your data and help prevent unnecessary inquiries or context switching. To implement dashboard status tiles, you'll need to have dbt docs with [exposures](/docs/build/exposures) defined. -### Notifications +## Set up notifications Setting up [notifications](/docs/deploy/job-notifications) in dbt Cloud allows you to receive alerts via email or a Slack channel whenever a run ends. This ensures that the appropriate teams are notified and can take action promptly when jobs fail or are canceled. To set up notifications: @@ -109,9 +117,9 @@ Setting up [notifications](/docs/deploy/job-notifications) in dbt Cloud allows y 2. Select the **Notifications** tab. 3. Choose the desired notification type (Email or Slack) and configure the relevant settings. -If you require notifications through other means than email or Slack, you can use dbt Cloud's outbound [webhooks](/docs/deploy/webhooks) feature to relay job events to other tools. Webhooks enable you to [integrate dbt Cloud with a wide range of SaaS applications](/guides/orchestration/webhooks), extending your pipeline’s automation into other systems. +If you require notifications through other means than email or Slack, you can use dbt Cloud's outbound [webhooks](/docs/deploy/webhooks) feature to relay job events to other tools. Webhooks enable you to integrate dbt Cloud with a wide range of SaaS applications, extending your pipeline’s automation into other systems. -### Troubleshooting +## Troubleshooting When a disruption occurs in your production pipeline, it's essential to know how to troubleshoot issues effectively to minimize downtime and maintain a high degree of trust with your stakeholders. @@ -122,13 +130,13 @@ The five key steps for troubleshooting dbt Cloud issues are: 3. Isolate the problem by running one model at a time in the IDE or undoing the code that caused the issue. 4. Check for problems in compiled files and logs. -Consult the [Debugging errors documentation](/guides/best-practices/debugging-errors) for a comprehensive list of error types and diagnostic methods. +Consult the [Debugging errors documentation](/guides/debug-errors) for a comprehensive list of error types and diagnostic methods. To troubleshoot issues with a dbt Cloud job, navigate to the "Deploy > Run History" tab in your dbt Cloud project and select the failed run. Then, expand the run steps to view [console and debug logs](/docs/deploy/run-visibility#access-logs) to review the detailed log messages. To obtain additional information, open the Artifacts tab and download the compiled files associated with the run. If your jobs are taking longer than expected, use the [model timing](/docs/deploy/run-visibility#model-timing) dashboard to identify bottlenecks in your pipeline. Analyzing the time taken for each model execution helps you pinpoint the slowest components and optimize them for better performance. The Databricks [Query History](https://docs.databricks.com/sql/admin/query-history.html) lets you inspect granular details such as time spent in each task, rows returned, I/O performance, and execution plan. -For more on performance tuning, see our guide on [How to Optimize and Troubleshoot dbt Models on Databricks](/guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks). +For more on performance tuning, see our guide on [How to Optimize and Troubleshoot dbt Models on Databricks](/guides/optimize-dbt-models-on-databricks). ## Advanced considerations @@ -148,11 +156,11 @@ Inserting dbt Cloud jobs into a Databricks Workflows allows you to chain togethe - Logs and Run History: Accessing logs and run history becomes more convenient when using dbt Cloud. - Monitoring and Notification Features: dbt Cloud comes equipped with monitoring and notification features like the ones described above that can help you stay informed about the status and performance of your jobs. -To trigger your dbt Cloud job from Databricks, follow the instructions in our [Databricks Workflows to run dbt Cloud jobs guide](/guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs). +To trigger your dbt Cloud job from Databricks, follow the instructions in our [Databricks Workflows to run dbt Cloud jobs guide](/guides/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs). -### Data masking +## Data masking -Our [Best Practices for dbt and Unity Catalog](/guides/dbt-ecosystem/databricks-guides/dbt-unity-catalog-best-practices) guide recommends using separate catalogs *dev* and *prod* for development and deployment environments, with Unity Catalog and dbt Cloud handling configurations and permissions for environment isolation. Ensuring security while maintaining efficiency in your development and deployment environments is crucial. Additional security measures may be necessary to protect sensitive data, such as personally identifiable information (PII). +Our [Best Practices for dbt and Unity Catalog](/best-practices/dbt-unity-catalog-best-practices) guide recommends using separate catalogs *dev* and *prod* for development and deployment environments, with Unity Catalog and dbt Cloud handling configurations and permissions for environment isolation. Ensuring security while maintaining efficiency in your development and deployment environments is crucial. Additional security measures may be necessary to protect sensitive data, such as personally identifiable information (PII). Databricks leverages [Dynamic Views](https://docs.databricks.com/data-governance/unity-catalog/create-views.html#create-a-dynamic-view) to enable data masking based on group membership. Because views in Unity Catalog use Spark SQL, you can implement advanced data masking by using more complex SQL expressions and regular expressions. You can now also apply fine grained access controls like row filters in preview and column masks in preview on tables in Databricks Unity Catalog, which will be the recommended approach to protect sensitive data once this goes GA. Additionally, in the near term, Databricks Unity Catalog will also enable Attribute Based Access Control natively, which will make protecting sensitive data at scale simpler. @@ -179,10 +187,10 @@ Unity Catalog is a unified governance solution for your lakehouse. It provides a To get the most out of both tools, you can use the [persist docs config](/reference/resource-configs/persist_docs) to push table and column descriptions written in dbt into Unity Catalog, making the information easily accessible to both tools' users. Keeping the descriptions in dbt ensures they are version controlled and can be reproduced after a table is dropped. -## Additional resources +### Related docs - [Advanced deployments course](https://courses.getdbt.com/courses/advanced-deployment) if you want a deeper dive into these topics - [Autoscaling CI: The intelligent Slim CI](https://docs.getdbt.com/blog/intelligent-slim-ci) - [Trigger a dbt Cloud Job in your automated workflow with Python](https://discourse.getdbt.com/t/triggering-a-dbt-cloud-job-in-your-automated-workflow-with-python/2573) -- [Databricks + dbt Cloud Quickstart Guide](/quickstarts/databricks) +- [Databricks + dbt Cloud Quickstart Guide](/guides/databricks) - Reach out to your Databricks account team to get access to preview features on Databricks. diff --git a/website/docs/quickstarts/redshift-qs.md b/website/docs/guides/redshift-qs.md similarity index 99% rename from website/docs/quickstarts/redshift-qs.md rename to website/docs/guides/redshift-qs.md index 67f66d6e275..9296e6c6568 100644 --- a/website/docs/quickstarts/redshift-qs.md +++ b/website/docs/guides/redshift-qs.md @@ -1,9 +1,10 @@ --- title: "Quickstart for dbt Cloud and Redshift" -id: "redshift" -platform: 'dbt-cloud' +id: redshift +level: 'Beginner' icon: 'redshift' hide_table_of_contents: true +tags: ['Redshift', 'dbt Cloud','Quickstart'] --- ## Introduction diff --git a/website/docs/guides/migration/tools/refactoring-legacy-sql.md b/website/docs/guides/refactoring-legacy-sql.md similarity index 93% rename from website/docs/guides/migration/tools/refactoring-legacy-sql.md rename to website/docs/guides/refactoring-legacy-sql.md index d9acfea6dab..a339e523020 100644 --- a/website/docs/guides/migration/tools/refactoring-legacy-sql.md +++ b/website/docs/guides/refactoring-legacy-sql.md @@ -2,15 +2,24 @@ title: Refactoring legacy SQL to dbt id: refactoring-legacy-sql description: This guide walks through refactoring a long SQL query (perhaps from a stored procedure) into modular dbt data models. +displayText: Creating new materializations +hoverSnippet: Learn how to refactoring a long SQL query into modular dbt data models. +# time_to_complete: '30 minutes' commenting out until we test +platform: 'dbt-cloud' +icon: 'guides' +hide_table_of_contents: true +tags: ['SQL'] +level: 'Advanced' +recently_updated: true --- -You may have already learned how to build dbt models from scratch. +## Introduction -But in reality, you probably already have some queries or stored procedures that power analyses and dashboards, and now you’re wondering how to port those into dbt. +You may have already learned how to build dbt models from scratch. But in reality, you probably already have some queries or stored procedures that power analyses and dashboards, and now you’re wondering how to port those into dbt. There are two parts to accomplish this: migration and refactoring. In this guide we’re going to learn a process to help us turn legacy SQL code into modular dbt models. -When migrating and refactoring code, it’s of course important to stay organized. We'll do this by following several steps (jump directly from the right sidebar): +When migrating and refactoring code, it’s of course important to stay organized. We'll do this by following several steps: 1. Migrate your code 1:1 into dbt 2. Implement dbt sources rather than referencing raw database tables @@ -21,9 +30,10 @@ When migrating and refactoring code, it’s of course important to stay organize Let's get into it! -:::info More resources. -This guide is excerpted from the new dbt Learn On-demand Course, "Refactoring SQL for Modularity" - if you're curious, pick up the [free refactoring course here](https://courses.getdbt.com/courses/refactoring-sql-for-modularity), which includes example and practice refactoring projects. Or for a more in-depth look at migrating DDL and DML from stored procedures check out [this guide](/guides/migration/tools/migrating-from-stored-procedures/1-migrating-from-stored-procedures). +:::info More resources +This guide is excerpted from the new dbt Learn On-demand Course, "Refactoring SQL for Modularity" - if you're curious, pick up the [free refactoring course here](https://courses.getdbt.com/courses/refactoring-sql-for-modularity), which includes example and practice refactoring projects. Or for a more in-depth look at migrating DDL and DML from stored procedures, refer to the[Migrate from stored procedures](/guides/migrate-from-stored-procedures) guide. ::: + ## Migrate your existing SQL code @@ -38,7 +48,7 @@ To get going, you'll copy your legacy SQL query into your dbt project, by saving Once you've copied it over, you'll want to `dbt run` to execute the query and populate the in your warehouse. -If this is your first time running dbt, you may want to start with the [Introduction to dbt](/docs/introduction) and the earlier sections of the [quickstart guide](/quickstarts) before diving into refactoring. +If this is your first time running dbt, you may want to start with the [Introduction to dbt](/docs/introduction) and the earlier sections of the [quickstart guide](/guides) before diving into refactoring. This step may sound simple, but if you're porting over an existing set of SQL transformations to a new SQL dialect, you will need to consider how your legacy SQL dialect differs from your new SQL flavor, and you may need to modify your legacy code to get it to run at all. @@ -206,7 +216,7 @@ This allows anyone after us to easily step through the CTEs when troubleshooting ## Port CTEs to individual data models Rather than keep our SQL code confined to one long SQL file, we'll now start splitting it into modular + reusable [dbt data models](https://docs.getdbt.com/docs/build/models). -Internally at dbt Labs, we follow roughly this [data modeling technique](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) and we [structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) accordingly. +Internally at dbt Labs, we follow roughly this [data modeling technique](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) and we [structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) accordingly. We'll follow those structures in this walkthrough, but your team's conventions may of course differ from ours. @@ -243,7 +253,7 @@ Under the hood, it generates comparison queries between our before and after sta Sure, we could write our own query manually to audit these models, but using the dbt `audit_helper` package gives us a head start and allows us to identify variances more quickly. -## Ready for refactoring practice? +### Ready for refactoring practice? Head to the free on-demand course, [Refactoring from Procedural SQL to dbt](https://courses.getdbt.com/courses/refactoring-sql-for-modularity) for a more in-depth refactoring example + a practice refactoring problem to test your skills. Questions on this guide or the course? Drop a note in #learn-on-demand in [dbt Community Slack](https://getdbt.com/community). diff --git a/website/docs/guides/orchestration/webhooks/serverless-datadog.md b/website/docs/guides/serverless-datadog.md similarity index 67% rename from website/docs/guides/orchestration/webhooks/serverless-datadog.md rename to website/docs/guides/serverless-datadog.md index 6bd38869259..931ba9832ab 100644 --- a/website/docs/guides/orchestration/webhooks/serverless-datadog.md +++ b/website/docs/guides/serverless-datadog.md @@ -1,62 +1,71 @@ --- title: "Create Datadog events from dbt Cloud results" -id: webhooks-guide-serverless-datadog -slug: serverless-datadog -description: Configure a serverless app to add Datadog logs +id: serverless-datadog +description: Configure a serverless app to add dbt Cloud events to Datadog logs. +hoverSnippet: Learn how to configure a serverless app to add dbt Cloud events to Datadog logs. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction + This guide will teach you how to build and host a basic Python app which will add dbt Cloud job events to Datadog. To do this, when a dbt Cloud job completes it will create a log entry for each node that was run, containing all information about the node provided by the [Discovery API](/docs/dbt-cloud-apis/discovery-schema-job-models). In this example, we will use [fly.io](https://fly.io) for hosting/running the service. fly.io is a platform for running full stack apps without provisioning servers etc. This level of usage should comfortably fit inside of the Free tier. You can also use an alternative tool such as [AWS Lambda](https://adem.sh/blog/tutorial-fastapi-aws-lambda-serverless) or [Google Cloud Run](https://github.com/sekR4/FastAPI-on-Google-Cloud-Run). -## Prerequisites +### Prerequisites + This guide assumes some familiarity with: - [dbt Cloud Webhooks](/docs/deploy/webhooks) - CLI apps - Deploying code to a serverless code runner like fly.io or AWS Lambda -## Integration steps - -### 1. Clone the `dbt-cloud-webhooks-datadog` repo +## Clone the `dbt-cloud-webhooks-datadog` repo [This repository](https://github.com/dpguthrie/dbt-cloud-webhooks-datadog) contains the sample code for validating a webhook and creating logs in Datadog. -### 2. Install `flyctl` and sign up for fly.io +## Install `flyctl` and sign up for fly.io Follow the directions for your OS in the [fly.io docs](https://fly.io/docs/hands-on/install-flyctl/), then from your command line, run the following commands: Switch to the directory containing the repo you cloned in step 1: -```shell -#example: replace with your actual path -cd ~/Documents/GitHub/dbt-cloud-webhooks-datadog -``` + + ```shell + #example: replace with your actual path + cd ~/Documents/GitHub/dbt-cloud-webhooks-datadog + ``` Sign up for fly.io: -```shell -flyctl auth signup -``` + ```shell + flyctl auth signup + ``` Your console should show `successfully logged in as YOUR_EMAIL` when you're done, but if it doesn't then sign in to fly.io from your command line: -```shell -flyctl auth login -``` + ```shell + flyctl auth login + ``` + +## Launch your fly.io app -### 3. Launch your fly.io app Launching your app publishes it to the web and makes it ready to catch webhook events: -```shell -flyctl launch -``` + ```shell + flyctl launch + ``` -You will see a message saying that an existing `fly.toml` file was found. Type `y` to copy its configuration to your new app. +1. You will see a message saying that an existing `fly.toml` file was found. Type `y` to copy its configuration to your new app. -Choose an app name of your choosing, such as `YOUR_COMPANY-dbt-cloud-webhook-datadog`, or leave blank and one will be generated for you. Note that your name can only contain numbers, lowercase letters and dashes. +2. Choose an app name of your choosing, such as `YOUR_COMPANY-dbt-cloud-webhook-datadog`, or leave blank and one will be generated for you. Note that your name can only contain numbers, lowercase letters and dashes. -Choose a deployment region, and take note of the hostname that is generated (normally `APP_NAME.fly.dev`). +3. Choose a deployment region, and take note of the hostname that is generated (normally `APP_NAME.fly.dev`). -When asked if you would like to set up Postgresql or Redis databases, type `n` for each. +4. When asked if you would like to set up Postgresql or Redis databases, type `n` for each. -Type `y` when asked if you would like to deploy now. +5. Type `y` when asked if you would like to deploy now.
    Sample output from the setup wizard: @@ -86,16 +95,16 @@ Wrote config file fly.toml
    ### 4. Create a Datadog API Key [Create an API Key for your Datadog account](https://docs.datadoghq.com/account_management/api-app-keys/) and make note of it and your Datadog site (e.g. `datadoghq.com`) for later. -### 5. Configure a new webhook in dbt Cloud -See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Your event should be **Run completed**. - -Set the webhook URL to the host name you created earlier (`APP_NAME.fly.dev`) +## Configure a new webhook in dbt Cloud -Make note of the Webhook Secret Key for later. +1. See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Your event should be **Run completed**. +2. Set the webhook URL to the host name you created earlier (`APP_NAME.fly.dev`). +3. Make note of the Webhook Secret Key for later. *Do not test the endpoint*; it won't work until you have stored the auth keys (next step) -### 6. Store secrets +## Store secrets + The application requires four secrets to be set, using these names: - `DBT_CLOUD_SERVICE_TOKEN`: a dbt Cloud [user token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens) or [service account token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens) with at least the `Metdata Only` permission. - `DBT_CLOUD_AUTH_TOKEN`: the Secret Key for the dbt Cloud webhook you created earlier. @@ -103,9 +112,10 @@ The application requires four secrets to be set, using these names: - `DD_SITE`: The Datadog site for your organisation, e.g. `datadoghq.com`. Set these secrets as follows, replacing `abc123` etc with actual values: -```shell -flyctl secrets set DBT_CLOUD_SERVICE_TOKEN=abc123 DBT_CLOUD_AUTH_TOKEN=def456 DD_API_KEY=ghi789 DD_SITE=datadoghq.com -``` + ```shell + flyctl secrets set DBT_CLOUD_SERVICE_TOKEN=abc123 DBT_CLOUD_AUTH_TOKEN=def456 DD_API_KEY=ghi789 DD_SITE=datadoghq.com + ``` + +## Deploy your app -### 7. Deploy your app After you set your secrets, fly.io will redeploy your application. When it has completed successfully, go back to the dbt Cloud webhook settings and click **Test Endpoint**. diff --git a/website/docs/guides/orchestration/webhooks/serverless-pagerduty.md b/website/docs/guides/serverless-pagerduty.md similarity index 87% rename from website/docs/guides/orchestration/webhooks/serverless-pagerduty.md rename to website/docs/guides/serverless-pagerduty.md index 5455af60110..50cc1b2b36e 100644 --- a/website/docs/guides/orchestration/webhooks/serverless-pagerduty.md +++ b/website/docs/guides/serverless-pagerduty.md @@ -1,10 +1,18 @@ --- -title: "Create PagerDuty alarms from failed dbt Cloud tasks" -id: webhooks-guide-serverless-pagerduty -slug: serverless-pagerduty -description: Configure a serverless app to create PagerDuty alarms +title: "Trigger PagerDuty alarms when dbt Cloud jobs fail" +id: serverless-pagerduty +description: Use webhooks to configure a serverless app to trigger PagerDuty alarms. +hoverSnippet: Learn how to configure a serverless app that uses webhooks to trigger PagerDuty alarms. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction + This guide will teach you how to build and host a basic Python app which will monitor dbt Cloud jobs and create PagerDuty alarms based on failure. To do this, when a dbt Cloud job completes it will: - Check for any failed nodes (e.g. non-passing tests or errored models), and - create a PagerDuty alarm based on those nodes by calling the PagerDuty Events API. Events are deduplicated per run ID. @@ -13,20 +21,20 @@ This guide will teach you how to build and host a basic Python app which will mo In this example, we will use fly.io for hosting/running the service. fly.io is a platform for running full stack apps without provisioning servers etc. This level of usage should comfortably fit inside of the Free tier. You can also use an alternative tool such as [AWS Lambda](https://adem.sh/blog/tutorial-fastapi-aws-lambda-serverless) or [Google Cloud Run](https://github.com/sekR4/FastAPI-on-Google-Cloud-Run). -## Prerequisites +### Prerequisites + This guide assumes some familiarity with: - [dbt Cloud Webhooks](/docs/deploy/webhooks) - CLI apps - Deploying code to a serverless code runner like fly.io or AWS Lambda -## Integration steps -### 1. Clone the `dbt-cloud-webhooks-pagerduty` repo +## Clone the `dbt-cloud-webhooks-pagerduty` repo [This repository](https://github.com/dpguthrie/dbt-cloud-webhooks-pagerduty) contains the sample code for validating a webhook and creating events in PagerDuty. -### 2. Install `flyctl` and sign up for fly.io +## Install `flyctl` and sign up for fly.io Follow the directions for your OS in the [fly.io docs](https://fly.io/docs/hands-on/install-flyctl/), then from your command line, run the following commands: @@ -46,7 +54,7 @@ Your console should show `successfully logged in as YOUR_EMAIL` when you're done flyctl auth login ``` -### 3. Launch your fly.io app +## Launch your fly.io app Launching your app publishes it to the web and makes it ready to catch webhook events: ```shell flyctl launch @@ -87,12 +95,12 @@ Wrote config file fly.toml
    -### 4. Create a PagerDuty integration application +## Create a PagerDuty integration application See [PagerDuty's guide](https://developer.pagerduty.com/docs/ZG9jOjExMDI5NTgw-events-api-v2-overview#getting-started) for full instructions. Make note of the integration key for later. -### 5. Configure a new webhook in dbt Cloud +## Configure a new webhook in dbt Cloud See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Your event should be **Run completed**. Set the webhook URL to the host name you created earlier (`APP_NAME.fly.dev`) @@ -101,7 +109,7 @@ Make note of the Webhook Secret Key for later. *Do not test the endpoint*; it won't work until you have stored the auth keys (next step) -### 6. Store secrets +## Store secrets The application requires three secrets to be set, using these names: - `DBT_CLOUD_SERVICE_TOKEN`: a dbt Cloud [user token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens) or [service account token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens) with at least the `Metdata Only` permission. - `DBT_CLOUD_AUTH_TOKEN`: the Secret Key for the dbt Cloud webhook you created earlier. @@ -112,5 +120,6 @@ Set these secrets as follows, replacing `abc123` etc with actual values: flyctl secrets set DBT_CLOUD_SERVICE_TOKEN=abc123 DBT_CLOUD_AUTH_TOKEN=def456 PD_ROUTING_KEY=ghi789 ``` -### 7. Deploy your app -After you set your secrets, fly.io will redeploy your application. When it has completed successfully, go back to the dbt Cloud webhook settings and click **Test Endpoint**. \ No newline at end of file +## Deploy your app + +After you set your secrets, fly.io will redeploy your application. When it has completed successfully, go back to the dbt Cloud webhook settings and click **Test Endpoint**. diff --git a/website/docs/guides/set-up-ci.md b/website/docs/guides/set-up-ci.md new file mode 100644 index 00000000000..83362094ec6 --- /dev/null +++ b/website/docs/guides/set-up-ci.md @@ -0,0 +1,355 @@ +--- +title: "Get started with Continuous Integration tests" +description: Implement a CI environment for safe project validation. +hoverSnippet: Learn how to implement a CI environment for safe project validation. +id: set-up-ci +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['dbt Cloud', 'Orchestration', 'CI'] +level: 'Intermediate' +recently_updated: true +--- + +## Introduction + +By validating your code _before_ it goes into production, you don't need to spend your afternoon fielding messages from people whose reports are suddenly broken. + +A solid CI setup is critical to preventing avoidable downtime and broken trust. dbt Cloud uses **sensible defaults** to get you up and running in a performant and cost-effective way in minimal time. + +After that, there's time to get fancy, but let's walk before we run. + +In this guide, we're going to add a **CI environment**, where proposed changes can be validated in the context of the entire project without impacting production systems. We will use a single set of deployment credentials (like the Prod environment), but models are built in a separate location to avoid impacting others (like the Dev environment). + +Your git flow will look like this: + + +### Prerequisites + +As part of your initial dbt Cloud setup, you should already have Development and Production environments configured. Let's recap what each does: + +- Your **Development environment** powers the IDE. Each user has individual credentials, and builds into an individual dev schema. Nothing you do here impacts any of your colleagues. +- Your **Production environment** brings the canonical version of your project to life for downstream consumers. There is a single set of deployment credentials, and everything is built into your production schema(s). + +## Create a new CI environment + +See [Create a new environment](/docs/dbt-cloud-environments#create-a-deployment-environment). The environment should be called **CI**. Just like your existing Production environment, it will be a Deployment-type environment. + +When setting a Schema in the **Deployment Credentials** area, remember that dbt Cloud will automatically generate a custom schema name for each PR to ensure that they don't interfere with your deployed models. This means you can safely set the same Schema name as your Production job. + +### 1. Double-check your Production environment is identified + +Go into your existing Production environment, and ensure that the **Set as Production environment** checkbox is set. It'll make things easier later. + +### 2. Create a new job in the CI environment + +Use the **Continuous Integration Job** template, and call the job **CI Check**. + +In the Execution Settings, your command will be preset to `dbt build --select state:modified+`. Let's break this down: + +- [`dbt build`](/reference/commands/build) runs all nodes (seeds, models, snapshots, tests) at once in DAG order. If something fails, nodes that depend on it will be skipped. +- The [`state:modified+` selector](/reference/node-selection/methods#the-state-method) means that only modified nodes and their children will be run ("Slim CI"). In addition to [not wasting time](https://discourse.getdbt.com/t/how-we-sped-up-our-ci-runs-by-10x-using-slim-ci/2603) building and testing nodes that weren't changed in the first place, this significantly reduces compute costs. + +To be able to find modified nodes, dbt needs to have something to compare against. dbt Cloud uses the last successful run of any job in your Production environment as its [comparison state](/reference/node-selection/syntax#about-node-selection). As long as you identified your Production environment in Step 2, you won't need to touch this. If you didn't, pick the right environment from the dropdown. + +### 3. Test your process + +That's it! There are other steps you can take to be even more confident in your work, such as validating your structure follows best practices and linting your code. For more information, refer to [Get started with Continuous Integration tests](/guides/set-up-ci). + +To test your new flow, create a new branch in the dbt Cloud IDE then add a new file or modify an existing one. Commit it, then create a new Pull Request (not a draft). Within a few seconds, you’ll see a new check appear in your git provider. + +### Things to keep in mind + +- If you make a new commit while a CI run based on older code is in progress, it will be automatically canceled and replaced with the fresh code. +- An unlimited number of CI jobs can run at once. If 10 developers all commit code to different PRs at the same time, each person will get their own schema containing their changes. Once each PR is merged, dbt Cloud will drop that schema. +- CI jobs will never block a production run. + +## Enforce best practices with dbt project evaluator + +dbt Project Evaluator is a package designed to identify deviations from best practices common to many dbt projects, including modeling, testing, documentation, structure and performance problems. For an introduction to the package, read its [launch blog post](/blog/align-with-dbt-project-evaluator). + +### 1. Install the package + +As with all packages, add a reference to `dbt-labs/dbt_project_evaluator` to your `packages.yml` file. See the [dbt Package Hub](https://hub.getdbt.com/dbt-labs/dbt_project_evaluator/latest/) for full installation instructions. + +### 2. Define test severity with an environment variable + +As noted in the [documentation](https://dbt-labs.github.io/dbt-project-evaluator/latest/ci-check/), tests in the package are set to `warn` severity by default. + +To have these tests fail in CI, create a new environment called `DBT_PROJECT_EVALUATOR_SEVERITY`. Set the project-wide default to `warn`, and set it to `error` in the CI environment. + +In your `dbt_project.yml` file, override the severity configuration: + +```yaml +tests: +dbt_project_evaluator: + +severity: "{{ env_var('DBT_PROJECT_EVALUATOR_SEVERITY', 'warn') }}" +``` + +### 3. Update your CI commands + +Because these tests should only run after the rest of your project has been built, your existing CI command will need to be updated to exclude the dbt_project_evaluator package. You will then add a second step which builds _only_ the package's models and tests. + +Update your steps to: + +```bash +dbt build --select state:modified+ --exclude package:dbt_project_evaluator +dbt build --select package:dbt_project_evaluator +``` + +### 4. Apply any customizations + +Depending on the state of your project when you roll out the evaluator, you may need to skip some tests or allow exceptions for some areas. To do this, refer to the documentation on: + +- [disabling tests](https://dbt-labs.github.io/dbt-project-evaluator/latest/customization/customization/) +- [excluding groups of models from a specific test](https://dbt-labs.github.io/dbt-project-evaluator/latest/customization/exceptions/) +- [excluding packages or sources/models based on path](https://dbt-labs.github.io/dbt-project-evaluator/latest/customization/excluding-packages-and-paths/) + +If you create a seed to exclude groups of models from a specific test, remember to disable the default seed and include `dbt_project_evaluator_exceptions` in your second `dbt build` command above. + +## Run linting checks with SQLFluff + +By [linting](/docs/cloud/dbt-cloud-ide/lint-format#lint) your project during CI, you can ensure that code styling standards are consistently enforced, without spending human time nitpicking comma placement. + +The steps below create an action/pipeline which uses [SQLFluff](https://docs.sqlfluff.com/en/stable/) to scan your code and look for linting errors. If you don't already have SQLFluff rules defined, check out [our recommended config file](/best-practices/how-we-style/2-how-we-style-our-sql). + +### 1. Create a YAML file to define your pipeline + +The YAML files defined below are what tell your code hosting platform the steps to run. In this setup, you’re telling the platform to run a SQLFluff lint job every time a commit is pushed. + + + + +GitHub Actions are defined in the `.github/workflows` directory. To define the job for your action, add a new file named `lint_on_push.yml` under the `workflows` folder. Your final folder structure will look like this: + +```sql +my_awesome_project +├── .github +│ ├── workflows +│ │ └── lint_on_push.yml +``` + +**Key pieces:** + +- `on:` defines when the pipeline is run. This workflow will run whenever code is pushed to any branch except `main`. For other trigger options, check out [GitHub’s docs](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows). +- `runs-on: ubuntu-latest` - this defines the operating system we’re using to run the job +- `uses:` - When the Ubuntu server is created, it is completely empty. [`checkout`](https://github.com/actions/checkout#checkout-v3) and [`setup-python`](https://github.com/actions/setup-python#setup-python-v3) are public GitHub Actions which enable the server to access the code in your repo, and set up Python correctly. +- `run:` - these steps are run at the command line, as though you typed them at a prompt yourself. This will install sqlfluff and lint the project. Be sure to set the correct `--dialect` for your project. + +For a full breakdown of the properties in a workflow file, see [Understanding the workflow file](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#understanding-the-workflow-file) on GitHub's website. + +```yaml +name: lint dbt project on push + +on: + push: + branches-ignore: + - 'main' + +jobs: + # this job runs SQLFluff with a specific set of rules + # note the dialect is set to Snowflake, so make that specific to your setup + # details on linter rules: https://docs.sqlfluff.com/en/stable/rules.html + lint_project: + name: Run SQLFluff linter + runs-on: ubuntu-latest + + steps: + - uses: "actions/checkout@v3" + - uses: "actions/setup-python@v4" + with: + python-version: "3.9" + - name: Install SQLFluff + run: "pip install sqlfluff" + - name: Lint project + run: "sqlfluff lint models --dialect snowflake" + +``` + + + + +Create a `.gitlab-ci.yml` file in your **root directory** to define the triggers for when to execute the script below. You’ll put the code below into this file. + +```sql +my_awesome_project +├── dbt_project.yml +├── .gitlab-ci.yml +``` + +**Key pieces:** + +- `image: python:3.9` - this defines the virtual image we’re using to run the job +- `rules:` - defines when the pipeline is run. This workflow will run whenever code is pushed to any branch except `main`. For other rules, refer to [GitLab’s documentation](https://docs.gitlab.com/ee/ci/yaml/#rules). +- `script:` - this is how we’re telling the GitLab runner to execute the Python script we defined above. + +```yaml +image: python:3.9 + +stages: + - pre-build + +# this job runs SQLFluff with a specific set of rules +# note the dialect is set to Snowflake, so make that specific to your setup +# details on linter rules: https://docs.sqlfluff.com/en/stable/rules.html +lint-project: + stage: pre-build + rules: + - if: $CI_PIPELINE_SOURCE == "push" && $CI_COMMIT_BRANCH != 'main' + script: + - pip install sqlfluff + - sqlfluff lint models --dialect snowflake +``` + + + + +Create a `bitbucket-pipelines.yml` file in your **root directory** to define the triggers for when to execute the script below. You’ll put the code below into this file. + +```sql +my_awesome_project +├── bitbucket-pipelines.yml +├── dbt_project.yml +``` + +**Key pieces:** + +- `image: python:3.11.1` - this defines the virtual image we’re using to run the job +- `'**':` - this is used to filter when the pipeline runs. In this case we’re telling it to run on every push event, and you can see at line 12 we're creating a dummy pipeline for `main`. More information on filtering when a pipeline is run can be found in [Bitbucket's documentation](https://support.atlassian.com/bitbucket-cloud/docs/pipeline-triggers/) +- `script:` - this is how we’re telling the Bitbucket runner to execute the Python script we defined above. + +```yaml +image: python:3.11.1 + + +pipelines: + branches: + '**': # this sets a wildcard to run on every branch + - step: + name: Lint dbt project + script: + - pip install sqlfluff==0.13.1 + - sqlfluff lint models --dialect snowflake --rules L019,L020,L021,L022 + + 'main': # override if your default branch doesn't run on a branch named "main" + - step: + script: + - python --version +``` + + + + +### 2. Commit and push your changes to make sure everything works + +After you finish creating the YAML files, commit and push your code to trigger your pipeline for the first time. If everything goes well, you should see the pipeline in your code platform. When you click into the job you’ll get a log showing that SQLFluff was run. If your code failed linting you’ll get an error in the job with a description of what needs to be fixed. If everything passed the lint check, you’ll see a successful job run. + + + + +In your repository, click the _Actions_ tab + +![Image showing the GitHub action for lint on push](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-github.png) + +Sample output from SQLFluff in the `Run SQLFluff linter` job: + +![Image showing the logs in GitHub for the SQLFluff run](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-logs-github.png) + + + + +In the menu option go to *CI/CD > Pipelines* + +![Image showing the GitLab action for lint on push](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-gitlab.png) + +Sample output from SQLFluff in the `Run SQLFluff linter` job: + +![Image showing the logs in GitLab for the SQLFluff run](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-logs-gitlab.png) + + + + +In the left menu pane, click on *Pipelines* + +![Image showing the Bitbucket action for lint on push](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-bitbucket.png) + +Sample output from SQLFluff in the `Run SQLFluff linter` job: + +![Image showing the logs in Bitbucket for the SQLFluff run](/img/guides/orchestration/custom-cicd-pipelines/lint-on-push-logs-bitbucket.png) + + + + +## Advanced: Create a release train with additional environments + +Large and complex enterprises sometimes require additional layers of validation before deployment. Learn how to add these checks with dbt Cloud. + +:::caution Are you sure you need this? +This approach can increase release safety, but creates additional manual steps in the deployment process as well as a greater maintenance burden. + +As such, it may slow down the time it takes to get new features into production. + +The team at Sunrun maintained a SOX-compliant deployment in dbt while reducing the number of environments. Check out [their Coalesce presentation](https://www.youtube.com/watch?v=vmBAO2XN-fM) to learn more. +::: + +In this section, we will add a new **QA** environment. New features will branch off from and be merged back into the associated `qa` branch, and a member of your team (the "Release Manager") will create a PR against `main` to be validated in the CI environment before going live. + +The git flow will look like this: + + +### Advanced prerequisites + +- You have the **Development**, **CI**, and **Production** environments, as described in [the Baseline setup](/guides/set-up-ci). + +### 1. Create a `release` branch in your git repo + +As noted above, this branch will outlive any individual feature, and will be the base of all feature development for a period of time. Your team might choose to create a new branch for each sprint (`qa/sprint-01`, `qa/sprint-02`, etc), tie it to a version of your data product (`qa/1.0`, `qa/1.1`), or just have a single `qa` branch which remains active indefinitely. + +### 2. Update your Development environment to use the `qa` branch + +See [Custom branch behavior](/docs/dbt-cloud-environments#custom-branch-behavior). Setting `qa` as your custom branch ensures that the IDE creates new branches and PRs with the correct target, instead of using `main`. + + + +### 3. Create a new QA environment + +See [Create a new environment](/docs/dbt-cloud-environments#create-a-deployment-environment). The environment should be called **QA**. Just like your existing Production and CI environments, it will be a Deployment-type environment. + +Set its branch to `qa` as well. + +### 4. Create a new job + +Use the **Continuous Integration Job** template, and call the job **QA Check**. + +In the Execution Settings, your command will be preset to `dbt build --select state:modified+`. Let's break this down: + +- [`dbt build`](/reference/commands/build) runs all nodes (seeds, models, snapshots, tests) at once in DAG order. If something fails, nodes that depend on it will be skipped. +- The [`state:modified+` selector](/reference/node-selection/methods#the-state-method) means that only modified nodes and their children will be run ("Slim CI"). In addition to [not wasting time](https://discourse.getdbt.com/t/how-we-sped-up-our-ci-runs-by-10x-using-slim-ci/2603) building and testing nodes that weren't changed in the first place, this significantly reduces compute costs. + +To be able to find modified nodes, dbt needs to have something to compare against. Normally, we use the Production environment as the source of truth, but in this case there will be new code merged into `qa` long before it hits the `main` branch and Production environment. Because of this, we'll want to defer the Release environment to itself. + +### Optional: also add a compile-only job + +dbt Cloud uses the last successful run of any job in that environment as its [comparison state](/reference/node-selection/syntax#about-node-selection). If you have a lot of PRs in flight, the comparison state could switch around regularly. + +Adding a regularly-scheduled job inside of the QA environment whose only command is `dbt compile` can regenerate a more stable manifest for comparison purposes. + +### 5. Test your process + +When the Release Manager is ready to cut a new release, they will manually open a PR from `qa` into `main` from their git provider (e.g. GitHub, GitLab, Azure DevOps). dbt Cloud will detect the new PR, at which point the existing check in the CI environment will trigger and run. When using the [baseline configuration](/guides/set-up-ci), it's possible to kick off the PR creation from inside of the dbt Cloud IDE. Under this paradigm, that button will create PRs targeting your QA branch instead. + +To test your new flow, create a new branch in the dbt Cloud IDE then add a new file or modify an existing one. Commit it, then create a new Pull Request (not a draft) against your `qa` branch. You'll see the integration tests begin to run. Once they complete, manually create a PR against `main`, and within a few seconds you’ll see the tests run again but this time incorporating all changes from all code that hasn't been merged to main yet. diff --git a/website/docs/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project.md b/website/docs/guides/set-up-your-databricks-dbt-project.md similarity index 81% rename from website/docs/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project.md rename to website/docs/guides/set-up-your-databricks-dbt-project.md index b0be39a4273..c17c6a1f99e 100644 --- a/website/docs/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project.md +++ b/website/docs/guides/set-up-your-databricks-dbt-project.md @@ -1,5 +1,18 @@ -# How to set up your Databricks and dbt project - +--- +title: Set up your dbt project with Databricks +id: set-up-your-databricks-dbt-project +description: "Learn more about setting up your dbt project with Databricks." +displayText: Setting up your dbt project with Databricks +hoverSnippet: Learn how to set up your dbt project with Databricks. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'databricks' +hide_table_of_contents: true +tags: ['Databricks', 'dbt Core','dbt Cloud'] +level: 'Intermediate' +recently_updated: true +--- + +## Introduction Databricks and dbt Labs are partnering to help data teams think like software engineering teams and ship trusted data, faster. The dbt-databricks adapter enables dbt users to leverage the latest Databricks features in their dbt project. Hundreds of customers are now using dbt and Databricks to build expressive and reliable data pipelines on the Lakehouse, generating data assets that enable analytics, ML, and AI use cases throughout the business. @@ -7,7 +20,7 @@ In this guide, we discuss how to set up your dbt project on the Databricks Lakeh ## Configuring the Databricks Environments -To get started, we will use Databricks’s Unity Catalog. Without it, we would not be able to design separate [environments](https://docs.getdbt.com/docs/collaborate/environments) for development and production per our [best practices](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). It also allows us to ensure the proper access controls have been applied using SQL. You will need to be using the dbt-databricks adapter to use it (as opposed to the dbt-spark adapter). +To get started, we will use Databricks’s Unity Catalog. Without it, we would not be able to design separate [environments](https://docs.getdbt.com/docs/collaborate/environments) for development and production per our [best practices](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview). It also allows us to ensure the proper access controls have been applied using SQL. You will need to be using the dbt-databricks adapter to use it (as opposed to the dbt-spark adapter). We will set up two different *catalogs* in Unity Catalog: **dev** and **prod**. A catalog is a top-level container for *schemas* (previously known as databases in Databricks), which in turn contain tables and views. @@ -33,7 +46,7 @@ Service principals are used to remove humans from deploying to production for co [Let’s create a service principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-your-databricks-account) in Databricks: 1. Have your Databricks Account admin [add a service principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-your-databricks-account) to your account. The service principal’s name should differentiate itself from a user ID and make its purpose clear (eg dbt_prod_sp). -2. Add the service principal added to any groups it needs to be a member of at this time. There are more details on permissions in our ["Unity Catalog best practices" guide](dbt-unity-catalog-best-practices). +2. Add the service principal added to any groups it needs to be a member of at this time. There are more details on permissions in our ["Unity Catalog best practices" guide](/best-practices/dbt-unity-catalog-best-practices). 3. [Add the service principal to your workspace](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-a-workspace) and apply any [necessary entitlements](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-a-service-principal-to-a-workspace-using-the-admin-console), such as Databricks SQL access and Workspace access. ## Setting up Databricks Compute @@ -55,13 +68,13 @@ We are not covering python in this post but if you want to learn more, check out Now that the Databricks components are in place, we can configure our dbt project. This involves connecting dbt to our Databricks SQL warehouse to run SQL queries and using a version control system like GitHub to store our transformation code. -If you are migrating an existing dbt project from the dbt-spark adapter to dbt-databricks, follow this [migration guide](https://docs.getdbt.com/guides/migration/tools/migrating-from-spark-to-databricks#migration) to switch adapters without needing to update developer credentials and other existing configs. +If you are migrating an existing dbt project from the dbt-spark adapter to dbt-databricks, follow this [migration guide](/guides/migrate-from-spark-to-databricks) to switch adapters without needing to update developer credentials and other existing configs. -If you’re starting a new dbt project, follow the steps below. For a more detailed setup flow, check out our [quickstart guide.](/quickstarts/databricks) +If you’re starting a new dbt project, follow the steps below. For a more detailed setup flow, check out our [quickstart guide.](/guides/databricks) ### Connect dbt to Databricks -First, you’ll need to connect your dbt project to Databricks so it can send transformation instructions and build objects in Unity Catalog. Follow the instructions for [dbt Cloud](/quickstarts/databricks?step=4) or [Core](https://docs.getdbt.com/reference/warehouse-setups/databricks-setup) to configure your project’s connection credentials. +First, you’ll need to connect your dbt project to Databricks so it can send transformation instructions and build objects in Unity Catalog. Follow the instructions for [dbt Cloud](/guides/databricks?step=4) or [Core](https://docs.getdbt.com/reference/warehouse-setups/databricks-setup) to configure your project’s connection credentials. Each developer must generate their Databricks PAT and use the token in their development credentials. They will also specify a unique developer schema that will store the tables and views generated by dbt runs executed from their IDE. This provides isolated developer environments and ensures data access is fit for purpose. @@ -80,11 +93,11 @@ For your development credentials/profiles.yml: During your first invocation of `dbt run`, dbt will create the developer schema if it doesn't already exist in the dev catalog. -### Defining your dbt deployment environment +## Defining your dbt deployment environment -Last, we need to give dbt a way to deploy code outside of development environments. To do so, we’ll use dbt [environments](https://docs.getdbt.com/docs/collaborate/environments) to define the production targets that end users will interact with. +We need to give dbt a way to deploy code outside of development environments. To do so, we’ll use dbt [environments](https://docs.getdbt.com/docs/collaborate/environments) to define the production targets that end users will interact with. -Core projects can use [targets in profiles](https://docs.getdbt.com/docs/core/connection-profiles#understanding-targets-in-profiles) to separate environments. [dbt Cloud environments](https://docs.getdbt.com/docs/cloud/develop-in-the-cloud#set-up-and-access-the-cloud-ide) allow you to define environments via the UI and [schedule jobs](/quickstarts/databricks#create-and-run-a-job) for specific environments. +Core projects can use [targets in profiles](https://docs.getdbt.com/docs/core/connection-profiles#understanding-targets-in-profiles) to separate environments. [dbt Cloud environments](https://docs.getdbt.com/docs/cloud/develop-in-the-cloud#set-up-and-access-the-cloud-ide) allow you to define environments via the UI and [schedule jobs](/guides/databricks#create-and-run-a-job) for specific environments. Let’s set up our deployment environment: @@ -94,10 +107,10 @@ Let’s set up our deployment environment: 4. Set the schema to the default for your prod environment. This can be overridden by [custom schemas](https://docs.getdbt.com/docs/build/custom-schemas#what-is-a-custom-schema) if you need to use more than one. 5. Provide your Service Principal token. -### Connect dbt to your git repository +## Connect dbt to your git repository -Next, you’ll need somewhere to store and version control your code that allows you to collaborate with teammates. Connect your dbt project to a git repository with [dbt Cloud](/quickstarts/databricks#set-up-a-dbt-cloud-managed-repository). [Core](/quickstarts/manual-install#create-a-repository) projects will use the git CLI. +Next, you’ll need somewhere to store and version control your code that allows you to collaborate with teammates. Connect your dbt project to a git repository with [dbt Cloud](/guides/databricks#set-up-a-dbt-cloud-managed-repository). [Core](/guides/manual-install#create-a-repository) projects will use the git CLI. -## Next steps +### Next steps -Now that your project is configured, you can start transforming your Databricks data with dbt. To help you scale efficiently, we recommend you follow our best practices, starting with the ["Unity Catalog best practices" guide](dbt-unity-catalog-best-practices). +Now that your project is configured, you can start transforming your Databricks data with dbt. To help you scale efficiently, we recommend you follow our best practices, starting with the [Unity Catalog best practices](/best-practices/dbt-unity-catalog-best-practices), then you can [Optimize dbt models on Databricks](/guides/optimize-dbt-models-on-databricks). diff --git a/website/docs/guides/migration/sl-migration.md b/website/docs/guides/sl-migration.md similarity index 93% rename from website/docs/guides/migration/sl-migration.md rename to website/docs/guides/sl-migration.md index 56cd6dc9d80..0cfde742af2 100644 --- a/website/docs/guides/migration/sl-migration.md +++ b/website/docs/guides/sl-migration.md @@ -1,14 +1,22 @@ --- title: "Legacy dbt Semantic Layer migration guide" -sidebar_label: "Legacy dbt Semantic Layer migration" +id: "sl-migration" description: "Learn how to migrate from the legacy dbt Semantic Layer to the latest one." -tags: [Semantic Layer] +hoverSnippet: Migrate from the legacy dbt Semantic Layer to the latest one. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Semantic Layer','Migration'] +level: 'Intermediate' +recently_updated: true --- +## Introduction + The legacy Semantic Layer will be deprecated in H2 2023. Additionally, the `dbt_metrics` package will not be supported in dbt v1.6 and later. If you are using `dbt_metrics`, you'll need to upgrade your configurations before upgrading to v1.6. This guide is for people who have the legacy dbt Semantic Layer setup and would like to migrate to the new dbt Semantic Layer. The estimated migration time is two weeks. -## Step 1: Migrate metric configs to the new spec +## Migrate metric configs to the new spec The metrics specification in dbt Core is changed in v1.6 to support the integration of MetricFlow. It's strongly recommended that you refer to [Build your metrics](/docs/build/build-metrics-intro) and before getting started so you understand the core concepts of the Semantic Layer. @@ -35,7 +43,7 @@ dbt Labs recommends completing these steps in a local dev environment (such as t **To make this process easier, dbt Labs provides a [custom migration tool](https://github.com/dbt-labs/dbt-converter) that automates these steps for you. You can find installation instructions in the [README](https://github.com/dbt-labs/dbt-converter/blob/master/README.md). Derived metrics aren’t supported in the migration tool, and will have to be migrated manually.* -## Step 2: Audit metric values after the migration +## Audit metric values after the migration You might need to audit metric values during the migration to ensure that the historical values of key business metrics are the same. @@ -58,7 +66,7 @@ You might need to audit metric values during the migration to ensure that the hi 1. Run the [dbt-audit](https://github.com/dbt-labs/dbt-audit-helper) helper on both models to compare the metric values. -## Step 3: Setup the Semantic Layer in a new environment +## Setup the Semantic Layer in a new environment This step is only relevant to users who want the legacy and new semantic layer to run in parallel for a short time. This will let you recreate content in downstream tools like Hex and Mode with minimal downtime. If you do not need to recreate assets in these tools skip to step 5. @@ -79,7 +87,7 @@ This step is only relevant to users who want the legacy and new semantic layer t At this point, both the new semantic layer and the old semantic layer will be running. The new semantic layer will be pointing at your migration branch with the updated metrics definitions. -## Step 4: Update connection in downstream integrations +## Update connection in downstream integrations Now that your Semantic Layer is set up, you will need to update any downstream integrations that used the legacy Semantic Layer. @@ -105,7 +113,7 @@ To learn more about integrating with Hex, check out their [documentation](https: 3. For specific SQL syntax details, refer to [Querying the API for metric metadata](/docs/dbt-cloud-apis/sl-jdbc#querying-the-api-for-metric-metadata) to query metrics using the API. -## Step 5: Merge your metrics migration branch to main, and upgrade your production environment to 1.6. +## Merge your metrics migration branch to main, and upgrade your production environment to 1.6. 1. Upgrade your production environment to 1.6 or higher. * **Note** — The old metrics definitions are no longer valid so your dbt jobs will not pass. @@ -118,7 +126,7 @@ If you created a new environment in [Step 3](#step-3-setup-the-semantic-layer-in 4. Delete your migration environment. Be sure to update your connection details in any downstream tools to account for the environment change. -## Related docs +### Related docs - [MetricFlow quickstart guide](/docs/build/sl-getting-started) - [Example dbt project](https://github.com/dbt-labs/jaffle-sl-template) diff --git a/website/docs/guides/dbt-ecosystem/sl-partner-integration-guide.md b/website/docs/guides/sl-partner-integration-guide.md similarity index 96% rename from website/docs/guides/dbt-ecosystem/sl-partner-integration-guide.md rename to website/docs/guides/sl-partner-integration-guide.md index 936a54465e8..04f58f525bd 100644 --- a/website/docs/guides/dbt-ecosystem/sl-partner-integration-guide.md +++ b/website/docs/guides/sl-partner-integration-guide.md @@ -1,17 +1,26 @@ --- -title: "dbt Semantic Layer integration best practices" +title: "Integrate with dbt Semantic Layer using best practices" id: "sl-partner-integration-guide" description: Learn about partner integration guidelines, roadmap, and connectivity. +hoverSnippet: Learn how to integrate with the Semantic Layer using best practices +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Semantic Layer','Best practices'] +level: 'Advanced' +recently_updated: true --- -To fit your tool within the world of the Semantic Layer, dbt Labs offers some best practice recommendations for how to expose metrics and allow users to interact with them seamlessly. +## Introduction + +To fit your tool within the world of the Semantic Layer, dbt Labs offers some best practice recommendations for how to expose metrics and allow users to interact with them seamlessly. :::note This is an evolving guide that is meant to provide recommendations based on our experience. If you have any feedback, we'd love to hear it! ::: -## Requirements +### Prerequisites To build a dbt Semantic Layer integration: @@ -37,7 +46,7 @@ When building an integration, we recommend you expose certain metadata in the re - The version of dbt they are on. -## Best practices on exposing metrics +## Use best practices when exposing metrics Best practices for exposing metrics are summarized into five themes: @@ -121,7 +130,7 @@ For transparency and additional context, we recommend you have an easy way for t In the cases where our APIs support either a string or a filter list for the `where` clause, we always recommend that your application utilizes the filter list in order to gain maximum pushdown benefits. The `where` string may be more intuitive for users writing queries during testing, but it will not have the performance benefits of the filter list in a production environment. -## Example stages of an integration +## Understand stages of an integration These are recommendations on how to evolve a Semantic Layer integration and not a strict runbook. @@ -149,7 +158,7 @@ These are recommendations on how to evolve a Semantic Layer integration and not * Suggest metrics to users based on teams/identity, and so on. -## Related docs +### Related docs - [Use the dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) to learn about the product. - [Build your metrics](/docs/build/build-metrics-intro) for more info about MetricFlow and its components. diff --git a/website/docs/quickstarts/snowflake-qs.md b/website/docs/guides/snowflake-qs.md similarity index 99% rename from website/docs/quickstarts/snowflake-qs.md rename to website/docs/guides/snowflake-qs.md index 33e253e8c15..abb18276b97 100644 --- a/website/docs/quickstarts/snowflake-qs.md +++ b/website/docs/guides/snowflake-qs.md @@ -1,8 +1,9 @@ --- title: "Quickstart for dbt Cloud and Snowflake" id: "snowflake" -platform: 'dbt-cloud' +level: 'Beginner' icon: 'snowflake' +tags: ['dbt Cloud','Quickstart','Snowflake'] hide_table_of_contents: true --- ## Introduction diff --git a/website/docs/quickstarts/starburst-galaxy-qs.md b/website/docs/guides/starburst-galaxy-qs.md similarity index 99% rename from website/docs/quickstarts/starburst-galaxy-qs.md rename to website/docs/guides/starburst-galaxy-qs.md index 33228710509..1822c83fa90 100644 --- a/website/docs/quickstarts/starburst-galaxy-qs.md +++ b/website/docs/guides/starburst-galaxy-qs.md @@ -1,9 +1,10 @@ --- title: "Quickstart for dbt Cloud and Starburst Galaxy" id: "starburst-galaxy" -platform: 'dbt-cloud' +level: 'Beginner' icon: 'starburst' hide_table_of_contents: true +tags: ['dbt Cloud','Quickstart'] --- ## Introduction diff --git a/website/docs/guides/advanced/using-jinja.md b/website/docs/guides/using-jinja.md similarity index 97% rename from website/docs/guides/advanced/using-jinja.md rename to website/docs/guides/using-jinja.md index 1cbe88dc9ca..9f098bb637f 100644 --- a/website/docs/guides/advanced/using-jinja.md +++ b/website/docs/guides/using-jinja.md @@ -1,8 +1,18 @@ --- -title: "Using Jinja" +title: "Use Jinja to improve your SQL code" id: "using-jinja" +description: "Learn how to improve your SQL code using Jinja." +hoverSnippet: "Improve your SQL code with Jinja" +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Jinja', 'dbt Core'] +level: 'Advanced' +recently_updated: true --- +## Introduction + In this guide, we're going to take a common pattern used in SQL, and then use Jinja to improve our code. If you'd like to work through this query, add [this CSV](https://github.com/dbt-labs/jaffle_shop/blob/core-v1.0.0/seeds/raw_payments.csv) to the `seeds/` folder of your dbt project, and then execute `dbt seed`. diff --git a/website/docs/guides/orchestration/webhooks/zapier-ms-teams.md b/website/docs/guides/zapier-ms-teams.md similarity index 92% rename from website/docs/guides/orchestration/webhooks/zapier-ms-teams.md rename to website/docs/guides/zapier-ms-teams.md index 148e16b2469..66596d590e0 100644 --- a/website/docs/guides/orchestration/webhooks/zapier-ms-teams.md +++ b/website/docs/guides/zapier-ms-teams.md @@ -1,9 +1,16 @@ --- title: "Post to Microsoft Teams when a job finishes" -id: webhooks-guide-zapier-ms-teams -slug: zapier-ms-teams -description: Use Zapier and the dbt Cloud API to post to Microsoft Teams +id: zapier-ms-teams +description: Use Zapier and dbt Cloud webhooks to post to Microsoft Teams when a job finishes running. +hoverSnippet: Learn how to use Zapier with dbt Cloud webhooks to post in Microsoft Teams when a job finishes running. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction This guide will show you how to set up an integration between dbt Cloud jobs and Microsoft Teams using [dbt Cloud Webhooks](/docs/deploy/webhooks) and Zapier, similar to the [native Slack integration](/docs/deploy/job-notifications#slack-notifications). @@ -14,19 +21,20 @@ When a dbt Cloud job finishes running, the integration will: - Post a summary to a Microsoft Teams channel. ![Screenshot of a message in MS Teams showing a summary of a dbt Cloud run which failed](/img/guides/orchestration/webhooks/zapier-ms-teams/ms-teams-ui.png) -## Prerequisites + +### Prerequisites In order to set up the integration, you should have familiarity with: - [dbt Cloud Webhooks](/docs/deploy/webhooks) - Zapier -## Integration steps -### 1. Set up the connection between Zapier and Microsoft Teams + +## Set up the connection between Zapier and Microsoft Teams * Install the [Zapier app in Microsoft Teams](https://appsource.microsoft.com/en-us/product/office/WA200002044) and [grant Zapier access to your account](https://zapier.com/blog/how-to-automate-microsoft-teams/). **Note**: To receive the message, add the Zapier app to the team's channel during installation. -### 2. Create a new Zap in Zapier +## Create a new Zap in Zapier Use **Webhooks by Zapier** as the Trigger, and **Catch Raw Hook** as the Event. If you don't intend to [validate the authenticity of your webhook](/docs/deploy/webhooks#validate-a-webhook) (not recommended!) then you can choose **Catch Hook** instead. Press **Continue**, then copy the webhook URL. @@ -34,6 +42,7 @@ Press **Continue**, then copy the webhook URL. ![Screenshot of the Zapier UI, showing the webhook URL ready to be copied](/img/guides/orchestration/webhooks/zapier-common/catch-raw-hook.png) ### 3. Configure a new webhook in dbt Cloud + See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Choose either **Run completed** or **Run errored**, but not both, or you'll get double messages when a run fails. Make note of the Webhook Secret Key for later. @@ -42,14 +51,15 @@ Once you've tested the endpoint in dbt Cloud, go back to Zapier and click **Test The sample body's values are hard-coded and not reflective of your project, but they give Zapier a correctly-shaped object during development. -### 4. Store secrets +## Store secrets + In the next step, you will need the Webhook Secret Key from the prior step, and a dbt Cloud [user token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens) or [service account token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Zapier allows you to [store secrets](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps), which prevents your keys from being displayed in plaintext in the Zap code. You will be able to access them via the [StoreClient utility](https://help.zapier.com/hc/en-us/articles/8496293969549-Store-data-from-code-steps-with-StoreClient). -### 5. Add a code action +## Add a code action Select **Code by Zapier** as the App, and **Run Python** as the Event. In the **Set up action** area, add two items to **Input Data**: `raw_body` and `auth_header`. Map those to the `1. Raw Body` and `1. Headers Http Authorization` fields from the **Catch Raw Hook** step above. @@ -141,19 +151,21 @@ for step in run_data_results['run_steps']: output = {'outcome_message': outcome_message} ``` -### 6. Add the Microsoft Teams action +## Add the Microsoft Teams action + Select **Microsoft Teams** as the App, and **Send Channel Message** as the Action. In the **Set up action** area, choose the team and channel. Set the **Message Text Format** to **markdown**, then put **2. Outcome Message** from the Run Python in Code by Zapier output into the **Message Text** field. ![Screenshot of the Zapier UI, showing the mappings of prior steps to an MS Teams message](/img/guides/orchestration/webhooks/zapier-ms-teams/ms-teams-zap-config.png) -### 7. Test and deploy +## Test and deploy + As you have gone through each step, you should have tested the outputs, so you can now try posting a message into your Teams channel. When you're happy with it, remember to ensure that your `run_id` and `account_id` are no longer hardcoded, then publish your Zap. -## Other notes +### Other notes - If you post to a chat instead of a team channel, you don't need to add the Zapier app to Microsoft Teams. - If you post to a chat instead of a team channel, note that markdown is not supported and you will need to remove the markdown formatting. - If you chose the **Catch Hook** trigger instead of **Catch Raw Hook**, you will need to pass each required property from the webhook as an input instead of running `json.loads()` against the raw body. You will also need to remove the validation code. diff --git a/website/docs/guides/orchestration/webhooks/zapier-new-cloud-job.md b/website/docs/guides/zapier-new-cloud-job.md similarity index 89% rename from website/docs/guides/orchestration/webhooks/zapier-new-cloud-job.md rename to website/docs/guides/zapier-new-cloud-job.md index 0764c6c7911..b16fa94bc21 100644 --- a/website/docs/guides/orchestration/webhooks/zapier-new-cloud-job.md +++ b/website/docs/guides/zapier-new-cloud-job.md @@ -1,28 +1,34 @@ --- title: "Trigger a dbt Cloud job after a run finishes" -id: webhooks-guide-zapier-new-cloud-job -slug: zapier-new-cloud-job -description: Use Zapier to interact with the dbt Cloud API +id: zapier-new-cloud-job +description: Use Zapier to trigger a dbt Cloud job once a run completes. +hoverSnippet: Learn how to use Zapier to trigger a dbt Cloud job once a run completes. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction + This guide will show you how to trigger a dbt Cloud job based on the successful completion of a different job. This can be useful when you need to trigger a job in a different project. Remember that dbt works best when it understands the whole context of the it has been asked to run, so use this ability judiciously. -## Prerequisites +### Prerequisites In order to set up the integration, you should have familiarity with: - [dbt Cloud Webhooks](/docs/deploy/webhooks) - Zapier -## Integration steps - -### 1. Create a new Zap in Zapier +## Create a new Zap in Zapier Use **Webhooks by Zapier** as the Trigger, and **Catch Raw Hook** as the Event. If you don't intend to [validate the authenticity of your webhook](/docs/deploy/webhooks#validate-a-webhook) (not recommended!) then you can choose **Catch Hook** instead. Press **Continue**, then copy the webhook URL. ![Screenshot of the Zapier UI, showing the webhook URL ready to be copied](/img/guides/orchestration/webhooks/zapier-common/catch-raw-hook.png) -### 2. Configure a new webhook in dbt Cloud +## Configure a new webhook in dbt Cloud See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Your event should be **Run completed**, and you need to change the **Jobs** list to only contain the job you want to trigger the next run. Make note of the Webhook Secret Key for later. @@ -31,14 +37,14 @@ Once you've tested the endpoint in dbt Cloud, go back to Zapier and click **Test The sample body's values are hard-coded and not reflective of your project, but they give Zapier a correctly-shaped object during development. -### 3. Store secrets +## Store secrets In the next step, you will need the Webhook Secret Key from the prior step, and a dbt Cloud [user token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens) or [service account token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Zapier allows you to [store secrets](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps), which prevents your keys from being displayed in plaintext in the Zap code. You will be able to access them via the [StoreClient utility](https://help.zapier.com/hc/en-us/articles/8496293969549-Store-data-from-code-steps-with-StoreClient). -### 4. Add a code action +## Add a code action Select **Code by Zapier** as the App, and **Run Python** as the Event. In the **Set up action** area, add two items to **Input Data**: `raw_body` and `auth_header`. Map those to the `1. Raw Body` and `1. Headers Http Authorization` fields from the **Catch Raw Hook** step above. @@ -87,5 +93,6 @@ if hook_data['runStatus'] == "Success": return ``` -### 5. Test and deploy +## Test and deploy + When you're happy with it, remember to ensure that your `account_id` is no longer hardcoded, then publish your Zap. diff --git a/website/docs/guides/orchestration/webhooks/zapier-refresh-mode-report.md b/website/docs/guides/zapier-refresh-mode-report.md similarity index 90% rename from website/docs/guides/orchestration/webhooks/zapier-refresh-mode-report.md rename to website/docs/guides/zapier-refresh-mode-report.md index f682baae8e2..5bab165b11d 100644 --- a/website/docs/guides/orchestration/webhooks/zapier-refresh-mode-report.md +++ b/website/docs/guides/zapier-refresh-mode-report.md @@ -1,10 +1,18 @@ --- title: "Refresh a Mode dashboard when a job completes" -id: webhooks-guide-zapier-refresh-mode-report -slug: zapier-refresh-mode-report -description: Use Zapier to trigger a Mode dashboard refresh +id: zapier-refresh-mode-report +description: Use Zapier to trigger a Mode dashboard refresh when a dbt Cloud job completes. +hoverSnippet: Learn how to use Zapier to trigger a Mode dashboard refresh when a dbt Cloud job completes. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction + This guide will teach you how to refresh a Mode dashboard when a dbt Cloud job has completed successfully and there is fresh data available. The integration will: - Receive a webhook notification in Zapier @@ -12,23 +20,21 @@ This guide will teach you how to refresh a Mode dashboard when a dbt Cloud job h Although we are using the Mode API for a concrete example, the principles are readily transferrable to your [tool](https://learn.hex.tech/docs/develop-logic/hex-api/api-reference#operation/RunProject) [of](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/refresh-dataset) [choice](https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref.htm#update_workbook_now). -## Prerequisites +### Prerequisites In order to set up the integration, you should have familiarity with: - [dbt Cloud Webhooks](/docs/deploy/webhooks) - Zapier - The [Mode API](https://mode.com/developer/api-reference/introduction/) -## Integration steps - -### 1. Create a new Zap in Zapier +## Create a new Zap in Zapier Use **Webhooks by Zapier** as the Trigger, and **Catch Raw Hook** as the Event. If you don't intend to [validate the authenticity of your webhook](/docs/deploy/webhooks#validate-a-webhook) (not recommended!) then you can choose **Catch Hook** instead. Press **Continue**, then copy the webhook URL. ![Screenshot of the Zapier UI, showing the webhook URL ready to be copied](/img/guides/orchestration/webhooks/zapier-common/catch-raw-hook.png) -### 2. Configure a new webhook in dbt Cloud +## Configure a new webhook in dbt Cloud See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Your event should be **Run completed**, and you need to change the **Jobs** list to only contain any jobs whose completion should trigger a report refresh. Make note of the Webhook Secret Key for later. @@ -37,20 +43,19 @@ Once you've tested the endpoint in dbt Cloud, go back to Zapier and click **Test The sample body's values are hard-coded and not reflective of your project, but they give Zapier a correctly-shaped object during development. -### 3. Store secrets +## Store secrets In the next step, you will need the Webhook Secret Key from the prior step, and a dbt Cloud [user token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens) or [service account token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens), as well as a [Mode API token and secret](https://mode.com/developer/api-reference/authentication/). Zapier allows you to [store secrets](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps), which prevents your keys from being displayed in plaintext in the Zap code. You will be able to access them via the [StoreClient utility](https://help.zapier.com/hc/en-us/articles/8496293969549-Store-data-from-code-steps-with-StoreClient). - This guide assumes the names for the secret keys are: `DBT_WEBHOOK_KEY`, `MODE_API_TOKEN`, and `MODE_API_SECRET`. If you are using different names, make sure you update all references to them in the sample code. This guide uses a short-lived code action to store the secrets, but you can also use a tool like Postman to interact with the [REST API](https://store.zapier.com/) or create a separate Zap and call the [Set Value Action](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps#3-set-a-value-in-your-store-0-3). -#### a. Create a Storage by Zapier connection +### a. Create a Storage by Zapier connection If you haven't already got one, go to and create a new connection. Remember the UUID secret you generate for later. -#### b. Add a temporary code step +### b. Add a temporary code step Choose **Run Python** as the Event. Run the following code: ```python store = StoreClient('abc123') #replace with your UUID secret @@ -60,7 +65,7 @@ store.set('MODE_API_SECRET', 'abc123') #replace with your Mode API Secret ``` Test the step. You can delete this Action when the test succeeds. The key will remain stored as long as it is accessed at least once every three months. -### 4. Add a code action +## Add a code action Select **Code by Zapier** as the App, and **Run Python** as the Event. In the **Set up action** area, add two items to **Input Data**: `raw_body` and `auth_header`. Map those to the `1. Raw Body` and `1. Headers Http Authorization` fields from the **Catch Raw Hook** step above. @@ -124,5 +129,5 @@ if hook_data['runStatus'] == "Success": return ``` -### 5. Test and deploy -You can iterate on the Code step by modifying the code and then running the test again. When you're happy with it, you can publish your Zap. \ No newline at end of file +## Test and deploy +You can iterate on the Code step by modifying the code and then running the test again. When you're happy with it, you can publish your Zap. diff --git a/website/docs/guides/orchestration/webhooks/zapier-refresh-tableau-workbook.md b/website/docs/guides/zapier-refresh-tableau-workbook.md similarity index 92% rename from website/docs/guides/orchestration/webhooks/zapier-refresh-tableau-workbook.md rename to website/docs/guides/zapier-refresh-tableau-workbook.md index 52a9ae63523..f614b64eaa2 100644 --- a/website/docs/guides/orchestration/webhooks/zapier-refresh-tableau-workbook.md +++ b/website/docs/guides/zapier-refresh-tableau-workbook.md @@ -1,16 +1,24 @@ --- title: "Refresh Tableau workbook with extracts after a job finishes" -id: webhooks-guide-zapier-refresh-tableau-workbook -slug: zapier-refresh-tableau-workbook -description: Use Zapier to trigger a Tableau workbook refresh +id: zapier-refresh-tableau-workbook +description: Use Zapier to trigger a Tableau workbook refresh once a dbt Cloud job completes successfully. +hoverSnippet: Learn how to use Zapier to trigger a Tableau workbook refresh once a dbt Cloud job completes successfully. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction + This guide will teach you how to refresh a Tableau workbook that leverages [extracts](https://help.tableau.com/current/pro/desktop/en-us/extracting_data.htm) when a dbt Cloud job has completed successfully and there is fresh data available. The integration will: - Receive a webhook notification in Zapier - Trigger a refresh of a Tableau workbook -## Prerequisites +### Prerequisites To set up the integration, you need to be familiar with: @@ -19,19 +27,18 @@ To set up the integration, you need to be familiar with: - The [Tableau API](https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api.htm) - The [version](https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_concepts_versions.htm#rest_api_versioning) of Tableau's REST API that is compatible with your server -## Integration steps -### 1. Obtain authentication credentials from Tableau +## Obtain authentication credentials from Tableau To authenticate with the Tableau API, obtain a [Personal Access Token](https://help.tableau.com/current/server/en-us/security_personal_access_tokens.htm) from your Tableau Server/Cloud instance. In addition, make sure your Tableau workbook uses data sources that allow refresh access, which is usually set when publishing. -### 2. Create a new Zap in Zapier +## Create a new Zap in Zapier To trigger an action with the delivery of a webhook in Zapier, you'll want to create a new Zap with **Webhooks by Zapier** as the Trigger and **Catch Raw Hook** as the Event. However, if you choose not to [validate the authenticity of your webhook](/docs/deploy/webhooks#validate-a-webhook), which isn't recommended, you can choose **Catch Hook** instead. Press **Continue**, then copy the webhook URL. ![Screenshot of the Zapier UI, showing the webhook URL ready to be copied](/img/guides/orchestration/webhooks/zapier-common/catch-raw-hook.png) -### 3. Configure a new webhook in dbt Cloud +## Configure a new webhook in dbt Cloud To set up a webhook subscription for dbt Cloud, follow the instructions in [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription). For the event, choose **Run completed** and modify the **Jobs** list to include only the jobs that should trigger a report refresh. Remember to save the Webhook Secret Key for later. Paste in the webhook URL obtained from Zapier in step 2 into the **Endpoint** field and test the endpoint. @@ -40,7 +47,7 @@ Once you've tested the endpoint in dbt Cloud, go back to Zapier and click **Test The sample body's values are hard-coded and not reflective of your project, but they give Zapier a correctly-shaped object during development. -### 4. Store secrets +## Store secrets In the next step, you will need the Webhook Secret Key from the prior step, and your Tableau authentication credentials and details. Specifically, you'll need your Tableau server/site URL, server/site name, PAT name, and PAT secret. Zapier allows you to [store secrets](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps), which prevents your keys from being displayed in plaintext in the Zap code. You will be able to access them via the [StoreClient utility](https://help.zapier.com/hc/en-us/articles/8496293969549-Store-data-from-code-steps-with-StoreClient). @@ -49,11 +56,11 @@ This guide assumes the names for the secret keys are: `DBT_WEBHOOK_KEY`, `TABLEA This guide uses a short-lived code action to store the secrets, but you can also use a tool like Postman to interact with the [REST API](https://store.zapier.com/) or create a separate Zap and call the [Set Value Action](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps#3-set-a-value-in-your-store-0-3). -#### a. Create a Storage by Zapier connection +### a. Create a Storage by Zapier connection Create a new connection at https://zapier.com/app/connections/storage if you don't already have one and remember the UUID secret you generate for later. -#### b. Add a temporary code step +### b. Add a temporary code step Choose **Run Python** as the Event and input the following code: @@ -68,7 +75,7 @@ store.set('TABLEAU_API_TOKEN_SECRET', 'abc123') #replace with your Tableau API S Test the step to run the code. You can delete this action when the test succeeds. The keys will remain stored as long as it is accessed at least once every three months. -### 5. Add a code action +## Add a code action Select **Code by Zapier** as the App, and **Run Python** as the Event. In the **Set up action** area, add two items to **Input Data**: `raw_body` and `auth_header`. Map those to the `1. Raw Body` and `1. Headers Http Authorization` fields from the **Catch Raw Hook** step above. @@ -161,5 +168,5 @@ refresh_trigger = requests.post(refresh_url, data=json.dumps(refresh_data), head return {"message": "Workbook refresh has been queued"} ``` -### 6. Test and deploy +## Test and deploy To make changes to your code, you can modify it and test it again. When you're happy with it, you can publish your Zap. diff --git a/website/docs/guides/orchestration/webhooks/zapier-slack.md b/website/docs/guides/zapier-slack.md similarity index 94% rename from website/docs/guides/orchestration/webhooks/zapier-slack.md rename to website/docs/guides/zapier-slack.md index 6ce89eadd12..61b96658f95 100644 --- a/website/docs/guides/orchestration/webhooks/zapier-slack.md +++ b/website/docs/guides/zapier-slack.md @@ -1,10 +1,18 @@ --- title: "Post to Slack with error context when a job fails" -id: webhooks-guide-zapier-slack -slug: zapier-slack -description: Use Zapier and the dbt Cloud API to post error context to Slack +id: zapier-slack +description: Use a webhook or Slack message to trigger Zapier and post error context in Slack when a job fails. +hoverSnippet: Learn how to use a webhook or Slack message to trigger Zapier to post error context in Slack when a job fails. +# time_to_complete: '30 minutes' commenting out until we test +icon: 'guides' +hide_table_of_contents: true +tags: ['Webhooks'] +level: 'Advanced' +recently_updated: true --- +## Introduction + This guide will show you how to set up an integration between dbt Cloud jobs and Slack using [dbt Cloud webhooks](/docs/deploy/webhooks) and Zapier. It builds on the native [native Slack integration](/docs/deploy/job-notifications#slack-notifications) by attaching error message details of models and tests in a thread. Note: Because there is not a webhook for Run Cancelled, you may want to keep the standard Slack integration installed to receive those notifications. You could also use the [alternative integration](#alternate-approach) that augments the native integration without replacing it. @@ -17,21 +25,20 @@ When a dbt Cloud job finishes running, the integration will: - Create a threaded message attached to that post which contains any reasons that the job failed ![Screenshot of a message in Slack showing a summary of a dbt Cloud run which failed](/img/guides/orchestration/webhooks/zapier-slack/slack-thread-example.png) -## Prerequisites + +### Prerequisites In order to set up the integration, you should have familiarity with: - [dbt Cloud webhooks](/docs/deploy/webhooks) - Zapier -## Integration steps - -### 1. Create a new Zap in Zapier -Use **Webhooks by Zapier** as the Trigger, and **Catch Raw Hook** as the Event. If you don't intend to [validate the authenticity of your webhook](/docs/deploy/webhooks#validate-a-webhook) (not recommended!) then you can choose **Catch Hook** instead. -Click **Continue**, then copy the webhook URL. +## Create a new Zap in Zapier +1. Use **Webhooks by Zapier** as the Trigger, and **Catch Raw Hook** as the Event. If you don't intend to [validate the authenticity of your webhook](/docs/deploy/webhooks#validate-a-webhook) (not recommended!) then you can choose **Catch Hook** instead. +2. Click **Continue**, then copy the webhook URL. ![Screenshot of the Zapier UI, showing the webhook URL ready to be copied](/img/guides/orchestration/webhooks/zapier-common/catch-raw-hook.png) -### 2. Configure a new webhook in dbt Cloud +## Configure a new webhook in dbt Cloud See [Create a webhook subscription](/docs/deploy/webhooks#create-a-webhook-subscription) for full instructions. Choose **Run completed** as the Event. You can alternatively choose **Run errored**, but you will need to account for the fact that the necessary metadata [might not be available immediately](/docs/deploy/webhooks#completed-errored-event-difference). Remember the Webhook Secret Key for later. @@ -40,7 +47,7 @@ Once you've tested the endpoint in dbt Cloud, go back to Zapier and click **Test The sample body's values are hardcoded and not reflective of your project, but they give Zapier a correctly-shaped object during development. -### 3. Store secrets +## Store secrets In the next step, you will need the Webhook Secret Key from the prior step, and a dbt Cloud [user token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens) or [service account token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Zapier allows you to [store secrets](https://help.zapier.com/hc/en-us/articles/8496293271053-Save-and-retrieve-data-from-Zaps). This prevents your keys from being displayed as plaintext in the Zap code. You can access them with the [StoreClient utility](https://help.zapier.com/hc/en-us/articles/8496293969549-Store-data-from-code-steps-with-StoreClient). @@ -48,7 +55,7 @@ Zapier allows you to [store secrets](https://help.zapier.com/hc/en-us/articles/8 -### 4. Add a code action +## Add a code action Select **Code by Zapier** as the App, and **Run Python** as the Event. In the **Set up action** section, add two items to **Input Data**: `raw_body` and `auth_header`. Map those to the `1. Raw Body` and `1. Headers Http Authorization` fields from the previous **Catch Raw Hook** step. @@ -153,7 +160,7 @@ send_error_thread = len(threaded_errors_post) > 0 output = {'step_summary_post': step_summary_post, 'send_error_thread': send_error_thread, 'threaded_errors_post': threaded_errors_post} ``` -### 5. Add Slack actions in Zapier +## Add Slack actions in Zapier Select **Slack** as the App, and **Send Channel Message** as the Action. In the **Action** section, choose which **Channel** to post to. Set the **Message Text** field to **2. Step Summary Post** from the Run Python in Code by Zapier output. @@ -170,11 +177,11 @@ Add another **Send Channel Message in Slack** action. In the **Action** section, ![Screenshot of the Zapier UI, showing the mappings of prior steps to a Slack message](/img/guides/orchestration/webhooks/zapier-slack/thread-slack-config.png) -### 7. Test and deploy +## Test and deploy When you're done testing your Zap, make sure that your `run_id` and `account_id` are no longer hardcoded in the Code step, then publish your Zap. -## Alternate approach +## Alternately, use a dbt Cloud app Slack message to trigger Zapier Instead of using a webhook as your trigger, you can keep the existing dbt Cloud app installed in your Slack workspace and use its messages being posted to your channel as the trigger. In this case, you can skip validating the webhook and only need to load the context from the thread. diff --git a/website/docs/reference/commands/init.md b/website/docs/reference/commands/init.md index ac55717c0ec..e9cc2ccba4e 100644 --- a/website/docs/reference/commands/init.md +++ b/website/docs/reference/commands/init.md @@ -36,7 +36,7 @@ If you've just cloned or downloaded an existing dbt project, `dbt init` can stil `dbt init` knows how to prompt for connection information by looking for a file named `profile_template.yml`. It will look for this file in two places: -- **Adapter plugin:** What's the bare minumum Postgres profile? What's the type of each field, what are its defaults? This information is stored in a file called [`dbt/include/postgres/profile_template.yml`](https://github.com/dbt-labs/dbt-core/blob/main/plugins/postgres/dbt/include/postgres/profile_template.yml). If you're the maintainer of an adapter plugin, we highly recommend that you add a `profile_template.yml` to your plugin, too. See more details in [building-a-new-adapter](/guides/dbt-ecosystem/adapter-development/3-building-a-new-adapter). +- **Adapter plugin:** What's the bare minumum Postgres profile? What's the type of each field, what are its defaults? This information is stored in a file called [`dbt/include/postgres/profile_template.yml`](https://github.com/dbt-labs/dbt-core/blob/main/plugins/postgres/dbt/include/postgres/profile_template.yml). If you're the maintainer of an adapter plugin, we highly recommend that you add a `profile_template.yml` to your plugin, too. Refer to the [Build, test, document, and promote adapters](/guides/adapter-creation) guide for more information. - **Existing project:** If you're the maintainer of an existing project, and you want to help new users get connected to your database quickly and easily, you can include your own custom `profile_template.yml` in the root of your project, alongside `dbt_project.yml`. For common connection attributes, set the values in `fixed`; leave user-specific attributes in `prompts`, but with custom hints and defaults as you'd like. diff --git a/website/docs/reference/dbt-jinja-functions/run_query.md b/website/docs/reference/dbt-jinja-functions/run_query.md index cdd65a7b4dc..87970e024ed 100644 --- a/website/docs/reference/dbt-jinja-functions/run_query.md +++ b/website/docs/reference/dbt-jinja-functions/run_query.md @@ -15,7 +15,7 @@ Returns a [Table](https://agate.readthedocs.io/page/api/table.html) object with **Note:** The `run_query` macro will not begin a transaction automatically - if you wish to run your query inside of a transaction, please use `begin` and `commit ` statements as appropriate. :::info Using run_query for the first time? -Check out the section of the Getting Started guide on [using Jinja](/guides/advanced/using-jinja#dynamically-retrieve-the-list-of-payment-methods) for an example of working with the results of the `run_query` macro! +Check out the section of the Getting Started guide on [using Jinja](/guides/using-jinja#dynamically-retrieve-the-list-of-payment-methods) for an example of working with the results of the `run_query` macro! ::: **Example Usage:** diff --git a/website/docs/reference/events-logging.md b/website/docs/reference/events-logging.md index dec1dafcb8e..ffdeb7bb752 100644 --- a/website/docs/reference/events-logging.md +++ b/website/docs/reference/events-logging.md @@ -4,7 +4,7 @@ title: "Events and logs" As dbt runs, it generates events. The most common way to see those events is as log messages, written in real time to two places: - The command line terminal (`stdout`), to provide interactive feedback while running dbt. -- The debug log file (`logs/dbt.log`), to enable detailed [debugging of errors](/guides/best-practices/debugging-errors) when they occur. The text-formatted log messages in this file include all `DEBUG`-level events, as well as contextual information, such as log level and thread name. The location of this file can be configured via [the `log_path` config](/reference/project-configs/log-path). +- The debug log file (`logs/dbt.log`), to enable detailed [debugging of errors](/guides/debug-errors) when they occur. The text-formatted log messages in this file include all `DEBUG`-level events, as well as contextual information, such as log level and thread name. The location of this file can be configured via [the `log_path` config](/reference/project-configs/log-path). diff --git a/website/docs/reference/node-selection/syntax.md b/website/docs/reference/node-selection/syntax.md index bb2aeefd742..d0ea4a9acd8 100644 --- a/website/docs/reference/node-selection/syntax.md +++ b/website/docs/reference/node-selection/syntax.md @@ -96,7 +96,7 @@ by comparing code in the current project against the state manifest. - [Deferring](/reference/node-selection/defer) to another environment, whereby dbt can identify upstream, unselected resources that don't exist in your current environment and instead "defer" their references to the environment provided by the state manifest. - The [`dbt clone` command](/reference/commands/clone), whereby dbt can clone nodes based on their location in the manifest provided to the `--state` flag. -Together, the `state:` selector and deferral enable ["slim CI"](/guides/legacy/best-practices#run-only-modified-models-to-test-changes-slim-ci). We expect to add more features in future releases that can leverage artifacts passed to the `--state` flag. +Together, the `state:` selector and deferral enable ["slim CI"](/best-practices/best-practice-workflows#run-only-modified-models-to-test-changes-slim-ci). We expect to add more features in future releases that can leverage artifacts passed to the `--state` flag. ### Establishing state @@ -190,7 +190,7 @@ dbt build --select "source_status:fresher+" ``` -For more example commands, refer to [Pro-tips for workflows](/guides/legacy/best-practices.md#pro-tips-for-workflows). +For more example commands, refer to [Pro-tips for workflows](/best-practices/best-practice-workflows#pro-tips-for-workflows). ### The "source_status" status diff --git a/website/docs/reference/resource-configs/contract.md b/website/docs/reference/resource-configs/contract.md index 59cc511890b..ccc10099a12 100644 --- a/website/docs/reference/resource-configs/contract.md +++ b/website/docs/reference/resource-configs/contract.md @@ -48,7 +48,7 @@ models: -When dbt compares data types, it will not compare granular details such as size, precision, or scale. We don't think you should sweat the difference between `varchar(256)` and `varchar(257)`, because it doesn't really affect the experience of downstream queriers. You can accomplish a more-precise assertion by [writing or using a custom test](/guides/best-practices/writing-custom-generic-tests). +When dbt compares data types, it will not compare granular details such as size, precision, or scale. We don't think you should sweat the difference between `varchar(256)` and `varchar(257)`, because it doesn't really affect the experience of downstream queriers. You can accomplish a more-precise assertion by [writing or using a custom test](/best-practices/writing-custom-generic-tests). Note that you need to specify a varchar size or numeric scale, otherwise dbt relies on default values. For example, if a `numeric` type defaults to a precision of 38 and a scale of 0, then the numeric column stores 0 digits to the right of the decimal (it only stores whole numbers), which might cause it to fail contract enforcement. To avoid this implicit coercion, specify your `data_type` with a nonzero scale, like `numeric(38, 6)`. dbt Core 1.7 and higher provides a warning if you don't specify precision and scale when providing a numeric data type. diff --git a/website/docs/reference/resource-configs/no-configs.md b/website/docs/reference/resource-configs/no-configs.md index 5a4ba4eaaa2..5eec26917c8 100644 --- a/website/docs/reference/resource-configs/no-configs.md +++ b/website/docs/reference/resource-configs/no-configs.md @@ -8,4 +8,4 @@ If you were guided to this page from a data platform setup article, it most like - Setting up the profile is the only action the end-user needs to take on the data platform, or - The subsequent actions the end-user needs to take are not currently documented -If you'd like to contribute to data platform-specifc configuration information, refer to [Documenting a new adapter](/guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter) \ No newline at end of file +If you'd like to contribute to data platform-specific configuration information, refer to [Documenting a new adapter](/guides/adapter-creation) diff --git a/website/docs/reference/resource-properties/tests.md b/website/docs/reference/resource-properties/tests.md index 6e2c02c6bc5..0fe86ccc57d 100644 --- a/website/docs/reference/resource-properties/tests.md +++ b/website/docs/reference/resource-properties/tests.md @@ -298,7 +298,7 @@ models: -Check out the guide on writing a [custom generic test](/guides/best-practices/writing-custom-generic-tests) for more information. +Check out the guide on writing a [custom generic test](/best-practices/writing-custom-generic-tests) for more information. ### Custom test name diff --git a/website/docs/sql-reference/aggregate-functions/sql-array-agg.md b/website/docs/sql-reference/aggregate-functions/sql-array-agg.md index 430be4b4316..a6f508a7bef 100644 --- a/website/docs/sql-reference/aggregate-functions/sql-array-agg.md +++ b/website/docs/sql-reference/aggregate-functions/sql-array-agg.md @@ -59,4 +59,4 @@ Looking at the query results—this makes sense! We’d expect newer orders to l There are definitely too many use cases to list out for using the ARRAY_AGG function in your dbt models, but it’s very likely that ARRAY_AGG is used pretty downstream in your since you likely don’t want your data so bundled up earlier in your DAG to improve modularity and dryness. A few downstream use cases for ARRAY_AGG: - In [`export_` models](https://www.getdbt.com/open-source-data-culture/reverse-etl-playbook) that are used to send data to platforms using a tool to pair down multiple rows into a single row. Some downstream platforms, for example, require certain values that we’d usually keep as separate rows to be one singular row per customer or user. ARRAY_AGG is handy to bring multiple column values together by a singular id, such as creating an array of all items a user has ever purchased and sending that array downstream to an email platform to create a custom email campaign. -- Similar to export models, you may see ARRAY_AGG used in [mart tables](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) to create final aggregate arrays per a singular dimension; performance concerns of ARRAY_AGG in these likely larger tables can potentially be bypassed with use of [incremental models in dbt](https://docs.getdbt.com/docs/build/incremental-models). +- Similar to export models, you may see ARRAY_AGG used in [mart tables](/best-practices/how-we-structure/4-marts) to create final aggregate arrays per a singular dimension; performance concerns of ARRAY_AGG in these likely larger tables can potentially be bypassed with use of [incremental models in dbt](/docs/build/incremental-models). diff --git a/website/docs/sql-reference/aggregate-functions/sql-avg.md b/website/docs/sql-reference/aggregate-functions/sql-avg.md index d7d2fccc3c4..d1dba119292 100644 --- a/website/docs/sql-reference/aggregate-functions/sql-avg.md +++ b/website/docs/sql-reference/aggregate-functions/sql-avg.md @@ -48,7 +48,7 @@ Snowflake, Databricks, Google BigQuery, and Amazon Redshift all support the abil ## AVG function use cases We most commonly see the AVG function used in data work to calculate: -- The average of key metrics (ex. Average CSAT, average lead time, average order amount) in downstream [fact or dim models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) +- The average of key metrics (ex. Average CSAT, average lead time, average order amount) in downstream [fact or dim models](/best-practices/how-we-structure/4-marts) - Rolling or moving averages (ex. 7-day, 30-day averages for key metrics) using window functions - Averages in [dbt metrics](https://docs.getdbt.com/docs/build/metrics) diff --git a/website/docs/sql-reference/aggregate-functions/sql-round.md b/website/docs/sql-reference/aggregate-functions/sql-round.md index 053a2ebdd8e..bc9669e22cb 100644 --- a/website/docs/sql-reference/aggregate-functions/sql-round.md +++ b/website/docs/sql-reference/aggregate-functions/sql-round.md @@ -57,7 +57,7 @@ Google BigQuery, Amazon Redshift, Snowflake, and Databricks all support the abil ## ROUND function use cases -If you find yourself rounding numeric data, either in data models or ad-hoc analyses, you’re probably rounding to improve the readability and usability of your data using downstream [intermediate](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate) or [mart models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts). Specifically, you’ll likely use the ROUND function to: +If you find yourself rounding numeric data, either in data models or ad-hoc analyses, you’re probably rounding to improve the readability and usability of your data using downstream [intermediate](/best-practices/how-we-structure/3-intermediate) or [mart models](/best-practices/how-we-structure/4-marts). Specifically, you’ll likely use the ROUND function to: - Make numeric calculations using division or averages a little cleaner and easier to understand - Create concrete buckets of data for a cleaner distribution of values during ad-hoc analysis diff --git a/website/docs/sql-reference/clauses/sql-limit.md b/website/docs/sql-reference/clauses/sql-limit.md index 74cc2e12123..a02b851e37d 100644 --- a/website/docs/sql-reference/clauses/sql-limit.md +++ b/website/docs/sql-reference/clauses/sql-limit.md @@ -51,7 +51,7 @@ This simple query using the [Jaffle Shop’s](https://github.com/dbt-labs/jaffle After ensuring that this is the result you want from this query, you can omit the LIMIT in your final data model. :::tip Save money and time by limiting data in development -You could limit your data used for development by manually adding a LIMIT statement, a WHERE clause to your query, or by using a [dbt macro to automatically limit data based](https://docs.getdbt.com/guides/legacy/best-practices#limit-the-data-processed-when-in-development) on your development environment to help reduce your warehouse usage during dev periods. +You could limit your data used for development by manually adding a LIMIT statement, a WHERE clause to your query, or by using a [dbt macro to automatically limit data based](/best-practices/best-practice-workflows#limit-the-data-processed-when-in-development) on your development environment to help reduce your warehouse usage during dev periods. ::: ## LIMIT syntax in Snowflake, Databricks, BigQuery, and Redshift diff --git a/website/docs/sql-reference/clauses/sql-order-by.md b/website/docs/sql-reference/clauses/sql-order-by.md index 660794adc14..d18946d0d16 100644 --- a/website/docs/sql-reference/clauses/sql-order-by.md +++ b/website/docs/sql-reference/clauses/sql-order-by.md @@ -57,7 +57,7 @@ Since the ORDER BY clause is a SQL fundamental, data warehouses, including Snowf ## ORDER BY use cases We most commonly see the ORDER BY clause used in data work to: -- Analyze data for both initial exploration of raw data sources and ad hoc querying of [mart datasets](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) +- Analyze data for both initial exploration of raw data sources and ad hoc querying of [mart datasets](/best-practices/how-we-structure/4-marts) - Identify the top 5/10/50/100 of a dataset when used in pair with a [LIMIT](/sql-reference/limit) - (For Snowflake) Optimize the performance of large incremental models that use both a `cluster_by` [configuration](https://docs.getdbt.com/reference/resource-configs/snowflake-configs#using-cluster_by) and ORDER BY statement - Control the ordering of window function partitions (ex. `row_number() over (partition by user_id order by updated_at)`) diff --git a/website/docs/sql-reference/joins/sql-inner-join.md b/website/docs/sql-reference/joins/sql-inner-join.md index 0cf8a3894bd..951e3675bc7 100644 --- a/website/docs/sql-reference/joins/sql-inner-join.md +++ b/website/docs/sql-reference/joins/sql-inner-join.md @@ -66,5 +66,5 @@ Because there’s no `user_id` = 4 in Table A and no `user_id` = 2 in Table B, r ## SQL inner join use cases -There are probably countless scenarios where you’d want to inner join multiple tables together—perhaps you have some really nicely structured tables with the exact same primary keys that should really just be one larger, wider table or you’re joining two tables together don’t want any null or missing column values if you used a left or right join—it’s all pretty dependent on your source data and end use cases. Where you will not (and should not) see inner joins is in [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging) that are used to clean and prep raw source data for analytics uses. Any joins in your dbt projects should happen further downstream in [intermediate](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate) and [mart models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) to improve modularity and DAG cleanliness. +There are probably countless scenarios where you’d want to inner join multiple tables together—perhaps you have some really nicely structured tables with the exact same primary keys that should really just be one larger, wider table or you’re joining two tables together don’t want any null or missing column values if you used a left or right join—it’s all pretty dependent on your source data and end use cases. Where you will not (and should not) see inner joins is in [staging models](/best-practices/how-we-structure/2-staging) that are used to clean and prep raw source data for analytics uses. Any joins in your dbt projects should happen further downstream in [intermediate](/best-practices/how-we-structure/3-intermediate) and [mart models](/best-practices/how-we-structure/4-marts) to improve modularity and DAG cleanliness. diff --git a/website/docs/sql-reference/joins/sql-left-join.md b/website/docs/sql-reference/joins/sql-left-join.md index 841edc41cdd..914f83bb7e3 100644 --- a/website/docs/sql-reference/joins/sql-left-join.md +++ b/website/docs/sql-reference/joins/sql-left-join.md @@ -73,4 +73,4 @@ Left joins are a fundamental in data modeling and analytics engineering work—t Something to note if you use left joins: if there are multiple records for an individual key in the left join database object, be aware that duplicates can potentially be introduced in the final query result. This is where dbt tests, such as testing for uniqueness and [equal row count](https://github.com/dbt-labs/dbt-utils#equal_rowcount-source) across upstream source tables and downstream child models, can help you identify faulty data modeling logic and improve data quality. ::: -Where you will not (and should not) see left joins is in [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging) that are used to clean and prep raw source data for analytics uses. Any joins in your dbt projects should happen further downstream in [intermediate](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate) and [mart models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) to improve modularity and cleanliness. \ No newline at end of file +Where you will not (and should not) see left joins is in [staging models](/best-practices/how-we-structure/2-staging) that are used to clean and prep raw source data for analytics uses. Any joins in your dbt projects should happen further downstream in [intermediate](/best-practices/how-we-structure/3-intermediate) and [mart models](/best-practices/how-we-structure/4-marts) to improve modularity and cleanliness. diff --git a/website/docs/sql-reference/joins/sql-self-join.md b/website/docs/sql-reference/joins/sql-self-join.md index 0eef0fcab7c..6d9a7d3261e 100644 --- a/website/docs/sql-reference/joins/sql-self-join.md +++ b/website/docs/sql-reference/joins/sql-self-join.md @@ -66,6 +66,6 @@ This query utilizing a self join adds the `parent_name` of skus that have non-nu ## SQL self join use cases -Again, self joins are probably rare in your dbt project and will most often be utilized in tables that contain a hierarchical structure, such as consisting of a column which is a foreign key to the primary key of the same table. If you do have use cases for self joins, such as in the example above, you’ll typically want to perform that self join early upstream in your , such as in a [staging](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging) or [intermediate](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate) model; if your raw, unjoined table is going to need to be accessed further downstream sans self join, that self join should happen in a modular intermediate model. +Again, self joins are probably rare in your dbt project and will most often be utilized in tables that contain a hierarchical structure, such as consisting of a column which is a foreign key to the primary key of the same table. If you do have use cases for self joins, such as in the example above, you’ll typically want to perform that self join early upstream in your , such as in a [staging](/best-practices/how-we-structure/2-staging) or [intermediate](/best-practices/how-we-structure/3-intermediate) model; if your raw, unjoined table is going to need to be accessed further downstream sans self join, that self join should happen in a modular intermediate model. -You can also use self joins to create a cartesian product (aka a cross join) of a table against itself. Again, slim use cases, but still there for you if you need it 😉 \ No newline at end of file +You can also use self joins to create a cartesian product (aka a cross join) of a table against itself. Again, slim use cases, but still there for you if you need it 😉 diff --git a/website/docs/sql-reference/operators/sql-not.md b/website/docs/sql-reference/operators/sql-not.md index e9156cb9720..fcfa7627c0b 100644 --- a/website/docs/sql-reference/operators/sql-not.md +++ b/website/docs/sql-reference/operators/sql-not.md @@ -55,4 +55,4 @@ This simple query using the sample dataset [Jaffle Shop’s](https://github.com/ ## NOT operator example use cases -There are probably many scenarios where you’d want to use the NOT operators in your WHERE clauses or case statements, but we commonly see NOT operators used to remove nulls or boolean-identifed deleted rows in source data in [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging). This removal of unnecessary rows can potentially help the performance of downstream [intermediate](https://docs.getdbt.com/guides/best-practices/how-we-structure/3-intermediate) and [mart models](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts). \ No newline at end of file +There are probably many scenarios where you’d want to use the NOT operators in your WHERE clauses or case statements, but we commonly see NOT operators used to remove nulls or boolean-identifed deleted rows in source data in [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging). This removal of unnecessary rows can potentially help the performance of downstream [intermediate](https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate) and [mart models](https://docs.getdbt.com/best-practices/how-we-structure/4-marts). diff --git a/website/docs/sql-reference/other/sql-cast.md b/website/docs/sql-reference/other/sql-cast.md index cf24a12706e..9d41400e825 100644 --- a/website/docs/sql-reference/other/sql-cast.md +++ b/website/docs/sql-reference/other/sql-cast.md @@ -50,7 +50,7 @@ After running this query, the `orders` table will look a little something like t Let’s be clear: the resulting data from this query looks exactly the same as the upstream `orders` model. However, the `order_id` and `customer_id` fields are now strings, meaning you could easily concat different string variables to them. -> Casting columns to their appropriate types typically happens in our dbt project’s [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging). A few reasons for that: data cleanup and standardization, such as aliasing, casting, and lower or upper casing, should ideally happen in staging models to create downstream uniformity and improve downstream performance. +> Casting columns to their appropriate types typically happens in our dbt project’s [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging). A few reasons for that: data cleanup and standardization, such as aliasing, casting, and lower or upper casing, should ideally happen in staging models to create downstream uniformity and improve downstream performance. ## SQL CAST function syntax in Snowflake, Databricks, BigQuery, and Redshift @@ -66,4 +66,4 @@ You know at one point you’re going to need to cast a column to a different dat - tools [defaulting to certain data types](https://airbytehq.github.io/integrations/sources/google-sheets/) - BI tools require certain fields to be specific data types -A key thing to remember when you’re casting data is the user experience in your end BI tool: are business users expecting `customer_id` to be filtered on 1 or '1'? What is more intuitive for them? If one `id` field is an integer, all `id` fields should be integers. Just like all data modeling, consistency and standardization is key when determining when and what to cast. \ No newline at end of file +A key thing to remember when you’re casting data is the user experience in your end BI tool: are business users expecting `customer_id` to be filtered on 1 or '1'? What is more intuitive for them? If one `id` field is an integer, all `id` fields should be integers. Just like all data modeling, consistency and standardization is key when determining when and what to cast. diff --git a/website/docs/sql-reference/other/sql-comments.md b/website/docs/sql-reference/other/sql-comments.md index 811f2b4339e..7fe5e970a85 100644 --- a/website/docs/sql-reference/other/sql-comments.md +++ b/website/docs/sql-reference/other/sql-comments.md @@ -53,7 +53,7 @@ We recommend leveraging inline comments in the following situations: - Explain complex code logic that if you had to scratch your head at, someone else will have to scratch their head at - Explain niche, unique-to-your-business logic -- Separate out field types (ex. Ids, booleans, strings, dates, numerics, and timestamps) in [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging) to create more readable, organized, and formulaic models +- Separate out field types (ex. Ids, booleans, strings, dates, numerics, and timestamps) in [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging) to create more readable, organized, and formulaic models - Clearly label tech debt (`-- [TODO]: TECH DEBT`) in queries or models diff --git a/website/docs/sql-reference/statements/sql-select.md b/website/docs/sql-reference/statements/sql-select.md index 49132524096..0b914d9c1da 100644 --- a/website/docs/sql-reference/statements/sql-select.md +++ b/website/docs/sql-reference/statements/sql-select.md @@ -42,8 +42,8 @@ You may also commonly see queries that `select * from table_name`. The asterisk Leverage [dbt utils’ star macro](/blog/star-sql-love-letter) to be able to both easily select many and specifically exclude certain columns. ::: -In a dbt project, analytics engineers will typically write models that contain multiple CTEs that build to one greater query. For folks that are newer to analytics engineering or dbt, we recommend they check out the [“How we structure our dbt projects” guide](/guides/best-practices/how-we-structure/1-guide-overview) to better understand why analytics folks like modular data modeling and CTEs. +In a dbt project, analytics engineers will typically write models that contain multiple CTEs that build to one greater query. For folks that are newer to analytics engineering or dbt, we recommend they check out the [“How we structure our dbt projects” guide](/best-practices/how-we-structure/1-guide-overview) to better understand why analytics folks like modular data modeling and CTEs. ## SELECT statement syntax in Snowflake, Databricks, BigQuery, and Redshift -While we know the data warehouse players like to have their own slightly different flavors and syntax for SQL, they have conferred together that the SELECT statement is sacred and unchangeable. As a result, writing the actual `select…from` statement across Snowflake, Databricks, Google BigQuery, and Amazon Redshift would look the same. However, the actual SQL manipulation of data within the SELECT statement (ex. adding dates, casting columns) might look slightly different between each data warehouse. \ No newline at end of file +While we know the data warehouse players like to have their own slightly different flavors and syntax for SQL, they have conferred together that the SELECT statement is sacred and unchangeable. As a result, writing the actual `select…from` statement across Snowflake, Databricks, Google BigQuery, and Amazon Redshift would look the same. However, the actual SQL manipulation of data within the SELECT statement (ex. adding dates, casting columns) might look slightly different between each data warehouse. diff --git a/website/docs/sql-reference/string-functions/sql-lower.md b/website/docs/sql-reference/string-functions/sql-lower.md index 8c8622bb77a..7b1a5a2c2b3 100644 --- a/website/docs/sql-reference/string-functions/sql-lower.md +++ b/website/docs/sql-reference/string-functions/sql-lower.md @@ -54,7 +54,7 @@ After running this query, the `customers` table will look a little something lik Now, all characters in the `first_name` and `last_name` columns are lowercase. -> Changing all string columns to lowercase to create uniformity across data sources typically happens in our [dbt project’s staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging). There are a few reasons for that: data cleanup and standardization, such as aliasing, casting, and lowercasing, should ideally happen in staging models to create downstream uniformity and improve downstream performance. +> Changing all string columns to lowercase to create uniformity across data sources typically happens in our [dbt project’s staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging). There are a few reasons for that: data cleanup and standardization, such as aliasing, casting, and lowercasing, should ideally happen in staging models to create downstream uniformity and improve downstream performance. ## SQL LOWER function syntax in Snowflake, Databricks, BigQuery, and Redshift diff --git a/website/docs/sql-reference/string-functions/sql-trim.md b/website/docs/sql-reference/string-functions/sql-trim.md index ad54a015437..b9555feb630 100644 --- a/website/docs/sql-reference/string-functions/sql-trim.md +++ b/website/docs/sql-reference/string-functions/sql-trim.md @@ -50,4 +50,4 @@ In this query, you’re adding superfluous asterisks to a string using the [CONC ## TRIM function use cases -If string values in your raw data have extra white spaces or miscellaneous characters, you’ll leverage the TRIM (and subset RTRIM AND LTRIM) functions to help you quickly remove them. You’ll likely do this cleanup in [staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging), where you’re probably standardizing casing and doing other minor formatting changes to string values, so you can use a clean and consistent format across your downstream models. +If string values in your raw data have extra white spaces or miscellaneous characters, you’ll leverage the TRIM (and subset RTRIM AND LTRIM) functions to help you quickly remove them. You’ll likely do this cleanup in [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging), where you’re probably standardizing casing and doing other minor formatting changes to string values, so you can use a clean and consistent format across your downstream models. diff --git a/website/docs/sql-reference/string-functions/sql-upper.md b/website/docs/sql-reference/string-functions/sql-upper.md index cf7694f8e46..a505537ac5d 100644 --- a/website/docs/sql-reference/string-functions/sql-upper.md +++ b/website/docs/sql-reference/string-functions/sql-upper.md @@ -46,7 +46,7 @@ After running this query, the `customers` table will look a little something lik Now, all characters in the `first_name` are uppercase (and `last_name` are unchanged). -> Changing string columns to uppercase to create uniformity across data sources typically happens in our [dbt project’s staging models](https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging). There are a few reasons for that: data cleanup and standardization, such as aliasing, casting, and lower or upper casing, should ideally happen in staging models to create downstream uniformity and improve downstream performance. +> Changing string columns to uppercase to create uniformity across data sources typically happens in our [dbt project’s staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging). There are a few reasons for that: data cleanup and standardization, such as aliasing, casting, and lower or upper casing, should ideally happen in staging models to create downstream uniformity and improve downstream performance. ## SQL UPPER function syntax in Snowflake, Databricks, BigQuery, and Redshift diff --git a/website/docs/terms/cte.md b/website/docs/terms/cte.md index d4a4bb15915..f67480325b4 100644 --- a/website/docs/terms/cte.md +++ b/website/docs/terms/cte.md @@ -66,7 +66,7 @@ When people talk about how CTEs can simplify your queries, they specifically mea #### Establish Structure -In leveraging CTEs, you can break complex code into smaller segments, ultimately helping provide structure to your code. At dbt Labs, we often like to use the [import, logical, and final structure](/guides/migration/tools/refactoring-legacy-sql#implement-cte-groupings) for CTEs which creates a predictable and organized structure to your dbt models. +In leveraging CTEs, you can break complex code into smaller segments, ultimately helping provide structure to your code. At dbt Labs, we often like to use the [import, logical, and final structure](/guides/refactoring-legacy-sql?step=5#implement-cte-groupings) for CTEs which creates a predictable and organized structure to your dbt models. #### Easily identify dependencies @@ -181,7 +181,7 @@ CTEs are essentially temporary views that can be used throughout a query. They a If you’re interested in reading more about CTE best practices, check out some of our favorite content around model refactoring and style: -- [Refactoring Legacy SQL to dbt](/guides/migration/tools/refactoring-legacy-sql#implement-cte-groupings) +- [Refactoring Legacy SQL to dbt](/guides/refactoring-legacy-sql?step=5#implement-cte-groupings) - [dbt Labs Style Guide](https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md#ctes) - [Modular Data Modeling Technique](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) diff --git a/website/docs/terms/dag.md b/website/docs/terms/dag.md index f4247c785a4..c6b91300bfc 100644 --- a/website/docs/terms/dag.md +++ b/website/docs/terms/dag.md @@ -65,7 +65,7 @@ See the DAG above? It follows a more traditional approach to data modeling where Instead, there are some key elements that can help you create a more streamlined DAG and [modular data models](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/): -- Leveraging [staging, intermediate, and mart layers](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) to create layers of distinction between sources and transformed data +- Leveraging [staging, intermediate, and mart layers](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) to create layers of distinction between sources and transformed data - Abstracting code that’s used across multiple models to its own model - Joining on surrogate keys versus on multiple values @@ -106,6 +106,6 @@ A Directed acyclic graph (DAG) is a visual representation of your data models an Ready to restructure (or create your first) DAG? Check out some of the resources below to better understand data modularity, data lineage, and how dbt helps bring it all together: - [Data modeling techniques for more modularity](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) -- [How we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) +- [How we structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) - [How to audit your DAG](https://www.youtube.com/watch?v=5W6VrnHVkCA) -- [Refactoring legacy SQL to dbt](/guides/migration/tools/refactoring-legacy-sql) +- [Refactoring legacy SQL to dbt](/guides/refactoring-legacy-sql) diff --git a/website/docs/terms/data-lineage.md b/website/docs/terms/data-lineage.md index a03687eaba3..d0162c35616 100644 --- a/website/docs/terms/data-lineage.md +++ b/website/docs/terms/data-lineage.md @@ -89,7 +89,7 @@ The biggest challenges around data lineage become more apparent as your data, sy As dbt projects scale with data and organization growth, the number of sources, models, macros, seeds, and [exposures](https://docs.getdbt.com/docs/build/exposures) invariably grow. And with an increasing number of nodes in your DAG, it can become harder to audit your DAG for WET code or inefficiencies. -Working with dbt projects with thousands of models and nodes can feel overwhelming, but remember: your DAG and data lineage are meant to help you, not be your enemy. Tackle DAG audits in chunks, document all models, and [leverage strong structure conventions](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview). +Working with dbt projects with thousands of models and nodes can feel overwhelming, but remember: your DAG and data lineage are meant to help you, not be your enemy. Tackle DAG audits in chunks, document all models, and [leverage strong structure conventions](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview). :::tip dbt project evaluator @@ -113,4 +113,4 @@ DAGs, data lineage, and root cause analysis…tell me more! Check out some of ou - [Glossary: DRY](https://docs.getdbt.com/terms/dry) - [Data techniques for modularity](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) -- [How we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) +- [How we structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) diff --git a/website/docs/terms/data-wrangling.md b/website/docs/terms/data-wrangling.md index a5b4e99f312..4a26507adfd 100644 --- a/website/docs/terms/data-wrangling.md +++ b/website/docs/terms/data-wrangling.md @@ -12,7 +12,7 @@ hoverSnippet: Data wrangling describes the different processes used to transform Data wrangling describes the different processes used to transform raw data into a consistent and easily usable format. For analytics engineers, you may know this better by the name of data cleaning. In data science or machine learning, "wrangling" often refers to prepping the data for model creation. -The ultimate goal of data wrangling is to work in a way that allows you to dive right into analysis on a dataset or build upon that data in a downstream model without worrying about basic cleaning like renaming, datatype casting, etc. Data wrangling acts as preparation for the development of [intermediate, fct/dim, or mart data models](/guides/best-practices/how-we-structure/1-guide-overview) that form the base layer that other data work can be built off of. Analytics engineers tend to do data wrangling work in the staging layer as a first transformation step after loading the data. This eliminates a foundational step done by an analytics engineer or analyst when building a downstream data model or dashboard. +The ultimate goal of data wrangling is to work in a way that allows you to dive right into analysis on a dataset or build upon that data in a downstream model without worrying about basic cleaning like renaming, datatype casting, etc. Data wrangling acts as preparation for the development of [intermediate, fct/dim, or mart data models](/best-practices/how-we-structure/1-guide-overview) that form the base layer that other data work can be built off of. Analytics engineers tend to do data wrangling work in the staging layer as a first transformation step after loading the data. This eliminates a foundational step done by an analytics engineer or analyst when building a downstream data model or dashboard. ## Data wrangling steps @@ -164,4 +164,4 @@ You could argue that data wrangling is one of the most important parts of an ana - [Our favorite SQL functions](https://www.getdbt.com/sql-foundations/top-sql-functions/) - [Glossary: Data warehouse](/terms/data-warehouse) - [Glossary: Primary key](/terms/primary-key) -- [Glossary: JSON](/terms/json) \ No newline at end of file +- [Glossary: JSON](/terms/json) diff --git a/website/docs/terms/dimensional-modeling.md b/website/docs/terms/dimensional-modeling.md index d0b5e9384a5..de88f7c318d 100644 --- a/website/docs/terms/dimensional-modeling.md +++ b/website/docs/terms/dimensional-modeling.md @@ -28,7 +28,7 @@ If you run a bakery (and we’d be interested in seeing the data person + baker Just as eating raw flour isn’t that appetizing, neither is deriving insights from raw data since it rarely has a nice structure that makes it poised for analytics. There’s some considerable work that’s needed to organize data and make it usable for business users. -This is where dimensional modeling comes into play; it’s a method that can help data folks create meaningful entities (cupcakes and cookies) to live inside their [data mart](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts) (your glass display) and eventually use for business intelligence purposes (eating said cookies). +This is where dimensional modeling comes into play; it’s a method that can help data folks create meaningful entities (cupcakes and cookies) to live inside their [data mart](https://docs.getdbt.com/best-practices/how-we-structure/4-marts) (your glass display) and eventually use for business intelligence purposes (eating said cookies). So I guess we take it back—you’re not just trying to build a bakery, you’re also trying to build a top-notch foundation for meaningful analytics. Dimensional modeling can be a method to get you part of the way there. @@ -135,7 +135,7 @@ If your end data consumers are less comfortable with SQL and your BI tool doesn The benefits and drawbacks of dimensional modeling are pretty straightforward. Generally, the main advantages can be boiled down to: -* **More accessibility**: Since the output of good dimensional modeling is a [data mart](https://docs.getdbt.com/guides/best-practices/how-we-structure/4-marts), the tables created are easier to understand and more accessible to end consumers. +* **More accessibility**: Since the output of good dimensional modeling is a [data mart](https://docs.getdbt.com/best-practices/how-we-structure/4-marts), the tables created are easier to understand and more accessible to end consumers. * **More flexibility**: Easy to slice, dice, filter, and view your data in whatever way suits your purpose. * **Performance**: Fact and dimension models are typically materialized as tables or [incremental models](https://docs.getdbt.com/docs/build/incremental-models). Since these often form the core understanding of a business, they are queried often. Materializing them as tables allows them to be more performant in downstream BI platforms. @@ -156,4 +156,4 @@ Dimensional modeling is a tough, complex, and opinionated topic in the data worl * [Modular data modeling techniques](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) * [Stakeholder-friendly model naming conventions](https://docs.getdbt.com/blog/stakeholder-friendly-model-names/) -* [How we structure our dbt projects guide](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) +* [How we structure our dbt projects guide](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) diff --git a/website/docs/terms/dry.md b/website/docs/terms/dry.md index be3d03ed4f0..b1649278cd2 100644 --- a/website/docs/terms/dry.md +++ b/website/docs/terms/dry.md @@ -89,7 +89,7 @@ DRY code is a principle that you should always be striving for. It saves you tim ## Further reading * [Data modeling technique for more modularity](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/) -* [Why we use so many CTEs](https://docs.getdbt.com/docs/guides/best-practices) +* [Why we use so many CTEs](https://docs.getdbt.com/docs/best-practices) * [Glossary: CTE](https://docs.getdbt.com/terms/cte) * [Glossary: Materialization](https://docs.getdbt.com/terms/materialization) * [Glossary: View](https://docs.getdbt.com/terms/view) diff --git a/website/docs/terms/idempotent.md b/website/docs/terms/idempotent.md index 8772ba58b62..ea3ef0a099b 100644 --- a/website/docs/terms/idempotent.md +++ b/website/docs/terms/idempotent.md @@ -20,4 +20,4 @@ A non-idempotent version of the "_Save_" button might do something like "Append If word processors only gave us non-idempotent "Append paragraph" / "Update paragraph" / "Delete paragraph" operations, then saving our document changes would be a lot more difficult! We'd have to keep track of which paragraphs we previously saved, and either make sure to not save them again or have a process in place to regularly clean up duplicate paragraphs. The implementation of the "_Save_" button in word processors takes the collection of low-level non-idempotent filesystem operations (read/append/overwrite/delete), and systematically runs them in a certain order so that the _user_ doesn't have to deal with the non-idempotency. The user can just focus on writing -- choosing words, editing for clarity, ensuring paragraphs aren't too long, etc. -- and the word processor deals with making sure the words get persisted properly to disk. -This word processing analogy is very similar to what dbt does for [data transformation](https://www.getdbt.com/analytics-engineering/transformation/): it takes the collection of low-level non-idempotent database operations (`SELECT`/`INSERT`/`UPDATE`/`DELETE` -- collectively known as DML statements), and systematically runs them in a certain order so that analytics engineers don't have to deal with non-idempotency. We can just focus on the data -- [choosing good model and column names](https://docs.getdbt.com/blog/on-the-importance-of-naming), [documenting them](/community/resources/viewpoint#documentation), [ensuring data consumers can understand them](https://docs.getdbt.com/docs/guides/best-practices#consider-the-information-architecture-of-your-data-warehouse), etc. -- and [`dbt run`](https://docs.getdbt.com/reference/commands/run) will make sure the database ends up in the right state. +This word processing analogy is very similar to what dbt does for [data transformation](https://www.getdbt.com/analytics-engineering/transformation/): it takes the collection of low-level non-idempotent database operations (`SELECT`/`INSERT`/`UPDATE`/`DELETE` -- collectively known as DML statements), and systematically runs them in a certain order so that analytics engineers don't have to deal with non-idempotency. We can just focus on the data -- [choosing good model and column names](https://docs.getdbt.com/blog/on-the-importance-of-naming), [documenting them](/community/resources/viewpoint#documentation), [ensuring data consumers can understand them](https://docs.getdbt.com/docs/best-practices#consider-the-information-architecture-of-your-data-warehouse), etc. -- and [`dbt run`](https://docs.getdbt.com/reference/commands/run) will make sure the database ends up in the right state. diff --git a/website/docs/terms/view.md b/website/docs/terms/view.md index 90cd5d1f36f..53c122ca9e6 100644 --- a/website/docs/terms/view.md +++ b/website/docs/terms/view.md @@ -33,4 +33,4 @@ You shouldn’t expect a view in itself to be your final destination in terms of ## Further reading -- [Best practices guide on choosing table vs view materializations](/guides/best-practices) +- [Best practices guide on choosing table vs view materializations](/best-practices) diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js index ce81d614c65..ee593e568f4 100644 --- a/website/docusaurus.config.js +++ b/website/docusaurus.config.js @@ -130,12 +130,12 @@ var siteSettings = { href: 'https://courses.getdbt.com', }, { - label: 'Guides', - to: '/guides/best-practices', + label: 'Best Practices', + to: '/best-practices', }, { - label: "Quickstarts", - to: "/quickstarts", + label: "Guides", + to: "/guides", }, { label: "Developer Blog", diff --git a/website/package-lock.json b/website/package-lock.json index b15a903e97f..282056e5922 100644 --- a/website/package-lock.json +++ b/website/package-lock.json @@ -36,6 +36,7 @@ "react-dom": "^17.0.1", "react-full-screen": "^1.1.1", "react-is": "^18.1.0", + "react-select": "^5.7.5", "react-tooltip": "^4.2.21", "redoc": "^2.0.0-rc.57", "rehype-katex": "^5.0.0", @@ -3098,6 +3099,59 @@ "node": ">=12" } }, + "node_modules/@emotion/babel-plugin": { + "version": "11.11.0", + "resolved": "https://registry.npmjs.org/@emotion/babel-plugin/-/babel-plugin-11.11.0.tgz", + "integrity": "sha512-m4HEDZleaaCH+XgDDsPF15Ht6wTLsgDTeR3WYj9Q/k76JtWhrJjcP4+/XlG8LGT/Rol9qUfOIztXeA84ATpqPQ==", + "dependencies": { + "@babel/helper-module-imports": "^7.16.7", + "@babel/runtime": "^7.18.3", + "@emotion/hash": "^0.9.1", + "@emotion/memoize": "^0.8.1", + "@emotion/serialize": "^1.1.2", + "babel-plugin-macros": "^3.1.0", + "convert-source-map": "^1.5.0", + "escape-string-regexp": "^4.0.0", + "find-root": "^1.1.0", + "source-map": "^0.5.7", + "stylis": "4.2.0" + } + }, + "node_modules/@emotion/babel-plugin/node_modules/@emotion/memoize": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.8.1.tgz", + "integrity": "sha512-W2P2c/VRW1/1tLox0mVUalvnWXxavmv/Oum2aPsRcoDJuob75FC3Y8FbpfLwUegRcxINtGUMPq0tFCvYNTBXNA==" + }, + "node_modules/@emotion/babel-plugin/node_modules/source-map": { + "version": "0.5.7", + "resolved": "https://registry.npmjs.org/source-map/-/source-map-0.5.7.tgz", + "integrity": "sha512-LbrmJOMUSdEVxIKvdcJzQC+nQhe8FUZQTXQy6+I75skNgn3OoQ0DZA8YnFa7gp8tqtL3KPf1kmo0R5DoApeSGQ==", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/@emotion/cache": { + "version": "11.11.0", + "resolved": "https://registry.npmjs.org/@emotion/cache/-/cache-11.11.0.tgz", + "integrity": "sha512-P34z9ssTCBi3e9EI1ZsWpNHcfY1r09ZO0rZbRO2ob3ZQMnFI35jB536qoXbkdesr5EUhYi22anuEJuyxifaqAQ==", + "dependencies": { + "@emotion/memoize": "^0.8.1", + "@emotion/sheet": "^1.2.2", + "@emotion/utils": "^1.2.1", + "@emotion/weak-memoize": "^0.3.1", + "stylis": "4.2.0" + } + }, + "node_modules/@emotion/cache/node_modules/@emotion/memoize": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.8.1.tgz", + "integrity": "sha512-W2P2c/VRW1/1tLox0mVUalvnWXxavmv/Oum2aPsRcoDJuob75FC3Y8FbpfLwUegRcxINtGUMPq0tFCvYNTBXNA==" + }, + "node_modules/@emotion/hash": { + "version": "0.9.1", + "resolved": "https://registry.npmjs.org/@emotion/hash/-/hash-0.9.1.tgz", + "integrity": "sha512-gJB6HLm5rYwSLI6PQa+X1t5CFGrv1J1TWG+sOyMCeKz2ojaj6Fnl/rZEspogG+cvqbt4AE/2eIyD2QfLKTBNlQ==" + }, "node_modules/@emotion/is-prop-valid": { "version": "0.8.8", "resolved": "https://registry.npmjs.org/@emotion/is-prop-valid/-/is-prop-valid-0.8.8.tgz", @@ -3111,6 +3165,56 @@ "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.7.4.tgz", "integrity": "sha512-Ja/Vfqe3HpuzRsG1oBtWTHk2PGZ7GR+2Vz5iYGelAw8dx32K0y7PjVuxK6z1nMpZOqAFsRUPCkK1YjJ56qJlgw==" }, + "node_modules/@emotion/react": { + "version": "11.11.1", + "resolved": "https://registry.npmjs.org/@emotion/react/-/react-11.11.1.tgz", + "integrity": "sha512-5mlW1DquU5HaxjLkfkGN1GA/fvVGdyHURRiX/0FHl2cfIfRxSOfmxEH5YS43edp0OldZrZ+dkBKbngxcNCdZvA==", + "dependencies": { + "@babel/runtime": "^7.18.3", + "@emotion/babel-plugin": "^11.11.0", + "@emotion/cache": "^11.11.0", + "@emotion/serialize": "^1.1.2", + "@emotion/use-insertion-effect-with-fallbacks": "^1.0.1", + "@emotion/utils": "^1.2.1", + "@emotion/weak-memoize": "^0.3.1", + "hoist-non-react-statics": "^3.3.1" + }, + "peerDependencies": { + "react": ">=16.8.0" + }, + "peerDependenciesMeta": { + "@types/react": { + "optional": true + } + } + }, + "node_modules/@emotion/serialize": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@emotion/serialize/-/serialize-1.1.2.tgz", + "integrity": "sha512-zR6a/fkFP4EAcCMQtLOhIgpprZOwNmCldtpaISpvz348+DP4Mz8ZoKaGGCQpbzepNIUWbq4w6hNZkwDyKoS+HA==", + "dependencies": { + "@emotion/hash": "^0.9.1", + "@emotion/memoize": "^0.8.1", + "@emotion/unitless": "^0.8.1", + "@emotion/utils": "^1.2.1", + "csstype": "^3.0.2" + } + }, + "node_modules/@emotion/serialize/node_modules/@emotion/memoize": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.8.1.tgz", + "integrity": "sha512-W2P2c/VRW1/1tLox0mVUalvnWXxavmv/Oum2aPsRcoDJuob75FC3Y8FbpfLwUegRcxINtGUMPq0tFCvYNTBXNA==" + }, + "node_modules/@emotion/serialize/node_modules/@emotion/unitless": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/unitless/-/unitless-0.8.1.tgz", + "integrity": "sha512-KOEGMu6dmJZtpadb476IsZBclKvILjopjUii3V+7MnXIQCYh8W3NgNcgwo21n9LXZX6EDIKvqfjYxXebDwxKmQ==" + }, + "node_modules/@emotion/sheet": { + "version": "1.2.2", + "resolved": "https://registry.npmjs.org/@emotion/sheet/-/sheet-1.2.2.tgz", + "integrity": "sha512-0QBtGvaqtWi+nx6doRwDdBIzhNdZrXUppvTM4dtZZWEGTXL/XE/yJxLMGlDT1Gt+UHH5IX1n+jkXyytE/av7OA==" + }, "node_modules/@emotion/stylis": { "version": "0.8.5", "resolved": "https://registry.npmjs.org/@emotion/stylis/-/stylis-0.8.5.tgz", @@ -3121,6 +3225,24 @@ "resolved": "https://registry.npmjs.org/@emotion/unitless/-/unitless-0.7.5.tgz", "integrity": "sha512-OWORNpfjMsSSUBVrRBVGECkhWcULOAJz9ZW8uK9qgxD+87M7jHRcvh/A96XXNhXTLmKcoYSQtBEX7lHMO7YRwg==" }, + "node_modules/@emotion/use-insertion-effect-with-fallbacks": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/@emotion/use-insertion-effect-with-fallbacks/-/use-insertion-effect-with-fallbacks-1.0.1.tgz", + "integrity": "sha512-jT/qyKZ9rzLErtrjGgdkMBn2OP8wl0G3sQlBb3YPryvKHsjvINUhVaPFfP+fpBcOkmrVOVEEHQFJ7nbj2TH2gw==", + "peerDependencies": { + "react": ">=16.8.0" + } + }, + "node_modules/@emotion/utils": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/@emotion/utils/-/utils-1.2.1.tgz", + "integrity": "sha512-Y2tGf3I+XVnajdItskUCn6LX+VUDmP6lTL4fcqsXAv43dnlbZiuW4MWQW38rW/BVWSE7Q/7+XQocmpnRYILUmg==" + }, + "node_modules/@emotion/weak-memoize": { + "version": "0.3.1", + "resolved": "https://registry.npmjs.org/@emotion/weak-memoize/-/weak-memoize-0.3.1.tgz", + "integrity": "sha512-EsBwpc7hBUJWAsNPBmJy4hxWx12v6bshQsldrVmjxJoc3isbxhOrF2IcCpaXxfvq03NwkI7sbsOLXbYuqF/8Ww==" + }, "node_modules/@endiliey/react-ideal-image": { "version": "0.0.11", "resolved": "https://registry.npmjs.org/@endiliey/react-ideal-image/-/react-ideal-image-0.0.11.tgz", @@ -3204,6 +3326,28 @@ "resolved": "https://registry.npmjs.org/@faker-js/faker/-/faker-5.5.3.tgz", "integrity": "sha512-R11tGE6yIFwqpaIqcfkcg7AICXzFg14+5h5v0TfF/9+RMDL6jhzCy/pxHVOfbALGdtVYdt6JdR21tuxEgl34dw==" }, + "node_modules/@floating-ui/core": { + "version": "1.5.0", + "resolved": "https://registry.npmjs.org/@floating-ui/core/-/core-1.5.0.tgz", + "integrity": "sha512-kK1h4m36DQ0UHGj5Ah4db7R0rHemTqqO0QLvUqi1/mUUp3LuAWbWxdxSIf/XsnH9VS6rRVPLJCncjRzUvyCLXg==", + "dependencies": { + "@floating-ui/utils": "^0.1.3" + } + }, + "node_modules/@floating-ui/dom": { + "version": "1.5.3", + "resolved": "https://registry.npmjs.org/@floating-ui/dom/-/dom-1.5.3.tgz", + "integrity": "sha512-ClAbQnEqJAKCJOEbbLo5IUlZHkNszqhuxS4fHAVxRPXPya6Ysf2G8KypnYcOTpx6I8xcgF9bbHb6g/2KpbV8qA==", + "dependencies": { + "@floating-ui/core": "^1.4.2", + "@floating-ui/utils": "^0.1.3" + } + }, + "node_modules/@floating-ui/utils": { + "version": "0.1.4", + "resolved": "https://registry.npmjs.org/@floating-ui/utils/-/utils-0.1.4.tgz", + "integrity": "sha512-qprfWkn82Iw821mcKofJ5Pk9wgioHicxcQMxx+5zt5GSKoqdWvgG5AxVmpmUUjzTLPVSH5auBrhI93Deayn/DA==" + }, "node_modules/@fortawesome/fontawesome-common-types": { "version": "6.4.0", "resolved": "https://registry.npmjs.org/@fortawesome/fontawesome-common-types/-/fontawesome-common-types-6.4.0.tgz", @@ -6820,6 +6964,14 @@ "@types/react-router": "*" } }, + "node_modules/@types/react-transition-group": { + "version": "4.4.7", + "resolved": "https://registry.npmjs.org/@types/react-transition-group/-/react-transition-group-4.4.7.tgz", + "integrity": "sha512-ICCyBl5mvyqYp8Qeq9B5G/fyBSRC0zx3XM3sCC6KkcMsNeAHqXBKkmat4GqdJET5jtYUpZXrxI5flve5qhi2Eg==", + "dependencies": { + "@types/react": "*" + } + }, "node_modules/@types/resize-observer-browser": { "version": "0.1.7", "resolved": "https://registry.npmjs.org/@types/resize-observer-browser/-/resize-observer-browser-0.1.7.tgz", @@ -7981,6 +8133,20 @@ "node": "^10.13.0 || ^12.13.0 || ^14.15.0 || >=15.0.0" } }, + "node_modules/babel-plugin-macros": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/babel-plugin-macros/-/babel-plugin-macros-3.1.0.tgz", + "integrity": "sha512-Cg7TFGpIr01vOQNODXOOaGz2NpCU5gl8x1qJFbb6hbZxR7XrcE2vtbAsTAbJ7/xwJtUuJEw8K8Zr/AE0LHlesg==", + "dependencies": { + "@babel/runtime": "^7.12.5", + "cosmiconfig": "^7.0.0", + "resolve": "^1.19.0" + }, + "engines": { + "node": ">=10", + "npm": ">=6" + } + }, "node_modules/babel-plugin-polyfill-corejs2": { "version": "0.3.3", "resolved": "https://registry.npmjs.org/babel-plugin-polyfill-corejs2/-/babel-plugin-polyfill-corejs2-0.3.3.tgz", @@ -12111,6 +12277,11 @@ "url": "https://github.com/avajs/find-cache-dir?sponsor=1" } }, + "node_modules/find-root": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/find-root/-/find-root-1.1.0.tgz", + "integrity": "sha512-NKfW6bec6GfKc0SGx1e07QZY9PE99u0Bft/0rzSD5k3sO/vwkVUpDUKVm5Gpp5Ue3YfShPFTX2070tDs5kB9Ng==" + }, "node_modules/find-up": { "version": "6.3.0", "resolved": "https://registry.npmjs.org/find-up/-/find-up-6.3.0.tgz", @@ -14081,9 +14252,9 @@ } }, "node_modules/is-core-module": { - "version": "2.11.0", - "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.11.0.tgz", - "integrity": "sha512-RRjxlvLDkD1YJwDbroBHMb+cukurkDWNyHx7D3oNB5x9rb5ogcksMC5wHCadcXoo67gVr/+3GFySh3134zi6rw==", + "version": "2.13.0", + "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.13.0.tgz", + "integrity": "sha512-Z7dk6Qo8pOCp3l4tsX2C5ZVas4V+UxwQodwZhLopL91TX8UyyHEXafPcyoeeWuLrwzHcr3igO78wNLwHJHsMCQ==", "dependencies": { "has": "^1.0.3" }, @@ -17456,6 +17627,11 @@ "node": ">= 4.0.0" } }, + "node_modules/memoize-one": { + "version": "6.0.0", + "resolved": "https://registry.npmjs.org/memoize-one/-/memoize-one-6.0.0.tgz", + "integrity": "sha512-rkpe71W0N0c0Xz6QD0eJETuWAJGnJ9afsl1srmwPrI+yBCkge5EycXXbYRyvL29zZVUWQCY7InPRCv3GDXuZNw==" + }, "node_modules/merge-descriptors": { "version": "1.0.1", "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz", @@ -17898,9 +18074,15 @@ } }, "node_modules/nanoid": { - "version": "3.3.4", - "resolved": "https://registry.npmjs.org/nanoid/-/nanoid-3.3.4.tgz", - "integrity": "sha512-MqBkQh/OHTS2egovRtLk45wEyNXwF+cokD+1YPf9u5VfJiRdAiRwB2froX5Co9Rh20xs4siNPm8naNotSD6RBw==", + "version": "3.3.6", + "resolved": "https://registry.npmjs.org/nanoid/-/nanoid-3.3.6.tgz", + "integrity": "sha512-BGcqMMJuToF7i1rt+2PWSNVnWIkGCU78jBG3RxO/bZlnZPK2Cmi2QaffxGO/2RvWi9sL+FAiRiXMgsyxQ1DIDA==", + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/ai" + } + ], "bin": { "nanoid": "bin/nanoid.cjs" }, @@ -19034,9 +19216,9 @@ } }, "node_modules/postcss": { - "version": "8.4.21", - "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.4.21.tgz", - "integrity": "sha512-tP7u/Sn/dVxK2NnruI4H9BG+x+Wxz6oeZ1cJ8P6G/PZY0IKk4k/63TDsQf2kQq3+qoJeLm2kIBUNlZe3zgb4Zg==", + "version": "8.4.31", + "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.4.31.tgz", + "integrity": "sha512-PS08Iboia9mts/2ygV3eLpY5ghnUcfLV/EXTOW1E2qYxJKGGBUtNjN76FYHnMs36RmARn41bC0AZmn+rR0OVpQ==", "funding": [ { "type": "opencollective", @@ -19045,10 +19227,14 @@ { "type": "tidelift", "url": "https://tidelift.com/funding/github/npm/postcss" + }, + { + "type": "github", + "url": "https://github.com/sponsors/ai" } ], "dependencies": { - "nanoid": "^3.3.4", + "nanoid": "^3.3.6", "picocolors": "^1.0.0", "source-map-js": "^1.0.2" }, @@ -20644,6 +20830,26 @@ "resolved": "https://registry.npmjs.org/react-is/-/react-is-16.13.1.tgz", "integrity": "sha512-24e6ynE2H+OKt4kqsOvNd8kBpV65zoxbA4BVsEOB3ARVWQki/DHzaUoC5KuON/BiccDaCCTZBuOcfZs70kR8bQ==" }, + "node_modules/react-select": { + "version": "5.7.5", + "resolved": "https://registry.npmjs.org/react-select/-/react-select-5.7.5.tgz", + "integrity": "sha512-jgYZa2xgKP0DVn5GZk7tZwbRx7kaVz1VqU41S8z1KWmshRDhlrpKS0w80aS1RaK5bVIXpttgSou7XCjWw1ncKA==", + "dependencies": { + "@babel/runtime": "^7.12.0", + "@emotion/cache": "^11.4.0", + "@emotion/react": "^11.8.1", + "@floating-ui/dom": "^1.0.1", + "@types/react-transition-group": "^4.4.0", + "memoize-one": "^6.0.0", + "prop-types": "^15.6.0", + "react-transition-group": "^4.3.0", + "use-isomorphic-layout-effect": "^1.1.2" + }, + "peerDependencies": { + "react": "^16.8.0 || ^17.0.0 || ^18.0.0", + "react-dom": "^16.8.0 || ^17.0.0 || ^18.0.0" + } + }, "node_modules/react-tabs": { "version": "3.2.3", "resolved": "https://registry.npmjs.org/react-tabs/-/react-tabs-3.2.3.tgz", @@ -20696,6 +20902,30 @@ "uuid": "dist/bin/uuid" } }, + "node_modules/react-transition-group": { + "version": "4.4.5", + "resolved": "https://registry.npmjs.org/react-transition-group/-/react-transition-group-4.4.5.tgz", + "integrity": "sha512-pZcd1MCJoiKiBR2NRxeCRg13uCXbydPnmB4EOeRrY7480qNWO8IIgQG6zlDkm6uRMsURXPuKq0GWtiM59a5Q6g==", + "dependencies": { + "@babel/runtime": "^7.5.5", + "dom-helpers": "^5.0.1", + "loose-envify": "^1.4.0", + "prop-types": "^15.6.2" + }, + "peerDependencies": { + "react": ">=16.6.0", + "react-dom": ">=16.6.0" + } + }, + "node_modules/react-transition-group/node_modules/dom-helpers": { + "version": "5.2.1", + "resolved": "https://registry.npmjs.org/dom-helpers/-/dom-helpers-5.2.1.tgz", + "integrity": "sha512-nRCa7CK3VTrM2NmGkIy4cbK7IZlgBE/PYMn55rrXefr5xXDP0LdtfPnblFDoVdcAfslJ7or6iqAUnx0CCGIWQA==", + "dependencies": { + "@babel/runtime": "^7.8.7", + "csstype": "^3.0.2" + } + }, "node_modules/react-universal-interface": { "version": "0.6.2", "resolved": "https://registry.npmjs.org/react-universal-interface/-/react-universal-interface-0.6.2.tgz", @@ -21378,11 +21608,11 @@ "integrity": "sha512-LwZrotdHOo12nQuZlHEmtuXdqGoOD0OhaxopaNFxWzInpEgaLWoVuAMbTzixuosCx2nEG58ngzW3vxdWoxIgdg==" }, "node_modules/resolve": { - "version": "1.22.1", - "resolved": "https://registry.npmjs.org/resolve/-/resolve-1.22.1.tgz", - "integrity": "sha512-nBpuuYuY5jFsli/JIs1oldw6fOQCBioohqWZg/2hiaOybXOft4lonv85uDOKXdf8rhyK159cxU5cDcK/NKk8zw==", + "version": "1.22.6", + "resolved": "https://registry.npmjs.org/resolve/-/resolve-1.22.6.tgz", + "integrity": "sha512-njhxM7mV12JfufShqGy3Rz8j11RPdLy4xi15UurGJeoHLfJpVXKdh3ueuOqbYUcDZnffr6X739JBo5LzyahEsw==", "dependencies": { - "is-core-module": "^2.9.0", + "is-core-module": "^2.13.0", "path-parse": "^1.0.7", "supports-preserve-symlinks-flag": "^1.0.0" }, @@ -27423,6 +27653,60 @@ "tslib": "^2.4.0" } }, + "@emotion/babel-plugin": { + "version": "11.11.0", + "resolved": "https://registry.npmjs.org/@emotion/babel-plugin/-/babel-plugin-11.11.0.tgz", + "integrity": "sha512-m4HEDZleaaCH+XgDDsPF15Ht6wTLsgDTeR3WYj9Q/k76JtWhrJjcP4+/XlG8LGT/Rol9qUfOIztXeA84ATpqPQ==", + "requires": { + "@babel/helper-module-imports": "^7.16.7", + "@babel/runtime": "^7.18.3", + "@emotion/hash": "^0.9.1", + "@emotion/memoize": "^0.8.1", + "@emotion/serialize": "^1.1.2", + "babel-plugin-macros": "^3.1.0", + "convert-source-map": "^1.5.0", + "escape-string-regexp": "^4.0.0", + "find-root": "^1.1.0", + "source-map": "^0.5.7", + "stylis": "4.2.0" + }, + "dependencies": { + "@emotion/memoize": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.8.1.tgz", + "integrity": "sha512-W2P2c/VRW1/1tLox0mVUalvnWXxavmv/Oum2aPsRcoDJuob75FC3Y8FbpfLwUegRcxINtGUMPq0tFCvYNTBXNA==" + }, + "source-map": { + "version": "0.5.7", + "resolved": "https://registry.npmjs.org/source-map/-/source-map-0.5.7.tgz", + "integrity": "sha512-LbrmJOMUSdEVxIKvdcJzQC+nQhe8FUZQTXQy6+I75skNgn3OoQ0DZA8YnFa7gp8tqtL3KPf1kmo0R5DoApeSGQ==" + } + } + }, + "@emotion/cache": { + "version": "11.11.0", + "resolved": "https://registry.npmjs.org/@emotion/cache/-/cache-11.11.0.tgz", + "integrity": "sha512-P34z9ssTCBi3e9EI1ZsWpNHcfY1r09ZO0rZbRO2ob3ZQMnFI35jB536qoXbkdesr5EUhYi22anuEJuyxifaqAQ==", + "requires": { + "@emotion/memoize": "^0.8.1", + "@emotion/sheet": "^1.2.2", + "@emotion/utils": "^1.2.1", + "@emotion/weak-memoize": "^0.3.1", + "stylis": "4.2.0" + }, + "dependencies": { + "@emotion/memoize": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.8.1.tgz", + "integrity": "sha512-W2P2c/VRW1/1tLox0mVUalvnWXxavmv/Oum2aPsRcoDJuob75FC3Y8FbpfLwUegRcxINtGUMPq0tFCvYNTBXNA==" + } + } + }, + "@emotion/hash": { + "version": "0.9.1", + "resolved": "https://registry.npmjs.org/@emotion/hash/-/hash-0.9.1.tgz", + "integrity": "sha512-gJB6HLm5rYwSLI6PQa+X1t5CFGrv1J1TWG+sOyMCeKz2ojaj6Fnl/rZEspogG+cvqbt4AE/2eIyD2QfLKTBNlQ==" + }, "@emotion/is-prop-valid": { "version": "0.8.8", "resolved": "https://registry.npmjs.org/@emotion/is-prop-valid/-/is-prop-valid-0.8.8.tgz", @@ -27436,6 +27720,50 @@ "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.7.4.tgz", "integrity": "sha512-Ja/Vfqe3HpuzRsG1oBtWTHk2PGZ7GR+2Vz5iYGelAw8dx32K0y7PjVuxK6z1nMpZOqAFsRUPCkK1YjJ56qJlgw==" }, + "@emotion/react": { + "version": "11.11.1", + "resolved": "https://registry.npmjs.org/@emotion/react/-/react-11.11.1.tgz", + "integrity": "sha512-5mlW1DquU5HaxjLkfkGN1GA/fvVGdyHURRiX/0FHl2cfIfRxSOfmxEH5YS43edp0OldZrZ+dkBKbngxcNCdZvA==", + "requires": { + "@babel/runtime": "^7.18.3", + "@emotion/babel-plugin": "^11.11.0", + "@emotion/cache": "^11.11.0", + "@emotion/serialize": "^1.1.2", + "@emotion/use-insertion-effect-with-fallbacks": "^1.0.1", + "@emotion/utils": "^1.2.1", + "@emotion/weak-memoize": "^0.3.1", + "hoist-non-react-statics": "^3.3.1" + } + }, + "@emotion/serialize": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@emotion/serialize/-/serialize-1.1.2.tgz", + "integrity": "sha512-zR6a/fkFP4EAcCMQtLOhIgpprZOwNmCldtpaISpvz348+DP4Mz8ZoKaGGCQpbzepNIUWbq4w6hNZkwDyKoS+HA==", + "requires": { + "@emotion/hash": "^0.9.1", + "@emotion/memoize": "^0.8.1", + "@emotion/unitless": "^0.8.1", + "@emotion/utils": "^1.2.1", + "csstype": "^3.0.2" + }, + "dependencies": { + "@emotion/memoize": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/memoize/-/memoize-0.8.1.tgz", + "integrity": "sha512-W2P2c/VRW1/1tLox0mVUalvnWXxavmv/Oum2aPsRcoDJuob75FC3Y8FbpfLwUegRcxINtGUMPq0tFCvYNTBXNA==" + }, + "@emotion/unitless": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@emotion/unitless/-/unitless-0.8.1.tgz", + "integrity": "sha512-KOEGMu6dmJZtpadb476IsZBclKvILjopjUii3V+7MnXIQCYh8W3NgNcgwo21n9LXZX6EDIKvqfjYxXebDwxKmQ==" + } + } + }, + "@emotion/sheet": { + "version": "1.2.2", + "resolved": "https://registry.npmjs.org/@emotion/sheet/-/sheet-1.2.2.tgz", + "integrity": "sha512-0QBtGvaqtWi+nx6doRwDdBIzhNdZrXUppvTM4dtZZWEGTXL/XE/yJxLMGlDT1Gt+UHH5IX1n+jkXyytE/av7OA==" + }, "@emotion/stylis": { "version": "0.8.5", "resolved": "https://registry.npmjs.org/@emotion/stylis/-/stylis-0.8.5.tgz", @@ -27446,6 +27774,22 @@ "resolved": "https://registry.npmjs.org/@emotion/unitless/-/unitless-0.7.5.tgz", "integrity": "sha512-OWORNpfjMsSSUBVrRBVGECkhWcULOAJz9ZW8uK9qgxD+87M7jHRcvh/A96XXNhXTLmKcoYSQtBEX7lHMO7YRwg==" }, + "@emotion/use-insertion-effect-with-fallbacks": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/@emotion/use-insertion-effect-with-fallbacks/-/use-insertion-effect-with-fallbacks-1.0.1.tgz", + "integrity": "sha512-jT/qyKZ9rzLErtrjGgdkMBn2OP8wl0G3sQlBb3YPryvKHsjvINUhVaPFfP+fpBcOkmrVOVEEHQFJ7nbj2TH2gw==", + "requires": {} + }, + "@emotion/utils": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/@emotion/utils/-/utils-1.2.1.tgz", + "integrity": "sha512-Y2tGf3I+XVnajdItskUCn6LX+VUDmP6lTL4fcqsXAv43dnlbZiuW4MWQW38rW/BVWSE7Q/7+XQocmpnRYILUmg==" + }, + "@emotion/weak-memoize": { + "version": "0.3.1", + "resolved": "https://registry.npmjs.org/@emotion/weak-memoize/-/weak-memoize-0.3.1.tgz", + "integrity": "sha512-EsBwpc7hBUJWAsNPBmJy4hxWx12v6bshQsldrVmjxJoc3isbxhOrF2IcCpaXxfvq03NwkI7sbsOLXbYuqF/8Ww==" + }, "@endiliey/react-ideal-image": { "version": "0.0.11", "resolved": "https://registry.npmjs.org/@endiliey/react-ideal-image/-/react-ideal-image-0.0.11.tgz", @@ -27502,6 +27846,28 @@ "resolved": "https://registry.npmjs.org/@faker-js/faker/-/faker-5.5.3.tgz", "integrity": "sha512-R11tGE6yIFwqpaIqcfkcg7AICXzFg14+5h5v0TfF/9+RMDL6jhzCy/pxHVOfbALGdtVYdt6JdR21tuxEgl34dw==" }, + "@floating-ui/core": { + "version": "1.5.0", + "resolved": "https://registry.npmjs.org/@floating-ui/core/-/core-1.5.0.tgz", + "integrity": "sha512-kK1h4m36DQ0UHGj5Ah4db7R0rHemTqqO0QLvUqi1/mUUp3LuAWbWxdxSIf/XsnH9VS6rRVPLJCncjRzUvyCLXg==", + "requires": { + "@floating-ui/utils": "^0.1.3" + } + }, + "@floating-ui/dom": { + "version": "1.5.3", + "resolved": "https://registry.npmjs.org/@floating-ui/dom/-/dom-1.5.3.tgz", + "integrity": "sha512-ClAbQnEqJAKCJOEbbLo5IUlZHkNszqhuxS4fHAVxRPXPya6Ysf2G8KypnYcOTpx6I8xcgF9bbHb6g/2KpbV8qA==", + "requires": { + "@floating-ui/core": "^1.4.2", + "@floating-ui/utils": "^0.1.3" + } + }, + "@floating-ui/utils": { + "version": "0.1.4", + "resolved": "https://registry.npmjs.org/@floating-ui/utils/-/utils-0.1.4.tgz", + "integrity": "sha512-qprfWkn82Iw821mcKofJ5Pk9wgioHicxcQMxx+5zt5GSKoqdWvgG5AxVmpmUUjzTLPVSH5auBrhI93Deayn/DA==" + }, "@fortawesome/fontawesome-common-types": { "version": "6.4.0", "resolved": "https://registry.npmjs.org/@fortawesome/fontawesome-common-types/-/fontawesome-common-types-6.4.0.tgz", @@ -30317,6 +30683,14 @@ "@types/react-router": "*" } }, + "@types/react-transition-group": { + "version": "4.4.7", + "resolved": "https://registry.npmjs.org/@types/react-transition-group/-/react-transition-group-4.4.7.tgz", + "integrity": "sha512-ICCyBl5mvyqYp8Qeq9B5G/fyBSRC0zx3XM3sCC6KkcMsNeAHqXBKkmat4GqdJET5jtYUpZXrxI5flve5qhi2Eg==", + "requires": { + "@types/react": "*" + } + }, "@types/resize-observer-browser": { "version": "0.1.7", "resolved": "https://registry.npmjs.org/@types/resize-observer-browser/-/resize-observer-browser-0.1.7.tgz", @@ -31224,6 +31598,16 @@ "@types/babel__traverse": "^7.0.6" } }, + "babel-plugin-macros": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/babel-plugin-macros/-/babel-plugin-macros-3.1.0.tgz", + "integrity": "sha512-Cg7TFGpIr01vOQNODXOOaGz2NpCU5gl8x1qJFbb6hbZxR7XrcE2vtbAsTAbJ7/xwJtUuJEw8K8Zr/AE0LHlesg==", + "requires": { + "@babel/runtime": "^7.12.5", + "cosmiconfig": "^7.0.0", + "resolve": "^1.19.0" + } + }, "babel-plugin-polyfill-corejs2": { "version": "0.3.3", "resolved": "https://registry.npmjs.org/babel-plugin-polyfill-corejs2/-/babel-plugin-polyfill-corejs2-0.3.3.tgz", @@ -34367,6 +34751,11 @@ "pkg-dir": "^4.1.0" } }, + "find-root": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/find-root/-/find-root-1.1.0.tgz", + "integrity": "sha512-NKfW6bec6GfKc0SGx1e07QZY9PE99u0Bft/0rzSD5k3sO/vwkVUpDUKVm5Gpp5Ue3YfShPFTX2070tDs5kB9Ng==" + }, "find-up": { "version": "6.3.0", "resolved": "https://registry.npmjs.org/find-up/-/find-up-6.3.0.tgz", @@ -35795,9 +36184,9 @@ } }, "is-core-module": { - "version": "2.11.0", - "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.11.0.tgz", - "integrity": "sha512-RRjxlvLDkD1YJwDbroBHMb+cukurkDWNyHx7D3oNB5x9rb5ogcksMC5wHCadcXoo67gVr/+3GFySh3134zi6rw==", + "version": "2.13.0", + "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.13.0.tgz", + "integrity": "sha512-Z7dk6Qo8pOCp3l4tsX2C5ZVas4V+UxwQodwZhLopL91TX8UyyHEXafPcyoeeWuLrwzHcr3igO78wNLwHJHsMCQ==", "requires": { "has": "^1.0.3" } @@ -38307,6 +38696,11 @@ "fs-monkey": "^1.0.3" } }, + "memoize-one": { + "version": "6.0.0", + "resolved": "https://registry.npmjs.org/memoize-one/-/memoize-one-6.0.0.tgz", + "integrity": "sha512-rkpe71W0N0c0Xz6QD0eJETuWAJGnJ9afsl1srmwPrI+yBCkge5EycXXbYRyvL29zZVUWQCY7InPRCv3GDXuZNw==" + }, "merge-descriptors": { "version": "1.0.1", "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz", @@ -38605,9 +38999,9 @@ } }, "nanoid": { - "version": "3.3.4", - "resolved": "https://registry.npmjs.org/nanoid/-/nanoid-3.3.4.tgz", - "integrity": "sha512-MqBkQh/OHTS2egovRtLk45wEyNXwF+cokD+1YPf9u5VfJiRdAiRwB2froX5Co9Rh20xs4siNPm8naNotSD6RBw==" + "version": "3.3.6", + "resolved": "https://registry.npmjs.org/nanoid/-/nanoid-3.3.6.tgz", + "integrity": "sha512-BGcqMMJuToF7i1rt+2PWSNVnWIkGCU78jBG3RxO/bZlnZPK2Cmi2QaffxGO/2RvWi9sL+FAiRiXMgsyxQ1DIDA==" }, "napi-build-utils": { "version": "1.0.2", @@ -39455,11 +39849,11 @@ } }, "postcss": { - "version": "8.4.21", - "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.4.21.tgz", - "integrity": "sha512-tP7u/Sn/dVxK2NnruI4H9BG+x+Wxz6oeZ1cJ8P6G/PZY0IKk4k/63TDsQf2kQq3+qoJeLm2kIBUNlZe3zgb4Zg==", + "version": "8.4.31", + "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.4.31.tgz", + "integrity": "sha512-PS08Iboia9mts/2ygV3eLpY5ghnUcfLV/EXTOW1E2qYxJKGGBUtNjN76FYHnMs36RmARn41bC0AZmn+rR0OVpQ==", "requires": { - "nanoid": "^3.3.4", + "nanoid": "^3.3.6", "picocolors": "^1.0.0", "source-map-js": "^1.0.2" } @@ -40579,6 +40973,22 @@ "prop-types": "^15.7.2" } }, + "react-select": { + "version": "5.7.5", + "resolved": "https://registry.npmjs.org/react-select/-/react-select-5.7.5.tgz", + "integrity": "sha512-jgYZa2xgKP0DVn5GZk7tZwbRx7kaVz1VqU41S8z1KWmshRDhlrpKS0w80aS1RaK5bVIXpttgSou7XCjWw1ncKA==", + "requires": { + "@babel/runtime": "^7.12.0", + "@emotion/cache": "^11.4.0", + "@emotion/react": "^11.8.1", + "@floating-ui/dom": "^1.0.1", + "@types/react-transition-group": "^4.4.0", + "memoize-one": "^6.0.0", + "prop-types": "^15.6.0", + "react-transition-group": "^4.3.0", + "use-isomorphic-layout-effect": "^1.1.2" + } + }, "react-tabs": { "version": "3.2.3", "resolved": "https://registry.npmjs.org/react-tabs/-/react-tabs-3.2.3.tgz", @@ -40614,6 +41024,28 @@ } } }, + "react-transition-group": { + "version": "4.4.5", + "resolved": "https://registry.npmjs.org/react-transition-group/-/react-transition-group-4.4.5.tgz", + "integrity": "sha512-pZcd1MCJoiKiBR2NRxeCRg13uCXbydPnmB4EOeRrY7480qNWO8IIgQG6zlDkm6uRMsURXPuKq0GWtiM59a5Q6g==", + "requires": { + "@babel/runtime": "^7.5.5", + "dom-helpers": "^5.0.1", + "loose-envify": "^1.4.0", + "prop-types": "^15.6.2" + }, + "dependencies": { + "dom-helpers": { + "version": "5.2.1", + "resolved": "https://registry.npmjs.org/dom-helpers/-/dom-helpers-5.2.1.tgz", + "integrity": "sha512-nRCa7CK3VTrM2NmGkIy4cbK7IZlgBE/PYMn55rrXefr5xXDP0LdtfPnblFDoVdcAfslJ7or6iqAUnx0CCGIWQA==", + "requires": { + "@babel/runtime": "^7.8.7", + "csstype": "^3.0.2" + } + } + } + }, "react-universal-interface": { "version": "0.6.2", "resolved": "https://registry.npmjs.org/react-universal-interface/-/react-universal-interface-0.6.2.tgz", @@ -41126,11 +41558,11 @@ "integrity": "sha512-LwZrotdHOo12nQuZlHEmtuXdqGoOD0OhaxopaNFxWzInpEgaLWoVuAMbTzixuosCx2nEG58ngzW3vxdWoxIgdg==" }, "resolve": { - "version": "1.22.1", - "resolved": "https://registry.npmjs.org/resolve/-/resolve-1.22.1.tgz", - "integrity": "sha512-nBpuuYuY5jFsli/JIs1oldw6fOQCBioohqWZg/2hiaOybXOft4lonv85uDOKXdf8rhyK159cxU5cDcK/NKk8zw==", + "version": "1.22.6", + "resolved": "https://registry.npmjs.org/resolve/-/resolve-1.22.6.tgz", + "integrity": "sha512-njhxM7mV12JfufShqGy3Rz8j11RPdLy4xi15UurGJeoHLfJpVXKdh3ueuOqbYUcDZnffr6X739JBo5LzyahEsw==", "requires": { - "is-core-module": "^2.9.0", + "is-core-module": "^2.13.0", "path-parse": "^1.0.7", "supports-preserve-symlinks-flag": "^1.0.0" } diff --git a/website/package.json b/website/package.json index afb7a9b1cd4..b0105102359 100644 --- a/website/package.json +++ b/website/package.json @@ -39,6 +39,7 @@ "react-dom": "^17.0.1", "react-full-screen": "^1.1.1", "react-is": "^18.1.0", + "react-select": "^5.7.5", "react-tooltip": "^4.2.21", "redoc": "^2.0.0-rc.57", "rehype-katex": "^5.0.0", diff --git a/website/plugins/buildQuickstartIndexPage/index.js b/website/plugins/buildQuickstartIndexPage/index.js index 4724478883a..368a717a6a5 100644 --- a/website/plugins/buildQuickstartIndexPage/index.js +++ b/website/plugins/buildQuickstartIndexPage/index.js @@ -6,10 +6,13 @@ module.exports = function buildQuickstartIndexPage() { name: 'docusaurus-build-quickstart-index-page-plugin', async loadContent() { // Quickstart files directory - const quickstartDirectory = 'docs/quickstarts' + const quickstartDirectory = 'docs/guides' // Get all Quickstart files and content - const quickstartFiles = fs.readdirSync(quickstartDirectory) + const quickstartFiles = fs.readdirSync(quickstartDirectory, { withFileTypes: true }) + .filter(dirent => dirent.isFile()) + .map(dirent => dirent.name) + const quickstartData = quickstartFiles.reduce((arr, quickstartFile) => { const fileData = fs.readFileSync( @@ -19,8 +22,12 @@ module.exports = function buildQuickstartIndexPage() { if(!fileData) return null - // convert frontmatter to json + // Convert frontmatter to json const fileJson = matter(fileData) + + // Add the original directory to build links + fileJson.data.original_directory = quickstartDirectory.replace('docs/', '') + if(!fileJson) return null @@ -35,7 +42,7 @@ module.exports = function buildQuickstartIndexPage() { async contentLoaded({content, actions}) { const {createData, addRoute} = actions; - // Sort quickstarts by platform if available + // Sort guides by platform if available const contentSorted = content.sort((a, b) => { if(!a?.data?.platform || !b?.data?.platform) return @@ -53,7 +60,7 @@ module.exports = function buildQuickstartIndexPage() { // Build the quickstart index page addRoute({ - path: `/quickstarts`, + path: `/guides`, component: '@site/src/components/quickstartGuideList/index.js', modules: { // propName -> JSON file path diff --git a/website/sidebars.js b/website/sidebars.js index 8920a7180d4..66ba731fb1b 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -29,8 +29,8 @@ const sidebarSettings = { }, // About dbt Cloud directory { type: "link", - label: "Quickstarts", - href: `/quickstarts`, + label: "Guides", + href: `/guides`, }, { type: "category", @@ -963,7 +963,7 @@ const sidebarSettings = { ], }, ], - guides: [ + bestpractices: [ { type: "category", label: "Best practices", @@ -972,7 +972,7 @@ const sidebarSettings = { title: "Best practice guides", description: "Learn how dbt Labs approaches building projects through our current viewpoints on structure, style, and setup.", - slug: "/guides/best-practices", + slug: "best-practices", }, items: [ { @@ -980,14 +980,14 @@ const sidebarSettings = { label: "How we structure our dbt projects", link: { type: "doc", - id: "guides/best-practices/how-we-structure/1-guide-overview", + id: "best-practices/how-we-structure/1-guide-overview", }, items: [ - "guides/best-practices/how-we-structure/2-staging", - "guides/best-practices/how-we-structure/3-intermediate", - "guides/best-practices/how-we-structure/4-marts", - "guides/best-practices/how-we-structure/5-semantic-layer-marts", - "guides/best-practices/how-we-structure/6-the-rest-of-the-project", + "best-practices/how-we-structure/2-staging", + "best-practices/how-we-structure/3-intermediate", + "best-practices/how-we-structure/4-marts", + "best-practices/how-we-structure/5-semantic-layer-marts", + "best-practices/how-we-structure/6-the-rest-of-the-project", ], }, { @@ -995,15 +995,15 @@ const sidebarSettings = { label: "How we style our dbt projects", link: { type: "doc", - id: "guides/best-practices/how-we-style/0-how-we-style-our-dbt-projects", + id: "best-practices/how-we-style/0-how-we-style-our-dbt-projects", }, items: [ - "guides/best-practices/how-we-style/1-how-we-style-our-dbt-models", - "guides/best-practices/how-we-style/2-how-we-style-our-sql", - "guides/best-practices/how-we-style/3-how-we-style-our-python", - "guides/best-practices/how-we-style/4-how-we-style-our-jinja", - "guides/best-practices/how-we-style/5-how-we-style-our-yaml", - "guides/best-practices/how-we-style/6-how-we-style-conclusion", + "best-practices/how-we-style/1-how-we-style-our-dbt-models", + "best-practices/how-we-style/2-how-we-style-our-sql", + "best-practices/how-we-style/3-how-we-style-our-python", + "best-practices/how-we-style/4-how-we-style-our-jinja", + "best-practices/how-we-style/5-how-we-style-our-yaml", + "best-practices/how-we-style/6-how-we-style-conclusion", ], }, { @@ -1011,15 +1011,14 @@ const sidebarSettings = { label: "How we build our metrics", link: { type: "doc", - id: "guides/best-practices/how-we-build-our-metrics/semantic-layer-1-intro", + id: "best-practices/how-we-build-our-metrics/semantic-layer-1-intro", }, items: [ - "guides/best-practices/how-we-build-our-metrics/semantic-layer-2-setup", - "guides/best-practices/how-we-build-our-metrics/semantic-layer-3-build-semantic-models", - "guides/best-practices/how-we-build-our-metrics/semantic-layer-4-build-metrics", - "guides/best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart", - "guides/best-practices/how-we-build-our-metrics/semantic-layer-6-advanced-metrics", - "guides/best-practices/how-we-build-our-metrics/semantic-layer-7-conclusion", + "best-practices/how-we-build-our-metrics/semantic-layer-3-build-semantic-models", + "best-practices/how-we-build-our-metrics/semantic-layer-4-build-metrics", + "best-practices/how-we-build-our-metrics/semantic-layer-5-refactor-a-mart", + "best-practices/how-we-build-our-metrics/semantic-layer-6-advanced-metrics", + "best-practices/how-we-build-our-metrics/semantic-layer-7-conclusion", ], }, { @@ -1027,11 +1026,11 @@ const sidebarSettings = { label: "How we build our dbt Mesh projects", link: { type: "doc", - id: "guides/best-practices/how-we-mesh/mesh-1-intro", + id: "best-practices/how-we-mesh/mesh-1-intro", }, items: [ - "guides/best-practices/how-we-mesh/mesh-2-structures", - "guides/best-practices/how-we-mesh/mesh-3-implementation", + "best-practices/how-we-mesh/mesh-2-structures", + "best-practices/how-we-mesh/mesh-3-implementation", ], }, { @@ -1039,226 +1038,20 @@ const sidebarSettings = { label: "Materialization best practices", link: { type: "doc", - id: "guides/best-practices/materializations/materializations-guide-1-guide-overview", + id: "best-practices/materializations/materializations-guide-1-guide-overview", }, items: [ - "guides/best-practices/materializations/materializations-guide-2-available-materializations", - "guides/best-practices/materializations/materializations-guide-3-configuring-materializations", - "guides/best-practices/materializations/materializations-guide-4-incremental-models", - "guides/best-practices/materializations/materializations-guide-5-best-practices", - "guides/best-practices/materializations/materializations-guide-6-examining-builds", - "guides/best-practices/materializations/materializations-guide-7-conclusion", + "best-practices/materializations/materializations-guide-2-available-materializations", + "best-practices/materializations/materializations-guide-3-configuring-materializations", + "best-practices/materializations/materializations-guide-4-incremental-models", + "best-practices/materializations/materializations-guide-5-best-practices", + "best-practices/materializations/materializations-guide-6-examining-builds", + "best-practices/materializations/materializations-guide-7-conclusion", ], }, - "guides/best-practices/debugging-errors", - "guides/best-practices/writing-custom-generic-tests", - ], - }, - { - type: "category", - label: "Orchestration", - link: { - type: "generated-index", - title: "Orchestration guides", - description: - "Learn how to orchestrate your data transformations in dbt, using dbt Cloud, a variety of popular tools, or both working together.", - slug: "/guides/orchestration", - }, - items: [ - { - type: "category", - label: "Airflow and dbt Cloud", - link: { - type: "doc", - id: "guides/orchestration/airflow-and-dbt-cloud/1-airflow-and-dbt-cloud", - }, - items: [ - "guides/orchestration/airflow-and-dbt-cloud/2-setting-up-airflow-and-dbt-cloud", - "guides/orchestration/airflow-and-dbt-cloud/3-running-airflow-and-dbt-cloud", - "guides/orchestration/airflow-and-dbt-cloud/4-airflow-and-dbt-cloud-faqs", - ], - }, - { - type: "category", - label: "Set up Continuous Integration", - link: { - type: "doc", - id: "guides/orchestration/set-up-ci/introduction", - }, - items: [ - "guides/orchestration/set-up-ci/quick-setup", - "guides/orchestration/set-up-ci/run-dbt-project-evaluator", - "guides/orchestration/set-up-ci/lint-on-push", - "guides/orchestration/set-up-ci/multiple-checks", - ], - }, - { - type: "category", - label: "Custom Continuous Deployment Workflows", - link: { - type: "doc", - id: "guides/orchestration/custom-cicd-pipelines/1-cicd-background", - }, - items: [ - "guides/orchestration/custom-cicd-pipelines/3-dbt-cloud-job-on-merge", - "guides/orchestration/custom-cicd-pipelines/4-dbt-cloud-job-on-pr", - "guides/orchestration/custom-cicd-pipelines/5-something-to-consider", - ], - }, - { - type: "category", - label: "Webhooks with dbt Cloud and SaaS apps", - link: { - type: "generated-index", - title: "Use dbt Cloud's webhooks with other SaaS apps", - description: - "Learn how to use webhooks to trigger actions in other tools by using Zapier or a serverless platform.", - slug: "/guides/orchestration/webhooks", - }, - items: [ - { - type: "autogenerated", - dirName: "guides/orchestration/webhooks", - }, - ], - }, - "guides/orchestration/how-to-use-databricks-workflows-to-run-dbt-cloud-jobs", - ], - }, - { - type: "category", - label: "Migration", - items: [ - "guides/migration/sl-migration", - { - type: "category", - label: "Versions", - items: [ - "docs/dbt-versions/core-upgrade/upgrading-to-v1.7", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.6", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.5", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.4", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.3", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.2", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.1", - "docs/dbt-versions/core-upgrade/upgrading-to-v1.0", - ], - }, - { - type: "category", - label: "Tools", - link: { - type: "generated-index", - title: "Tool migration guides", - description: - "Learn how to migrate to dbt from other tools and platforms.", - slug: "/guides/migration/tools", - }, - items: [ - { - type: "category", - label: "Migrating from stored procedures", - link: { - type: "doc", - id: "guides/migration/tools/migrating-from-stored-procedures/1-migrating-from-stored-procedures", - }, - items: [ - "guides/migration/tools/migrating-from-stored-procedures/2-inserts", - "guides/migration/tools/migrating-from-stored-procedures/3-updates", - "guides/migration/tools/migrating-from-stored-procedures/4-deletes", - "guides/migration/tools/migrating-from-stored-procedures/5-merges", - "guides/migration/tools/migrating-from-stored-procedures/6-migrating-from-stored-procedures-conclusion", - ], - }, - "guides/migration/tools/migrating-from-spark-to-databricks", - "guides/migration/tools/refactoring-legacy-sql", - ], - }, - ], - }, - { - type: "category", - label: "dbt Ecosystem", - link: { - type: "generated-index", - title: "dbt Ecosystem guides", - description: "Learn about the dbt ecosystem and how to build with dbt.", - slug: "/guides/dbt-ecosystem/", - }, - items: [ - { - type: "category", - label: "Adapter development", - link: { - type: "doc", - id: "guides/dbt-ecosystem/adapter-development/1-what-are-adapters", - }, - items: [ - "guides/dbt-ecosystem/adapter-development/2-prerequisites-for-a-new-adapter", - "guides/dbt-ecosystem/adapter-development/3-building-a-new-adapter", - "guides/dbt-ecosystem/adapter-development/4-testing-a-new-adapter", - "guides/dbt-ecosystem/adapter-development/5-documenting-a-new-adapter", - "guides/dbt-ecosystem/adapter-development/6-promoting-a-new-adapter", - "guides/dbt-ecosystem/adapter-development/7-verifying-a-new-adapter", - "guides/dbt-ecosystem/adapter-development/8-building-a-trusted-adapter", - ], - }, - { - type: "category", - label: "dbt Python Snowpark", - link: { - type: "doc", - id: "guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark", - }, - items: [ - "guides/dbt-ecosystem/dbt-python-snowpark/2-snowflake-configuration", - "guides/dbt-ecosystem/dbt-python-snowpark/3-connect-to-data-source", - "guides/dbt-ecosystem/dbt-python-snowpark/4-configure-dbt", - "guides/dbt-ecosystem/dbt-python-snowpark/5-development-schema-name", - "guides/dbt-ecosystem/dbt-python-snowpark/6-foundational-structure", - "guides/dbt-ecosystem/dbt-python-snowpark/7-folder-structure", - "guides/dbt-ecosystem/dbt-python-snowpark/8-sources-and-staging", - "guides/dbt-ecosystem/dbt-python-snowpark/9-sql-transformations", - "guides/dbt-ecosystem/dbt-python-snowpark/10-python-transformations", - "guides/dbt-ecosystem/dbt-python-snowpark/11-machine-learning-prep", - "guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-prediction", - "guides/dbt-ecosystem/dbt-python-snowpark/13-testing", - "guides/dbt-ecosystem/dbt-python-snowpark/14-documentation", - "guides/dbt-ecosystem/dbt-python-snowpark/15-deployment", - ], - }, - { - type: "category", - label: "Databricks and dbt", - link: { - type: "doc", - id: "guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project", - }, - items: [ - "guides/dbt-ecosystem/databricks-guides/dbt-unity-catalog-best-practices", - "guides/dbt-ecosystem/databricks-guides/how_to_optimize_dbt_models_on_databricks", - "guides/dbt-ecosystem/databricks-guides/productionizing-your-dbt-databricks-project", - ], - }, - "guides/dbt-ecosystem/sl-partner-integration-guide", - ], - }, - { - type: "category", - label: "Advanced", - items: [ - "guides/advanced/creating-new-materializations", - "guides/advanced/using-jinja", - ], - }, - { - type: "category", - label: "Legacy", - items: [ - "guides/legacy/debugging-schema-names", - "guides/legacy/best-practices", - "guides/legacy/building-packages", - "guides/legacy/videos", + "best-practices/writing-custom-generic-tests", + "best-practices/best-practice-workflows", + "best-practices/dbt-unity-catalog-best-practices", ], }, ], diff --git a/website/snippets/_legacy-sl-callout.md b/website/snippets/_legacy-sl-callout.md index f45c6b68af3..97c95512332 100644 --- a/website/snippets/_legacy-sl-callout.md +++ b/website/snippets/_legacy-sl-callout.md @@ -6,6 +6,6 @@ The dbt Semantic Layer has undergone a [significant revamp](https://www.getdbt.c **What’s changed?** The dbt_metrics package has been [deprecated](https://docs.getdbt.com/blog/deprecating-dbt-metrics) and replaced with [MetricFlow](/docs/build/about-metricflow?version=1.6), a new framework for defining metrics in dbt. This means dbt_metrics is no longer supported after dbt v1.5 and won't receive any code fixes. -**What should you do?** If you're using the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use the new dbt Semantic Layer. To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/migration/sl-migration) for more info. +**What should you do?** If you're using the legacy Semantic Layer, we **highly** recommend you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt v1.6 or higher to use the new dbt Semantic Layer. To migrate to the new Semantic Layer, refer to the dedicated [migration guide](/guides/sl-migration) for more info. ::: diff --git a/website/snippets/_new-sl-setup.md b/website/snippets/_new-sl-setup.md index ad248bc3ca9..3cb6e09eb4c 100644 --- a/website/snippets/_new-sl-setup.md +++ b/website/snippets/_new-sl-setup.md @@ -7,7 +7,7 @@ You can set up the dbt Semantic Layer in dbt Cloud at the environment and projec - You must have a successful run in your new environment. :::tip -If you're using the legacy Semantic Layer, dbt Labs strongly recommends that you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt version 1.6 or newer to use the latest dbt Semantic Layer. Refer to the dedicated [migration guide](/guides/migration/sl-migration) for details. +If you're using the legacy Semantic Layer, dbt Labs strongly recommends that you [upgrade your dbt version](/docs/dbt-versions/upgrade-core-in-cloud) to dbt version 1.6 or newer to use the latest dbt Semantic Layer. Refer to the dedicated [migration guide](/guides/sl-migration) for details. ::: 1. In dbt Cloud, create a new [deployment environment](/docs/deploy/deploy-environments#create-a-deployment-environment) or use an existing environment on dbt 1.6 or higher. diff --git a/website/snippets/dbt-databricks-for-databricks.md b/website/snippets/dbt-databricks-for-databricks.md index 930e7a85a9f..f1c5ec84af1 100644 --- a/website/snippets/dbt-databricks-for-databricks.md +++ b/website/snippets/dbt-databricks-for-databricks.md @@ -1,4 +1,4 @@ :::info If you're using Databricks, use `dbt-databricks` If you're using Databricks, the `dbt-databricks` adapter is recommended over `dbt-spark`. -If you're still using dbt-spark with Databricks consider [migrating from the dbt-spark adapter to the dbt-databricks adapter](/guides/migration/tools/migrating-from-spark-to-databricks#migrate-your-dbt-projects). -::: \ No newline at end of file +If you're still using dbt-spark with Databricks consider [migrating from the dbt-spark adapter to the dbt-databricks adapter](/guides/migrate-from-spark-to-databricks). +::: diff --git a/website/src/components/quickstartGuideCard/index.js b/website/src/components/quickstartGuideCard/index.js index fdc629bd7b0..104bb5cb35b 100644 --- a/website/src/components/quickstartGuideCard/index.js +++ b/website/src/components/quickstartGuideCard/index.js @@ -3,26 +3,67 @@ import Link from "@docusaurus/Link"; import styles from "./styles.module.css"; import getIconType from "../../utils/get-icon-type"; -function QuickstartGuideCard({ frontMatter }) { - const { id, title, time_to_complete, icon } = frontMatter; +export default function QuickstartGuideCard({ frontMatter }) { + const { id, title, time_to_complete, icon, tags, level, recently_updated } = + frontMatter; + return ( - + + {recently_updated && ( + Updated + )} {icon && getIconType(icon, styles.icon)} - +

    {title}

    {time_to_complete && ( {time_to_complete} )} - - Start + + Start + + {(tags || level) && ( +
    + {tags && + tags.map((tag, i) => ( +
    + {tag} +
    + ))} + {level &&
    {level}
    } +
    + )} ); } -export default QuickstartGuideCard; +// Component that handles the information under the title on the quickstart guide page +export function QuickstartGuideTitle({ frontMatter }) { + const { time_to_complete, tags, level, recently_updated } = + frontMatter; + + return ( +
    + {recently_updated && ( + Updated + )} + {time_to_complete && ( + {time_to_complete} + )} + + {(tags || level) && ( +
    + {tags && + tags.map((tag, i) => ( +
    + {tag} +
    + ))} + {level &&
    {level}
    } +
    + )} +
    + ); +} diff --git a/website/src/components/quickstartGuideCard/styles.module.css b/website/src/components/quickstartGuideCard/styles.module.css index 8202f694fcd..5df40c8479e 100644 --- a/website/src/components/quickstartGuideCard/styles.module.css +++ b/website/src/components/quickstartGuideCard/styles.module.css @@ -1,24 +1,28 @@ .quickstartCard { - border: 1px solid #EFF2F3; - border-radius: var(--border-radius); + outline: 1px solid #EFF2F3; + border-radius: 10px; box-shadow: 0px 11px 24px rgba(138, 138, 138, .1); padding: 2.5rem 2.5rem 1.5rem 2.5rem; flex: 0 0 30%; - border-bottom: solid 4px var(--color-light-teal); display: flex; flex-direction: column; text-decoration: none !important; transition: all 0.2s ease-in-out; + position: relative; } .quickstartCard:hover { - border-bottom-color: var(--color-orange); transform: translateY(-7px); + outline: 2px solid var( --color-green-blue); +} + +.quickstartCard:hover > .start { + text-decoration: underline; } .quickstartCard .icon { - max-width: 25px; - font-size: 25px; + max-width: 46px; + font-size: 46px; margin-bottom: .8rem; color: var(--ifm-menu-color); } @@ -45,21 +49,106 @@ color:var(--ifm-menu-color) } +[data-theme='dark'] .quickstartCard .recently_updated { + color: #fff; +} + .quickstartCard .start { font-size: 1.125rem; margin-top: auto; padding-top: 2rem; + font-weight: 600; } [data-theme='dark'] .quickstartCard .start { color: #fff; } -[data-theme='dark'] .quickstartCard:hover .start { - text-decoration: underline; +.quickstartCard .start i { + margin-left: 4px; + font-size: .9rem; +} + +.quickstartCard .recently_updated { + position: absolute; + top: 1.5rem; + right: 1.5rem; +} + +.quickstartCard .tag_container { + display: flex; + flex-wrap: wrap; + gap: 0.375rem; + margin-top: 1rem; +} + +.quickstartCard .tag_container .tag { + background: #E5E7EB; + border-radius: 1.5rem; + color:#262A38; + padding: 0rem 0.75rem; +} + +[data-theme='dark'] .quickstartCard .tag_container .tag { + background: #374151; + color: #fff; +} + +.infoContainer { + display: flex; + margin-bottom: 4rem; +} + +.infoContainer > * { + border-left: solid #e0e3e8 3px; + padding: 0 1rem 0 1rem; +} + +.infoContainer > *:first-child { + border: none; + padding-left: 0; +} + +.infoContainer .tag_container { + display: flex; + flex-wrap: wrap; + gap: 0.375rem; + align-items: center; +} + +.infoContainer .tag_container .tag { + background: #E5E7EB; + border-radius: 1.5rem; + color:#262A38; + padding: 0rem 0.75rem; +} + +[data-theme='dark'] .infoContainer .tag_container .tag { + background: #374151; + color: #fff; +} + + +.infoContainer .time_to_complete { + font-weight: 700; + +} + +.infoContainer .recently_updated { + color: var(--color-green-blue); +} + +[data-theme='dark'] .infoContainer .recently_updated { + color: #fff; } -.quickstartCard .start:after { - content: " →"; - margin-left: 5px; +@media (max-width: 996px) { + .infoContainer { + gap: 1rem; + flex-direction: column; + } + .infoContainer > * { + border: none; + padding: 0; + } } diff --git a/website/src/components/quickstartGuideList/index.js b/website/src/components/quickstartGuideList/index.js index 954d54e6d47..05c8c041a0e 100644 --- a/website/src/components/quickstartGuideList/index.js +++ b/website/src/components/quickstartGuideList/index.js @@ -1,19 +1,68 @@ import React from 'react'; +import { useState, useEffect, useMemo } from 'react'; import Head from '@docusaurus/Head'; import useDocusaurusContext from '@docusaurus/useDocusaurusContext'; import Layout from '@theme/Layout'; import Hero from '@site/src/components/hero'; import QuickstartGuideCard from '../quickstartGuideCard' import styles from './styles.module.css'; +import { SelectDropdown } from '../selectDropdown'; +import SearchInput from '../searchInput'; -const quickstartTitle = 'Quickstarts' +const quickstartTitle = 'Guides' const quickstartDescription = 'dbt Core is a powerful open-source tool for data transformations and dbt Cloud is the fastest and most reliable way to deploy your dbt jobs. With the help of a sample project, learn how to quickly start using dbt and one of the most common data platforms.' + function QuickstartList({ quickstartData }) { - const { siteConfig } = useDocusaurusContext() - + const { siteConfig } = useDocusaurusContext(); + const [filteredData, setFilteredData] = useState(() => quickstartData); + const [selectedTags, setSelectedTags] = useState([]); + const [selectedLevel, setSelectedLevel] = useState([]); + const [searchInput, setSearchInput] = useState(''); + // Build meta title from quickstartTitle and docusaurus config site title - const metaTitle = `${quickstartTitle}${siteConfig?.title ? ` | ${siteConfig.title}` : ''}` + const metaTitle = `${quickstartTitle}${siteConfig?.title ? ` | ${siteConfig.title}` : ''}`; + + // UseMemo to prevent re-rendering on every filter change + // Get tag options + // Populated from the tags frontmatter array + const tagOptions = useMemo(() => { + const tags = new Set(); + quickstartData.forEach(guide => + guide?.data?.tags?.forEach(tag => tags.add(tag)) + ); + // Sort alphabetically + return Array.from(tags).sort((a, b) => a.toLowerCase().localeCompare(b.toLowerCase())).map(tag => ({ value: tag, label: tag })); + }, [quickstartData]); + + // Get level options + // Populated by the level frontmatter string + const levelOptions = useMemo(() => { + const levels = new Set(); + quickstartData.forEach(guide => + guide?.data?.level && levels.add(guide.data.level) + ); + return Array.from(levels).map(level => ({ value: level, label: level })); + }, [quickstartData]); + + // Handle all filters + const handleDataFilter = () => { + const filteredGuides = quickstartData.filter((guide) => { + const tagsMatch = selectedTags.length === 0 || (Array.isArray(guide?.data?.tags) && selectedTags.every((tag) => + guide?.data?.tags.includes(tag.value) + )); + const levelMatch = selectedLevel.length === 0 || (guide?.data?.level && selectedLevel.some((level) => + guide?.data?.level === level.value + )); + const titleMatch = searchInput === '' || guide?.data?.title?.toLowerCase().includes(searchInput.toLowerCase()); + return tagsMatch && levelMatch && titleMatch; + }); + setFilteredData(filteredGuides); + }; + + useEffect(() => { + handleDataFilter(); + }, [selectedTags, selectedLevel, searchInput]); return ( @@ -22,23 +71,32 @@ function QuickstartList({ quickstartData }) { -
    -
    - {quickstartData && quickstartData.length > 0 ? ( +
    + {tagOptions && tagOptions.length > 0 && ( + + )} + {levelOptions && levelOptions.length > 0 && ( + + )} + setSearchInput(value)} placeholder='Search Guides' /> +
    +
    + {filteredData && filteredData.length > 0 ? ( <> - {quickstartData.map((guide, i) => ( - + {filteredData.map((guide) => ( + ))} - ) : -

    No quickstarts are available at this time. 😕

    + ) : +

    No quickstarts are available with the selected filters.

    }
    @@ -46,4 +104,4 @@ function QuickstartList({ quickstartData }) { ) } -export default QuickstartList +export default QuickstartList; diff --git a/website/src/components/quickstartGuideList/styles.module.css b/website/src/components/quickstartGuideList/styles.module.css index 8c4e45edc8c..4e1518efd2b 100644 --- a/website/src/components/quickstartGuideList/styles.module.css +++ b/website/src/components/quickstartGuideList/styles.module.css @@ -18,12 +18,23 @@ .quickstartCardContainer { display: grid; grid-template-columns: 1fr 1fr 1fr; - grid-gap: 2rem; - padding: 5rem 1rem; + grid-gap: 1rem; + padding: 2rem 1rem 5rem; +} + +.quickstartFilterContainer { + display: grid; + grid-template-columns: 1fr 1fr 1fr; + grid-gap: 1rem; + padding-top: 4rem; +} + +.quickstartFilterContainer > div:first-child { + padding: 0; } @media (max-width: 996px) { - .quickstartCardContainer { + .quickstartCardContainer, .quickstartFilterContainer { grid-template-columns: 1fr; } } diff --git a/website/src/components/quickstartTOC/index.js b/website/src/components/quickstartTOC/index.js index 8c9b8fba910..3ff5e027208 100644 --- a/website/src/components/quickstartTOC/index.js +++ b/website/src/components/quickstartTOC/index.js @@ -81,13 +81,19 @@ function QuickstartTOC() { buttonContainer.classList.add(style.buttonContainer); const prevButton = document.createElement("a"); const nextButton = document.createElement("a"); + const nextButtonIcon = document.createElement("i"); + const prevButtonIcon = document.createElement("i"); + prevButtonIcon.classList.add("fa-regular", "fa-arrow-left"); prevButton.textContent = "Back"; + prevButton.prepend(prevButtonIcon); prevButton.classList.add(clsx(style.button, style.prevButton)); prevButton.disabled = index === 0; prevButton.addEventListener("click", () => handlePrev(index + 1)); + nextButtonIcon.classList.add("fa-regular", "fa-arrow-right"); nextButton.textContent = "Next"; + nextButton.appendChild(nextButtonIcon); nextButton.classList.add(clsx(style.button, style.nextButton)); nextButton.disabled = index === stepWrappers.length - 1; nextButton.addEventListener("click", () => handleNext(index + 1)); @@ -190,7 +196,24 @@ function QuickstartTOC() { updateStep(activeStep, stepNumber); }; + // Handle TOC menu click + const handleTocMenuClick = () => { + const tocList = document.querySelector(`.${style.tocList}`); + const tocMenuBtn = document.querySelector(`.${style.toc_menu_btn}`); + const tocListStyles = window.getComputedStyle(tocList); + + if (tocListStyles.display === "none") { + tocList.style.display = "block"; + tocMenuBtn.querySelector("i").style.transform = "rotate(0deg)"; + } else { + tocList.style.display = "none"; + tocMenuBtn.querySelector("i").style.transform = "rotate(-90deg)"; + } + }; + return ( + <> + Menu
      {tocData.map((step) => (
    • ))}
    + ); } diff --git a/website/src/components/quickstartTOC/styles.module.css b/website/src/components/quickstartTOC/styles.module.css index edfd0380098..892e6f73be6 100644 --- a/website/src/components/quickstartTOC/styles.module.css +++ b/website/src/components/quickstartTOC/styles.module.css @@ -1,5 +1,5 @@ .quickstartTitle { - padding: 1rem 0 2rem; + } .tocList { @@ -8,15 +8,16 @@ margin: 0; width: 370px; flex-shrink: 0; - padding-right: 3rem; + padding-right: 4rem; + margin-right: 4rem; + border-right: solid 4px #EFF2F3; } .tocList li { padding: 1rem; display: block; - border: 1px solid #EFF2F3; - box-shadow: 0px 11px 24px rgba(138, 138, 138, 0.1), 0px 0px 0px rgba(138, 138, 138, 0.1); - border-radius: 10px; + box-shadow: 0px 10px 16px 0px rgba(31, 41, 55, 0.10); + border-radius: 8px; margin-bottom: 1rem; display: grid; grid-template-columns: 1fr 5fr; @@ -32,13 +33,13 @@ height: 30px; text-align: center; line-height: 27px; - color: var(--color-light-teal); - border: solid 1px var(--color-light-teal); + color: var(--color-green-blue); + border: solid 1px var(--color-green-blue); margin-bottom: auto; } .tocList .active span { - background: var(--color-light-teal); + background: var( --color-green-blue); color: var(--color-white); } @@ -52,7 +53,7 @@ html[data-theme="dark"] .tocList li span { } html[data-theme="dark"] .tocList .active span { - border-color: var(--color-light-teal); + border-color: var(--color-green-blue); } .tocItem { @@ -73,28 +74,47 @@ html[data-theme="dark"] .tocList .active span { transition-property: color, background, border-color; transition-duration: var(--ifm-button-transition-duration); transition-timing-function: var(--ifm-transition-timing-default); - border: 2px solid var(--color-light-teal); + border: 2px solid var(--color-green-blue); + color: var(--color-green-blue); border-radius: 5px; width: 125px; text-align: center; } -.buttonContainer a:hover { - background: var(--color-light-teal); - color: var(--color-white) +.stepWrapper .buttonContainer a:hover { + background: var(--color-green-blue); + color: var(--color-white); +} + +html[data-theme="dark"] .stepWrapper .buttonContainer a:hover { + color: var(--color-white) !important; } .buttonContainer .prevButton { margin-right: auto; } +.buttonContainer .prevButton i { + font-size: .8rem; + margin-right: .4rem; +} + .buttonContainer .nextButton { margin-left: auto; } -.stepWrapper[data-step="1"] .nextButton { - background: var(--color-light-teal); - color: var(--color-white) +.buttonContainer .nextButton i { + font-size: .8rem; + margin-left: .4rem; +} + +.stepWrapper[data-step="1"] a.nextButton { + background: var(--color-green-blue); + color: var(--color-white); +} + +html[data-theme="dark"] .stepWrapper[data-step="1"] a.nextButton { + color: var(--color-white) !important; } .stepWrapper.hidden { @@ -105,12 +125,26 @@ html[data-theme="dark"] .tocList .active span { display: none; } +.toc_menu_btn { + display: none; +} + +.toc_menu_btn i { + transform: rotate(-90deg); + vertical-align: middle; +} + @media (max-width: 996px) { .tocList { width: 100%; padding-right: 0; margin-bottom: 2rem; - height: 160px; - overflow-y: auto; + display: none; + } + + .toc_menu_btn { + display: inline-block; + margin-bottom: 2rem; + cursor: pointer; } } diff --git a/website/src/components/searchInput/index.js b/website/src/components/searchInput/index.js new file mode 100644 index 00000000000..e0a5faf4a82 --- /dev/null +++ b/website/src/components/searchInput/index.js @@ -0,0 +1,26 @@ +import React from "react"; +import styles from "./styles.module.css"; + +const SearchInput = ({ + value, + onChange, + placeholder = "Search...", + ...props +}) => { + return ( + + ); +}; + +export default SearchInput; diff --git a/website/src/components/searchInput/styles.module.css b/website/src/components/searchInput/styles.module.css new file mode 100644 index 00000000000..ae19a3bb81b --- /dev/null +++ b/website/src/components/searchInput/styles.module.css @@ -0,0 +1,30 @@ +.inputContainer { + padding: 0 1rem; + border-radius: 0.3125rem; + border: 2px solid var(--navy-200-c-6-ccd-4, #C6CCD4); + + +} + +.inputContainer:active, .input:focus { + border: 2px solid #4f5d75; + outline: none; +} + +.input::placeholder { + all: unset; + -webkit-text-security: initial; +} + +.inputContainer .input { + border: none; + min-height: 38px; + font-size: .975rem; + color: var(--ifm-font-color-base); + font-family: var(--ifm-font-family-base); +} + +[data-theme='dark'] .input{ + background: #1b1b1d; + color: #e3e3e3; +} diff --git a/website/src/components/selectDropdown/index.js b/website/src/components/selectDropdown/index.js new file mode 100644 index 00000000000..b6378518c25 --- /dev/null +++ b/website/src/components/selectDropdown/index.js @@ -0,0 +1,21 @@ +import React from "react"; +import Select from "react-select"; +import styles from "./styles.module.css"; + +export const SelectDropdown = ({ options, value, onChange, isMulti, placeHolder }) => { + return ( +