Skip to content

Commit

Permalink
Merge branch 'current' into brian-gillet-patch-2
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Nov 30, 2022
2 parents a30a98f + 38cc18a commit 1374ad6
Show file tree
Hide file tree
Showing 82 changed files with 1,350 additions and 212 deletions.
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,14 @@ You can use components documented in the [docusaurus library](https://v2.docusau

# Writing content

When writing content, you should refer to the [style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) and [content types](/contributing/content-types.md) to help you understand our writing standards and how we break down information in the product documentation.
The dbt Labs docs are written in Markdown and sometimes HTML. When writing content, refer to the [style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) and [content types](/contributing/content-types.md) to help you understand our writing standards and how we break down information in the product documentation.

## SME and editorial reviews

All PRs that are submitted will be reviewed by the dbt Labs Docs team for editorial review.

Content that is submitted by our users and the open-source community are also reviewed by our dbt Labs subject matter experts (SMEs) to help ensure technical accuracy.


## Versioning and single-sourcing content

Expand Down
2 changes: 1 addition & 1 deletion contributing/content-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Procedural content should include troubleshooting tips as frequently as possible

## Guide

Guides (formerly called long-form procedural articles) are highly-approachable articles that group information in context to help readers complete a complex task or set of related tasks. Guides eliminate duplication and ensure the customer finds contextual content in the right place. Guides may be a set of tasks within the reader’s larger workflow, such as including use cases.
Guides are highly-approachable articles that group information in context to help readers complete a complex task or set of related tasks. Guides eliminate duplication and ensure people find contextual content in the right place. Guides may be a set of tasks within the reader’s larger workflow, such as including use cases.

Guides combine the content types within a single article to illustrate an entire workflow within a single page, rather than splitting the workflow out into separate pieces. Guides containing multiple procedures help us scale as more options are added to the product. Users may need to complete different procedures within the guide at different times, or refer back to the guide for conceptual content or to complete a followup task.
Example usage: If there is a large number of the same type of setting, use a guide that gathers all of the tasks in context.
Expand Down
45 changes: 3 additions & 42 deletions website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ In my experience, these are false dichotomies, that sound great as hot takes but

<!--truncate-->

In my days as a data consultant and now as a member of the dbt Labs Solutions Architecture team, I’ve frequently seen Airflow, dbt Core & dbt Cloud ([via the API](https://docs.getdbt.com/dbt-cloud/api-v2)) blended as needed, based on the needs of a specific data pipeline, or a team’s structure and skillset.
In my days as a data consultant and now as a member of the dbt Labs Solutions Architecture team, I’ve frequently seen Airflow, dbt Core & dbt Cloud ([via the official provider](https://registry.astronomer.io/providers/dbt-cloud?type=Operators&utm_campaign=Monthly+Product+Updates&utm_medium=email&_hsmi=208603877&utm_content=208603877&utm_source=hs_email)) blended as needed, based on the needs of a specific data pipeline, or a team’s structure and skillset.

More fundamentally, I think it’s important to call out that Airflow + dbt are **spiritually aligned** in purpose. They both exist to facilitate clear communication across data teams, in service of producing trustworthy data.

Expand Down Expand Up @@ -123,8 +123,6 @@ When a dbt run fails within an Airflow pipeline, an engineer monitoring the over

dbt provides common programmatic interfaces (the [dbt Cloud Admin + Metadata APIs](/docs/dbt-cloud/dbt-cloud-api/cloud-apis), and [.json-based artifacts](/reference/artifacts/dbt-artifacts) in the case of dbt Core) that provide the context needed for the engineer to self-serve—either by rerunning from a point of failure or reaching out to the owner.

![dbt run log](/img/blog/airflow-dbt-run-log.png "dbt run log")

## Why I ❤️ dbt Cloud + Airflow

dbt Core is a fantastic framework for developing data transformation + testing logic. It is less fantastic as a shared interface for data analysts + engineers to collaborate **_on production runs of transformation jobs_**.
Expand Down Expand Up @@ -191,25 +189,7 @@ This means that whether you’re actively developing or you simply want to rerun

### dbt Cloud + Airflow

With dbt Cloud and its aforementioned [APIs](https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/cloud-apis), any dbt user can configure dbt runs from the UI.

In Airflow, engineers can then call the API, and everyone can move on with their lives. This allows the API to be a programmatic interface between analysts and data engineers, vs relying on the human interface.

If you look at what this practically looks like in code (my [airflow-toolkit repo is here](https://github.com/sungchun12/airflow-toolkit/blob/demo-sung/dags/examples/dbt_cloud_example.py)), just a few settings need to be configured after you create the initial python API call: [here](https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/dbt_cloud_utils.py)

```
dbt_cloud_job_runner_config = dbt_cloud_job_runner(
account_id=4238, project_id=12220, job_id=12389, cause=dag_file_name
)
```

If the operator fails, it’s an Airflow problem. If the dbt run returns a model or test failure, it’s a dbt problem and the analyst can be notified to hop into the dbt Cloud UI to debug.

#### Using the new dbt Cloud Provider
#### Using the dbt Cloud Provider

With the new dbt Cloud Provider, you can use Airflow to orchestrate and monitor your dbt Cloud jobs without any of the overhead of dbt Core. Out of the box, the dbt Cloud provider comes with:

Expand All @@ -221,26 +201,7 @@ TL;DR - This combines the end-to-end visibility of everything (from ingestion th

#### Setting up Airflow and dbt Cloud

To set up Airflow and dbt Cloud, you can:


1. Set up a dbt Cloud job, as in the example below.

![job settings](/img/blog/2021-11-29-dbt-airflow-spiritual-alignment/job-settings.png)

2. Set up an Airflow Connection ID

![airflow dbt run select](/img/blog/2021-11-29-dbt-airflow-spiritual-alignment/airflow-connection-ID.png)

3. ~~Set up your Airflow DAG similar to this example.~~

4. You can use Airflow to call the dbt Cloud API via the new `DbtCloudRunJobOperator` to run the job and monitor it in real time through the dbt Cloud interface.

![dbt Cloud API graph](/img/blog/2021-11-29-dbt-airflow-spiritual-alignment/dbt-Cloud-API-graph.png)

![Monitor Job Runs](/img/blog/2021-11-29-dbt-airflow-spiritual-alignment/Monitor-Job-Runs.png)

![run number](/img/blog/2021-11-29-dbt-airflow-spiritual-alignment/run-number.png)
To set up Airflow and dbt Cloud, you can follow the step by step instructions: [here](https://docs.getdbt.com/guides/orchestration/airflow-and-dbt-cloud/2-setting-up-airflow-and-dbt-cloud)

If your task errors or fails in any of the above use cases, you can view the logs within dbt Cloud (think: data engineers can trust analytics engineers to resolve errors).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -234,13 +234,13 @@ I won’t get into our modeling methodology at dbt Labs in this article, but the

### Staggered joins

![Staggered-Joins.png](/img/blog/2022-05-19-redshift-configurations-dbt-model-optimizations/Staggered-Joins.png)
![Staggered-Joins.png](/img/blog/2022-05-19-redshift-configurations-dbt-model-optimizations/Staggered-Joins.jpg)

In this method, you piece out your joins based on the main table they’re joining to. For example, if you had five tables that were all joined using `person_id`, then you would stage your data (doing your clean up too, of course), distribute those by using `dist='person_id'`, and then marry them up in some table downstream. Now with that new table, you can choose the next distribution key you’ll need for the next process that will happen. In our example above, the next step is joining to the `anonymous_visitor_profiles` table which is distributed by `mask_id`, so the results of our join should also distribute by `mask_id`.

### Resolve to a single key

![Resolve-to-single-key](/img/blog/2022-05-19-redshift-configurations-dbt-model-optimizations/Resolve-to-single-key.png)
![Resolve-to-single-key](/img/blog/2022-05-19-redshift-configurations-dbt-model-optimizations/Resolve-to-single-key.jpg)

This method takes some time to think about, and it may not make sense to do it depending on what you need. This is definitely balance between coherence, usability, and performance.

Expand Down
Loading

0 comments on commit 1374ad6

Please sign in to comment.