Skip to content

Commit

Permalink
Merge branch 'dbt-teradata-1.6' of https://github.com/tallamohan/docs…
Browse files Browse the repository at this point in the history
….getdbt.com into dbt-teradata-1.6
  • Loading branch information
Talla committed Nov 1, 2023
2 parents efdd7f3 + 49232ca commit 32c48bc
Show file tree
Hide file tree
Showing 254 changed files with 4,608 additions and 1,092 deletions.
1 change: 1 addition & 0 deletions .github/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ developer blog:

guides:
- website/docs/guides/**/*
- website/docs/quickstarts/**/*

content:
- website/docs/**/*
Expand Down
8 changes: 4 additions & 4 deletions website/blog/2021-11-29-dbt-airflow-spiritual-alignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,22 +144,22 @@ An analyst will be in the dark when attempting to debug this, and will need to r
This can be perfectly ok, in the event your data team is structured for data engineers to exclusively own dbt modeling duties, but that’s a quite uncommon org structure pattern from what I’ve seen. And if you have easy solutions for this analyst-blindness problem, I’d love to hear them.

Once the data has been ingested, dbt Core can be used to model it for consumption. Most of the time, users choose to either:
Use the dbt CLI+ [BashOperator](https://registry.astronomer.io/providers/apache-airflow/modules/bashoperator) with Airflow (If you take this route, you can use an external secrets manager to manage credentials externally), or
Use the dbt Core CLI+ [BashOperator](https://registry.astronomer.io/providers/apache-airflow/modules/bashoperator) with Airflow (If you take this route, you can use an external secrets manager to manage credentials externally), or
Use the [KubernetesPodOperator](https://registry.astronomer.io/providers/kubernetes/modules/kubernetespodoperator) for each dbt job, as data teams have at places like [Gitlab](https://gitlab.com/gitlab-data/analytics/-/blob/master/dags/transformation/dbt_trusted_data.py#L72) and [Snowflake](https://www.snowflake.com/blog/migrating-airflow-from-amazon-ec2-to-kubernetes/).

Both approaches are equally valid; the right one will depend on the team and use case at hand.

| | Dependency management | Overhead | Flexibility | Infrastructure Overhead |
|---|---|---|---|---|
| dbt CLI + BashOperator | Medium | Low | Medium | Low |
| dbt Core CLI + BashOperator | Medium | Low | Medium | Low |
| Kubernetes Pod Operator | Very Easy | Medium | High | Medium |
| | | | | |

If you have DevOps resources available to you, and your team is comfortable with concepts like Kubernetes pods and containers, you can use the KubernetesPodOperator to run each job in a Docker image so that you never have to think about Python dependencies. Furthermore, you’ll create a library of images containing your dbt models that can be run on any containerized environment. However, setting up development environments, CI/CD, and managing the arrays of containers can mean a lot of overhead for some teams. Tools like the [astro-cli](https://github.com/astronomer/astro-cli) can make this easier, but at the end of the day, there’s no getting around the need for Kubernetes resources for the Gitlab approach.

If you’re just looking to get started or just don’t want to deal with containers, using the BashOperator to call the dbt CLI can be a great way to begin scheduling your dbt workloads with Airflow.
If you’re just looking to get started or just don’t want to deal with containers, using the BashOperator to call the dbt Core CLI can be a great way to begin scheduling your dbt workloads with Airflow.

It’s important to note that whichever approach you choose, this is just a first step; your actual production needs may have more requirements. If you need granularity and dependencies between your dbt models, like the team at [Updater does, you may need to deconstruct the entire dbt DAG in Airflow.](https://www.astronomer.io/guides/airflow-dbt#use-case-2-dbt-airflow-at-the-model-level) If you’re okay managing some extra dependencies, but want to maximize control over what abstractions you expose to your end users, you may want to use the [GoCardlessProvider](https://github.com/gocardless/airflow-dbt), which wraps the BashOperator and dbt CLI.
It’s important to note that whichever approach you choose, this is just a first step; your actual production needs may have more requirements. If you need granularity and dependencies between your dbt models, like the team at [Updater does, you may need to deconstruct the entire dbt DAG in Airflow.](https://www.astronomer.io/guides/airflow-dbt#use-case-2-dbt-airflow-at-the-model-level) If you’re okay managing some extra dependencies, but want to maximize control over what abstractions you expose to your end users, you may want to use the [GoCardlessProvider](https://github.com/gocardless/airflow-dbt), which wraps the BashOperator and dbt Core CLI.

#### Rerunning jobs from failure

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2022-02-23-founding-an-AE-team-smartsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ In the interest of getting a proof of concept out the door (I highly favor focus

- Our own Dev, Prod & Publish databases
- Our own code repository which we managed independently
- dbt CLI
- dbt Core CLI
- Virtual Machine running dbt on a schedule

None of us had used dbt before, but we’d heard amazing things about it. We hotly debated the choice between dbt and building our own lightweight stack, and looking back now, I couldn’t be happier with choosing dbt. While there was a learning curve that slowed us down initially, we’re now seeing the benefit of that decision. Onboarding new analysts is a breeze and much of the functionality we need is pre-built. The more we use the tool, the faster we are at using it and the more value we’re gaining from the product.
Expand Down
6 changes: 3 additions & 3 deletions website/blog/2022-07-26-pre-commit-dbt.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ The last step of our flow is to make those pre-commit checks part of the day-to-

Adding periodic pre-commit checks can be done in 2 different ways, through CI (Continuous Integration) actions, or as git hooks when running dbt locally

#### a) Adding pre-commit-dbt to the CI flow (works for dbt Cloud and dbt CLI users)
#### a) Adding pre-commit-dbt to the CI flow (works for dbt Cloud and dbt Core users)

The example below will assume GitHub actions as the CI engine but similar behavior could be achieved in any other CI tool.

Expand Down Expand Up @@ -237,9 +237,9 @@ With that information, I could now go back to dbt, document my model customers a

We could set up rules that prevent any change to be merged if the GitHub action fails. Alternatively, this action step can be defined as merely informational.

#### b) Installing the pre-commit git hooks (for dbt CLI users)
#### b) Installing the pre-commit git hooks (for dbt Core users)

If we develop locally with the dbt CLI, we could also execute `pre-commit install` to install the git hooks. What it means then is that every time we want to commit code in git, the pre-commit hooks will run and will prevent us from committing if any step fails.
If we develop locally with the dbt Core CLI, we could also execute `pre-commit install` to install the git hooks. What it means then is that every time we want to commit code in git, the pre-commit hooks will run and will prevent us from committing if any step fails.

If we want to commit code without performing all the steps of the pre-hook we could use the environment variable SKIP or the git flag `--no-verify` as described [in the documentation](https://pre-commit.com/#temporarily-disabling-hooks). (e.g. we might want to skip the auto `dbt docs generate` locally to prevent it from running at every commit and rely on running it manually from time to time)

Expand Down
118 changes: 118 additions & 0 deletions website/blog/2023-10-31-to-defer-or-to-clone.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---

title: To defer or to clone, that is the question
description: "In dbt v1.6, we introduce support for zero-copy cloning via the new dbt clone command. In this blog post, Kshitij will cover what clone is, how it is different from deferral, and when to use each."
slug: to-defer-or-to-clone

image: /img/blog/2023-10-31-to-defer-or-to-clone/preview.png

authors: [kshitij_aranke, doug_beatty]

tags: [analytics craft]
hide_table_of_contents: false

date: 2023-10-31
is_featured: true

---

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs.
One of the coolest moments of my career here thus far has been shipping the new `dbt clone` command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond [the documentation on “how” to clone](https://docs.getdbt.com/reference/commands/clone).
In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

1. What is `dbt clone`?
2. How is it different from deferral?
3. Should I defer or should I clone?
<!--truncate-->
## What is `dbt clone`?

`dbt clone` is a new command in dbt 1.6 that leverages native zero-copy clone functionality on supported warehouses to **copy entire schemas for free, almost instantly**.

### How is this possible?

Well, the warehouse “cheats” by only copying metadata from the `source` schema to the `target` schema; the underlying data remains at rest during this operation.
This metadata includes materialized objects like tables and views, which is why you see a **clone** of these objects in the target schema.

In computer science jargon, `clone` makes a copy of the pointer from the `source` schema to the underlying data; after the operation there are now two pointers (`source` and `target` schemas) that each point to the same underlying data.

## How is cloning different from deferral?

On the surface, cloning and deferral seem similar – **they’re both ways to save costs in the data warehouse.**
They do this by bypassing expensive model re-computations – clone by [eagerly copying](https://en.wikipedia.org/wiki/Evaluation_strategy#Eager_evaluation) an entire schema into the target schema, and defer by [lazily referencing](https://en.wikipedia.org/wiki/Lazy_evaluation) pre-built models in the source schema.

Let’s unpack this sentence and explore its first-order effects:

| | defer | clone |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| **How do I use it?** | Implicit via the `--defer` flag | Explicit via the `dbt clone` command |
| **What are its outputs?** | Doesn't create any objects itself, but dbt might create objects in the target schema if they’ve changed from those in the source schema. | Copies objects from source schema to target schema in the data warehouse, which are persisted after operation is finished. |
| **How does it work?** | Compares manifests between source and target dbt runs and overrides ref to resolve models not built in the target run to point to objects built in the source run. | Uses zero-copy cloning if available to copy objects from source to target schemas, else creates pointer views (`select * from my_model`) |

These first-order effects lead to the following second-order effects that truly distinguish clone and defer from each other:

| | defer | clone |
|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|
| **Where can I use objects built in the target schema?** | Only within the context of dbt | Any downstream tool (e.g. BI) |
| **Can I safely modify objects built in the target schema?** | No, since this would modify production data | Yes, cloning is a cheap way to create a sandbox of production data for experimentation |
| **Will data in the target schema drift from data in the source schema?** | No, since deferral will always point to the latest version of the source schema | Yes, since clone is a point-in-time operation |
| **Can I use multiple source schemas at once?** | Yes, defer can dynamically switch between source schemas e.g. ref unchanged models from production and changed models from staging | No, clone copies objects from one source schema to one target schema |

## Should I defer or should I clone?

Putting together all the points above, here’s a handy cheat sheet for when to defer and when to clone:

| | defer | clone |
|---------------------------------------------------------------------------|-------|-------|
| **Save time & cost by avoiding re-computation** |||
| **Create database objects to be available in downstream tools (e.g. BI)** |||
| **Safely modify objects in the target schema** |||
| **Avoid creating new database objects** |||
| **Avoid data drift** |||
| **Support multiple dynamic sources** |||

To absolutely drive this point home:

1. If you send someone this cheatsheet by linking to this page, you are deferring to this page
2. If you print out this page and write notes in the margins, you have cloned this page

## Putting it in practice

Using the cheat sheet above, let’s explore a few common scenarios and explore whether we should use defer or clone for each:

1. **Testing staging datasets in BI**

In this scenario, we want to:
1. Make a copy of our production dataset available in our downstream BI tool
2. To safely iterate on this copy without breaking production datasets

Therefore, we should use **clone** in this scenario

2. **[Slim CI](https://discourse.getdbt.com/t/how-we-sped-up-our-ci-runs-by-10x-using-slim-ci/2603)**

In this scenario, we want to:
1. Refer to production models wherever possible to speed up continuous integration (CI) runs
2. Only run and test models in the CI staging environment that have changed from the production environment
3. Reference models from different environments – prod for unchanged models, and staging for modified models

Therefore, we should use **defer** in this scenario

3. **[Blue/Green Deployments](https://discourse.getdbt.com/t/performing-a-blue-green-deploy-of-your-dbt-project-on-snowflake/1349)**

In this scenario, we want to:
1. Ensure that all tests are always passing on the production dataset, even if that dataset is slightly stale
2. Atomically rollback a promotion to production if tests aren’t passing across the entire staging dataset

In this scenario, we can use **clone** to implement a deployment strategy known as blue-green deployments where we build the entire staging dataset and then run tests against it, and only clone it over to production if all tests pass.


As a rule of thumb, deferral lends itself better to continuous integration (CI) use cases whereas cloning lends itself better to continuous deployment (CD) use cases.

## Wrapping Up

In this post, we covered what `dbt clone` is, how it is different from deferral, and when to use each. Often, they can be used together within the same project in different parts of the deployment lifecycle.

Thanks for reading, and I look forward to seeing what you build with `dbt clone`.

*Thanks to [Jason Ganz](https://docs.getdbt.com/author/jason_ganz) and [Gwen Windflower](https://www.linkedin.com/in/gwenwindflower/) for reviewing drafts of this article*
9 changes: 9 additions & 0 deletions website/blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,15 @@ kira_furuichi:
name: Kira Furuichi
organization: dbt Labs

kshitij_aranke:
image_url: /img/blog/authors/kshitij-aranke.jpg
job_title: Senior Software Engineer
links:
- icon: fa-linkedin
url: https://www.linkedin.com/in/aranke/
name: Kshitij Aranke
organization: dbt Labs

lauren_benezra:
image_url: /img/blog/authors/lauren-benezra.jpeg
job_title: Analytics Engineer
Expand Down
7 changes: 6 additions & 1 deletion website/blog/ctas.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,9 @@
header: Join data practitioners worldwide at Coalesce 2023
subheader: Kicking off on October 16th, both online and in-person (Sydney, London, and San Diego)
button_text: Register now
url: https://coalesce.getdbt.com/?utm_medium=internal&utm_source=docs&utm_campaign=q3-2024_coalesce-2023_aw&utm_content=coalesce____&utm_term=all___
url: https://coalesce.getdbt.com/?utm_medium=internal&utm_source=docs&utm_campaign=q3-2024_coalesce-2023_aw&utm_content=coalesce____&utm_term=all___
- name: coalesce_2023_catchup
header: Missed Coalesce 2023?
subheader: Watch Coalesce 2023 highlights and full sessions, dbt Labs' annual analytics engineering conference.
button_text: Watch the talks
url: https://www.youtube.com/playlist?list=PL0QYlrC86xQnT3HLh-XgvoTf9F3lbsADf
2 changes: 1 addition & 1 deletion website/blog/metadata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
featured_image: ""

# This CTA lives in right sidebar on blog index
featured_cta: "coalesce_2023_signup"
featured_cta: "coalesce_2023_catchup"

# Show or hide hero title, description, cta from blog index
show_title: true
Expand Down
8 changes: 8 additions & 0 deletions website/dbt-versions.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ exports.versions = [
]

exports.versionedPages = [
{
"page": "reference/resource-configs/store_failures_as",
"firstVersion": "1.7",
},
{
"page": "docs/build/build-metrics-intro",
"firstVersion": "1.6",
Expand Down Expand Up @@ -170,6 +174,10 @@ exports.versionedPages = [
{
"page": "reference/resource-configs/grants",
"firstVersion": "1.2",
},
{
"page": "docs/build/saved-queries",
"firstVersion": "1.7",
}
]

Expand Down
2 changes: 1 addition & 1 deletion website/docs/community/resources/viewpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ id: "viewpoint"

In 2015-2016, a team of folks at RJMetrics had the opportunity to observe, and participate in, a significant evolution of the analytics ecosystem. The seeds of dbt were conceived in this environment, and the viewpoint below was written to reflect what we had learned and how we believed the world should be different. **dbt is our attempt to address the workflow challenges we observed, and as such, this viewpoint is the most foundational statement of the dbt project's goals.**

The remainder of this document is largely unedited from [the original post](https://blog.getdbt.com/building-a-mature-analytics-workflow/).
The remainder of this document is largely unedited from [the original post](https://getdbt.com/blog/building-a-mature-analytics-workflow).

:::

Expand Down
2 changes: 1 addition & 1 deletion website/docs/dbt-cli/cli-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "CLI overview"
description: "Run your dbt project from the command line."
---

dbt Core ships with a command-line interface (CLI) for running your dbt project. The dbt CLI is free to use and available as an [open source project](https://github.com/dbt-labs/dbt-core).
dbt Core ships with a command-line interface (CLI) for running your dbt project. dbt Core and its CLI are free to use and available as an [open source project](https://github.com/dbt-labs/dbt-core).

When using the command line, you can run commands and do other work from the current or _working directory_ on your computer. Before running the dbt project from the command line, make sure the working directory is your dbt project directory. For more details, see "[Creating a dbt project](/docs/build/projects)."

Expand Down
Loading

0 comments on commit 32c48bc

Please sign in to comment.