Skip to content

Commit

Permalink
Merge branch 'current' into update-ratio-metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Oct 7, 2024
2 parents c6ab151 + 0fd277a commit 66ecd13
Show file tree
Hide file tree
Showing 76 changed files with 8,748 additions and 4,410 deletions.
12 changes: 12 additions & 0 deletions website/dbt-versions.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ exports.versions = [
version: "1.9.1",
customDisplay: "Cloud (Versionless)",
},
{
version: "1.9",
isPrerelease: true,
},
{
version: "1.8",
EOLDate: "2025-04-15",
Expand All @@ -42,6 +46,14 @@ exports.versions = [
* @property {string} lastVersion The last version the page is visible in the sidebar
*/
exports.versionedPages = [
{
page: "docs/build/incremental-microbatch",
firstVersion: "1.9",
},
{
page: "reference/resource-configs/snapshot_meta_column_names",
firstVersion: "1.9",
},
{
page: "reference/resource-configs/target_database",
lastVersion: "1.8",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,14 @@ We’ve focused heavily thus far on the primary area of action in our dbt projec

### Project splitting

One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Our present stance on this for most projects, particularly for teams starting out, is straightforward: you should avoid it unless you have no other option or it saves you from an even more complex workaround. If you do have the need to split up your project, it’s completely possible through the use of private packages, but the added complexity and separation is, for most organizations, a hindrance, not a help, at present. That said, this is very likely subject to change! [We want to create a world where it’s easy to bring lots of dbt projects together into a cohesive lineage](https://github.com/dbt-labs/dbt-core/discussions/5244). In a world where it’s simple to break up monolithic dbt projects into multiple connected projects, perhaps inside of a modern mono repo, the calculus will be different, and the below situations we recommend against may become totally viable. So watch this space!
One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Currently, our advice for most teams, especially those just starting, is fairly simple: in most cases, we recommend doing so with [dbt Mesh](/best-practices/how-we-mesh/mesh-1-intro)! dbt Mesh allows organizations to handle complexity by connecting several dbt projects rather than relying on one big, monolithic project. This approach is designed to speed up development while maintaining governance.

- ❌ **Business groups or departments.** Conceptual separations within the project are not a good reason to split up your project. Splitting up, for instance, marketing and finance modeling into separate projects will not only add unnecessary complexity but destroy the unifying effect of collaborating across your organization on cohesive definitions and business logic.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.
As breaking up monolithic dbt projects into smaller, connected projects, potentially within a modern mono repo becomes easier, the scenarios we currently advise against may soon become feasible. So watch this space!

- ✅ **Business groups or departments.** Conceptual separations within the project are the primary reason to split up your project. This allows your business domains to own their own data products and still collaborate using dbt Mesh. For more information about dbt Mesh, please refer to our [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs).
- ✅ **Data governance.** Structural, organizational needs — such as data governance and security — are one of the few worthwhile reasons to split up a project. If, for instance, you work at a healthcare company with only a small team cleared to access raw data with PII in it, you may need to split out your staging models into their own projects to preserve those policies. In that case, you would import your staging project into the project that builds on those staging models as a [private package](https://docs.getdbt.com/docs/build/packages/#private-packages).
- ✅ **Project size.** At a certain point, your project may grow to have simply too many models to present a viable development experience. If you have 1000s of models, it absolutely makes sense to find a way to split up your project.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.

## Final considerations

Expand Down
285 changes: 285 additions & 0 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
---
title: "About microbatch incremental models"
description: "Learn about the 'microbatch' strategy for incremental models."
id: "incremental-microbatch"
---

# About microbatch incremental models <Lifecycle status="beta" />

:::info Microbatch

The `microbatch` strategy is available in beta for [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) and dbt Core v1.9. We have been developing it behind a flag to prevent unintended interactions with existing custom incremental strategies. To enable this feature, set the environment variable `DBT_EXPERIMENTAL_MICROBATCH` to `True` in your dbt Cloud environments or wherever you're running dbt Core.

Read and participate in the discussion: [dbt-core#10672](https://github.com/dbt-labs/dbt-core/discussions/10672)

:::

## What is "microbatch" in dbt?

Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models process a smaller number of rows, and then append, update, or replace those rows in the existing table. This can significantly reduce the time and resources required for your data transformations.

Microbatch incremental models make it possible to process transformations on very large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` and `batch_size` you configure.

Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently.

### Example

A `sessions` model is aggregating and enriching data that comes from two other models:
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update.
- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers.

The `page_view_start` column in `page_views` is configured as that model's `event_time`. The `customers` model does not configure an `event_time`. Therefore, each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch, and it will not filter `customers` (a full scan for every batch).

We run the `sessions` model on October 1, 2024, and then again on October 2. It produces the following queries:

<Tabs>

<TabItem value="Model definition">

<File name="models/sessions.sql">

```sql
{{ config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01'
) }}

with page_views as (

-- this ref will be auto-filtered
select * from {{ ref('page_views') }}

),

customers as (

-- this ref won't
select * from {{ ref('customers') }}

),

...
```

</File>

</TabItem>

<TabItem value="Compiled (Oct 1, 2024)">

<File name="target/compiled/sessions.sql">

```sql

with page_views as (

select * from (
-- filtered on configured event_time
select * from "analytics"."page_views"
where page_view_start >= '2024-10-01 00:00:00' -- Oct 1
and page_view_start < '2024-10-02 00:00:00'
)

),

customers as (

select * from "analytics"."customers"

),

...
```

</File>

</TabItem>

<TabItem value="Compiled (Oct 2, 2024)">

<File name="target/compiled/sessions.sql">

```sql

with page_views as (

select * from (
-- filtered on configured event_time
select * from "analytics"."page_views"
where page_view_start >= '2024-10-02 00:00:00' -- Oct 2
and page_view_start < '2024-10-03 00:00:00'
)

),

customers as (

select * from "analytics"."customers"

),

...
```

</File>

</TabItem>

</Tabs>

dbt will instruct the data platform to take the result of each batch query and insert, update, or replace the contents of the `analytics.sessions` table for the same day of data. To perform this operation, dbt will use the most efficient atomic mechanism for "full batch" replacement that is available on each data platform.

It does not matter whether the table already contains data for that day, or not. Given the same input data, no matter how many times a batch is reprocessed, the resulting table is the same.

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_filters.png" title="Each batch of sessions filters page_views to the matching time-bound batch, but doesn't filter sessions, performing a full scan for each batch."/>

### Relevant configs

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| `event_time` | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| `begin` | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| `batch_size` | String (optional) | The granularity of your batches. The default is `day` (and currently this is the only granularity supported). | `day` |
| `lookback` | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `0` |

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates.

### Usage

**You must write your model query to process (read and return) exactly one "batch" of data**. This is a simplifying assumption and a powerful one:
- You don’t need to think about `is_incremental` filtering
- You don't need to pick among DML strategies (upserting/merging/replacing)
- You can preview your model, and see the exact records for a given batch that will appear when that batch is processed and written to the table

When you run a microbatch model, dbt will evaluate which batches need to be loaded, break them up into a SQL query per batch, and load each one independently.

dbt will automatically filter upstream inputs (`source` or `ref`) that define `event_time`, based on the `lookback` and `batch_size` configs for this model.

During standard incremental runs, dbt will process batches according to the current timestamp and the configured `lookback`, with one query per batch.

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

**Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models which configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table.

### Backfills

Whether to fix erroneous source data, or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.

Backfilling a microbatch model is as simple as selecting it to run or build, and specifying a "start" and "end" for `event_time`. As always, dbt will process the batches between the start and end as independent queries.

```bash
dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"
```

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_backfill.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

### Retry

If one or more of your batches fail, you can use `dbt retry` to reprocess _only_ the failed batches.

![Partial retry](https://github.com/user-attachments/assets/f94c4797-dcc7-4875-9623-639f70c97b8f)

### Timezones

For now, dbt assumes that all values supplied are in UTC:

- `event_time`
- `begin`
- `--event-time-start`
- `--event-time-end`

While we may consider adding support for custom timezones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## How `microbatch` compares to other incremental strategies?

Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `{% if is_incremental() %}` conditional block. You are responsible for crafting this SQL in a way that queries [`{{ this }}`](/reference/dbt-jinja-functions/this) to check when the most recent record was last loaded, with an optional look-back window for late-arriving records.

Other incremental strategies will control _how_ the data is being added into the table — whether append-only `insert`, `delete` + `insert`, `merge`, `insert overwrite`, etc — but they all have this in common.

As an example:

```sql
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day'
)
}}

select * from {{ ref('stg_events') }}

{% if is_incremental() %}
-- this filter will only be applied on an incremental run
-- add a lookback window of 3 days to account for late-arriving records
where date_day >= (select {{ dbt.dateadd("day", -3, "max(date_day)") }} from {{ this }})
{% endif %}

```

For this incremental model:

- "New" records are those with a `date_day` greater than the maximum `date_day` that has previously been loaded
- The lookback window is 3 days
- When there are new records for a given `date_day`, the existing data for `date_day` is deleted and the new data is inserted

Let’s take our same example from before, and instead use the new `microbatch` incremental strategy:

<File name="models/staging/stg_events.sql">

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='event_occured_at',
batch_size='day',
lookback=3,
begin='2020-01-01',
full_refresh=false
)
}}

select * from {{ ref('stg_events') }} -- this ref will be auto-filtered
```

</File>

Where you’ve also set an `event_time` for the model’s direct parents - in this case `stg_events`:

<File name="models/staging/stg_events.yml">

```yaml
models:
- name: stg_events
config:
event_time: my_time_field
```
</File>
And that’s it!
When you run the model, each batch templates a separate query. For example, if you were running the model on October 1, dbt would template separate queries for each day between September 28 and October 1, inclusive — four batches in total.
The query for `2024-10-01` would look like:

<File name="target/compiled/staging/stg_events.sql">

```sql
select * from (
select * from "analytics"."stg_events"
where my_time_field >= '2024-10-01 00:00:00'
and my_time_field < '2024-10-02 00:00:00'
)
```

</File>

Based on your data platform, dbt will choose the most efficient atomic mechanism to insert, update, or replace these four batches (`2024-09-28`, `2024-09-29`, `2024-09-30`, and `2024-10-01`) in the existing table.
1 change: 1 addition & 0 deletions website/docs/docs/build/incremental-models-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,5 @@ Transaction management, a process used in certain data platforms, ensures that a
## Related docs
- [Incremental models](/docs/build/incremental-models) to learn how to configure incremental models in dbt.
- [Incremental strategies](/docs/build/incremental-strategy) to understand how dbt implements incremental models on different databases.
- [Microbatch](/docs/build/incremental-strategy) <Lifecycle status="beta" /> to understand a new incremental strategy intended for efficient and resilient processing of very large time-series datasets.
- [Materializations best practices](/best-practices/materializations/1-guide-overview) to learn about the best practices for using materializations in dbt.
Loading

0 comments on commit 66ecd13

Please sign in to comment.