Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add key configs for adapters using microbatch #6582

Merged
merged 12 commits into from
Dec 5, 2024
65 changes: 53 additions & 12 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Each "batch" corresponds to a single bounded time period (by default, a single d

This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills), concurrently, and [retry](#retry) them independently.

### Example
## Example

A `sessions` model aggregates and enriches data that comes from two other models:
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. It uses the `page_view_start` column as its `event_time`.
Expand Down Expand Up @@ -175,22 +175,63 @@ It does not matter whether the table already contains data for that day. Given t

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_filters.png" title="Each batch of sessions filters page_views to the matching time-bound batch, but doesn't filter sessions, performing a full scan for each batch."/>

### Relevant configs
## Relevant configs

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| `begin` | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| `batch_size` | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| `lookback` | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |
| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

### Required configs for specific adapters
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a separate header for this @QMalcolm 🙏

Some adapters require additional configurations for the microbatch strategy. This is because each adapter implements the microbatch strategy differently.

The following table lists the required configurations for the specific adapters, in addition to the standard microbatch configs:

| Adapter | `unique_key` config | `partition_by` config |
|----------|------------------|--------------------|
| [`dbt-postgres`](/reference/resource-configs/postgres-configs#incremental-materialization-strategies) | ✅ Required | N/A |
| [`dbt-spark`](/reference/resource-configs/spark-configs#incremental-models) | N/A | ✅ Required |
| [`dbt-bigquery`](/reference/resource-configs/bigquery-configs#merge-behavior-incremental-models) | N/A | ✅ Required |

For example, if you're using `dbt-postgres`, configure `unique_key` as follows:

<File name="models/sessions.sql">

```sql
{{ config(
materialized='incremental',
incremental_strategy='microbatch',
unique_key='sales_id', ## required for dbt-postgres
event_time='transaction_date',
begin='2023-01-01',
batch_size='day'
) }}

select
sales_id,
transaction_date,
customer_id,
product_id,
total_amount
from {{ source('sales', 'transactions') }}

```

In this example, `unique_key` is required because `dbt-postgres`' microbatch uses the `merge` strategy, which needs a `unique_key` to identify which rows in the data warehouse need to get merged. Without a `unique_key`, dbt won't be able to match rows between the incoming batch and the existing table.
mirnawong1 marked this conversation as resolved.
Show resolved Hide resolved

</File>

### Full refresh

As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates.

### Usage
## Usage

**You must write your model query to process (read and return) exactly one "batch" of data**. This is a simplifying assumption and a powerful one:
- You don’t need to think about `is_incremental` filtering
Expand All @@ -207,7 +248,7 @@ During standard incremental runs, dbt will process batches according to the curr

**Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models that configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table.

### Backfills
## Backfills

Whether to fix erroneous source data or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.

Expand All @@ -222,13 +263,13 @@ dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_backfill.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

### Retry
## Retry

If one or more of your batches fail, you can use `dbt retry` to reprocess _only_ the failed batches.

![Partial retry](https://github.com/user-attachments/assets/f94c4797-dcc7-4875-9623-639f70c97b8f)

### Timezones
## Timezones

For now, dbt assumes that all values supplied are in UTC:

Expand Down
Loading