From b48ed25c0da351410005bb1d89dc63b5c8b56659 Mon Sep 17 00:00:00 2001 From: mirnawong1 Date: Tue, 3 Dec 2024 15:16:30 +0000 Subject: [PATCH 1/6] add note about adapter requirement --- .../docs/docs/build/incremental-microbatch.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 9055aa7650..9138d14e51 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -179,12 +179,18 @@ It does not matter whether the table already contains data for that day. Given t Several configurations are relevant to microbatch models, and some are required: -| Config | Type | Description | Default | -|----------|------|---------------|---------| -| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | -| `begin` | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | -| `batch_size` | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | -| `lookback` | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | +| Config | Description | Default | Type | Required | +|----------|---------------|---------|------|---------| +| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required | +| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required | +| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | +| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | +| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String
| Optional* | +| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | + +***Note:** +- `unique_key` is _required_ for the check strategy when using the `dbt-postgres` adapter. +- `partition_by` is _required_ for the check strategy when using the `dbt-spark` and `dbt-bigquery` adapters. From 68d14318742cc387a9249ed11716e60b63cbf7d9 Mon Sep 17 00:00:00 2001 From: mirnawong1 Date: Tue, 3 Dec 2024 15:37:43 +0000 Subject: [PATCH 2/6] clarify microbatch --- website/docs/docs/build/incremental-microbatch.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 9138d14e51..0c411c02d1 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -189,8 +189,8 @@ Several configurations are relevant to microbatch models, and some are required: | `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | ***Note:** -- `unique_key` is _required_ for the check strategy when using the `dbt-postgres` adapter. -- `partition_by` is _required_ for the check strategy when using the `dbt-spark` and `dbt-bigquery` adapters. +- `unique_key` is _required_ for the microbatch strategy when using the `dbt-postgres` adapter. +- `partition_by` is _required_ for the microbatch strategy when using the `dbt-spark` and `dbt-bigquery` adapters. From 2f82a60108cf2590204bf63213c114233e01a1d0 Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Wed, 4 Dec 2024 10:54:34 +0000 Subject: [PATCH 3/6] Update website/docs/docs/build/incremental-microbatch.md --- website/docs/docs/build/incremental-microbatch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 0c411c02d1..d15d5b0582 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -186,7 +186,7 @@ Several configurations are relevant to microbatch models, and some are required: | `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | | `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | | `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String
| Optional* | -| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | +| `partition_by` | A column(s) (string or array) or expression for the record. | N/A | String | Optional* | ***Note:** - `unique_key` is _required_ for the microbatch strategy when using the `dbt-postgres` adapter. From 6704c0d2fe4bc127474474dbd6109d25e61071cc Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Wed, 4 Dec 2024 10:54:41 +0000 Subject: [PATCH 4/6] Update website/docs/docs/build/incremental-microbatch.md --- website/docs/docs/build/incremental-microbatch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index d15d5b0582..ead67b930a 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -185,7 +185,7 @@ Several configurations are relevant to microbatch models, and some are required: | `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required | | `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | | `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | -| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String
| Optional* | +| `unique_key` | A column(s) (string or array) or expression for the record. | N/A | String
| Optional* | | `partition_by` | A column(s) (string or array) or expression for the record. | N/A | String | Optional* | ***Note:** From 0b93919eeda7028a7447678e3956bc700b4a1bbe Mon Sep 17 00:00:00 2001 From: mirnawong1 Date: Wed, 4 Dec 2024 11:29:19 +0000 Subject: [PATCH 5/6] rejig page --- .../docs/docs/build/incremental-microbatch.md | 59 +++++++++++++++---- 1 file changed, 47 insertions(+), 12 deletions(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index ead67b930a..377710f48f 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -36,7 +36,7 @@ Each "batch" corresponds to a single bounded time period (by default, a single d This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills), concurrently, and [retry](#retry) them independently. -### Example +## Example A `sessions` model aggregates and enriches data that comes from two other models: - `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. It uses the `page_view_start` column as its `event_time`. @@ -175,7 +175,7 @@ It does not matter whether the table already contains data for that day. Given t -### Relevant configs +## Relevant configs Several configurations are relevant to microbatch models, and some are required: @@ -185,18 +185,53 @@ Several configurations are relevant to microbatch models, and some are required: | `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required | | `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | | `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | -| `unique_key` | A column(s) (string or array) or expression for the record. | N/A | String
| Optional* | -| `partition_by` | A column(s) (string or array) or expression for the record. | N/A | String | Optional* | - -***Note:** -- `unique_key` is _required_ for the microbatch strategy when using the `dbt-postgres` adapter. -- `partition_by` is _required_ for the microbatch strategy when using the `dbt-spark` and `dbt-bigquery` adapters. +### Required configs for specific adapters +Some adapters require additional configurations for the microbatch strategy. This is because each adapter implements the microbatch strategy differently. + +The following table lists the required configurations for the specific adapters, in addition to the standard microbatch configs: + +| Adapter | `unique_key` config | `partition_by` config | +|----------|------------------|--------------------| +| [`dbt-postgres`](/reference/resource-configs/postgres-configs#incremental-materialization-strategies) | ✅ Required | N/A | +| [`dbt-spark`](/reference/resource-configs/spark-configs#incremental-models) | N/A | ✅ Required | +| [`dbt-bigquery`](/reference/resource-configs/bigquery-configs#merge-behavior-incremental-models) | N/A | ✅ Required | + +For example, if you're using `dbt-postgres`, configure `unique_key` as follows: + + + +```sql +{{ config( + materialized='incremental', + incremental_strategy='microbatch', + unique_key='sales_id', ## required for dbt-postgres + event_time='transaction_date', + begin='2023-01-01', + batch_size='day' +) }} + +select + sales_id, + transaction_date, + customer_id, + product_id, + total_amount +from {{ source('sales', 'transactions') }} + +``` + + In this example, `unique_key` is required because `dbt-postgres`' microbatch uses the `merge` strategy, which needs a `unique_key` to identify which rows in the data warehouse need to get merged. Without a `unique_key`, dbt won't be able to match rows between the incoming batch and the existing table. + + + +### Full refresh + As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates. -### Usage +## Usage **You must write your model query to process (read and return) exactly one "batch" of data**. This is a simplifying assumption and a powerful one: - You don’t need to think about `is_incremental` filtering @@ -213,7 +248,7 @@ During standard incremental runs, dbt will process batches according to the curr **Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models that configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table. -### Backfills +## Backfills Whether to fix erroneous source data or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data. @@ -228,13 +263,13 @@ dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04" -### Retry +## Retry If one or more of your batches fail, you can use `dbt retry` to reprocess _only_ the failed batches. ![Partial retry](https://github.com/user-attachments/assets/f94c4797-dcc7-4875-9623-639f70c97b8f) -### Timezones +## Timezones For now, dbt assumes that all values supplied are in UTC: From 6050b24acf26522ed2bc441b4ca43a40c0c78bd3 Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Wed, 4 Dec 2024 12:29:02 +0000 Subject: [PATCH 6/6] Update website/docs/docs/build/incremental-microbatch.md Co-authored-by: nataliefiann <120089939+nataliefiann@users.noreply.github.com> --- website/docs/docs/build/incremental-microbatch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 377710f48f..023e0f25d6 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -223,7 +223,7 @@ from {{ source('sales', 'transactions') }} ``` - In this example, `unique_key` is required because `dbt-postgres`' microbatch uses the `merge` strategy, which needs a `unique_key` to identify which rows in the data warehouse need to get merged. Without a `unique_key`, dbt won't be able to match rows between the incoming batch and the existing table. + In this example, `unique_key` is required because `dbt-postgres` microbatch uses the `merge` strategy, which needs a `unique_key` to identify which rows in the data warehouse need to get merged. Without a `unique_key`, dbt won't be able to match rows between the incoming batch and the existing table.