From 8cb3b7cb041291d19525d82d728109b42f9e4dfa Mon Sep 17 00:00:00 2001 From: mirnawong1 Date: Tue, 26 Nov 2024 13:11:31 +0000 Subject: [PATCH 1/5] bela's feedback --- .../docs/docs/build/incremental-microbatch.md | 29 ++++++++++++++----- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 30070834ff9..987b467d75f 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -8,7 +8,7 @@ id: "incremental-microbatch" :::info Microbatch -The `microbatch` strategy is available in beta for [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) and dbt Core v1.9. +The new `microbatch` strategy is available in beta for [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) and dbt Core v1.9. If you use a custom microbatch macro, set a [distinct behavior flag](/reference/global-configs/behavior-changes#custom-microbatch-strategy) in your `dbt_project.yml` to enable batched execution. If you don't have a custom microbatch macro, you don't need to set this flag as dbt will handle microbatching automatically for any model using the [microbatch strategy](#how-microbatch-compares-to-other-incremental-strategies). @@ -22,17 +22,32 @@ Refer to [Supported incremental strategies by adapter](/docs/build/incremental-s Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models process a smaller number of rows, and then append, update, or replace those rows in the existing table. This can significantly reduce the time and resources required for your data transformations. -Microbatch incremental models make it possible to process transformations on very large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the [`event_time`](/reference/resource-configs/event-time) and `batch_size` you configure. +Microbatch is a new incremental strategy designed for large time-series datasets: +- It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing. +- Unlike traditional incremental strategies, microbatch doesn't require managing uniqueness constraints or implementing complex conditional logic for [backfilling](#backfills). +- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to handle data partitioning and filtering. +- Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies). -Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and . This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently. +### How microbatch works + +When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` and `batch_size` you configure. + +Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and . This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills) — in the future, concurrently — and to [retry](#retry) them independently. ### Example -A `sessions` model aggregates and enriches data that comes from two other models. -- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. -- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers. +A `sessions` model aggregates and enriches data that comes from two other models: +- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. It uses the `page_view_start` column as its `event_time`. +- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers. The customers model doesn't configure an `event_time` column. + +As a result: -The `page_view_start` column in `page_views` is configured as that model's `event_time`. The `customers` model does not configure an `event_time`. Therefore, each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch, and it will not filter `customers` (a full scan for every batch). +- Each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch. +- The `customers` table isn't filtered, resulting in a full scan for every batch. + +:::tip +In addition to configuring `event_time` for the target table, you can also specify it for any upstream models that you want to filter, even if they have different time columns. +::: From 2eb0ff9f5f66e1973a0f6d9d8d23ebd0463b243d Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Tue, 26 Nov 2024 13:15:19 +0000 Subject: [PATCH 2/5] Update website/docs/docs/build/incremental-microbatch.md --- website/docs/docs/build/incremental-microbatch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 987b467d75f..1a545a2254f 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -22,7 +22,7 @@ Refer to [Supported incremental strategies by adapter](/docs/build/incremental-s Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models process a smaller number of rows, and then append, update, or replace those rows in the existing table. This can significantly reduce the time and resources required for your data transformations. -Microbatch is a new incremental strategy designed for large time-series datasets: +Microbatch is an incremental strategy designed for large time-series datasets: - It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing. - Unlike traditional incremental strategies, microbatch doesn't require managing uniqueness constraints or implementing complex conditional logic for [backfilling](#backfills). - It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to handle data partitioning and filtering. From 75c44662091473b290b3915306ba6b576ac545e4 Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Tue, 26 Nov 2024 22:21:31 +0000 Subject: [PATCH 3/5] Update incremental-microbatch.md Co-authored-by: Grace Goheen <53586774+graciegoheen@users.noreply.github.com> --- website/docs/docs/build/incremental-microbatch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 1a545a2254f..e0dacb7da42 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -46,7 +46,7 @@ As a result: - The `customers` table isn't filtered, resulting in a full scan for every batch. :::tip -In addition to configuring `event_time` for the target table, you can also specify it for any upstream models that you want to filter, even if they have different time columns. +In addition to configuring `event_time` for the target table, you should also specify it for any upstream models that you want to filter, even if they have different time columns. ::: From 147f080ad3926a92cd7bf16cee14a02ba3c79f9b Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Tue, 26 Nov 2024 22:25:52 +0000 Subject: [PATCH 4/5] Update incremental-microbatch.md Co-authored-by: Grace Goheen <53586774+graciegoheen@users.noreply.github.com> --- website/docs/docs/build/incremental-microbatch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index e0dacb7da42..3525c43bb94 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -24,7 +24,7 @@ Incremental models in dbt are a [materialization](/docs/build/materializations) Microbatch is an incremental strategy designed for large time-series datasets: - It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing. -- Unlike traditional incremental strategies, microbatch doesn't require managing uniqueness constraints or implementing complex conditional logic for [backfilling](#backfills). +- Unlike traditional incremental strategies, microbatch doesn't require implementing complex conditional logic for [backfilling](#backfills). - It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to handle data partitioning and filtering. - Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies). From 9c52827e9886e8e0a74dcf3a3afba0dddd7dd6b4 Mon Sep 17 00:00:00 2001 From: mirnawong1 Date: Wed, 27 Nov 2024 10:12:32 +0000 Subject: [PATCH 5/5] grace's feedback --- website/docs/docs/build/incremental-microbatch.md | 6 ++++-- website/docs/docs/build/incremental-models.md | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 3525c43bb94..e6c8284cc4b 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -23,16 +23,18 @@ Refer to [Supported incremental strategies by adapter](/docs/build/incremental-s Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models process a smaller number of rows, and then append, update, or replace those rows in the existing table. This can significantly reduce the time and resources required for your data transformations. Microbatch is an incremental strategy designed for large time-series datasets: +- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to define time-based ranges for filtering. Set the `event_time` column for your microbatch model and its direct parents (upstream models). Note, this is different to `partition_by`, which groups rows into partitions. - It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing. - Unlike traditional incremental strategies, microbatch doesn't require implementing complex conditional logic for [backfilling](#backfills). -- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to handle data partitioning and filtering. - Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies). ### How microbatch works When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` and `batch_size` you configure. -Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and . This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills) — in the future, concurrently — and to [retry](#retry) them independently. +Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and . + +This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills), concurrently, and [retry](#retry) them independently. ### Example diff --git a/website/docs/docs/build/incremental-models.md b/website/docs/docs/build/incremental-models.md index a56246addf3..d7b6ecd8f54 100644 --- a/website/docs/docs/build/incremental-models.md +++ b/website/docs/docs/build/incremental-models.md @@ -114,7 +114,7 @@ When you define a `unique_key`, you'll see this behavior for each row of "new" d Please note that if there's a unique_key with more than one row in either the existing target table or the new incremental rows, the incremental model may fail depending on your database and [incremental strategy](/docs/build/incremental-strategy). If you're having issues running an incremental model, it's a good idea to double check that the unique key is truly unique in both your existing database table and your new incremental rows. You can [learn more about surrogate keys here](https://www.getdbt.com/blog/guide-to-surrogate-key). :::info -While common incremental strategies, such as`delete+insert` + `merge`, might use `unique_key`, others don't. For example, the `insert_overwrite` strategy does not use `unique_key`, because it operates on partitions of data rather than individual rows. For more information, see [About incremental_strategy](/docs/build/incremental-strategy). +While common incremental strategies, such as `delete+insert` + `merge`, might use `unique_key`, others don't. For example, the `insert_overwrite` strategy does not use `unique_key`, because it operates on partitions of data rather than individual rows. For more information, see [About incremental_strategy](/docs/build/incremental-strategy). ::: #### `unique_key` example