Skip to content

Commit

Permalink
Merge branch 'current' into update-simple-metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Dec 9, 2024
2 parents 8f6cd6a + 66509e8 commit 6c6c31b
Show file tree
Hide file tree
Showing 8 changed files with 244 additions and 30 deletions.
7 changes: 2 additions & 5 deletions website/blog/2021-11-23-how-to-upgrade-dbt-versions.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,14 @@ date: 2021-11-29
is_featured: true
---

import Latest from '/snippets/_release-stages-from-versionless.md'

<Latest/>

:::tip February 2024 Update

It's been a few years since dbt-core turned 1.0! Since then, we've committed to releasing zero breaking changes whenever possible and it's become much easier to upgrade dbt Core versions.

In 2024, we're taking this promise further by:

- Stabilizing interfaces for everyone — adapter maintainers, metadata consumers, and (of course) people writing dbt code everywhere — as discussed in [our November 2023 roadmap update](https://github.com/dbt-labs/dbt-core/blob/main/docs/roadmap/2023-11-dbt-tng.md).
- Introducing **Latest** release track in dbt Cloud. No more manual upgrades and no need for _a second sandbox project_ just to try out new features in development. For more details, refer to [Upgrade Core version in Cloud](/docs/dbt-versions/upgrade-dbt-version-in-cloud).
- Introducing [Release tracks](/docs/dbt-versions/cloud-release-tracks) (formerly known as Versionless) to dbt Cloud. No more manual upgrades and no need for _a second sandbox project_ just to try out new features in development. For more details, refer to [Upgrade Core version in Cloud](/docs/dbt-versions/upgrade-dbt-version-in-cloud).

We're leaving the rest of this post as is, so we can all remember how it used to be. Enjoy a stroll down memory lane.

Expand Down
6 changes: 1 addition & 5 deletions website/blog/2024-04-22-extended-attributes.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,6 @@ date: 2024-04-22
is_featured: true
---

import Latest from '/snippets/_release-stages-from-versionless.md'

<Latest/>

dbt Cloud now includes a suite of new features that enable configuring precise and unique connections to data platforms at the environment and user level. These enable more sophisticated setups, like connecting a project to multiple warehouse accounts, first-class support for [staging environments](/docs/deploy/deploy-environments#staging-environment), and user-level [overrides for specific dbt versions](/docs/dbt-versions/upgrade-dbt-version-in-cloud#override-dbt-version). This gives dbt Cloud developers the features they need to tackle more complex tasks, like Write-Audit-Publish (WAP) workflows and safely testing dbt version upgrades. While you still configure a default connection at the project level and per-developer, you now have tools to get more advanced in a secure way. Soon, dbt Cloud will take this even further allowing multiple connections to be set globally and reused with _global connections_.

<!--truncate-->
Expand Down Expand Up @@ -84,7 +80,7 @@ All you need to do is configure an environment as staging and enable the **Defer

## Upgrading on a curve

Lastly, let’s consider a more specialized use case. Imagine we have a "tiger team" (consisting of a lone analytics engineer named Dave) tasked with upgrading from dbt version 1.6 to the new **Latest release track**, to take advantage of new features and performance improvements. We want to keep the rest of the data team being productive in dbt 1.6 for the time being, while enabling Dave to upgrade and do his work with Latest (and greatest) dbt.
Lastly, let’s consider a more specialized use case. Imagine we have a "tiger team" (consisting of a lone analytics engineer named Dave) tasked with upgrading from dbt version 1.6 to the new **[Latest release track](/docs/dbt-versions/cloud-release-tracks)**, to take advantage of new features and performance improvements. We want to keep the rest of the data team being productive in dbt 1.6 for the time being, while enabling Dave to upgrade and do his work with Latest (and greatest) dbt.

### Development environment

Expand Down
18 changes: 7 additions & 11 deletions website/blog/2024-06-12-putting-your-dag-on-the-internet.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,7 @@ date: 2024-06-14
is_featured: true
---

import Latest from '/snippets/_release-stages-from-versionless.md'

<Latest/>

**New in dbt: allow Snowflake Python models to access the internet**
## New in dbt: allow Snowflake Python models to access the internet

With dbt 1.8, dbt released support for Snowflake’s [external access integrations](https://docs.snowflake.com/en/developer-guide/external-network-access/external-network-access-overview) further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, [EQT AB](https://eqtgroup.com/). Learn about why they needed it and how they helped build the feature and get it shipped!

Expand Down Expand Up @@ -49,7 +45,7 @@ This API is open and if it requires an API key, handle it similarly to managing
For simplicity’s sake, we will show how to create them using [pre-hooks](/reference/resource-configs/pre-hook-post-hook) in a model configuration yml file:


```
```yml
models:
- name: external_access_sample
config:
Expand All @@ -61,7 +57,7 @@ models:
Then we can simply use the new external_access_integrations configuration parameter to use our network rule within a Python model (called external_access_sample.py):
```
```python
import snowflake.snowpark as snowpark
def model(dbt, session: snowpark.Session):
dbt.config(
Expand All @@ -79,7 +75,7 @@ def model(dbt, session: snowpark.Session):
The result is a model with some json I can parse, for example, in a SQL model to extract some information:


```
```sql
{{
config(
materialized='incremental',
Expand Down Expand Up @@ -112,12 +108,12 @@ The result is a model that will keep track of dbt invocations, and the current U

This is a very new area to Snowflake and dbt -- something special about SQL and dbt is that it’s very resistant to external entropy. The second we rely on API calls, Python packages and other external dependencies, we open up to a lot more external entropy. APIs will change, break, and your models could fail.

Traditionally dbt is the T in ELT (dbt overview [here](https://docs.getdbt.com/terms/elt)), and this functionality unlocks brand new EL capabilities for which best practices do not yet exist. What’s clear is that EL workloads should be separated from T workloads, perhaps in a different modeling layer. Note also that unless using incremental models, your historical data can easily be deleted. dbt has seen a lot of use cases for this, including this AI example as outlined in this external [engineering blog post](https://klimmy.hashnode.dev/enhancing-your-dbt-project-with-large-language-models).
Traditionally dbt is the T in ELT (dbt overview [here](https://docs.getdbt.com/terms/elt)), and this functionality unlocks brand new EL capabilities for which best practices do not yet exist. What’s clear is that EL workloads should be separated from T workloads, perhaps in a different modeling layer. Note also that unless using incremental models, your historical data can easily be deleted. dbt has seen a lot of use cases for this, including this AI example as outlined in this external [engineering blog post](https://klimmy.hashnode.dev/enhancing-your-dbt-project-with-large-language-models).

**A few words about the power of Commercial Open Source Software**
## A few words about the power of Commercial Open Source Software

In order to get this functionality shipped quickly, EQT opened a pull request, Snowflake helped with some problems we had with CI and a member of dbt Labs helped write the tests and merge the code in!

dbt now features this functionality in dbt 1.8+ and the "Latest" release track in dbt Cloud (dbt overview [here](/docs/dbt-versions/cloud-release-tracks)).
dbt now features this functionality in dbt 1.8+ and all [Release tracks](/docs/dbt-versions/cloud-release-tracks) in dbt Cloud.

dbt Labs staff and community members would love to chat more about it in the [#db-snowflake](https://getdbt.slack.com/archives/CJN7XRF1B) slack channel.
140 changes: 132 additions & 8 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ Incremental models in dbt are a [materialization](/docs/build/materializations)
Microbatch is an incremental strategy designed for large time-series datasets:
- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to define time-based ranges for filtering. Set the `event_time` column for your microbatch model and its direct parents (upstream models). Note, this is different to `partition_by`, which groups rows into partitions.
- It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing.
- Unlike traditional incremental strategies, microbatch doesn't require implementing complex conditional logic for [backfilling](#backfills).
- Unlike traditional incremental strategies, microbatch enables you to [reprocess failed batches](/docs/build/incremental-microbatch#retry), auto-detect [parallel batch execution](#parallel-batch-execution), and eliminate the need to implement complex conditional logic for [backfilling](#backfills).

- Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies).

### How microbatch works
Expand Down Expand Up @@ -179,12 +180,15 @@ It does not matter whether the table already contains data for that day. Given t

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| [`begin`](/reference/resource-configs/begin) | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| [`batch_size`](/reference/resource-configs/batch-size) | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| [`lookback`](/reference/resource-configs/lookback) | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |

| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |
| `concurrent_batches` | An override for whether batches run concurrently (at the same time) or sequentially (one after the other). | `None` | Boolean | Optional |


<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

Expand Down Expand Up @@ -280,7 +284,127 @@ For now, dbt assumes that all values supplied are in UTC:

While we may consider adding support for custom time zones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## How `microbatch` compares to other incremental strategies?
## Parallel batch execution

The microbatch strategy offers the benefit of updating a model in smaller, more manageable batches. Depending on your use case, configuring your microbatch models to run in parallel offers faster processing, in comparison to running batches sequentially.

Parallel batch execution means that multiple batches are processed at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models.

dbt automatically detects whether a batch can be run in parallel in most cases, which means you don’t need to configure this setting. However, the `concurrent_batches` config is available as an override (not a gate), allowing you to specify whether batches should or shouldn’t be run in parallel in specific cases.

For example, if you have a microbatch model with 12 batches, you can execute those batches to run in parallel. Specifically they'll run in parallel limited by the number of [available threads](/docs/running-a-dbt-project/using-threads).

### Prerequisites

To enable parallel execution, you must:

- Use a supported adapter:
- Snowflake
- Databricks
- More adapters coming soon!
- We'll be continuing to test and add concurrency support for adapters. This means that some adapters might get concurrency support _after_ the 1.9 initial release.

- Meet [additional conditions](#how-parallel-batch-execution-works) described in the following section.

### How parallel batch execution works

A batch can only run in parallel if all of these conditions are met:

| Condition | Parallel execution | Sequential execution|
| ---------------| :------------------: | :----------: |
| **Not** the first batch | ✅ | - |
| **Not** the last batch | ✅ | - |
| [Adapter supports](#prerequisites) parallel batches | ✅ | - |


After checking for the conditions in the previous table &mdash; and if `concurrent_batches` value isn't set, dbt will intelligently auto-detect if the model invokes the [`{{ this }}`](/reference/dbt-jinja-functions/this) Jinja function. If it references `{{ this }}`, the batches will run sequentially since `{{ this }}` represents the database of the current model and referencing the same relation causes conflict.

Otherwise, if `{{ this }}` isn't detected (and other conditions are met), the batches will run in parallel, which can be overriden when you [set a value for `concurrent_batches`](/reference/resource-properties/concurrent_batches).

### Parallel or sequential execution

Choosing between parallel batch execution and sequential processing depends on the specific requirements of your use case.

- Parallel batch execution is faster but requires logic independent of batch execution order. For example, if you're developing a data pipeline for a system that processes user transactions in batches, each batch is executed in parallel for better performance. However, the logic used to process each transaction shouldn't depend on the order of how batches are executed or completed.
- Sequential processing is slower but essential for calculations like [cumulative metrics](/docs/build/cumulative) in microbatch models. It processes data in the correct order, allowing each step to build on the previous one.

<!-- You can override the check for `this` by setting `concurrent_batches` to either `True` or `False`. If set to `False`, the batch will be run sequentially. If set to `True` the batch will be run in parallel (assuming [1], [2], and [3])
To override the `this` check, use the `concurrent_batches` configuration:


<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: True
```

</File>

or:

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
concurrent_batches=True,
incremental_strategy='microbatch'
...
)
}}
select ...
```

</File>
-->

### Configure `concurrent_batches`

By default, dbt auto-detects whether batches can run in parallel for microbatch models, and this works correctly in most cases. However, you can override dbt's detection by setting the [`concurrent_batches` config](/reference/resource-properties/concurrent_batches) in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet all the [conditions](#prerequisites):

<Tabs>
<TabItem value="yaml" label="dbt_project.yml">

<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: true # value set to true to run batches in parallel
```

</File>
</TabItem>

<TabItem value="sql" label="my_model.sql">

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01',
batch_size='day
concurrent_batches=true, # value set to true to run batches in parallel
...
)
}}
select ...
```
</File>
</TabItem>
</Tabs>

## How microbatch compares to other incremental strategies

As data warehouses roll out new operations for concurrently replacing/upserting data partitions, we may find that the new operation for the data warehouse is more efficient than what the adapter uses for microbatch. In such instances, we reserve the right the update the default operation for microbatch, so long as it works as intended/documented for models that fit the microbatch paradigm.

Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `{% if is_incremental() %}` conditional block. You are responsible for crafting this SQL in a way that queries [`{{ this }}`](/reference/dbt-jinja-functions/this) to check when the most recent record was last loaded, with an optional look-back window for late-arriving records.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ Starting in Core 1.9, you can use the new [microbatch strategy](/docs/build/incr
- Simplified query design: Write your model query for a single batch of data. dbt will use your `event_time``lookback`, and `batch_size` configurations to automatically generate the necessary filters for you, making the process more streamlined and reducing the need for you to manage these details.
- Independent batch processing: dbt automatically breaks down the data to load into smaller batches based on the specified `batch_size` and processes each batch independently, improving efficiency and reducing the risk of query timeouts. If some of your batches fail, you can use `dbt retry` to load only the failed batches.
- Targeted reprocessing: To load a *specific* batch or batches, you can use the CLI arguments `--event-time-start` and `--event-time-end`.
- [Automatic parallel batch execution](/docs/build/incremental-microbatch#parallel-batch-execution): Process multiple batches at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models. dbt intelligently auto-detects if your batches can run in parallel, while also allowing you to manually override parallel execution with the `concurrent_batches` config.


Currently microbatch is supported on these adapters with more to come:
* postgres
Expand Down
10 changes: 9 additions & 1 deletion website/docs/reference/dbt-jinja-functions/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,21 @@ __Args__:

The `config.get` function is used to get configurations for a model from the end-user. Configs defined in this way are optional, and a default value can be provided.

There are 3 cases:
1. The configuration variable exists, it is not `None`
1. The configuration variable exists, it is `None`
1. The configuration variable does not exist

Example usage:
```sql
{% materialization incremental, default -%}
-- Example w/ no default. unique_key will be None if the user does not provide this configuration
{%- set unique_key = config.get('unique_key') -%}

-- Example w/ default value. Default to 'id' if 'unique_key' not provided
-- Example w/ alternate value. Use alternative of 'id' if 'unique_key' config is provided, but it is None
{%- set unique_key = config.get('unique_key') or 'id' -%}

-- Example w/ default value. Default to 'id' if the 'unique_key' config does not exist
{%- set unique_key = config.get('unique_key', default='id') -%}
...
```
Expand Down
Loading

0 comments on commit 6c6c31b

Please sign in to comment.