Skip to content

Commit

Permalink
Merge branch 'current' into explorer-udpate-preview
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Oct 7, 2024
2 parents 017238e + bd73e21 commit 7f0527e
Show file tree
Hide file tree
Showing 17 changed files with 540 additions and 28 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,14 @@ We’ve focused heavily thus far on the primary area of action in our dbt projec

### Project splitting

One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Our present stance on this for most projects, particularly for teams starting out, is straightforward: you should avoid it unless you have no other option or it saves you from an even more complex workaround. If you do have the need to split up your project, it’s completely possible through the use of private packages, but the added complexity and separation is, for most organizations, a hindrance, not a help, at present. That said, this is very likely subject to change! [We want to create a world where it’s easy to bring lots of dbt projects together into a cohesive lineage](https://github.com/dbt-labs/dbt-core/discussions/5244). In a world where it’s simple to break up monolithic dbt projects into multiple connected projects, perhaps inside of a modern mono repo, the calculus will be different, and the below situations we recommend against may become totally viable. So watch this space!
One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Currently, our advice for most teams, especially those just starting, is fairly simple: in most cases, we recommend doing so with [dbt Mesh](/best-practices/how-we-mesh/mesh-1-intro)! dbt Mesh allows organizations to handle complexity by connecting several dbt projects rather than relying on one big, monolithic project. This approach is designed to speed up development while maintaining governance.

- ❌ **Business groups or departments.** Conceptual separations within the project are not a good reason to split up your project. Splitting up, for instance, marketing and finance modeling into separate projects will not only add unnecessary complexity but destroy the unifying effect of collaborating across your organization on cohesive definitions and business logic.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.
As breaking up monolithic dbt projects into smaller, connected projects, potentially within a modern mono repo becomes easier, the scenarios we currently advise against may soon become feasible. So watch this space!

- ✅ **Business groups or departments.** Conceptual separations within the project are the primary reason to split up your project. This allows your business domains to own their own data products and still collaborate using dbt Mesh. For more information about dbt Mesh, please refer to our [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs).
- ✅ **Data governance.** Structural, organizational needs — such as data governance and security — are one of the few worthwhile reasons to split up a project. If, for instance, you work at a healthcare company with only a small team cleared to access raw data with PII in it, you may need to split out your staging models into their own projects to preserve those policies. In that case, you would import your staging project into the project that builds on those staging models as a [private package](https://docs.getdbt.com/docs/build/packages/#private-packages).
- ✅ **Project size.** At a certain point, your project may grow to have simply too many models to present a viable development experience. If you have 1000s of models, it absolutely makes sense to find a way to split up your project.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.

## Final considerations

Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ A `sessions` model is aggregating and enriching data that comes from two other m
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update.
- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers.

The `page_view_start` column in `page_views` is configured as that model's `event_time`. The `customers` model does not configure an `event_time`. Therefore, each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch, and it will not filter `sessions` (a full scan for every batch).
The `page_view_start` column in `page_views` is configured as that model's `event_time`. The `customers` model does not configure an `event_time`. Therefore, each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch, and it will not filter `customers` (a full scan for every batch).

We run the `sessions` model on October 1, 2024, and then again on October 2. It produces the following queries:

Expand Down
32 changes: 20 additions & 12 deletions website/docs/docs/build/snapshots.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,20 +52,25 @@ It is not possible to "preview data" or "compile sql" for snapshots in dbt Cloud

<VersionBlock firstVersion="1.9">

In dbt Cloud Versionless and dbt Core v1.9 and later, snapshots are configurations defined in YAML files (typically in your snapshots directory). You'll configure your snapshot to tell dbt how to detect record changes.
Configure your snapshots in YAML files to tell dbt how to detect record changes. Define snapshots configurations in YAML files, alongside your models, for a cleaner, faster, and more consistent set up.

<File name='snapshots/orders_snapshot.yml'>

```yaml
snapshots:
- name: orders_snapshot
relation: source('jaffle_shop', 'orders')
- name: string
relation: relation # source('my_source', 'my_table') or ref('my_model')
config:
schema: snapshots
database: analytics
unique_key: id
strategy: timestamp
updated_at: updated_at
[database](/reference/resource-configs/database): string
[schema](/reference/resource-configs/schema): string
[alias](/reference/resource-configs/alias): string
[strategy](/reference/resource-configs/strategy): timestamp | check
[unique_key](/reference/resource-configs/unique_key): column_name_or_expression
[check_cols](/reference/resource-configs/check_cols): [column_name] | all
[updated_at](/reference/resource-configs/updated_at): column_name
[invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes): true | false
[snapshot_meta_column_names](/reference/resource-configs/snapshot_meta_column_names): dictionary

```

</File>
Expand All @@ -82,6 +87,7 @@ The following table outlines the configurations available for snapshots:
| [check_cols](/reference/resource-configs/check_cols) | If using the `check` strategy, then the columns to check | Only if using the `check` strategy | ["status"] |
| [updated_at](/reference/resource-configs/updated_at) | If using the `timestamp` strategy, the timestamp column to compare | Only if using the `timestamp` strategy | updated_at |
| [invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes) | Find hard deleted records in source and set `dbt_valid_to` to current time if the record no longer exists | No | True |
| [snapshot_meta_column_names](/reference/resource-configs/snapshot_meta_column_names) | Customize the names of the snapshot meta fields | No | dictionary |

- In versions prior to v1.9, the `target_schema` (required) and `target_database` (optional) configurations defined a single schema or database to build a snapshot across users and environment. This created problems when testing or developing a snapshot, as there was no clear separation between development and production environments. In v1.9, `target_schema` became optional, allowing snapshots to be environment-aware. By default, without `target_schema` or `target_database` defined, snapshots now use the `generate_schema_name` or `generate_database_name` macros to determine where to build. Developers can still set a custom location with [`schema`](/reference/resource-configs/schema) and [`database`](/reference/resource-configs/database) configs, consistent with other resource types.
- A number of other configurations are also supported (for example, `tags` and `post-hook`). For the complete list, refer to [Snapshot configurations](/reference/snapshot-configs).
Expand Down Expand Up @@ -160,7 +166,7 @@ To add a snapshot to your project follow these steps. For users on versions 1.8

### Configuration best practices

<Expandable alt_header="Use thetimestamp strategy where possible">
<Expandable alt_header="Use the timestamp strategy where possible">

This strategy handles column additions and deletions better than the `check` strategy.

Expand Down Expand Up @@ -188,9 +194,9 @@ Snapshots can't be rebuilt. Because of this, it's a good idea to put snapshots i

</Expandable>

<Expandable alt_header="Use ephemeral model to clean or tranform data before snapshotting">
<Expandable alt_header="Use ephemeral model to clean or transform data before snapshotting">

If you need to clean or transform your data before snapshotting, create an ephemeral model (or a staging model) that applies the necessary transformations. Then, reference this model in your snapshot configuration. This approach keeps your snapshot definitions clean and allows you to test and run transformations separately.
If you need to clean or transform your data before snapshotting, create an ephemeral model or a staging model that applies the necessary transformations. Then, reference this model in your snapshot configuration. This approach keeps your snapshot definitions clean and allows you to test and run transformations separately.

</Expandable>
</VersionBlock>
Expand All @@ -203,6 +209,8 @@ When you run the [`dbt snapshot` command](/reference/commands/snapshot):
- The `dbt_valid_to` column will be updated for any existing records that have changed
- The updated record and any new records will be inserted into the snapshot table. These records will now have `dbt_valid_to = null`

Note, these column names can be customized to your team or organizational conventions using the [snapshot_meta_column_names](#snapshot-meta-fields) config.

Snapshots can be referenced in downstream models the same way as referencing models — by using the [ref](/reference/dbt-jinja-functions/ref) function.

## Detecting row changes
Expand Down
29 changes: 29 additions & 0 deletions website/docs/docs/cloud/connect-data-platform/connect-teradata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "Connect Teradata"
id: connect-teradata
description: "Configure the Teradata platform connection in dbt Cloud."
sidebar_label: "Connect Teradata"
---

# Connect Teradata <Lifecycle status="preview" />

Your environment(s) must be on ["Versionless"](/docs/dbt-versions/versionless-cloud) to use the Teradata connection.

| Field | Description | Type | Required? | Example |
| ----------------------------- | --------------------------------------------------------------------------------------------- | -------------- | --------- | ------- |
| Host | Host name of your Teradata environment. | String | Required | host-name.env.clearscape.teradata.com |
| Port | The database port number. Equivalent to the Teradata JDBC Driver DBS_PORT connection parameter.| Quoted integer | Optional | 1025 |
| Retries | Number of times to retry to connect to database upon error. | Integer | optional | 10 |
| Request timeout | The waiting period between connections attempts in seconds. Default is "1" second. | Quoted integer | Optional | 3 |

<Lightbox src="/img/docs/dbt-cloud/teradata-connection.png" title="Example of the Teradata connection fields." />

### Development and deployment credentials

| Field | Description | Type | Required? | Example |
| ------------------------------|-----------------------------------------------------------------------------------------------|----------------|-----------|--------------------|
| Username | The database username. Equivalent to the Teradata JDBC Driver USER connection parameter. | String | Required | database_username |
| Password | The database password. Equivalent to the Teradata JDBC Driver PASSWORD connection parameter. | String | Required | DatabasePassword123 |
| Schema | Specifies the initial database to use after login, rather than the user's default database. | String | Required | dbtlabsdocstest |

<Lightbox src="/img/docs/dbt-cloud/teradata-deployment.png" title="Example of the developer credential fields." />
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Historically, managing incremental models involved several manual steps and resp

While this works for many use-cases, there’s a clear limitation with this approach: *Some datasets are just too big to fit into one query.*

Starting in Core 1.9, you can use the new microbatch strategy to optimize your largest datasets -- **process your event data in discrete periods with their own SQL queries, rather than all at once.** The benefits include:
Starting in Core 1.9, you can use the new [microbatch strategy](/docs/build/incremental-microbatch#what-is-microbatch-in-dbt) to optimize your largest datasets -- **process your event data in discrete periods with their own SQL queries, rather than all at once.** The benefits include:

- Simplified query design: Write your model query for a single batch of data. dbt will use your `event_time``lookback`, and `batch_size` configurations to automatically generate the necessary filters for you, making the process more streamlined and reducing the need for you to manage these details.
- Independent batch processing: dbt automatically breaks down the data to load into smaller batches based on the specified `batch_size` and processes each batch independently, improving efficiency and reducing the risk of query timeouts. If some of your batches fail, you can use `dbt retry` to load only the failed batches.
Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/dbt-versions/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Release notes are grouped by month for both multi-tenant and virtual private clo
- **New**: In dbt Cloud Versionless, [Snapshots](/docs/build/snapshots) have been updated to use YAML configuration files instead of SQL snapshot blocks. This new feature simplifies snapshot management and improves performance, and will soon be released in dbt Core 1.9.
- Who does this affect? New user on Versionless can define snapshots using the new YAML specification. Users upgrading to Versionless who use snapshots can keep their existing configuration or can choose to migrate their snapshot definitions to YAML.
- Users on dbt 1.8 and earlier: No action is needed; existing snapshots will continue to work as before. However, we recommend upgrading to Versionless to take advantage of the new snapshot features.
- **Behavior change:** Set [`state_modified_compare_more_unrendered`](/reference/global-configs/behavior-changes#source-definitions-for-state) to true to reduce false positives for `state:modified` when configs differ between `dev` and `prod` environments.
- **Behavior change:** Set [`state_modified_compare_more_unrendered_values`](/reference/global-configs/behavior-changes#source-definitions-for-state) to true to reduce false positives for `state:modified` when configs differ between `dev` and `prod` environments.
- **Behavior change:** Set the [`skip_nodes_if_on_run_start_fails`](/reference/global-configs/behavior-changes#failures-in-on-run-start-hooks) flag to `True` to skip all selected resources from running if there is a failure on an `on-run-start` hook.
- **Enhancement**: In dbt Cloud Versionless, snapshots defined in SQL files can now use `config` defined in `schema.yml` YAML files. This update resolves the previous limitation that required snapshot properties to be defined exclusively in `dbt_project.yml` and/or a `config()` block within the SQL file. This will also be released in dbt Core 1.9.
- **New**: In dbt Cloud Versionless, the `snapshot_meta_column_names` config allows for customizing the snapshot metadata columns. This feature allows an organization to align these automatically-generated column names with their conventions, and will be included in the upcoming dbt Core 1.9 release.
Expand Down
Loading

0 comments on commit 7f0527e

Please sign in to comment.