Skip to content

Commit

Permalink
Merge branch 'current' into hovaesco/update-trino
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Dec 21, 2023
2 parents 61b080e + 9566806 commit 8b8328c
Show file tree
Hide file tree
Showing 13 changed files with 216 additions and 26 deletions.
99 changes: 99 additions & 0 deletions website/blog/2023-12-20-partner-integration-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "How to integrate with dbt"
description: "This guide will cover the ways to integrate with dbt Cloud"
slug: integrating-with-dbtcloud

authors: [amy_chen]

tags: [dbt Cloud, Integrations, APIs]
hide_table_of_contents: false

date: 2023-12-20
is_featured: false
---
## Overview

Over the course of my three years running the Partner Engineering team at dbt Labs, the most common question I've been asked is, How do we integrate with dbt? Because those conversations often start out at the same place, I decided to create this guide so I’m no longer the blocker to fundamental information. This also allows us to skip the intro and get to the fun conversations so much faster, like what a joint solution for our customers would look like.

This guide doesn't include how to integrate with dbt Core. If you’re interested in creating a dbt adapter, please check out the [adapter development guide](https://docs.getdbt.com/guides/dbt-ecosystem/adapter-development/1-what-are-adapters) instead.

Instead, we're going to focus on integrating with dbt Cloud. Integrating with dbt Cloud is a key requirement to become a dbt Labs technology partner, opening the door to a variety of collaborative commercial opportunities.

Here I'll cover how to get started, potential use cases you want to solve for, and points of integrations to do so.

## New to dbt Cloud?

If you're new to dbt and dbt Cloud, we recommend you and your software developers try our [Getting Started Quickstarts](https://docs.getdbt.com/guides) after reading [What is dbt](https://docs.getdbt.com/docs/introduction). The documentation will help you familiarize yourself with how our users interact with dbt. By going through this, you will also create a sample dbt project to test your integration.

If you require a partner dbt Cloud account to test on, we can upgrade an existing account or a trial account. This account may only be used for development, training, and demonstration purposes. Please contact your partner manager if you're interested and provide the account ID (provided in the URL). Our partner account includes all of the enterprise level functionality and can be provided with a signed partnerships agreement.

## Integration points

- [Discovery API (formerly referred to as Metadata API)](https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-api)
- **Overview** — This GraphQL API allows you to query the metadata that dbt Cloud generates every time you run a dbt project. We have two schemas available (environment and job level). By default, we always recommend that you integrate with the environment level schema because it contains the latest state and historical run results of all the jobs run on the dbt Cloud project. The job level will only provide you the metadata of one job, giving you only a small snapshot of part of the project.
- [Administrative (Admin) API](https://docs.getdbt.com/docs/dbt-cloud-apis/admin-cloud-api)
- **Overview** — This REST API allows you to orchestrate dbt Cloud jobs runs and help you administer a dbt Cloud account. For metadata retrieval, we recommend integrating with the Discovery API instead.
- [Webhooks](https://docs.getdbt.com/docs/deploy/webhooks)
- **Overview** — Outbound webhooks can send notifications about your dbt Cloud jobs to other systems. These webhooks allow you to get the latest information about your dbt jobs in real time.
- [Semantic Layers/Metrics](https://docs.getdbt.com/docs/dbt-cloud-apis/sl-api-overview)
- **Overview** — Our Semantic Layer is made up of two parts: metrics definitions and the ability to interactively query the dbt metrics. For more details, here is a [basic overview](https://docs.getdbt.com/docs/use-dbt-semantic-layer/dbt-sl) and [our best practices](https://docs.getdbt.com/guides/dbt-ecosystem/sl-partner-integration-guide).
- Metrics definitions can be pulled from the Discovery API (linked above) or the Semantic Layer Driver/GraphQL API. The key difference is that the Discovery API isn't able to pull the semantic graph, which provides the list of available dimensions that one can query per metric. That is only available with the SL Driver/APIs. The trade-off is that the SL Driver/APIs doesn't have access to the lineage of the entire dbt project (that is, how the dbt metrics dependencies on dbt models).
- Three integration points are available for the Semantic Layer API.

## dbt Cloud hosting and authentication

To use the dbt Cloud APIs, you'll need access to the customer’s access urls. Depending on their dbt Cloud setup, they'll have a different access URL. To find out more, refer to [Regions & IP addresses](https://docs.getdbt.com/docs/cloud/about-cloud/regions-ip-addresses) to understand all the possible configurations. My recommendation is to allow the customer to provide their own URL to simplify support.

If the customer is on an Azure single tenant instance, they don't currently have access to the Discovery API or the Semantic Layer APIs.

For authentication, we highly recommend that your integration uses account service tokens. You can read more about [how to create a service token and what permission sets to provide it](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Please note that depending on their plan type, they'll have access to different permission sets. We _do not_ recommend that users supply their user bearer tokens for authentication. This can cause issues if the user leaves the organization and provides you access to all the dbt Cloud accounts associated to the user rather than just the account (and related projects) that they want to integrate with.

## Potential use cases

- Event-based orchestration
- **Desired action** — You want to receive information that a scheduled dbt Cloud job has been completed or has kicked off a dbt Cloud job. You can align your product schedule to the dbt Cloud run schedule.
- **Examples** — Kicking off a dbt job after the ETL job of extracting and loading the data is completed. Or receiving a webhook after the job has been completed to kick off your reverse ETL job.
- **Integration points** — Webhooks and/or Admin API
- dbt lineage
- **Desired action** — You want to interpolate the dbt lineage metadata into your tool.
- **Example** — In your tool, you want to pull in the dbt DAG into your lineage diagram. For details on what you could pull and how to do this, refer to [Use cases and examples for the Discovery API](https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-use-cases-and-examples).
- **Integration points** — Discovery API
- dbt environment/job metadata
- **Desired action** — You want to interpolate the dbt Cloud job information into your tool, including the status of the jobs, the status of the tables executed in the run, what tests passed, etc.
- **Example** — In your Business Intelligence tool, stakeholders select from tables that a dbt model created. You show the last time the model passed its tests/last run to show that the tables are current and can be trusted. For details on what you could pull and how to do this, refer to [What's the latest state of each model](https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-use-cases-and-examples#whats-the-latest-state-of-each-model).
- **Integration points** — Discovery API
- dbt model documentation
- **Desired action** — You want to interpolate the dbt project Information, including model descriptions, column descriptions, etc.
- **Example** — You want to extract the dbt model description so you can display and help the stakeholder understand what they are selecting from. This way, the creators can easily pass on the information without updating another system. For details on what you could pull and how to do this, refer to [What does this dataset and its columns mean](https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-use-cases-and-examples#what-does-this-dataset-and-its-columns-mean).
- **Integration points** — Discovery API

dbt Core only users will have no access to the above integration points. For dbt metadata, oftentimes our partners will create a dbt Core integration by using the [dbt artifact](https://www.getdbt.com/product/semantic-layer/) files generated by each run and provided by the user. With the Discovery API, we are providing a dynamic way to get the latest information parsed out for you.

## dbt Cloud plans & permissions

[The dbt Cloud plan type](https://www.getdbt.com/pricing) will change what the user has access to. There are four different types of plans:

- **Developer** — This is free and available to one user with a limited amount of successful models built. This plan can't access the APIs, Webhooks, or Semantic Layer and is limited to just one project.
- **Team** — This plan provides access to the APIs, webhooks, and Semantic Layer. You can have up to eight users on the account and one dbt Cloud Project. This is limited to 15,000 successful models built.
- **Enterprise** (multi-tenant/multi-cell) — This plan provides access to the APIs, webhooks, and Semantic Layer. You can have more than one dbt Cloud project based on how many dbt projects/domains they have using dbt. The majority of our enterprise customers are on multi-tenant dbt Cloud instances.
- **Enterprise** (single tenant): This plan might have access to the APIs, webhooks, and Semantic Layer. If you're working with a specific customer, let us know and we can confirm if their instance has access.

## FAQs

- What is a dbt Cloud project?
- A dbt Cloud project is made up of two connections: one to the Git repository and one to the data warehouse/platform. Most customers will have only one dbt Cloud project in their account but there are enterprise clients who might have more depending on their use cases. The project also encapsulates two types of environments at minimal: a development environment and deployment environment.
- Folks commonly refer to the [dbt project](https://docs.getdbt.com/docs/build/projects) as the code hosted in their Git repository.
- What is a dbt Cloud environment?
- For an overview, check out [About environments](https://docs.getdbt.com/docs/environments-in-dbt). At a minimum, a project will have one deployment type environment that they will be executing jobs on. The development environment powers the dbt Cloud IDE and Cloud CLI.
- Can we write back to the dbt project?
- At this moment, we don't have a Write API. A dbt project is hosted in a Git repository, so if you have a Git provider integration, you can manually open a pull request (PR) on the project to maintain the version control process.
- Can you provide column-level information in the lineage?
- Column-level lineage is currently in beta release with more information to come.
- How do I get a Partner Account?
- Contact your Partner Manager with your account ID (in your URL).
- Why shouldn't I use the Admin API to pull out the dbt artifacts for metadata?
- We recommend not integrating with the Admin API to extract the dbt artifacts documentation. This is because the Discovery API provides more extensive information, a user-friendly structure, and a more reliable integration point.
- How do I get access to the dbt brand assets?
- Check out our [Brand guidelines](https://www.getdbt.com/brand-guidelines/) page. Please make sure you’re not using our old logo (hint: there should only be one hole in the logo). Please also note that the name dbt and the dbt logo are trademarked by dbt Labs, and that use is governed by our brand guidelines, which are fairly specific for commercial uses. If you have any questions about proper use of our marks, please ask your partner manager.
- How do I engage with the partnerships team?
- Email [email protected].
2 changes: 1 addition & 1 deletion website/blog/authors.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
amy_chen:
image_url: /img/blog/authors/achen.png
job_title: Staff Partner Engineer
job_title: Product Partnerships Manager
links:
- icon: fa-linkedin
url: https://www.linkedin.com/in/yuanamychen/
Expand Down
14 changes: 9 additions & 5 deletions website/docs/docs/build/incremental-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,17 +154,21 @@ For detailed usage instructions, check out the [dbt run](/reference/commands/run

# Understanding incremental models
## When should I use an incremental model?
It's often desirable to build models as tables in your data warehouse since downstream queries are more performant. While the `table` materialization also creates your models as tables, it rebuilds the table on each dbt run. These runs can become problematic in that they use a lot of compute when either:
* source data tables have millions, or even billions, of rows.
* the transformations on the source data are computationally expensive (that is, take a long time to execute), for example, complex Regex functions, or UDFs are being used to transform data.

Like many things in programming, incremental models are a trade-off between complexity and performance. While they are not as straightforward as the `view` and `table` materializations, they can lead to significantly better performance of your dbt runs.
Building models as tables in your data warehouse is often preferred for better query performance. However, using `table` materialization can be computationally intensive, especially when:

- Source data has millions or billions of rows.
- Data transformations on the source data are computationally expensive (take a long time to execute) and complex, like using Regex or UDFs.

Incremental models offer a balance between complexity and improved performance compared to `view` and `table` materializations and offer better performance of your dbt runs.

In addition to these considerations for incremental models, it's important to understand their limitations and challenges, particularly with large datasets. For more insights into efficient strategies, performance considerations, and the handling of late-arriving data in incremental models, refer to the [On the Limits of Incrementality](https://discourse.getdbt.com/t/on-the-limits-of-incrementality/303) discourse discussion.

## Understanding the is_incremental() macro
The `is_incremental()` macro will return `True` if _all_ of the following conditions are met:
* the destination table already exists in the database
* dbt is _not_ running in full-refresh mode
* the running model is configured with `materialized='incremental'`
* The running model is configured with `materialized='incremental'`

Note that the SQL in your model needs to be valid whether `is_incremental()` evaluates to `True` or `False`.

Expand Down
4 changes: 4 additions & 0 deletions website/docs/docs/build/join-logic.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ mf query --metrics average_purchase_price --dimensions metric_time,user_id__type

## Multi-hop joins

:::info
MetricFlow can join three tables at most, supporting multi-hop joins with a limit of two hops.
:::

MetricFlow allows users to join measures and dimensions across a graph of entities, which we refer to as a 'multi-hop join.' This is because users can move from one table to another like a 'hop' within a graph.

Here's an example schema for reference:
Expand Down
45 changes: 45 additions & 0 deletions website/docs/docs/cloud/migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: "Multi-cell migration checklist"
id: migration
description: "Prepare for account migration to AWS cell-based architecture."
pagination_next: null
pagination_prev: null
---

dbt Labs is in the process of migrating dbt Cloud to a new _cell-based architecture_. This architecture will be the foundation of dbt Cloud for years to come, and will bring improved scalability, reliability, and security to all customers and users of dbt Cloud.

There is some preparation required to ensure a successful migration.

Migrations are being scheduled on a per-account basis. _If you haven't received any communication (either with a banner or by email) about a migration date, you don't need to take any action at this time._ dbt Labs will share migration date information with you, with appropriate advance notice, before we complete any migration steps in the dbt Cloud backend.

This document outlines the steps that you must take to prevent service disruptions before your environment is migrated over to the cell-based architecture. This will impact areas such as login, IP restrictions, and API access.

## Pre-migration checklist

Prior to your migration date, your dbt Cloud account admin will need to make some changes to your account.

If your account is scheduled for migration, you will see a banner indicating your migration date when you log in. If you don't see a banner, you don't need to take any action.

1. **IP addresses** — dbt Cloud will be using new IPs to access your warehouse after the migration. Make sure to allow inbound traffic from these IPs in your firewall and include it in any database grants. All six of the IPs below should be added to allowlists.
* Old IPs: `52.45.144.63``54.81.134.249``52.22.161.231`
* New IPs: `52.3.77.232``3.214.191.130``34.233.79.135`
2. **APIs and integrations** — Each dbt Cloud account will be allocated a static access URL like: `aa000.us1.dbt.com`. You should begin migrating your API access and partner integrations to use the new static subdomain as soon as possible. You can find your access URL on:
* Any page where you generate or manage API tokens.
* The **Account Settings** > **Account page**.

:::important Multiple account access
Be careful, each account that you have access to will have a different, dedicated [access URL](https://next.docs.getdbt.com/docs/cloud/about-cloud/access-regions-ip-addresses#accessing-your-account).
:::

3. **IDE sessions** — Any uncommitted changes in the IDE might be lost during the migration process. dbt Labs _strongly_ encourages you to commit all changes in the IDE before your scheduled migration time.
4. **User invitations** — Any pending user invitations will be invalidated during the migration. You can resend the invitations once the migration is complete.
5. **Git integrations** — Integrations with GitHub, GitLab, and Azure DevOps will need to be manually updated. dbt Labs will not be migrating any accounts using these integrations at this time. If you're using one of these integrations and your account is scheduled for migration, please contact support and we will delay your migration.
6. **SSO integrations** — Integrations with SSO identity providers (IdPs) will need to be manually updated. dbt Labs will not be migrating any accounts using SSO at this time. If you're using one of these integrations and your account is scheduled for migration, please contact support and we will delay your migration.

## Post-migration

After migration, if you completed all the [Pre-migration checklist](#pre-migration-checklist) items, your dbt Cloud resources and jobs will continue to work as they did before.

You have the option to log in to dbt Cloud at a different URL:
* If you were previously logging in at `cloud.getdbt.com`, you should instead plan to login at `us1.dbt.com`. The original URL will still work, but you’ll have to click through to be redirected upon login.
* You may also log in directly with your account’s unique [access URL](https://next.docs.getdbt.com/docs/cloud/about-cloud/access-regions-ip-addresses#accessing-your-account).
5 changes: 1 addition & 4 deletions website/docs/docs/core/connect-data-platform/spark-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,6 @@ meta:
<Snippet path="warehouse-setups-cloud-callout" />
<Snippet path="dbt-databricks-for-databricks" />

:::note
See [Databricks setup](#databricks-setup) for the Databricks version of this page.
:::

import SetUpPages from '/snippets/_setup-pages-intro.md';

<SetUpPages meta={frontMatter.meta} />
Expand Down Expand Up @@ -204,6 +200,7 @@ connect_retries: 3


<VersionBlock firstVersion="1.7">

### Server side configuration

Spark can be customized using [Application Properties](https://spark.apache.org/docs/latest/configuration.html). Using these properties the execution can be customized, for example, to allocate more memory to the driver process. Also, the Spark SQL runtime can be set through these properties. For example, this allows the user to [set a Spark catalogs](https://spark.apache.org/docs/latest/configuration.html#spark-sql).
Expand Down
Loading

0 comments on commit 8b8328c

Please sign in to comment.