Skip to content

Commit

Permalink
Merge branch 'current' into amychen1776-patch-8
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewshaver authored Feb 12, 2024
2 parents bdfcdbb + 546b6f2 commit 95ea1d6
Show file tree
Hide file tree
Showing 52 changed files with 623 additions and 220 deletions.
28 changes: 14 additions & 14 deletions website/blog/2023-01-24-aggregating-test-failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Testing the quality of data in your warehouse is an important aspect in any matu

<!--truncate-->

At [Tempus](https://www.tempus.com/), a precision medicine company specializing in oncology, high quality data is a necessary component for high quality clinical models. With roughly 1,000 dbt models, nearly a hundred data sources, and a dozen different data quality stakeholders, producing a framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — in early 2022, we had nearly a thousand tests, hundreds of which failed on a daily basis yet were wholly ignored.
Producing a data quality framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — one failing test becomes two becomes ten and suddenly you have too many test failures to act on any of them.

Recently, we overhauled our testing framework. We cut the number of tests down to 200, creating a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:
Recently, we overhauled our testing framework. We cut the number of tests down by 80% to create a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:

1. Leveraging the contextual knowledge of stakeholders, writing specific, high quality data tests, perpetuating test failure results into aliased models for easy access.
1. Aggregating test failure results using Jinja macros and pre-configured metadata to pull together high level summary tables.
Expand All @@ -37,35 +37,35 @@ Data Integrity tests (Generic Tests)  are simple — they’re tests akin to a
```yaml
version: 2
models:
- name: patient
- name: customer
columns:
- name: id
description: Unique ID associated with the record
tests:
- unique:
alias: patient__id__unique
alias: id__unique
- not_null:
alias: patient__id__not_null
alias: id__not_null
```
<center><i>Example Data Integrity Tests in a YAML file — the alias argument is an important piece that will be touched on later.</i></center><br />
Context Driven Tests are more complex and look a lot more like models. Essentially, they’re data models that select bad data or records we don’t want, defined as SQL files that live in the `dbt/tests` directory. An example is shown below —

```sql
{{ config(
tags=['check_birth_date_in_range', 'patient'],
alias='ad_hoc__check_birth-date_in_range'
tags=['check_purchase_date_in_range', 'customer'],
alias='ad_hoc__check_purchase_date_in_range
)
}}
SELECT
id,
birth_date
purchase_date
FROM
{{ ref('patient') }}
WHERE birth_date < '1900-01-01'
{{ ref('customer') }}
WHERE purchase_date < '1900-01-01'
```
<center><i>The above test selects all patients with a birth date before 1900, due to data rules we have about maximum patient age.</i></center><br />
<center><i>The above test selects all customers who have made a purchase before 1900. The idea is that any customer that exists before 1900 probably isn't real.</i></center><br />

Importantly, we leverage [Test Aliasing](https://docs.getdbt.com/reference/resource-configs/alias) to ensure that our tests all follow a standard and predictable naming convention; our naming convention for Data Integrity tests is *table_name_ _column_name__test_name*, and our naming convention for Context Driven Tests is *ad_hoc__test_name*. Finally, to ensure all of our tests can then be aggregated, we modify the `dbt_project.yml` file  and [set the `store_failures` tag to ‘TRUE’](https://docs.getdbt.com/reference/resource-configs/store_failures), thus persisting test failures into SQL tables.

Expand All @@ -86,15 +86,15 @@ After defining our metadata Seed file, we begin the process of aggregating our d
incremental_strategy = 'merge',
unique_key='row_key',
full_refresh=false,
tags=['dq_test_warning_failures','clinical_mart', 'data_health']
tags=['dq_test_warning_failures','customer_mart', 'data_health']
)
}}
WITH failures as (
SELECT
count(*) as test_failures,
_TABLE_SUFFIX as table_suffix,
FROM {{ var('clinical_mart_schema') }}_dbt_test__audit.`*`
FROM {{ var('customer_mart_schema') }}_dbt_test__audit.`*`
GROUP BY _TABLE_SUFFIX
),

Expand Down Expand Up @@ -131,4 +131,4 @@ With our finalized data quality base table, there are many other options for cle

First, we create views on top of the base table that filter down by test owner. We strongly believe that test noise is the biggest risk towards the success of a quality framework. Creating specific views is like giving each team a magnifying glass that lets them zoom into only the tests they care about. We also have a dashboard, currently in Google Looker Studio, that shows historical test failures with a suite of filters to let users magnify high severity tests and constructs machine-composed example queries for users to select failing records. When a test fails, a business analyst can copy and paste a query from the dashboard and get all the relevant information.

As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
8 changes: 4 additions & 4 deletions website/dbt-versions.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
exports.versions = [
{
version: "1.8",
isPrerelease: "true",
},
{
version: "1.7",
EOLDate: "2024-10-30",
Expand Down Expand Up @@ -174,10 +178,6 @@ exports.versionedPages = [
"page": "reference/resource-configs/grants",
"firstVersion": "1.2",
},
{
"page": "docs/build/saved-queries",
"firstVersion": "1.7",
},
{
"page": "reference/resource-configs/on_configuration_change",
"firstVersion": "1.6",
Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/build/about-metricflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Before you start, consider the following guidelines:
- Define metrics in YAML and query them using these [new metric specifications](https://github.com/dbt-labs/dbt-core/discussions/7456).
- You must be on [dbt version](/docs/dbt-versions/upgrade-core-in-cloud) 1.6 or higher to use MetricFlow.
- Use MetricFlow with Snowflake, BigQuery, Databricks, Postgres (dbt Core only), or Redshift.
- Discover insights and query your metrics using the [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) and its diverse range of [available integrations](/docs/use-dbt-semantic-layer/avail-sl-integrations). You must have a dbt Cloud account on the [Team or Enterprise plan](https://www.getdbt.com/pricing/).
- Discover insights and query your metrics using the [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) and its diverse range of [available integrations](/docs/use-dbt-semantic-layer/avail-sl-integrations).

## MetricFlow

Expand Down
4 changes: 2 additions & 2 deletions website/docs/docs/build/build-metrics-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,14 @@ MetricFlow allows you to:
<div className="grid--3-col">

<Card
title="Get started with the dbt Semantic Layer"
title="Get started with the dbt Semantic Layer and MetricFlow"
body="Use this guide to build and define metrics with MetricFlow, set up the dbt Semantic Layer, and query them using downstream tools."
link="/docs/build/sl-getting-started"
icon="dbt-bit"/>

<Card
title="About MetricFlow"
body="Understand MetricFlow's core concepts, key principles, and how to use this powerful tool."
body="Understand MetricFlow's core concepts, how to use joins, how to save commonly used queries, and what commands are available."
link="/docs/build/about-metricflow"
icon="dbt-bit"/>

Expand Down
6 changes: 3 additions & 3 deletions website/docs/docs/build/data-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ These tests are defined in `.sql` files, typically in your `tests` directory (as

```sql
-- Refunds have a negative amount, so the total amount should always be >= 0.
-- Therefore return records where this isn't true to make the test fail
-- Therefore return records where total_amount < 0 to make the test fail.
select
order_id,
sum(amount) as total_amount
from {{ ref('fct_payments' )}}
group by 1
having not(total_amount >= 0)
having total_amount < 0
```

</File>
Expand Down Expand Up @@ -247,7 +247,7 @@ This workflow allows you to query and examine failing records much more quickly

<Lightbox src="/img/docs/building-a-dbt-project/test-store-failures.gif" title="Store test failures in the database for faster development-time debugging."/>

Note that, if you elect to store test failures:
Note that, if you select to store test failures:
* Test result tables are created in a schema suffixed or named `dbt_test__audit`, by default. It is possible to change this value by setting a `schema` config. (For more details on schema naming, see [using custom schemas](/docs/build/custom-schemas).)
- A test's results will always **replace** previous failures for the same test.

Expand Down
28 changes: 28 additions & 0 deletions website/docs/docs/build/metricflow-commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ You can use the `dbt sl` prefix before the command name to execute them in the d
- [`list dimensions`](#list) &mdash; Lists unique dimensions for metrics.
- [`list dimension-values`](#list-dimension-values) &mdash; List dimensions with metrics.
- [`list entities`](#list-entities) &mdash; Lists all unique entities.
- [`list saved queries`)(#list-saved-queries) &mdash; Lists available saved queries. Use the `--show-exports` flag to display each export listed under a saved query.
- [`query`](#query) &mdash; Query metrics, saved queries, and dimensions you want to see in the command line interface. Refer to [query examples](#query-examples) to help you get started.

<!--below commands aren't supported in dbt cloud yet
Expand Down Expand Up @@ -174,6 +175,33 @@ Options:
--help Show this message and exit.
```

### List saved queries

This command lists all available saved queries:

```bash
dbt sl list saved-queries
```

You can also add the `--show-exports` flag (or option) to show each export listed under a saved query:

```bash
dbt sl list saved-queries --show-exports
```

**Output**

```bash
dbt sl list saved-queries --show-exports

The list of available saved queries:
- new_customer_orders
exports:
- Export(new_customer_orders_table, exportAs=TABLE)
- Export(new_customer_orders_view, exportAs=VIEW)
- Export(new_customer_orders, alias=orders, schemas=customer_schema, exportAs=TABLE)
```

### Validate-configs

The following command performs validations against the defined semantic model configurations.
Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/build/metrics-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ metrics:
This page explains the different supported metric types you can add to your dbt project.
### Conversion metrics <Lifecycle status='new'/>
### Conversion metrics
[Conversion metrics](/docs/build/conversion) help you track when a base event and a subsequent conversion event occurs for an entity within a set time period.
Expand Down
46 changes: 39 additions & 7 deletions website/docs/docs/build/saved-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,21 @@ tags: [Metrics, Semantic Layer]

Saved queries are a way to save commonly used queries in MetricFlow. You can group metrics, dimensions, and filters that are logically related into a saved query.

To define a saved query, refer to the following specification:
### Exports and saved queries comparison

Parameter | Description | Type |
| --------- | ----------- | ---- |
| `name` | The name of the metric. | Required |
| `description` | The description of the metric. | Optional |
| `query_params` | The query parameters for the saved query: `metrics`, `group_by`, and `where`. | Required |
Saved queries are distinct from [exports](/docs/use-dbt-semantic-layer/exports), which schedule and execute saved queries using [dbt Cloud's job scheduler](/docs/deploy/job-scheduler). The following table compares the features and usage of exports and saved queries:

The following is an example of a saved query:
| Feature | Exports | <div style={{width:'250px, text-align: center'}}>Saved queries</div> |
| ----------- | ----------- | ---------------- |
| **Availability** | Available on dbt Cloud [Team or Enterprise](https://www.getdbt.com/pricing/) plans on dbt versions 1.7 or newer.| Available in both dbt Core and dbt Cloud. |
| **Purpose** | To materialize saved queries in your data platform and expose metrics and dimensions as a view or table. | To define and manage common Semantic Layer queries in YAML, which includes metrics and dimensions. |
| **Usage** | Automatically runs saved queries and materializes them within your data platform. Exports count towards [queried metrics](/docs/cloud/billing#what-counts-as-a-queried-metric) usage. <br /><br />**Example**: Creating a weekly aggregated table for active user metrics, automatically updated and stored in the data platform. | Used for organizing and reusing common MetricFlow queries within dbt projects.<br /><br /><br />**Example**: Group related metrics together for better organization, and include commonly uses dimensions and filters. | For materializing query results in the data platform. |
| **Integration** | Must have the dbt Semantic Layer configured in your dbt project.<br /><br />Tightly integrated with the [MetricFlow Server](/docs/use-dbt-semantic-layer/sl-architecture#components) and dbt Cloud's job scheduler. | Integrated into the dbt <Term id="dag" /> and managed alongside other dbt nodes. |
| **Configuration** | Defined within the `saved_queries` configuration. Set up within the dbt Cloud environment and job scheduler settings. | Defined in YAML format within dbt project files. |

All metrics in a saved query need to use the same dimensions in the `group_by` or `where` clauses. The following is an example of a saved query:

<File name='semantic_model.yml'>

```yaml
saved_queries:
Expand All @@ -32,5 +38,31 @@ saved_queries:
where:
- "{{ Dimension('listing__capacity_latest') }} > 3"
```
</File>
## Parameters
To define a saved query, refer to the following parameters:
| Parameter | Type | Required | Description |
|-------|---------|----------|----------------|
| `name` | String | Required | Name of the saved query object. |
| `description` | String | Required | A description of the saved query. |
| `query_params` | Structure | Required | Contains the query parameters. |
| `query_params::metrics` | List or String | Optional | A list of the metrics to be used in the query as specified in the command line interface. |
| `query_params::group_bys` | List or String | Optional | A list of the Entities and Dimensions to be used in the query, which include the `Dimension` or `TimeDimension`. |
| `query_params::where` | LList or String or String | Optional | A list of string which may include the `Dimension` or `TimeDimension` objects. |
| `exports` | List or Structure | Optional | A list of exports to be specified with the exports structure. |
| `exports::name` | String | Required | Name of the export object. |
| `exports::config` | List or Structure | Required | A config section for any parameters specifying the export. |
| `exports::config::export_as` | String | Required | The type of export to run. Options include table or view currently and cache in the near future. |
| `exports::config::schema` | String | Optional | The schema used for creating the table or view. This option cannot be used for caching. |
| `exports::config::alias` | String | Optional | The table alias to use to write the table or view. This option cannot be used for caching. |

All metrics in a saved query need to use the same dimensions in the `group_by` or `where` clauses.


## Related docs

- [Exports](/docs/use-dbt-semantic-layer/exports)
- [Set up the dbt Semantic Layer](/docs/use-dbt-semantic-layer/setup-sl)
Loading

0 comments on commit 95ea1d6

Please sign in to comment.