Skip to content

Commit

Permalink
Merge branch 'current' into explorer-dev-blog
Browse files Browse the repository at this point in the history
  • Loading branch information
dave-connors-3 authored Feb 13, 2024
2 parents 13aa2f0 + 2b7539f commit bfb6bec
Show file tree
Hide file tree
Showing 40 changed files with 282 additions and 202 deletions.
28 changes: 14 additions & 14 deletions website/blog/2023-01-24-aggregating-test-failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Testing the quality of data in your warehouse is an important aspect in any matu

<!--truncate-->

At [Tempus](https://www.tempus.com/), a precision medicine company specializing in oncology, high quality data is a necessary component for high quality clinical models. With roughly 1,000 dbt models, nearly a hundred data sources, and a dozen different data quality stakeholders, producing a framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — in early 2022, we had nearly a thousand tests, hundreds of which failed on a daily basis yet were wholly ignored.
Producing a data quality framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — one failing test becomes two becomes ten and suddenly you have too many test failures to act on any of them.

Recently, we overhauled our testing framework. We cut the number of tests down to 200, creating a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:
Recently, we overhauled our testing framework. We cut the number of tests down by 80% to create a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:

1. Leveraging the contextual knowledge of stakeholders, writing specific, high quality data tests, perpetuating test failure results into aliased models for easy access.
1. Aggregating test failure results using Jinja macros and pre-configured metadata to pull together high level summary tables.
Expand All @@ -37,35 +37,35 @@ Data Integrity tests (Generic Tests)  are simple — they’re tests akin to a
```yaml
version: 2
models:
- name: patient
- name: customer
columns:
- name: id
description: Unique ID associated with the record
tests:
- unique:
alias: patient__id__unique
alias: id__unique
- not_null:
alias: patient__id__not_null
alias: id__not_null
```
<center><i>Example Data Integrity Tests in a YAML file — the alias argument is an important piece that will be touched on later.</i></center><br />
Context Driven Tests are more complex and look a lot more like models. Essentially, they’re data models that select bad data or records we don’t want, defined as SQL files that live in the `dbt/tests` directory. An example is shown below —

```sql
{{ config(
tags=['check_birth_date_in_range', 'patient'],
alias='ad_hoc__check_birth-date_in_range'
tags=['check_purchase_date_in_range', 'customer'],
alias='ad_hoc__check_purchase_date_in_range
)
}}
SELECT
id,
birth_date
purchase_date
FROM
{{ ref('patient') }}
WHERE birth_date < '1900-01-01'
{{ ref('customer') }}
WHERE purchase_date < '1900-01-01'
```
<center><i>The above test selects all patients with a birth date before 1900, due to data rules we have about maximum patient age.</i></center><br />
<center><i>The above test selects all customers who have made a purchase before 1900. The idea is that any customer that exists before 1900 probably isn't real.</i></center><br />

Importantly, we leverage [Test Aliasing](https://docs.getdbt.com/reference/resource-configs/alias) to ensure that our tests all follow a standard and predictable naming convention; our naming convention for Data Integrity tests is *table_name_ _column_name__test_name*, and our naming convention for Context Driven Tests is *ad_hoc__test_name*. Finally, to ensure all of our tests can then be aggregated, we modify the `dbt_project.yml` file  and [set the `store_failures` tag to ‘TRUE’](https://docs.getdbt.com/reference/resource-configs/store_failures), thus persisting test failures into SQL tables.

Expand All @@ -86,15 +86,15 @@ After defining our metadata Seed file, we begin the process of aggregating our d
incremental_strategy = 'merge',
unique_key='row_key',
full_refresh=false,
tags=['dq_test_warning_failures','clinical_mart', 'data_health']
tags=['dq_test_warning_failures','customer_mart', 'data_health']
)
}}
WITH failures as (
SELECT
count(*) as test_failures,
_TABLE_SUFFIX as table_suffix,
FROM {{ var('clinical_mart_schema') }}_dbt_test__audit.`*`
FROM {{ var('customer_mart_schema') }}_dbt_test__audit.`*`
GROUP BY _TABLE_SUFFIX
),

Expand Down Expand Up @@ -131,4 +131,4 @@ With our finalized data quality base table, there are many other options for cle

First, we create views on top of the base table that filter down by test owner. We strongly believe that test noise is the biggest risk towards the success of a quality framework. Creating specific views is like giving each team a magnifying glass that lets them zoom into only the tests they care about. We also have a dashboard, currently in Google Looker Studio, that shows historical test failures with a suite of filters to let users magnify high severity tests and constructs machine-composed example queries for users to select failing records. When a test fails, a business analyst can copy and paste a query from the dashboard and get all the relevant information.

As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
4 changes: 4 additions & 0 deletions website/dbt-versions.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
exports.versions = [
{
version: "1.8",
isPrerelease: "true",
},
{
version: "1.7",
EOLDate: "2024-10-30",
Expand Down
6 changes: 3 additions & 3 deletions website/docs/docs/build/data-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ These tests are defined in `.sql` files, typically in your `tests` directory (as

```sql
-- Refunds have a negative amount, so the total amount should always be >= 0.
-- Therefore return records where this isn't true to make the test fail
-- Therefore return records where total_amount < 0 to make the test fail.
select
order_id,
sum(amount) as total_amount
from {{ ref('fct_payments' )}}
group by 1
having not(total_amount >= 0)
having total_amount < 0
```

</File>
Expand Down Expand Up @@ -247,7 +247,7 @@ This workflow allows you to query and examine failing records much more quickly

<Lightbox src="/img/docs/building-a-dbt-project/test-store-failures.gif" title="Store test failures in the database for faster development-time debugging."/>

Note that, if you elect to store test failures:
Note that, if you select to store test failures:
* Test result tables are created in a schema suffixed or named `dbt_test__audit`, by default. It is possible to change this value by setting a `schema` config. (For more details on schema naming, see [using custom schemas](/docs/build/custom-schemas).)
- A test's results will always **replace** previous failures for the same test.

Expand Down
28 changes: 28 additions & 0 deletions website/docs/docs/build/metricflow-commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ You can use the `dbt sl` prefix before the command name to execute them in the d
- [`list dimensions`](#list) &mdash; Lists unique dimensions for metrics.
- [`list dimension-values`](#list-dimension-values) &mdash; List dimensions with metrics.
- [`list entities`](#list-entities) &mdash; Lists all unique entities.
- [`list saved queries`](#list-saved-queries) &mdash; Lists available saved queries. Use the `--show-exports` flag to display each export listed under a saved query.
- [`query`](#query) &mdash; Query metrics, saved queries, and dimensions you want to see in the command line interface. Refer to [query examples](#query-examples) to help you get started.

<!--below commands aren't supported in dbt cloud yet
Expand Down Expand Up @@ -174,6 +175,33 @@ Options:
--help Show this message and exit.
```

### List saved queries

This command lists all available saved queries:

```bash
dbt sl list saved-queries
```

You can also add the `--show-exports` flag (or option) to show each export listed under a saved query:

```bash
dbt sl list saved-queries --show-exports
```

**Output**

```bash
dbt sl list saved-queries --show-exports

The list of available saved queries:
- new_customer_orders
exports:
- Export(new_customer_orders_table, exportAs=TABLE)
- Export(new_customer_orders_view, exportAs=VIEW)
- Export(new_customer_orders, alias=orders, schemas=customer_schema, exportAs=TABLE)
```

### Validate-configs

The following command performs validations against the defined semantic model configurations.
Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/build/metrics-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ metrics:
This page explains the different supported metric types you can add to your dbt project.
### Conversion metrics <Lifecycle status='new'/>
### Conversion metrics
[Conversion metrics](/docs/build/conversion) help you track when a base event and a subsequent conversion event occurs for an entity within a set time period.
Expand Down
8 changes: 4 additions & 4 deletions website/docs/docs/build/saved-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Saved queries are distinct from [exports](/docs/use-dbt-semantic-layer/exports),
| **Purpose** | To materialize saved queries in your data platform and expose metrics and dimensions as a view or table. | To define and manage common Semantic Layer queries in YAML, which includes metrics and dimensions. |
| **Usage** | Automatically runs saved queries and materializes them within your data platform. Exports count towards [queried metrics](/docs/cloud/billing#what-counts-as-a-queried-metric) usage. <br /><br />**Example**: Creating a weekly aggregated table for active user metrics, automatically updated and stored in the data platform. | Used for organizing and reusing common MetricFlow queries within dbt projects.<br /><br /><br />**Example**: Group related metrics together for better organization, and include commonly uses dimensions and filters. | For materializing query results in the data platform. |
| **Integration** | Must have the dbt Semantic Layer configured in your dbt project.<br /><br />Tightly integrated with the [MetricFlow Server](/docs/use-dbt-semantic-layer/sl-architecture#components) and dbt Cloud's job scheduler. | Integrated into the dbt <Term id="dag" /> and managed alongside other dbt nodes. |
| **Configuration** | Configured within dbt Cloud environment and job scheduler settings. | Defined in YAML format within dbt project files. |
| **Configuration** | Defined within the `saved_queries` configuration. Set up within the dbt Cloud environment and job scheduler settings. | Defined in YAML format within dbt project files. |

All metrics in a saved query need to use the same dimensions in the `group_by` or `where` clauses. The following is an example of a saved query:

Expand Down Expand Up @@ -51,13 +51,13 @@ To define a saved query, refer to the following parameters:
| `query_params` | Structure | Required | Contains the query parameters. |
| `query_params::metrics` | List or String | Optional | A list of the metrics to be used in the query as specified in the command line interface. |
| `query_params::group_bys` | List or String | Optional | A list of the Entities and Dimensions to be used in the query, which include the `Dimension` or `TimeDimension`. |
| `query_params::where` | LList or String or String | Optional | A list of string which may include the `Dimension` or `TimeDimension` objects. |
| `query_params::where` | List or String | Optional | A list of string which may include the `Dimension` or `TimeDimension` objects. |
| `exports` | List or Structure | Optional | A list of exports to be specified with the exports structure. |
| `exports::name` | String | Required | Name of the export object. |
| `exports::config` | List or Structure | Required | A config section for any parameters specifying the export. |
| `exports::config::export_as` | String | Required | The type of export to run. Options include table or view currently and cache in the near future. |
| `exports::config` | String | Optional | The schema used for creating the table or view. This option cannot be used for caching. |
| `exports::config` | String | Optional | The table alias to use to write the table or view. This option cannot be used for caching. |
| `exports::config::schema` | String | Optional | The schema used for creating the table or view. This option cannot be used for caching. |
| `exports::config::alias` | String | Optional | The table alias to use to write the table or view. This option cannot be used for caching. |

All metrics in a saved query need to use the same dimensions in the `group_by` or `where` clauses.

Expand Down
Loading

0 comments on commit bfb6bec

Please sign in to comment.