Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing company language to abide by request from legal #4875

Merged
merged 4 commits into from
Feb 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions website/blog/2023-01-24-aggregating-test-failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Testing the quality of data in your warehouse is an important aspect in any matu

<!--truncate-->

At [Tempus](https://www.tempus.com/), a precision medicine company specializing in oncology, high quality data is a necessary component for high quality clinical models. With roughly 1,000 dbt models, nearly a hundred data sources, and a dozen different data quality stakeholders, producing a framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — in early 2022, we had nearly a thousand tests, hundreds of which failed on a daily basis yet were wholly ignored.
Producing a data quality framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — one failing test becomes two becomes ten and suddenly you have too many test failures to act on any of them.

Recently, we overhauled our testing framework. We cut the number of tests down to 200, creating a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:
Recently, we overhauled our testing framework. We cut the number of tests down by 80% to create a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:

1. Leveraging the contextual knowledge of stakeholders, writing specific, high quality data tests, perpetuating test failure results into aliased models for easy access.
1. Aggregating test failure results using Jinja macros and pre-configured metadata to pull together high level summary tables.
Expand All @@ -37,35 +37,35 @@ Data Integrity tests (Generic Tests)  are simple — they’re tests akin to a
```yaml
version: 2
models:
- name: patient
- name: customer
columns:
- name: id
description: Unique ID associated with the record
tests:
- unique:
alias: patient__id__unique
alias: id__unique
- not_null:
alias: patient__id__not_null
alias: id__not_null
```
<center><i>Example Data Integrity Tests in a YAML file — the alias argument is an important piece that will be touched on later.</i></center><br />

Context Driven Tests are more complex and look a lot more like models. Essentially, they’re data models that select bad data or records we don’t want, defined as SQL files that live in the `dbt/tests` directory. An example is shown below —

```sql
{{ config(
tags=['check_birth_date_in_range', 'patient'],
alias='ad_hoc__check_birth-date_in_range'
tags=['check_purchase_date_in_range', 'customer'],
alias='ad_hoc__check_purchase_date_in_range
)
}}

SELECT
id,
birth_date
purchase_date
FROM
{{ ref('patient') }}
WHERE birth_date < '1900-01-01'
{{ ref('customer') }}
WHERE purchase_date < '1900-01-01'
```
<center><i>The above test selects all patients with a birth date before 1900, due to data rules we have about maximum patient age.</i></center><br />
<center><i>The above test selects all customers who have made a purchase before 1900. The idea is that any customer that exists before 1900 probably isn't real.</i></center><br />

Importantly, we leverage [Test Aliasing](https://docs.getdbt.com/reference/resource-configs/alias) to ensure that our tests all follow a standard and predictable naming convention; our naming convention for Data Integrity tests is *table_name_ _column_name__test_name*, and our naming convention for Context Driven Tests is *ad_hoc__test_name*. Finally, to ensure all of our tests can then be aggregated, we modify the `dbt_project.yml` file  and [set the `store_failures` tag to ‘TRUE’](https://docs.getdbt.com/reference/resource-configs/store_failures), thus persisting test failures into SQL tables.

Expand All @@ -86,15 +86,15 @@ After defining our metadata Seed file, we begin the process of aggregating our d
incremental_strategy = 'merge',
unique_key='row_key',
full_refresh=false,
tags=['dq_test_warning_failures','clinical_mart', 'data_health']
tags=['dq_test_warning_failures','customer_mart', 'data_health']
)
}}

WITH failures as (
SELECT
count(*) as test_failures,
_TABLE_SUFFIX as table_suffix,
FROM {{ var('clinical_mart_schema') }}_dbt_test__audit.`*`
FROM {{ var('customer_mart_schema') }}_dbt_test__audit.`*`
GROUP BY _TABLE_SUFFIX
),

Expand Down Expand Up @@ -131,4 +131,4 @@ With our finalized data quality base table, there are many other options for cle

First, we create views on top of the base table that filter down by test owner. We strongly believe that test noise is the biggest risk towards the success of a quality framework. Creating specific views is like giving each team a magnifying glass that lets them zoom into only the tests they care about. We also have a dashboard, currently in Google Looker Studio, that shows historical test failures with a suite of filters to let users magnify high severity tests and constructs machine-composed example queries for users to select failing records. When a test fails, a business analyst can copy and paste a query from the dashboard and get all the relevant information.

As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
Loading