Data reliability tool for profiling and testing your data
Piperider is a CLI tool that allows you to build data profiles and write assertion tests for easily evaluating and tracking your data's reliability over time.
- Profile Your Data to explore/understand what kind of dataset you're dealing with e.g. completeness, duplicates, missing values, distributions
- Test Your Data to verify that your data is within acceptable range and formatted correctly
- Observe & Monitor Your Data to keep an eye on how that data changes over time
- SQL-based (additionally supports CSV)
- Data Profiling Characteristics
- Provides rich data profiling metrics
- e.g.
missing
,uniqueness
,duplicate_rows
,quantiles
,histogram
- Test datasets with a mix of custom and built-in assertion definitions
- Auto-generates recommended assertions based on your single-run profiles
- Generates single-run reports to visualize your data profile and assertion test results (example)
- Generates comparison reports to visualize how your data has changed over time (example)
- Supported Datasources: Snowflake, BigQuery, Redshift, Postgres, SQLite, DuckDB, CSV, Parquet.
This repo is a fork version of piperider package that includes reconcile
feature.
Create a github personal access token
pip install "piperider @ git+https://${GITHUB_USER}:${GITHUB_TOKEN}@github.com/chenxuanrong/piperider@domain"
# To install database drives
pip install "piperider[snowflake] @ git+https://${GITHUB_USER}:${GITHUB_TOKEN}@github.com/chenxuanrong/piperider@domain"
By default, PipeRider supports built-in SQLite connector, extra connectors are available:
connectors | install |
---|---|
snowflake | pip install 'piperider[snowflake] @GITHUB_REPO' |
postgres | pip install 'piperider[postgres]' @GITHUB_REPO |
bigquery | pip install 'piperider[bigquery]' @GITHUB_REPO |
redshift | pip install 'piperider[redshift]' @GITHUB_REPO |
parquet | pip install 'piperider[parquet]' @GITHUB_REPO |
csv | pip install 'piperider[csv]' @GITHUB_REPO |
duckdb | pip install 'piperider[duckdb]' @GITHUB_REPO |
Use comma to install multiple connectors in one line:
pip install 'piperider[postgres,snowflake] @@GITHUB_REPO'
Once installed, initialize a new project with the following command.
piperider init # initializes project config
piperider diagnose # verifies your data source connection & project config
Next, execute piperider run
, which will do a number of things:
- Create a single-run profile of your data source
- Auto-generate recommended or template assertions files (first-run only)
- Test that single-run profile against any available assertions, including custom and/or recommended assertions
- Generate a static HTML report, which helps visualize the single-run profile and its assertion results.
Common Usages/Tips:
piperider run # profile all tables in the data source.
piperider run --table $TABLENAME # profile a specific table
piperider generate-report -o $PATHNAME # Specify the output location of the generated report
piperider generate-assertions # To re-generate the recommended assertions after the first-run
With at least two runs completed, you can then run piperider compare-reports
, which will generate a comparison report that presents the changes between them (e.g. schema changes, column renaming, distributions).
Common Usages/Tips:
piperider compare-reports --last # Compare the last two reports automatically using
For more details on the generated report, see the doc
See Generated Single-Run Report
Create a file .piperider/reconcile.yml
after initialisation.
A sample reconcile declaration
Reconciles:
- name: migration
description: This project reconciles v1 and v2 migration
base_source: v1
target_source: v2
suites:
- name: address_table
description: Compare property address table in v1 and v2
base:
table: property_address
join_key: address_id
target:
table: address
join_key: id
rules:
- name: street_name
description: Compare street name
base_column: street_name
target_column: street_name
- name: postcode
description: Compare postcode
base_column: postcode
target_column: postcode
- name: <another_suite>
- name: <second_project>
- name: <third_project>
Common Usage
piperider reconcile
piperider reconcile --project migration
The result is saved to .piperider/reconciles/latest/reconcile.json
Notes:
base_source
andtarget_source
are defined inconfig.yml
andcredential.yml
Limitations:
join_key
only support string now. If multiple columns are required to join base and target table, user can string concatenation. List of columns configuration will be supported in the future.- By default, the first project defined in
reconcile.yml
is executed. User can select project by pass--project <name>
.
See setup dev environment and the contributing guildlines to get started.