Macros that help with fuzzy text matching, with the aim of keeping dbt models agnostic of data warehouses.
Current coverage:
Algorithm | Snowflake | BigQuery | Azure Synapse (Dedicated SQL Pool) |
---|---|---|---|
Edit distance based | |||
Levenshtein Distance | ✔️ | ✔️ | ✔️ |
Jaro-Winkler Similarity | ✔️ | ❌ | ❌ |
New to dbt packages? Read more about them here.
- Include this package in your
packages.yml
file — check here for the latest version number:
packages:
- package: calum-mcg/fuzzy_text
version: X.X.X ## update to latest version here
- Run
dbt deps
to install the package.
levenshtein_distance (source)
This macro generates the levenshtein distance between two strings.
str1
(required): First string to comparestr2
(required): Second string to comparemax
(optional, default=none): Maximum distance to compute (integer)
Copy the macro into a statement tab in the dbt Cloud IDE, or into a model, and compile your code
... {{ fuzzy_text.levenshtein_distance('input_string_column', 'comparison_string_column') }} as levenshtein_distance ...
jaro_winkler (source)
This macro generates the Jaro-Winkler similarity between two strings.
str1
(required): First string to comparestr2
(required): Second string to compare
Copy the macro into a statement tab in the dbt Cloud IDE, or into a model, and compile your code
... {{ fuzzy_text.jaro_winkler('input_string_column', 'comparison_string_column') }} as jaro_winkler ...
Pull requests are the best way to propose changes to the codebase. Steps required:
- Create an issue in the repo with a description of the problem / bug / improvement required
- Clone the
main
branch with a suitable branch name, e.g.feature/add-cool-thing
- Add tests for supported adaptors in the
integration_tests
folder - If required, update the README documentation to include usage and an example.
- Issue a pull request, provide:
- a description of changes
- add a reviewer
- reference original issue (from step 1)