diff --git a/README.md b/README.md index 6fbf9c1..b6bb1ae 100644 --- a/README.md +++ b/README.md @@ -34,15 +34,15 @@ Users are welcomed, however, to utilize their own Synthea and/or OMOP vocabulary source dbt-env/bin/activate # activate the environment for Mac and Linux OR dbt-env\Scripts\activate # activate the environment for Windows ``` - 4. In your virtual environment, install dbt and other required dependencies as follows: + +### DuckDB Setup + 1. In your virtual environment install requirements for duckdb (see [here for contents](./requirements/duckdb.in)) ```bash -pip3 install -r requirements.txt +pip3 install -r requirements/duckdb.txt pre-commit install ``` - - This will install dbt-core, the dbt duckdb and postgres adapters, SQLFluff (a SQL linter), pre-commit (in order to run SQLFluff on all newly-committed code in this repo), duckdb (to support bootstrapping scripts), and various dependencies for the listed packages -### DuckDB Setup - 1. Set up your [profiles.yml file](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml): + 2. Set up your [profiles.yml file](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml): - Create a directory `.dbt` in your root directory if one doesn't exist already, then create a `profiles.yml` file in `.dbt` - Add the following block to the file: ```yaml @@ -55,23 +55,23 @@ synthea_omop_etl: target: dev ``` - 2. Ensure your profile is setup correctly using dbt debug: + 3. Ensure your profile is setup correctly using dbt debug: ```bash dbt debug ``` - 3. Load dbt dependencies: + 4. Load dbt dependencies: ```bash dbt deps ``` - 4. **If you'd like to run the default ETL using the pre-seeded Synthea dataset,** run `dbt seed` to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). **Then, skip to step 9 below.** + 5. **If you'd like to run the default ETL using the pre-seeded Synthea dataset,** run `dbt seed` to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). **Then, skip to step 9 below.** ```bash dbt seed ``` - 5. **If you'd like to run the ETL on your own Synthea dataset,** first toggle the `seed_source` variable in `dbt_project.yml` to `false`. This will tell dbt not to look for the source data in the seed schemas. + 6. **If you'd like to run the ETL on your own Synthea dataset,** first toggle the `seed_source` variable in `dbt_project.yml` to `false`. This will tell dbt not to look for the source data in the seed schemas. - 6. **[BYO DATA ONLY]** Load your Synthea and Vocabulary data into the database by running the following commands (modify the commands as needed to specify the path to the folder storing the Synthea and vocabulary csv files, respectively). The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "_synthea". **NOTE only Synthea v3.0.0 is supported at this time.** + 7. **[BYO DATA ONLY]** Load your Synthea and Vocabulary data into the database by running the following commands (modify the commands as needed to specify the path to the folder storing the Synthea and vocabulary csv files, respectively). The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "_synthea". **NOTE only Synthea v3.0.0 is supported at this time.** ``` bash file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/synthea/csvs) dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: false}" @@ -79,21 +79,26 @@ file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/vocab/csvs) dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}" ``` - 7. Seed the location mapper and currently unused empty OMOP tables: + 8. Seed the location mapper and currently unused empty OMOP tables: ```bash dbt seed --select states omop ``` - 8. Build the OMOP tables: + 9. Build the OMOP tables: ```bash dbt build # or `dbt run`, `dbt test` ``` ### Postgres Setup - 1. Set up a local Postgres database with a dedicated schema for developing this project (e.g. `dbt_synthea_dev`) + 1. In your virtual environment install requirements for Postgres (see [here for contents](./requirements/postgres.in)) +```bash +pip3 install -r requirements/postgres.txt +pre-commit install +``` + 2. Set up a local Postgres database with a dedicated schema for developing this project (e.g. `dbt_synthea_dev`) - 2. Set up your [profiles.yml file](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml): + 3. Set up your [profiles.yml file](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml): - Create a directory `.dbt` in your root directory if one doesn't exist already, then create a `profiles.yml` file in `.dbt` - Add the following block to the file: ```yaml @@ -111,37 +116,37 @@ synthea_omop_etl: target: dev ``` - 3. Ensure your profile is setup correctly using dbt debug: + 4. Ensure your profile is setup correctly using dbt debug: ```bash dbt debug ``` - 4. Load dbt dependencies: + 5. Load dbt dependencies: ```bash dbt deps ``` - 5. **If you'd like to run the default ETL using the pre-seeded Synthea dataset,** run `dbt seed` to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). **Then, skip to step 10 below.** + 6. **If you'd like to run the default ETL using the pre-seeded Synthea dataset,** run `dbt seed` to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). **Then, skip to step 10 below.** ```bash dbt seed ``` - 6. **If you'd like to run the ETL on your own Synthea dataset,** first toggle the `seed_source` variable in `dbt_project.yml` to `false`. This will tell dbt not to look for the source data in the seed schemas. + 7. **If you'd like to run the ETL on your own Synthea dataset,** first toggle the `seed_source` variable in `dbt_project.yml` to `false`. This will tell dbt not to look for the source data in the seed schemas. - 7. **[BYO DATA ONLY]** Create the empty vocabulary and Synthea tables by running the following commands. The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "_synthea". + 8. **[BYO DATA ONLY]** Create the empty vocabulary and Synthea tables by running the following commands. The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "_synthea". ``` bash dbt run-operation create_vocab_tables dbt run-operation create_synthea_tables ``` - 8. **[BYO DATA ONLY]** Use the technology/package of your choice to load the OMOP vocabulary and raw Synthea files into these newly-created tables. **NOTE only Synthea v3.0.0 is supported at this time.** + 9. **[BYO DATA ONLY]** Use the technology/package of your choice to load the OMOP vocabulary and raw Synthea files into these newly-created tables. **NOTE only Synthea v3.0.0 is supported at this time.** - 9. Seed the location mapper and currently unused empty OMOP tables: + 10. Seed the location mapper and currently unused empty OMOP tables: ```bash dbt seed --select states omop ``` - 10. Build the OMOP tables: + 11. Build the OMOP tables: ```bash dbt build # or `dbt run`, `dbt test` diff --git a/requirements.txt b/requirements.txt deleted file mode 100644 index 59f2262..0000000 --- a/requirements.txt +++ /dev/null @@ -1,63 +0,0 @@ -agate==1.7.1 -annotated-types==0.7.0 -attrs==23.2.0 -Babel==2.15.0 -certifi==2024.6.2 -cffi==1.16.0 -cfgv==3.4.0 -charset-normalizer==3.3.2 -click==8.1.7 -colorama==0.4.6 -daff==1.3.46 -dbt-adapters==1.2.1 -dbt-common==1.3.0 -dbt-core==1.8.2 -dbt-duckdb==1.8.1 -dbt-extractor==0.5.1 -dbt-postgres==1.8.1 -dbt-semantic-interfaces==0.5.1 -distlib==0.3.8 -duckdb==0.10.2 -filelock==3.15.1 -identify==2.5.36 -idna==3.7 -importlib-metadata==6.11.0 -isodate==0.6.1 -Jinja2==3.1.4 -jsonschema==4.22.0 -jsonschema-specifications==2023.12.1 -leather==0.4.0 -Logbook==1.5.3 -MarkupSafe==2.1.5 -mashumaro==3.13 -minimal-snowplow-tracker==0.0.2 -more-itertools==10.3.0 -msgpack==1.0.8 -networkx==3.3 -nodeenv==1.9.1 -packaging==24.1 -parsedatetime==2.6 -pathspec==0.11.2 -platformdirs==4.2.2 -pre-commit==3.7.1 -protobuf==4.25.3 -psycopg2-binary==2.9.9 -pycparser==2.22 -pydantic==2.7.4 -pydantic_core==2.18.4 -python-dateutil==2.9.0.post0 -python-slugify==8.0.4 -pytimeparse==1.1.8 -pytz==2024.1 -PyYAML==6.0.1 -referencing==0.35.1 -requests==2.32.3 -rpds-py==0.18.1 -setuptools==70.0.0 -six==1.16.0 -sqlparse==0.5.0 -text-unidecode==1.3 -typing_extensions==4.12.2 -urllib3==1.26.18 -virtualenv==20.26.2 -zipp==3.19.2 diff --git a/requirements/common.in b/requirements/common.in new file mode 100644 index 0000000..38fa754 --- /dev/null +++ b/requirements/common.in @@ -0,0 +1,4 @@ +pre-commit==3.8 +black==24.8 +sqlfluff==3.2 +sqlfluff-templater-dbt==3.2 \ No newline at end of file diff --git a/requirements/duckdb.in b/requirements/duckdb.in new file mode 100644 index 0000000..5d54ef1 --- /dev/null +++ b/requirements/duckdb.in @@ -0,0 +1,3 @@ +-r common.in + +dbt-duckdb==1.8 diff --git a/requirements/duckdb.txt b/requirements/duckdb.txt new file mode 100644 index 0000000..f78a979 --- /dev/null +++ b/requirements/duckdb.txt @@ -0,0 +1,226 @@ +# This file was autogenerated by uv via the following command: +# uv pip compile duckdb.in +agate==1.9.1 + # via + # dbt-adapters + # dbt-common + # dbt-core +annotated-types==0.7.0 + # via pydantic +appdirs==1.4.4 + # via sqlfluff +attrs==24.2.0 + # via + # jsonschema + # referencing +babel==2.16.0 + # via agate +black==24.8.0 + # via -r tools.in +certifi==2024.8.30 + # via requests +cfgv==3.4.0 + # via pre-commit +chardet==5.2.0 + # via + # diff-cover + # sqlfluff +charset-normalizer==3.3.2 + # via requests +click==8.1.7 + # via + # black + # dbt-core + # dbt-semantic-interfaces + # sqlfluff +colorama==0.4.6 + # via + # dbt-common + # sqlfluff +daff==1.3.46 + # via dbt-core +dbt-adapters==1.7.0 + # via + # dbt-core + # dbt-duckdb +dbt-common==1.10.0 + # via + # dbt-adapters + # dbt-core + # dbt-duckdb +dbt-core==1.8.7 + # via + # dbt-duckdb + # sqlfluff-templater-dbt +dbt-duckdb==1.8.0 + # via -r duckdb.in +dbt-extractor==0.5.1 + # via dbt-core +dbt-semantic-interfaces==0.5.1 + # via dbt-core +deepdiff==7.0.1 + # via dbt-common +diff-cover==9.2.0 + # via sqlfluff +distlib==0.3.8 + # via virtualenv +duckdb==1.1.1 + # via dbt-duckdb +filelock==3.16.1 + # via virtualenv +identify==2.6.1 + # via pre-commit +idna==3.10 + # via requests +importlib-metadata==6.11.0 + # via dbt-semantic-interfaces +iniconfig==2.0.0 + # via pytest +isodate==0.6.1 + # via + # agate + # dbt-common +jinja2==3.1.4 + # via + # dbt-common + # dbt-core + # dbt-semantic-interfaces + # diff-cover + # jinja2-simple-tags + # sqlfluff +jinja2-simple-tags==0.6.1 + # via sqlfluff-templater-dbt +jsonschema==4.23.0 + # via + # dbt-common + # dbt-semantic-interfaces +jsonschema-specifications==2023.12.1 + # via jsonschema +leather==0.4.0 + # via agate +logbook==1.5.3 + # via dbt-core +markupsafe==2.1.5 + # via jinja2 +mashumaro==3.13.1 + # via + # dbt-adapters + # dbt-common + # dbt-core +minimal-snowplow-tracker==0.0.2 + # via dbt-core +more-itertools==10.5.0 + # via dbt-semantic-interfaces +msgpack==1.1.0 + # via mashumaro +mypy-extensions==1.0.0 + # via black +networkx==3.3 + # via dbt-core +nodeenv==1.9.1 + # via pre-commit +ordered-set==4.1.0 + # via deepdiff +packaging==24.1 + # via + # black + # dbt-core + # pytest +parsedatetime==2.6 + # via agate +pathspec==0.12.1 + # via + # black + # dbt-common + # dbt-core + # sqlfluff +platformdirs==4.3.6 + # via + # black + # virtualenv +pluggy==1.5.0 + # via + # diff-cover + # pytest +pre-commit==3.8.0 + # via -r tools.in +protobuf==4.25.5 + # via + # dbt-adapters + # dbt-common + # dbt-core +pydantic==2.9.2 + # via dbt-semantic-interfaces +pydantic-core==2.23.4 + # via pydantic +pygments==2.18.0 + # via diff-cover +pytest==8.3.3 + # via sqlfluff +python-dateutil==2.9.0.post0 + # via + # dbt-common + # dbt-semantic-interfaces +python-slugify==8.0.4 + # via agate +pytimeparse==1.1.8 + # via agate +pytz==2024.2 + # via + # dbt-adapters + # dbt-core +pyyaml==6.0.2 + # via + # dbt-core + # dbt-semantic-interfaces + # pre-commit + # sqlfluff +referencing==0.35.1 + # via + # jsonschema + # jsonschema-specifications +regex==2024.9.11 + # via sqlfluff +requests==2.32.3 + # via + # dbt-common + # dbt-core + # minimal-snowplow-tracker +rpds-py==0.20.0 + # via + # jsonschema + # referencing +six==1.16.0 + # via + # isodate + # minimal-snowplow-tracker + # python-dateutil +sqlfluff==3.2.0 + # via + # -r tools.in + # sqlfluff-templater-dbt +sqlfluff-templater-dbt==3.2.0 + # via -r tools.in +sqlparse==0.5.1 + # via dbt-core +tblib==3.0.0 + # via sqlfluff +text-unidecode==1.3 + # via python-slugify +tqdm==4.66.5 + # via sqlfluff +typing-extensions==4.12.2 + # via + # dbt-adapters + # dbt-common + # dbt-core + # dbt-semantic-interfaces + # mashumaro + # pydantic + # pydantic-core +urllib3==2.2.3 + # via requests +virtualenv==20.26.6 + # via pre-commit +zipp==3.20.2 + # via importlib-metadata diff --git a/requirements/postgres.in b/requirements/postgres.in new file mode 100644 index 0000000..0080085 --- /dev/null +++ b/requirements/postgres.in @@ -0,0 +1,3 @@ +-r common.in + +dbt-postgres==1.8 diff --git a/requirements/postgres.txt b/requirements/postgres.txt new file mode 100644 index 0000000..4800d41 --- /dev/null +++ b/requirements/postgres.txt @@ -0,0 +1,227 @@ +# This file was autogenerated by uv via the following command: +# uv pip compile postgres.in +agate==1.9.1 + # via + # dbt-adapters + # dbt-common + # dbt-core + # dbt-postgres +annotated-types==0.7.0 + # via pydantic +appdirs==1.4.4 + # via sqlfluff +attrs==24.2.0 + # via + # jsonschema + # referencing +babel==2.16.0 + # via agate +black==24.8.0 + # via -r tools.in +certifi==2024.8.30 + # via requests +cfgv==3.4.0 + # via pre-commit +chardet==5.2.0 + # via + # diff-cover + # sqlfluff +charset-normalizer==3.3.2 + # via requests +click==8.1.7 + # via + # black + # dbt-core + # dbt-semantic-interfaces + # sqlfluff +colorama==0.4.6 + # via + # dbt-common + # sqlfluff +daff==1.3.46 + # via dbt-core +dbt-adapters==1.7.0 + # via + # dbt-core + # dbt-postgres +dbt-common==1.10.0 + # via + # dbt-adapters + # dbt-core + # dbt-postgres +dbt-core==1.8.7 + # via + # dbt-postgres + # sqlfluff-templater-dbt +dbt-extractor==0.5.1 + # via dbt-core +dbt-postgres==1.8.0 + # via -r postgres.in +dbt-semantic-interfaces==0.5.1 + # via dbt-core +deepdiff==7.0.1 + # via dbt-common +diff-cover==9.2.0 + # via sqlfluff +distlib==0.3.8 + # via virtualenv +filelock==3.16.1 + # via virtualenv +identify==2.6.1 + # via pre-commit +idna==3.10 + # via requests +importlib-metadata==6.11.0 + # via dbt-semantic-interfaces +iniconfig==2.0.0 + # via pytest +isodate==0.6.1 + # via + # agate + # dbt-common +jinja2==3.1.4 + # via + # dbt-common + # dbt-core + # dbt-semantic-interfaces + # diff-cover + # jinja2-simple-tags + # sqlfluff +jinja2-simple-tags==0.6.1 + # via sqlfluff-templater-dbt +jsonschema==4.23.0 + # via + # dbt-common + # dbt-semantic-interfaces +jsonschema-specifications==2023.12.1 + # via jsonschema +leather==0.4.0 + # via agate +logbook==1.5.3 + # via dbt-core +markupsafe==2.1.5 + # via jinja2 +mashumaro==3.13.1 + # via + # dbt-adapters + # dbt-common + # dbt-core +minimal-snowplow-tracker==0.0.2 + # via dbt-core +more-itertools==10.5.0 + # via dbt-semantic-interfaces +msgpack==1.1.0 + # via mashumaro +mypy-extensions==1.0.0 + # via black +networkx==3.3 + # via dbt-core +nodeenv==1.9.1 + # via pre-commit +ordered-set==4.1.0 + # via deepdiff +packaging==24.1 + # via + # black + # dbt-core + # pytest +parsedatetime==2.6 + # via agate +pathspec==0.12.1 + # via + # black + # dbt-common + # dbt-core + # sqlfluff +platformdirs==4.3.6 + # via + # black + # virtualenv +pluggy==1.5.0 + # via + # diff-cover + # pytest +pre-commit==3.8.0 + # via -r tools.in +protobuf==4.25.5 + # via + # dbt-adapters + # dbt-common + # dbt-core +psycopg2-binary==2.9.9 + # via dbt-postgres +pydantic==2.9.2 + # via dbt-semantic-interfaces +pydantic-core==2.23.4 + # via pydantic +pygments==2.18.0 + # via diff-cover +pytest==8.3.3 + # via sqlfluff +python-dateutil==2.9.0.post0 + # via + # dbt-common + # dbt-semantic-interfaces +python-slugify==8.0.4 + # via agate +pytimeparse==1.1.8 + # via agate +pytz==2024.2 + # via + # dbt-adapters + # dbt-core +pyyaml==6.0.2 + # via + # dbt-core + # dbt-semantic-interfaces + # pre-commit + # sqlfluff +referencing==0.35.1 + # via + # jsonschema + # jsonschema-specifications +regex==2024.9.11 + # via sqlfluff +requests==2.32.3 + # via + # dbt-common + # dbt-core + # minimal-snowplow-tracker +rpds-py==0.20.0 + # via + # jsonschema + # referencing +six==1.16.0 + # via + # isodate + # minimal-snowplow-tracker + # python-dateutil +sqlfluff==3.2.0 + # via + # -r tools.in + # sqlfluff-templater-dbt +sqlfluff-templater-dbt==3.2.0 + # via -r tools.in +sqlparse==0.5.1 + # via dbt-core +tblib==3.0.0 + # via sqlfluff +text-unidecode==1.3 + # via python-slugify +tqdm==4.66.5 + # via sqlfluff +typing-extensions==4.12.2 + # via + # dbt-adapters + # dbt-common + # dbt-core + # dbt-semantic-interfaces + # mashumaro + # pydantic + # pydantic-core +urllib3==2.2.3 + # via requests +virtualenv==20.26.6 + # via pre-commit +zipp==3.20.2 + # via importlib-metadata