Skip to content

Commit

Permalink
prepare to create a v1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
atsvetkova-ody committed Feb 8, 2021
1 parent 4d954bf commit e4d55f1
Show file tree
Hide file tree
Showing 42 changed files with 2,258 additions and 2,852 deletions.
122 changes: 101 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,119 @@
# MIMIC IV to OMOP CDM Conversion #

### What is this repository for? ###

* Quick summary
* Version
* [Learn Markdown](https://bitbucket.org/tutorials/markdowndemo)
The project implements an ETL conversion of MIMIC IV PhysioNet dataset to OMOP CDM format.

### Who do I talk to? ###
* Version 1.0

* Repo owner or admin
* Other community or team contact

### How to run the conversion ###
### Concepts / Phylosophy ###

* Workflows: ddl, vocabulary_refresh, staging, etl, ut, qa, unload
* It is supposed that the project root (location of this file) is current directory.
The ETL is based on the five steps.
* Create a snapshot of the source data. The snapshot data is stored in staging source tables with prefix "src_".
* Clean source data: filter out rows to be not used, format values, apply some business rules. Create intermediate tables with prefix "lk_" and postfix "clean".
* Map distinct source codes to concepts in vocabulary tables. Create intermediate tables with prefix "lk_" and postfix "concept".
* Custom mapping is implemented in custom concepts generated in vocabulary tables beforehand.
* Join cleaned data and mapped codes. Create intermediate tables with prefix "lk_" and postfix "mapped".
* Distribute mapped data by target cdm tables according to target_domain_id values.

* Run a workflow:
* with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf`
* copy "variables" section from file.etlconf
* with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf`
* Run explicitly named scripts (space delimited):
`python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql`
* Run in background:
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &`
* Continue after an error:
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &`
Intermediate and staging CDM tables have additional working fields like unit_id. Field unit_id is composed during the ETL steps. From right to left: source table name, initial target table name abbreviation, final target table name or abbreviation. For example: unit_id = 'drug.cond.diagnoses_icd' means that the rows in this unit_id belong to Drug_exposure table, inially were prepared for Condition_occurrence table, and its original is source table diagnoses_icd.

Vocabularies are kept in a separate dataset, and are copied as a part of the snapshot data too.


### How to run the conversion ###

* The ETL process encapsulates the following workflows: ddl, vocabulary_refresh, staging, etl, ut, unload.
* The unload workflow results in creating a final OMOP CDM dataset, which can be analysed with OHDSI tools as Atlas or DQD.

* How to run ETL end-to-end:
* update config files accordingly
* perform vocabulary_refresh steps if needed (see vocabulary_refresh/README.md)
* set the project root (location of this file) as the current directory

`
cd vocabulary_refresh
python vocabulary_refresh.py -s10
python vocabulary_refresh.py -s20
python vocabulary_refresh.py -s30
cd ../
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ddl.conf
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_staging.conf
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ut.conf
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_metrics.conf
`
* How to look at UT and Metrics reports
* see metrics dataset name in the corresponding .etlconf file

`
-- UT report
SELECT report_starttime, table_id, test_type, field_name
FROM metrics_dataset.report_unit_test
WHERE NOT test_passed
;
-- Metrics - row count
SELECT * FROM metrics_dataset.me_total ORDER BY table_name;
-- Metrics - person and visit summary
SELECT
category, name, count AS row_count
FROM metrics_dataset.me_persons_visits ORDER BY category, name;
-- Metrics - Mapping rates
SELECT
table_name, concept_field,
count AS rows_mapped,
percent AS percent_mapped,
total AS rows_total
FROM metrics_dataset.me_mapping_rate
ORDER BY table_name, concept_field
;
-- Metrics - Top 100 Mapped and Unmapped
SELECT
table_name, concept_field, category, source_value, concept_id, concept_name,
count AS row_count,
percent AS rows_percent
FROM metrics_dataset.me_tops_together
ORDER BY table_name, concept_field, category, count DESC;
`

* More option to run ETL parts:
* Run a workflow:
* with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf`
* copy "variables" section from file.etlconf
* with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf`
* Run explicitly named scripts (space delimited):
`python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql`
* Run in background:
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &`
* Continue after an error:
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &`


### Change Log (latest first) ###


**2021-02-08**

* Set version v.1.0

* Drug_exposure table
* pharmacy.medication is replacing particular values of prescription.drug
* source value format is changed to COALESCE(pharmacy.medication.selected, prescription.drug) || prescription.prod_strength
* Labevents mapping is replaced with new reviewed version
* vocabulary affected: mimiciv_meas_lab_loinc
* lk_meas_labevents_clean and lk_meas_labevents_mapped are changed accordingly
* Unload for Atlas
* Technical fields unit_id, load_row_id, load_table_id, trace_id are removed from Atlas devoted tables
* Delivery export script
* tables are exported to a single directory and single files. If a table is too large, it is exported to multiple files
* Bugfixes and cleanup
* Real environmental names are replaced with placeholders


**2021-02-01**

* Waveforms POC-2 (load from folders tree and csv files)
* Waveform POC-2 is created for 4 MIMIC III Waveform files uploaded to the bucket
* iterate through the folders tree, capture metadata and load the CSVs
* Bugfixes


Expand Down
32 changes: 16 additions & 16 deletions conf/dev.etlconf
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,28 @@

"variables":
{
"@source_project": "physionet-data",
"@core_dataset": "mimic_demo_core",
"@hosp_dataset": "mimic_demo_hosp",
"@icu_dataset": "mimic_demo_icu",
"@ed_dataset": "mimic_demo_ed",
"@source_project": "source_project...",
"@core_dataset": "core...",
"@hosp_dataset": "hosp...",
"@icu_dataset": "icu...",
"@ed_dataset": "ed...",

"@voc_project": "odysseus-mimic-dev",
"@voc_dataset": "vocabulary_2020_09_11",
"@voc_project": "etl_project...",
"@voc_dataset": "voc...",

"@wf_project": "odysseus-mimic-dev",
"@wf_dataset": "waveform_source_poc",
"@wf_project": "etl_project...",
"@wf_dataset": "wf...",

"@etl_project": "odysseus-mimic-dev",
"@etl_dataset": "mimiciv_demo_cdm_2021_01_20",
"@etl_project": "etl_project...",
"@etl_dataset": "etl...",

"@metrics_project": "odysseus-mimic-dev",
"@metrics_dataset": "mimiciv_demo_metrics_2021_01_20",
"@metrics_project": "etl_project...",
"@metrics_dataset": "metrics...",

"@atlas_project": "odysseus-mimic-dev",
"@atlas_dataset": "mimiciv_demo_202101_cdm_531",
"@atlas_project": "etl_project...",
"@atlas_dataset": "atlas...",

"@waveforms_csv_path": "gs://mimic_iv_to_omop/waveforms/source_data/csv"
"@waveforms_csv_path": "gs://bucket..."

},

Expand Down
38 changes: 22 additions & 16 deletions conf/full.etlconf
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,28 @@

"variables":
{
"@source_project": "physionet-data",
"@core_dataset": "mimic_core",
"@hosp_dataset": "mimic_hosp",
"@icu_dataset": "mimic_icu",
"@ed_dataset": "mimic_ed",
"@source_project": "source_project...",
"@core_dataset": "core...",
"@hosp_dataset": "hosp...",
"@icu_dataset": "icu...",
"@ed_dataset": "ed...",

"@voc_project": "odysseus-mimic-dev",
"@voc_dataset": "vocabulary_2020_09_11",
"@voc_project": "etl_project...",
"@voc_dataset": "voc...",

"@wf_project": "odysseus-mimic-dev",
"@wf_dataset": "waveform_source_poc",
"@wf_project": "etl_project...",
"@wf_dataset": "wf...",

"@etl_project": "odysseus-mimic-dev",
"@etl_dataset": "mimiciv_full_cdm_2021_01_31",
"@etl_project": "etl_project...",
"@etl_dataset": "etl...",

"@metrics_project": "odysseus-mimic-dev",
"@metrics_dataset": "mimiciv_full_metrics_2021_01_31",
"@metrics_project": "etl_project...",
"@metrics_dataset": "metrics...",

"@atlas_project": "odysseus-mimic-dev",
"@atlas_dataset": "mimiciv_full_202101_cdm_531",
"@atlas_project": "etl_project...",
"@atlas_dataset": "atlas...",

"@waveforms_csv_path": "gs://mimic_iv_to_omop/waveforms/source_data/csv"
"@waveforms_csv_path": "gs://bucket..."

},

Expand Down Expand Up @@ -68,6 +68,12 @@
"conf": "workflow_qa.conf"
},

{
"workflow": "metrics",
"comment": "build metrics with metrics_gen scripts",
"type": "sql",
"conf": "workflow_metrics.conf"
},
{
"workflow": "gen_scripts",
"comment": "automation to generate similar queries for some tasks",
Expand Down
2 changes: 1 addition & 1 deletion custom_mapping_csv/custom_mapping_list.tsv
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"file_name" "source_vocabulary_id" "min_concept_id" "max_concept_id" "row_count" "target_domains"
"gcpt_mimic_generated.csv" "mimiciv_mimic_generated" 2000000000 2000001000 "all(?)"
"gcpt_meas_lab_loinc.csv" "mimiciv_meas_lab_loinc" 2000001001 2000001173 173 "measurement"
"gcpt_meas_lab_loinc.csv" "mimiciv_meas_lab_loinc" 2000001001 2000001235 235 "measurement"
"gcpt_obs_insurance.csv" "mimiciv_obs_insurance" 2000001301 2000001305 5 "observation, Meas Value"
"gcpt_per_ethnicity.csv" "mimiciv_per_ethnicity" 2000001401 2000001408 8 "person"
"gcpt_obs_marital.csv" "mimiciv_obs_marital" 2000001501 2000001507 7 "observation"
Expand Down
Loading

0 comments on commit e4d55f1

Please sign in to comment.