This repository presents an approach using ontologies and weak supervision to identify rare diseases from clinical notes. The idea is illustrated below and the data annotation for rare disease entity linking and ontology matching is available for download.
The latest preprint is available on arXiv, Ontology-Driven and Weakly Supervised Rare Disease Identification from Clinical Notes, accepted for BMC Medical Informatics and Decision Making. This is an extension of the previous work published in IEEE EMBC 2021.
A graphical illustration of the entity linking and ontology matching process:
The process to create weakly labelled data with contextual representation is illustrated below:
The annotations of rare disease mentions created from this research are available in the folder data annotation
.
- Main packages: See
requirement.txt
(with conda scripts inside) for a full list. BERT-as-service (follow guide to install), scikit_learn, Huggingface Transformers, numpy, nltk, gensim, pandasm, medcat, etc. - SemEHR can be installed from https://github.com/CogStack/CogStack-SemEHR
- Minimised SemEHR version was used to process the MIMIC-III radiology reports.
- BlueBERT (Base, Uncased, PubMed+MIMIC-III) models are from https://github.com/ncbi-nlp/bluebert or https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
- Ontology matching:
- ORDO to ICD-10 or UMLS https://www.ebi.ac.uk/ols/ontologies/ordo;
- ICD-10 to ICD-9 https://www.health.govt.nz/nz-health-statistics/data-references/mapping-tools/mapping-between-icd-10-and-icd-9;
- UMLS to ICD-9-CM https://bioportal.bioontology.org/ontologies/ICD9CM
Note: This is mainly research-based implementation, rather than well-engineered software, but we hope that the code, data, and results provide more details to this work and are useful.
The data files and BERT models are placed according to the structure below. The SemEHR outputs for MIMIC-III discharge summaries (mimic-semehr-smp-outputs\outputs
) and MIMIC-III radiology reports (mimic-rad-semehr-outputs\outputs
) were obtained by running SemEHR.
└───bert-models
| | run_get_bluebert.sh
| | NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12
| | | ... (model files)
└───data/
| | NOTEEVENTS.csv (from MIMIC-III)
| | DIAGNOSES_ICD.csv (from MIMIC-III)
| | PROCEDURES_ICD.csv (from MIMIC-III)
| | mimic-semehr-smp-outputs
| | | outputs
| | | | ... (SemEHR output files of MIMIC-III DS)
| | mimic-rad-semehr-outputs
| | | outputs
| | | | ... (SemEHR output files of MIMIC-III rad)
└───models/
| | ... (phenotype confirmation model `.pik` files)
└───ontology/
| | ORDO2UMLS_ICD10_ICD9+titles_final_v3.xlsx
(ontology concept matching file)
- Weakly supervised data creation:
main_scripts/step1_tr_data_creat_ment_disamb.py
. - Weakly supervised data representation and model training:
main_scripts/step3.4
for MIMIC-III discharge summaries,main_scripts/step3.6
for MIMIC-III (and Tayside) radiology reports.- static BERT-based encoding is implemented in
def encode_data_tuple() in main_scripts/sent_bert_emb_viz_util.py
using BERT-as-service; - a fine-tuning approach with Huggingface Transformers is in
other_scripts/step3.8_fine_tune_bert_with_trainer.py
.
- static BERT-based encoding is implemented in
If all files are set (MIMIC-III data, SemEHR outputs, BERT models), the main steps of the whole pipeline can be run with python run_main_steps.py
.
This does not need to run the pipeline above, as it is based on the prediction scores
.
Move all the files inside main_scripts
(and other_scripts
) to the upper folder.
MIMIC-III discharge summaries: python step4_further_results_from_annotations.py
MIMIC-III radiology reports: python step4.1_further_results_from_annotations_for_rad.py
Error analysis: python error_analysis.py
UMLS-to-ORDO: calculated from results in raw annotations (with model predictions)
.
Text-to-ORDO, mention-level: see step7
and step7.1
in other_scripts
.
Text-to-ORDO, admission-level: see step8
and step8.1
in other_scripts
.
This work has been carried out by members from KnowLab, also thanks to the EdiE-ClinicalNLP research group.
Acknowledgement to the icons used:
- MIMIC icon from https://mimic.physionet.org/
- UMLS icon from https://uts.nlm.nih.gov/uts/umls/
- ORDO icon from http://www.orphadata.org/cgi-bin/index.php
- Ministry of Health, New Zealand icon from https://www.health.govt.nz/nz-health-statistics/data-references/mapping-tools/mapping-between-icd-10-and-icd-9
- ICD 10 icon from https://icd.who.int/browse10/Content/ICD10.png