Rare-disease-identification

This repository presents an approach using ontologies and weak supervision to identify rare diseases from clinical notes. The idea is illustrated below and the data annotation for rare disease entity linking and ontology matching is available for download.

The latest preprint is available on arXiv, Ontology-Driven and Weakly Supervised Rare Disease Identification from Clinical Notes, accepted for BMC Medical Informatics and Decision Making. This is an extension of the previous work published in IEEE EMBC 2021.

Entity linking and ontology matching

A graphical illustration of the entity linking and ontology matching process:

Weak supervision (WS)

The process to create weakly labelled data with contextual representation is illustrated below:

Rare disease mention annotations

The annotations of rare disease mentions created from this research are available in the folder data annotation.

Implementation sources

Main packages: See requirement.txt (with conda scripts inside) for a full list. BERT-as-service (follow guide to install), scikit_learn, Huggingface Transformers, numpy, nltk, gensim, pandasm, medcat, etc.
SemEHR can be installed from https://github.com/CogStack/CogStack-SemEHR
- Minimised SemEHR version was used to process the MIMIC-III radiology reports.
BlueBERT (Base, Uncased, PubMed+MIMIC-III) models are from https://github.com/ncbi-nlp/bluebert or https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
Ontology matching:
- ORDO to ICD-10 or UMLS https://www.ebi.ac.uk/ols/ontologies/ordo;
- ICD-10 to ICD-9 https://www.health.govt.nz/nz-health-statistics/data-references/mapping-tools/mapping-between-icd-10-and-icd-9;
- UMLS to ICD-9-CM https://bioportal.bioontology.org/ontologies/ICD9CM

Pipeline

Note: This is mainly research-based implementation, rather than well-engineered software, but we hope that the code, data, and results provide more details to this work and are useful.

Data and models

The data files and BERT models are placed according to the structure below. The SemEHR outputs for MIMIC-III discharge summaries (mimic-semehr-smp-outputs\outputs) and MIMIC-III radiology reports (mimic-rad-semehr-outputs\outputs) were obtained by running SemEHR.

└───bert-models
|   |   run_get_bluebert.sh
|   |   NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12
|   |   |   ... (model files)
└───data/
|   |   NOTEEVENTS.csv (from MIMIC-III)
|   |   DIAGNOSES_ICD.csv (from MIMIC-III)
|   |   PROCEDURES_ICD.csv (from MIMIC-III)
|   |   mimic-semehr-smp-outputs
|   |   |   outputs
|   |   |   |   ... (SemEHR output files of MIMIC-III DS)
|   |   mimic-rad-semehr-outputs
|   |   |   outputs
|   |   |   |   ... (SemEHR output files of MIMIC-III rad)
└───models/
|   |   ... (phenotype confirmation model `.pik` files)
└───ontology/
|   |   ORDO2UMLS_ICD10_ICD9+titles_final_v3.xlsx 
        (ontology concept matching file)

Key pipeline scripts

Weakly supervised data creation: main_scripts/step1_tr_data_creat_ment_disamb.py.
Weakly supervised data representation and model training: main_scripts/step3.4 for MIMIC-III discharge summaries, main_scripts/step3.6 for MIMIC-III (and Tayside) radiology reports.
- static BERT-based encoding is implemented in def encode_data_tuple() in main_scripts/sent_bert_emb_viz_util.py using BERT-as-service;
- a fine-tuning approach with Huggingface Transformers is in other_scripts/step3.8_fine_tune_bert_with_trainer.py.

If all files are set (MIMIC-III data, SemEHR outputs, BERT models), the main steps of the whole pipeline can be run with python run_main_steps.py.

Reproducing results from the paper

This does not need to run the pipeline above, as it is based on the prediction scores.

Move all the files inside main_scripts (and other_scripts) to the upper folder.

Main results: Text-to-UMLS

MIMIC-III discharge summaries: python step4_further_results_from_annotations.py

MIMIC-III radiology reports: python step4.1_further_results_from_annotations_for_rad.py

Error analysis: python error_analysis.py

Other results: UMLS-to-ORDO, Text-to-ORDO

UMLS-to-ORDO: calculated from results in raw annotations (with model predictions).

Text-to-ORDO, mention-level: see step7 and step7.1 in other_scripts.

Text-to-ORDO, admission-level: see step8 and step8.1 in other_scripts.

Acknowledgement

This work has been carried out by members from KnowLab, also thanks to the EdiE-ClinicalNLP research group.

Acknowledgement to the icons used:

MIMIC icon from https://mimic.physionet.org/
UMLS icon from https://uts.nlm.nih.gov/uts/umls/
ORDO icon from http://www.orphadata.org/cgi-bin/index.php
Ministry of Health, New Zealand icon from https://www.health.govt.nz/nz-health-statistics/data-references/mapping-tools/mapping-between-icd-10-and-icd-9
ICD 10 icon from https://icd.who.int/browse10/Content/ICD10.png

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
bert-models		bert-models
data annotation		data annotation
main_scripts		main_scripts
models		models
ontology		ontology
other_scripts		other_scripts
supp-results		supp-results
Graph representation.PNG		Graph representation.PNG
LICENSE		LICENSE
README.md		README.md
Weak supervision illustrated.PNG		Weak supervision illustrated.PNG
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rare-disease-identification

Entity linking and ontology matching

Weak supervision (WS)

Rare disease mention annotations

Implementation sources

Pipeline

Data and models

Key pipeline scripts

Reproducing results from the paper

Main results: Text-to-UMLS

Other results: UMLS-to-ORDO, Text-to-ORDO

Acknowledgement

About

Releases

Packages

Languages

License

acadTags/Rare-disease-identification

Folders and files

Latest commit

History

Repository files navigation

Rare-disease-identification

Entity linking and ontology matching

Weak supervision (WS)

Rare disease mention annotations

Implementation sources

Pipeline

Data and models

Key pipeline scripts

Reproducing results from the paper

Main results: Text-to-UMLS

Other results: UMLS-to-ORDO, Text-to-ORDO

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages