ROCOv2: Radiology Object in COntext version 2

TL;DR

Ready to use ROCOv2 with Transformers in our Hugging Face repository.

from datasets import load_dataset

ds = load_dataset("eltorio/ROCOv2-radiology")

Introduction

ROCOv2 is a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access Subset. It is an updated version of the ROCO dataset, adding 35,705 new images and improving concept extraction and filtering.

Dataset Overview

The ROCOv2 dataset contains 79,789 radiological images, each with a corresponding caption and medical concepts. The images are sourced from openly available publications in the PMC Open Access Subset, licensed under CC BY or CC BY-NC.

Dataset Statistics

79,789 radiological images
59,958 images in the training set
9,904 images in the validation set
9,927 images in the test set
1,947 unique CUIs overall
1,947 CUIs in the training set
1,760 CUIs in the validation set
1,754 CUIs in the test set

Dataset Creation

The dataset was created by downloading the full PMC Open Access Subset via FTP, extracting the images and captions, and filtering the images using two binary classification models. The models achieved accuracies of about 90% and 98.6%, respectively.

Filtering Steps

Non-compound image filtering: removed 15,315,657 images
Radiological image filtering: removed 64,831 images
License filtering: removed 10,392 images from papers not licensed under CC BY or CC BY-NC
Duplicate removal: removed 2,056 duplicates
Caption filtering: removed 1,528 images with non-English captions and very short captions without relevant information

Transformers Dataset generation

The dataset hosted in Hugging Face hub was generated with this notebook
All the source images and code can be found on our GitHub repo

Dataset Labels and Concepts

The dataset labels and concepts were generated using the Medical Concept Annotation Toolkit v1.10.0 (MedCAT) and manually curated concepts for modality (all images), body region (X-ray only), and directionality (X-ray only).

Labeling and Concept Generation Workflow

The labeling and concept generation workflow consisted of the following steps:

Image caption extraction
Concept extraction using MedCAT
Manual curation of concepts for modality, body region, and directionality
Combination of automatically generated and manually curated concepts

Use Cases

The ROCOv2 dataset can be used for various applications, including:

Training image annotation models based on image-caption pairs
Multi-label image classification using UMLS concepts
Pre-training of medical domain models
Evaluation of deep learning models for multi-task learning
Image retrieval and caption generation tasks

Citation

If you use the ROCOv2 dataset in your research, please cite the following paper:

Pelka, O., Menze, B. H., & Rexhausen, S. E. (2023). Radiology Objects in COntext version 2 (ROCOv2): A multimodal dataset for medical image analysis. arXiv preprint arXiv:2405.10004.

@misc {ronan_l.m._2024,
 author       = { {Ronan L.M.} },
 title        = { ROCOv2-radiology (Revision 5d66908) },
 year         = 2024,
 url          = { https://huggingface.co/datasets/eltorio/ROCOv2-radiology },
 doi          = { 10.57967/hf/3489 },
 publisher    = { Hugging Face }
}

License

The ROCOv2 dataset is licensed under the CC BY-NC-SA 4.0 license.

Acknowledgments

We acknowledge the National Library of Medicine (NLM) for providing access to the PMC Open Access Subset. We also acknowledge the creators of the Medical Concept Annotation Toolkit (MedCAT) for providing a valuable tool for concept extraction and annotation.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
source_dataset		source_dataset
.gitignore		.gitignore
README.md		README.md
autopush.sh		autopush.sh
generate.ipynb		generate.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROCOv2: Radiology Object in COntext version 2

TL;DR

Introduction

Dataset Overview

Dataset Statistics

Dataset Creation

Filtering Steps

Transformers Dataset generation

Dataset Labels and Concepts

Labeling and Concept Generation Workflow

Use Cases

Citation

License

Acknowledgments

About

Releases

Packages

Languages

sctg-development/ROCOv2-radiology

Folders and files

Latest commit

History

Repository files navigation

ROCOv2: Radiology Object in COntext version 2

TL;DR

Introduction

Dataset Overview

Dataset Statistics

Dataset Creation

Filtering Steps

Transformers Dataset generation

Dataset Labels and Concepts

Labeling and Concept Generation Workflow

Use Cases

Citation

License

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages