Ready to use ROCOv2 with Transformers in our Hugging Face repository.
from datasets import load_dataset
ds = load_dataset("eltorio/ROCOv2-radiology")
ROCOv2 is a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access Subset. It is an updated version of the ROCO dataset, adding 35,705 new images and improving concept extraction and filtering.
The ROCOv2 dataset contains 79,789 radiological images, each with a corresponding caption and medical concepts. The images are sourced from openly available publications in the PMC Open Access Subset, licensed under CC BY or CC BY-NC.
- 79,789 radiological images
- 59,958 images in the training set
- 9,904 images in the validation set
- 9,927 images in the test set
- 1,947 unique CUIs overall
- 1,947 CUIs in the training set
- 1,760 CUIs in the validation set
- 1,754 CUIs in the test set
The dataset was created by downloading the full PMC Open Access Subset via FTP, extracting the images and captions, and filtering the images using two binary classification models. The models achieved accuracies of about 90% and 98.6%, respectively.
- Non-compound image filtering: removed 15,315,657 images
- Radiological image filtering: removed 64,831 images
- License filtering: removed 10,392 images from papers not licensed under CC BY or CC BY-NC
- Duplicate removal: removed 2,056 duplicates
- Caption filtering: removed 1,528 images with non-English captions and very short captions without relevant information
The dataset hosted in Hugging Face hub was generated with this notebook
All the source images and code can be found on our GitHub repo
The dataset labels and concepts were generated using the Medical Concept Annotation Toolkit v1.10.0 (MedCAT) and manually curated concepts for modality (all images), body region (X-ray only), and directionality (X-ray only).
The labeling and concept generation workflow consisted of the following steps:
- Image caption extraction
- Concept extraction using MedCAT
- Manual curation of concepts for modality, body region, and directionality
- Combination of automatically generated and manually curated concepts
The ROCOv2 dataset can be used for various applications, including:
- Training image annotation models based on image-caption pairs
- Multi-label image classification using UMLS concepts
- Pre-training of medical domain models
- Evaluation of deep learning models for multi-task learning
- Image retrieval and caption generation tasks
If you use the ROCOv2 dataset in your research, please cite the following paper:
Pelka, O., Menze, B. H., & Rexhausen, S. E. (2023). Radiology Objects in COntext version 2 (ROCOv2): A multimodal dataset for medical image analysis. arXiv preprint arXiv:2405.10004.
@misc {ronan_l.m._2024,
author = { {Ronan L.M.} },
title = { ROCOv2-radiology (Revision 5d66908) },
year = 2024,
url = { https://huggingface.co/datasets/eltorio/ROCOv2-radiology },
doi = { 10.57967/hf/3489 },
publisher = { Hugging Face }
}
The ROCOv2 dataset is licensed under the CC BY-NC-SA 4.0 license.
We acknowledge the National Library of Medicine (NLM) for providing access to the PMC Open Access Subset. We also acknowledge the creators of the Medical Concept Annotation Toolkit (MedCAT) for providing a valuable tool for concept extraction and annotation.