Kimia Path24 is a dataset for image classification and retrieval in digital pathology. It contains 24 whole scan images (WSIs) of different tissue textures. The dataset contains 1,325 test patches of size 1000 x 1000 (0.5mm × 0.5mm) extracted from all 24 WSIs. Patches for training algorithms can be generated according to preferences of the algorithm designer using provided WSIs (test patches are whitened in the scans). Approximately 27,000 to over 50,000 training patches (1000 x 1000) can be extracted if the preset parameters are adopted.
If you are interested in using the KIMIA Path24 data set, please send an email to Hamid Tizhoosh (last_name AT uwaterloo DOT ca).
Please read the end-user agreement carefully.
http://kimia.uwaterloo.ca/tizhoosh/PDF/KIMIA_Path24_EndUserAgreement.pdf
Entire dataset is provided in a single HDF5 file Kimia_Path24.h5
.
File Name | Kimia_Path24.h5 |
File Size | 29409M |
MD5SUM | aeb54db89aaf28455e06770d20bd1307 |
Data is organized in a hierarchy as follows.
|
+ -- scan0
| |
| + -- data (x, y)
| |
| + -- thumbnail
|
+ -- scan1
| |
| + -- data (x, y)
| |
| + -- thumbnail
.
.
.
|
+ -- scan23
| |
| + -- data (m, n)
| |
| + -- thumbnail
|
+ -- test_data
|
+ -- patches (1325, 1000, 1000)
|
+ -- targets (1325, )
Scans are labeled from 0 to 23 and can be accessed via \scan{id}\data
. For
convenience, the thumbnail of each scan is provided and accessed via
\scan{id}\thumbnail
. However, no scaling standard is used for thumbnail and it
must not be used for interpolating back to original scan data. The test data
contains 1325 patches from all 24 scans, accessed via \test_data\patches
and
their corresponding targets \test_data\targets
for evaluation purposes.
Location from where test patches are extracted are whitened in the original
WSIs, therefore such patches must be identified during extraction of training
patches.
Python (h5py required):
import h5py as h5
data = h5.File('Kimia_Path24.h5', 'r')
I = data['/scan1/data']
MATLAB:
I=h5read('Kimia_Path24.h5', '/scan1/data');
I=I'; %The images must be transposed to be same as images in Phyton.
Python (h5py required):
# This is simple code snippet showing how to read patches without any overlap.
# Algorithm designer may want to read patches with overlap and also be able to
# ignore less interesting patches (not containing useless information).
import h5py as h5
data = h5.File('Kimia_Path24.h5', 'r')
scan1_data = data['/scan1/data']
x_max, y_max = scan1_data.shape
# creating patches of size (1000, 1000)
patch_size = 1000
x_max_iter = int(x_max/patch_size) - 1
y_max_iter = int(y_max/patch_size) - 1
for x in range(x_max_iter):
for y in range(y_max_iter):
x_loc = x*patch_size
y_loc = y*patch_size
# read the patch
patch = scan1_data[x_loc:x_loc+patch_size, y_loc:y_loc+patch_size]
MATLAB: Currently not done!!
There are 1325 test patches each of size (1000, 1000). A classification algorithm will predict their label from 0 to 23. For evaluation, create the CSV file containing prediction labels for all 1325 patches in exact same order as given in H%PY file. Python:
import h5py as h5
data = h5.File('Kimia_Path24.py', 'r')
test_patches = data['/test_data/patches']
predictions = map(lambda test_patch: predict(test_patch), test_patches)
# write predictions to csv file for evaluation purpose
Evaluation script is provided in both MATLAB and Python.
Authors of scientific papers including results generated using KimiaPath24 dataset are encouraged to cite the following paper.
@article{kimiapath24_2017,
title = {Classification and {{Retrieval}} of {{Digital Pathology Scans}}: {{A New Dataset}}},
url = {http://arxiv.org/abs/1705.07522},
shorttitle = {Classification and {{Retrieval}} of {{Digital Pathology Scans}}},
abstract = {In this paper, we introduce a new dataset, $\backslash$textbf\{Kimia Path24\}, for image classification and retrieval in digital pathology. We use the whole scan images of 24 different tissue textures to generate 1,325 test patches of size 1000\$$\backslash$times\$1000 (0.5mm\$$\backslash$times\$0.5mm). Training data can be generated according to preferences of algorithm designer and can range from approximately 27,000 to over 50,000 patches if the preset parameters are adopted. We propose a compound patch-and-scan accuracy measurement that makes achieving high accuracies quite challenging. In addition, we set the benchmarking line by applying LBP, dictionary approach and convolutional neural nets (CNNs) and report their results. The highest accuracy was 41.80$\backslash$\% for CNN.},
archivePrefix = {arXiv},
eprinttype = {arxiv},
eprint = {1705.07522},
author = {Babaie, Morteza and Kalra, Shivam and Sriram, Aditya and Mitcheltree, Christopher and Zhu, Shujin and Khatami, Amin and Rahnamayan, Shahryar and Tizhoosh, H. R.},
}