Automating Digital Pathology

Overview

The tumor proliferation speed or tumor growth is an important biomarker for predicting patient outcomes. Proper assessment of this biomarker is crucial for informing the decisions for the treatment plan for the patient. In a clinical setting, the most common method is to count mitotic figures under a microscope by a pathologist. The manual counting and subjectivity of the process pose a reproducibility challenge. This has been the main motivation for many efforts to automate this process and use advanced ML techniques.

One of the main challenges however for automating this task, is the fact that whole slide images are rather large. WSI images can vary anywhere between 0.5 to 3.5GB in size, and that can slow down the image preprocessing step which is necessary for any downstream ML application.

In this solution accelerator, we walk you through a step-by-step process to use databricks capabilities to perform image segmentation and pre-processing on WSI and train a binary classifier that produces a metastasis probability map over a whole slide image (WSI).

Dataset

The data used in this solution accelerator is from the Camelyon16 Grand Challenge, along with annotations based on hand-drawn metastasis outlines. We use curated annotations for this dataset obtained from Baidu Research github repository.

Notebooks

We use Apache Spark's parallelization capabilities, using pandas_udf, to generate tumor/normal patches based on annotation data as well as feature extraction, using a pre-trained InceptionV3. We use the embeddings obtained this way to explore clusters of patches by visualizing 2d and 3d embeddings, using UMAP. We then use transfer learning with pytorch to train a convnet to classify tumor vs normal patches and later use the resulting model to overlay a metastasis heatmap on a new slide.

This solution accelerator contains the following notebooks:

config: configuring paths and other settings. Also for the first time setting up a cluster for patch generation, use the initscript generated by the config notebook to install openSlide on your cluster.
1-create-annotation-deltalake: to download annotations and write to delta.
2-patch-generation: This notebook generates patches from WSI based on annotations.
3-feature-extraction: To extract image embeddings using InceptionV3 in a distributed manner
4-unsupervised-learning: dimensionality reduction and cluster inspection with UMAP
5-training: In this notebook we tune and train a binary classifier to classify tumor/normal patches with pytorch and log the model with mlflow.
6-metastasis-heatmap: This notebook we use the model trained in the previous step to generate a metastasis probability heatmap for a given slide.
definitions: This notebook contains definitions for some of the functions that are used in multiple places (for example patch generation and pre processing)

License

Copyright / License info of the notebook. Copyright [2021] the Notebook Authors. The source in this notebook is provided subject to the Apache 2.0 License. All included or referenced third party libraries are subject to the licenses set forth below.

Library Name	Library License	Library License URL	Library Source URL
Pandas	BSD 3-Clause License	https://github.com/pandas-dev/pandas/blob/master/LICENSE	https://github.com/pandas-dev/pandas
Numpy	BSD 3-Clause License	https://github.com/numpy/numpy/blob/main/LICENSE.txt	https://github.com/numpy/numpy
Apache Spark	Apache License 2.0	https://github.com/apache/spark/blob/master/LICENSE	https://github.com/apache/spark/tree/master/python/pyspark
Pillow (PIL)	HPND License	https://github.com/python-pillow/Pillow/blob/master/LICENSE	https://github.com/python-pillow/Pillow/
OpenSlide	GNU LGPL version 2.1	https://github.com/openslide/openslide/blob/main/LICENSE.txt	https://github.com/openslide
Open Slide Python	GNU LGPL version 2.1	https://github.com/openslide/openslide-python/blob/main/LICENSE.txt	https://github.com/openslide/openslide-python
pytorch lightning	Apache License 2.0	https://github.com/PyTorchLightning/pytorch-lightning/blob/master/LICENSE	https://github.com/PyTorchLightning/pytorch-lightning
NCRF	Apache License 2.0	https://github.com/baidu-research/NCRF/blob/master/LICENSE	https://github.com/baidu-research/NCRF

Author
Databricks Inc.

Disclaimers

Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account. Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.

To run this accelerator, clone this repo into a Databricks workspace. Attach the RUNME notebook to any cluster running a DBR 11.0 or later runtime, and execute the notebook via Run-All. A multi-step-job describing the accelerator pipeline will be created, and the link will be provided. Execute the multi-step-job to see how the pipeline runs.

The job configuration is written in the RUNME notebook in json format. The cost associated with running the accelerator is the user's responsibility.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
config		config
.gitignore		.gitignore
00-create-annotation-deltalake.py		00-create-annotation-deltalake.py
01-README.py		01-README.py
02-patch-generation.py		02-patch-generation.py
03-feature-extraction.py		03-feature-extraction.py
04-unsupervised-learning.py		04-unsupervised-learning.py
05-training.py		05-training.py
06-metastasis-heatmap.py		06-metastasis-heatmap.py
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RUNME.py		RUNME.py
SECURITY.md		SECURITY.md
definitions.py		definitions.py
openslide-tools.sh		openslide-tools.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automating Digital Pathology

Overview

Dataset

Notebooks

License

Disclaimers

About

Releases

Packages

Contributors 4

Languages

License

databricks-industry-solutions/digital-pathology

Folders and files

Latest commit

History

Repository files navigation

Automating Digital Pathology

Overview

Dataset

Notebooks

License

Disclaimers

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages