The tumor proliferation speed or tumor growth is an important biomarker for predicting patient outcomes. Proper assessment of this biomarker is crucial for informing the decisions for the treatment plan for the patient. In a clinical setting, the most common method is to count mitotic figures under a microscope by a pathologist. The manual counting and subjectivity of the process pose a reproducibility challenge. This has been the main motivation for many efforts to automate this process and use advanced ML techniques.
One of the main challenges however for automating this task, is the fact that whole slide images are rather large. WSI images can vary anywhere between 0.5 to 3.5GB in size, and that can slow down the image preprocessing step which is necessary for any downstream ML application.
In this solution accelerator, we walk you through a step-by-step process to use databricks capabilities to perform image segmentation and pre-processing on WSI and train a binary classifier that produces a metastasis probability map over a whole slide image (WSI).
The data used in this solution accelerator is from the Camelyon16 Grand Challenge, along with annotations based on hand-drawn metastasis outlines. We use curated annotations for this dataset obtained from Baidu Research github repository.
We use Apache Spark's parallelization capabilities, using pandas_udf, to generate tumor/normal patches based on annotation data as well as feature extraction, using a pre-trained InceptionV3. We use the embeddings obtained this way to explore clusters of patches by visualizing 2d and 3d embeddings, using UMAP. We then use transfer learning with pytorch to train a convnet to classify tumor vs normal patches and later use the resulting model to overlay a metastasis heatmap on a new slide.
This solution accelerator contains the following notebooks:
-
config
: configuring paths and other settings. Also for the first time setting up a cluster for patch generation, use theinitscript
generated by the config notebook to installopenSlide
on your cluster. -
1-create-annotation-deltalake
: to download annotations and write to delta. -
2-patch-generation
: This notebook generates patches from WSI based on annotations. -
3-feature-extraction
: To extract image embeddings usingInceptionV3
in a distributed manner -
4-unsupervised-learning
: dimensionality reduction and cluster inspection with UMAP -
5-training
: In this notebook we tune and train a binary classifier to classify tumor/normal patches with pytorch and log the model with mlflow. -
6-metastasis-heatmap
: This notebook we use the model trained in the previous step to generate a metastasis probability heatmap for a given slide. -
definitions
: This notebook contains definitions for some of the functions that are used in multiple places (for example patch generation and pre processing)
Copyright / License info of the notebook. Copyright [2021] the Notebook Authors. The source in this notebook is provided subject to the Apache 2.0 License. All included or referenced third party libraries are subject to the licenses set forth below.
Author |
---|
Databricks Inc. |
Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account. Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.
To run this accelerator, clone this repo into a Databricks workspace. Attach the RUNME notebook to any cluster running a DBR 11.0 or later runtime, and execute the notebook via Run-All. A multi-step-job describing the accelerator pipeline will be created, and the link will be provided. Execute the multi-step-job to see how the pipeline runs.
The job configuration is written in the RUNME notebook in json format. The cost associated with running the accelerator is the user's responsibility.