*Note: Biomarker Agents can now be deployed with Bedrock multi-agent collaboration in your AWS account!
Navigate to the multi-agent-collaboration branch of this repository and follow the instructions there to deploy the agents
Read more about these agents in this article: https://aws.amazon.com/blogs/machine-learning/accelerate-analysis-and-discovery-of-cancer-biomarkers-with-amazon-bedrock-agents/
The success rate for Phase I oncology clinical trials is significantly low. According to a study published in Nature Reviews Drug Discovery, the overall success rate for oncology drugs from Phase I to approval is around 5%, indicating a high failure rate of approximately 95%.
According to the National Cancer Institute, cancer biomarkers, also known as tumor markers, are biological molecules found in blood, other body fluids, or tissues that indicate the presence of cancer. Biomarkers for patient stratification can improve the probability of success in clinical development; the average is between double and triple, but it can be as high as five-fold.
In this solution, we show you how agentic workflows with LLMs from Amazon Bedrock leverage planning, tool-use, and self-reflection to transform complex oncology research queries into actionable insights. We define an example analysis pipeline, specifically for lung cancer survival with clinical, genomics and radiology modalities of biomarkers. We showcase a variety of tools including database retrieval with Text2SQL, statistical models and visual charts with scientific libraries, biomedical literature search with public APIs and internal evidence, and medical image processing with Amazon SageMaker jobs. We demonstrate advanced capabilities of agents for self-review and planning that help build trust with end users by breaking down complex tasks into a series of question and answers and showing the chain of thought to generate the final answer.
The biomarker analysis workflow is illustrated below that incorporates multimodal data, including clinical, genomic, and CT scan imaging data. We augment this pipeline with Agents for Amazon Bedrock.
The solution architecture is illustrated below. Agents for Amazon Bedrock enable generative AI applications to execute multistep tasks across company systems and data sources. We define our solution to include planning and reasoning with multiple tools including
- Biomarker query engine: Convert natural language questions to SQL statements and execute on an Amazon Redshift database of biomarkers.
- Scientific analysis and plotting engine: Use lifelines library with clinical and genomics data to build survival regression models like COX proportional hazard model and visualization like Kaplan Meier charts for survival analysis.
- External data API: Use PubMed apis to search biomedical literature for specific evidence.
- Internal literature: Use Knowledge Bases for Bedrock to give agents contextual information from internal literature evidence for RAG to deliver more relevant, accurate, and customized responses
- Medical imaging: Use Amazon SageMaker Jobs to augment agents with the capability to trigger asynchronous jobs with an ephemeral cluster to process CT scan images that include 2D slices to 3D volumes conversion, CT and segmentation masks alignment, radiomic feature extraction within the tumor region.
We reuse the multimodal data analysis pipeline from this solution(https://github.com/aws-samples/machine-learning-pipelines-for-multimodal-health-data/tree/sagemaker-soln-lcsp) for Non-Small Cell Lung Cancer (NSCLC). The Non-Small Cell Lung Cancer (NSCLC) Radiogenomics dataset consists a cohort of early stage NSCLC patients referred for surgical treatment. Prior to surgical procedures, Computed Tomography (CT) and Positron Emission Tomography/CT (PET/CT) are performed. Samples of tumor tissues were used to obtain mutation data and gene expresssion data by RNA sequencing technology. Clinical and demographic information were recorded for the patients as well. Each data modality (imaging, genomic, clincal) presents different view of a patient.
Details for each modality is described below.
Total RNA was extracted from the tumor tissue and analyzed with RNA sequencing technology. The dataset file that is available from the source was pre-processed using open-source tools including STAR v.2.3 for alignment and Cufflinks v.2.0.2 for expression calls. Further details can be found in [3]. The original dataset (GSE103584_R01_NSCLC_RNAseq.txt.gz) is also available in https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103584.
While the original data contains more than 22,000 genes, we keep 21 genes from 10 highly coexpressed gene clusters (metagenes) that were identified, validated in publicly available gene-expression cohorts, and correlated with prognosis [3]. These are genes corresponding to Metagenes 19, 10, 9, 4, 3, 21 in Table 2 in [3].
After gene selection from the source data, the dataset looks like
Case_ID | LRIG1 | HPGD | GDF15 | CDH2 | POSTN | ...... |
---|---|---|---|---|---|---|
R01-024 | 26.7037 | 3.12635 | 13.0269 | 0 | 36.4332 | ...... |
R01-153 | 15.2133 | 5.0693 | 0.908663 | 0 | 32.8595 | ...... |
R01-031 | 5.54082 | 1.23083 | 29.8832 | 1.13549 | 34.8544 | ...... |
R01-032 | 12.8391 | 7.21931 | 12.0701 | 0 | 7.77297 | ...... |
R01-033 | 33.7975 | 3.19058 | 5.43418 | 0 | 9.84029 | ...... |
The values denote expression level for each gene per patient. A higher number means that that specific gene is highly expressed in that specific tumor sample.
The clinical records are stored in CSV format. Each row corresponds to a patient and the columns represent information about the patients, including demographics, tumor stage, and survival status.
Case ID | Survival Status | Age at Histological Diagnosis | Weight (lbs) | Smoking status | Pack Years | Quit Smoking Year | Chemotherapy | Adjuvant Treatment | EGFR mutation status | ...... |
---|---|---|---|---|---|---|---|---|---|---|
R01-005 | Dead | 84 | 145 | Former | 20 | 1951 | No | No | Wildtype | ...... |
R01-006 | Alive | 62 | Not Collected | Former | Not Collected | nan | No | No | Wildtype | ...... |
R01-007 | Dead | 68 | Not Collected | Former | 15 | 1968 | Yes | Yes | Wildtype | ...... |
R01-008 | Alive | 73 | 102 | Nonsmoker | nan | nan | No | No | Wildtype | ...... |
R01-009 | Dead | 59 | 133 | Current | 100 | nan | No | No | Wildtype | ...... |
Medical imaging biomarkers of cancer promise improvements in patient care through advances in precision medicine. Compared to genomic biomarkers, imaging biomarkers provide the advantages of being non-invasive, and characterizing a heterogeneous tumor in its entirety, as opposed to limited tissue available via biopsy [2]. In this dataset, CT and PET/CT imaging sequences were acquired for patients prior to surgical procedures. Segmentation of tumor regions were annotated by two expert thoracic radiologists. Below is an example overlay of a tumor segmentation onto a lung CT scan (case R01-093).
Here is an example chain of thought sequence with the agent. 11 questions are listed in the image with their expected responses.
Important
Access to Amazon Bedrock foundation models (not granted by default). To gain access, follow the official documentation.
- Upload the
Infra_cfn.yaml
file from the amazon-bedrock-agents-cancer-biomarker-discovery repository to AWS CloudFormation. This template will set up:
Warning
Launching this stack will create 2 VPCs (Infrastructure and UI).
- Networking infrastructure (VPC, Subnets, etc.)
- Amazon Redshift database
- Bedrock Agent with Actions
- Knowledgebase
- Streamlit UI frontend
- Deploy the
Infra_cfn.yaml
template:- Default parameter values can remain unchanged
- Parameter descriptions:
BedrockModelId
: ID of the Foundation Model for the Agent (permissions scoped to Anthropic Claude 3 Sonnet model)EnvironmentName
: Differentiates the application if launched in the same AWS account (lowercase, one number, max 5 characters)RedshiftDatabaseName
: Name for the Redshift databaseRedshiftUserName
: Username for Redshift database loginRedshiftPassword
: Password for Redshift database loginGithubLink
: Default repository for the Agent (do not change)ImageTag
: Tag of the Docker image for Streamlit UI deployment
Note
Full deployment takes approximately 10-15 minutes. Stack can also be launched in us-east-1 or us-west-2 by clicking launch stack below
Region | Infra_cfn.yaml |
---|---|
us-east-1 | |
us-west-2 |
-
After stack launch is successful manually sync the Knowledgebase:
- Navigate to the Bedrock dashboard via AWS Console search
- Click the option icon (top left) to open the navigation bar
- Select "Knowledge bases" under the "Builder tools" tab
- Choose the Knowledgebase created by the CloudFormation template
- Scroll to the "Data Source" option box
- Select the data source (radio button) and click "Sync"
-
Access the UI:
- Navigate to AWS CloudFormation via AWS Console search
- Click the Streamlit nested stack (format:
<stackname>-StreamlitBuildNestedStack-<123ABCXXXX>
) - In the Outputs tab, find and click the StreamlitAppURL link
-
Fork the repository to your GitHub account. Ensure the fork remains public during development and testing.
-
Clone your forked repository to your local machine.
-
Update the GitHub URL in the following configuration files to point to your forked repository:
infra_cfn.yaml
agent_build.yaml
streamlit_build.yaml
-
For testing purposes, deploy the
infra_cfn.yaml
template to AWS CloudFormation.
-
Ensure you have forked the main repository: amazon-bedrock-agents-cancer-biomarker-discovery
-
Create a new branch in your forked repository for your changes.
-
Implement your changes, following the project's coding standards and guidelines.
-
Commit your changes with clear, concise commit messages.
-
Push your branch to your forked repository on GitHub.
-
Open a pull request from your branch to the main repository's
main
branch. -
Provide a clear description of your changes in the pull request, including any relevant issue numbers.
-
Be prepared to address any feedback or questions during the code review process.
This project is licensed under the MIT License. The open-souce packages used in this project are under these licenses.
Users of the dataset must abide by TCIA's Data Usage Policy and the Creative Commons Attribution 3.0 Unported License under which it has been published.
We reuse the multimodal data analysis pipeline from this solution(https://github.com/aws-samples/machine-learning-pipelines-for-multimodal-health-data/tree/sagemaker-soln-lcsp)
Important: This solution is for demonstrative purposes only. It is not for clinical use and is not a substitute for professional medical advice, diagnosis, or treatment. The associated notebooks, including the trained model and sample data, are not intended for production. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.