Copy Number Alteration Prediction from gene Expression in human cancers
Active development
Copy number alterations (CNAs) are important features of human cancer. While the standard methods for CNA detection (CGH arrays, SNP arrrays, DNA sequencing) rely on DNA, occasionally DNA data are not available, especially in cancer studies (e.g. biopsies, legacy data). CNAPE comes into play by predicting CNAs based on gene expression data from RNA-seq.
Before installing CNAPE please make sure you have installed R, and Rscript
is available in your system path ($PATH).
A simple clone of the repository is enough for installation, since the necessary packages will be installed automatically when you run CNAPE.
git clone https://github.com/WangLabHKUST/CNAPE
CNAPE.R takes the gene expression matrix of the human cancer samples as input. For RNA-seq data, you can process them using TCGA's RNA-seq processing pipeline (i.e., reads were aligned to the human genome using MapSplice and expression was quantified/normalized using RSEM against UCSC genes).
An example input file demonstrating the format of the input gene expression matrix can be found in the example/ folder.
The main function of CNAPE is packaged in cnape.R. Get your gene expression profile prepared, and run it like this:
Rscript cnape.R expressionMatrix outputPrefix
The output contains prefix.chromosome_level.cna.txt and prefix.arm_level.cna.txt, where 1 means amplified, -1 means deleted, while 0 means no CNA change.
For chromosome and arm level CNAs, the models trained on TCGA pan-cancer data are available. After you have cloned CNAPE, please go to the CNAPE folder and run :
./run_example.sh
Your result files, named example.chromosome_level.cna.txt and example.arm_level.cna.txt, should appear in the example folder. You can compare the results with the provided example.chromosome_level.cna.origional.txt and example.arm_level.cna.origional.txt.
A more detailed example on gene-level CNA prediction is provided, using the open-access TCGA pan-glioma data. In this example you will see how the models are formulated and trained, as well as their performance in testing. We also show how you can extract the feature genes in the models.
The models are trained on the TCGA Pancancer Atlas data, using glmnet package in R. The dependency requirements are automatically solved while running the program.
For technical issues please send an email to [email protected] or [email protected].