Skip to content

Legacy Setup Instructions

Jason Walker edited this page Feb 4, 2020 · 1 revision

Initialization

Shell Variables

  • NOTE: The directory where you download the following config files will become the base directory for this tutorial.
  • You will need to update the $SOMATIC_HOME environment variable in Somatic_base_config.sh to match your local directory.
curl -L -k -o Somatic_config.sh https://raw.githubusercontent.com/genome/arvados_trial_wiki/master/Somatic_config.sh
curl -L -k -o Somatic_base_config.sh https://raw.githubusercontent.com/genome/arvados_trial_wiki/master/Somatic_base_config.sh
  • NOTE: This must be run each time a new terminal window is opened. These shell variables will NOT persist in your environment.
source Somatic_config.sh

Working Directories

mkdir -p $SOMATIC_HOME/software
mkdir -p $SOMATIC_HOME/logs
mkdir -p $SOMATIC_HOME/refseq
mkdir -p $SOMATIC_HOME/fastq
mkdir -p $SOMATIC_HOME/alignments
mkdir -p $SOMATIC_HOME/varscan
mkdir -p $SOMATIC_HOME/strelka/exome
mkdir -p $SOMATIC_HOME/strelka/wgs
mkdir -p $SOMATIC_HOME/mutect/exome
mkdir -p $SOMATIC_HOME/mutect/wgs
mkdir -p $SOMATIC_HOME/pindel
mkdir -p $VEP_CACHE

Installs and Downloads

Software

This tool is required to download files from the Broad Google Cloud Platform For install instructions for your specific architecture, please see the above link to gsutil.

NOTE: THIS DID NOT WORK FOR ME ON LINUX WORKSTATION, I INSTALLED ON MY MAC

sudo easy_install -U pip
sudo pip install gsutil
cd $SOMATIC_HOME/software

curl -L -k -o htslib-1.3.2.tar.bz2 https://github.com/samtools/htslib/releases/download/1.3.2/htslib-1.3.2.tar.bz2
tar --bzip2 -xvf htslib-1.3.2.tar.bz2
cd htslib-1.3.2
./configure  --enable-plugins --enable-libcurl --prefix=$SOMATIC_HOME/software
make
make install

$SOMATIC_HOME/software/bin/tabix 
cd $SOMATIC_HOME/software
curl -L -k -o samtools-1.3.1.tar.bz2  https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
tar --bzip2 -xvf samtools-1.3.1.tar.bz2
cd samtools-1.3.1
./configure --with-htslib=$SOMATIC_HOME/software/htslib-1.3.2 --prefix=$SOMATIC_HOME/software
make
make install

$SOMATIC_HOME/software/bin/samtools
cd $SOMATIC_HOME/software
curl -L -k -o bwa-0.7.15.tar.bz2 https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download
tar --bzip2 -xvf bwa-0.7.15.tar.bz2
cd  bwa-0.7.15
make

ln -s $SOMATIC_HOME/software/bwa-0.7.15/bwa $SOMATIC_HOME/software/bin/bwa

bwa
cd $SOMATIC_HOME/software
curl -L -k -o samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz
tar -xzvf samblaster-v.0.1.24.tar.gz
cd samblaster-v.0.1.24
make

ln -s $SOMATIC_HOME/software/samblaster-v.0.1.24/samblaster $SOMATIC_HOME/software/bin/samblaster

samblaster -h
cd $SOMATIC_HOME/software
curl -L -k -o sambamba_v0.6.4_linux.tar.bz2 https://github.com/lomereiter/sambamba/releases/download/v0.6.4/sambamba_v0.6.4_linux.tar.bz2
tar --bzip2 -xvf sambamba_v0.6.4_linux.tar.bz2

ln -s $SOMATIC_HOME/software/sambamba_v0.6.4 $SOMATIC_HOME/software/bin/sambamba

sambamba -h
cd $SOMATIC_HOME/software
curl -L -k -o picard-tools-2.4.1.zip https://github.com/broadinstitute/picard/releases/download/2.4.1/picard-tools-2.4.1.zip
unzip picard-tools-2.4.1.zip

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar -h 
  • NOTE: Replace localhost in step 2 with the name of your host.
  • Download instructions may differ depending on the location of your host and configuration of your network.
  • Steps 3 and 4 should be run from the localhost.
  1. Manually download GATK after accepting the license : https://www.broadinstitute.org/gatk/download/auth?package=GATK
  2. Copy the download : scp Downloads/GenomeAnalysisTK-3.6.tar.bz2 localhost:$SOMATIC_HOME/software/.
  3. Unzip the archive : tar --bzip2 -xvf GenomeAnalysisTK-3.6.tar.bz2
  4. Test with JAVA8 : $JAVA_EIGHT -jar GenomeAnalysisTK.jar -h
cd $SOMATIC_HOME/software
curl -L -k -o verifyBamIDLibStatGen.1.1.3.tgz https://github.com/statgen/verifyBamID/releases/download/v1.1.3/verifyBamIDLibStatGen.1.1.3.tgz
tar -xzvf verifyBamIDLibStatGen.1.1.3.tgz
cd verifyBamID_1.1.3
make

ln -s $SOMATIC_HOME/software/verifyBamID_1.1.3/verifyBamID/bin/verifyBamID $SOMATIC_HOME/software/bin/verifyBamID

verifyBamID
cd $SOMATIC_HOME/software
curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar
$JAVA_EIGHT -jar $SOMATIC_HOME/software/VarScan.v2.4.2.jar
cd $SOMATIC_HOME/software

curl -L -k -o strelka-2.7.1.centos5_x86_64.tar.bz2 https://github.com/Illumina/strelka/releases/download/v2.7.1/strelka-2.7.1.centos5_x86_64.tar.bz2
tar --bzip2 -xvf strelka-2.7.1.centos5_x86_64.tar.bz2
bash $SOMATIC_HOME/software/strelka-2.7.1.centos5_x86_64/bin/runStrelkaSomaticWorkflowDemo.bash
cd $SOMATIC_HOME/software
curl -L -k -o bcftools-1.3.1.tar.bz2 https://github.com/samtools/bcftools/releases/download/1.3.1/bcftools-1.3.1.tar.bz2
tar --bzip2 -xvf bcftools-1.3.1.tar.bz2
cd bcftools-1.3.1
make prefix=$SOMATIC_HOME/software install
cd ../
bcftools -h
cd $SOMATIC_HOME/software
curl -L -k -o pindel-0.2.5b8.tar.gz https://github.com/genome/pindel/archive/v0.2.5b8.tar.gz
tar -xzvf pindel-0.2.5b8.tar.gz
cd pindel-0.2.5b8/
./INSTALL /gscuser/jwalker/git/HCC1395/arvados/software/htslib-1.3.2
ln -s $SOMATIC_HOME/software/pindel-0.2.5b8/pindel $SOMATIC_HOME/software/bin/pindel
ln -s $SOMATIC_HOME/software/pindel-0.2.5b8/pindel2vcf $SOMATIC_HOME/software/bin/pindel2vcf
cd $SOMATIC_HOME
pindel -h
cd $SOMATIC_HOME/software/
git clone https://github.com/Ensembl/ensembl-vep.git --branch release/87 --single-branch
cd $SOMATIC_HOME/software/ensembl-vep
unset PERL5LIB
/usr/bin/perl INSTALL.pl --NO_HTSLIB --CACHEDIR $VEP_CACHE
#During the installation make sure to accept (y) when asked whether you'd like to install cache files, fastas, and plugins.
#Install all homo sapiens build 38 cache files (options 42 44 46), the homo sapiens fasta (option 28), and the Downstream plugin (option 24).
wget -O $VEP_CACHE/Plugins/Wildtype.pm https://raw.githubusercontent.com/griffithlab/pVAC-Seq/master/pvacseq/VEP_plugins/Wildtype.pm --no-check-certificate

  • NOTE: This requires cmake 2.8.3 or higher
cd $SOMATIC_HOME/software/
mkdir git
cd git
git clone --recursive git://github.com/genome/bam-readcount.git

cd $SOMATIC_HOME/software/
mkdir bam-readcount
cd bam-readcount
export SAMTOOLS_ROOT=$SOMATIC_HOME/software/samtools-1.3.1
cmake $SOMATIC_HOME/software/git/bam-readcount
make
./bin/bam-readcount
cd $SOMATIC_HOME/software/
curl -L -k -o fpfilter.pl https://raw.githubusercontent.com/genome/fpfilter-tool/master/fpfilter.pl
/usr/bin/perl $SOMATIC_HOME/software/fpfilter.pl

Reference Sequence

mkdir -p $SOMATIC_REFSEQ_DIR

cd $SOMATIC_$REFSEQ_DIR

curl -L -k -o $SOMATIC_REFSEQ_FASTA  $GENOME_URI

TODO: Make WGS intervals using Picard tools

cd $SOMATIC_REFSEQ_DIR

gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/$REFSEQ_DBSNP $SOMATIC_REFSEQ_DIR
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/$REFSEQ_KNOWN_INDELS $SOMATIC_REFSEQ_DIR
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/$REFSEQ_MILLS_INDELS $SOMATIC_REFSEQ_DIR

gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/wgs_calling_regions.hg38.interval_list $SOMATIC_REFSEQ_DIR
gsutil cp -r gs://genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/ $SOMATIC_REFSEQ_DIR
cd $SOMATIC_REFSEQ_DIR

sftp "your_email_address"@sftp-cancer.sanger.ac.uk
get /files/grch38/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz
get /files/grch38/cosmic/v79/VCF/CosmicNonCodingVariants.vcf.gz

zgrep "^#" CosmicCodingMuts.vcf.gz > VCF_Header
zgrep -v "^#" CosmicCodingMuts.vcf.gz | awk '{print "chr"$0}' | sed 's/^chrMT/chrM/' > CosmicCodingMuts.clean
zgrep -v "^#" CosmicNonCodingVariants.vcf.gz | awk '{print "chr"$0}' | sed 's/^chrMT/chrM/' > CosmicNonCodingVariants.clean

cat CosmicCodingMuts.clean CosmicNonCodingVariants.clean | sort -gk 2,2 > Cosmic_v79
cat VCF_Header Cosmic_v79 > Cosmic_v79.vcf

rm VCF_Header CosmicCodingMuts.clean CosmicNonCodingVariants.clean Cosmic_v79

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar SortVcf I=$SOMATIC_REFSEQ_DIR/Cosmic_v79.vcf O=$SOMATIC_REFSEQ_DIR/Cosmic_v79.dictsorted.vcf SEQUENCE_DICTIONARY=$SOMATIC_REFSEQ_DICT

Exome Targets/Probes

cd $SOMATIC_HOME/refseq/

curl -L -k -o xgen-exome-research-panel-probes.bed http://www.idtdna.com/pages/docs/default-source/xgen-libraries/xGen-Lockdown-Panels/xgen-exome-research-panel-probes.bed?sfvrsn=4

curl -L -k -o xgen-exome-research-panel-targets.bed http://www.idtdna.com/pages/docs/default-source/xgen-libraries/xGen-Lockdown-Panels/xgen-exome-research-panel-targets.bed?sfvrsn=6

Download the UCSC hg19ToHg38 chain file


cd $SOMATIC_HOME/refseq/

curl -L -k -o hg19ToHg38.over.chain.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip hg19ToHg38.over.chain.gz

Index

Reference Sequence

BWA Alinger Index

bsub -q $LSB_QUEUE -o $SOMATIC_HOME/logs/bwa_index.out -e $SOMATIC_HOME/logs/bwa_index.err -M ${INDEX_RAM_GB}000000 -R "select[mem>=${INDEX_RAM_GB}000] rusage[mem=${INDEX_RAM_GB}000]" $SOMATIC_HOME/software/bwa-0.7.15/bwa index $SOMATIC_REFSEQ_FASTA

Picard Sequence Dictionary

bsub -q $LSB_QUEUE -o $SOMATIC_HOME/logs/picard_dict.out -e $SOMATIC_HOME/logs/picard_dict.err $JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar CreateSequenceDictionary R=$SOMATIC_REFSEQ_FASTA O=$SOMATIC_REFSEQ_DICT GENOME_ASSEMBLY=$GENOME_ASSEMBLY URI=$GENOME_URI SPECIES=$GENOME_SPECIES

Samtools FAIDX

bsub -q $LSB_QUEUE -o $SOMATIC_HOME/logs/samtools_faidx.out -e $SOMATIC_HOME/logs/samtools_faidx.err $SOMATIC_HOME/software/bin/samtools faidx $SOMATIC_REFSEQ_FASTA

GATK Known Sites

dbSNP VCF

bgzip $SOMATIC_REFSEQ_DIR/Homo_sapiens_assembly38.dbsnp138.vcf
tabix -p vcf $SOMATIC_REFSEQ_DIR/Homo_sapiens_assembly38.dbsnp138.vcf.gz

Known Indels

tabix -p vcf $SOMATIC_REFSEQ_DIR/Homo_sapiens_assembly38.known_indels.vcf.gz

Mills Indels

tabix -p vcf $SOMATIC_REFSEQ_DIR/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

COSMIC VCF

bgzip $SOMATIC_REFSEQ_DIR/Cosmic_v79.dictsorted.vcf
tabix -p vcf $SOMATIC_REFSEQ_DIR/Cosmic_v79.dictsorted.vcf.gz

Target Interval Lists

Exome

NOTE: hg19 sequence dictionary used from GMS @ MGI

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar BedToIntervalList I=$SOMATIC_HOME/refseq/xgen-exome-research-panel-probes.bed SD=/gscmnt/gc4096/info/model_data/2871743894/build108563338/seqdict/seqdict.sam O=$SOMATIC_HOME/refseq/xgen-exome-research-panel-probes.interval_list

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar BedToIntervalList I=$SOMATIC_HOME/refseq/xgen-exome-research-panel-targets.bed SD=/gscmnt/gc4096/info/model_data/2871743894/build108563338/seqdict/seqdict.sam O=$SOMATIC_HOME/refseq/xgen-exome-research-panel-targets.interval_list
$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar LiftOverIntervalList CHAIN=$SOMATIC_HOME/refseq/hg19ToHg38.over.chain SD=$SOMATIC_REFSEQ_DICT O=$SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-probes.interval_list I=$SOMATIC_HOME/refseq/xgen-exome-research-panel-probes.interval_list

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar LiftOverIntervalList CHAIN=$SOMATIC_HOME/refseq/hg19ToHg38.over.chain SD=$SOMATIC_REFSEQ_DICT O=$SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-targets.interval_list I=$SOMATIC_HOME/refseq/xgen-exome-research-panel-targets.interval_list

IntervalListToBed using Perl

cat $SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-probes.interval_list | /usr/bin/perl -M5.10.0 -ne 'if(substr($_,0,1) ne q{@}) { chomp; my @c = split "\t"; say(join("\t", $c[0], $c[1]-1, $c[2])); }' > $SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-probes.bed

cat $SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-targets.interval_list | /usr/bin/perl -M5.10.0 -ne 'if(substr($_,0,1) ne q{@}) { chomp; my @c = split "\t"; say(join("\t", $c[0], $c[1]-1, $c[2])); }' > $SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-targets.bed
mkdir -p $SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-targets

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar IntervalListTools INPUT=$SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-targets.interval_list SCATTER_COUNT=$REFERENCE_SEQUENCE_CHUNKS OUTPUT=$SOMATIC_HOME/refseq/GRCh38DH/xgen-exome-research-panel-targets

WGS

Whole-Genome Interval List

awk '{print $1"\t1\t"$2"\t+\t"$1}' ${SOMATIC_REFSEQ_FASTA}.fai | cat $SOMATIC_REFSEQ_DICT - > $SOMATIC_REFSEQ_DIR/$GENOME_BASENAME.interval_list
)

Autosomal Chromosome Interval List

egrep 'chr[0-9]{1,2}\s' ${SOMATIC_REFSEQ_FASTA}.fai | awk '{print $1"\t1\t"$2"\t+\t"$1}' | cat $SOMATIC_REFSEQ_DICT - > $SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_autosomal.interval_list

Autosomal + Sex Chromosome Interval List

egrep 'chr[0-9,X,Y]{1,2}\s' ${SOMATIC_REFSEQ_FASTA}.fai | awk '{print $1"\t1\t"$2"\t+\t"$1}' | cat $SOMATIC_REFSEQ_DICT - > $SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_autosomal_plus_sex.interval_list

NOTE : It looks like Broad may use a higher value for Ns (>1) when making a list of calling regions. There are only 356 regions compared to 592 when using 1. 100 for MAX_TO_MERGE resulted in 324 regions and 99 resulted in 533??? Try kdiff3 tomorrow.


$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar ScatterIntervalsByNs R=$SOMATIC_REFSEQ_FASTA O=$SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_excludeNs.interval_list OT=ACGT N=100

Picard IntervalListTools Intersect ExcludeNs with Autosomal + Sex Interval List

$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar IntervalListTools I=$SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_excludeNs.interval_list I=$SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_autosomal_plus_sex.interval_list ACTION=INTERSECT O=$SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_calling_regions.interval_list
$JAVA_EIGHT -jar $SOMATIC_HOME/software/picard-tools-2.4.1/picard.jar IntervalListTools I=$SOMATIC_REFSEQ_DIR/${GENOME_BASENAME}_calling_regions.interval_list SCATTER_COUNT=$REFERENCE_SEQUENCE_CHUNKS OUTPUT=$SOMATIC_REFSEQ_DIR
Clone this wiki locally