Code for analyzing RNAseq data from breast cell lines expressing PcTF, and code for making figures.
Olney, K. C., Nyer, D. B., Wilson Sayres, M. A. & Haynes, K. Activation of tumor suppressor genes in breast cancer cells by a synthetic chromatin effector. (2017). bioRxiv 186056; doi: https://doi.org/10.1101/186056
Fastq and CuffDiff output files are available at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database (accession GSE103520)
RNA-seq reads were quality-checked using FastQC before and after trimming and filtering. Reads shorter than 50 bases, and base reads below the PHRED-scaled threshold quality of 10 at the 5’ end or 25 at the trailing 3’ end as well as below the average quality of 30 (within a sliding window of 4 bases) were filtered or clipped using TrimmomaticSE (33). A combined reference genome index and dictionary for GRCH38.p7 (1-22, X, MT, and non-chromosomal sequences) plus the full coding region of PcTF was generated using Spliced Transcripts Alignment to Reference (STARv2.5.2b)and the picard tools (version 1.1.19). Trimmed RNA-seq reads were mapped, and splice junctions extracted, using STARv2.5.2b read aligner. Bamtools2.4.0 was used to determine alignment quality (‘stats’ command), sort reads, mark duplicates, add read groups, and to index the BAM read files. CuffDiff (Cufflinks package), was used to run pairwise comparisons to identify significant (q ≤ 0.05) differential expression. CummeRbund was used to calculate distances between features (JSD plots). R ggplot2 and VennDiagrams were used to generate heat maps and Venn diagrams respectively
- Download or obtain data
- FastQC to check the quality of the raw reads
- Trim fastq files for quality and to remove adaptors.
- FastQC to check the quality of the trimmed reads
- Obtain reference genome and gene annotation files
- Generate genome indexes
- STAR: map transcript reads to the reference genome and identify splice junctions
- Check quality of raw BAM files
- Sort BAM files, to be in the same order as the reference for downstream analysis
- Mark duplicates, duplicates will not be removed but will be marked for quality checks
- Add read groups to BAM files, this done to keep the sample ids organized when creating the merged vcf file
- Check quality of add read group BAM files
- Merge lanes into one BAM file
- Index BAM files
- Check quality of merged BAM file
- Identify genes that differentially expressed
- Figure 1. Comparisons of transcription profiles of three model breast cancer lines (MCF7, BT-549, BT-474) and a control non-cancer line (MCF10A).
- Figure 2. PcTF-expressing breast tissue-derived cell lines show upregulation of PRC-repressed genes and other genes.
- Figure 3. QRT-PCR analysis of gene expression in stable, transgenic PcTF-expressing cells.
- Figure 4. Expression profiles of PcTF-responsive and nonresponsive genes in MCF7.
fastqc http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic
STAR https://github.com/alexdobin/STAR
bamtools https://github.com/pezmaster31/bamtools
cuffdiff http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
cummeRbund http://compbio.mit.edu/cummeRbund/manual_2_0.html
ggplot2 http://ggplot2.tidyverse.org/reference/
VennDiagram https://cran.r-project.org/web/packages/VennDiagram/VennDiagram.pdf
GEO accession number Fastq and CuffDiff output files are available at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database (accession GSE103520)
Fastqc reads raw sequence data from high throughput sequencers and runs a set of quality checks to produce a report. Best reports are those whose "per base sequence quality" are included in the green area of the graph & kmer content is good or average.
Example command: $fastqc sampleID.fastq
Reports were saved in fastq_files directory in Project /Project/fastq_files This command will create two outputs: an .html file & an .zip file. Will output sampleID_fastqc.html and sampleID_fastqc.zip files
The selection of trimming steps and their associated parameters are supplied on the command line. The parameters selected were slidingwindow:4:30 leading10 trailing25 minlen50 phred33. They were chosen based on a better per sequence quality base and kmer content.
Example command: $java -jar trimmomatic-0.36.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:/adapters/TruSeq3-SE.fa:2:30:10 LEADING:10 TRAILING:25 SLIDINGWINDOW:4:30 MINLEN:50
Example command: $fastqc sampleID_minlen50_sliding430_leading30_trailing40.fastq
Obtain reference genome and gene annotation file to be used for mapping reads. Version GRCh38.p7 from gencode. http://www.gencodegenes.org/releases/current.html Nucleotide sequence of the GRCh38.p7 genome assembly version on all regions, including reference chromosomes, scaffolds, assembly patches and haplotypes
Example command $wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh38.p7.genome.fa.gz
Comprehensive gene annotation gencode.v25.chr_patch_hapl_scaff.annotation.gtf.gz Contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)
Example command $wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.chr_patch_hapl_scaff.annotation.gtf.gz
A .dict dictionary of the contig names and sizes and a .fai fasta index file allow efficient random access to the reference bases for downstream analysis and mapping
Index reference genome using STAR
Example command: $STAR --runMode genomeGenerate --genomeDir /GRCh38.p7/ --genomeFastaFiles GRCh38.p7.genome.fa.gz --sjdbGTFfile gencode.v25.chr_patch_hapl_scaff.annotation.gtf.gz --runThreadN 8
Create reference dictionrary using Picard tools CreateSequenceDictionary
Example command: $java -Xmx14g -jar picard.jar CreateSequenceDictionary R=GRCh38.p7.genome.fa O=GRCh38.p7.genome.fa.dict
The user supplies the genome files generated in the pervious step (generate genome indexes), as well as the RNA-seq reads (sequences) in the form of FASTA or FASTQ files. STAR maps the reads to the genome, and writes several output files, such as alignments (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal (wiggle) tracks etc. Mapping is controlled by a variety of input parameters (options)
STAR highly recommends using --sjdbGTFfile which specifies the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. Compatibility with Cufflinks/Cuffdiff.For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute, which STAR will generate with --outSAMstrandField intronMotif option. As required, the XS strand attribute will be generated for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions) will be suppressed.
STAR: maps fastq files to the reference genome and identifies splice junctions.
Example comamnd: $STAR --genomeDir Project_genome --genomeLoad LoadAndKeep --readFilesIn sampleID.fastq --outSAMtype BAM Unsorted --outFileNamePrefix sampleID_pass1. --runThreadN 8
Will print basic statistics from input BAM file(s)
Example command: $bamtools stats -in sampleID_pass2.Aligned.out.bam > sampleID_pass2.txt
An appropriate @HD-SO sort order header tag will be added
Example command: $bamtools sort -in sampleID.bam -out sampleID.sorted.bam
Mark duplicates: "Flags" where the duplicate reads are
Example command: $java -Xmx8g -jar picard.jar MarkDuplicates INPUT=sampleID.sorted.bam OUTPUT=sampleID.sorted.markdup.bam METRICS_FILE=sampleID.markdup.picardMetrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT
For each sample, add a read group to the mark duplicate BAM files (a read group is a "tag" such as a sample ID)
Example command: $java -Xmx8g -jar picard.jar AddOrReplaceReadGroups INPUT=sampleID.sorted.rmdup.bam OUTPUT=sampleID.sorted.rmdup.addReadGr.bam RGLB=sampleID RGPL=machineUsed RGPU=laneUsed RGSM=sampleName RGCN=location RGDS=species VALIDATION_STRINGENCY=LENIENT
For each sample get the read stats for the remove duplicates and add read groups BAM files. Statistics on the BAM files should be the same as before the previous step when read groups were modified. Compare stat results for each sample to the markdup.bam file (sanity check: is there the same number of reads as the original BAM file?) Compare stat results for each sample to the original bam file (sanity check: is there the same number of reads as the original BAM file?)If there is more than 15% of the reads being marked as duplicates may need to consider removing that sample
Example command: bamtools stats -in sampleID.sorted.markdup.addReadGr.bam
Merge the multiple lane files into 1 bam file for each sample.
Example command: $bamtools merge -in KH1_S1_L001.sort.markdup.readgrps.bam -in KH1_S1_L002.sort.markdup.readgrps.bam -in KH1_S1_L003.sort.markdup.readgrps.bam -in KH1_S1_L004.sort.markdup.readgrps.bam -out KH1_S1.sort.markdup.readgrps.merge.bam
For each sample index the processed BAM files that are sorted, have marked duplicates, and have read groups added. These will be used to identify callable loci. Indexing is used to "sort" by chromosome and region Output will be sampleID.sorted.markdup.addReadGr.bam.bai
Example command: $bamtools index -in sampleID.sorted.markdup.addReadGr.bam
For each sample get the read stats for the merged BAM files.
Example command: bamtools stats -in sampleID.sorted.markdup.addReadGr.merge.bam
Identify differentially expressed genes and transcripts using cuffdiff. Cuffdiff is used to find significant changes in transcript expression, splicing, and promoter use. Cuffdiff uses a gene annotation file downloaded with a sbatch script from Gencode GRCh38
Example command: $cuffdiff -use-sample-sheet -o diff_out -b reference.fa -p 8 --library-type fr-firststrand -L set1,set2 -u refence.gtf sampleSet.txt