title

author

date

output

RNA Seq for STAT736

Alex Soupir

9/10/2019

html_document

keep_md
true

Background

For STAT736-Fall-2019, we are analyzing the RNA-Seq from the publication Genome-wide analysis of p53 transcriptional programs in B cells upon exposure to genotoxic stress in vivo. We are only using the sequences B cells from spleen and not the non-B cells from spleen from the SRA Run Selector on NCBI.

The mice were exposed to whole-body ionizing radiation and sequences were extracted from both Bcells and non-B cells from the spleens of the mice. Two genotypes of mice were used: mice with p53 knocked out and the wild-type C57/Bl6. There were 4 different group combinations including the 2 different genotypes; each genotype was subjected to the ionizing radiation as well as control/mock.

Treatment groups of the mice that were either controls or treated with ionizing radiation to determine reaction of p53.

	Genotype	Treatment
Group 1	p53	Mock
Group 2	C57/Bl6	Mock
Group 3	p53	IR
Group 4	C57/Bl6	IR

This document will contain 2 different pipelines: The first one is going to be using the genome to map the reads too, and the secod is going to be de novo.

Genome De novo

#Genome Mapping {#genome-mapping}

The pipeline used in this analysis used conda on South Dakota State University's High Performance Computing cluster to run the programs FastQC, Trimmomatic, and Tophat.

This is different than previous RNA-Seq analyses where I used my workstation pc with Ubuntu 18.04 to run FastQC, Trimmomatic, HiSat2, HTSeq, and DESeq2 locally. Also, the previous RNA-Seq alayses were of Soybean with treatment combinations of mycorrhizae and rhizobia inoculation.

Cleaning Data

Programs used?

FastQC
Trimmomatic 0.39
Bowtie 2.2.5.0
Tophat 2.1.1
STAR
Cufflinks
Kraken2
MultiQC
featureCount

Picking the right node

To find a node that we can use on our own, we need to see which nodes are already allocated to jobs and which ones are idle. To do this, we can run sinfo. We want to pick one of the nodes that are marked 'idle' so we get the whole thing and we aren't interrupting someone elses job. For the sake of this exercise, lets work on big-mem.

Once a node that is idle has been found, you can ssh into it by typing ssh -X big-mem00# where # is the node number.

ssh -X big-mem005

Once on the node, the modules will have to be pulled from the shared folder again, otherwise we will be left with very basic ones. NOTE: if running programs that are in a personal folder such as miniconda (these examples), it is not necessary to add the other modules.

module use /cm/shared/modulefiles_local/

After loading the modules you can use it just as you would any other command line.

Creating slurm scripts

When running on a cluster, it can sometime be difficult to find open nodes with the resources needed to run the jobs that we have. Making a slurm script is really easy. Fist we make a new file with the touch command.

touch commands.slurm

Now in our directory we have the file commands.slurm which we can edit to hold our code in. We can edit it with the vi command.

vi commands.slurm

We have a few things that we need to put in the file header so slurm knows what to do with our commands.

#!/bin/bash

#SBATCH --job-name=example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --output=job-%j-%N.log
#SBATCH --partition=bigmem
#SBATCH --time=10:00:00

When we break this down, we see --job-name which is what we will see when we look at whats running later, --nodes is the number of nodes we have, --ntasks-per-node is the number of cores that we are requesting to have allocated, --output is the output log file of the job (here it names the output file with the job number and the node that we used), --partition here is requesting a big-mem node but compute can also be used, and finally --time is how long we are requesting the allocation for.

If the time runs out before the job is done I believe that it just kills the job even if not finished so we need to think a little about how much time to set. If the time is set too low, the job is killed and if the time is set too long, we may face issues with getting the node allocated to us.

To submit a job we can use sbatch commands.slurm and then we have the job ID. To check the status of our submission we use sbatch and then it shows all of the submitted jobs and how long they have been running along with the name that we set in the script.

Acquiring sequences

To download the sequences from the sequence read archive (SRA), the SRA Toolkit was used. The downloading of the files took a very long time, so this was left to run over night. The --gzip was used to keep the files a relatively small, although this can be left out to download uncompressed files, and --split-files was used to split the forward read from the reverse read for paired end read trimming through Trimmomatic.

~/tools/sratoolkit.2.9.6-1-centos_linux64/bin/fastq-dump --gzip --split-files SRR2121770

This is an example of the single file, but the above code needed to be ran for all of the following SRA numbers:

SRR2121770
SRR2121771
SRR2121774
SRR2121775
SRR2121778
SRR2121779
SRR2121780
SRR2121781
SRR2121786
SRR2121787
SRR2121788
SRR2121789

The results from downloading with --split-files gives 2 files per SRR, as mentioned before, one forward and one reverse. The suffix of the split files is one with _1.fastq.gz and another with _2.fastq.gz.

FastQC

FastQC can be run on all of the read files by using the wild card (*) as in *.fastq.gz. This prevents the need to hard code each individual read file into a FastQC command, which saves a lot of time since there are 24 read files in total for these 12 samples.

~/miniconda2/bin/fastqc *.fastq.gz

The output from running FastQC is a zipped folder and an HTML file for each of the \fastq.gz files in the folder. The HTML document looks something like this:

This is just the top of the file, and every category under the Summary heading has a graph that shows how the read quality looks for that particular metric. These reports can give insight into whether the reads are of decent quality or if the quality is poor.

The raw reads we have here all passed for adapter content and sequence length distribution and everything failed per base sequence content. SRR2121770, SRR2121771, SRR2121774, SRR2121775, SRR2121788, SRR2121781-2, and SRR2121789-1 were fairly decent quality reads. SRR2121778, SRR2121779, SRR2121780, SRR2121786, SRR2121787, SRR2121781-1, and SRR2121789-2 were of fairly lower quality (failing 3 or more in both reads. All of them failed both per base sequence quality and per tile sequence quality.

MultiQC

First lets install multiqc with conda. The command for this is conda install -c bioconda multiqc.

When that is finished, we can run MultiQC in the folder with the QC files (they should be moved into a folder alone so things don't get cluttered later on in the analyses).

~/miniconda/bin/multiqc .

When MultiQC is finished running, there will be a new folder called multiqc_data where the summaries are stored. Now lets go back up a level where our raw data folder and fastqc folder is and make a new folder for all of our MultiQC data. We will copy the FastQC output from MultiQC to this new folder.

mkdir MultiQC_All

cp RawQC/multiqc_data/multiqc_fastqc.txt MultiQC_All/

Trimming with Trimmomatic

Conda was used again to run Trimmomatic. This isn't as easy as using the wildcard like with FastQC because each output has to be personalized for the read files that are input into Trimmomatic. Also, we have to make sure that the adapter sequences are in the same folder that we are running so we can refer to them easily when calling the Trimmomatic program. In this case, we are using the TruSeq3-PE-2.fa adapter sequences For example:

~/miniconda2/bin/trimmomatic PE SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz 770_fp.fq.gz 770_fu.fq.gz 770_rp.fq.gz 770_ru.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 &

This would be repeated for each of the pairs (12 in total). We are trimming paired-end reads with the TruSeq3-PE-2 adapters. We are chopping off the first and last 3 bases and if we end up with a sequence less than 36 bases, we get rid of it. We want to make sure that there are enough bases in a read to work with. These parameters can be tweaked for possibly better end results with less being discarded. When Trimmomatic is finished running, it will out put the total number of reads, the total number from both the forward and reverse reads that are kept, the number of only forward reads kept, the number of only reverse reads kept, and the number of discarded reads.

The highest number of reads dropped was from trimming SRR2121786, where 20.55% dropped. Most reads were between 5% and 10% dropped. SRR2121786, SRR2121787, and SRR2121779 had sequence drops greater than 15%.

The trimmed reads can be analyzed again with FastQC to see how well the trimming worked to make the file better quality. After running FastQC on the trimmed files we see that the quality of those that were really bad quality were improved. There were a few different metrics throughout all of the files that bounced from a warning before the failing, or from passing before to a warning, and so forth, overall creating better quality read files.

Alignment

Using Tophat

Tophat can be installed using the same conda install (conda install -c bioconda tophat). When this is finished installing, then we will need to get the mouse genome from the Johns Hopkins Univeristy Center for computational BIology. The version of the mouse genome that I am using here is the NCBI build37.2. Instead of downloading this from the website and having to move it to the cluser, I will just download it using wget into the folder that has the raw reads, trimmed reads, and the FastQC files.

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz

This will take a long time to download because the file is a little less than 16GB zipped.

We notice here that we have a zipped tar file. To make this file easier to use, lets unzip it.

tar zxvf Mus_muculus_NCBI_build37.2.tar.gz

Since Tophat is requiring *.bt21 files (large index) and the files downloaded for the genome above are only small index files, we have to create a large index using bowtie2-build. For this, lets navigate to the WholeGenomeFasta folder within the extracted folder and then run bowtie2-build.

~/miniconda2/bin/bowtie2-build --large-index genome.fa genome

This process took about 26 minutes to run. Now lets copy the index files to a folder close to our reads so we can access them easier, rather than having to refer to the longer path where we build them. After they are copied to a new folder closer to our working directory, I went ahead and unzipped the trimmed read files to try and make the Tophat faster but it turned out not to work. The multicore call with -p didn't use more cores than 1 until bowtie2-align-s, then 20 cores were used.

~/miniconda2/bin/tophat --no-converage-search -p 20 -G Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -0 770_thout ./Index/genome 770_fp.fq.gz 770_rp.fq.gz 770_fu.fq 770_ru.fq

This run took almost 3 hours to complete.. Running with 80 cores rather than 20 cores took just 4 minutes less, so the whole process must be limited by a single core and the core clock speed. The process does use close to 8,000% at its peak so there is a benefit to multicore, just isn't very scalable.

Using STAR

STAR can be installed the same way as the previous programs with conda install (conda install -c bioconda star). In order to run STAR, we need to creaate indices just like with tophat, but STAR has this built in. I'm going to be using the same genome and GTF file as previously downloaded, but Dr. Ge uses a different zipped genome from the gencode database.

~/miniconda2/bin/STAR \
--runThreadN 80 \
--runMode genomeGenerate \
--genomeDir starIndex \
--genomeFastaFiles Index/genome.fa \ #same when we made the bowtie indices 
--sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf

With the index files made, we can start aligning with STAR. It's important here than we only pick the paired end reads and not use all of the reads. Tophat is able to use all 4 reads but STAR doesn't allow that, so we need to make sure that we feed in the large files from trimming.

~/miniconda2/bin/STAR --runThreadN 80 --genomeDir starIndex --readFilesIn 770_fp.fq 770_rp.fq --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix 2121770 --outSAMtype BAM SortedByCoordinate

Assembling transcripts with Cufflinks

Once STAR is done running, we can assemble the transcripts with Cufflinks. This can also be installed with conda install (conda install -c bioconda cufflinks).

~/miniconda2/bin/cufflinks -p 20 -o SRR2121771_clout --library-type fr-firststrand 2121770Aligned.sortedByCoord.out.bam

Checking for Contamination

PhiX contamination

Now we will look at what kind of contamination we are looking at. When samples are sequenced with Illumina, a PhiX control is run along side them. This control is for cluster generation, sequencing, alignment, and calibration for cross-talk matrix generation. We will use Bowtie to create a file to determine the PhiX contamination level.

~/miniconda2/bin/bowtie2 -p 20 -x PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome \
-1 TrimmedReads/770_fp.fq  -2 TrimmedReads/770_rp.fq -S phix.sam &> PhiXout/SRR2121770_phix.out

When the job is done running, the output file will show how much PhiX contamination we have. For example, lookin at the SRR2121770_phix.out created above, we see that 0.11% of the reads aligned with PhiX. The lower this value the better.

rRNA Sequences

To retreive the rRNA sequences for mouse, we need to search the taxonomy database on NCBI for Mus musculus. Click on Mus musculus on the next page, and then the top Mus musculus at the head of the list. Now, select the top subtree link in the Nucleotide database. Select rRNA sequences on the left side of the page and download full list just downloading with Send > Complete Record > File > FASTA > Create File. Drag the file using WinSCP to the raw folder on the cluster and rename it to rRNA.fa.

We are going to need to install bwa with conda in order to get the alignments to work. This can be done with conda install -c bioconda bwa. Following this, we will need to make indixes for the rRNA that we downloaded. To make this more clean, lets make a directory for the rRNA sequences that we downloaded and the indices that we make.

mkdir rRNA

Then we move the rRNA.fa to the new rRNA folder with WinSCP and then we can run the bwa.

time ~/miniconda2/bin/bwa mem -t 20 rRNA/rRNA.fa TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq > rnaAlign/770_rna.sam

When we are done creating the new *.sam files for all of the forward/reverse read combinations, we can use samtools to convert the *.sam file to *.bam files which are essentially the same file just that sam is easier for us to look at while bam is binary. Samtools can be installed with conda install -c bioconda samtools.

~/miniconda2/bin/samtools view -@ 10 -bS -o rnaAlign/770_rna.bam rnaAlign/770.sam

Now in the rnaAlign folder we have our sam and bam file for each of the libraries. Lets create an output file with flagstat.

~/miniconda2/bin/samtools flagstat -@ 10 rnaAlign/770_rna.out

Wihtin this file we will be able to see the summary of our alignments to the rRNA file that we downloaded from NCBI.

#From 770_rna.out
205559289 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
85 + 0 supplementary
0 + 0 duplicates
4265179 + 0 mapped (2.07% : N/A)
205559204 + 0 paired in sequencing
102779602 + 0 read1
102779602 + 0 read2
4151684 + 0 properly paired (2.02% : N/A)
4187608 + 0 with itself and mate mapped
77486 + 0 singletons (0.04% : N/A)
4222 + 0 with mate mapped to a different chr
1026 + 0 with mate mapped to a different chr (mapQ>=5)

Bacterial contamination

In order to find out the contamination, we need to install Kraken2 with conda install -c bioconda kraken2 and download a pre-built database containing bacteria, archaea, and viral sequences. The database we are going to download only contains about 5% of k-mers from the original database (but directions are sort of lacking to build an entirely new database). More information can be found at https://ccb.jhu.edu/software/kraken/ for the pre-built databases.

Using the code in the next chunk will download the 8GB database and then extract the files so we can use them with the Kraken2 program. Lets do this in the main project folder.

wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz

tar xzf minikraken2_v2_8GB_201904_UPDATE.tgz

Now lets make a directory for the output.

mkdir krakenOut

We can call Kraken with the extracted database folder and point it to the location of out paired end reads from trimming and to the output folder that we just created for the outputs.

~/miniconda2/bin/kraken2 --db minikraken2_v2_8GB_201904_UPDATE/ --output krakenOut/770.out --threads 10 --paired TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq

When Kraken is done running, it will print out the number (and percentage) of reads that were classified. In this case, we have used 102779602 sequences, of which 19142843 sequences were classified (18.63%) and 83636759 sequences were unclassified (81.37%). My interpretation of this is that 18.63% of the reads are possibly from microbial cell contamination.

Counting Transcripts

Since we have the bam files from the alignments of the different samples, we can count the features for each and get the transcipt counts using featureCounts form conda install -c bioconda/label/cf201901 subread. The genome and annotations that we previously downloaded were from genome mm9 so we have to specify to featureCounts what we want to actually count. FeatureCounts defaults to using gene_id which our output bam files don't have described correctly for featureCounts to read them. This is a single line of code because we can use a wildcard to run through all of the bam files.

#Move to the Star Alignment output folder for a working directory
cd StarOut

~/miniconda2/bin/featureCounts -a /gpfs/scratch/alex.soupir/Mus/raw/Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -g 'transcript_id' -o readCounds.txt *bam

With the files that we are working with, this will take between 3.5 minutes to 5 minutes per bam file. The output will be a file that can be imported into excel and saved as csv which we then can work with in R.

Final QC of cleaning the data

Lets look at the data that we have collected from all of the MultiQC runs that we had with initial FastQC, Trimmomatic, STAR alignment, PhiX contamination, rRNA contamination, and the final feature counts.

qc = read.csv('Whole Data QC.csv', header=TRUE, na.strings="")
kable(qc) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

X	Sample	SRR2121770_1	SRR2121770_2	SRR2121771_1	SRR2121771_2	SRR2121774_1	SRR2121774_2	SRR2121775_1	SRR2121775_2	SRR2121778_1	SRR2121778_2	SRR2121779_1	SRR2121779_2	SRR2121780_1	SRR2121780_2	SRR2121781_1	SRR2121781_2	SRR2121786_1	SRR2121786_2	SRR2121787_1	SRR2121787_2	SRR2121788_1	SRR2121788_2	SRR2121789_1	SRR2121789_2
FastQC	adapter_content	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass
NA	Sequences flagged as poor quality	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
NA	sequence_duplication_levels	warn	warn	pass	pass	pass	pass	pass	pass	warn	pass	pass	pass	pass	pass	warn	warn	warn	pass	pass	pass	fail	fail	fail	warn
NA	avg_sequence_length	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51
NA	Encoding	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9	Sanger / Illumina 1.9
NA	per_base_sequence_quality	pass	pass	pass	pass	pass	pass	pass	pass	fail	fail	fail	fail	fail	fail	fail	pass	fail	fail	fail	fail	pass	pass	pass	fail
NA	sequence_length_distribution	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass
NA	Sequence length	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51	51
NA	File type	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls	Conventional base calls
NA	basic_statistics	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass
NA	per_sequence_gc_content	pass	pass	pass	pass	pass	pass	warn	warn	warn	warn	pass	fail	warn	fail	pass	pass	warn	fail	warn	fail	fail	fail	fail	fail
NA	Total Sequences	118323219	118323219	103127231	103127231	91225885	91225885	108661623	108661623	136766553	136766553	124265478	124265478	99685598	99685598	106360532	106360532	99988292	99988292	75410310	75410310	82750449	82750449	92454136	92454136
NA	per_base_n_content	pass	pass	warn	pass	warn	pass	pass	pass	fail	warn	fail	warn	fail	warn	pass	pass	fail	fail	fail	warn	pass	pass	pass	fail
NA	per_base_sequence_content	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail
NA	overrepresented_sequences	warn	warn	warn	warn	warn	warn	warn	warn	warn	warn	pass	warn	pass	pass	warn	warn	warn	warn	pass	pass	warn	warn	warn	warn
NA	%GC	47	47	46	46	47	47	47	47	46	47	46	46	46	46	46	46	47	47	46	46	47	47	47	48
NA	total_deduplicated_percentage	52.20958673	55.28743506	90.47297502	97.22714143	91.11411337	97.47431986	72.31423101	95.57923295	65.27895305	83.85617925	73.08810256	79.50169478	92.46127029	94.09779679	54.09833163	52.4832411	68.88309902	74.63550918	93.00930552	94.9666922	43.56442006	44.95245995	42.6643141	57.36052687
NA	Filename	SRR2121770_1.fastq.gz	SRR2121770_2.fastq.gz	SRR2121771_1.fastq.gz	SRR2121771_2.fastq.gz	SRR2121774_1.fastq.gz	SRR2121774_2.fastq.gz	SRR2121775_1.fastq.gz	SRR2121775_2.fastq.gz	SRR2121778_1.fastq.gz	SRR2121778_2.fastq.gz	SRR2121779_1.fastq.gz	SRR2121779_2.fastq.gz	SRR2121780_1.fastq.gz	SRR2121780_2.fastq.gz	SRR2121781_1.fastq.gz	SRR2121781_2.fastq.gz	SRR2121786_1.fastq.gz	SRR2121786_2.fastq.gz	SRR2121787_1.fastq.gz	SRR2121787_2.fastq.gz	SRR2121788_1.fastq.gz	SRR2121788_2.fastq.gz	SRR2121789_1.fastq.gz	SRR2121789_2.fastq.gz
NA	per_tile_sequence_quality	warn	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail	warn	fail	fail	fail	fail	warn	pass	pass	fail
NA	per_sequence_quality_scores	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	pass	fail	pass	fail	pass	pass	pass	pass
Trimming	surviving	102779602	NA	88509413	NA	77240029	NA	94179097	NA	108337271	NA	92924182	NA	76172416	NA	95488961	NA	69906417	NA	54427039	NA	73771492	NA	82764590	NA
NA	surviving_pct	86.86	NA	85.83	NA	84.67	NA	86.67	NA	79.21	NA	74.78	NA	76.41	NA	89.78	NA	69.91	NA	72.17	NA	89.15	NA	89.52	NA
NA	reverse_only_surviving_pct	4.48	NA	2.52	NA	2.53	NA	2.29	NA	2.11	NA	2.44	NA	2.52	NA	1.8	NA	2.59	NA	2.65	NA	1.85	NA	1.77	NA
NA	forward_only_surviving	3202267	NA	5041955	NA	4667583	NA	4832339	NA	9375417	NA	7916665	NA	6350522	NA	3894412	NA	6944977	NA	5349296	NA	3489058	NA	3797776	NA
NA	dropped	7036792	NA	6973727	NA	7013028	NA	7160952	NA	16166923	NA	20393774	NA	14647690	NA	5067412	NA	20551506	NA	13638339	NA	3962243	NA	4252397	NA
NA	dropped_pct	5.95	NA	6.76	NA	7.69	NA	6.59	NA	11.82	NA	16.41	NA	14.69	NA	4.76	NA	20.55	NA	18.09	NA	4.79	NA	4.6	NA
NA	forward_only_surviving_pct	2.71	NA	4.89	NA	5.12	NA	4.45	NA	6.86	NA	6.37	NA	6.37	NA	3.66	NA	6.95	NA	7.09	NA	4.22	NA	4.11	NA
NA	input_read_pairs	118323219	NA	103127231	NA	91225885	NA	108661623	NA	136766553	NA	124265478	NA	99685598	NA	106360532	NA	99988292	NA	75410310	NA	82750449	NA	92454136	NA
NA	reverse_only_surviving	5304558	NA	2602136	NA	2305245	NA	2489235	NA	2886942	NA	3030857	NA	2514970	NA	1909747	NA	2585392	NA	1995636	NA	1527656	NA	1639373	NA
STAR Alignment	uniquely_mapped_percent	81.73	NA	82.35	NA	81.67	NA	83.1	NA	83.43	NA	84.55	NA	82.48	NA	86.82	NA	80.75	NA	80.27	NA	80.81	NA	81.62	NA
NA	num_splices	8669080	NA	8000968	NA	10613774	NA	11755566	NA	10864591	NA	8361644	NA	6866296	NA	7991948	NA	6183074	NA	4335214	NA	7285454	NA	8581902	NA
NA	num_GCAG_splices	90012	NA	83679	NA	116149	NA	124779	NA	116447	NA	86763	NA	74268	NA	80836	NA	65893	NA	44765	NA	80392	NA	97422	NA
NA	insertion_length	1.11	NA	1.1	NA	1.12	NA	1.17	NA	1.13	NA	1.11	NA	1.11	NA	1.13	NA	1.16	NA	1.15	NA	1.13	NA	1.15	NA
NA	deletion_length	1.35	NA	1.33	NA	1.37	NA	1.26	NA	1.36	NA	1.35	NA	1.35	NA	1.4	NA	1.2	NA	1.22	NA	1.3	NA	1.25	NA
NA	unmapped_tooshort_percent	4.08	NA	4.21	NA	3.42	NA	4.09	NA	3.51	NA	3.87	NA	3.66	NA	3.45	NA	5.83	NA	6.42	NA	5.61	NA	4.53	NA
NA	avg_mapped_read_length	100.73	NA	101.11	NA	101.1	NA	101.14	NA	100.6	NA	100.51	NA	100.51	NA	100.81	NA	100.41	NA	100.33	NA	101.18	NA	101.22	NA
NA	deletion_rate	0	NA	0	NA	0	NA	0	NA	0.01	NA	0.01	NA	0.01	NA	0.01	NA	0.01	NA	0.01	NA	0.01	NA	0.01	NA
NA	mismatch_rate	0.23	NA	0.29	NA	0.29	NA	0.17	NA	0.17	NA	0.22	NA	0.22	NA	0.17	NA	0.26	NA	0.28	NA	0.16	NA	0.15	NA
NA	avg_input_read_length	101	NA	101	NA	101	NA	101	NA	100	NA	100	NA	100	NA	101	NA	100	NA	100	NA	101	NA	101	NA
NA	num_ATAC_splices	8039	NA	7497	NA	10782	NA	11677	NA	9722	NA	7455	NA	6186	NA	7544	NA	5492	NA	3832	NA	6602	NA	7593	NA
NA	num_annotated_splices	8581504	NA	7919864	NA	10513157	NA	11643271	NA	10755797	NA	8273449	NA	6799294	NA	7913548	NA	6103902	NA	4260450	NA	7202217	NA	8487777	NA
NA	num_GTAG_splices	8571029	NA	7909792	NA	10486843	NA	11619110	NA	10738422	NA	8267426	NA	6785842	NA	7903568	NA	6111689	NA	4286617	NA	7198460	NA	8476887	NA
NA	uniquely_mapped	83997598	NA	72886249	NA	63080382	NA	78265974	NA	90382084	NA	78565331	NA	62830380	NA	82902461	NA	56450144	NA	43685904	NA	59617353	NA	67556146	NA
NA	multimapped_toomany	1852495	NA	1405409	NA	1242903	NA	1061446	NA	1736807	NA	1362362	NA	1374461	NA	846193	NA	1009216	NA	955877	NA	703236	NA	890704	NA
NA	unmapped_mismatches	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	unmapped_mismatches_percent	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	total_reads	102779602	NA	88509413	NA	77240029	NA	94179097	NA	108337271	NA	92924182	NA	76172416	NA	95488961	NA	69906417	NA	54427039	NA	73771492	NA	82764590	NA
NA	unmapped_other	432340	NA	370957	NA	254880	NA	377196	NA	476580	NA	454868	NA	335571	NA	468726	NA	370629	NA	359051	NA	317451	NA	405340	NA
NA	insertion_rate	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	unmapped_other_percent	0.42	NA	0.42	NA	0.33	NA	0.4	NA	0.44	NA	0.49	NA	0.44	NA	0.49	NA	0.53	NA	0.66	NA	0.43	NA	0.49	NA
NA	multimapped_percent	11.96	NA	11.44	NA	12.97	NA	11.27	NA	11.02	NA	9.63	NA	11.61	NA	8.35	NA	11.44	NA	10.9	NA	12.19	NA	12.28	NA
NA	multimapped	12297290	NA	10128397	NA	10020381	NA	10617654	NA	11939994	NA	8949092	NA	8840667	NA	7971370	NA	7999509	NA	5933625	NA	8991826	NA	10165077	NA
NA	num_noncanonical_splices	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	unmapped_tooshort	4199879	NA	3718401	NA	2641483	NA	3856827	NA	3801806	NA	3592529	NA	2791337	NA	3300211	NA	4076919	NA	3492582	NA	4141626	NA	3747323	NA
NA	multimapped_toomany_percent	1.8	NA	1.59	NA	1.61	NA	1.13	NA	1.6	NA	1.47	NA	1.8	NA	0.89	NA	1.44	NA	1.76	NA	0.95	NA	1.08	NA
PhiX Contamination	overall_alignment_rate	0.11	NA	0.13	NA	0.18	NA	0.2	NA	0.12	NA	0.12	NA	0.1	NA	0.15	NA	0.18	NA	0.16	NA	0.2	NA	0.2	NA
NA	paired_aligned_mate_multi_halved	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	paired_aligned_none	102674701	NA	88400163	NA	77108986	NA	93998943	NA	108212089	NA	92820803	NA	76103405	NA	95350795	NA	69784944	NA	54344793	NA	73629819	NA	82605166	NA
NA	paired_aligned_mate_multi	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	paired_aligned_discord_one	5081	NA	5514	NA	6603	NA	9196	NA	6325	NA	5222	NA	3475	NA	7190	NA	6424	NA	4218	NA	7405	NA	8139	NA
NA	paired_aligned_mate_one	2761	NA	2466	NA	2825	NA	3974	NA	2693	NA	2277	NA	1515	NA	2999	NA	2696	NA	1961	NA	2913	NA	3585	NA
NA	paired_total	102779602	NA	88509413	NA	77240029	NA	94179097	NA	108337271	NA	92924182	NA	76172416	NA	95488961	NA	69906417	NA	54427039	NA	73771492	NA	82764590	NA
NA	paired_aligned_mate_none_halved	102668239.5	NA	88393416	NA	77100970.5	NA	93987760	NA	108204417.5	NA	92814442.5	NA	76099172.5	NA	95342105.5	NA	69777172	NA	54339594.5	NA	73620957.5	NA	82595234.5	NA
NA	paired_aligned_multi	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	paired_aligned_one	104901	NA	109250	NA	131043	NA	180154	NA	125182	NA	103379	NA	69011	NA	138166	NA	121473	NA	82246	NA	141673	NA	159424	NA
NA	total_reads	102779602	NA	88509413	NA	77240029	NA	94179097	NA	108337271	NA	92924182	NA	76172416	NA	95488961	NA	69906417	NA	54427039	NA	73771492	NA	82764590	NA
NA	paired_aligned_mate_one_halved	1380.5	NA	1233	NA	1412.5	NA	1987	NA	1346.5	NA	1138.5	NA	757.5	NA	1499.5	NA	1348	NA	980.5	NA	1456.5	NA	1792.5	NA
NA	paired_aligned_mate_none	205336479	NA	176786832	NA	154201941	NA	187975520	NA	216408835	NA	185628885	NA	152198345	NA	190684211	NA	139554344	NA	108679189	NA	147241915	NA	165190469	NA
rRNA Contamination	mapped_passed	4265172	NA	3371478	NA	1366577	NA	4031209	NA	5947728	NA	5093417	NA	4109793	NA	5042021	NA	7103552	NA	4879843	NA	5080844	NA	7013490	NA
NA	duplicates_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	secondary_passed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	paired in sequencing_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	duplicates_passed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	read2_passed	102779602	NA	88509413	NA	77240029	NA	94179097	NA	108337271	NA	92924182	NA	76172416	NA	95488961	NA	69906417	NA	54427039	NA	73771492	NA	82764590	NA
NA	read1_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	read1_passed	102779602	NA	88509413	NA	77240029	NA	94179097	NA	108337271	NA	92924182	NA	76172416	NA	95488961	NA	69906417	NA	54427039	NA	73771492	NA	82764590	NA
NA	with mate mapped to a different chr_passed	4240	NA	3672	NA	692	NA	2742	NA	7850	NA	5990	NA	4782	NA	5840	NA	5348	NA	2434	NA	2250	NA	3646	NA
NA	total_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	properly paired_passed_pct	2.02	NA	1.84	NA	0.86	NA	2.07	NA	2.7	NA	2.69	NA	2.65	NA	2.59	NA	4.98	NA	4.39	NA	3.36	NA	4.15	NA
NA	singletons_passed	77493	NA	85229	NA	21005	NA	88202	NA	48177	NA	47851	NA	41299	NA	53429	NA	60929	NA	55645	NA	58038	NA	57600	NA
NA	supplementary_passed	85	NA	65	NA	16	NA	23	NA	43	NA	44	NA	44	NA	72	NA	19	NA	18	NA	58	NA	50	NA
NA	singletons_passed_pct	0.04	NA	0.05	NA	0.01	NA	0.05	NA	0.02	NA	0.03	NA	0.03	NA	0.03	NA	0.04	NA	0.05	NA	0.04	NA	0.03	NA
NA	mapped_failed_pct	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA
NA	mapped_passed_pct	2.07	NA	1.9	NA	0.88	NA	2.14	NA	2.75	NA	2.74	NA	2.7	NA	2.64	NA	5.08	NA	4.48	NA	3.44	NA	4.24	NA
NA	supplementary_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	with itself and mate mapped_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	mapped_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	total_passed	205559289	NA	177018891	NA	154480074	NA	188358217	NA	216674585	NA	185848408	NA	152344876	NA	190977994	NA	139812853	NA	108854096	NA	147543042	NA	165529230	NA
NA	properly paired_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	flagstat_total	205559289	NA	177018891	NA	154480074	NA	188358217	NA	216674585	NA	185848408	NA	152344876	NA	190977994	NA	139812853	NA	108854096	NA	147543042	NA	165529230	NA
NA	with mate mapped to a different chr (mapQ >= 5)_passed	1026	NA	854	NA	118	NA	442	NA	1093	NA	940	NA	801	NA	1206	NA	964	NA	487	NA	823	NA	951	NA
NA	properly paired_failed_pct	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA
NA	with mate mapped to a different chr (mapQ >= 5)_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	with itself and mate mapped_passed	4187594	NA	3286184	NA	1345556	NA	3942984	NA	5899508	NA	5045522	NA	4068450	NA	4988520	NA	7042604	NA	4824180	NA	5022748	NA	6955840	NA
NA	read2_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	with mate mapped to a different chr_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	properly paired_passed	4151444	NA	3256926	NA	1334312	NA	3903816	NA	5846848	NA	5001956	NA	4035832	NA	4945218	NA	6955726	NA	4775228	NA	4958480	NA	6869584	NA
NA	paired in sequencing_passed	205559204	NA	177018826	NA	154480058	NA	188358194	NA	216674542	NA	185848364	NA	152344832	NA	190977922	NA	139812834	NA	108854078	NA	147542984	NA	165529180	NA
NA	singletons_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	secondary_failed	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	singletons_failed_pct	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA	nan	NA
Feature Counts	Unassigned_Ambiguity	5553184	NA	4245795	NA	4120211	NA	4873854	NA	5225069	NA	4143370	NA	3730746	NA	3744259	NA	3208052	NA	2154031	NA	3756137	NA	4158307	NA
NA	Unassigned_MappingQuality	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	percent_assigned	38.92490502	NA	41.43212911	NA	45.0933741	NA	46.82554793	NA	45.79057473	NA	42.0896936	NA	43.45341423	NA	41.01237187	NA	46.15918368	NA	44.71355742	NA	45.65729376	NA	46.21716065	NA
NA	Unassigned_Nonjunction	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Unassigned_Duplicate	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Unassigned_Chimera	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Unassigned_Unmapped	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Assigned	92630635	NA	83632332	NA	81770639	NA	99643999	NA	112432428	NA	87231124	NA	76094464	NA	86121025	NA	71761506	NA	53877618	NA	75554738	NA	86661345	NA
NA	Unassigned_MultiMapping	69977472	NA	56081314	NA	55175498	NA	56266414	NA	64772012	NA	50119894	NA	49456582	NA	44182994	NA	42564996	NA	33123230	NA	46247590	NA	52396728	NA
NA	Unassigned_Overlapping_Length	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Unassigned_Secondary	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Unassigned_FragmentLength	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA	0	NA
NA	Unassigned_NoFeatures	69811377	NA	57894371	NA	40269914	NA	52014095	NA	63106671	NA	65756168	NA	45835550	NA	75939638	NA	37930730	NA	31340159	NA	39923831	NA	44292640	NA
NA	Total	237972668	NA	201853812	NA	181336262	NA	212798362	NA	245536180	NA	207250556	NA	175117342	NA	209987916	NA	155465284	NA	120495038	NA	165482296	NA	187509020	NA

Differentially Expressed Sequence Identification

Programs Used

R
RStudio

Packages

readr (install.packages('readr'))
limma (BiocManager::install('limma'))
DESeq2 (BiocManager::install('DESeq2'))
dplyr (install.packages("dplyr"))
ggplot2 (install.packages("ggplot2"))
gplots (install.packages("gplots"))
Annotations (BiocManager::install('AnnotationDbi'))
org.Hs.eg.db (BiocManager::install('org.Hs.eg.db'))
- This is for Human
org.Mm.eg.db (BiocManager::install('org.Mm.eg.db'))
- This is for Mouse
ggrepel (install.packages("ggrepel"))
ReportingTools (BiocManager::install('ReportingTools'))
GO.db (BiocManager::install('GO.db'))
GOstats (BiocManager::install('GOstats'))
pathview (BiocManager::install('pathview'))
gage (BiocManager::install('gage'))
gageData (BiocManager::install('gageData'))
select (BiocManager::install('Select'))

With these, you most certainly will have to step through each and install extra things when you start calling the packages. Take it step by step to ensure that each dependency is installed.

Analyzing Reads Counts

When the count file is completed, we can import it into R and start working with it to determine differentially expressed genes. First we will import it into R

library(limma)
library(DESeq2)
library(dplyr)
library(readr)

countData = read_csv("readCounts.csv", skip = 1)

This gives us our dataframe from out featureCounts program, but if we look at the data we see that featureCounts added some extra information that characterizes each gene_id.

kable(head(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "320px")

Geneid	Chr	Start	End	Strand	Length	SRR2121775Aligned.sortedByCoord.out.bam	SRR2121786Aligned.sortedByCoord.out.bam	SRR2121787Aligned.sortedByCoord.out.bam
ENSMUSG00000102693.1	chr1	3073253	3074322	+	1070	0	4	2
ENSMUSG00000064842.1	chr1	3102016	3102125	+	110	0	0	0
ENSMUSG00000051951.5	chr1;chr1;chr1;chr1;chr1;chr1;chr1	3205901;3206523;3213439;3213609;3214482;3421702;3670552	3207317;3207317;3215632;3216344;3216968;3421901;3671498	-;-;-;-;-;-;-	6094	4	4	2
ENSMUSG00000102851.1	chr1	3252757	3253236	+	480	0	2	0
ENSMUSG00000103377.1	chr1	3365731	3368549	-	2819	2	6	14
ENSMUSG00000104017.1	chr1	3375556	3377788	-	2233	0	0	0

We also need to set out row names to the gene_id. We will do some data frame manipulation and then look at the data again.

countData = as.data.frame(countData)
rownames(countData) = countData$Geneid
countData = countData[,-c(1:6)]
kable(head(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	SRR2121775Aligned.sortedByCoord.out.bam	SRR2121786Aligned.sortedByCoord.out.bam	SRR2121787Aligned.sortedByCoord.out.bam
ENSMUSG00000102693.1	0	4	2
ENSMUSG00000064842.1	0	0	0
ENSMUSG00000051951.5	4	4	2
ENSMUSG00000102851.1	0	2	0
ENSMUSG00000103377.1	2	6	14
ENSMUSG00000104017.1	0	0	0

Quick Data Exploration

dim(countData)

## [1] 55487    12

kable(summary(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

SRR2121770Aligned.sortedByCoord.out.bam	SRR2121771Aligned.sortedByCoord.out.bam	SRR2121774Aligned.sortedByCoord.out.bam	SRR2121775Aligned.sortedByCoord.out.bam	SRR2121778Aligned.sortedByCoord.out.bam	SRR2121779Aligned.sortedByCoord.out.bam	SRR2121780Aligned.sortedByCoord.out.bam	SRR2121781Aligned.sortedByCoord.out.bam	SRR2121786Aligned.sortedByCoord.out.bam	SRR2121787Aligned.sortedByCoord.out.bam	SRR2121788Aligned.sortedByCoord.out.bam	SRR2121789Aligned.sortedByCoord.out.bam
Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0
1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0
Median : 0.0	Median : 0.0	Median : 0.0	Median : 0.0	Median : 0.0	Median : 0.0	Median : 0.0	Median : 0.0	Median : 3.0	Median : 4.0	Median : 0.0	Median : 1.0
Mean : 273.8	Mean : 235.7	Mean : 229.6	Mean : 269.6	Mean : 286.6	Mean : 257.7	Mean : 193.7	Mean : 268.2	Mean : 160.2	Mean : 119.4	Mean : 184.3	Mean : 211.1
3rd Qu.: 25.0	3rd Qu.: 20.0	3rd Qu.: 15.0	3rd Qu.: 24.0	3rd Qu.: 25.0	3rd Qu.: 24.0	3rd Qu.: 17.0	3rd Qu.: 26.0	3rd Qu.: 21.0	3rd Qu.: 21.0	3rd Qu.: 16.0	3rd Qu.: 18.0
Max. :1085454.0	Max. :834016.0	Max. :783768.0	Max. :1032589.0	Max. :1233905.0	Max. :1232421.0	Max. :805463.0	Max. :1210716.0	Max. :641911.0	Max. :516752.0	Max. :706363.0	Max. :756086.0

Let's also go ahead and change the names to describe out data a little better.

columns = c('Trp53m_mock_1', 'Trp53m_mock_2', 'Trp53m_4h7Gy_1', 'Trp53m_4h7Gy_2', 'Trp53p_mock_1', 'Trp53p_mock_2', 'Trp53p_mock_3', 'Trp53p_mock_4', 'Trp53p_4h7Gy_1', 'Trp53p_4h7Gy_2', 'Trp53p_4h7Gy_3', 'Trp53p_4h7Gy_4')
colnames(countData) = columns

Here we have to make sure that we convert the +/+ and -/- to characters. These characters

par(mar=c(8,4,4,1)+0.1)
barplot( colSums(countData)/1e6, col="green",las=3,main="Total read counts (millions)", ylab="Total read counts in millions")

hist(countData[,1], br=200, xlab="Number of Reads Counts per Feature", main="Histogram of Read Counts for Trp53-/- Mock")

We can see that our count data is highly skewed to the right. This is a great case for using log transformation!

logCountData = log2(1+countData)
par(mfrow = c(1, 2), mar=c(8,4,4,1))  # two columns
hist(logCountData[,1], main="Histogram of Log Read Counts", xlab="Log transformed counts")
boxplot(logCountData,las=3, main="Boxplot of Log Read Counts")

x <- logCountData
myColors = rainbow(dim(x)[2])
plot(density(x[,1]),col = myColors[1], lwd=2,
     xlab="Expresson values", ylab="Density", main= "Distribution of transformed data",
     ylim=c(0, max(density(x[,1])$y)+.02 ) )
  
for( i in 2:dim(x)[2] )
lines(density(x[,i]),col=myColors[i], lwd=2)
legend("topright", cex=1.1,colnames(x), lty=rep(1,dim(x)[2]), col=myColors )

plot(logCountData[,1],logCountData[,2], xlab="Trp53-/- mock replication 1", ylab="Trp53-/- mock replication 2")

Filtering, Normalization, and Trasformation using DESeq2

We have to make the experiment design into a small dataframe so we can tell DESeq how we want to analyze the data. Here will will make a small table that has the rep names that we changed the column names to previously, and then a column for which columns are Trp53+/+ or Trp53-/-, and which columns were control mice and which columns were treated with ionizing radiation.

detectGroups <- function (x){  # x are col names
  tem <- gsub("[0-9]*$","",x) # Remove all numbers from end
  #tem = gsub("_Rep|_rep|_REP","",tem)
  tem <- gsub("_$","",tem); # remove "_" from end
  tem <- gsub("_Rep$","",tem); # remove "_Rep" from end
  tem <- gsub("_rep$","",tem); # remove "_rep" from end
  tem <- gsub("_REP$","",tem)  # remove "_REP" from end
  return( tem )
}

groups = as.character ( detectGroups( colnames( countData ) ) )
groups

##  [1] "Trp53m_mock"  "Trp53m_mock"  "Trp53m_4h7Gy" "Trp53m_4h7Gy" "Trp53p_mock" 
##  [6] "Trp53p_mock"  "Trp53p_mock"  "Trp53p_mock"  "Trp53p_4h7Gy" "Trp53p_4h7Gy"
## [11] "Trp53p_4h7Gy" "Trp53p_4h7Gy"

p53 = c("m", "m", "m", "m",
        "p", "p", "p", "p", "p", "p", "p", "p")
treatment = c("control", "control", "IR", "IR", "control", "control", "control", "control",
              "IR", "IR", "IR", "IR")

colData = cbind(colnames(countData), p53 )
colData

##                        p53
##  [1,] "Trp53m_mock_1"  "m"
##  [2,] "Trp53m_mock_2"  "m"
##  [3,] "Trp53m_4h7Gy_1" "m"
##  [4,] "Trp53m_4h7Gy_2" "m"
##  [5,] "Trp53p_mock_1"  "p"
##  [6,] "Trp53p_mock_2"  "p"
##  [7,] "Trp53p_mock_3"  "p"
##  [8,] "Trp53p_mock_4"  "p"
##  [9,] "Trp53p_4h7Gy_1" "p"
## [10,] "Trp53p_4h7Gy_2" "p"
## [11,] "Trp53p_4h7Gy_3" "p"
## [12,] "Trp53p_4h7Gy_4" "p"

colData = as.data.frame(cbind(colnames(countData), p53, treatment))
colData

##                V1 p53 treatment
## 1   Trp53m_mock_1   m   control
## 2   Trp53m_mock_2   m   control
## 3  Trp53m_4h7Gy_1   m        IR
## 4  Trp53m_4h7Gy_2   m        IR
## 5   Trp53p_mock_1   p   control
## 6   Trp53p_mock_2   p   control
## 7   Trp53p_mock_3   p   control
## 8   Trp53p_mock_4   p   control
## 9  Trp53p_4h7Gy_1   p        IR
## 10 Trp53p_4h7Gy_2   p        IR
## 11 Trp53p_4h7Gy_3   p        IR
## 12 Trp53p_4h7Gy_4   p        IR

str(colData)

## 'data.frame':	12 obs. of  3 variables:
##  $ V1       : Factor w/ 12 levels "Trp53m_4h7Gy_1",..: 3 4 1 2 9 10 11 12 5 6 ...
##  $ p53      : Factor w/ 2 levels "m","p": 1 1 1 1 2 2 2 2 2 2 ...
##  $ treatment: Factor w/ 2 levels "control","IR": 1 1 2 2 1 1 1 1 2 2 ...

Creating a DESeq Dataset

dds = DESeqDataSetFromMatrix(countData=countData,
                             colData=colData,
                             design= ~ p53+treatment+p53*treatment)   # note that the study design is changed.
dds = DESeq(dds)  # main function
nrow(dds)

## [1] 55487

Filtering: we will only keep rows that have a sum count between all samples greater than 5. This will remove most of the genes that mostly have "0" counts.

dds <- dds[ rowSums(counts(dds)) > 5, ]
nrow(dds)

## [1] 34139

Regularized log transformation - used for clustering

rld <- rlog(dds, blind = FALSE)
kable(head(assay(rld), 6)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	Trp53m_mock_1	Trp53m_mock_2	Trp53m_4h7Gy_1	Trp53m_4h7Gy_2	Trp53p_mock_1	Trp53p_mock_2	Trp53p_mock_3	Trp53p_mock_4	Trp53p_4h7Gy_1	Trp53p_4h7Gy_2	Trp53p_4h7Gy_3	Trp53p_4h7Gy_4
ENSMUSG00000102693.1	-0.7873810	-0.7869647	-0.7864123	-0.7872251	-0.7873917	-0.7872901	-0.7863979	-0.7874606	-0.6313032	-0.7027829	-0.7854203	-0.7864622
ENSMUSG00000051951.5	-0.1689760	-0.1684855	-0.1678308	-0.0729011	-0.1689887	-0.1688691	-0.1678136	-0.1690696	-0.0298983	-0.0789877	-0.1675075	-0.1678901
ENSMUSG00000103377.1	0.9367497	0.9385253	0.9408618	1.0007595	0.9367037	0.9371385	0.9409228	0.9364087	1.1895736	1.5284052	0.9420021	0.9406517
ENSMUSG00000103201.1	-0.0694969	-0.0689413	-0.0682009	-0.0692892	-0.0695112	-0.0693758	-0.0681815	-0.0696029	0.0406074	0.1775249	-0.0678357	-0.0682680
ENSMUSG00000102592.1	0.6079270	0.6090832	0.6106138	0.6083601	0.6078971	0.6081797	0.6106538	0.6077056	0.9928379	0.8039315	0.6113644	0.6829781
ENSMUSG00000025900.12	1.7329730	1.7166717	1.6536060	1.6483113	1.6766519	1.6478854	1.6536997	1.7812828	2.0563917	2.2880207	1.6553593	1.7301108

Variance Stabilizing Transformation

Interactions cause a difference between the lfc betwen pooled data, e.g. p53+/+ (control and IR) and p53-/- (control and IR)

vsd <- vst(dds, blind = FALSE)
kable(head(assay(vsd), 6)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	Trp53m_mock_1	Trp53m_mock_2	Trp53m_4h7Gy_1	Trp53m_4h7Gy_2	Trp53p_mock_1	Trp53p_mock_2	Trp53p_mock_3	Trp53p_mock_4	Trp53p_4h7Gy_1	Trp53p_4h7Gy_2	Trp53p_4h7Gy_3	Trp53p_4h7Gy_4
ENSMUSG00000102693.1	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.784352	6.714627	6.433022	6.433022
ENSMUSG00000051951.5	6.433022	6.433022	6.433022	6.715020	6.433022	6.433022	6.433022	6.433022	6.784352	6.714627	6.433022	6.433022
ENSMUSG00000103377.1	6.433022	6.433022	6.433022	6.632583	6.433022	6.433022	6.433022	6.433022	6.862784	7.171182	6.433022	6.433022
ENSMUSG00000103201.1	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.737470	6.919246	6.433022	6.433022
ENSMUSG00000102592.1	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	6.433022	7.013160	6.830644	6.433022	6.662579
ENSMUSG00000025900.12	6.669513	6.643045	6.433022	6.433022	6.569347	6.433022	6.433022	6.732883	7.013160	7.220955	6.433022	6.662579

For the log2 approach, we need to first estimate size factors to account for sequencing depth, and then specify normalized=TRUE. Sequencing depth correction is done automatically for the rlog and the vst.

Size Factor

dds <- estimateSizeFactors(dds)
kable(sizeFactors(dds)) %>%
  kable_styling() %>%
  scroll_box(width = "300px", height = "520px")

	x
Trp53m_mock_1	1.2892566
Trp53m_mock_2	1.0903014
Trp53m_4h7Gy_1	0.8973654
Trp53m_4h7Gy_2	1.2078255
Trp53p_mock_1	1.2952327
Trp53p_mock_2	1.2406213
Trp53p_mock_3	0.8931001
Trp53p_mock_4	1.3347016
Trp53p_4h7Gy_1	0.7767888
Trp53p_4h7Gy_2	0.6056051
Trp53p_4h7Gy_3	0.8227361
Trp53p_4h7Gy_4	0.9123249

We will first look at the log transformed data

slog <- log2(counts(dds, normalized=TRUE)+1)
kable(head(slog)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	Trp53m_mock_1	Trp53m_mock_2	Trp53m_4h7Gy_2	Trp53p_mock_1	Trp53p_mock_4	Trp53p_4h7Gy_1	Trp53p_4h7Gy_2	Trp53p_4h7Gy_4
ENSMUSG00000102693.1	0.000000	0.000000	0.000000	0.0000000	0.000000	2.620447	2.105169	0.000000
ENSMUSG00000051951.5	0.000000	0.000000	2.108269	0.0000000	0.000000	2.620447	2.105169	0.000000
ENSMUSG00000103377.1	0.000000	0.000000	1.409184	0.0000000	0.000000	3.125008	4.592001	0.000000
ENSMUSG00000103201.1	0.000000	0.000000	0.000000	0.0000000	0.000000	2.281566	3.447241	0.000000
ENSMUSG00000102592.1	0.000000	0.000000	0.000000	0.0000000	0.000000	3.922280	2.926941	1.674552
ENSMUSG00000025900.12	1.734188	1.503021	0.000000	0.8254291	2.246759	3.922280	4.777149	1.674552

par(mfrow = c(1, 3))  # 3 columns
plot(slog[,1],slog[,2])
plot(assay(rld)[,1],assay(rld)[,2])
plot(assay(vsd)[,1],assay(vsd)[,2])

As the log transformation constant increases, the information of the data is lost.

par(mfrow = c(1, 3))  # 3 columns
slog <- log2(counts(dds, normalized=TRUE)+1)
plot(slog[,1],slog[,2])
slog <- log2(counts(dds, normalized=TRUE)+4)
plot(slog[,1],slog[,2], xlim=c(0,20))
slog <- log2(counts(dds, normalized=TRUE)+20)
plot(slog[,1],slog[,2], xlim=c(0,20))

library("dplyr")
library("ggplot2")

df <- bind_rows(
  as_data_frame(slog[,1:2]) %>%
         mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"),
  as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"))

## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

colnames(df)[1:2] <- c("x", "y")  

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)

Exploratory Data Analysis

PCA plot

plotPCA(rld, intgroup = c("p53", "treatment")) + theme(aspect.ratio=1)

A prettier PCA plot created with GGPlot2

pca.object <- prcomp(t(assay(rld))) # PCA 
pcaData = as.data.frame(pca.object$x[,1:2]); 
pcaData = cbind(pcaData,detectGroups(colnames(assay(rld)) ))
colnames(pcaData) = c("PC1", "PC2", "Type")
percentVar=round(100*summary(pca.object)$importance[2,1:2],0)
#plot
p=ggplot(pcaData, aes(PC1, PC2, color=Type, shape = Type)) + geom_point(size=5) 
p=p+xlab(paste0("PC1: ",percentVar[1],"% variance")) 
p=p+ylab(paste0("PC2: ",percentVar[2],"% variance")) 
p=p+ggtitle("Principal component analysis (PCA)")+coord_fixed(ratio=1.0)+ 
    theme(plot.title = element_text(size = 16,hjust = 0.5)) + theme(aspect.ratio=1) +
    theme(axis.text.x = element_text( size = 16),
    axis.text.y = element_text( size = 16),
    axis.title.x = element_text( size = 16),
    axis.title.y = element_text( size = 16) ) +
  theme(legend.text=element_text(size=16))
print(p)

Multidimensional Scaling Plot

dist2 <- function(x, ...)   # distance function = 1-PCC (Pearson's correlation coefficient)
  as.dist(1-cor(t(x), method="pearson"))

fit = cmdscale( dist2(t(assay(rld))) , eig=T, k=2)
mdsData <- as.data.frame(fit$points[,1:2]); 
mdsData <- cbind(mdsData,detectGroups(colnames(assay(rld))) )
colnames(mdsData) = c("x1", "x2", "Type")
	
p<-ggplot(mdsData, aes(x1, x2, color=Type, shape = Type)) + geom_point(size=5) 
p=p+xlab("Dimension 1") 
p=p+ylab("Dimension 2") 
p=p+ggtitle("Multidimensional scaling (MDS)")+ coord_fixed(ratio=1.)+ 
     theme(plot.title = element_text(hjust = 0.5)) + theme(aspect.ratio=1) +
	 	 theme(axis.text.x = element_text( size = 16),
        axis.text.y = element_text( size = 16),
        axis.title.x = element_text( size = 16),
        axis.title.y = element_text( size = 16) ) +
	   theme(legend.text=element_text(size=16))
print(p)

Creating a heatmap

library(gplots)

hclust2 <- function(x, method="average", ...)  # average linkage in hierarchical clustering
  hclust(x, method=method, ...)

n=100 # number of top genes by standard deviation

x = assay(rld)
if(n>dim(x)[1]) n = dim(x)[1] # max	as data

x = x[order(apply(x,1,sd),decreasing=TRUE),]  # sort genes by standard deviation

x = x[1:n,]   # only keep the n genes

# this will cutoff very large values, which could skew the color 
x=as.matrix(x[1:n,])-apply(x[1:n,],1,mean)
cutoff = median(unlist(x)) + 4*sd (unlist(x)) 
x[x>cutoff] <- cutoff
cutoff = median(unlist(x)) - 4*sd (unlist(x)) 
x[x< cutoff] <- cutoff
	
groups = detectGroups(colnames(x) )
groups.colors = rainbow(length(unique(groups) ) )


	lmat = rbind(c(5,4),c(0,1),c(3,2))
	lwid = c(1.5,4)
	lhei = c(1,.2,4)


heatmap.2(x, distfun = dist2,hclustfun=hclust2,
	 col=greenred(75), density.info="none", trace="none", scale="none", keysize=.5
	,key=T, symkey=F
	,ColSideColors=groups.colors[ as.factor(groups)]
	,margins=c(8,12)
	,cexRow=1
	,srtCol=45
	,cexCol=1.  # size of font for sample names
	,lmat = lmat, lwid = lwid, lhei = lhei
	)

Identification of Differentially Expressed Genes

dds <- DESeq(dds)
res <- results(dds)

kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
ENSMUSG00000102693.1	0.7043239	3.5623773	6.556337	0.5433487	0.5868897	NA
ENSMUSG00000051951.5	0.9803019	0.3244198	6.442842	0.0503535	0.9598407	NA
ENSMUSG00000103377.1	2.7081125	3.1757639	5.419836	0.5859520	0.5579078	NA
ENSMUSG00000103201.1	1.1474583	4.2647949	6.546806	0.6514314	0.5147681	NA
ENSMUSG00000102592.1	1.9131690	5.0084508	5.437877	0.9210305	0.3570345	NA
ENSMUSG00000025900.12	4.2877013	6.7470978	3.112343	2.1678514	0.0301700	0.1442115

DESeq2 uses the Benjamini-Hochberg (BH) adjustment (Benjamini and Hochberg 1995) as implemented in the base R p.adjust function

res <- results(dds, alpha = 0.5, lfcThreshold=0.01)
summary(res)

## 
## out of 34139 with nonzero total read count
## adjusted p-value < 0.5
## LFC > 0.01 (up)    : 5794, 17%
## LFC < -0.01 (down) : 4094, 12%
## outliers [1]       : 3, 0.0088%
## low counts [2]     : 8605, 25%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Now lets sort genes by fold change

res <- res[order(abs( res$log2FoldChange), decreasing=TRUE),]
kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
ENSMUSG00000107277.1	178.134530	12.82857	1.906777	6.722639	0.0000000	0.0000000
ENSMUSG00000036281.13	17.898231	11.90953	1.997989	5.955753	0.0000000	0.0000002
ENSMUSG00000034189.5	21.363195	11.19200	2.064077	5.417433	0.0000001	0.0000035
ENSMUSG00000111241.1	26.215073	10.91199	1.885793	5.781118	0.0000000	0.0000005
ENSMUSG00000027479.14	4.844616	10.82957	3.264449	3.314364	0.0009185	0.0130936
ENSMUSG00000014932.15	77.883535	10.65416	1.686891	6.309929	0.0000000	0.0000000

MA Plot

Show the significant genes. The lower the average read counts for all samples and the higher the variation between the samples, the less significant those genes are.

DESeq2::plotMA(res,  ylim = c(-5, 5))

Volcano plot

library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.05, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<0.5),'sig'] <- "Not Sig"


p = ggplot(res1, aes(log2FoldChange, -log10(padj))) +
  geom_point(aes(col=sig)) +
  scale_color_manual(values=c("red", "black"))
p

## Warning: Removed 8608 rows containing missing values (geom_point).

Gene Annotations

Plot counts of top gene

topGene <- rownames(res)[1]
plotCounts(dds, gene = topGene, intgroup=c("p53", "treatment"))

Here we see an interesting point or our normalized counts under the Trp53p_4h7Gy group that seems to be extremely high, while the other 3 replicates are around 0.5. I cannot get this portion to work, however. I get an error stating that None of the keys entered are valid keys for 'SYMBOL'.

Let's look at the keys that we have to work with for our res file from DESeq2

head(row.names(res))

## [1] "ENSMUSG00000107277.1"  "ENSMUSG00000036281.13" "ENSMUSG00000034189.5" 
## [4] "ENSMUSG00000111241.1"  "ENSMUSG00000027479.14" "ENSMUSG00000014932.15"

Now we need to find the same key in the Mm database.

library(AnnotationDbi)
library(org.Mm.eg.db)

columns(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
## [16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

#key = gsub("\\..*","", row.names(res))
res$symbol <- gsub("\\..*","", row.names(res))
#res$symbol <- gsub(" ","",row.names(res))

message("Ensembl IDs")

## Ensembl IDs

key.en = keys(org.Mm.eg.db, keytype="ENSEMBL")
head(key.en)

## [1] "ENSMUSG00000030359" "ENSMUSG00000020804" "ENSMUSG00000025375"
## [4] "ENSMUSG00000015243" "ENSMUSG00000028125" "ENSMUSG00000026944"

cat("\n\n")

message("SYMBOL names")

## SYMBOL names

key.sy = keys(org.Mm.eg.db, keytype="SYMBOL")
head(key.sy)

## [1] "Pzp"   "Aanat" "Aatk"  "Abca1" "Abca4" "Abca2"

These are ENSEMBL symbols, so we need to designate that when looking for the genes that we have.

res$ensembl <- gsub("\\..*","", row.names(res))
res$entrez <- mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="ENTREZID",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")

## 'select()' returned 1:many mapping between keys and columns

res$symbol <- mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="SYMBOL",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")

## 'select()' returned 1:many mapping between keys and columns

write.csv(res, file = "results.csv")

kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	symbol	ensembl	entrez
ENSMUSG00000107277.1	178.134530	12.82857	1.906777	6.722639	0.0000000	0.0000000	NA	ENSMUSG00000107277	NA
ENSMUSG00000036281.13	17.898231	11.90953	1.997989	5.955753	0.0000000	0.0000002	Snapc4	ENSMUSG00000036281	227644
ENSMUSG00000034189.5	21.363195	11.19200	2.064077	5.417433	0.0000001	0.0000035	Hsdl1	ENSMUSG00000034189	72552
ENSMUSG00000111241.1	26.215073	10.91199	1.885793	5.781118	0.0000000	0.0000005	NA	ENSMUSG00000111241	NA
ENSMUSG00000027479.14	4.844616	10.82957	3.264449	3.314364	0.0009185	0.0130936	Mapre1	ENSMUSG00000027479	13589
ENSMUSG00000014932.15	77.883535	10.65416	1.686891	6.309929	0.0000000	0.0000000	Yes1	ENSMUSG00000014932	22612

Let's make a file with just the genes with an adjusted p-value < 0.5

resSig = as.data.frame(subset(res,padj<0.5) )
resSig = resSig[order(resSig$log2FoldChange,decreasing=TRUE),]
head(resSig)

##                         baseMean log2FoldChange    lfcSE     stat       pvalue
## ENSMUSG00000107277.1  178.134530       12.82857 1.906777 6.722639 1.784620e-11
## ENSMUSG00000036281.13  17.898231       11.90953 1.997989 5.955753 2.588774e-09
## ENSMUSG00000034189.5   21.363195       11.19200 2.064077 5.417433 6.046073e-08
## ENSMUSG00000111241.1   26.215073       10.91199 1.885793 5.781118 7.420598e-09
## ENSMUSG00000027479.14   4.844616       10.82957 3.264449 3.314364 9.185180e-04
## ENSMUSG00000014932.15  77.883535       10.65416 1.686891 6.309929 2.791627e-10
##                               padj symbol            ensembl entrez
## ENSMUSG00000107277.1  2.190535e-09   <NA> ENSMUSG00000107277   <NA>
## ENSMUSG00000036281.13 2.052608e-07 Snapc4 ENSMUSG00000036281 227644
## ENSMUSG00000034189.5  3.540420e-06  Hsdl1 ENSMUSG00000034189  72552
## ENSMUSG00000111241.1  5.248069e-07   <NA> ENSMUSG00000111241   <NA>
## ENSMUSG00000027479.14 1.309363e-02 Mapre1 ENSMUSG00000027479  13589
## ENSMUSG00000014932.15 2.795021e-08   Yes1 ENSMUSG00000014932  22612

write.csv(resSig,"SigGenes.csv")

Here is a volcano plot that shows the symbol that we created at each point.

library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.5, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<1),'sig'] <- "Not Sig"
p = ggplot(res1, aes(log2FoldChange, -log10(pvalue))) +
  geom_point(aes(col=sig)) +
  scale_color_manual(values=c("red", "black"))
p+geom_text(data=filter(res1, padj<1e-50), aes(label=symbol))

## Warning: Removed 7357 rows containing missing values (geom_point).

7. GO Enrichment analysis using GOstats

Here we will do a GO Enrichment analysis for genes that have a decreased fold-change of 5 or more

library(GO.db)
library(GOstats)
selectedGenes = unique(resSig[resSig$log2FoldChange>5,'entrez'])  # upregulated genes
universeGenes =  unique( mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="ENTREZID",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
                    )


 hgCutoff <- 0.001
 params <- new("GOHyperGParams",
     geneIds=selectedGenes,
     universeGeneIds=universeGenes,
     annotation="org.Mm.eg.db",
     ontology="BP",
     pvalueCutoff=hgCutoff,
     conditional=FALSE,
     testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,]

##        GOBPID       Pvalue OddsRatio   ExpCount Count Size
## 1  GO:0019236 2.643566e-07  5.062236   4.423515    18   78
## 2  GO:0016049 1.730359e-05  2.028026  26.144107    49  461
## 3  GO:0048588 2.580339e-05  2.442482  13.951085    31  246
## 4  GO:1990138 3.035698e-05  2.725414  10.208111    25  180
## 5  GO:0050789 3.546769e-05  1.302786 562.863914   624 9925
## 6  GO:0001558 5.334949e-05  2.045100  22.174286    42  391
## 7  GO:0032879 5.962358e-05  1.395721 147.223650   191 2596
## 8  GO:0060560 1.073049e-04  2.275476  14.348068    30  253
## 9  GO:0061564 1.204488e-04  1.884894  26.711225    47  471
## 10 GO:0050794 1.250445e-04  1.272536 527.135525   584 9295
##                                              Term
## 1                           response to pheromone
## 2                                     cell growth
## 3                       developmental cell growth
## 4                     neuron projection extension
## 5                regulation of biological process
## 6                       regulation of cell growth
## 7                      regulation of localization
## 8  developmental growth involved in morphogenesis
## 9                                axon development
## 10                 regulation of cellular process

summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]

##        GOBPID       Pvalue                                           Term
## 1  GO:0019236 2.643566e-07                          response to pheromone
## 2  GO:0016049 1.730359e-05                                    cell growth
## 3  GO:0048588 2.580339e-05                      developmental cell growth
## 4  GO:1990138 3.035698e-05                    neuron projection extension
## 5  GO:0050789 3.546769e-05               regulation of biological process
## 6  GO:0001558 5.334949e-05                      regulation of cell growth
## 7  GO:0032879 5.962358e-05                     regulation of localization
## 8  GO:0060560 1.073049e-04 developmental growth involved in morphogenesis
## 9  GO:0061564 1.204488e-04                               axon development
## 10 GO:0050794 1.250445e-04                 regulation of cellular process

params1 <- params
ontology(params1) <- "CC"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOCCID","Pvalue","Term")]

##        GOCCID       Pvalue                                        Term
## 1  GO:0031226 2.194852e-10      intrinsic component of plasma membrane
## 2  GO:0005887 2.497976e-10       integral component of plasma membrane
## 3  GO:0030424 3.300807e-07                                        axon
## 4  GO:0098978 7.089583e-07                       glutamatergic synapse
## 5  GO:0033267 1.725126e-06                                   axon part
## 6  GO:0044459 1.872288e-06                        plasma membrane part
## 7  GO:0099055 3.042517e-06 integral component of postsynaptic membrane
## 8  GO:0099699 4.152767e-06     integral component of synaptic membrane
## 9  GO:0045211 6.111879e-06                       postsynaptic membrane
## 10 GO:0016021 7.169574e-06              integral component of membrane

params1 <- params
ontology(params1) <- "MF"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOMFID","Pvalue","Term")]

##        GOMFID       Pvalue                                         Term
## 1  GO:0016503 4.685420e-07                  pheromone receptor activity
## 2  GO:0005550 1.234794e-05                            pheromone binding
## 3  GO:0005549 2.293309e-05                              odorant binding
## 4  GO:0004888 3.280256e-05    transmembrane signaling receptor activity
## 5  GO:0038023 5.222020e-05                  signaling receptor activity
## 6  GO:0060089 8.100909e-05                molecular transducer activity
## 7  GO:0005488 6.931421e-04                                      binding
## 8  GO:0015267 8.237763e-04                             channel activity
## 9  GO:0022803 8.237763e-04   passive transmembrane transporter activity
## 10 GO:0046873 8.631819e-04 metal ion transmembrane transporter activity

GO Enrichment analysis of downregulated genes

Next we will have a look at the genes that are upregulated by a fold-change of 5 or greater.

selectedGenes = unique(resSig[resSig$log2FoldChange<5,'entrez'])  # upregulated genes

 params <- new("GOHyperGParams",
     geneIds=selectedGenes,
     universeGeneIds=universeGenes,
     annotation="org.Mm.eg.db",
     ontology="BP",
     pvalueCutoff=hgCutoff,
     conditional=FALSE,
     testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]

##        GOBPID       Pvalue                                          Term
## 1  GO:0016043 9.317561e-12               cellular component organization
## 2  GO:0008152 1.757957e-11                             metabolic process
## 3  GO:0071840 4.299006e-11 cellular component organization or biogenesis
## 4  GO:0044237 7.869800e-11                    cellular metabolic process
## 5  GO:0044238 1.013489e-09                     primary metabolic process
## 6  GO:0006807 1.956109e-09           nitrogen compound metabolic process
## 7  GO:0071704 2.363963e-09           organic substance metabolic process
## 8  GO:0006996 2.779804e-09                        organelle organization
## 9  GO:0033036 1.835538e-08                    macromolecule localization
## 10 GO:0051179 2.179269e-08                                  localization

8. Pathway analysis using expression data

# bioconductor packages
# source("https://bioconductor.org/biocLite.R");
# biocLite(c("pathview","gage","gageData"))
library(pathview) 
library(gage)

Prepare data

foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)

##     <NA>   227644    72552     <NA>    13589    22612 
## 12.82857 11.90953 11.19200 10.91199 10.82957 10.65416

library(gageData)
data(go.sets.mm)
data(go.subs.mm)
gobpsets = go.sets.mm[go.subs.mm$BP]
gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)
#lapply(gobpres, head)
message("Greater")

## Greater

kable(head(gobpres$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	p.geomean	stat.mean	p.val	q.val	set.size	exp1
GO:0019236 response to pheromone	0.0000000	7.637046	0.0000000	0.0000000	76	0.0000000
GO:0007420 brain development	0.0012778	3.026272	0.0012778	0.7925518	400	0.0012778
GO:0045664 regulation of neuron differentiation	0.0021154	2.869812	0.0021154	0.7925518	355	0.0021154
GO:0007156 homophilic cell adhesion	0.0023079	2.879945	0.0023079	0.7925518	70	0.0023079
GO:0001944 vasculature development	0.0030643	2.746850	0.0030643	0.7925518	487	0.0030643
GO:0033555 multicellular organismal response to stress	0.0038427	2.719970	0.0038427	0.7925518	53	0.0038427

message("Less")

## Less

kable(head(gobpres$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	p.geomean	stat.mean	p.val	q.val	set.size	exp1
GO:0007283 spermatogenesis	0.0094832	-2.352919	0.0094832	0.992104	292	0.0094832
GO:0048232 male gamete generation	0.0100757	-2.330049	0.0100757	0.992104	293	0.0100757
GO:0007266 Rho protein signal transduction	0.0180131	-2.109476	0.0180131	0.992104	117	0.0180131
GO:0007094 mitotic cell cycle spindle assembly checkpoint	0.0263331	-2.065183	0.0263331	0.992104	14	0.0263331
GO:0016441 posttranscriptional gene silencing	0.0274536	-1.957280	0.0274536	0.992104	35	0.0274536
GO:0007276 gamete generation	0.0283921	-1.907876	0.0283921	0.992104	388	0.0283921

message("Stats")

## Stats

kable(head(gobpres$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	stat.mean	exp1
GO:0019236 response to pheromone	7.637046	7.637046
GO:0007420 brain development	3.026272	3.026272
GO:0045664 regulation of neuron differentiation	2.869812	2.869812
GO:0007156 homophilic cell adhesion	2.879945	2.879945
GO:0001944 vasculature development	2.746850	2.746850
GO:0033555 multicellular organismal response to stress	2.719970	2.719970

KEGG pathways

library(gageData)
data(kegg.sets.mm)
data(sigmet.idx.mm)
kegg.sets.mm = kegg.sets.mm[sigmet.idx.mm]
#head(kegg.sets.mm, 3)

message("Greater")

## Greater

kable(head(kegg.sets.mm$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

message("Less")

## Less

kable(head(kegg.sets.mm$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

message("Stats")

## Stats

kable(head(kegg.sets.mm$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.mm, same.dir=TRUE)

# Look at both up (greater), down (less), and statatistics.
#lapply(keggres, head, n=10)
message("Greater")

## Greater

kable(head(keggres$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	p.geomean	stat.mean	p.val	q.val	set.size	exp1
mmu04360 Axon guidance	0.0285970	1.910570	0.0285970	0.8131848	128	0.0285970
mmu03410 Base excision repair	0.0355168	1.842982	0.0355168	0.8131848	28	0.0355168
mmu00591 Linoleic acid metabolism	0.0537749	1.628039	0.0537749	0.8131848	40	0.0537749
mmu04974 Protein digestion and absorption	0.0596166	1.567001	0.0596166	0.8131848	76	0.0596166
mmu01040 Biosynthesis of unsaturated fatty acids	0.0662476	1.531834	0.0662476	0.8131848	24	0.0662476
mmu03022 Basal transcription factors	0.0674085	1.514546	0.0674085	0.8131848	33	0.0674085

message("Less")

## Less

kable(head(keggres$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	p.geomean	stat.mean	p.val	q.val	set.size	exp1
mmu03020 RNA polymerase	0.0095316	-2.445444	0.0095316	0.9111323	23	0.0095316
mmu03450 Non-homologous end-joining	0.0341877	-1.999438	0.0341877	0.9111323	11	0.0341877
mmu00190 Oxidative phosphorylation	0.0669600	-1.505167	0.0669600	0.9111323	99	0.0669600
mmu03420 Nucleotide excision repair	0.0747360	-1.456591	0.0747360	0.9111323	39	0.0747360
mmu00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis	0.0757982	-1.464592	0.0757982	0.9111323	25	0.0757982
mmu00240 Pyrimidine metabolism	0.1116636	-1.222140	0.1116636	0.9111323	87	0.1116636

message("Stats")

## Stats

kable(head(keggres$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

	stat.mean	exp1
mmu04360 Axon guidance	1.910570	1.910570
mmu03410 Base excision repair	1.842982	1.842982
mmu00591 Linoleic acid metabolism	1.628039	1.628039
mmu04974 Protein digestion and absorption	1.567001	1.567001
mmu01040 Biosynthesis of unsaturated fatty acids	1.531834	1.531834
mmu03022 Basal transcription factors	1.514546	1.514546

# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>% 
  tbl_df() %>% 
  filter(row_number()<=5) %>% 
  .$id %>% 
  as.character()
keggrespathways

## [1] "mmu03020 RNA polymerase"                                       
## [2] "mmu03450 Non-homologous end-joining"                           
## [3] "mmu00190 Oxidative phosphorylation"                            
## [4] "mmu03420 Nucleotide excision repair"                           
## [5] "mmu00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis"

# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids

## [1] "mmu03020" "mmu03450" "mmu00190" "mmu03420" "mmu00563"

# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu", new.signature=FALSE)

# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu"))

Pathway and regulation of genes for Oxidative phosphorylation.

Pathway and regulation of genes for Glycosylphosphatidylinositol(GPI)-anchor biosynthesis.

Pathway and regulation of genes for RNA polymerase.

Pathway and regulation of genes for Nucleotide excision repair.

Pathway and regulation of genes for Non-homologous end-joining.

#De novo Assembly {#denovo-assembly}

AT THIS TIME, MINICONDA3 WAS INSTALLED SO FURTHER TOOLS ARE ALL INSTALLED UNDER THIS

Having the genome for RNA-sequencing analysis is very useful, but sometimes it is not available so de novo assembly is used. de novo assembly is using the reads from sequencing to create longer seqeunces called contigs (contiguous sequences). These sequences are then compared to a protein database to get an idea of what proteins are possibly present. The raw sequences are also mapped back to the contigs, treating the contigs as the "genome", much like we do when we have the genome from the host available.

Lets jump right into it. We are going to be using Trinity for the de novo assembly and analysis because there are many tools built into the Trinity tool that allow for the building of contigs, counting of sequences, and even DeSeq/edgeR analyses.

To install Trinity, all I did was ran conda install -c bioconda trinity. This installs all of the previously mentioned tools in one go. In addition to Trinity, BLAST (conda install -c bioconda blast) and RSEM (conda install -c bioconda rsem) were installed.

Trinity --seqType fq \
--left trimmedReads/SRR2121770_Trimmed_1P.fq.gz,trimmedReads/SRR2121771_Trimmed_1P.fq.gz,trimmedReads/SRR2121774_Trimmed_1P.fq.gz,trimmedReads/SRR2121775_Trimmed_1P.fq.gz,trimmedReads/SRR2121778_Trimmed_1P.fq.gz,trimmedReads/SRR2121779_Trimmed_1P.fq.gz,trimmedReads/SRR2121780_Trimmed_1P.fq.gz,trimmedReads/SRR2121781_Trimmed_1P.fq.gz,trimmedReads/SRR2121786_Trimmed_1P.fq.gz,trimmedReads/SRR2121787_Trimmed_1P.fq.gz,trimmedReads/SRR2121788_Trimmed_1P.fq.gz,trimmedReads/SRR2121789_Trimmed_1P.fq.gz \
--right trimmedReads/SRR2121770_Trimmed_2P.fq.gz,trimmedReads/SRR2121771_Trimmed_2P.fq.gz,trimmedReads/SRR2121774_Trimmed_2P.fq.gz,trimmedReads/SRR2121775_Trimmed_2P.fq.gz,trimmedReads/SRR2121778_Trimmed_2P.fq.gz,trimmedReads/SRR2121779_Trimmed_2P.fq.gz,trimmedReads/SRR2121780_Trimmed_2P.fq.gz,trimmedReads/SRR2121781_Trimmed_2P.fq.gz,trimmedReads/SRR2121786_Trimmed_2P.fq.gz,trimmedReads/SRR2121787_Trimmed_2P.fq.gz,trimmedReads/SRR2121788_Trimmed_2P.fq.gz,trimmedReads/SRR2121789_Trimmed_2P.fq.gz \
--CPU 80 --max_memory 2000G --min_contig_length 150

TrinityStats.pl trinity_out_dir/Trinity.fasta

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  602870
Total trinity transcripts:      730297
Percent GC: 45.89

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 8589
        Contig N20: 5955
        Contig N30: 4400
        Contig N40: 3276
        Contig N50: 2413

        Median contig length: 413
        Average contig: 1028.24
        Total assembled bases: 750921120


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 5470
        Contig N20: 3434
        Contig N30: 2391
        Contig N40: 1741
        Contig N50: 1285

        Median contig length: 343
        Average contig: 703.04
        Total assembled bases: 423841627

bowtie2-build trinity_out_dir/Trinity.fasta trinity_out_dir/Trinity.fasta
bowtie2 --local --no-unal -x trinity_out

#download uniprot swiss-prot db #ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

gunzip uniprot_sprot.fasta.gz
mkdir -p blast_protdb
mv uniprot_sprot.fasta blast_protdb/.
makeblastdb -in uniprot_sprot.fasta -dbtype prot
cd ..

blastx -query trinity_out_dir/Trinity.fasta -db blast_protdb/uniprot_sprot.fasta -out blastx.outfmt6 -evalue 1e-20 -num_threads 80 -max_target_seqs 1 -outfmt 6
analyze_blastPlus_topHit_coverage.pl blastx.outfmt6 trinity_out_dir/Trinity.fasta blast_protdb/uniprot_sprot.fasta | column -t

#hit_pct_cov_bin  count_in_bin  >bin_below
100               7361          7361
90                1117          8478
80                981           9459
70                971           10430
60                1003          11433
50                865           12298
40                924           13222
30                1102          14324
20                1478          15802
10                932           16734

$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121770_1.fastq.gz -2 rawReads/SRR2121770_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121770.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121771_1.fastq.gz -2 rawReads/SRR2121771_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121771.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121774_1.fastq.gz -2 rawReads/SRR2121774_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121774.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121775_1.fastq.gz -2 rawReads/SRR2121775_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121775.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121778_1.fastq.gz -2 rawReads/SRR2121778_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121778.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121779_1.fastq.gz -2 rawReads/SRR2121779_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121779.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121780_1.fastq.gz -2 rawReads/SRR2121780_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121780.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121781_1.fastq.gz -2 rawReads/SRR2121781_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121781.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121786_1.fastq.gz -2 rawReads/SRR2121786_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121786.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121787_1.fastq.gz -2 rawReads/SRR2121787_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121787.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121788_1.fastq.gz -2 rawReads/SRR2121788_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121788.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121789_1.fastq.gz -2 rawReads/SRR2121789_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121789.coordSorted.bam

$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121770_1.fastq.gz --right rawReads/SRR2121770_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121770.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121771_1.fastq.gz --right rawReads/SRR2121771_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121771.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121774_1.fastq.gz --right rawReads/SRR2121774_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121774.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121775_1.fastq.gz --right rawReads/SRR2121775_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121775.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121778_1.fastq.gz --right rawReads/SRR2121778_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121778.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121779_1.fastq.gz --right rawReads/SRR2121779_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121779.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121780_1.fastq.gz --right rawReads/SRR2121780_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121780.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121781_1.fastq.gz --right rawReads/SRR2121781_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121781.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121786_1.fastq.gz --right rawReads/SRR2121786_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121786.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121787_1.fastq.gz --right rawReads/SRR2121787_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121787.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121788_1.fastq.gz --right rawReads/SRR2121788_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121788.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121789_1.fastq.gz --right rawReads/SRR2121789_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121789.RSEM

time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_trans SRR2121770.RSEM/SRR2121770.RSEM.trans.results \
SRR2121771.RSEM/SRR2121771.RSEM.trans.results \
SRR2121774.RSEM/SRR2121774.RSEM.trans.results \
SRR2121775.RSEM/SRR2121775.RSEM.trans.results \
SRR2121778.RSEM/SRR2121778.RSEM.trans.results \
SRR2121779.RSEM/SRR2121779.RSEM.trans.results \
SRR2121780.RSEM/SRR2121780.RSEM.trans.results \
SRR2121781.RSEM/SRR2121781.RSEM.trans.results \
SRR2121786.RSEM/SRR2121786.RSEM.trans.results \
SRR2121787.RSEM/SRR2121787.RSEM.trans.results \
SRR2121788.RSEM/SRR2121788.RSEM.trans.results \
SRR2121789.RSEM/SRR2121789.RSEM.trans.results --gene_trans_map none

time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_genes SRR2121770.RSEM/SRR2121770.RSEM.genes.results \
SRR2121771.RSEM/SRR2121771.RSEM.genes.results \
SRR2121774.RSEM/SRR2121774.RSEM.genes.results \
SRR2121775.RSEM/SRR2121775.RSEM.genes.results \
SRR2121778.RSEM/SRR2121778.RSEM.genes.results \
SRR2121779.RSEM/SRR2121779.RSEM.genes.results \
SRR2121780.RSEM/SRR2121780.RSEM.genes.results \
SRR2121781.RSEM/SRR2121781.RSEM.genes.results \
SRR2121786.RSEM/SRR2121786.RSEM.genes.results \
SRR2121787.RSEM/SRR2121787.RSEM.genes.results \
SRR2121788.RSEM/SRR2121788.RSEM.genes.results \
SRR2121789.RSEM/SRR2121789.RSEM.genes.results --gene_trans_map none

contig_Exn50_statistic.pl Trinity_trans.TMM.EXPR.matrix trinity_out_dir/Trinity.fasta > ExN50_trans.stats

Make samples.txt

mock-	SRR2121770.RSEM
mock-	SRR2121771.RSEM
IR-	SRR2121774.RSEM
IR-	SRR2121775.RSEM
mock+	SRR2121778.RSEM
mock+	SRR2121779.RSEM
mock+	SRR2121780.RSEM
mock+	SRR2121781.RSEM
IR+	SRR2121786.RSEM
IR+	SRR2121787.RSEM
IR+	SRR2121788.RSEM
IR+	SRR2121789.RSEM

The way that the installed version of run_DE_analysis.pl tries to install or check for edgeR doesn't work for the new versions of R (3.6.1). In the perl script I just had to edit a few lines under the edgeR section for it to run smoothly, as well as install *BiocManager before hand.

I changed the perl script, under sub run_edgeR_sample_pair, when writing the R script to look like:

    print $ofh "if (! require(edgeR)) {\n";
    print $ofh "   install.packages(\"BiocManager\")\n";
    print $ofh "   BiocManager::install(\"edgeR\")\n";
    print $ofh "   library(edgeR)\n";
    print $ofh "}\n\n";

To make sure BiocManager and edgeR work, they were installed in the R terminal.

R #runs the installed version of R in terminal mode
install.packages("BiocManager")
BiocManager::install("edgeR")

Since we don't have at least 3 replicates for each of the treatments, edgeR wants a dispersion parameter. 0.035 seems to be pretty common. If the dispersion parameter has the squareroot taken of it we get ~ 0.19, meaning that we are saying the true abundance for each gene can vary up or down by 19% between replicates.

run_DE_analysis.pl --matrix Trinity_trans.counts.matrix --method edgeR --output edgeR_trans --dispersion 0.035 --samples_file samples.txt

head edgeR_trans/Trinity_trans.counts.matrix.IR-_vs_mock-.edgeR.DE_results | column -t

                          sampleA  sampleB  logFC           logCPM             PValue                FDR
TRINITY_DN2468_c0_g4_i1   IR-      mock-  -14.7753615825754  4.45315041918     5.51021995613131e-115  5.85185359341145e-110
TRINITY_DN2505_c0_g1_i2   IR-      mock-  -10.1817621187735  4.11815268306374  8.54695600746168e-88   4.53843363996215e-83
TRINITY_DN3231_c0_g2_i15  IR-      mock-  -13.5922145319911  3.27307743751541  6.29068575237937e-81   2.2269027563423e-76
TRINITY_DN4975_c0_g1_i7   IR-      mock-  13.4941229058571   3.17132827495398  5.25978730838885e-79   1.39647353037724e-74
TRINITY_DN4360_c0_g1_i5   IR-      mock-  -13.5215746751431  3.20272600532575  1.0222845072488e-70    2.17133229339645e-66
TRINITY_DN128_c0_g1_i20   IR-      mock-  13.1582688371709   2.83599676163376  1.8443766792831e-69    3.26454672233108e-65
TRINITY_DN644_c1_g1_i6    IR-      mock-  -5.91850038396836  4.31456225100214  8.01440426808572e-67   1.21589961895815e-62
TRINITY_DN2043_c0_g1_i13  IR-      mock-  13.7803059024088   3.45719012794474  3.30519380237759e-66   4.38764477265625e-62
TRINITY_DN3321_c0_g1_i2   IR-      mock-  12.8215942182179   2.49995284585527  9.5059376972195e-61    1.1217006482719e-56

Files

Mouse_RNA_Seq_p53_genotoxic.md

Latest commit

History