Skip to content

Latest commit

 

History

History
6493 lines (5888 loc) · 264 KB

Mouse_RNA_Seq_p53_genotoxic.md

File metadata and controls

6493 lines (5888 loc) · 264 KB
title author date output
RNA Seq for STAT736
Alex Soupir
9/10/2019
html_document
keep_md
true

Background

For STAT736-Fall-2019, we are analyzing the RNA-Seq from the publication Genome-wide analysis of p53 transcriptional programs in B cells upon exposure to genotoxic stress in vivo. We are only using the sequences B cells from spleen and not the non-B cells from spleen from the SRA Run Selector on NCBI.

The mice were exposed to whole-body ionizing radiation and sequences were extracted from both Bcells and non-B cells from the spleens of the mice. Two genotypes of mice were used: mice with p53 knocked out and the wild-type C57/Bl6. There were 4 different group combinations including the 2 different genotypes; each genotype was subjected to the ionizing radiation as well as control/mock.

Treatment groups of the mice that were either controls or treated with ionizing radiation to determine reaction of p53.
Genotype Treatment
Group 1 p53 Mock
Group 2 C57/Bl6 Mock
Group 3 p53 IR
Group 4 C57/Bl6 IR

This document will contain 2 different pipelines: The first one is going to be using the genome to map the reads too, and the secod is going to be de novo.

Genome De novo

#Genome Mapping {#genome-mapping}

The pipeline used in this analysis used conda on South Dakota State University's High Performance Computing cluster to run the programs FastQC, Trimmomatic, and Tophat.

This is different than previous RNA-Seq analyses where I used my workstation pc with Ubuntu 18.04 to run FastQC, Trimmomatic, HiSat2, HTSeq, and DESeq2 locally. Also, the previous RNA-Seq alayses were of Soybean with treatment combinations of mycorrhizae and rhizobia inoculation.

Cleaning Data

Programs used?

  • FastQC
  • Trimmomatic 0.39
  • Bowtie 2.2.5.0
  • Tophat 2.1.1
  • STAR
  • Cufflinks
  • Kraken2
  • MultiQC
  • featureCount

Picking the right node

To find a node that we can use on our own, we need to see which nodes are already allocated to jobs and which ones are idle. To do this, we can run sinfo. We want to pick one of the nodes that are marked 'idle' so we get the whole thing and we aren't interrupting someone elses job. For the sake of this exercise, lets work on big-mem.

Once a node that is idle has been found, you can ssh into it by typing ssh -X big-mem00# where # is the node number.

ssh -X big-mem005

Once on the node, the modules will have to be pulled from the shared folder again, otherwise we will be left with very basic ones. NOTE: if running programs that are in a personal folder such as miniconda (these examples), it is not necessary to add the other modules.

module use /cm/shared/modulefiles_local/

After loading the modules you can use it just as you would any other command line.

Creating slurm scripts

When running on a cluster, it can sometime be difficult to find open nodes with the resources needed to run the jobs that we have. Making a slurm script is really easy. Fist we make a new file with the touch command.

touch commands.slurm

Now in our directory we have the file commands.slurm which we can edit to hold our code in. We can edit it with the vi command.

vi commands.slurm

We have a few things that we need to put in the file header so slurm knows what to do with our commands.

#!/bin/bash

#SBATCH --job-name=example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --output=job-%j-%N.log
#SBATCH --partition=bigmem
#SBATCH --time=10:00:00

When we break this down, we see --job-name which is what we will see when we look at whats running later, --nodes is the number of nodes we have, --ntasks-per-node is the number of cores that we are requesting to have allocated, --output is the output log file of the job (here it names the output file with the job number and the node that we used), --partition here is requesting a big-mem node but compute can also be used, and finally --time is how long we are requesting the allocation for.

If the time runs out before the job is done I believe that it just kills the job even if not finished so we need to think a little about how much time to set. If the time is set too low, the job is killed and if the time is set too long, we may face issues with getting the node allocated to us.

To submit a job we can use sbatch commands.slurm and then we have the job ID. To check the status of our submission we use sbatch and then it shows all of the submitted jobs and how long they have been running along with the name that we set in the script.

Acquiring sequences

To download the sequences from the sequence read archive (SRA), the SRA Toolkit was used. The downloading of the files took a very long time, so this was left to run over night. The --gzip was used to keep the files a relatively small, although this can be left out to download uncompressed files, and --split-files was used to split the forward read from the reverse read for paired end read trimming through Trimmomatic.

~/tools/sratoolkit.2.9.6-1-centos_linux64/bin/fastq-dump --gzip --split-files SRR2121770

This is an example of the single file, but the above code needed to be ran for all of the following SRA numbers:

  • SRR2121770
  • SRR2121771
  • SRR2121774
  • SRR2121775
  • SRR2121778
  • SRR2121779
  • SRR2121780
  • SRR2121781
  • SRR2121786
  • SRR2121787
  • SRR2121788
  • SRR2121789

The results from downloading with --split-files gives 2 files per SRR, as mentioned before, one forward and one reverse. The suffix of the split files is one with _1.fastq.gz and another with _2.fastq.gz.

FastQC

FastQC can be run on all of the read files by using the wild card (*) as in *.fastq.gz. This prevents the need to hard code each individual read file into a FastQC command, which saves a lot of time since there are 24 read files in total for these 12 samples.

~/miniconda2/bin/fastqc *.fastq.gz

The output from running FastQC is a zipped folder and an HTML file for each of the \fastq.gz files in the folder. The HTML document looks something like this:

FastQC of Raw SRR2121770_1.fastq.gz Read

This is just the top of the file, and every category under the Summary heading has a graph that shows how the read quality looks for that particular metric. These reports can give insight into whether the reads are of decent quality or if the quality is poor.

The raw reads we have here all passed for adapter content and sequence length distribution and everything failed per base sequence content. SRR2121770, SRR2121771, SRR2121774, SRR2121775, SRR2121788, SRR2121781-2, and SRR2121789-1 were fairly decent quality reads. SRR2121778, SRR2121779, SRR2121780, SRR2121786, SRR2121787, SRR2121781-1, and SRR2121789-2 were of fairly lower quality (failing 3 or more in both reads. All of them failed both per base sequence quality and per tile sequence quality.

MultiQC

First lets install multiqc with conda. The command for this is conda install -c bioconda multiqc.

When that is finished, we can run MultiQC in the folder with the QC files (they should be moved into a folder alone so things don't get cluttered later on in the analyses).

~/miniconda/bin/multiqc .

When MultiQC is finished running, there will be a new folder called multiqc_data where the summaries are stored. Now lets go back up a level where our raw data folder and fastqc folder is and make a new folder for all of our MultiQC data. We will copy the FastQC output from MultiQC to this new folder.

mkdir MultiQC_All

cp RawQC/multiqc_data/multiqc_fastqc.txt MultiQC_All/

Trimming with Trimmomatic

Conda was used again to run Trimmomatic. This isn't as easy as using the wildcard like with FastQC because each output has to be personalized for the read files that are input into Trimmomatic. Also, we have to make sure that the adapter sequences are in the same folder that we are running so we can refer to them easily when calling the Trimmomatic program. In this case, we are using the TruSeq3-PE-2.fa adapter sequences For example:

~/miniconda2/bin/trimmomatic PE SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz 770_fp.fq.gz 770_fu.fq.gz 770_rp.fq.gz 770_ru.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 &

This would be repeated for each of the pairs (12 in total). We are trimming paired-end reads with the TruSeq3-PE-2 adapters. We are chopping off the first and last 3 bases and if we end up with a sequence less than 36 bases, we get rid of it. We want to make sure that there are enough bases in a read to work with. These parameters can be tweaked for possibly better end results with less being discarded. When Trimmomatic is finished running, it will out put the total number of reads, the total number from both the forward and reverse reads that are kept, the number of only forward reads kept, the number of only reverse reads kept, and the number of discarded reads.

The highest number of reads dropped was from trimming SRR2121786, where 20.55% dropped. Most reads were between 5% and 10% dropped. SRR2121786, SRR2121787, and SRR2121779 had sequence drops greater than 15%.

The trimmed reads can be analyzed again with FastQC to see how well the trimming worked to make the file better quality. After running FastQC on the trimmed files we see that the quality of those that were really bad quality were improved. There were a few different metrics throughout all of the files that bounced from a warning before the failing, or from passing before to a warning, and so forth, overall creating better quality read files.

Alignment

Using Tophat

Tophat can be installed using the same conda install (conda install -c bioconda tophat). When this is finished installing, then we will need to get the mouse genome from the Johns Hopkins Univeristy Center for computational BIology. The version of the mouse genome that I am using here is the NCBI build37.2. Instead of downloading this from the website and having to move it to the cluser, I will just download it using wget into the folder that has the raw reads, trimmed reads, and the FastQC files.

wget ftp://igenome:[email protected]/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz

This will take a long time to download because the file is a little less than 16GB zipped.

We notice here that we have a zipped tar file. To make this file easier to use, lets unzip it.

tar zxvf Mus_muculus_NCBI_build37.2.tar.gz

Since Tophat is requiring *.bt21 files (large index) and the files downloaded for the genome above are only small index files, we have to create a large index using bowtie2-build. For this, lets navigate to the WholeGenomeFasta folder within the extracted folder and then run bowtie2-build.

~/miniconda2/bin/bowtie2-build --large-index genome.fa genome

This process took about 26 minutes to run. Now lets copy the index files to a folder close to our reads so we can access them easier, rather than having to refer to the longer path where we build them. After they are copied to a new folder closer to our working directory, I went ahead and unzipped the trimmed read files to try and make the Tophat faster but it turned out not to work. The multicore call with -p didn't use more cores than 1 until bowtie2-align-s, then 20 cores were used.

~/miniconda2/bin/tophat --no-converage-search -p 20 -G Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -0 770_thout ./Index/genome 770_fp.fq.gz 770_rp.fq.gz 770_fu.fq 770_ru.fq

This run took almost 3 hours to complete.. Running with 80 cores rather than 20 cores took just 4 minutes less, so the whole process must be limited by a single core and the core clock speed. The process does use close to 8,000% at its peak so there is a benefit to multicore, just isn't very scalable.

Using STAR

STAR can be installed the same way as the previous programs with conda install (conda install -c bioconda star). In order to run STAR, we need to creaate indices just like with tophat, but STAR has this built in. I'm going to be using the same genome and GTF file as previously downloaded, but Dr. Ge uses a different zipped genome from the gencode database.

~/miniconda2/bin/STAR \
--runThreadN 80 \
--runMode genomeGenerate \
--genomeDir starIndex \
--genomeFastaFiles Index/genome.fa \ #same when we made the bowtie indices 
--sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf

With the index files made, we can start aligning with STAR. It's important here than we only pick the paired end reads and not use all of the reads. Tophat is able to use all 4 reads but STAR doesn't allow that, so we need to make sure that we feed in the large files from trimming.

~/miniconda2/bin/STAR --runThreadN 80 --genomeDir starIndex --readFilesIn 770_fp.fq 770_rp.fq --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix 2121770 --outSAMtype BAM SortedByCoordinate

Assembling transcripts with Cufflinks

Once STAR is done running, we can assemble the transcripts with Cufflinks. This can also be installed with conda install (conda install -c bioconda cufflinks).

~/miniconda2/bin/cufflinks -p 20 -o SRR2121771_clout --library-type fr-firststrand 2121770Aligned.sortedByCoord.out.bam

Checking for Contamination

PhiX contamination

Now we will look at what kind of contamination we are looking at. When samples are sequenced with Illumina, a PhiX control is run along side them. This control is for cluster generation, sequencing, alignment, and calibration for cross-talk matrix generation. We will use Bowtie to create a file to determine the PhiX contamination level.

~/miniconda2/bin/bowtie2 -p 20 -x PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome \
-1 TrimmedReads/770_fp.fq  -2 TrimmedReads/770_rp.fq -S phix.sam &> PhiXout/SRR2121770_phix.out

When the job is done running, the output file will show how much PhiX contamination we have. For example, lookin at the SRR2121770_phix.out created above, we see that 0.11% of the reads aligned with PhiX. The lower this value the better.

rRNA Sequences

To retreive the rRNA sequences for mouse, we need to search the taxonomy database on NCBI for Mus musculus. Click on Mus musculus on the next page, and then the top Mus musculus at the head of the list. Now, select the top subtree link in the Nucleotide database. Select rRNA sequences on the left side of the page and download full list just downloading with Send > Complete Record > File > FASTA > Create File. Drag the file using WinSCP to the raw folder on the cluster and rename it to rRNA.fa.

We are going to need to install bwa with conda in order to get the alignments to work. This can be done with conda install -c bioconda bwa. Following this, we will need to make indixes for the rRNA that we downloaded. To make this more clean, lets make a directory for the rRNA sequences that we downloaded and the indices that we make.

mkdir rRNA

Then we move the rRNA.fa to the new rRNA folder with WinSCP and then we can run the bwa.

time ~/miniconda2/bin/bwa mem -t 20 rRNA/rRNA.fa TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq > rnaAlign/770_rna.sam

When we are done creating the new *.sam files for all of the forward/reverse read combinations, we can use samtools to convert the *.sam file to *.bam files which are essentially the same file just that sam is easier for us to look at while bam is binary. Samtools can be installed with conda install -c bioconda samtools.

~/miniconda2/bin/samtools view -@ 10 -bS -o rnaAlign/770_rna.bam rnaAlign/770.sam

Now in the rnaAlign folder we have our sam and bam file for each of the libraries. Lets create an output file with flagstat.

~/miniconda2/bin/samtools flagstat -@ 10 rnaAlign/770_rna.out

Wihtin this file we will be able to see the summary of our alignments to the rRNA file that we downloaded from NCBI.

#From 770_rna.out
205559289 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
85 + 0 supplementary
0 + 0 duplicates
4265179 + 0 mapped (2.07% : N/A)
205559204 + 0 paired in sequencing
102779602 + 0 read1
102779602 + 0 read2
4151684 + 0 properly paired (2.02% : N/A)
4187608 + 0 with itself and mate mapped
77486 + 0 singletons (0.04% : N/A)
4222 + 0 with mate mapped to a different chr
1026 + 0 with mate mapped to a different chr (mapQ>=5)

Bacterial contamination

In order to find out the contamination, we need to install Kraken2 with conda install -c bioconda kraken2 and download a pre-built database containing bacteria, archaea, and viral sequences. The database we are going to download only contains about 5% of k-mers from the original database (but directions are sort of lacking to build an entirely new database). More information can be found at https://ccb.jhu.edu/software/kraken/ for the pre-built databases.

Using the code in the next chunk will download the 8GB database and then extract the files so we can use them with the Kraken2 program. Lets do this in the main project folder.

wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz

tar xzf minikraken2_v2_8GB_201904_UPDATE.tgz

Now lets make a directory for the output.

mkdir krakenOut

We can call Kraken with the extracted database folder and point it to the location of out paired end reads from trimming and to the output folder that we just created for the outputs.

~/miniconda2/bin/kraken2 --db minikraken2_v2_8GB_201904_UPDATE/ --output krakenOut/770.out --threads 10 --paired TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq

When Kraken is done running, it will print out the number (and percentage) of reads that were classified. In this case, we have used 102779602 sequences, of which 19142843 sequences were classified (18.63%) and 83636759 sequences were unclassified (81.37%). My interpretation of this is that 18.63% of the reads are possibly from microbial cell contamination.

Counting Transcripts

Since we have the bam files from the alignments of the different samples, we can count the features for each and get the transcipt counts using featureCounts form conda install -c bioconda/label/cf201901 subread. The genome and annotations that we previously downloaded were from genome mm9 so we have to specify to featureCounts what we want to actually count. FeatureCounts defaults to using gene_id which our output bam files don't have described correctly for featureCounts to read them. This is a single line of code because we can use a wildcard to run through all of the bam files.

#Move to the Star Alignment output folder for a working directory
cd StarOut

~/miniconda2/bin/featureCounts -a /gpfs/scratch/alex.soupir/Mus/raw/Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -g 'transcript_id' -o readCounds.txt *bam

With the files that we are working with, this will take between 3.5 minutes to 5 minutes per bam file. The output will be a file that can be imported into excel and saved as csv which we then can work with in R.

Final QC of cleaning the data

Lets look at the data that we have collected from all of the MultiQC runs that we had with initial FastQC, Trimmomatic, STAR alignment, PhiX contamination, rRNA contamination, and the final feature counts.

qc = read.csv('Whole Data QC.csv', header=TRUE, na.strings="")
kable(qc) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")
X Sample SRR2121770_1 SRR2121770_2 SRR2121771_1 SRR2121771_2 SRR2121774_1 SRR2121774_2 SRR2121775_1 SRR2121775_2 SRR2121778_1 SRR2121778_2 SRR2121779_1 SRR2121779_2 SRR2121780_1 SRR2121780_2 SRR2121781_1 SRR2121781_2 SRR2121786_1 SRR2121786_2 SRR2121787_1 SRR2121787_2 SRR2121788_1 SRR2121788_2 SRR2121789_1 SRR2121789_2
FastQC adapter_content pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass
NA Sequences flagged as poor quality 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NA sequence_duplication_levels warn warn pass pass pass pass pass pass warn pass pass pass pass pass warn warn warn pass pass pass fail fail fail warn
NA avg_sequence_length 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51
NA Encoding Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9 Sanger / Illumina 1.9
NA per_base_sequence_quality pass pass pass pass pass pass pass pass fail fail fail fail fail fail fail pass fail fail fail fail pass pass pass fail
NA sequence_length_distribution pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass
NA Sequence length 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51
NA File type Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls Conventional base calls
NA basic_statistics pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass
NA per_sequence_gc_content pass pass pass pass pass pass warn warn warn warn pass fail warn fail pass pass warn fail warn fail fail fail fail fail
NA Total Sequences 118323219 118323219 103127231 103127231 91225885 91225885 108661623 108661623 136766553 136766553 124265478 124265478 99685598 99685598 106360532 106360532 99988292 99988292 75410310 75410310 82750449 82750449 92454136 92454136
NA per_base_n_content pass pass warn pass warn pass pass pass fail warn fail warn fail warn pass pass fail fail fail warn pass pass pass fail
NA per_base_sequence_content fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail
NA overrepresented_sequences warn warn warn warn warn warn warn warn warn warn pass warn pass pass warn warn warn warn pass pass warn warn warn warn
NA %GC 47 47 46 46 47 47 47 47 46 47 46 46 46 46 46 46 47 47 46 46 47 47 47 48
NA total_deduplicated_percentage 52.20958673 55.28743506 90.47297502 97.22714143 91.11411337 97.47431986 72.31423101 95.57923295 65.27895305 83.85617925 73.08810256 79.50169478 92.46127029 94.09779679 54.09833163 52.4832411 68.88309902 74.63550918 93.00930552 94.9666922 43.56442006 44.95245995 42.6643141 57.36052687
NA Filename SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz SRR2121771_1.fastq.gz SRR2121771_2.fastq.gz SRR2121774_1.fastq.gz SRR2121774_2.fastq.gz SRR2121775_1.fastq.gz SRR2121775_2.fastq.gz SRR2121778_1.fastq.gz SRR2121778_2.fastq.gz SRR2121779_1.fastq.gz SRR2121779_2.fastq.gz SRR2121780_1.fastq.gz SRR2121780_2.fastq.gz SRR2121781_1.fastq.gz SRR2121781_2.fastq.gz SRR2121786_1.fastq.gz SRR2121786_2.fastq.gz SRR2121787_1.fastq.gz SRR2121787_2.fastq.gz SRR2121788_1.fastq.gz SRR2121788_2.fastq.gz SRR2121789_1.fastq.gz SRR2121789_2.fastq.gz
NA per_tile_sequence_quality warn fail fail fail fail fail fail fail fail fail fail fail fail fail fail warn fail fail fail fail warn pass pass fail
NA per_sequence_quality_scores pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass fail pass fail pass pass pass pass
Trimming surviving 102779602 NA 88509413 NA 77240029 NA 94179097 NA 108337271 NA 92924182 NA 76172416 NA 95488961 NA 69906417 NA 54427039 NA 73771492 NA 82764590 NA
NA surviving_pct 86.86 NA 85.83 NA 84.67 NA 86.67 NA 79.21 NA 74.78 NA 76.41 NA 89.78 NA 69.91 NA 72.17 NA 89.15 NA 89.52 NA
NA reverse_only_surviving_pct 4.48 NA 2.52 NA 2.53 NA 2.29 NA 2.11 NA 2.44 NA 2.52 NA 1.8 NA 2.59 NA 2.65 NA 1.85 NA 1.77 NA
NA forward_only_surviving 3202267 NA 5041955 NA 4667583 NA 4832339 NA 9375417 NA 7916665 NA 6350522 NA 3894412 NA 6944977 NA 5349296 NA 3489058 NA 3797776 NA
NA dropped 7036792 NA 6973727 NA 7013028 NA 7160952 NA 16166923 NA 20393774 NA 14647690 NA 5067412 NA 20551506 NA 13638339 NA 3962243 NA 4252397 NA
NA dropped_pct 5.95 NA 6.76 NA 7.69 NA 6.59 NA 11.82 NA 16.41 NA 14.69 NA 4.76 NA 20.55 NA 18.09 NA 4.79 NA 4.6 NA
NA forward_only_surviving_pct 2.71 NA 4.89 NA 5.12 NA 4.45 NA 6.86 NA 6.37 NA 6.37 NA 3.66 NA 6.95 NA 7.09 NA 4.22 NA 4.11 NA
NA input_read_pairs 118323219 NA 103127231 NA 91225885 NA 108661623 NA 136766553 NA 124265478 NA 99685598 NA 106360532 NA 99988292 NA 75410310 NA 82750449 NA 92454136 NA
NA reverse_only_surviving 5304558 NA 2602136 NA 2305245 NA 2489235 NA 2886942 NA 3030857 NA 2514970 NA 1909747 NA 2585392 NA 1995636 NA 1527656 NA 1639373 NA
STAR Alignment uniquely_mapped_percent 81.73 NA 82.35 NA 81.67 NA 83.1 NA 83.43 NA 84.55 NA 82.48 NA 86.82 NA 80.75 NA 80.27 NA 80.81 NA 81.62 NA
NA num_splices 8669080 NA 8000968 NA 10613774 NA 11755566 NA 10864591 NA 8361644 NA 6866296 NA 7991948 NA 6183074 NA 4335214 NA 7285454 NA 8581902 NA
NA num_GCAG_splices 90012 NA 83679 NA 116149 NA 124779 NA 116447 NA 86763 NA 74268 NA 80836 NA 65893 NA 44765 NA 80392 NA 97422 NA
NA insertion_length 1.11 NA 1.1 NA 1.12 NA 1.17 NA 1.13 NA 1.11 NA 1.11 NA 1.13 NA 1.16 NA 1.15 NA 1.13 NA 1.15 NA
NA deletion_length 1.35 NA 1.33 NA 1.37 NA 1.26 NA 1.36 NA 1.35 NA 1.35 NA 1.4 NA 1.2 NA 1.22 NA 1.3 NA 1.25 NA
NA unmapped_tooshort_percent 4.08 NA 4.21 NA 3.42 NA 4.09 NA 3.51 NA 3.87 NA 3.66 NA 3.45 NA 5.83 NA 6.42 NA 5.61 NA 4.53 NA
NA avg_mapped_read_length 100.73 NA 101.11 NA 101.1 NA 101.14 NA 100.6 NA 100.51 NA 100.51 NA 100.81 NA 100.41 NA 100.33 NA 101.18 NA 101.22 NA
NA deletion_rate 0 NA 0 NA 0 NA 0 NA 0.01 NA 0.01 NA 0.01 NA 0.01 NA 0.01 NA 0.01 NA 0.01 NA 0.01 NA
NA mismatch_rate 0.23 NA 0.29 NA 0.29 NA 0.17 NA 0.17 NA 0.22 NA 0.22 NA 0.17 NA 0.26 NA 0.28 NA 0.16 NA 0.15 NA
NA avg_input_read_length 101 NA 101 NA 101 NA 101 NA 100 NA 100 NA 100 NA 101 NA 100 NA 100 NA 101 NA 101 NA
NA num_ATAC_splices 8039 NA 7497 NA 10782 NA 11677 NA 9722 NA 7455 NA 6186 NA 7544 NA 5492 NA 3832 NA 6602 NA 7593 NA
NA num_annotated_splices 8581504 NA 7919864 NA 10513157 NA 11643271 NA 10755797 NA 8273449 NA 6799294 NA 7913548 NA 6103902 NA 4260450 NA 7202217 NA 8487777 NA
NA num_GTAG_splices 8571029 NA 7909792 NA 10486843 NA 11619110 NA 10738422 NA 8267426 NA 6785842 NA 7903568 NA 6111689 NA 4286617 NA 7198460 NA 8476887 NA
NA uniquely_mapped 83997598 NA 72886249 NA 63080382 NA 78265974 NA 90382084 NA 78565331 NA 62830380 NA 82902461 NA 56450144 NA 43685904 NA 59617353 NA 67556146 NA
NA multimapped_toomany 1852495 NA 1405409 NA 1242903 NA 1061446 NA 1736807 NA 1362362 NA 1374461 NA 846193 NA 1009216 NA 955877 NA 703236 NA 890704 NA
NA unmapped_mismatches 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA unmapped_mismatches_percent 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA total_reads 102779602 NA 88509413 NA 77240029 NA 94179097 NA 108337271 NA 92924182 NA 76172416 NA 95488961 NA 69906417 NA 54427039 NA 73771492 NA 82764590 NA
NA unmapped_other 432340 NA 370957 NA 254880 NA 377196 NA 476580 NA 454868 NA 335571 NA 468726 NA 370629 NA 359051 NA 317451 NA 405340 NA
NA insertion_rate 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA unmapped_other_percent 0.42 NA 0.42 NA 0.33 NA 0.4 NA 0.44 NA 0.49 NA 0.44 NA 0.49 NA 0.53 NA 0.66 NA 0.43 NA 0.49 NA
NA multimapped_percent 11.96 NA 11.44 NA 12.97 NA 11.27 NA 11.02 NA 9.63 NA 11.61 NA 8.35 NA 11.44 NA 10.9 NA 12.19 NA 12.28 NA
NA multimapped 12297290 NA 10128397 NA 10020381 NA 10617654 NA 11939994 NA 8949092 NA 8840667 NA 7971370 NA 7999509 NA 5933625 NA 8991826 NA 10165077 NA
NA num_noncanonical_splices 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA unmapped_tooshort 4199879 NA 3718401 NA 2641483 NA 3856827 NA 3801806 NA 3592529 NA 2791337 NA 3300211 NA 4076919 NA 3492582 NA 4141626 NA 3747323 NA
NA multimapped_toomany_percent 1.8 NA 1.59 NA 1.61 NA 1.13 NA 1.6 NA 1.47 NA 1.8 NA 0.89 NA 1.44 NA 1.76 NA 0.95 NA 1.08 NA
PhiX Contamination overall_alignment_rate 0.11 NA 0.13 NA 0.18 NA 0.2 NA 0.12 NA 0.12 NA 0.1 NA 0.15 NA 0.18 NA 0.16 NA 0.2 NA 0.2 NA
NA paired_aligned_mate_multi_halved 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA paired_aligned_none 102674701 NA 88400163 NA 77108986 NA 93998943 NA 108212089 NA 92820803 NA 76103405 NA 95350795 NA 69784944 NA 54344793 NA 73629819 NA 82605166 NA
NA paired_aligned_mate_multi 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA paired_aligned_discord_one 5081 NA 5514 NA 6603 NA 9196 NA 6325 NA 5222 NA 3475 NA 7190 NA 6424 NA 4218 NA 7405 NA 8139 NA
NA paired_aligned_mate_one 2761 NA 2466 NA 2825 NA 3974 NA 2693 NA 2277 NA 1515 NA 2999 NA 2696 NA 1961 NA 2913 NA 3585 NA
NA paired_total 102779602 NA 88509413 NA 77240029 NA 94179097 NA 108337271 NA 92924182 NA 76172416 NA 95488961 NA 69906417 NA 54427039 NA 73771492 NA 82764590 NA
NA paired_aligned_mate_none_halved 102668239.5 NA 88393416 NA 77100970.5 NA 93987760 NA 108204417.5 NA 92814442.5 NA 76099172.5 NA 95342105.5 NA 69777172 NA 54339594.5 NA 73620957.5 NA 82595234.5 NA
NA paired_aligned_multi 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA paired_aligned_one 104901 NA 109250 NA 131043 NA 180154 NA 125182 NA 103379 NA 69011 NA 138166 NA 121473 NA 82246 NA 141673 NA 159424 NA
NA total_reads 102779602 NA 88509413 NA 77240029 NA 94179097 NA 108337271 NA 92924182 NA 76172416 NA 95488961 NA 69906417 NA 54427039 NA 73771492 NA 82764590 NA
NA paired_aligned_mate_one_halved 1380.5 NA 1233 NA 1412.5 NA 1987 NA 1346.5 NA 1138.5 NA 757.5 NA 1499.5 NA 1348 NA 980.5 NA 1456.5 NA 1792.5 NA
NA paired_aligned_mate_none 205336479 NA 176786832 NA 154201941 NA 187975520 NA 216408835 NA 185628885 NA 152198345 NA 190684211 NA 139554344 NA 108679189 NA 147241915 NA 165190469 NA
rRNA Contamination mapped_passed 4265172 NA 3371478 NA 1366577 NA 4031209 NA 5947728 NA 5093417 NA 4109793 NA 5042021 NA 7103552 NA 4879843 NA 5080844 NA 7013490 NA
NA duplicates_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA secondary_passed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA paired in sequencing_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA duplicates_passed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA read2_passed 102779602 NA 88509413 NA 77240029 NA 94179097 NA 108337271 NA 92924182 NA 76172416 NA 95488961 NA 69906417 NA 54427039 NA 73771492 NA 82764590 NA
NA read1_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA read1_passed 102779602 NA 88509413 NA 77240029 NA 94179097 NA 108337271 NA 92924182 NA 76172416 NA 95488961 NA 69906417 NA 54427039 NA 73771492 NA 82764590 NA
NA with mate mapped to a different chr_passed 4240 NA 3672 NA 692 NA 2742 NA 7850 NA 5990 NA 4782 NA 5840 NA 5348 NA 2434 NA 2250 NA 3646 NA
NA total_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA properly paired_passed_pct 2.02 NA 1.84 NA 0.86 NA 2.07 NA 2.7 NA 2.69 NA 2.65 NA 2.59 NA 4.98 NA 4.39 NA 3.36 NA 4.15 NA
NA singletons_passed 77493 NA 85229 NA 21005 NA 88202 NA 48177 NA 47851 NA 41299 NA 53429 NA 60929 NA 55645 NA 58038 NA 57600 NA
NA supplementary_passed 85 NA 65 NA 16 NA 23 NA 43 NA 44 NA 44 NA 72 NA 19 NA 18 NA 58 NA 50 NA
NA singletons_passed_pct 0.04 NA 0.05 NA 0.01 NA 0.05 NA 0.02 NA 0.03 NA 0.03 NA 0.03 NA 0.04 NA 0.05 NA 0.04 NA 0.03 NA
NA mapped_failed_pct nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA
NA mapped_passed_pct 2.07 NA 1.9 NA 0.88 NA 2.14 NA 2.75 NA 2.74 NA 2.7 NA 2.64 NA 5.08 NA 4.48 NA 3.44 NA 4.24 NA
NA supplementary_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA with itself and mate mapped_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA mapped_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA total_passed 205559289 NA 177018891 NA 154480074 NA 188358217 NA 216674585 NA 185848408 NA 152344876 NA 190977994 NA 139812853 NA 108854096 NA 147543042 NA 165529230 NA
NA properly paired_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA flagstat_total 205559289 NA 177018891 NA 154480074 NA 188358217 NA 216674585 NA 185848408 NA 152344876 NA 190977994 NA 139812853 NA 108854096 NA 147543042 NA 165529230 NA
NA with mate mapped to a different chr (mapQ >= 5)_passed 1026 NA 854 NA 118 NA 442 NA 1093 NA 940 NA 801 NA 1206 NA 964 NA 487 NA 823 NA 951 NA
NA properly paired_failed_pct nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA
NA with mate mapped to a different chr (mapQ >= 5)_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA with itself and mate mapped_passed 4187594 NA 3286184 NA 1345556 NA 3942984 NA 5899508 NA 5045522 NA 4068450 NA 4988520 NA 7042604 NA 4824180 NA 5022748 NA 6955840 NA
NA read2_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA with mate mapped to a different chr_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA properly paired_passed 4151444 NA 3256926 NA 1334312 NA 3903816 NA 5846848 NA 5001956 NA 4035832 NA 4945218 NA 6955726 NA 4775228 NA 4958480 NA 6869584 NA
NA paired in sequencing_passed 205559204 NA 177018826 NA 154480058 NA 188358194 NA 216674542 NA 185848364 NA 152344832 NA 190977922 NA 139812834 NA 108854078 NA 147542984 NA 165529180 NA
NA singletons_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA secondary_failed 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA singletons_failed_pct nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA nan NA
Feature Counts Unassigned_Ambiguity 5553184 NA 4245795 NA 4120211 NA 4873854 NA 5225069 NA 4143370 NA 3730746 NA 3744259 NA 3208052 NA 2154031 NA 3756137 NA 4158307 NA
NA Unassigned_MappingQuality 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA percent_assigned 38.92490502 NA 41.43212911 NA 45.0933741 NA 46.82554793 NA 45.79057473 NA 42.0896936 NA 43.45341423 NA 41.01237187 NA 46.15918368 NA 44.71355742 NA 45.65729376 NA 46.21716065 NA
NA Unassigned_Nonjunction 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Unassigned_Duplicate 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Unassigned_Chimera 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Unassigned_Unmapped 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Assigned 92630635 NA 83632332 NA 81770639 NA 99643999 NA 112432428 NA 87231124 NA 76094464 NA 86121025 NA 71761506 NA 53877618 NA 75554738 NA 86661345 NA
NA Unassigned_MultiMapping 69977472 NA 56081314 NA 55175498 NA 56266414 NA 64772012 NA 50119894 NA 49456582 NA 44182994 NA 42564996 NA 33123230 NA 46247590 NA 52396728 NA
NA Unassigned_Overlapping_Length 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Unassigned_Secondary 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Unassigned_FragmentLength 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA 0 NA
NA Unassigned_NoFeatures 69811377 NA 57894371 NA 40269914 NA 52014095 NA 63106671 NA 65756168 NA 45835550 NA 75939638 NA 37930730 NA 31340159 NA 39923831 NA 44292640 NA
NA Total 237972668 NA 201853812 NA 181336262 NA 212798362 NA 245536180 NA 207250556 NA 175117342 NA 209987916 NA 155465284 NA 120495038 NA 165482296 NA 187509020 NA

Differentially Expressed Sequence Identification

Programs Used

  • R
  • RStudio

Packages

  • readr (install.packages('readr'))
  • limma (BiocManager::install('limma'))
  • DESeq2 (BiocManager::install('DESeq2'))
  • dplyr (install.packages("dplyr"))
  • ggplot2 (install.packages("ggplot2"))
  • gplots (install.packages("gplots"))
  • Annotations (BiocManager::install('AnnotationDbi'))
  • org.Hs.eg.db (BiocManager::install('org.Hs.eg.db'))
    • This is for Human
  • org.Mm.eg.db (BiocManager::install('org.Mm.eg.db'))
    • This is for Mouse
  • ggrepel (install.packages("ggrepel"))
  • ReportingTools (BiocManager::install('ReportingTools'))
  • GO.db (BiocManager::install('GO.db'))
  • GOstats (BiocManager::install('GOstats'))
  • pathview (BiocManager::install('pathview'))
  • gage (BiocManager::install('gage'))
  • gageData (BiocManager::install('gageData'))
  • select (BiocManager::install('Select'))

With these, you most certainly will have to step through each and install extra things when you start calling the packages. Take it step by step to ensure that each dependency is installed.

Analyzing Reads Counts

When the count file is completed, we can import it into R and start working with it to determine differentially expressed genes. First we will import it into R

library(limma)
library(DESeq2)
library(dplyr)
library(readr)

countData = read_csv("readCounts.csv", skip = 1)

This gives us our dataframe from out featureCounts program, but if we look at the data we see that featureCounts added some extra information that characterizes each gene_id.

kable(head(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "320px")
Geneid Chr Start End Strand Length SRR2121770Aligned.sortedByCoord.out.bam SRR2121771Aligned.sortedByCoord.out.bam SRR2121774Aligned.sortedByCoord.out.bam SRR2121775Aligned.sortedByCoord.out.bam SRR2121778Aligned.sortedByCoord.out.bam SRR2121779Aligned.sortedByCoord.out.bam SRR2121780Aligned.sortedByCoord.out.bam SRR2121781Aligned.sortedByCoord.out.bam SRR2121786Aligned.sortedByCoord.out.bam SRR2121787Aligned.sortedByCoord.out.bam SRR2121788Aligned.sortedByCoord.out.bam SRR2121789Aligned.sortedByCoord.out.bam
ENSMUSG00000102693.1 chr1 3073253 3074322 + 1070 0 0 0 0 0 0 0 0 4 2 0 0
ENSMUSG00000064842.1 chr1 3102016 3102125 + 110 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000051951.5 chr1;chr1;chr1;chr1;chr1;chr1;chr1 3205901;3206523;3213439;3213609;3214482;3421702;3670552 3207317;3207317;3215632;3216344;3216968;3421901;3671498 -;-;-;-;-;-;- 6094 0 0 0 4 0 0 0 0 4 2 0 0
ENSMUSG00000102851.1 chr1 3252757 3253236 + 480 0 0 0 0 0 0 0 0 2 0 0 0
ENSMUSG00000103377.1 chr1 3365731 3368549 - 2819 0 0 0 2 0 0 0 0 6 14 0 0
ENSMUSG00000104017.1 chr1 3375556 3377788 - 2233 0 0 0 0 0 0 0 0 0 0 0 0

We also need to set out row names to the gene_id. We will do some data frame manipulation and then look at the data again.

countData = as.data.frame(countData)
rownames(countData) = countData$Geneid
countData = countData[,-c(1:6)]
kable(head(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
SRR2121770Aligned.sortedByCoord.out.bam SRR2121771Aligned.sortedByCoord.out.bam SRR2121774Aligned.sortedByCoord.out.bam SRR2121775Aligned.sortedByCoord.out.bam SRR2121778Aligned.sortedByCoord.out.bam SRR2121779Aligned.sortedByCoord.out.bam SRR2121780Aligned.sortedByCoord.out.bam SRR2121781Aligned.sortedByCoord.out.bam SRR2121786Aligned.sortedByCoord.out.bam SRR2121787Aligned.sortedByCoord.out.bam SRR2121788Aligned.sortedByCoord.out.bam SRR2121789Aligned.sortedByCoord.out.bam
ENSMUSG00000102693.1 0 0 0 0 0 0 0 0 4 2 0 0
ENSMUSG00000064842.1 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000051951.5 0 0 0 4 0 0 0 0 4 2 0 0
ENSMUSG00000102851.1 0 0 0 0 0 0 0 0 2 0 0 0
ENSMUSG00000103377.1 0 0 0 2 0 0 0 0 6 14 0 0
ENSMUSG00000104017.1 0 0 0 0 0 0 0 0 0 0 0 0

Quick Data Exploration

dim(countData)
## [1] 55487    12
kable(summary(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
SRR2121770Aligned.sortedByCoord.out.bam SRR2121771Aligned.sortedByCoord.out.bam SRR2121774Aligned.sortedByCoord.out.bam SRR2121775Aligned.sortedByCoord.out.bam SRR2121778Aligned.sortedByCoord.out.bam SRR2121779Aligned.sortedByCoord.out.bam SRR2121780Aligned.sortedByCoord.out.bam SRR2121781Aligned.sortedByCoord.out.bam SRR2121786Aligned.sortedByCoord.out.bam SRR2121787Aligned.sortedByCoord.out.bam SRR2121788Aligned.sortedByCoord.out.bam SRR2121789Aligned.sortedByCoord.out.bam
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0 Median : 3.0 Median : 4.0 Median : 0.0 Median : 1.0
Mean : 273.8 Mean : 235.7 Mean : 229.6 Mean : 269.6 Mean : 286.6 Mean : 257.7 Mean : 193.7 Mean : 268.2 Mean : 160.2 Mean : 119.4 Mean : 184.3 Mean : 211.1
3rd Qu.: 25.0 3rd Qu.: 20.0 3rd Qu.: 15.0 3rd Qu.: 24.0 3rd Qu.: 25.0 3rd Qu.: 24.0 3rd Qu.: 17.0 3rd Qu.: 26.0 3rd Qu.: 21.0 3rd Qu.: 21.0 3rd Qu.: 16.0 3rd Qu.: 18.0
Max. :1085454.0 Max. :834016.0 Max. :783768.0 Max. :1032589.0 Max. :1233905.0 Max. :1232421.0 Max. :805463.0 Max. :1210716.0 Max. :641911.0 Max. :516752.0 Max. :706363.0 Max. :756086.0

Let's also go ahead and change the names to describe out data a little better.

columns = c('Trp53m_mock_1', 'Trp53m_mock_2', 'Trp53m_4h7Gy_1', 'Trp53m_4h7Gy_2', 'Trp53p_mock_1', 'Trp53p_mock_2', 'Trp53p_mock_3', 'Trp53p_mock_4', 'Trp53p_4h7Gy_1', 'Trp53p_4h7Gy_2', 'Trp53p_4h7Gy_3', 'Trp53p_4h7Gy_4')
colnames(countData) = columns

Here we have to make sure that we convert the +/+ and -/- to characters. These characters

par(mar=c(8,4,4,1)+0.1)
barplot( colSums(countData)/1e6, col="green",las=3,main="Total read counts (millions)", ylab="Total read counts in millions")

hist(countData[,1], br=200, xlab="Number of Reads Counts per Feature", main="Histogram of Read Counts for Trp53-/- Mock")

We can see that our count data is highly skewed to the right. This is a great case for using log transformation!

logCountData = log2(1+countData)
par(mfrow = c(1, 2), mar=c(8,4,4,1))  # two columns
hist(logCountData[,1], main="Histogram of Log Read Counts", xlab="Log transformed counts")
boxplot(logCountData,las=3, main="Boxplot of Log Read Counts")

x <- logCountData
myColors = rainbow(dim(x)[2])
plot(density(x[,1]),col = myColors[1], lwd=2,
     xlab="Expresson values", ylab="Density", main= "Distribution of transformed data",
     ylim=c(0, max(density(x[,1])$y)+.02 ) )
  
for( i in 2:dim(x)[2] )
lines(density(x[,i]),col=myColors[i], lwd=2)
legend("topright", cex=1.1,colnames(x), lty=rep(1,dim(x)[2]), col=myColors )	

plot(logCountData[,1],logCountData[,2], xlab="Trp53-/- mock replication 1", ylab="Trp53-/- mock replication 2")

Filtering, Normalization, and Trasformation using DESeq2

We have to make the experiment design into a small dataframe so we can tell DESeq how we want to analyze the data. Here will will make a small table that has the rep names that we changed the column names to previously, and then a column for which columns are Trp53+/+ or Trp53-/-, and which columns were control mice and which columns were treated with ionizing radiation.

detectGroups <- function (x){  # x are col names
  tem <- gsub("[0-9]*$","",x) # Remove all numbers from end
  #tem = gsub("_Rep|_rep|_REP","",tem)
  tem <- gsub("_$","",tem); # remove "_" from end
  tem <- gsub("_Rep$","",tem); # remove "_Rep" from end
  tem <- gsub("_rep$","",tem); # remove "_rep" from end
  tem <- gsub("_REP$","",tem)  # remove "_REP" from end
  return( tem )
}

groups = as.character ( detectGroups( colnames( countData ) ) )
groups
##  [1] "Trp53m_mock"  "Trp53m_mock"  "Trp53m_4h7Gy" "Trp53m_4h7Gy" "Trp53p_mock" 
##  [6] "Trp53p_mock"  "Trp53p_mock"  "Trp53p_mock"  "Trp53p_4h7Gy" "Trp53p_4h7Gy"
## [11] "Trp53p_4h7Gy" "Trp53p_4h7Gy"
p53 = c("m", "m", "m", "m",
        "p", "p", "p", "p", "p", "p", "p", "p")
treatment = c("control", "control", "IR", "IR", "control", "control", "control", "control",
              "IR", "IR", "IR", "IR")
colData = cbind(colnames(countData), p53 )
colData
##                        p53
##  [1,] "Trp53m_mock_1"  "m"
##  [2,] "Trp53m_mock_2"  "m"
##  [3,] "Trp53m_4h7Gy_1" "m"
##  [4,] "Trp53m_4h7Gy_2" "m"
##  [5,] "Trp53p_mock_1"  "p"
##  [6,] "Trp53p_mock_2"  "p"
##  [7,] "Trp53p_mock_3"  "p"
##  [8,] "Trp53p_mock_4"  "p"
##  [9,] "Trp53p_4h7Gy_1" "p"
## [10,] "Trp53p_4h7Gy_2" "p"
## [11,] "Trp53p_4h7Gy_3" "p"
## [12,] "Trp53p_4h7Gy_4" "p"
colData = as.data.frame(cbind(colnames(countData), p53, treatment))
colData
##                V1 p53 treatment
## 1   Trp53m_mock_1   m   control
## 2   Trp53m_mock_2   m   control
## 3  Trp53m_4h7Gy_1   m        IR
## 4  Trp53m_4h7Gy_2   m        IR
## 5   Trp53p_mock_1   p   control
## 6   Trp53p_mock_2   p   control
## 7   Trp53p_mock_3   p   control
## 8   Trp53p_mock_4   p   control
## 9  Trp53p_4h7Gy_1   p        IR
## 10 Trp53p_4h7Gy_2   p        IR
## 11 Trp53p_4h7Gy_3   p        IR
## 12 Trp53p_4h7Gy_4   p        IR
str(colData)
## 'data.frame':	12 obs. of  3 variables:
##  $ V1       : Factor w/ 12 levels "Trp53m_4h7Gy_1",..: 3 4 1 2 9 10 11 12 5 6 ...
##  $ p53      : Factor w/ 2 levels "m","p": 1 1 1 1 2 2 2 2 2 2 ...
##  $ treatment: Factor w/ 2 levels "control","IR": 1 1 2 2 1 1 1 1 2 2 ...

Creating a DESeq Dataset

dds = DESeqDataSetFromMatrix(countData=countData,
                             colData=colData,
                             design= ~ p53+treatment+p53*treatment)   # note that the study design is changed.
dds = DESeq(dds)  # main function
nrow(dds)
## [1] 55487

Filtering: we will only keep rows that have a sum count between all samples greater than 5. This will remove most of the genes that mostly have "0" counts.

dds <- dds[ rowSums(counts(dds)) > 5, ]
nrow(dds)
## [1] 34139

Regularized log transformation - used for clustering

rld <- rlog(dds, blind = FALSE)
kable(head(assay(rld), 6)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
Trp53m_mock_1 Trp53m_mock_2 Trp53m_4h7Gy_1 Trp53m_4h7Gy_2 Trp53p_mock_1 Trp53p_mock_2 Trp53p_mock_3 Trp53p_mock_4 Trp53p_4h7Gy_1 Trp53p_4h7Gy_2 Trp53p_4h7Gy_3 Trp53p_4h7Gy_4
ENSMUSG00000102693.1 -0.7873810 -0.7869647 -0.7864123 -0.7872251 -0.7873917 -0.7872901 -0.7863979 -0.7874606 -0.6313032 -0.7027829 -0.7854203 -0.7864622
ENSMUSG00000051951.5 -0.1689760 -0.1684855 -0.1678308 -0.0729011 -0.1689887 -0.1688691 -0.1678136 -0.1690696 -0.0298983 -0.0789877 -0.1675075 -0.1678901
ENSMUSG00000103377.1 0.9367497 0.9385253 0.9408618 1.0007595 0.9367037 0.9371385 0.9409228 0.9364087 1.1895736 1.5284052 0.9420021 0.9406517
ENSMUSG00000103201.1 -0.0694969 -0.0689413 -0.0682009 -0.0692892 -0.0695112 -0.0693758 -0.0681815 -0.0696029 0.0406074 0.1775249 -0.0678357 -0.0682680
ENSMUSG00000102592.1 0.6079270 0.6090832 0.6106138 0.6083601 0.6078971 0.6081797 0.6106538 0.6077056 0.9928379 0.8039315 0.6113644 0.6829781
ENSMUSG00000025900.12 1.7329730 1.7166717 1.6536060 1.6483113 1.6766519 1.6478854 1.6536997 1.7812828 2.0563917 2.2880207 1.6553593 1.7301108

Variance Stabilizing Transformation

Interactions cause a difference between the lfc betwen pooled data, e.g. p53+/+ (control and IR) and p53-/- (control and IR)

vsd <- vst(dds, blind = FALSE)
kable(head(assay(vsd), 6)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
Trp53m_mock_1 Trp53m_mock_2 Trp53m_4h7Gy_1 Trp53m_4h7Gy_2 Trp53p_mock_1 Trp53p_mock_2 Trp53p_mock_3 Trp53p_mock_4 Trp53p_4h7Gy_1 Trp53p_4h7Gy_2 Trp53p_4h7Gy_3 Trp53p_4h7Gy_4
ENSMUSG00000102693.1 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.784352 6.714627 6.433022 6.433022
ENSMUSG00000051951.5 6.433022 6.433022 6.433022 6.715020 6.433022 6.433022 6.433022 6.433022 6.784352 6.714627 6.433022 6.433022
ENSMUSG00000103377.1 6.433022 6.433022 6.433022 6.632583 6.433022 6.433022 6.433022 6.433022 6.862784 7.171182 6.433022 6.433022
ENSMUSG00000103201.1 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.737470 6.919246 6.433022 6.433022
ENSMUSG00000102592.1 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 6.433022 7.013160 6.830644 6.433022 6.662579
ENSMUSG00000025900.12 6.669513 6.643045 6.433022 6.433022 6.569347 6.433022 6.433022 6.732883 7.013160 7.220955 6.433022 6.662579

For the log2 approach, we need to first estimate size factors to account for sequencing depth, and then specify normalized=TRUE. Sequencing depth correction is done automatically for the rlog and the vst.

Size Factor

dds <- estimateSizeFactors(dds)
kable(sizeFactors(dds)) %>%
  kable_styling() %>%
  scroll_box(width = "300px", height = "520px")
x
Trp53m_mock_1 1.2892566
Trp53m_mock_2 1.0903014
Trp53m_4h7Gy_1 0.8973654
Trp53m_4h7Gy_2 1.2078255
Trp53p_mock_1 1.2952327
Trp53p_mock_2 1.2406213
Trp53p_mock_3 0.8931001
Trp53p_mock_4 1.3347016
Trp53p_4h7Gy_1 0.7767888
Trp53p_4h7Gy_2 0.6056051
Trp53p_4h7Gy_3 0.8227361
Trp53p_4h7Gy_4 0.9123249

We will first look at the log transformed data

slog <- log2(counts(dds, normalized=TRUE)+1)
kable(head(slog)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
Trp53m_mock_1 Trp53m_mock_2 Trp53m_4h7Gy_1 Trp53m_4h7Gy_2 Trp53p_mock_1 Trp53p_mock_2 Trp53p_mock_3 Trp53p_mock_4 Trp53p_4h7Gy_1 Trp53p_4h7Gy_2 Trp53p_4h7Gy_3 Trp53p_4h7Gy_4
ENSMUSG00000102693.1 0.000000 0.000000 0 0.000000 0.0000000 0 0 0.000000 2.620447 2.105169 0 0.000000
ENSMUSG00000051951.5 0.000000 0.000000 0 2.108269 0.0000000 0 0 0.000000 2.620447 2.105169 0 0.000000
ENSMUSG00000103377.1 0.000000 0.000000 0 1.409184 0.0000000 0 0 0.000000 3.125008 4.592001 0 0.000000
ENSMUSG00000103201.1 0.000000 0.000000 0 0.000000 0.0000000 0 0 0.000000 2.281566 3.447241 0 0.000000
ENSMUSG00000102592.1 0.000000 0.000000 0 0.000000 0.0000000 0 0 0.000000 3.922280 2.926941 0 1.674552
ENSMUSG00000025900.12 1.734188 1.503021 0 0.000000 0.8254291 0 0 2.246759 3.922280 4.777149 0 1.674552
par(mfrow = c(1, 3))  # 3 columns
plot(slog[,1],slog[,2])
plot(assay(rld)[,1],assay(rld)[,2])
plot(assay(vsd)[,1],assay(vsd)[,2])

As the log transformation constant increases, the information of the data is lost.

par(mfrow = c(1, 3))  # 3 columns
slog <- log2(counts(dds, normalized=TRUE)+1)
plot(slog[,1],slog[,2])
slog <- log2(counts(dds, normalized=TRUE)+4)
plot(slog[,1],slog[,2], xlim=c(0,20))
slog <- log2(counts(dds, normalized=TRUE)+20)
plot(slog[,1],slog[,2], xlim=c(0,20))

library("dplyr")
library("ggplot2")

df <- bind_rows(
  as_data_frame(slog[,1:2]) %>%
         mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"),
  as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"))
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
colnames(df)[1:2] <- c("x", "y")  

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)

Exploratory Data Analysis

PCA plot

plotPCA(rld, intgroup = c("p53", "treatment")) + theme(aspect.ratio=1)

A prettier PCA plot created with GGPlot2

pca.object <- prcomp(t(assay(rld))) # PCA 
pcaData = as.data.frame(pca.object$x[,1:2]); 
pcaData = cbind(pcaData,detectGroups(colnames(assay(rld)) ))
colnames(pcaData) = c("PC1", "PC2", "Type")
percentVar=round(100*summary(pca.object)$importance[2,1:2],0)
#plot
p=ggplot(pcaData, aes(PC1, PC2, color=Type, shape = Type)) + geom_point(size=5) 
p=p+xlab(paste0("PC1: ",percentVar[1],"% variance")) 
p=p+ylab(paste0("PC2: ",percentVar[2],"% variance")) 
p=p+ggtitle("Principal component analysis (PCA)")+coord_fixed(ratio=1.0)+ 
    theme(plot.title = element_text(size = 16,hjust = 0.5)) + theme(aspect.ratio=1) +
    theme(axis.text.x = element_text( size = 16),
    axis.text.y = element_text( size = 16),
    axis.title.x = element_text( size = 16),
    axis.title.y = element_text( size = 16) ) +
  theme(legend.text=element_text(size=16))
print(p)

Multidimensional Scaling Plot

dist2 <- function(x, ...)   # distance function = 1-PCC (Pearson's correlation coefficient)
  as.dist(1-cor(t(x), method="pearson"))

fit = cmdscale( dist2(t(assay(rld))) , eig=T, k=2)
mdsData <- as.data.frame(fit$points[,1:2]); 
mdsData <- cbind(mdsData,detectGroups(colnames(assay(rld))) )
colnames(mdsData) = c("x1", "x2", "Type")
	
p<-ggplot(mdsData, aes(x1, x2, color=Type, shape = Type)) + geom_point(size=5) 
p=p+xlab("Dimension 1") 
p=p+ylab("Dimension 2") 
p=p+ggtitle("Multidimensional scaling (MDS)")+ coord_fixed(ratio=1.)+ 
     theme(plot.title = element_text(hjust = 0.5)) + theme(aspect.ratio=1) +
	 	 theme(axis.text.x = element_text( size = 16),
        axis.text.y = element_text( size = 16),
        axis.title.x = element_text( size = 16),
        axis.title.y = element_text( size = 16) ) +
	   theme(legend.text=element_text(size=16))
print(p)

Creating a heatmap

library(gplots)

hclust2 <- function(x, method="average", ...)  # average linkage in hierarchical clustering
  hclust(x, method=method, ...)

n=100 # number of top genes by standard deviation

x = assay(rld)
if(n>dim(x)[1]) n = dim(x)[1] # max	as data

x = x[order(apply(x,1,sd),decreasing=TRUE),]  # sort genes by standard deviation

x = x[1:n,]   # only keep the n genes

# this will cutoff very large values, which could skew the color 
x=as.matrix(x[1:n,])-apply(x[1:n,],1,mean)
cutoff = median(unlist(x)) + 4*sd (unlist(x)) 
x[x>cutoff] <- cutoff
cutoff = median(unlist(x)) - 4*sd (unlist(x)) 
x[x< cutoff] <- cutoff
	
groups = detectGroups(colnames(x) )
groups.colors = rainbow(length(unique(groups) ) )


	lmat = rbind(c(5,4),c(0,1),c(3,2))
	lwid = c(1.5,4)
	lhei = c(1,.2,4)


heatmap.2(x, distfun = dist2,hclustfun=hclust2,
	 col=greenred(75), density.info="none", trace="none", scale="none", keysize=.5
	,key=T, symkey=F
	,ColSideColors=groups.colors[ as.factor(groups)]
	,margins=c(8,12)
	,cexRow=1
	,srtCol=45
	,cexCol=1.  # size of font for sample names
	,lmat = lmat, lwid = lwid, lhei = lhei
	)

Identification of Differentially Expressed Genes

dds <- DESeq(dds)
res <- results(dds)

kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000102693.1 0.7043239 3.5623773 6.556337 0.5433487 0.5868897 NA
ENSMUSG00000051951.5 0.9803019 0.3244198 6.442842 0.0503535 0.9598407 NA
ENSMUSG00000103377.1 2.7081125 3.1757639 5.419836 0.5859520 0.5579078 NA
ENSMUSG00000103201.1 1.1474583 4.2647949 6.546806 0.6514314 0.5147681 NA
ENSMUSG00000102592.1 1.9131690 5.0084508 5.437877 0.9210305 0.3570345 NA
ENSMUSG00000025900.12 4.2877013 6.7470978 3.112343 2.1678514 0.0301700 0.1442115

DESeq2 uses the Benjamini-Hochberg (BH) adjustment (Benjamini and Hochberg 1995) as implemented in the base R p.adjust function

res <- results(dds, alpha = 0.5, lfcThreshold=0.01)
summary(res)
## 
## out of 34139 with nonzero total read count
## adjusted p-value < 0.5
## LFC > 0.01 (up)    : 5794, 17%
## LFC < -0.01 (down) : 4094, 12%
## outliers [1]       : 3, 0.0088%
## low counts [2]     : 8605, 25%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Now lets sort genes by fold change

res <- res[order(abs( res$log2FoldChange), decreasing=TRUE),]
kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000107277.1 178.134530 12.82857 1.906777 6.722639 0.0000000 0.0000000
ENSMUSG00000036281.13 17.898231 11.90953 1.997989 5.955753 0.0000000 0.0000002
ENSMUSG00000034189.5 21.363195 11.19200 2.064077 5.417433 0.0000001 0.0000035
ENSMUSG00000111241.1 26.215073 10.91199 1.885793 5.781118 0.0000000 0.0000005
ENSMUSG00000027479.14 4.844616 10.82957 3.264449 3.314364 0.0009185 0.0130936
ENSMUSG00000014932.15 77.883535 10.65416 1.686891 6.309929 0.0000000 0.0000000

MA Plot

Show the significant genes. The lower the average read counts for all samples and the higher the variation between the samples, the less significant those genes are.

DESeq2::plotMA(res,  ylim = c(-5, 5))

Volcano plot

library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.05, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<0.5),'sig'] <- "Not Sig"


p = ggplot(res1, aes(log2FoldChange, -log10(padj))) +
  geom_point(aes(col=sig)) +
  scale_color_manual(values=c("red", "black"))
p
## Warning: Removed 8608 rows containing missing values (geom_point).

Gene Annotations

Plot counts of top gene

topGene <- rownames(res)[1]
plotCounts(dds, gene = topGene, intgroup=c("p53", "treatment"))

Here we see an interesting point or our normalized counts under the Trp53p_4h7Gy group that seems to be extremely high, while the other 3 replicates are around 0.5. I cannot get this portion to work, however. I get an error stating that None of the keys entered are valid keys for 'SYMBOL'.

Let's look at the keys that we have to work with for our res file from DESeq2

head(row.names(res))
## [1] "ENSMUSG00000107277.1"  "ENSMUSG00000036281.13" "ENSMUSG00000034189.5" 
## [4] "ENSMUSG00000111241.1"  "ENSMUSG00000027479.14" "ENSMUSG00000014932.15"

Now we need to find the same key in the Mm database.

library(AnnotationDbi)
library(org.Mm.eg.db)

columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
## [16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"
#key = gsub("\\..*","", row.names(res))
res$symbol <- gsub("\\..*","", row.names(res))
#res$symbol <- gsub(" ","",row.names(res)) 
message("Ensembl IDs")
## Ensembl IDs
key.en = keys(org.Mm.eg.db, keytype="ENSEMBL")
head(key.en)
## [1] "ENSMUSG00000030359" "ENSMUSG00000020804" "ENSMUSG00000025375"
## [4] "ENSMUSG00000015243" "ENSMUSG00000028125" "ENSMUSG00000026944"
cat("\n\n")
message("SYMBOL names")
## SYMBOL names
key.sy = keys(org.Mm.eg.db, keytype="SYMBOL")
head(key.sy)
## [1] "Pzp"   "Aanat" "Aatk"  "Abca1" "Abca4" "Abca2"

These are ENSEMBL symbols, so we need to designate that when looking for the genes that we have.

res$ensembl <- gsub("\\..*","", row.names(res))
res$entrez <- mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="ENTREZID",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
## 'select()' returned 1:many mapping between keys and columns
res$symbol <- mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="SYMBOL",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
## 'select()' returned 1:many mapping between keys and columns
write.csv(res, file = "results.csv")
kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
baseMean log2FoldChange lfcSE stat pvalue padj symbol ensembl entrez
ENSMUSG00000107277.1 178.134530 12.82857 1.906777 6.722639 0.0000000 0.0000000 NA ENSMUSG00000107277 NA
ENSMUSG00000036281.13 17.898231 11.90953 1.997989 5.955753 0.0000000 0.0000002 Snapc4 ENSMUSG00000036281 227644
ENSMUSG00000034189.5 21.363195 11.19200 2.064077 5.417433 0.0000001 0.0000035 Hsdl1 ENSMUSG00000034189 72552
ENSMUSG00000111241.1 26.215073 10.91199 1.885793 5.781118 0.0000000 0.0000005 NA ENSMUSG00000111241 NA
ENSMUSG00000027479.14 4.844616 10.82957 3.264449 3.314364 0.0009185 0.0130936 Mapre1 ENSMUSG00000027479 13589
ENSMUSG00000014932.15 77.883535 10.65416 1.686891 6.309929 0.0000000 0.0000000 Yes1 ENSMUSG00000014932 22612

Let's make a file with just the genes with an adjusted p-value < 0.5

resSig = as.data.frame(subset(res,padj<0.5) )
resSig = resSig[order(resSig$log2FoldChange,decreasing=TRUE),]
head(resSig)
##                         baseMean log2FoldChange    lfcSE     stat       pvalue
## ENSMUSG00000107277.1  178.134530       12.82857 1.906777 6.722639 1.784620e-11
## ENSMUSG00000036281.13  17.898231       11.90953 1.997989 5.955753 2.588774e-09
## ENSMUSG00000034189.5   21.363195       11.19200 2.064077 5.417433 6.046073e-08
## ENSMUSG00000111241.1   26.215073       10.91199 1.885793 5.781118 7.420598e-09
## ENSMUSG00000027479.14   4.844616       10.82957 3.264449 3.314364 9.185180e-04
## ENSMUSG00000014932.15  77.883535       10.65416 1.686891 6.309929 2.791627e-10
##                               padj symbol            ensembl entrez
## ENSMUSG00000107277.1  2.190535e-09   <NA> ENSMUSG00000107277   <NA>
## ENSMUSG00000036281.13 2.052608e-07 Snapc4 ENSMUSG00000036281 227644
## ENSMUSG00000034189.5  3.540420e-06  Hsdl1 ENSMUSG00000034189  72552
## ENSMUSG00000111241.1  5.248069e-07   <NA> ENSMUSG00000111241   <NA>
## ENSMUSG00000027479.14 1.309363e-02 Mapre1 ENSMUSG00000027479  13589
## ENSMUSG00000014932.15 2.795021e-08   Yes1 ENSMUSG00000014932  22612
write.csv(resSig,"SigGenes.csv")

Here is a volcano plot that shows the symbol that we created at each point.

library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.5, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<1),'sig'] <- "Not Sig"
p = ggplot(res1, aes(log2FoldChange, -log10(pvalue))) +
  geom_point(aes(col=sig)) +
  scale_color_manual(values=c("red", "black"))
p+geom_text(data=filter(res1, padj<1e-50), aes(label=symbol))
## Warning: Removed 7357 rows containing missing values (geom_point).

7. GO Enrichment analysis using GOstats

Here we will do a GO Enrichment analysis for genes that have a decreased fold-change of 5 or more

library(GO.db)
library(GOstats)
selectedGenes = unique(resSig[resSig$log2FoldChange>5,'entrez'])  # upregulated genes
universeGenes =  unique( mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="ENTREZID",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
                    )


 hgCutoff <- 0.001
 params <- new("GOHyperGParams",
     geneIds=selectedGenes,
     universeGeneIds=universeGenes,
     annotation="org.Mm.eg.db",
     ontology="BP",
     pvalueCutoff=hgCutoff,
     conditional=FALSE,
     testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,]
##        GOBPID       Pvalue OddsRatio   ExpCount Count Size
## 1  GO:0019236 2.643566e-07  5.062236   4.423515    18   78
## 2  GO:0016049 1.730359e-05  2.028026  26.144107    49  461
## 3  GO:0048588 2.580339e-05  2.442482  13.951085    31  246
## 4  GO:1990138 3.035698e-05  2.725414  10.208111    25  180
## 5  GO:0050789 3.546769e-05  1.302786 562.863914   624 9925
## 6  GO:0001558 5.334949e-05  2.045100  22.174286    42  391
## 7  GO:0032879 5.962358e-05  1.395721 147.223650   191 2596
## 8  GO:0060560 1.073049e-04  2.275476  14.348068    30  253
## 9  GO:0061564 1.204488e-04  1.884894  26.711225    47  471
## 10 GO:0050794 1.250445e-04  1.272536 527.135525   584 9295
##                                              Term
## 1                           response to pheromone
## 2                                     cell growth
## 3                       developmental cell growth
## 4                     neuron projection extension
## 5                regulation of biological process
## 6                       regulation of cell growth
## 7                      regulation of localization
## 8  developmental growth involved in morphogenesis
## 9                                axon development
## 10                 regulation of cellular process
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]
##        GOBPID       Pvalue                                           Term
## 1  GO:0019236 2.643566e-07                          response to pheromone
## 2  GO:0016049 1.730359e-05                                    cell growth
## 3  GO:0048588 2.580339e-05                      developmental cell growth
## 4  GO:1990138 3.035698e-05                    neuron projection extension
## 5  GO:0050789 3.546769e-05               regulation of biological process
## 6  GO:0001558 5.334949e-05                      regulation of cell growth
## 7  GO:0032879 5.962358e-05                     regulation of localization
## 8  GO:0060560 1.073049e-04 developmental growth involved in morphogenesis
## 9  GO:0061564 1.204488e-04                               axon development
## 10 GO:0050794 1.250445e-04                 regulation of cellular process
params1 <- params
ontology(params1) <- "CC"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOCCID","Pvalue","Term")]
##        GOCCID       Pvalue                                        Term
## 1  GO:0031226 2.194852e-10      intrinsic component of plasma membrane
## 2  GO:0005887 2.497976e-10       integral component of plasma membrane
## 3  GO:0030424 3.300807e-07                                        axon
## 4  GO:0098978 7.089583e-07                       glutamatergic synapse
## 5  GO:0033267 1.725126e-06                                   axon part
## 6  GO:0044459 1.872288e-06                        plasma membrane part
## 7  GO:0099055 3.042517e-06 integral component of postsynaptic membrane
## 8  GO:0099699 4.152767e-06     integral component of synaptic membrane
## 9  GO:0045211 6.111879e-06                       postsynaptic membrane
## 10 GO:0016021 7.169574e-06              integral component of membrane
params1 <- params
ontology(params1) <- "MF"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOMFID","Pvalue","Term")]
##        GOMFID       Pvalue                                         Term
## 1  GO:0016503 4.685420e-07                  pheromone receptor activity
## 2  GO:0005550 1.234794e-05                            pheromone binding
## 3  GO:0005549 2.293309e-05                              odorant binding
## 4  GO:0004888 3.280256e-05    transmembrane signaling receptor activity
## 5  GO:0038023 5.222020e-05                  signaling receptor activity
## 6  GO:0060089 8.100909e-05                molecular transducer activity
## 7  GO:0005488 6.931421e-04                                      binding
## 8  GO:0015267 8.237763e-04                             channel activity
## 9  GO:0022803 8.237763e-04   passive transmembrane transporter activity
## 10 GO:0046873 8.631819e-04 metal ion transmembrane transporter activity

GO Enrichment analysis of downregulated genes

Next we will have a look at the genes that are upregulated by a fold-change of 5 or greater.

selectedGenes = unique(resSig[resSig$log2FoldChange<5,'entrez'])  # upregulated genes

 params <- new("GOHyperGParams",
     geneIds=selectedGenes,
     universeGeneIds=universeGenes,
     annotation="org.Mm.eg.db",
     ontology="BP",
     pvalueCutoff=hgCutoff,
     conditional=FALSE,
     testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]
##        GOBPID       Pvalue                                          Term
## 1  GO:0016043 9.317561e-12               cellular component organization
## 2  GO:0008152 1.757957e-11                             metabolic process
## 3  GO:0071840 4.299006e-11 cellular component organization or biogenesis
## 4  GO:0044237 7.869800e-11                    cellular metabolic process
## 5  GO:0044238 1.013489e-09                     primary metabolic process
## 6  GO:0006807 1.956109e-09           nitrogen compound metabolic process
## 7  GO:0071704 2.363963e-09           organic substance metabolic process
## 8  GO:0006996 2.779804e-09                        organelle organization
## 9  GO:0033036 1.835538e-08                    macromolecule localization
## 10 GO:0051179 2.179269e-08                                  localization

8. Pathway analysis using expression data

# bioconductor packages
# source("https://bioconductor.org/biocLite.R");
# biocLite(c("pathview","gage","gageData"))
library(pathview) 
library(gage) 

Prepare data

foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
##     <NA>   227644    72552     <NA>    13589    22612 
## 12.82857 11.90953 11.19200 10.91199 10.82957 10.65416
library(gageData)
data(go.sets.mm)
data(go.subs.mm)
gobpsets = go.sets.mm[go.subs.mm$BP]
gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)
#lapply(gobpres, head)
message("Greater")
## Greater
kable(head(gobpres$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
p.geomean stat.mean p.val q.val set.size exp1
GO:0019236 response to pheromone 0.0000000 7.637046 0.0000000 0.0000000 76 0.0000000
GO:0007420 brain development 0.0012778 3.026272 0.0012778 0.7925518 400 0.0012778
GO:0045664 regulation of neuron differentiation 0.0021154 2.869812 0.0021154 0.7925518 355 0.0021154
GO:0007156 homophilic cell adhesion 0.0023079 2.879945 0.0023079 0.7925518 70 0.0023079
GO:0001944 vasculature development 0.0030643 2.746850 0.0030643 0.7925518 487 0.0030643
GO:0033555 multicellular organismal response to stress 0.0038427 2.719970 0.0038427 0.7925518 53 0.0038427
message("Less")
## Less
kable(head(gobpres$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
p.geomean stat.mean p.val q.val set.size exp1
GO:0007283 spermatogenesis 0.0094832 -2.352919 0.0094832 0.992104 292 0.0094832
GO:0048232 male gamete generation 0.0100757 -2.330049 0.0100757 0.992104 293 0.0100757
GO:0007266 Rho protein signal transduction 0.0180131 -2.109476 0.0180131 0.992104 117 0.0180131
GO:0007094 mitotic cell cycle spindle assembly checkpoint 0.0263331 -2.065183 0.0263331 0.992104 14 0.0263331
GO:0016441 posttranscriptional gene silencing 0.0274536 -1.957280 0.0274536 0.992104 35 0.0274536
GO:0007276 gamete generation 0.0283921 -1.907876 0.0283921 0.992104 388 0.0283921
message("Stats")
## Stats
kable(head(gobpres$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
stat.mean exp1
GO:0019236 response to pheromone 7.637046 7.637046
GO:0007420 brain development 3.026272 3.026272
GO:0045664 regulation of neuron differentiation 2.869812 2.869812
GO:0007156 homophilic cell adhesion 2.879945 2.879945
GO:0001944 vasculature development 2.746850 2.746850
GO:0033555 multicellular organismal response to stress 2.719970 2.719970

KEGG pathways

library(gageData)
data(kegg.sets.mm)
data(sigmet.idx.mm)
kegg.sets.mm = kegg.sets.mm[sigmet.idx.mm]
#head(kegg.sets.mm, 3)

message("Greater")
## Greater
kable(head(kegg.sets.mm$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Less")
## Less
kable(head(kegg.sets.mm$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Stats")
## Stats
kable(head(kegg.sets.mm$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.mm, same.dir=TRUE)

# Look at both up (greater), down (less), and statatistics.
#lapply(keggres, head, n=10)
message("Greater")
## Greater
kable(head(keggres$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
p.geomean stat.mean p.val q.val set.size exp1
mmu04360 Axon guidance 0.0285970 1.910570 0.0285970 0.8131848 128 0.0285970
mmu03410 Base excision repair 0.0355168 1.842982 0.0355168 0.8131848 28 0.0355168
mmu00591 Linoleic acid metabolism 0.0537749 1.628039 0.0537749 0.8131848 40 0.0537749
mmu04974 Protein digestion and absorption 0.0596166 1.567001 0.0596166 0.8131848 76 0.0596166
mmu01040 Biosynthesis of unsaturated fatty acids 0.0662476 1.531834 0.0662476 0.8131848 24 0.0662476
mmu03022 Basal transcription factors 0.0674085 1.514546 0.0674085 0.8131848 33 0.0674085
message("Less")
## Less
kable(head(keggres$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
p.geomean stat.mean p.val q.val set.size exp1
mmu03020 RNA polymerase 0.0095316 -2.445444 0.0095316 0.9111323 23 0.0095316
mmu03450 Non-homologous end-joining 0.0341877 -1.999438 0.0341877 0.9111323 11 0.0341877
mmu00190 Oxidative phosphorylation 0.0669600 -1.505167 0.0669600 0.9111323 99 0.0669600
mmu03420 Nucleotide excision repair 0.0747360 -1.456591 0.0747360 0.9111323 39 0.0747360
mmu00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis 0.0757982 -1.464592 0.0757982 0.9111323 25 0.0757982
mmu00240 Pyrimidine metabolism 0.1116636 -1.222140 0.1116636 0.9111323 87 0.1116636
message("Stats")
## Stats
kable(head(keggres$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
stat.mean exp1
mmu04360 Axon guidance 1.910570 1.910570
mmu03410 Base excision repair 1.842982 1.842982
mmu00591 Linoleic acid metabolism 1.628039 1.628039
mmu04974 Protein digestion and absorption 1.567001 1.567001
mmu01040 Biosynthesis of unsaturated fatty acids 1.531834 1.531834
mmu03022 Basal transcription factors 1.514546 1.514546
# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>% 
  tbl_df() %>% 
  filter(row_number()<=5) %>% 
  .$id %>% 
  as.character()
keggrespathways
## [1] "mmu03020 RNA polymerase"                                       
## [2] "mmu03450 Non-homologous end-joining"                           
## [3] "mmu00190 Oxidative phosphorylation"                            
## [4] "mmu03420 Nucleotide excision repair"                           
## [5] "mmu00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis"
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
## [1] "mmu03020" "mmu03450" "mmu00190" "mmu03420" "mmu00563"
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu", new.signature=FALSE)

# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu"))

Pathway and regulation of genes for Oxidative phosphorylation.

Oxidative Phosphorylation

Pathway and regulation of genes for Glycosylphosphatidylinositol(GPI)-anchor biosynthesis.

Glycosylphosphatidylinositol(GPI)-anchor biosynthesis

Pathway and regulation of genes for RNA polymerase.

RNA polymerase

Pathway and regulation of genes for Nucleotide excision repair.

Nucleotide excision repair

Pathway and regulation of genes for Non-homologous end-joining.

Non-homologous end-joining

#De novo Assembly {#denovo-assembly}

AT THIS TIME, MINICONDA3 WAS INSTALLED SO FURTHER TOOLS ARE ALL INSTALLED UNDER THIS

Having the genome for RNA-sequencing analysis is very useful, but sometimes it is not available so de novo assembly is used. de novo assembly is using the reads from sequencing to create longer seqeunces called contigs (contiguous sequences). These sequences are then compared to a protein database to get an idea of what proteins are possibly present. The raw sequences are also mapped back to the contigs, treating the contigs as the "genome", much like we do when we have the genome from the host available.

Lets jump right into it. We are going to be using Trinity for the de novo assembly and analysis because there are many tools built into the Trinity tool that allow for the building of contigs, counting of sequences, and even DeSeq/edgeR analyses.

To install Trinity, all I did was ran conda install -c bioconda trinity. This installs all of the previously mentioned tools in one go. In addition to Trinity, BLAST (conda install -c bioconda blast) and RSEM (conda install -c bioconda rsem) were installed.

Trinity --seqType fq \
--left trimmedReads/SRR2121770_Trimmed_1P.fq.gz,trimmedReads/SRR2121771_Trimmed_1P.fq.gz,trimmedReads/SRR2121774_Trimmed_1P.fq.gz,trimmedReads/SRR2121775_Trimmed_1P.fq.gz,trimmedReads/SRR2121778_Trimmed_1P.fq.gz,trimmedReads/SRR2121779_Trimmed_1P.fq.gz,trimmedReads/SRR2121780_Trimmed_1P.fq.gz,trimmedReads/SRR2121781_Trimmed_1P.fq.gz,trimmedReads/SRR2121786_Trimmed_1P.fq.gz,trimmedReads/SRR2121787_Trimmed_1P.fq.gz,trimmedReads/SRR2121788_Trimmed_1P.fq.gz,trimmedReads/SRR2121789_Trimmed_1P.fq.gz \
--right trimmedReads/SRR2121770_Trimmed_2P.fq.gz,trimmedReads/SRR2121771_Trimmed_2P.fq.gz,trimmedReads/SRR2121774_Trimmed_2P.fq.gz,trimmedReads/SRR2121775_Trimmed_2P.fq.gz,trimmedReads/SRR2121778_Trimmed_2P.fq.gz,trimmedReads/SRR2121779_Trimmed_2P.fq.gz,trimmedReads/SRR2121780_Trimmed_2P.fq.gz,trimmedReads/SRR2121781_Trimmed_2P.fq.gz,trimmedReads/SRR2121786_Trimmed_2P.fq.gz,trimmedReads/SRR2121787_Trimmed_2P.fq.gz,trimmedReads/SRR2121788_Trimmed_2P.fq.gz,trimmedReads/SRR2121789_Trimmed_2P.fq.gz \
--CPU 80 --max_memory 2000G --min_contig_length 150
TrinityStats.pl trinity_out_dir/Trinity.fasta

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  602870
Total trinity transcripts:      730297
Percent GC: 45.89

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 8589
        Contig N20: 5955
        Contig N30: 4400
        Contig N40: 3276
        Contig N50: 2413

        Median contig length: 413
        Average contig: 1028.24
        Total assembled bases: 750921120


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 5470
        Contig N20: 3434
        Contig N30: 2391
        Contig N40: 1741
        Contig N50: 1285

        Median contig length: 343
        Average contig: 703.04
        Total assembled bases: 423841627
bowtie2-build trinity_out_dir/Trinity.fasta trinity_out_dir/Trinity.fasta
bowtie2 --local --no-unal -x trinity_out

#download uniprot swiss-prot db #ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

gunzip uniprot_sprot.fasta.gz
mkdir -p blast_protdb
mv uniprot_sprot.fasta blast_protdb/.
makeblastdb -in uniprot_sprot.fasta -dbtype prot
cd ..
blastx -query trinity_out_dir/Trinity.fasta -db blast_protdb/uniprot_sprot.fasta -out blastx.outfmt6 -evalue 1e-20 -num_threads 80 -max_target_seqs 1 -outfmt 6
analyze_blastPlus_topHit_coverage.pl blastx.outfmt6 trinity_out_dir/Trinity.fasta blast_protdb/uniprot_sprot.fasta | column -t

#hit_pct_cov_bin  count_in_bin  >bin_below
100               7361          7361
90                1117          8478
80                981           9459
70                971           10430
60                1003          11433
50                865           12298
40                924           13222
30                1102          14324
20                1478          15802
10                932           16734
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121770_1.fastq.gz -2 rawReads/SRR2121770_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121770.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121771_1.fastq.gz -2 rawReads/SRR2121771_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121771.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121774_1.fastq.gz -2 rawReads/SRR2121774_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121774.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121775_1.fastq.gz -2 rawReads/SRR2121775_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121775.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121778_1.fastq.gz -2 rawReads/SRR2121778_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121778.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121779_1.fastq.gz -2 rawReads/SRR2121779_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121779.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121780_1.fastq.gz -2 rawReads/SRR2121780_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121780.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121781_1.fastq.gz -2 rawReads/SRR2121781_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121781.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121786_1.fastq.gz -2 rawReads/SRR2121786_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121786.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121787_1.fastq.gz -2 rawReads/SRR2121787_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121787.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121788_1.fastq.gz -2 rawReads/SRR2121788_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121788.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121789_1.fastq.gz -2 rawReads/SRR2121789_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121789.coordSorted.bam
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121770_1.fastq.gz --right rawReads/SRR2121770_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121770.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121771_1.fastq.gz --right rawReads/SRR2121771_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121771.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121774_1.fastq.gz --right rawReads/SRR2121774_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121774.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121775_1.fastq.gz --right rawReads/SRR2121775_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121775.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121778_1.fastq.gz --right rawReads/SRR2121778_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121778.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121779_1.fastq.gz --right rawReads/SRR2121779_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121779.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121780_1.fastq.gz --right rawReads/SRR2121780_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121780.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121781_1.fastq.gz --right rawReads/SRR2121781_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121781.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121786_1.fastq.gz --right rawReads/SRR2121786_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121786.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121787_1.fastq.gz --right rawReads/SRR2121787_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121787.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121788_1.fastq.gz --right rawReads/SRR2121788_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121788.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121789_1.fastq.gz --right rawReads/SRR2121789_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121789.RSEM
time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_trans SRR2121770.RSEM/SRR2121770.RSEM.trans.results \
SRR2121771.RSEM/SRR2121771.RSEM.trans.results \
SRR2121774.RSEM/SRR2121774.RSEM.trans.results \
SRR2121775.RSEM/SRR2121775.RSEM.trans.results \
SRR2121778.RSEM/SRR2121778.RSEM.trans.results \
SRR2121779.RSEM/SRR2121779.RSEM.trans.results \
SRR2121780.RSEM/SRR2121780.RSEM.trans.results \
SRR2121781.RSEM/SRR2121781.RSEM.trans.results \
SRR2121786.RSEM/SRR2121786.RSEM.trans.results \
SRR2121787.RSEM/SRR2121787.RSEM.trans.results \
SRR2121788.RSEM/SRR2121788.RSEM.trans.results \
SRR2121789.RSEM/SRR2121789.RSEM.trans.results --gene_trans_map none
time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_genes SRR2121770.RSEM/SRR2121770.RSEM.genes.results \
SRR2121771.RSEM/SRR2121771.RSEM.genes.results \
SRR2121774.RSEM/SRR2121774.RSEM.genes.results \
SRR2121775.RSEM/SRR2121775.RSEM.genes.results \
SRR2121778.RSEM/SRR2121778.RSEM.genes.results \
SRR2121779.RSEM/SRR2121779.RSEM.genes.results \
SRR2121780.RSEM/SRR2121780.RSEM.genes.results \
SRR2121781.RSEM/SRR2121781.RSEM.genes.results \
SRR2121786.RSEM/SRR2121786.RSEM.genes.results \
SRR2121787.RSEM/SRR2121787.RSEM.genes.results \
SRR2121788.RSEM/SRR2121788.RSEM.genes.results \
SRR2121789.RSEM/SRR2121789.RSEM.genes.results --gene_trans_map none
contig_Exn50_statistic.pl Trinity_trans.TMM.EXPR.matrix trinity_out_dir/Trinity.fasta > ExN50_trans.stats

Make samples.txt

mock-	SRR2121770.RSEM
mock-	SRR2121771.RSEM
IR-	SRR2121774.RSEM
IR-	SRR2121775.RSEM
mock+	SRR2121778.RSEM
mock+	SRR2121779.RSEM
mock+	SRR2121780.RSEM
mock+	SRR2121781.RSEM
IR+	SRR2121786.RSEM
IR+	SRR2121787.RSEM
IR+	SRR2121788.RSEM
IR+	SRR2121789.RSEM

The way that the installed version of run_DE_analysis.pl tries to install or check for edgeR doesn't work for the new versions of R (3.6.1). In the perl script I just had to edit a few lines under the edgeR section for it to run smoothly, as well as install *BiocManager before hand.

I changed the perl script, under sub run_edgeR_sample_pair, when writing the R script to look like:

    print $ofh "if (! require(edgeR)) {\n";
    print $ofh "   install.packages(\"BiocManager\")\n";
    print $ofh "   BiocManager::install(\"edgeR\")\n";
    print $ofh "   library(edgeR)\n";
    print $ofh "}\n\n";

To make sure BiocManager and edgeR work, they were installed in the R terminal.

R #runs the installed version of R in terminal mode
install.packages("BiocManager")
BiocManager::install("edgeR")

Since we don't have at least 3 replicates for each of the treatments, edgeR wants a dispersion parameter. 0.035 seems to be pretty common. If the dispersion parameter has the squareroot taken of it we get ~ 0.19, meaning that we are saying the true abundance for each gene can vary up or down by 19% between replicates.

run_DE_analysis.pl --matrix Trinity_trans.counts.matrix --method edgeR --output edgeR_trans --dispersion 0.035 --samples_file samples.txt
head edgeR_trans/Trinity_trans.counts.matrix.IR-_vs_mock-.edgeR.DE_results | column -t

                          sampleA  sampleB  logFC           logCPM             PValue                FDR
TRINITY_DN2468_c0_g4_i1   IR-      mock-  -14.7753615825754  4.45315041918     5.51021995613131e-115  5.85185359341145e-110
TRINITY_DN2505_c0_g1_i2   IR-      mock-  -10.1817621187735  4.11815268306374  8.54695600746168e-88   4.53843363996215e-83
TRINITY_DN3231_c0_g2_i15  IR-      mock-  -13.5922145319911  3.27307743751541  6.29068575237937e-81   2.2269027563423e-76
TRINITY_DN4975_c0_g1_i7   IR-      mock-  13.4941229058571   3.17132827495398  5.25978730838885e-79   1.39647353037724e-74
TRINITY_DN4360_c0_g1_i5   IR-      mock-  -13.5215746751431  3.20272600532575  1.0222845072488e-70    2.17133229339645e-66
TRINITY_DN128_c0_g1_i20   IR-      mock-  13.1582688371709   2.83599676163376  1.8443766792831e-69    3.26454672233108e-65
TRINITY_DN644_c1_g1_i6    IR-      mock-  -5.91850038396836  4.31456225100214  8.01440426808572e-67   1.21589961895815e-62
TRINITY_DN2043_c0_g1_i13  IR-      mock-  13.7803059024088   3.45719012794474  3.30519380237759e-66   4.38764477265625e-62
TRINITY_DN3321_c0_g1_i2   IR-      mock-  12.8215942182179   2.49995284585527  9.5059376972195e-61    1.1217006482719e-56