title | author | date | output | ||||
---|---|---|---|---|---|---|---|
RNA Seq for STAT736 |
Alex Soupir |
9/10/2019 |
|
For STAT736-Fall-2019, we are analyzing the RNA-Seq from the publication Genome-wide analysis of p53 transcriptional programs in B cells upon exposure to genotoxic stress in vivo. We are only using the sequences B cells from spleen and not the non-B cells from spleen from the SRA Run Selector on NCBI.
The mice were exposed to whole-body ionizing radiation and sequences were extracted from both Bcells and non-B cells from the spleens of the mice. Two genotypes of mice were used: mice with p53 knocked out and the wild-type C57/Bl6. There were 4 different group combinations including the 2 different genotypes; each genotype was subjected to the ionizing radiation as well as control/mock.
Genotype | Treatment | |
---|---|---|
Group 1 | p53 | Mock |
Group 2 | C57/Bl6 | Mock |
Group 3 | p53 | IR |
Group 4 | C57/Bl6 | IR |
This document will contain 2 different pipelines: The first one is going to be using the genome to map the reads too, and the secod is going to be de novo.
#Genome Mapping {#genome-mapping}
The pipeline used in this analysis used conda on South Dakota State University's High Performance Computing cluster to run the programs FastQC, Trimmomatic, and Tophat.
This is different than previous RNA-Seq analyses where I used my workstation pc with Ubuntu 18.04 to run FastQC, Trimmomatic, HiSat2, HTSeq, and DESeq2 locally. Also, the previous RNA-Seq alayses were of Soybean with treatment combinations of mycorrhizae and rhizobia inoculation.
- FastQC
- Trimmomatic 0.39
- Bowtie 2.2.5.0
- Tophat 2.1.1
- STAR
- Cufflinks
- Kraken2
- MultiQC
- featureCount
To find a node that we can use on our own, we need to see which nodes are already allocated to jobs and which ones are idle. To do this, we can run sinfo. We want to pick one of the nodes that are marked 'idle' so we get the whole thing and we aren't interrupting someone elses job. For the sake of this exercise, lets work on big-mem.
Once a node that is idle has been found, you can ssh into it by typing ssh -X big-mem00# where # is the node number.
ssh -X big-mem005
Once on the node, the modules will have to be pulled from the shared folder again, otherwise we will be left with very basic ones. NOTE: if running programs that are in a personal folder such as miniconda (these examples), it is not necessary to add the other modules.
module use /cm/shared/modulefiles_local/
After loading the modules you can use it just as you would any other command line.
When running on a cluster, it can sometime be difficult to find open nodes with the resources needed to run the jobs that we have. Making a slurm script is really easy. Fist we make a new file with the touch command.
touch commands.slurm
Now in our directory we have the file commands.slurm which we can edit to hold our code in. We can edit it with the vi command.
vi commands.slurm
We have a few things that we need to put in the file header so slurm knows what to do with our commands.
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --output=job-%j-%N.log
#SBATCH --partition=bigmem
#SBATCH --time=10:00:00
When we break this down, we see --job-name which is what we will see when we look at whats running later, --nodes is the number of nodes we have, --ntasks-per-node is the number of cores that we are requesting to have allocated, --output is the output log file of the job (here it names the output file with the job number and the node that we used), --partition here is requesting a big-mem node but compute can also be used, and finally --time is how long we are requesting the allocation for.
If the time runs out before the job is done I believe that it just kills the job even if not finished so we need to think a little about how much time to set. If the time is set too low, the job is killed and if the time is set too long, we may face issues with getting the node allocated to us.
To submit a job we can use sbatch commands.slurm and then we have the job ID. To check the status of our submission we use sbatch and then it shows all of the submitted jobs and how long they have been running along with the name that we set in the script.
To download the sequences from the sequence read archive (SRA), the SRA Toolkit was used. The downloading of the files took a very long time, so this was left to run over night. The --gzip was used to keep the files a relatively small, although this can be left out to download uncompressed files, and --split-files was used to split the forward read from the reverse read for paired end read trimming through Trimmomatic.
~/tools/sratoolkit.2.9.6-1-centos_linux64/bin/fastq-dump --gzip --split-files SRR2121770
This is an example of the single file, but the above code needed to be ran for all of the following SRA numbers:
- SRR2121770
- SRR2121771
- SRR2121774
- SRR2121775
- SRR2121778
- SRR2121779
- SRR2121780
- SRR2121781
- SRR2121786
- SRR2121787
- SRR2121788
- SRR2121789
The results from downloading with --split-files gives 2 files per SRR, as mentioned before, one forward and one reverse. The suffix of the split files is one with _1.fastq.gz and another with _2.fastq.gz.
FastQC can be run on all of the read files by using the wild card (*) as in *.fastq.gz. This prevents the need to hard code each individual read file into a FastQC command, which saves a lot of time since there are 24 read files in total for these 12 samples.
~/miniconda2/bin/fastqc *.fastq.gz
The output from running FastQC is a zipped folder and an HTML file for each of the \fastq.gz files in the folder. The HTML document looks something like this:
This is just the top of the file, and every category under the Summary heading has a graph that shows how the read quality looks for that particular metric. These reports can give insight into whether the reads are of decent quality or if the quality is poor.
The raw reads we have here all passed for adapter content and sequence length distribution and everything failed per base sequence content. SRR2121770, SRR2121771, SRR2121774, SRR2121775, SRR2121788, SRR2121781-2, and SRR2121789-1 were fairly decent quality reads. SRR2121778, SRR2121779, SRR2121780, SRR2121786, SRR2121787, SRR2121781-1, and SRR2121789-2 were of fairly lower quality (failing 3 or more in both reads. All of them failed both per base sequence quality and per tile sequence quality.
First lets install multiqc with conda. The command for this is conda install -c bioconda multiqc.
When that is finished, we can run MultiQC in the folder with the QC files (they should be moved into a folder alone so things don't get cluttered later on in the analyses).
~/miniconda/bin/multiqc .
When MultiQC is finished running, there will be a new folder called multiqc_data where the summaries are stored. Now lets go back up a level where our raw data folder and fastqc folder is and make a new folder for all of our MultiQC data. We will copy the FastQC output from MultiQC to this new folder.
mkdir MultiQC_All
cp RawQC/multiqc_data/multiqc_fastqc.txt MultiQC_All/
Conda was used again to run Trimmomatic. This isn't as easy as using the wildcard like with FastQC because each output has to be personalized for the read files that are input into Trimmomatic. Also, we have to make sure that the adapter sequences are in the same folder that we are running so we can refer to them easily when calling the Trimmomatic program. In this case, we are using the TruSeq3-PE-2.fa adapter sequences For example:
~/miniconda2/bin/trimmomatic PE SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz 770_fp.fq.gz 770_fu.fq.gz 770_rp.fq.gz 770_ru.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 &
This would be repeated for each of the pairs (12 in total). We are trimming paired-end reads with the TruSeq3-PE-2 adapters. We are chopping off the first and last 3 bases and if we end up with a sequence less than 36 bases, we get rid of it. We want to make sure that there are enough bases in a read to work with. These parameters can be tweaked for possibly better end results with less being discarded. When Trimmomatic is finished running, it will out put the total number of reads, the total number from both the forward and reverse reads that are kept, the number of only forward reads kept, the number of only reverse reads kept, and the number of discarded reads.
The highest number of reads dropped was from trimming SRR2121786, where 20.55% dropped. Most reads were between 5% and 10% dropped. SRR2121786, SRR2121787, and SRR2121779 had sequence drops greater than 15%.
The trimmed reads can be analyzed again with FastQC to see how well the trimming worked to make the file better quality. After running FastQC on the trimmed files we see that the quality of those that were really bad quality were improved. There were a few different metrics throughout all of the files that bounced from a warning before the failing, or from passing before to a warning, and so forth, overall creating better quality read files.
Tophat can be installed using the same conda install (conda install -c bioconda tophat). When this is finished installing, then we will need to get the mouse genome from the Johns Hopkins Univeristy Center for computational BIology. The version of the mouse genome that I am using here is the NCBI build37.2. Instead of downloading this from the website and having to move it to the cluser, I will just download it using wget into the folder that has the raw reads, trimmed reads, and the FastQC files.
wget ftp://igenome:[email protected]/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz
This will take a long time to download because the file is a little less than 16GB zipped.
We notice here that we have a zipped tar file. To make this file easier to use, lets unzip it.
tar zxvf Mus_muculus_NCBI_build37.2.tar.gz
Since Tophat is requiring *.bt21 files (large index) and the files downloaded for the genome above are only small index files, we have to create a large index using bowtie2-build. For this, lets navigate to the WholeGenomeFasta folder within the extracted folder and then run bowtie2-build.
~/miniconda2/bin/bowtie2-build --large-index genome.fa genome
This process took about 26 minutes to run. Now lets copy the index files to a folder close to our reads so we can access them easier, rather than having to refer to the longer path where we build them. After they are copied to a new folder closer to our working directory, I went ahead and unzipped the trimmed read files to try and make the Tophat faster but it turned out not to work. The multicore call with -p didn't use more cores than 1 until bowtie2-align-s, then 20 cores were used.
~/miniconda2/bin/tophat --no-converage-search -p 20 -G Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -0 770_thout ./Index/genome 770_fp.fq.gz 770_rp.fq.gz 770_fu.fq 770_ru.fq
This run took almost 3 hours to complete.. Running with 80 cores rather than 20 cores took just 4 minutes less, so the whole process must be limited by a single core and the core clock speed. The process does use close to 8,000% at its peak so there is a benefit to multicore, just isn't very scalable.
STAR can be installed the same way as the previous programs with conda install (conda install -c bioconda star). In order to run STAR, we need to creaate indices just like with tophat, but STAR has this built in. I'm going to be using the same genome and GTF file as previously downloaded, but Dr. Ge uses a different zipped genome from the gencode database.
~/miniconda2/bin/STAR \
--runThreadN 80 \
--runMode genomeGenerate \
--genomeDir starIndex \
--genomeFastaFiles Index/genome.fa \ #same when we made the bowtie indices
--sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf
With the index files made, we can start aligning with STAR. It's important here than we only pick the paired end reads and not use all of the reads. Tophat is able to use all 4 reads but STAR doesn't allow that, so we need to make sure that we feed in the large files from trimming.
~/miniconda2/bin/STAR --runThreadN 80 --genomeDir starIndex --readFilesIn 770_fp.fq 770_rp.fq --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix 2121770 --outSAMtype BAM SortedByCoordinate
Once STAR is done running, we can assemble the transcripts with Cufflinks. This can also be installed with conda install (conda install -c bioconda cufflinks).
~/miniconda2/bin/cufflinks -p 20 -o SRR2121771_clout --library-type fr-firststrand 2121770Aligned.sortedByCoord.out.bam
Now we will look at what kind of contamination we are looking at. When samples are sequenced with Illumina, a PhiX control is run along side them. This control is for cluster generation, sequencing, alignment, and calibration for cross-talk matrix generation. We will use Bowtie to create a file to determine the PhiX contamination level.
~/miniconda2/bin/bowtie2 -p 20 -x PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome \
-1 TrimmedReads/770_fp.fq -2 TrimmedReads/770_rp.fq -S phix.sam &> PhiXout/SRR2121770_phix.out
When the job is done running, the output file will show how much PhiX contamination we have. For example, lookin at the SRR2121770_phix.out created above, we see that 0.11% of the reads aligned with PhiX. The lower this value the better.
To retreive the rRNA sequences for mouse, we need to search the taxonomy database on NCBI for Mus musculus. Click on Mus musculus on the next page, and then the top Mus musculus at the head of the list. Now, select the top subtree link in the Nucleotide database. Select rRNA sequences on the left side of the page and download full list just downloading with Send > Complete Record > File > FASTA > Create File. Drag the file using WinSCP to the raw folder on the cluster and rename it to rRNA.fa.
We are going to need to install bwa with conda in order to get the alignments to work. This can be done with conda install -c bioconda bwa. Following this, we will need to make indixes for the rRNA that we downloaded. To make this more clean, lets make a directory for the rRNA sequences that we downloaded and the indices that we make.
mkdir rRNA
Then we move the rRNA.fa to the new rRNA folder with WinSCP and then we can run the bwa.
time ~/miniconda2/bin/bwa mem -t 20 rRNA/rRNA.fa TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq > rnaAlign/770_rna.sam
When we are done creating the new *.sam files for all of the forward/reverse read combinations, we can use samtools to convert the *.sam file to *.bam files which are essentially the same file just that sam is easier for us to look at while bam is binary. Samtools can be installed with conda install -c bioconda samtools.
~/miniconda2/bin/samtools view -@ 10 -bS -o rnaAlign/770_rna.bam rnaAlign/770.sam
Now in the rnaAlign folder we have our sam and bam file for each of the libraries. Lets create an output file with flagstat.
~/miniconda2/bin/samtools flagstat -@ 10 rnaAlign/770_rna.out
Wihtin this file we will be able to see the summary of our alignments to the rRNA file that we downloaded from NCBI.
#From 770_rna.out
205559289 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
85 + 0 supplementary
0 + 0 duplicates
4265179 + 0 mapped (2.07% : N/A)
205559204 + 0 paired in sequencing
102779602 + 0 read1
102779602 + 0 read2
4151684 + 0 properly paired (2.02% : N/A)
4187608 + 0 with itself and mate mapped
77486 + 0 singletons (0.04% : N/A)
4222 + 0 with mate mapped to a different chr
1026 + 0 with mate mapped to a different chr (mapQ>=5)
In order to find out the contamination, we need to install Kraken2 with conda install -c bioconda kraken2 and download a pre-built database containing bacteria, archaea, and viral sequences. The database we are going to download only contains about 5% of k-mers from the original database (but directions are sort of lacking to build an entirely new database). More information can be found at https://ccb.jhu.edu/software/kraken/ for the pre-built databases.
Using the code in the next chunk will download the 8GB database and then extract the files so we can use them with the Kraken2 program. Lets do this in the main project folder.
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz
tar xzf minikraken2_v2_8GB_201904_UPDATE.tgz
Now lets make a directory for the output.
mkdir krakenOut
We can call Kraken with the extracted database folder and point it to the location of out paired end reads from trimming and to the output folder that we just created for the outputs.
~/miniconda2/bin/kraken2 --db minikraken2_v2_8GB_201904_UPDATE/ --output krakenOut/770.out --threads 10 --paired TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq
When Kraken is done running, it will print out the number (and percentage) of reads that were classified. In this case, we have used 102779602 sequences, of which 19142843 sequences were classified (18.63%) and 83636759 sequences were unclassified (81.37%). My interpretation of this is that 18.63% of the reads are possibly from microbial cell contamination.
Since we have the bam files from the alignments of the different samples, we can count the features for each and get the transcipt counts using featureCounts form conda install -c bioconda/label/cf201901 subread. The genome and annotations that we previously downloaded were from genome mm9 so we have to specify to featureCounts what we want to actually count. FeatureCounts defaults to using gene_id which our output bam files don't have described correctly for featureCounts to read them. This is a single line of code because we can use a wildcard to run through all of the bam files.
#Move to the Star Alignment output folder for a working directory
cd StarOut
~/miniconda2/bin/featureCounts -a /gpfs/scratch/alex.soupir/Mus/raw/Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -g 'transcript_id' -o readCounds.txt *bam
With the files that we are working with, this will take between 3.5 minutes to 5 minutes per bam file. The output will be a file that can be imported into excel and saved as csv which we then can work with in R.
Lets look at the data that we have collected from all of the MultiQC runs that we had with initial FastQC, Trimmomatic, STAR alignment, PhiX contamination, rRNA contamination, and the final feature counts.
qc = read.csv('Whole Data QC.csv', header=TRUE, na.strings="")
kable(qc) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "800px")
X | Sample | SRR2121770_1 | SRR2121770_2 | SRR2121771_1 | SRR2121771_2 | SRR2121774_1 | SRR2121774_2 | SRR2121775_1 | SRR2121775_2 | SRR2121778_1 | SRR2121778_2 | SRR2121779_1 | SRR2121779_2 | SRR2121780_1 | SRR2121780_2 | SRR2121781_1 | SRR2121781_2 | SRR2121786_1 | SRR2121786_2 | SRR2121787_1 | SRR2121787_2 | SRR2121788_1 | SRR2121788_2 | SRR2121789_1 | SRR2121789_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FastQC | adapter_content | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass |
NA | Sequences flagged as poor quality | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
NA | sequence_duplication_levels | warn | warn | pass | pass | pass | pass | pass | pass | warn | pass | pass | pass | pass | pass | warn | warn | warn | pass | pass | pass | fail | fail | fail | warn |
NA | avg_sequence_length | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 |
NA | Encoding | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 |
NA | per_base_sequence_quality | pass | pass | pass | pass | pass | pass | pass | pass | fail | fail | fail | fail | fail | fail | fail | pass | fail | fail | fail | fail | pass | pass | pass | fail |
NA | sequence_length_distribution | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass |
NA | Sequence length | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 | 51 |
NA | File type | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls | Conventional base calls |
NA | basic_statistics | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass |
NA | per_sequence_gc_content | pass | pass | pass | pass | pass | pass | warn | warn | warn | warn | pass | fail | warn | fail | pass | pass | warn | fail | warn | fail | fail | fail | fail | fail |
NA | Total Sequences | 118323219 | 118323219 | 103127231 | 103127231 | 91225885 | 91225885 | 108661623 | 108661623 | 136766553 | 136766553 | 124265478 | 124265478 | 99685598 | 99685598 | 106360532 | 106360532 | 99988292 | 99988292 | 75410310 | 75410310 | 82750449 | 82750449 | 92454136 | 92454136 |
NA | per_base_n_content | pass | pass | warn | pass | warn | pass | pass | pass | fail | warn | fail | warn | fail | warn | pass | pass | fail | fail | fail | warn | pass | pass | pass | fail |
NA | per_base_sequence_content | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail |
NA | overrepresented_sequences | warn | warn | warn | warn | warn | warn | warn | warn | warn | warn | pass | warn | pass | pass | warn | warn | warn | warn | pass | pass | warn | warn | warn | warn |
NA | %GC | 47 | 47 | 46 | 46 | 47 | 47 | 47 | 47 | 46 | 47 | 46 | 46 | 46 | 46 | 46 | 46 | 47 | 47 | 46 | 46 | 47 | 47 | 47 | 48 |
NA | total_deduplicated_percentage | 52.20958673 | 55.28743506 | 90.47297502 | 97.22714143 | 91.11411337 | 97.47431986 | 72.31423101 | 95.57923295 | 65.27895305 | 83.85617925 | 73.08810256 | 79.50169478 | 92.46127029 | 94.09779679 | 54.09833163 | 52.4832411 | 68.88309902 | 74.63550918 | 93.00930552 | 94.9666922 | 43.56442006 | 44.95245995 | 42.6643141 | 57.36052687 |
NA | Filename | SRR2121770_1.fastq.gz | SRR2121770_2.fastq.gz | SRR2121771_1.fastq.gz | SRR2121771_2.fastq.gz | SRR2121774_1.fastq.gz | SRR2121774_2.fastq.gz | SRR2121775_1.fastq.gz | SRR2121775_2.fastq.gz | SRR2121778_1.fastq.gz | SRR2121778_2.fastq.gz | SRR2121779_1.fastq.gz | SRR2121779_2.fastq.gz | SRR2121780_1.fastq.gz | SRR2121780_2.fastq.gz | SRR2121781_1.fastq.gz | SRR2121781_2.fastq.gz | SRR2121786_1.fastq.gz | SRR2121786_2.fastq.gz | SRR2121787_1.fastq.gz | SRR2121787_2.fastq.gz | SRR2121788_1.fastq.gz | SRR2121788_2.fastq.gz | SRR2121789_1.fastq.gz | SRR2121789_2.fastq.gz |
NA | per_tile_sequence_quality | warn | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | warn | fail | fail | fail | fail | warn | pass | pass | fail |
NA | per_sequence_quality_scores | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | pass | fail | pass | fail | pass | pass | pass | pass |
Trimming | surviving | 102779602 | NA | 88509413 | NA | 77240029 | NA | 94179097 | NA | 108337271 | NA | 92924182 | NA | 76172416 | NA | 95488961 | NA | 69906417 | NA | 54427039 | NA | 73771492 | NA | 82764590 | NA |
NA | surviving_pct | 86.86 | NA | 85.83 | NA | 84.67 | NA | 86.67 | NA | 79.21 | NA | 74.78 | NA | 76.41 | NA | 89.78 | NA | 69.91 | NA | 72.17 | NA | 89.15 | NA | 89.52 | NA |
NA | reverse_only_surviving_pct | 4.48 | NA | 2.52 | NA | 2.53 | NA | 2.29 | NA | 2.11 | NA | 2.44 | NA | 2.52 | NA | 1.8 | NA | 2.59 | NA | 2.65 | NA | 1.85 | NA | 1.77 | NA |
NA | forward_only_surviving | 3202267 | NA | 5041955 | NA | 4667583 | NA | 4832339 | NA | 9375417 | NA | 7916665 | NA | 6350522 | NA | 3894412 | NA | 6944977 | NA | 5349296 | NA | 3489058 | NA | 3797776 | NA |
NA | dropped | 7036792 | NA | 6973727 | NA | 7013028 | NA | 7160952 | NA | 16166923 | NA | 20393774 | NA | 14647690 | NA | 5067412 | NA | 20551506 | NA | 13638339 | NA | 3962243 | NA | 4252397 | NA |
NA | dropped_pct | 5.95 | NA | 6.76 | NA | 7.69 | NA | 6.59 | NA | 11.82 | NA | 16.41 | NA | 14.69 | NA | 4.76 | NA | 20.55 | NA | 18.09 | NA | 4.79 | NA | 4.6 | NA |
NA | forward_only_surviving_pct | 2.71 | NA | 4.89 | NA | 5.12 | NA | 4.45 | NA | 6.86 | NA | 6.37 | NA | 6.37 | NA | 3.66 | NA | 6.95 | NA | 7.09 | NA | 4.22 | NA | 4.11 | NA |
NA | input_read_pairs | 118323219 | NA | 103127231 | NA | 91225885 | NA | 108661623 | NA | 136766553 | NA | 124265478 | NA | 99685598 | NA | 106360532 | NA | 99988292 | NA | 75410310 | NA | 82750449 | NA | 92454136 | NA |
NA | reverse_only_surviving | 5304558 | NA | 2602136 | NA | 2305245 | NA | 2489235 | NA | 2886942 | NA | 3030857 | NA | 2514970 | NA | 1909747 | NA | 2585392 | NA | 1995636 | NA | 1527656 | NA | 1639373 | NA |
STAR Alignment | uniquely_mapped_percent | 81.73 | NA | 82.35 | NA | 81.67 | NA | 83.1 | NA | 83.43 | NA | 84.55 | NA | 82.48 | NA | 86.82 | NA | 80.75 | NA | 80.27 | NA | 80.81 | NA | 81.62 | NA |
NA | num_splices | 8669080 | NA | 8000968 | NA | 10613774 | NA | 11755566 | NA | 10864591 | NA | 8361644 | NA | 6866296 | NA | 7991948 | NA | 6183074 | NA | 4335214 | NA | 7285454 | NA | 8581902 | NA |
NA | num_GCAG_splices | 90012 | NA | 83679 | NA | 116149 | NA | 124779 | NA | 116447 | NA | 86763 | NA | 74268 | NA | 80836 | NA | 65893 | NA | 44765 | NA | 80392 | NA | 97422 | NA |
NA | insertion_length | 1.11 | NA | 1.1 | NA | 1.12 | NA | 1.17 | NA | 1.13 | NA | 1.11 | NA | 1.11 | NA | 1.13 | NA | 1.16 | NA | 1.15 | NA | 1.13 | NA | 1.15 | NA |
NA | deletion_length | 1.35 | NA | 1.33 | NA | 1.37 | NA | 1.26 | NA | 1.36 | NA | 1.35 | NA | 1.35 | NA | 1.4 | NA | 1.2 | NA | 1.22 | NA | 1.3 | NA | 1.25 | NA |
NA | unmapped_tooshort_percent | 4.08 | NA | 4.21 | NA | 3.42 | NA | 4.09 | NA | 3.51 | NA | 3.87 | NA | 3.66 | NA | 3.45 | NA | 5.83 | NA | 6.42 | NA | 5.61 | NA | 4.53 | NA |
NA | avg_mapped_read_length | 100.73 | NA | 101.11 | NA | 101.1 | NA | 101.14 | NA | 100.6 | NA | 100.51 | NA | 100.51 | NA | 100.81 | NA | 100.41 | NA | 100.33 | NA | 101.18 | NA | 101.22 | NA |
NA | deletion_rate | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0.01 | NA | 0.01 | NA | 0.01 | NA | 0.01 | NA | 0.01 | NA | 0.01 | NA | 0.01 | NA | 0.01 | NA |
NA | mismatch_rate | 0.23 | NA | 0.29 | NA | 0.29 | NA | 0.17 | NA | 0.17 | NA | 0.22 | NA | 0.22 | NA | 0.17 | NA | 0.26 | NA | 0.28 | NA | 0.16 | NA | 0.15 | NA |
NA | avg_input_read_length | 101 | NA | 101 | NA | 101 | NA | 101 | NA | 100 | NA | 100 | NA | 100 | NA | 101 | NA | 100 | NA | 100 | NA | 101 | NA | 101 | NA |
NA | num_ATAC_splices | 8039 | NA | 7497 | NA | 10782 | NA | 11677 | NA | 9722 | NA | 7455 | NA | 6186 | NA | 7544 | NA | 5492 | NA | 3832 | NA | 6602 | NA | 7593 | NA |
NA | num_annotated_splices | 8581504 | NA | 7919864 | NA | 10513157 | NA | 11643271 | NA | 10755797 | NA | 8273449 | NA | 6799294 | NA | 7913548 | NA | 6103902 | NA | 4260450 | NA | 7202217 | NA | 8487777 | NA |
NA | num_GTAG_splices | 8571029 | NA | 7909792 | NA | 10486843 | NA | 11619110 | NA | 10738422 | NA | 8267426 | NA | 6785842 | NA | 7903568 | NA | 6111689 | NA | 4286617 | NA | 7198460 | NA | 8476887 | NA |
NA | uniquely_mapped | 83997598 | NA | 72886249 | NA | 63080382 | NA | 78265974 | NA | 90382084 | NA | 78565331 | NA | 62830380 | NA | 82902461 | NA | 56450144 | NA | 43685904 | NA | 59617353 | NA | 67556146 | NA |
NA | multimapped_toomany | 1852495 | NA | 1405409 | NA | 1242903 | NA | 1061446 | NA | 1736807 | NA | 1362362 | NA | 1374461 | NA | 846193 | NA | 1009216 | NA | 955877 | NA | 703236 | NA | 890704 | NA |
NA | unmapped_mismatches | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | unmapped_mismatches_percent | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | total_reads | 102779602 | NA | 88509413 | NA | 77240029 | NA | 94179097 | NA | 108337271 | NA | 92924182 | NA | 76172416 | NA | 95488961 | NA | 69906417 | NA | 54427039 | NA | 73771492 | NA | 82764590 | NA |
NA | unmapped_other | 432340 | NA | 370957 | NA | 254880 | NA | 377196 | NA | 476580 | NA | 454868 | NA | 335571 | NA | 468726 | NA | 370629 | NA | 359051 | NA | 317451 | NA | 405340 | NA |
NA | insertion_rate | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | unmapped_other_percent | 0.42 | NA | 0.42 | NA | 0.33 | NA | 0.4 | NA | 0.44 | NA | 0.49 | NA | 0.44 | NA | 0.49 | NA | 0.53 | NA | 0.66 | NA | 0.43 | NA | 0.49 | NA |
NA | multimapped_percent | 11.96 | NA | 11.44 | NA | 12.97 | NA | 11.27 | NA | 11.02 | NA | 9.63 | NA | 11.61 | NA | 8.35 | NA | 11.44 | NA | 10.9 | NA | 12.19 | NA | 12.28 | NA |
NA | multimapped | 12297290 | NA | 10128397 | NA | 10020381 | NA | 10617654 | NA | 11939994 | NA | 8949092 | NA | 8840667 | NA | 7971370 | NA | 7999509 | NA | 5933625 | NA | 8991826 | NA | 10165077 | NA |
NA | num_noncanonical_splices | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | unmapped_tooshort | 4199879 | NA | 3718401 | NA | 2641483 | NA | 3856827 | NA | 3801806 | NA | 3592529 | NA | 2791337 | NA | 3300211 | NA | 4076919 | NA | 3492582 | NA | 4141626 | NA | 3747323 | NA |
NA | multimapped_toomany_percent | 1.8 | NA | 1.59 | NA | 1.61 | NA | 1.13 | NA | 1.6 | NA | 1.47 | NA | 1.8 | NA | 0.89 | NA | 1.44 | NA | 1.76 | NA | 0.95 | NA | 1.08 | NA |
PhiX Contamination | overall_alignment_rate | 0.11 | NA | 0.13 | NA | 0.18 | NA | 0.2 | NA | 0.12 | NA | 0.12 | NA | 0.1 | NA | 0.15 | NA | 0.18 | NA | 0.16 | NA | 0.2 | NA | 0.2 | NA |
NA | paired_aligned_mate_multi_halved | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | paired_aligned_none | 102674701 | NA | 88400163 | NA | 77108986 | NA | 93998943 | NA | 108212089 | NA | 92820803 | NA | 76103405 | NA | 95350795 | NA | 69784944 | NA | 54344793 | NA | 73629819 | NA | 82605166 | NA |
NA | paired_aligned_mate_multi | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | paired_aligned_discord_one | 5081 | NA | 5514 | NA | 6603 | NA | 9196 | NA | 6325 | NA | 5222 | NA | 3475 | NA | 7190 | NA | 6424 | NA | 4218 | NA | 7405 | NA | 8139 | NA |
NA | paired_aligned_mate_one | 2761 | NA | 2466 | NA | 2825 | NA | 3974 | NA | 2693 | NA | 2277 | NA | 1515 | NA | 2999 | NA | 2696 | NA | 1961 | NA | 2913 | NA | 3585 | NA |
NA | paired_total | 102779602 | NA | 88509413 | NA | 77240029 | NA | 94179097 | NA | 108337271 | NA | 92924182 | NA | 76172416 | NA | 95488961 | NA | 69906417 | NA | 54427039 | NA | 73771492 | NA | 82764590 | NA |
NA | paired_aligned_mate_none_halved | 102668239.5 | NA | 88393416 | NA | 77100970.5 | NA | 93987760 | NA | 108204417.5 | NA | 92814442.5 | NA | 76099172.5 | NA | 95342105.5 | NA | 69777172 | NA | 54339594.5 | NA | 73620957.5 | NA | 82595234.5 | NA |
NA | paired_aligned_multi | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | paired_aligned_one | 104901 | NA | 109250 | NA | 131043 | NA | 180154 | NA | 125182 | NA | 103379 | NA | 69011 | NA | 138166 | NA | 121473 | NA | 82246 | NA | 141673 | NA | 159424 | NA |
NA | total_reads | 102779602 | NA | 88509413 | NA | 77240029 | NA | 94179097 | NA | 108337271 | NA | 92924182 | NA | 76172416 | NA | 95488961 | NA | 69906417 | NA | 54427039 | NA | 73771492 | NA | 82764590 | NA |
NA | paired_aligned_mate_one_halved | 1380.5 | NA | 1233 | NA | 1412.5 | NA | 1987 | NA | 1346.5 | NA | 1138.5 | NA | 757.5 | NA | 1499.5 | NA | 1348 | NA | 980.5 | NA | 1456.5 | NA | 1792.5 | NA |
NA | paired_aligned_mate_none | 205336479 | NA | 176786832 | NA | 154201941 | NA | 187975520 | NA | 216408835 | NA | 185628885 | NA | 152198345 | NA | 190684211 | NA | 139554344 | NA | 108679189 | NA | 147241915 | NA | 165190469 | NA |
rRNA Contamination | mapped_passed | 4265172 | NA | 3371478 | NA | 1366577 | NA | 4031209 | NA | 5947728 | NA | 5093417 | NA | 4109793 | NA | 5042021 | NA | 7103552 | NA | 4879843 | NA | 5080844 | NA | 7013490 | NA |
NA | duplicates_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | secondary_passed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | paired in sequencing_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | duplicates_passed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | read2_passed | 102779602 | NA | 88509413 | NA | 77240029 | NA | 94179097 | NA | 108337271 | NA | 92924182 | NA | 76172416 | NA | 95488961 | NA | 69906417 | NA | 54427039 | NA | 73771492 | NA | 82764590 | NA |
NA | read1_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | read1_passed | 102779602 | NA | 88509413 | NA | 77240029 | NA | 94179097 | NA | 108337271 | NA | 92924182 | NA | 76172416 | NA | 95488961 | NA | 69906417 | NA | 54427039 | NA | 73771492 | NA | 82764590 | NA |
NA | with mate mapped to a different chr_passed | 4240 | NA | 3672 | NA | 692 | NA | 2742 | NA | 7850 | NA | 5990 | NA | 4782 | NA | 5840 | NA | 5348 | NA | 2434 | NA | 2250 | NA | 3646 | NA |
NA | total_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | properly paired_passed_pct | 2.02 | NA | 1.84 | NA | 0.86 | NA | 2.07 | NA | 2.7 | NA | 2.69 | NA | 2.65 | NA | 2.59 | NA | 4.98 | NA | 4.39 | NA | 3.36 | NA | 4.15 | NA |
NA | singletons_passed | 77493 | NA | 85229 | NA | 21005 | NA | 88202 | NA | 48177 | NA | 47851 | NA | 41299 | NA | 53429 | NA | 60929 | NA | 55645 | NA | 58038 | NA | 57600 | NA |
NA | supplementary_passed | 85 | NA | 65 | NA | 16 | NA | 23 | NA | 43 | NA | 44 | NA | 44 | NA | 72 | NA | 19 | NA | 18 | NA | 58 | NA | 50 | NA |
NA | singletons_passed_pct | 0.04 | NA | 0.05 | NA | 0.01 | NA | 0.05 | NA | 0.02 | NA | 0.03 | NA | 0.03 | NA | 0.03 | NA | 0.04 | NA | 0.05 | NA | 0.04 | NA | 0.03 | NA |
NA | mapped_failed_pct | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA |
NA | mapped_passed_pct | 2.07 | NA | 1.9 | NA | 0.88 | NA | 2.14 | NA | 2.75 | NA | 2.74 | NA | 2.7 | NA | 2.64 | NA | 5.08 | NA | 4.48 | NA | 3.44 | NA | 4.24 | NA |
NA | supplementary_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | with itself and mate mapped_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | mapped_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | total_passed | 205559289 | NA | 177018891 | NA | 154480074 | NA | 188358217 | NA | 216674585 | NA | 185848408 | NA | 152344876 | NA | 190977994 | NA | 139812853 | NA | 108854096 | NA | 147543042 | NA | 165529230 | NA |
NA | properly paired_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | flagstat_total | 205559289 | NA | 177018891 | NA | 154480074 | NA | 188358217 | NA | 216674585 | NA | 185848408 | NA | 152344876 | NA | 190977994 | NA | 139812853 | NA | 108854096 | NA | 147543042 | NA | 165529230 | NA |
NA | with mate mapped to a different chr (mapQ >= 5)_passed | 1026 | NA | 854 | NA | 118 | NA | 442 | NA | 1093 | NA | 940 | NA | 801 | NA | 1206 | NA | 964 | NA | 487 | NA | 823 | NA | 951 | NA |
NA | properly paired_failed_pct | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA |
NA | with mate mapped to a different chr (mapQ >= 5)_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | with itself and mate mapped_passed | 4187594 | NA | 3286184 | NA | 1345556 | NA | 3942984 | NA | 5899508 | NA | 5045522 | NA | 4068450 | NA | 4988520 | NA | 7042604 | NA | 4824180 | NA | 5022748 | NA | 6955840 | NA |
NA | read2_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | with mate mapped to a different chr_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | properly paired_passed | 4151444 | NA | 3256926 | NA | 1334312 | NA | 3903816 | NA | 5846848 | NA | 5001956 | NA | 4035832 | NA | 4945218 | NA | 6955726 | NA | 4775228 | NA | 4958480 | NA | 6869584 | NA |
NA | paired in sequencing_passed | 205559204 | NA | 177018826 | NA | 154480058 | NA | 188358194 | NA | 216674542 | NA | 185848364 | NA | 152344832 | NA | 190977922 | NA | 139812834 | NA | 108854078 | NA | 147542984 | NA | 165529180 | NA |
NA | singletons_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | secondary_failed | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | singletons_failed_pct | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA | nan | NA |
Feature Counts | Unassigned_Ambiguity | 5553184 | NA | 4245795 | NA | 4120211 | NA | 4873854 | NA | 5225069 | NA | 4143370 | NA | 3730746 | NA | 3744259 | NA | 3208052 | NA | 2154031 | NA | 3756137 | NA | 4158307 | NA |
NA | Unassigned_MappingQuality | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | percent_assigned | 38.92490502 | NA | 41.43212911 | NA | 45.0933741 | NA | 46.82554793 | NA | 45.79057473 | NA | 42.0896936 | NA | 43.45341423 | NA | 41.01237187 | NA | 46.15918368 | NA | 44.71355742 | NA | 45.65729376 | NA | 46.21716065 | NA |
NA | Unassigned_Nonjunction | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Unassigned_Duplicate | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Unassigned_Chimera | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Unassigned_Unmapped | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Assigned | 92630635 | NA | 83632332 | NA | 81770639 | NA | 99643999 | NA | 112432428 | NA | 87231124 | NA | 76094464 | NA | 86121025 | NA | 71761506 | NA | 53877618 | NA | 75554738 | NA | 86661345 | NA |
NA | Unassigned_MultiMapping | 69977472 | NA | 56081314 | NA | 55175498 | NA | 56266414 | NA | 64772012 | NA | 50119894 | NA | 49456582 | NA | 44182994 | NA | 42564996 | NA | 33123230 | NA | 46247590 | NA | 52396728 | NA |
NA | Unassigned_Overlapping_Length | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Unassigned_Secondary | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Unassigned_FragmentLength | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA |
NA | Unassigned_NoFeatures | 69811377 | NA | 57894371 | NA | 40269914 | NA | 52014095 | NA | 63106671 | NA | 65756168 | NA | 45835550 | NA | 75939638 | NA | 37930730 | NA | 31340159 | NA | 39923831 | NA | 44292640 | NA |
NA | Total | 237972668 | NA | 201853812 | NA | 181336262 | NA | 212798362 | NA | 245536180 | NA | 207250556 | NA | 175117342 | NA | 209987916 | NA | 155465284 | NA | 120495038 | NA | 165482296 | NA | 187509020 | NA |
Programs Used
- R
- RStudio
Packages
- readr (install.packages('readr'))
- limma (BiocManager::install('limma'))
- DESeq2 (BiocManager::install('DESeq2'))
- dplyr (install.packages("dplyr"))
- ggplot2 (install.packages("ggplot2"))
- gplots (install.packages("gplots"))
- Annotations (BiocManager::install('AnnotationDbi'))
- org.Hs.eg.db (BiocManager::install('org.Hs.eg.db'))
- This is for Human
- org.Mm.eg.db (BiocManager::install('org.Mm.eg.db'))
- This is for Mouse
- ggrepel (install.packages("ggrepel"))
- ReportingTools (BiocManager::install('ReportingTools'))
- GO.db (BiocManager::install('GO.db'))
- GOstats (BiocManager::install('GOstats'))
- pathview (BiocManager::install('pathview'))
- gage (BiocManager::install('gage'))
- gageData (BiocManager::install('gageData'))
- select (BiocManager::install('Select'))
With these, you most certainly will have to step through each and install extra things when you start calling the packages. Take it step by step to ensure that each dependency is installed.
When the count file is completed, we can import it into R and start working with it to determine differentially expressed genes. First we will import it into R
library(limma)
library(DESeq2)
library(dplyr)
library(readr)
countData = read_csv("readCounts.csv", skip = 1)
This gives us our dataframe from out featureCounts program, but if we look at the data we see that featureCounts added some extra information that characterizes each gene_id.
kable(head(countData)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "320px")
Geneid | Chr | Start | End | Strand | Length | SRR2121770Aligned.sortedByCoord.out.bam | SRR2121771Aligned.sortedByCoord.out.bam | SRR2121774Aligned.sortedByCoord.out.bam | SRR2121775Aligned.sortedByCoord.out.bam | SRR2121778Aligned.sortedByCoord.out.bam | SRR2121779Aligned.sortedByCoord.out.bam | SRR2121780Aligned.sortedByCoord.out.bam | SRR2121781Aligned.sortedByCoord.out.bam | SRR2121786Aligned.sortedByCoord.out.bam | SRR2121787Aligned.sortedByCoord.out.bam | SRR2121788Aligned.sortedByCoord.out.bam | SRR2121789Aligned.sortedByCoord.out.bam |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000051951.5 | chr1;chr1;chr1;chr1;chr1;chr1;chr1 | 3205901;3206523;3213439;3213609;3214482;3421702;3670552 | 3207317;3207317;3215632;3216344;3216968;3421901;3671498 | -;-;-;-;-;-;- | 6094 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 14 | 0 | 0 |
ENSMUSG00000104017.1 | chr1 | 3375556 | 3377788 | - | 2233 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We also need to set out row names to the gene_id. We will do some data frame manipulation and then look at the data again.
countData = as.data.frame(countData)
rownames(countData) = countData$Geneid
countData = countData[,-c(1:6)]
kable(head(countData)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
SRR2121770Aligned.sortedByCoord.out.bam | SRR2121771Aligned.sortedByCoord.out.bam | SRR2121774Aligned.sortedByCoord.out.bam | SRR2121775Aligned.sortedByCoord.out.bam | SRR2121778Aligned.sortedByCoord.out.bam | SRR2121779Aligned.sortedByCoord.out.bam | SRR2121780Aligned.sortedByCoord.out.bam | SRR2121781Aligned.sortedByCoord.out.bam | SRR2121786Aligned.sortedByCoord.out.bam | SRR2121787Aligned.sortedByCoord.out.bam | SRR2121788Aligned.sortedByCoord.out.bam | SRR2121789Aligned.sortedByCoord.out.bam | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000102693.1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 |
ENSMUSG00000064842.1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000051951.5 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 |
ENSMUSG00000102851.1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
ENSMUSG00000103377.1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 14 | 0 | 0 |
ENSMUSG00000104017.1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
dim(countData)
## [1] 55487 12
kable(summary(countData)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
SRR2121770Aligned.sortedByCoord.out.bam | SRR2121771Aligned.sortedByCoord.out.bam | SRR2121774Aligned.sortedByCoord.out.bam | SRR2121775Aligned.sortedByCoord.out.bam | SRR2121778Aligned.sortedByCoord.out.bam | SRR2121779Aligned.sortedByCoord.out.bam | SRR2121780Aligned.sortedByCoord.out.bam | SRR2121781Aligned.sortedByCoord.out.bam | SRR2121786Aligned.sortedByCoord.out.bam | SRR2121787Aligned.sortedByCoord.out.bam | SRR2121788Aligned.sortedByCoord.out.bam | SRR2121789Aligned.sortedByCoord.out.bam | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | |
1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | |
Median : 0.0 | Median : 0.0 | Median : 0.0 | Median : 0.0 | Median : 0.0 | Median : 0.0 | Median : 0.0 | Median : 0.0 | Median : 3.0 | Median : 4.0 | Median : 0.0 | Median : 1.0 | |
Mean : 273.8 | Mean : 235.7 | Mean : 229.6 | Mean : 269.6 | Mean : 286.6 | Mean : 257.7 | Mean : 193.7 | Mean : 268.2 | Mean : 160.2 | Mean : 119.4 | Mean : 184.3 | Mean : 211.1 | |
3rd Qu.: 25.0 | 3rd Qu.: 20.0 | 3rd Qu.: 15.0 | 3rd Qu.: 24.0 | 3rd Qu.: 25.0 | 3rd Qu.: 24.0 | 3rd Qu.: 17.0 | 3rd Qu.: 26.0 | 3rd Qu.: 21.0 | 3rd Qu.: 21.0 | 3rd Qu.: 16.0 | 3rd Qu.: 18.0 | |
Max. :1085454.0 | Max. :834016.0 | Max. :783768.0 | Max. :1032589.0 | Max. :1233905.0 | Max. :1232421.0 | Max. :805463.0 | Max. :1210716.0 | Max. :641911.0 | Max. :516752.0 | Max. :706363.0 | Max. :756086.0 |
Let's also go ahead and change the names to describe out data a little better.
columns = c('Trp53m_mock_1', 'Trp53m_mock_2', 'Trp53m_4h7Gy_1', 'Trp53m_4h7Gy_2', 'Trp53p_mock_1', 'Trp53p_mock_2', 'Trp53p_mock_3', 'Trp53p_mock_4', 'Trp53p_4h7Gy_1', 'Trp53p_4h7Gy_2', 'Trp53p_4h7Gy_3', 'Trp53p_4h7Gy_4')
colnames(countData) = columns
Here we have to make sure that we convert the +/+ and -/- to characters. These characters
par(mar=c(8,4,4,1)+0.1)
barplot( colSums(countData)/1e6, col="green",las=3,main="Total read counts (millions)", ylab="Total read counts in millions")
hist(countData[,1], br=200, xlab="Number of Reads Counts per Feature", main="Histogram of Read Counts for Trp53-/- Mock")
We can see that our count data is highly skewed to the right. This is a great case for using log transformation!
logCountData = log2(1+countData)
par(mfrow = c(1, 2), mar=c(8,4,4,1)) # two columns
hist(logCountData[,1], main="Histogram of Log Read Counts", xlab="Log transformed counts")
boxplot(logCountData,las=3, main="Boxplot of Log Read Counts")
x <- logCountData
myColors = rainbow(dim(x)[2])
plot(density(x[,1]),col = myColors[1], lwd=2,
xlab="Expresson values", ylab="Density", main= "Distribution of transformed data",
ylim=c(0, max(density(x[,1])$y)+.02 ) )
for( i in 2:dim(x)[2] )
lines(density(x[,i]),col=myColors[i], lwd=2)
legend("topright", cex=1.1,colnames(x), lty=rep(1,dim(x)[2]), col=myColors )
plot(logCountData[,1],logCountData[,2], xlab="Trp53-/- mock replication 1", ylab="Trp53-/- mock replication 2")
We have to make the experiment design into a small dataframe so we can tell DESeq how we want to analyze the data. Here will will make a small table that has the rep names that we changed the column names to previously, and then a column for which columns are Trp53+/+ or Trp53-/-, and which columns were control mice and which columns were treated with ionizing radiation.
detectGroups <- function (x){ # x are col names
tem <- gsub("[0-9]*$","",x) # Remove all numbers from end
#tem = gsub("_Rep|_rep|_REP","",tem)
tem <- gsub("_$","",tem); # remove "_" from end
tem <- gsub("_Rep$","",tem); # remove "_Rep" from end
tem <- gsub("_rep$","",tem); # remove "_rep" from end
tem <- gsub("_REP$","",tem) # remove "_REP" from end
return( tem )
}
groups = as.character ( detectGroups( colnames( countData ) ) )
groups
## [1] "Trp53m_mock" "Trp53m_mock" "Trp53m_4h7Gy" "Trp53m_4h7Gy" "Trp53p_mock"
## [6] "Trp53p_mock" "Trp53p_mock" "Trp53p_mock" "Trp53p_4h7Gy" "Trp53p_4h7Gy"
## [11] "Trp53p_4h7Gy" "Trp53p_4h7Gy"
p53 = c("m", "m", "m", "m",
"p", "p", "p", "p", "p", "p", "p", "p")
treatment = c("control", "control", "IR", "IR", "control", "control", "control", "control",
"IR", "IR", "IR", "IR")
colData = cbind(colnames(countData), p53 )
colData
## p53
## [1,] "Trp53m_mock_1" "m"
## [2,] "Trp53m_mock_2" "m"
## [3,] "Trp53m_4h7Gy_1" "m"
## [4,] "Trp53m_4h7Gy_2" "m"
## [5,] "Trp53p_mock_1" "p"
## [6,] "Trp53p_mock_2" "p"
## [7,] "Trp53p_mock_3" "p"
## [8,] "Trp53p_mock_4" "p"
## [9,] "Trp53p_4h7Gy_1" "p"
## [10,] "Trp53p_4h7Gy_2" "p"
## [11,] "Trp53p_4h7Gy_3" "p"
## [12,] "Trp53p_4h7Gy_4" "p"
colData = as.data.frame(cbind(colnames(countData), p53, treatment))
colData
## V1 p53 treatment
## 1 Trp53m_mock_1 m control
## 2 Trp53m_mock_2 m control
## 3 Trp53m_4h7Gy_1 m IR
## 4 Trp53m_4h7Gy_2 m IR
## 5 Trp53p_mock_1 p control
## 6 Trp53p_mock_2 p control
## 7 Trp53p_mock_3 p control
## 8 Trp53p_mock_4 p control
## 9 Trp53p_4h7Gy_1 p IR
## 10 Trp53p_4h7Gy_2 p IR
## 11 Trp53p_4h7Gy_3 p IR
## 12 Trp53p_4h7Gy_4 p IR
str(colData)
## 'data.frame': 12 obs. of 3 variables:
## $ V1 : Factor w/ 12 levels "Trp53m_4h7Gy_1",..: 3 4 1 2 9 10 11 12 5 6 ...
## $ p53 : Factor w/ 2 levels "m","p": 1 1 1 1 2 2 2 2 2 2 ...
## $ treatment: Factor w/ 2 levels "control","IR": 1 1 2 2 1 1 1 1 2 2 ...
Creating a DESeq Dataset
dds = DESeqDataSetFromMatrix(countData=countData,
colData=colData,
design= ~ p53+treatment+p53*treatment) # note that the study design is changed.
dds = DESeq(dds) # main function
nrow(dds)
## [1] 55487
Filtering: we will only keep rows that have a sum count between all samples greater than 5. This will remove most of the genes that mostly have "0" counts.
dds <- dds[ rowSums(counts(dds)) > 5, ]
nrow(dds)
## [1] 34139
Regularized log transformation - used for clustering
rld <- rlog(dds, blind = FALSE)
kable(head(assay(rld), 6)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
Trp53m_mock_1 | Trp53m_mock_2 | Trp53m_4h7Gy_1 | Trp53m_4h7Gy_2 | Trp53p_mock_1 | Trp53p_mock_2 | Trp53p_mock_3 | Trp53p_mock_4 | Trp53p_4h7Gy_1 | Trp53p_4h7Gy_2 | Trp53p_4h7Gy_3 | Trp53p_4h7Gy_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000102693.1 | -0.7873810 | -0.7869647 | -0.7864123 | -0.7872251 | -0.7873917 | -0.7872901 | -0.7863979 | -0.7874606 | -0.6313032 | -0.7027829 | -0.7854203 | -0.7864622 |
ENSMUSG00000051951.5 | -0.1689760 | -0.1684855 | -0.1678308 | -0.0729011 | -0.1689887 | -0.1688691 | -0.1678136 | -0.1690696 | -0.0298983 | -0.0789877 | -0.1675075 | -0.1678901 |
ENSMUSG00000103377.1 | 0.9367497 | 0.9385253 | 0.9408618 | 1.0007595 | 0.9367037 | 0.9371385 | 0.9409228 | 0.9364087 | 1.1895736 | 1.5284052 | 0.9420021 | 0.9406517 |
ENSMUSG00000103201.1 | -0.0694969 | -0.0689413 | -0.0682009 | -0.0692892 | -0.0695112 | -0.0693758 | -0.0681815 | -0.0696029 | 0.0406074 | 0.1775249 | -0.0678357 | -0.0682680 |
ENSMUSG00000102592.1 | 0.6079270 | 0.6090832 | 0.6106138 | 0.6083601 | 0.6078971 | 0.6081797 | 0.6106538 | 0.6077056 | 0.9928379 | 0.8039315 | 0.6113644 | 0.6829781 |
ENSMUSG00000025900.12 | 1.7329730 | 1.7166717 | 1.6536060 | 1.6483113 | 1.6766519 | 1.6478854 | 1.6536997 | 1.7812828 | 2.0563917 | 2.2880207 | 1.6553593 | 1.7301108 |
Variance Stabilizing Transformation
Interactions cause a difference between the lfc betwen pooled data, e.g. p53+/+ (control and IR) and p53-/- (control and IR)
vsd <- vst(dds, blind = FALSE)
kable(head(assay(vsd), 6)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
Trp53m_mock_1 | Trp53m_mock_2 | Trp53m_4h7Gy_1 | Trp53m_4h7Gy_2 | Trp53p_mock_1 | Trp53p_mock_2 | Trp53p_mock_3 | Trp53p_mock_4 | Trp53p_4h7Gy_1 | Trp53p_4h7Gy_2 | Trp53p_4h7Gy_3 | Trp53p_4h7Gy_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000102693.1 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.784352 | 6.714627 | 6.433022 | 6.433022 |
ENSMUSG00000051951.5 | 6.433022 | 6.433022 | 6.433022 | 6.715020 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.784352 | 6.714627 | 6.433022 | 6.433022 |
ENSMUSG00000103377.1 | 6.433022 | 6.433022 | 6.433022 | 6.632583 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.862784 | 7.171182 | 6.433022 | 6.433022 |
ENSMUSG00000103201.1 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.737470 | 6.919246 | 6.433022 | 6.433022 |
ENSMUSG00000102592.1 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 6.433022 | 7.013160 | 6.830644 | 6.433022 | 6.662579 |
ENSMUSG00000025900.12 | 6.669513 | 6.643045 | 6.433022 | 6.433022 | 6.569347 | 6.433022 | 6.433022 | 6.732883 | 7.013160 | 7.220955 | 6.433022 | 6.662579 |
For the log2 approach, we need to first estimate size factors to account for sequencing depth, and then specify normalized=TRUE. Sequencing depth correction is done automatically for the rlog and the vst.
Size Factor
dds <- estimateSizeFactors(dds)
kable(sizeFactors(dds)) %>%
kable_styling() %>%
scroll_box(width = "300px", height = "520px")
x | |
---|---|
Trp53m_mock_1 | 1.2892566 |
Trp53m_mock_2 | 1.0903014 |
Trp53m_4h7Gy_1 | 0.8973654 |
Trp53m_4h7Gy_2 | 1.2078255 |
Trp53p_mock_1 | 1.2952327 |
Trp53p_mock_2 | 1.2406213 |
Trp53p_mock_3 | 0.8931001 |
Trp53p_mock_4 | 1.3347016 |
Trp53p_4h7Gy_1 | 0.7767888 |
Trp53p_4h7Gy_2 | 0.6056051 |
Trp53p_4h7Gy_3 | 0.8227361 |
Trp53p_4h7Gy_4 | 0.9123249 |
We will first look at the log transformed data
slog <- log2(counts(dds, normalized=TRUE)+1)
kable(head(slog)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
Trp53m_mock_1 | Trp53m_mock_2 | Trp53m_4h7Gy_1 | Trp53m_4h7Gy_2 | Trp53p_mock_1 | Trp53p_mock_2 | Trp53p_mock_3 | Trp53p_mock_4 | Trp53p_4h7Gy_1 | Trp53p_4h7Gy_2 | Trp53p_4h7Gy_3 | Trp53p_4h7Gy_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000102693.1 | 0.000000 | 0.000000 | 0 | 0.000000 | 0.0000000 | 0 | 0 | 0.000000 | 2.620447 | 2.105169 | 0 | 0.000000 |
ENSMUSG00000051951.5 | 0.000000 | 0.000000 | 0 | 2.108269 | 0.0000000 | 0 | 0 | 0.000000 | 2.620447 | 2.105169 | 0 | 0.000000 |
ENSMUSG00000103377.1 | 0.000000 | 0.000000 | 0 | 1.409184 | 0.0000000 | 0 | 0 | 0.000000 | 3.125008 | 4.592001 | 0 | 0.000000 |
ENSMUSG00000103201.1 | 0.000000 | 0.000000 | 0 | 0.000000 | 0.0000000 | 0 | 0 | 0.000000 | 2.281566 | 3.447241 | 0 | 0.000000 |
ENSMUSG00000102592.1 | 0.000000 | 0.000000 | 0 | 0.000000 | 0.0000000 | 0 | 0 | 0.000000 | 3.922280 | 2.926941 | 0 | 1.674552 |
ENSMUSG00000025900.12 | 1.734188 | 1.503021 | 0 | 0.000000 | 0.8254291 | 0 | 0 | 2.246759 | 3.922280 | 4.777149 | 0 | 1.674552 |
par(mfrow = c(1, 3)) # 3 columns
plot(slog[,1],slog[,2])
plot(assay(rld)[,1],assay(rld)[,2])
plot(assay(vsd)[,1],assay(vsd)[,2])
As the log transformation constant increases, the information of the data is lost.
par(mfrow = c(1, 3)) # 3 columns
slog <- log2(counts(dds, normalized=TRUE)+1)
plot(slog[,1],slog[,2])
slog <- log2(counts(dds, normalized=TRUE)+4)
plot(slog[,1],slog[,2], xlim=c(0,20))
slog <- log2(counts(dds, normalized=TRUE)+20)
plot(slog[,1],slog[,2], xlim=c(0,20))
library("dplyr")
library("ggplot2")
df <- bind_rows(
as_data_frame(slog[,1:2]) %>%
mutate(transformation = "log2(x + 1)"),
as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"),
as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"))
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
colnames(df)[1:2] <- c("x", "y")
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
PCA plot
plotPCA(rld, intgroup = c("p53", "treatment")) + theme(aspect.ratio=1)
A prettier PCA plot created with GGPlot2
pca.object <- prcomp(t(assay(rld))) # PCA
pcaData = as.data.frame(pca.object$x[,1:2]);
pcaData = cbind(pcaData,detectGroups(colnames(assay(rld)) ))
colnames(pcaData) = c("PC1", "PC2", "Type")
percentVar=round(100*summary(pca.object)$importance[2,1:2],0)
#plot
p=ggplot(pcaData, aes(PC1, PC2, color=Type, shape = Type)) + geom_point(size=5)
p=p+xlab(paste0("PC1: ",percentVar[1],"% variance"))
p=p+ylab(paste0("PC2: ",percentVar[2],"% variance"))
p=p+ggtitle("Principal component analysis (PCA)")+coord_fixed(ratio=1.0)+
theme(plot.title = element_text(size = 16,hjust = 0.5)) + theme(aspect.ratio=1) +
theme(axis.text.x = element_text( size = 16),
axis.text.y = element_text( size = 16),
axis.title.x = element_text( size = 16),
axis.title.y = element_text( size = 16) ) +
theme(legend.text=element_text(size=16))
print(p)
Multidimensional Scaling Plot
dist2 <- function(x, ...) # distance function = 1-PCC (Pearson's correlation coefficient)
as.dist(1-cor(t(x), method="pearson"))
fit = cmdscale( dist2(t(assay(rld))) , eig=T, k=2)
mdsData <- as.data.frame(fit$points[,1:2]);
mdsData <- cbind(mdsData,detectGroups(colnames(assay(rld))) )
colnames(mdsData) = c("x1", "x2", "Type")
p<-ggplot(mdsData, aes(x1, x2, color=Type, shape = Type)) + geom_point(size=5)
p=p+xlab("Dimension 1")
p=p+ylab("Dimension 2")
p=p+ggtitle("Multidimensional scaling (MDS)")+ coord_fixed(ratio=1.)+
theme(plot.title = element_text(hjust = 0.5)) + theme(aspect.ratio=1) +
theme(axis.text.x = element_text( size = 16),
axis.text.y = element_text( size = 16),
axis.title.x = element_text( size = 16),
axis.title.y = element_text( size = 16) ) +
theme(legend.text=element_text(size=16))
print(p)
Creating a heatmap
library(gplots)
hclust2 <- function(x, method="average", ...) # average linkage in hierarchical clustering
hclust(x, method=method, ...)
n=100 # number of top genes by standard deviation
x = assay(rld)
if(n>dim(x)[1]) n = dim(x)[1] # max as data
x = x[order(apply(x,1,sd),decreasing=TRUE),] # sort genes by standard deviation
x = x[1:n,] # only keep the n genes
# this will cutoff very large values, which could skew the color
x=as.matrix(x[1:n,])-apply(x[1:n,],1,mean)
cutoff = median(unlist(x)) + 4*sd (unlist(x))
x[x>cutoff] <- cutoff
cutoff = median(unlist(x)) - 4*sd (unlist(x))
x[x< cutoff] <- cutoff
groups = detectGroups(colnames(x) )
groups.colors = rainbow(length(unique(groups) ) )
lmat = rbind(c(5,4),c(0,1),c(3,2))
lwid = c(1.5,4)
lhei = c(1,.2,4)
heatmap.2(x, distfun = dist2,hclustfun=hclust2,
col=greenred(75), density.info="none", trace="none", scale="none", keysize=.5
,key=T, symkey=F
,ColSideColors=groups.colors[ as.factor(groups)]
,margins=c(8,12)
,cexRow=1
,srtCol=45
,cexCol=1. # size of font for sample names
,lmat = lmat, lwid = lwid, lhei = lhei
)
dds <- DESeq(dds)
res <- results(dds)
kable(head(res)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
ENSMUSG00000102693.1 | 0.7043239 | 3.5623773 | 6.556337 | 0.5433487 | 0.5868897 | NA |
ENSMUSG00000051951.5 | 0.9803019 | 0.3244198 | 6.442842 | 0.0503535 | 0.9598407 | NA |
ENSMUSG00000103377.1 | 2.7081125 | 3.1757639 | 5.419836 | 0.5859520 | 0.5579078 | NA |
ENSMUSG00000103201.1 | 1.1474583 | 4.2647949 | 6.546806 | 0.6514314 | 0.5147681 | NA |
ENSMUSG00000102592.1 | 1.9131690 | 5.0084508 | 5.437877 | 0.9210305 | 0.3570345 | NA |
ENSMUSG00000025900.12 | 4.2877013 | 6.7470978 | 3.112343 | 2.1678514 | 0.0301700 | 0.1442115 |
DESeq2 uses the Benjamini-Hochberg (BH) adjustment (Benjamini and Hochberg 1995) as implemented in the base R p.adjust function
res <- results(dds, alpha = 0.5, lfcThreshold=0.01)
summary(res)
##
## out of 34139 with nonzero total read count
## adjusted p-value < 0.5
## LFC > 0.01 (up) : 5794, 17%
## LFC < -0.01 (down) : 4094, 12%
## outliers [1] : 3, 0.0088%
## low counts [2] : 8605, 25%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Now lets sort genes by fold change
res <- res[order(abs( res$log2FoldChange), decreasing=TRUE),]
kable(head(res)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
ENSMUSG00000107277.1 | 178.134530 | 12.82857 | 1.906777 | 6.722639 | 0.0000000 | 0.0000000 |
ENSMUSG00000036281.13 | 17.898231 | 11.90953 | 1.997989 | 5.955753 | 0.0000000 | 0.0000002 |
ENSMUSG00000034189.5 | 21.363195 | 11.19200 | 2.064077 | 5.417433 | 0.0000001 | 0.0000035 |
ENSMUSG00000111241.1 | 26.215073 | 10.91199 | 1.885793 | 5.781118 | 0.0000000 | 0.0000005 |
ENSMUSG00000027479.14 | 4.844616 | 10.82957 | 3.264449 | 3.314364 | 0.0009185 | 0.0130936 |
ENSMUSG00000014932.15 | 77.883535 | 10.65416 | 1.686891 | 6.309929 | 0.0000000 | 0.0000000 |
MA Plot
Show the significant genes. The lower the average read counts for all samples and the higher the variation between the samples, the less significant those genes are.
DESeq2::plotMA(res, ylim = c(-5, 5))
Volcano plot
library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.05, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<0.5),'sig'] <- "Not Sig"
p = ggplot(res1, aes(log2FoldChange, -log10(padj))) +
geom_point(aes(col=sig)) +
scale_color_manual(values=c("red", "black"))
p
## Warning: Removed 8608 rows containing missing values (geom_point).
Plot counts of top gene
topGene <- rownames(res)[1]
plotCounts(dds, gene = topGene, intgroup=c("p53", "treatment"))
Here we see an interesting point or our normalized counts under the Trp53p_4h7Gy group that seems to be extremely high, while the other 3 replicates are around 0.5. I cannot get this portion to work, however. I get an error stating that None of the keys entered are valid keys for 'SYMBOL'.
Let's look at the keys that we have to work with for our res file from DESeq2
head(row.names(res))
## [1] "ENSMUSG00000107277.1" "ENSMUSG00000036281.13" "ENSMUSG00000034189.5"
## [4] "ENSMUSG00000111241.1" "ENSMUSG00000027479.14" "ENSMUSG00000014932.15"
Now we need to find the same key in the Mm database.
library(AnnotationDbi)
library(org.Mm.eg.db)
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "IPI" "MGI" "ONTOLOGY"
## [16] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE"
## [21] "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
#key = gsub("\\..*","", row.names(res))
res$symbol <- gsub("\\..*","", row.names(res))
#res$symbol <- gsub(" ","",row.names(res))
message("Ensembl IDs")
## Ensembl IDs
key.en = keys(org.Mm.eg.db, keytype="ENSEMBL")
head(key.en)
## [1] "ENSMUSG00000030359" "ENSMUSG00000020804" "ENSMUSG00000025375"
## [4] "ENSMUSG00000015243" "ENSMUSG00000028125" "ENSMUSG00000026944"
cat("\n\n")
message("SYMBOL names")
## SYMBOL names
key.sy = keys(org.Mm.eg.db, keytype="SYMBOL")
head(key.sy)
## [1] "Pzp" "Aanat" "Aatk" "Abca1" "Abca4" "Abca2"
These are ENSEMBL symbols, so we need to designate that when looking for the genes that we have.
res$ensembl <- gsub("\\..*","", row.names(res))
res$entrez <- mapIds(org.Mm.eg.db,
keys= res$ensembl,
column="ENTREZID",
keytype="ENSEMBL", #Out ID is ENSMBL
multiVals="first")
## 'select()' returned 1:many mapping between keys and columns
res$symbol <- mapIds(org.Mm.eg.db,
keys= res$ensembl,
column="SYMBOL",
keytype="ENSEMBL", #Out ID is ENSMBL
multiVals="first")
## 'select()' returned 1:many mapping between keys and columns
write.csv(res, file = "results.csv")
kable(head(res)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | symbol | ensembl | entrez | |
---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000107277.1 | 178.134530 | 12.82857 | 1.906777 | 6.722639 | 0.0000000 | 0.0000000 | NA | ENSMUSG00000107277 | NA |
ENSMUSG00000036281.13 | 17.898231 | 11.90953 | 1.997989 | 5.955753 | 0.0000000 | 0.0000002 | Snapc4 | ENSMUSG00000036281 | 227644 |
ENSMUSG00000034189.5 | 21.363195 | 11.19200 | 2.064077 | 5.417433 | 0.0000001 | 0.0000035 | Hsdl1 | ENSMUSG00000034189 | 72552 |
ENSMUSG00000111241.1 | 26.215073 | 10.91199 | 1.885793 | 5.781118 | 0.0000000 | 0.0000005 | NA | ENSMUSG00000111241 | NA |
ENSMUSG00000027479.14 | 4.844616 | 10.82957 | 3.264449 | 3.314364 | 0.0009185 | 0.0130936 | Mapre1 | ENSMUSG00000027479 | 13589 |
ENSMUSG00000014932.15 | 77.883535 | 10.65416 | 1.686891 | 6.309929 | 0.0000000 | 0.0000000 | Yes1 | ENSMUSG00000014932 | 22612 |
Let's make a file with just the genes with an adjusted p-value < 0.5
resSig = as.data.frame(subset(res,padj<0.5) )
resSig = resSig[order(resSig$log2FoldChange,decreasing=TRUE),]
head(resSig)
## baseMean log2FoldChange lfcSE stat pvalue
## ENSMUSG00000107277.1 178.134530 12.82857 1.906777 6.722639 1.784620e-11
## ENSMUSG00000036281.13 17.898231 11.90953 1.997989 5.955753 2.588774e-09
## ENSMUSG00000034189.5 21.363195 11.19200 2.064077 5.417433 6.046073e-08
## ENSMUSG00000111241.1 26.215073 10.91199 1.885793 5.781118 7.420598e-09
## ENSMUSG00000027479.14 4.844616 10.82957 3.264449 3.314364 9.185180e-04
## ENSMUSG00000014932.15 77.883535 10.65416 1.686891 6.309929 2.791627e-10
## padj symbol ensembl entrez
## ENSMUSG00000107277.1 2.190535e-09 <NA> ENSMUSG00000107277 <NA>
## ENSMUSG00000036281.13 2.052608e-07 Snapc4 ENSMUSG00000036281 227644
## ENSMUSG00000034189.5 3.540420e-06 Hsdl1 ENSMUSG00000034189 72552
## ENSMUSG00000111241.1 5.248069e-07 <NA> ENSMUSG00000111241 <NA>
## ENSMUSG00000027479.14 1.309363e-02 Mapre1 ENSMUSG00000027479 13589
## ENSMUSG00000014932.15 2.795021e-08 Yes1 ENSMUSG00000014932 22612
write.csv(resSig,"SigGenes.csv")
Here is a volcano plot that shows the symbol that we created at each point.
library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.5, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<1),'sig'] <- "Not Sig"
p = ggplot(res1, aes(log2FoldChange, -log10(pvalue))) +
geom_point(aes(col=sig)) +
scale_color_manual(values=c("red", "black"))
p+geom_text(data=filter(res1, padj<1e-50), aes(label=symbol))
## Warning: Removed 7357 rows containing missing values (geom_point).
Here we will do a GO Enrichment analysis for genes that have a decreased fold-change of 5 or more
library(GO.db)
library(GOstats)
selectedGenes = unique(resSig[resSig$log2FoldChange>5,'entrez']) # upregulated genes
universeGenes = unique( mapIds(org.Mm.eg.db,
keys= res$ensembl,
column="ENTREZID",
keytype="ENSEMBL", #Out ID is ENSMBL
multiVals="first")
)
hgCutoff <- 0.001
params <- new("GOHyperGParams",
geneIds=selectedGenes,
universeGeneIds=universeGenes,
annotation="org.Mm.eg.db",
ontology="BP",
pvalueCutoff=hgCutoff,
conditional=FALSE,
testDirection="over")
hgOver <- hyperGTest(params)
summary(hgOver)[1:10,]
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0019236 2.643566e-07 5.062236 4.423515 18 78
## 2 GO:0016049 1.730359e-05 2.028026 26.144107 49 461
## 3 GO:0048588 2.580339e-05 2.442482 13.951085 31 246
## 4 GO:1990138 3.035698e-05 2.725414 10.208111 25 180
## 5 GO:0050789 3.546769e-05 1.302786 562.863914 624 9925
## 6 GO:0001558 5.334949e-05 2.045100 22.174286 42 391
## 7 GO:0032879 5.962358e-05 1.395721 147.223650 191 2596
## 8 GO:0060560 1.073049e-04 2.275476 14.348068 30 253
## 9 GO:0061564 1.204488e-04 1.884894 26.711225 47 471
## 10 GO:0050794 1.250445e-04 1.272536 527.135525 584 9295
## Term
## 1 response to pheromone
## 2 cell growth
## 3 developmental cell growth
## 4 neuron projection extension
## 5 regulation of biological process
## 6 regulation of cell growth
## 7 regulation of localization
## 8 developmental growth involved in morphogenesis
## 9 axon development
## 10 regulation of cellular process
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]
## GOBPID Pvalue Term
## 1 GO:0019236 2.643566e-07 response to pheromone
## 2 GO:0016049 1.730359e-05 cell growth
## 3 GO:0048588 2.580339e-05 developmental cell growth
## 4 GO:1990138 3.035698e-05 neuron projection extension
## 5 GO:0050789 3.546769e-05 regulation of biological process
## 6 GO:0001558 5.334949e-05 regulation of cell growth
## 7 GO:0032879 5.962358e-05 regulation of localization
## 8 GO:0060560 1.073049e-04 developmental growth involved in morphogenesis
## 9 GO:0061564 1.204488e-04 axon development
## 10 GO:0050794 1.250445e-04 regulation of cellular process
params1 <- params
ontology(params1) <- "CC"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOCCID","Pvalue","Term")]
## GOCCID Pvalue Term
## 1 GO:0031226 2.194852e-10 intrinsic component of plasma membrane
## 2 GO:0005887 2.497976e-10 integral component of plasma membrane
## 3 GO:0030424 3.300807e-07 axon
## 4 GO:0098978 7.089583e-07 glutamatergic synapse
## 5 GO:0033267 1.725126e-06 axon part
## 6 GO:0044459 1.872288e-06 plasma membrane part
## 7 GO:0099055 3.042517e-06 integral component of postsynaptic membrane
## 8 GO:0099699 4.152767e-06 integral component of synaptic membrane
## 9 GO:0045211 6.111879e-06 postsynaptic membrane
## 10 GO:0016021 7.169574e-06 integral component of membrane
params1 <- params
ontology(params1) <- "MF"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOMFID","Pvalue","Term")]
## GOMFID Pvalue Term
## 1 GO:0016503 4.685420e-07 pheromone receptor activity
## 2 GO:0005550 1.234794e-05 pheromone binding
## 3 GO:0005549 2.293309e-05 odorant binding
## 4 GO:0004888 3.280256e-05 transmembrane signaling receptor activity
## 5 GO:0038023 5.222020e-05 signaling receptor activity
## 6 GO:0060089 8.100909e-05 molecular transducer activity
## 7 GO:0005488 6.931421e-04 binding
## 8 GO:0015267 8.237763e-04 channel activity
## 9 GO:0022803 8.237763e-04 passive transmembrane transporter activity
## 10 GO:0046873 8.631819e-04 metal ion transmembrane transporter activity
Next we will have a look at the genes that are upregulated by a fold-change of 5 or greater.
selectedGenes = unique(resSig[resSig$log2FoldChange<5,'entrez']) # upregulated genes
params <- new("GOHyperGParams",
geneIds=selectedGenes,
universeGeneIds=universeGenes,
annotation="org.Mm.eg.db",
ontology="BP",
pvalueCutoff=hgCutoff,
conditional=FALSE,
testDirection="over")
hgOver <- hyperGTest(params)
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]
## GOBPID Pvalue Term
## 1 GO:0016043 9.317561e-12 cellular component organization
## 2 GO:0008152 1.757957e-11 metabolic process
## 3 GO:0071840 4.299006e-11 cellular component organization or biogenesis
## 4 GO:0044237 7.869800e-11 cellular metabolic process
## 5 GO:0044238 1.013489e-09 primary metabolic process
## 6 GO:0006807 1.956109e-09 nitrogen compound metabolic process
## 7 GO:0071704 2.363963e-09 organic substance metabolic process
## 8 GO:0006996 2.779804e-09 organelle organization
## 9 GO:0033036 1.835538e-08 macromolecule localization
## 10 GO:0051179 2.179269e-08 localization
# bioconductor packages
# source("https://bioconductor.org/biocLite.R");
# biocLite(c("pathview","gage","gageData"))
library(pathview)
library(gage)
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
## <NA> 227644 72552 <NA> 13589 22612
## 12.82857 11.90953 11.19200 10.91199 10.82957 10.65416
library(gageData)
data(go.sets.mm)
data(go.subs.mm)
gobpsets = go.sets.mm[go.subs.mm$BP]
gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)
#lapply(gobpres, head)
message("Greater")
## Greater
kable(head(gobpres$greater)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
p.geomean | stat.mean | p.val | q.val | set.size | exp1 | |
---|---|---|---|---|---|---|
GO:0019236 response to pheromone | 0.0000000 | 7.637046 | 0.0000000 | 0.0000000 | 76 | 0.0000000 |
GO:0007420 brain development | 0.0012778 | 3.026272 | 0.0012778 | 0.7925518 | 400 | 0.0012778 |
GO:0045664 regulation of neuron differentiation | 0.0021154 | 2.869812 | 0.0021154 | 0.7925518 | 355 | 0.0021154 |
GO:0007156 homophilic cell adhesion | 0.0023079 | 2.879945 | 0.0023079 | 0.7925518 | 70 | 0.0023079 |
GO:0001944 vasculature development | 0.0030643 | 2.746850 | 0.0030643 | 0.7925518 | 487 | 0.0030643 |
GO:0033555 multicellular organismal response to stress | 0.0038427 | 2.719970 | 0.0038427 | 0.7925518 | 53 | 0.0038427 |
message("Less")
## Less
kable(head(gobpres$less)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
p.geomean | stat.mean | p.val | q.val | set.size | exp1 | |
---|---|---|---|---|---|---|
GO:0007283 spermatogenesis | 0.0094832 | -2.352919 | 0.0094832 | 0.992104 | 292 | 0.0094832 |
GO:0048232 male gamete generation | 0.0100757 | -2.330049 | 0.0100757 | 0.992104 | 293 | 0.0100757 |
GO:0007266 Rho protein signal transduction | 0.0180131 | -2.109476 | 0.0180131 | 0.992104 | 117 | 0.0180131 |
GO:0007094 mitotic cell cycle spindle assembly checkpoint | 0.0263331 | -2.065183 | 0.0263331 | 0.992104 | 14 | 0.0263331 |
GO:0016441 posttranscriptional gene silencing | 0.0274536 | -1.957280 | 0.0274536 | 0.992104 | 35 | 0.0274536 |
GO:0007276 gamete generation | 0.0283921 | -1.907876 | 0.0283921 | 0.992104 | 388 | 0.0283921 |
message("Stats")
## Stats
kable(head(gobpres$stats)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
stat.mean | exp1 | |
---|---|---|
GO:0019236 response to pheromone | 7.637046 | 7.637046 |
GO:0007420 brain development | 3.026272 | 3.026272 |
GO:0045664 regulation of neuron differentiation | 2.869812 | 2.869812 |
GO:0007156 homophilic cell adhesion | 2.879945 | 2.879945 |
GO:0001944 vasculature development | 2.746850 | 2.746850 |
GO:0033555 multicellular organismal response to stress | 2.719970 | 2.719970 |
library(gageData)
data(kegg.sets.mm)
data(sigmet.idx.mm)
kegg.sets.mm = kegg.sets.mm[sigmet.idx.mm]
#head(kegg.sets.mm, 3)
message("Greater")
## Greater
kable(head(kegg.sets.mm$greater)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
message("Less")
## Less
kable(head(kegg.sets.mm$less)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
message("Stats")
## Stats
kable(head(kegg.sets.mm$stats)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.mm, same.dir=TRUE)
# Look at both up (greater), down (less), and statatistics.
#lapply(keggres, head, n=10)
message("Greater")
## Greater
kable(head(keggres$greater)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
p.geomean | stat.mean | p.val | q.val | set.size | exp1 | |
---|---|---|---|---|---|---|
mmu04360 Axon guidance | 0.0285970 | 1.910570 | 0.0285970 | 0.8131848 | 128 | 0.0285970 |
mmu03410 Base excision repair | 0.0355168 | 1.842982 | 0.0355168 | 0.8131848 | 28 | 0.0355168 |
mmu00591 Linoleic acid metabolism | 0.0537749 | 1.628039 | 0.0537749 | 0.8131848 | 40 | 0.0537749 |
mmu04974 Protein digestion and absorption | 0.0596166 | 1.567001 | 0.0596166 | 0.8131848 | 76 | 0.0596166 |
mmu01040 Biosynthesis of unsaturated fatty acids | 0.0662476 | 1.531834 | 0.0662476 | 0.8131848 | 24 | 0.0662476 |
mmu03022 Basal transcription factors | 0.0674085 | 1.514546 | 0.0674085 | 0.8131848 | 33 | 0.0674085 |
message("Less")
## Less
kable(head(keggres$less)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
p.geomean | stat.mean | p.val | q.val | set.size | exp1 | |
---|---|---|---|---|---|---|
mmu03020 RNA polymerase | 0.0095316 | -2.445444 | 0.0095316 | 0.9111323 | 23 | 0.0095316 |
mmu03450 Non-homologous end-joining | 0.0341877 | -1.999438 | 0.0341877 | 0.9111323 | 11 | 0.0341877 |
mmu00190 Oxidative phosphorylation | 0.0669600 | -1.505167 | 0.0669600 | 0.9111323 | 99 | 0.0669600 |
mmu03420 Nucleotide excision repair | 0.0747360 | -1.456591 | 0.0747360 | 0.9111323 | 39 | 0.0747360 |
mmu00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis | 0.0757982 | -1.464592 | 0.0757982 | 0.9111323 | 25 | 0.0757982 |
mmu00240 Pyrimidine metabolism | 0.1116636 | -1.222140 | 0.1116636 | 0.9111323 | 87 | 0.1116636 |
message("Stats")
## Stats
kable(head(keggres$stats)) %>%
kable_styling() %>%
scroll_box(width = "1000px", height = "300px")
stat.mean | exp1 | |
---|---|---|
mmu04360 Axon guidance | 1.910570 | 1.910570 |
mmu03410 Base excision repair | 1.842982 | 1.842982 |
mmu00591 Linoleic acid metabolism | 1.628039 | 1.628039 |
mmu04974 Protein digestion and absorption | 1.567001 | 1.567001 |
mmu01040 Biosynthesis of unsaturated fatty acids | 1.531834 | 1.531834 |
mmu03022 Basal transcription factors | 1.514546 | 1.514546 |
# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "mmu03020 RNA polymerase"
## [2] "mmu03450 Non-homologous end-joining"
## [3] "mmu00190 Oxidative phosphorylation"
## [4] "mmu03420 Nucleotide excision repair"
## [5] "mmu00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis"
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
## [1] "mmu03020" "mmu03450" "mmu00190" "mmu03420" "mmu00563"
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu", new.signature=FALSE)
# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu"))
#De novo Assembly {#denovo-assembly}
AT THIS TIME, MINICONDA3 WAS INSTALLED SO FURTHER TOOLS ARE ALL INSTALLED UNDER THIS
Having the genome for RNA-sequencing analysis is very useful, but sometimes it is not available so de novo assembly is used. de novo assembly is using the reads from sequencing to create longer seqeunces called contigs (contiguous sequences). These sequences are then compared to a protein database to get an idea of what proteins are possibly present. The raw sequences are also mapped back to the contigs, treating the contigs as the "genome", much like we do when we have the genome from the host available.
Lets jump right into it. We are going to be using Trinity for the de novo assembly and analysis because there are many tools built into the Trinity tool that allow for the building of contigs, counting of sequences, and even DeSeq/edgeR analyses.
To install Trinity, all I did was ran conda install -c bioconda trinity. This installs all of the previously mentioned tools in one go. In addition to Trinity, BLAST (conda install -c bioconda blast) and RSEM (conda install -c bioconda rsem) were installed.
Trinity --seqType fq \
--left trimmedReads/SRR2121770_Trimmed_1P.fq.gz,trimmedReads/SRR2121771_Trimmed_1P.fq.gz,trimmedReads/SRR2121774_Trimmed_1P.fq.gz,trimmedReads/SRR2121775_Trimmed_1P.fq.gz,trimmedReads/SRR2121778_Trimmed_1P.fq.gz,trimmedReads/SRR2121779_Trimmed_1P.fq.gz,trimmedReads/SRR2121780_Trimmed_1P.fq.gz,trimmedReads/SRR2121781_Trimmed_1P.fq.gz,trimmedReads/SRR2121786_Trimmed_1P.fq.gz,trimmedReads/SRR2121787_Trimmed_1P.fq.gz,trimmedReads/SRR2121788_Trimmed_1P.fq.gz,trimmedReads/SRR2121789_Trimmed_1P.fq.gz \
--right trimmedReads/SRR2121770_Trimmed_2P.fq.gz,trimmedReads/SRR2121771_Trimmed_2P.fq.gz,trimmedReads/SRR2121774_Trimmed_2P.fq.gz,trimmedReads/SRR2121775_Trimmed_2P.fq.gz,trimmedReads/SRR2121778_Trimmed_2P.fq.gz,trimmedReads/SRR2121779_Trimmed_2P.fq.gz,trimmedReads/SRR2121780_Trimmed_2P.fq.gz,trimmedReads/SRR2121781_Trimmed_2P.fq.gz,trimmedReads/SRR2121786_Trimmed_2P.fq.gz,trimmedReads/SRR2121787_Trimmed_2P.fq.gz,trimmedReads/SRR2121788_Trimmed_2P.fq.gz,trimmedReads/SRR2121789_Trimmed_2P.fq.gz \
--CPU 80 --max_memory 2000G --min_contig_length 150
TrinityStats.pl trinity_out_dir/Trinity.fasta
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 602870
Total trinity transcripts: 730297
Percent GC: 45.89
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 8589
Contig N20: 5955
Contig N30: 4400
Contig N40: 3276
Contig N50: 2413
Median contig length: 413
Average contig: 1028.24
Total assembled bases: 750921120
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 5470
Contig N20: 3434
Contig N30: 2391
Contig N40: 1741
Contig N50: 1285
Median contig length: 343
Average contig: 703.04
Total assembled bases: 423841627
bowtie2-build trinity_out_dir/Trinity.fasta trinity_out_dir/Trinity.fasta
bowtie2 --local --no-unal -x trinity_out
#download uniprot swiss-prot db #ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
mkdir -p blast_protdb
mv uniprot_sprot.fasta blast_protdb/.
makeblastdb -in uniprot_sprot.fasta -dbtype prot
cd ..
blastx -query trinity_out_dir/Trinity.fasta -db blast_protdb/uniprot_sprot.fasta -out blastx.outfmt6 -evalue 1e-20 -num_threads 80 -max_target_seqs 1 -outfmt 6
analyze_blastPlus_topHit_coverage.pl blastx.outfmt6 trinity_out_dir/Trinity.fasta blast_protdb/uniprot_sprot.fasta | column -t
#hit_pct_cov_bin count_in_bin >bin_below
100 7361 7361
90 1117 8478
80 981 9459
70 971 10430
60 1003 11433
50 865 12298
40 924 13222
30 1102 14324
20 1478 15802
10 932 16734
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121770_1.fastq.gz -2 rawReads/SRR2121770_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121770.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121771_1.fastq.gz -2 rawReads/SRR2121771_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121771.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121774_1.fastq.gz -2 rawReads/SRR2121774_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121774.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121775_1.fastq.gz -2 rawReads/SRR2121775_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121775.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121778_1.fastq.gz -2 rawReads/SRR2121778_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121778.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121779_1.fastq.gz -2 rawReads/SRR2121779_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121779.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121780_1.fastq.gz -2 rawReads/SRR2121780_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121780.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121781_1.fastq.gz -2 rawReads/SRR2121781_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121781.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121786_1.fastq.gz -2 rawReads/SRR2121786_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121786.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121787_1.fastq.gz -2 rawReads/SRR2121787_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121787.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121788_1.fastq.gz -2 rawReads/SRR2121788_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121788.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121789_1.fastq.gz -2 rawReads/SRR2121789_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121789.coordSorted.bam
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121770_1.fastq.gz --right rawReads/SRR2121770_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121770.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121771_1.fastq.gz --right rawReads/SRR2121771_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121771.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121774_1.fastq.gz --right rawReads/SRR2121774_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121774.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121775_1.fastq.gz --right rawReads/SRR2121775_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121775.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121778_1.fastq.gz --right rawReads/SRR2121778_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121778.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121779_1.fastq.gz --right rawReads/SRR2121779_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121779.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121780_1.fastq.gz --right rawReads/SRR2121780_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121780.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121781_1.fastq.gz --right rawReads/SRR2121781_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121781.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121786_1.fastq.gz --right rawReads/SRR2121786_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121786.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121787_1.fastq.gz --right rawReads/SRR2121787_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121787.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121788_1.fastq.gz --right rawReads/SRR2121788_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121788.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121789_1.fastq.gz --right rawReads/SRR2121789_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121789.RSEM
time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_trans SRR2121770.RSEM/SRR2121770.RSEM.trans.results \
SRR2121771.RSEM/SRR2121771.RSEM.trans.results \
SRR2121774.RSEM/SRR2121774.RSEM.trans.results \
SRR2121775.RSEM/SRR2121775.RSEM.trans.results \
SRR2121778.RSEM/SRR2121778.RSEM.trans.results \
SRR2121779.RSEM/SRR2121779.RSEM.trans.results \
SRR2121780.RSEM/SRR2121780.RSEM.trans.results \
SRR2121781.RSEM/SRR2121781.RSEM.trans.results \
SRR2121786.RSEM/SRR2121786.RSEM.trans.results \
SRR2121787.RSEM/SRR2121787.RSEM.trans.results \
SRR2121788.RSEM/SRR2121788.RSEM.trans.results \
SRR2121789.RSEM/SRR2121789.RSEM.trans.results --gene_trans_map none
time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_genes SRR2121770.RSEM/SRR2121770.RSEM.genes.results \
SRR2121771.RSEM/SRR2121771.RSEM.genes.results \
SRR2121774.RSEM/SRR2121774.RSEM.genes.results \
SRR2121775.RSEM/SRR2121775.RSEM.genes.results \
SRR2121778.RSEM/SRR2121778.RSEM.genes.results \
SRR2121779.RSEM/SRR2121779.RSEM.genes.results \
SRR2121780.RSEM/SRR2121780.RSEM.genes.results \
SRR2121781.RSEM/SRR2121781.RSEM.genes.results \
SRR2121786.RSEM/SRR2121786.RSEM.genes.results \
SRR2121787.RSEM/SRR2121787.RSEM.genes.results \
SRR2121788.RSEM/SRR2121788.RSEM.genes.results \
SRR2121789.RSEM/SRR2121789.RSEM.genes.results --gene_trans_map none
contig_Exn50_statistic.pl Trinity_trans.TMM.EXPR.matrix trinity_out_dir/Trinity.fasta > ExN50_trans.stats
Make samples.txt
mock- SRR2121770.RSEM
mock- SRR2121771.RSEM
IR- SRR2121774.RSEM
IR- SRR2121775.RSEM
mock+ SRR2121778.RSEM
mock+ SRR2121779.RSEM
mock+ SRR2121780.RSEM
mock+ SRR2121781.RSEM
IR+ SRR2121786.RSEM
IR+ SRR2121787.RSEM
IR+ SRR2121788.RSEM
IR+ SRR2121789.RSEM
The way that the installed version of run_DE_analysis.pl tries to install or check for edgeR doesn't work for the new versions of R (3.6.1). In the perl script I just had to edit a few lines under the edgeR section for it to run smoothly, as well as install *BiocManager before hand.
I changed the perl script, under sub run_edgeR_sample_pair, when writing the R script to look like:
print $ofh "if (! require(edgeR)) {\n";
print $ofh " install.packages(\"BiocManager\")\n";
print $ofh " BiocManager::install(\"edgeR\")\n";
print $ofh " library(edgeR)\n";
print $ofh "}\n\n";
To make sure BiocManager and edgeR work, they were installed in the R terminal.
R #runs the installed version of R in terminal mode
install.packages("BiocManager")
BiocManager::install("edgeR")
Since we don't have at least 3 replicates for each of the treatments, edgeR wants a dispersion parameter. 0.035 seems to be pretty common. If the dispersion parameter has the squareroot taken of it we get ~ 0.19, meaning that we are saying the true abundance for each gene can vary up or down by 19% between replicates.
run_DE_analysis.pl --matrix Trinity_trans.counts.matrix --method edgeR --output edgeR_trans --dispersion 0.035 --samples_file samples.txt
head edgeR_trans/Trinity_trans.counts.matrix.IR-_vs_mock-.edgeR.DE_results | column -t
sampleA sampleB logFC logCPM PValue FDR
TRINITY_DN2468_c0_g4_i1 IR- mock- -14.7753615825754 4.45315041918 5.51021995613131e-115 5.85185359341145e-110
TRINITY_DN2505_c0_g1_i2 IR- mock- -10.1817621187735 4.11815268306374 8.54695600746168e-88 4.53843363996215e-83
TRINITY_DN3231_c0_g2_i15 IR- mock- -13.5922145319911 3.27307743751541 6.29068575237937e-81 2.2269027563423e-76
TRINITY_DN4975_c0_g1_i7 IR- mock- 13.4941229058571 3.17132827495398 5.25978730838885e-79 1.39647353037724e-74
TRINITY_DN4360_c0_g1_i5 IR- mock- -13.5215746751431 3.20272600532575 1.0222845072488e-70 2.17133229339645e-66
TRINITY_DN128_c0_g1_i20 IR- mock- 13.1582688371709 2.83599676163376 1.8443766792831e-69 3.26454672233108e-65
TRINITY_DN644_c1_g1_i6 IR- mock- -5.91850038396836 4.31456225100214 8.01440426808572e-67 1.21589961895815e-62
TRINITY_DN2043_c0_g1_i13 IR- mock- 13.7803059024088 3.45719012794474 3.30519380237759e-66 4.38764477265625e-62
TRINITY_DN3321_c0_g1_i2 IR- mock- 12.8215942182179 2.49995284585527 9.5059376972195e-61 1.1217006482719e-56