Skip to content

nkarimi/plot-ks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 

Repository files navigation

plot-ks.pl

This script searches for paralogous genes in a given transcriptome and then creates a histogram of the estimated number of synonymous substitions per synonymous site (Ks) between paralogs.

Dependencies

  1. TransDecoder
    • Used to determine likely open reading frames in a given transcriptome and its corresponding protein translation.
  2. Blat
    • Used to determine likely paralogs.
  3. KaKs_Calculator
    • Used to estimate the number of synonymous substitions per synonymous site (Ks) between paralogs.
  4. R
    • Plotting and statistics software used to generate the actual Ks plot.

Script Workflow

Open reading frames and their corresponding translations are first determined using TransDecoder. Once the likely proteome has been obtained, a self-blat (proteome is both the query and the database) is then performed to identify likely paralogs present within the transcriptome. The sequences of each likely set of paralogs are then reverse translated back to their original nucleotide sequence. These nucleotide alignments are then input into KaKs_Calculator to estimate Ks. Once all Ks values have been calculated, R is used to create a histogram which depicts the distribution of Ks between paralogs present in the transcriptome.

Script Usage & Settings

Usage

This script requires a single transcriptome in FASTA format in order to run, the following is an example of proper script invocation:

plot-ks.pl transcriptome.fa --out-dir my_ks_plot

To allow for faster modifications of an output plot, redundant calculations can be avoided by specifying the name of a directory previously generated by usage. For instance, if you previously had run the script with the above invocation but realized after its creation that you would likely to narrow down the Ks of the output plot, you could use the following invocation to avoid unneccessary recalculations:

plot-ks.pl my_ks_plot --ks-max 1

Command Line Options

For further fine-tuning of the script, the following options can also be specified:

Option Flag(s) Option Descripton Default
-m, --model model used by KaKs_Calculator to determine Ks YN
-l, --min-length the minimum alignment length of paralogous sequences 300 nucleotides
-x, --exclude-zero used to exclude Ks = 0 from plot, useful for unfiltered Trinity assemblies N/A
-b, --bin-size size of bins used in histogram of Ks plot, set to 0 for a density plot Ks = 0.05
-o, --out-dir name of the directory to store output files in "plot-ks-" + Unix time of script invocation)
--ks-min lower boundary for x-axis of Ks plot Ks = 0
--ks-max upper boundary for x-axis of Ks plot Ks = 3
--pfam-cpus the number of CPUs to let hmmscan using during pfam analysis current number of free CPUs
--pfam-search full path to pfam binary for usage in pfam search none
-T, --n-threads the maximum number of CPUs to use concurrently current number of free CPUs
-h, --help display help and exit N/A

Output Files

The names of the output files of this script varies depending on the settings it was run with, the following identifiers will be used as placeholders which represent the actual values of the specified command line options at script run time:

  • $MODEL = -m, --model
  • $MIN_LENGTH = -l, --min-length
  • $BIN_SIZE = -b, --bin-size
  • $KS_MIN = --ks-min
  • $KS_MAX = --ks-max

Assuming that an input file with the name 'transcriptome.fa' was specified to the script, the following files can be found in the output directory upon successful completion of the script:

  • TransDecoder output:
    • transcriptome.fa.transdecoder.bed:
    • transcriptome.fa.transdecoder.cds: contains the nucleotide sequences of detected ORFS
    • transcriptome.fa.transdecoder.gff3:
    • transcriptome.fa.transdecoder.mRNA: essentially the input transcriptome
    • transcriptome.fa.transdecoder.pep: contains the protein translation of the .cds file, this file is the one used in the self-blat
  • self-blat.pslx: contains the output of the self-blat
  • paralogs-t$MIN_LENGTH.atx: contains the sequences of identified paralogs, used as input for KaKs_Calculator
  • paralogs-t$MIN_LENGTH-m$MODEL.kaks: contains the estimated Ks as well as additional information calculated for each identified pair of paralogs by KaKs_Calculator
  • paralogs-t$MIN_LENGTH-m$MODEL.csv: contains only the estimated Ks from the previous file
  • transcriptome-t$MIN_LENGTH-m$MODEL-b$BIN_SIZE-range[$KS_MIN-$KS_MAX].pdf: the actual histogram generated by this script.
    • if run with --bin-size 0, '-b$BIN_SIZE-' will appear as '-density-'
    • if run with --exclude-zero (and assuming $KS_MIN still equals 0), '-range[0-$KS_MAX]' will appear as '-range(0-$KS_MAX]'
  • symlink to the input FASTA files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 100.0%