This script searches for paralogous genes in a given transcriptome and then creates a histogram of the estimated number of synonymous substitions per synonymous site (Ks) between paralogs.
- TransDecoder
- Used to determine likely open reading frames in a given transcriptome and its corresponding protein translation.
- Blat
- Used to determine likely paralogs.
- KaKs_Calculator
- Used to estimate the number of synonymous substitions per synonymous site (Ks) between paralogs.
- R
- Plotting and statistics software used to generate the actual Ks plot.
Open reading frames and their corresponding translations are first determined using TransDecoder. Once the likely proteome has been obtained, a self-blat (proteome is both the query and the database) is then performed to identify likely paralogs present within the transcriptome. The sequences of each likely set of paralogs are then reverse translated back to their original nucleotide sequence. These nucleotide alignments are then input into KaKs_Calculator to estimate Ks. Once all Ks values have been calculated, R is used to create a histogram which depicts the distribution of Ks between paralogs present in the transcriptome.
This script requires a single transcriptome in FASTA format in order to run, the following is an example of proper script invocation:
plot-ks.pl transcriptome.fa --out-dir my_ks_plot
To allow for faster modifications of an output plot, redundant calculations can be avoided by specifying the name of a directory previously generated by usage. For instance, if you previously had run the script with the above invocation but realized after its creation that you would likely to narrow down the Ks of the output plot, you could use the following invocation to avoid unneccessary recalculations:
plot-ks.pl my_ks_plot --ks-max 1
For further fine-tuning of the script, the following options can also be specified:
Option Flag(s) | Option Descripton | Default |
---|---|---|
-m, --model | model used by KaKs_Calculator to determine Ks | YN |
-l, --min-length | the minimum alignment length of paralogous sequences | 300 nucleotides |
-x, --exclude-zero | used to exclude Ks = 0 from plot, useful for unfiltered Trinity assemblies | N/A |
-b, --bin-size | size of bins used in histogram of Ks plot, set to 0 for a density plot | Ks = 0.05 |
-o, --out-dir | name of the directory to store output files in | "plot-ks-" + Unix time of script invocation) |
--ks-min | lower boundary for x-axis of Ks plot | Ks = 0 |
--ks-max | upper boundary for x-axis of Ks plot | Ks = 3 |
--pfam-cpus | the number of CPUs to let hmmscan using during pfam analysis | current number of free CPUs |
--pfam-search | full path to pfam binary for usage in pfam search | none |
-T, --n-threads | the maximum number of CPUs to use concurrently | current number of free CPUs |
-h, --help | display help and exit | N/A |
The names of the output files of this script varies depending on the settings it was run with, the following identifiers will be used as placeholders which represent the actual values of the specified command line options at script run time:
- $MODEL = -m, --model
- $MIN_LENGTH = -l, --min-length
- $BIN_SIZE = -b, --bin-size
- $KS_MIN = --ks-min
- $KS_MAX = --ks-max
Assuming that an input file with the name 'transcriptome.fa' was specified to the script, the following files can be found in the output directory upon successful completion of the script:
- TransDecoder output:
- transcriptome.fa.transdecoder.bed:
- transcriptome.fa.transdecoder.cds: contains the nucleotide sequences of detected ORFS
- transcriptome.fa.transdecoder.gff3:
- transcriptome.fa.transdecoder.mRNA: essentially the input transcriptome
- transcriptome.fa.transdecoder.pep: contains the protein translation of the .cds file, this file is the one used in the self-blat
- self-blat.pslx: contains the output of the self-blat
- paralogs-t$MIN_LENGTH.atx: contains the sequences of identified paralogs, used as input for KaKs_Calculator
- paralogs-t$MIN_LENGTH-m$MODEL.kaks: contains the estimated Ks as well as additional information calculated for each identified pair of paralogs by KaKs_Calculator
- paralogs-t$MIN_LENGTH-m$MODEL.csv: contains only the estimated Ks from the previous file
- transcriptome-t$MIN_LENGTH-m$MODEL-b$BIN_SIZE-range[$KS_MIN-$KS_MAX].pdf: the actual histogram generated by this script.
- if run with --bin-size 0, '-b$BIN_SIZE-' will appear as '-density-'
- if run with --exclude-zero (and assuming $KS_MIN still equals 0), '-range[0-$KS_MAX]' will appear as '-range(0-$KS_MAX]'
- symlink to the input FASTA files