LTR_Stream is designed to achieve sub-lineage level LTR-RT clustering in closely related species, discovering valuable genetic markers for genome comparison and LTR-RT modular evolution in host genome. It takes nucleotide sequences of intact LTR-RTs belonging to the same LTR-lineage as input. A mix of LTR-RTs from different LTR-lineages is theoretically acceptable but not recommended. LTR_Stream gives each LTR-RT a cluster label and automatically evaluates reliability of each cluster.
Conda should be installed with version >=23.1.0.
Mamba is recommended for speeding up conda.
Please install git with version >=2.34.1.
Please configure the ssh key of git and make sure git clone
could work.
ltrStreamInstallPath=path_you_want_to_install_LTR_Stream
cd ${ltrStreamInstallPath} && git clone [email protected]:xjtu-omics/LTR_Stream.git
If mamba is not available, please run:
cd ${ltrStreamInstallPath}/LTR_Stream && bash Init_LTR_Stream_Env.sh
For a speeding up installation with mamba, please run:
cd ${ltrStreamInstallPath}/LTR_Stream && bash Init_LTR_Stream_Env.sh mamba
conda activate ltrStream
cd ${ltrStreamInstallPath}/LTR_Stream/src
snakemake -s LTR_Stream.smk -f stream --config ltrParaFile=path_of_ltrPara.tsv -j {threadsNumber}
LTR_Stream will automatically run according to parameters set in this TSV (Tab-Separated Values) file, so please make sure all the parameters were set here before you start LTR_Stream.smk. (You can modify the file name and path according to your preferences. In this documentation, we refer to this configuration parameters file as ltrPara.tsv.) The following is an example of the file. Lines beginning with a # represent comments. Values of optional parameter in this example represent their default values in LTR_Stream. To facilitate parameter debugging, the parameters that significantly impact the clustering results will be introduced first. A standard example of this file is under examples/.
# An example for ltrPara.tsv
# All tab seperated.
# Mandatory parameters
# workDir: A blank directory for running LTR_Stream
# The outputs of LTR_Stream are in workDir/figure
workDir /xx/xx/xx
# ltrFasta: The nucleotide sequences of the LTR-RT set you want to
# analyze. Please ensure it is in standard FASTA format. Names of these
# sequences should follow the format like 'chrxx:stPos-endPos(strand)'. It
# is recommended to use bedtools to extract sequences from the genome.
ltrFasta /xx/xx/xx.fa
# Optional parameters
# Important parameters
# minOverLapForNovelModule: Control the number and dispersion of module sequences in the 3-D space.
# It is used in disjoint-set data structure to judge if there should be an edge between two alignment
# regions. It could be set at the range from 0 to 1. Greater minOverLapForNovelModule leads to more
# module sequences and more dispersed result. Default is 0.8.
minOverLapForNovelModule 0.8
# topModNum: Control the number and dispersion of module sequences with minOverLapForNovelModule.
# Greater topModNum leads to more module sequences and more dispersed result. LTR_Stream will output
# a module number versus covered LTR-RTs (named coverLine.pdf under workDir/figure). The topModNum
# needs to be set large enough to ensure that about 80% of LTR-RTs have 2-3 modules. It is estimated
# topModNum should be at range 200-800. Larger minOverLapForNovelModule usually corresponds to larger
# topModNum. You can adjust the two parameters in coordination. Default is 250.
topModNum 250
# tsneEarlyExaggeration: A crucial parameter in t-SNE dimensionality reduction, directly affects the
# results. An excessively large tsneEarlyExaggeration will result in a linear shape in the
# three-dimensional space, while an excessively small tsneEarlyExaggeration will lead to a dispersed
# distribution, hindering sub-lineage identification. It is estimated that tsneEarlyExaggeration
# should be at range 6-9. Default is 6.
tsneEarlyExaggeration 6
# tsnePerplexity: Larger tsnePerplexity will provide more robust results, while a smaller
# tsnePerplexity will yield more detailed clustering results. Depending on the size of the dataset,
# it is not recommended to set tsnePerplexity to less than 3% of the module sequence count for larger
# datasets, or less than 15 for smaller datasets. Default is 100.
tsnePerplexity 100
# cluCentCut: A parameter used to assess the degree of intra-class distribution aggregation in 3D
# space. A larger cluCentCut will result in coarser clustering. If LTR_Stream indicates clustering
# failure, please increase this parameter within the range of 0-1. Default is 0.1.
cluCentCut 0.1
# maxZoomInLevel: LTR_Stream achieves fine clustering of LTR-RT in complex scenarios through
# iterative expansion. This parameter controls the maximum depth of iterative expansion. If you find
# that the number of clusters is too large or some categories within subviews are verified as
# unreliable, you can set a maximum limit. The default value is -1, which means no limit is set.
maxZoomInLevel -1
# Other parameters
# tsneLearningRate: For t-SNE dimensionality reduction, LTR_Stream requires a very small learning rate, with a default value of 6. It is not recommended to set this value higher than 8.
tsneLearningRate 6
# blastEvalue: Used for homology searching in BLASTn. Default is 1e-10. If the LTR-RT sequence set to
# be analyzed has particularly high similarity, you can reduce this parameter accordingly.
blastEvalue 1e-10
# Parameters used in ElPiGraph
epgLambda 0.01
epgMu 0.01
epgAlpha 0.05
All outputs will be saved in workDir/figure
GIF files showing clustering results in each 3D-subview.
TSV file recording final cluster results.
TSV file recording details of clustering including coordinate information in each subview.
TSV file recording foldchange of inter- and intra-distance and corresponding significance for each cluster. Foldchange that signifcantly greater than one means reliable cluster.
Line plot showing module number and corresponding covered LTR-RT percentage. Used for guiding parameter ajustment.