Update Feb-19th-2021
-------------------------------
Please Refer to Wiki for an updated manual
Update June-7th-2019
-------------------------------
Improved the function enabling parallel running of MCScanX.
Codes are better organized and explained.
Four key parameters: k (# tophits), s (# anchors), m (# gaps), and p (# CPUs)
Setting for -p is used for both Diamond and MCScanX parralleling.
Notes:
Change Line70 according to your genome list
Options:
Line127: duplicate_gene_classifier
Line132: detect_collinear_tandem_arrays [intra-species]
Line180-183: detect_collinear_tandem_arrays [inter-species]
Update June 3rd 2019
-------------------------------
Phylogenomic Profiling
When you have constructed a synteny network database of your interested genomes, you could then perform clustering to the entire network (using infomap algorithm for example), or you could filter out subnetworks first (for certain gene family) and then perform clustering (using infomap, CFinder etc.).
Next, we would like to summarize clusters according to its node compositions. Then we could infer what are the conserved clusters, and what are the specific ones (shared by certain species group for example). A rough description of this process is like this: we first generate a matrix, rows stand for clusters, and columns stand for species, the value stands for the number of nodes of that species in that cluster. Then we calculate a distance matrix between pair-wise clusters, and finally perform hierarchical clustering to cluster similar-patterned clusters.
|
Species 1 |
Species 2 |
Species 3 |
… |
Species n |
Cluster 1 |
1 |
2 |
1 |
… |
1 |
….. |
0 |
0 |
1 |
… |
1 |
Cluster n |
0 |
1 |
1 |
… |
0 |
Now let’s start. Suppose you are using the infomap script infomap.r, the result looks like this:
names mem
aar_AA31G00673 1
aar_AA32G00725 2
aar_AA39G00041 1
aar_AA29G00273 3
ach_Achn050361 1
ach_Achn168171 1
ach_Achn330591 1
ach_Achn198651 1
ach_Achn060901 1
…….
Result in such format can be directly feed into Phylogenomic_Profiling.r This R script can analyze cluster composition, calculate distance, and perform hierarchical clustering. Please read the notes within the codes. Also remember to change the content of Line 22 according to the genomes you are using.
Usage: Rscript Phylogenomic_Profiling.r infomap_clustering_result cluster_profiled cluster_profiled_clustered
UPDATES 5 Jul 2018
-------------------------------
Here I attach two scripts, using DIAMOND (Buchfink et al., 2015) for faster genome comparisons at the first step.
SynetBuilding-Diamond.sh: Used for the first time, when you would like to construct synteny network of your interested genomes
SynetAdding-Diamond.sh: Used when you would like to add new genomes into the existing results.
-Prepeations
- Whole genome protein files in fasta format.
- GFF/BED file for each genome (Example: http://chibba.pgml.uga.edu/mcscan2/examples/at.gff)
-Install DIAMOND and MCScanX (Wang et al., 2012)
-DIAMOND: https://github.com/bbuchfink/diamond
-MCScanX: http://chibba.pgml.uga.edu/mcscan2/
-Notes / Tips
- Index your genome files, using 3-5 letters, for example for "Arabidopis thaliana", rename genome file and gff file as "Ath.pep" and "Ath.bed", and for "Oryza sativa" ("Osa.pep" and "Osa.bed")
-To run
- Put all pairs of "*.pep" and "*.bed" of your genomes under one folder, copy the script (SynetBuilding- Diamond.sh) to the same folder.
- Change the line 41 in the script (content in array): enter the genome indexes of your own selection.
- for example: array=(ath osa oth Alyr)
- The array can be of any length, depending the genomes you want to compare.
======================================
Synteny network construction consists of five primary steps: (1) Annotated genome data preparation, (2) pairwise whole-genome comparisons, (3) syntenic block detection and data merging, (4) sub-network extraction (optional), and (5) network data analysis and visualization.
For Step 1, plant genomes can be downloaded from Phytozome, NCBI, Plaza, CoGe, etc. For each genome two files are needed: peptide sequences for all annotated/predicted genes (primary transcripts only) and a bed/GFF file indicating the genomic location of each gene. Users can prepare any number of genomes for synteny network construction. More genomes, longer computation time required.
>>> Fifty-one plant genomes used in the study of Tao Zhao et al., 2017b are listed and available for download below (Table 1).
For Steps 2 and 3, we provide a bash script (SynNet.sh) that can automatically perform pairwise inter- and intra- species comparisons, trimming the outputs for synteny detection, and treating outputs containing all synteny blocks to a final network file. This network database contains four columns: Block_ID, Block_Score, Gene1, and Gene2 (Gene 1 and Gene 2 are a syntenic gene pair).
- Users have to pre-install RAPSearch2 (BLAST-like program, but much faster) and MCScanX (for pairwise synteny block detection).
- Put all required genome files and the bash script in the same directory. Then, alter the first line of the program, which is a bracket containing species abbreviations (which are consistent to the names used in the genome files, tab separated).
- Run the program and get the result file called “Final_Network”, which contains all pairwise synteny blocks of your input species.
>>> Synteny network of the fifty-one plant genomes used in the study of Tao et al., 2017b are available for download (“51_Genomes_Blocks”).
At Step 4, for specific gene family studies you may need to extract sub-networks from the database. To do this, you need to first identify all gene family members from the genomes and then query this gene candidate list against the synteny block database.
- We use HMMER for gene family identification. HMMs (Hidden Markov Models) for specific gene families can be obtained from Pfam. Users can use Pfam Search or NCBI BLAST to help identify the feature domain(s) in the protein sequence. For example, a plant MADS-box protein is characterized by a core DNA binding domain (PF00319).
Brief Guidelines of HMMER Usage:
-
Install HMMER followed the instructions at: http://hmmer.org/documentation.html
-
Download the protein sequence alignment for PF00319 in Stockholm format (default name : “PF00319_seed.txt”): http://pfam.xfam.org/family/PF00319#tabview=tab3
-
Hmmbuild: to make a model from the alignment
- Usage: hmmbuild [-options] <hmmfile output> <alignment file input>
- Example: hmmbuild MADS.hmm PF00319_seed.txt
- hmm is the output model for characterizing MADS-box genes.
-
Hmmsearch: to identify all candidate members from the peptide database.
-
Usage: hmmsearch [options] <query hmmfile> <target seqfile>
-
Example: hmmsearch MADS.hmm 51_Genomes_Peps > MADS_Results
-
>>> Peptides for 51 plant genomes are merged and available for download, which can be used for an easier identification of gene family members of all 51 genomes. (“51_Genomes_Peps”).
>>> The gene list of candidate MADS-box genes from the 51 Genomes (“MADS_list”)
- Extract subnetwork from the synteny network database as needed, using a list containing all HMMER-identified family members.
- Command: grep -f MADS_list 51_Genomes_Blocks > MADS.SynNet
- Now we obtain all syntenic relationships for all MADS-box genes.
>>> Synteny network of MADS-box genes across 51genomes (MADS.SynNet)
Step 5:
The subnetwork file (MADS.SynNet) can be trimmed into several formats for clustering and visualization, which can be performed in different ways.
Clustering algorithms: K-clique percolation (e.g., CFinder, SNAP), Infomap, CNM, k-core decomposition, etc.
Visualization tools: Cytoscape, Gephi et al.
>>> Example networks from Tao Zhao et al., 2017b are available for download and visualization in Cytoscape (MADS.cys), Cytoscape version 3.4.0.
Table 1: Genomes Used in the study of Tao Zhao et al., 2017
No |
Species |
Order |
Peptides |
BED/GFF |
Version |
#Genes |
Reference |
1 |
Phaseolus vulgaris (Common bean) |
Rosids |
Version 1.0 |
27082 |
Schmutz et al., 2014 |
||
2 |
Glycine max (Soybean) |
Rosids |
Wm82.a2.v1 |
56044 |
Schmutz et al., 2010 |
||
3 |
Cajanus cajan (Pigeonpea) |
Rosids |
Nov_2011 |
48680 |
Varshney et al., 2012 |
||
4 |
Medicago truncatula (Barrel medic) |
Rosids |
Mt4.0v1 |
50894 |
Young et al., 2011 |
||
5 |
Cicer arietinum (Chickpea) |
Rosids |
Version 1.0 |
28269 |
Varshney et al., 2013 |
||
6 |
Lotus japonicus (Lotus) |
Rosids |
Version 2.5 |
42399 |
Sato et al., 2008 |
||
7 |
Citrullus lanatus (Watermelon) |
Rosids |
Version 1.0 |
23440 |
Guo et al., 2013 |
||
8 |
Cucumis sativus (Cucumber) |
Rosids |
Version 1.0 |
21491 |
Huang et al., 2009 |
||
9 |
Populus trichocarpa (Western poplar) |
Rosids |
Version 3.0 |
41335 |
Tuskan et al., 2006 |
||
10 |
Ricinus communis (Castor bean) |
Rosids |
Version 0.1 |
38613 |
Chan et al., 2010 |
||
11 |
Malus x domestica (Apple) |
Rosids |
Version 1.0 |
63514 |
Velasco et al., 2010 |
||
12 |
Pyrus x bretschneideri (Pear) |
Rosids |
Version 1.0 |
42812 |
Wu et al., 2013 |
||
13 |
Prunus persica (Peach) |
Rosids |
Version 1.0 |
28689 |
International Peach Genome et al., 2013 |
||
14 |
Prunus mume (Mei) |
Rosids |
Version 1.0 |
31390 |
Zhang et al., 2012 |
||
15 |
Fragaria vesca (Strawberry) |
Rosids |
Version 1.1 |
32831 |
Shulaev et al., 2011 |
||
16 |
Arabidopsis thaliana (Arabidopsis) |
Rosids |
TAIR10 |
27416 |
Arabidopsis Genome, 2000 |
||
17 |
Arabidopsis lyrata (Lyrate rockcress) |
Rosids |
Version 1.0 |
32670 |
Hu et al., 2011 |
||
18 |
Capsella rubella (Capsella) |
Rosids |
Version 1.0 |
26521 |
Slotte et al., 2013 |
||
19 |
Brassica oleracea (Kale) |
Rosids |
Version 2.1 |
59225 |
Liu et al., 2014 |
||
20 |
Brassica rapa (Chinese cabbage) |
Rosids |
Version 1.3 |
40492 |
Wang et al., 2011 |
||
21 |
Aethionema |
Rosids |
Version 2.5 |
22230 |
Haudry et al., 2013 |
||
22 |
Tarenaya |
Rosids |
Version 5 |
31580 |
Cheng et al., 2013 |
||
23 |
Carica papaya (Papaya) |
Rosids |
ASGPBv0.4 |
24782 |
Ming et al., 2008 |
||
24 |
Gossypium raimondii (Cotton) |
Rosids |
Version 2.1 |
37505 |
Paterson et al., 2012 |
||
25 |
Theobroma cacao (Cacao) |
Rosids |
Version 1.1 |
29452 |
Argout et al., 2011 |
||
26 |
Citrus sinensis (Sweet orange) |
Rosids |
Version 1.1 |
25379 |
Xu et al., 2013 |
||
27 |
Eucalyptus grandis (Eucalyptus) |
Rosids |
Version 1.1 |
36376 |
Myburg et al., 2014 |
||
28 |
Vitis vinifera (Grape vine) |
Rosids |
Genoscope (Aug 2007) |
26346 |
Jaillon et al., 2007 |
||
29 |
Solanum tuberosum (Potato) |
Solanace |
Version 3.4 |
39031 |
Potato Genome Sequencing et al., 2011 |
||
30 |
Solanum lycopersicum (Tomato) |
Solanace |
Version 2.4 |
34727 |
Tomato Genome, 2012 |
||
31 |
Capsicum annuum (Hot pepper) |
Solanace |
Version 1.55 |
34899 |
Kim et al., 2014 |
||
32 |
Utricularia gibba (Humped bladderwort) |
Solanace |
CoGe (Jun 2013) |
28494 |
Ibarra-Laclette et al., 2013 |
||
33 |
Actinidia chinensis (Kiwifruit) |
Solanace |
May_2013 |
32670 |
Huang et al., 2013 |
||
34 |
Beta vulgaris (Sugar beet) |
Eudicots |
RefBeet-1.1 |
27421 |
Dohm et al., 2014 |
||
35 |
Nelumbo nucifera (Sacred lotus) |
Eudicots |
Version 1.0 |
26685 |
Ming et al., 2013 |
||
36 |
Triticum urartu (Wheat A-genome) |
Monocots |
Version 1.0 |
34879 |
Ling et al., 2013 |
||
37 |
Hordeum vulgare (Barley) |
Monocots |
Version 1.0 |
16598 |
International Barley Genome Sequencing et al., 2012 |
||
38 |
Brachypodium distachyon (Purple false brome) |
Monocots |
Version 2.1 |
31694 |
International Brachypodium, 2010 |
||
39 |
Oryza sativa (Rice) |
Monocots |
Version 7.0 |
39049 |
International Rice Genome Sequencing, 2005 |
||
40 |
Zea mays (Maize) |
Monocots |
Version 6a |
63480 |
Schnable et al., 2009 |
||
41 |
Sorghum bicolor (Sorghum) |
Monocots |
Version 2.1 |
33032 |
Paterson et al., 2009 |
||
42 |
Setaria italica |
Monocots |
Version 2.1 |
35471 |
Bennetzen et al., 2012 |
||
43 |
Elaeis guineensis (Oil palm) |
Monocots |
Version 2.0 |
30752 |
Singh et al., 2013 |
||
44 |
Musa acuminata (Banana) |
Monocots |
July_2012 |
36542 |
D'Hont et al., 2012 |
||
45 |
Phalaenopsis equestris (Orchid) |
Monocots |
Version 5.0 |
42293 |
Cai et al., 2015 |
||
46 |
Zostera muelleri (Seagrass) |
Monocots |
Version 1.0 |
33245 |
Golicz et al., 2015 |
||
47 |
Amborella trichopoda (Amborella) |
Basal Angiosperm |
Version 1.0 |
26846 |
Chamala et al., 2013 |
||
48 |
Picea abies (Norway spruce) |
Gymnosperm |
Version 1.0 |
66632 |
Nystedt et al., 2013 |
||
49 |
Selaginella moellendorffii (Selaginella) |
Moss |
Version 1.0 |
22273 |
Banks et al., 2011 |
||
50 |
Physcomitrella patens (Moss) |
Moss |
Version 3.0 |
26610 |
Rensing et al., 2008 |
||
51 |
Chlamydomonas reinhardtii (Green algae) |
Green algae |
Version 5.5 |
17741 |
Merchant et al., 2007 |
Citations:
Zhao, T. and Schranz, E., (2019). Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proceedings of the National Academy of Sciences. 116(6), 2165-2174.
Zhao, T., Holmer, R., de Bruijn, S., Angenent, G.C., van den Burg, H.A., and Schranz, M.E. (2017b). Phylogenomic synteny network analysis of MADS-box transcription factor genes reveals lineage-specific transpositions, ancient tandem duplications, and deep positional conservation. The Plant Cell 29, 1278-1292.
Zhao, T., and Schranz, E. (2017a). Network Approaches for Plant Phylogenomic Synteny Analysis. Current Opinion in Plant Biology 36, 129-134.