-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
107 lines (69 loc) · 3.42 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
OVERVIEW
ADDA is a method to find protein domains in protein sequences.
Briefly, ADDA attempts to split domains into segments that
correspond as closely as possible to all-on-all pairwise
alignments. A detailed description of the method can be found
in
Heger A, Holm L. (2003)
Exhaustive enumeration of protein domain families.
J Mol Biol. 2003 May 2;328(3):749-67.
PMID: 12706730
USAGE INSTRUCTIONS
ADDA is controlled with the script adda.py. The script expects a file adda.ini
with configuration options in the directory from which it is called. An example
is in the directory ./test.
INPUT DATA
ADDA requires three input files.
1. sequences in fasta format
2. the results from an all-on-all sequence comparison (sequence alignment graph)
3. domain assignments from a reference domain assignment
ADDA proceeds in stages. Each stage corresponds to a command to
the script adda.py. To run all stages, run adda.py as
python adda.py --steps=all
The specific stages are:
1. Pre-processing of the input. These steps can be performed in
parallel.
1. indexing the sequence database - "index"
2. building sequence profiles - "profiles"
3. formatting and filtering the alignment graph - "graph"
4. indexing the alignment graph - "index"
5. estimating the error parameters - "fit"
2. Decomposing sequences into domains - "optimise"
3. Convert sequence alignment graph to domain alignment graph - "convert"
4. Build minimum spanning tree of domains - "mst"
5. Align domains - "align"
EXAMPLE
A toy example can be found at http://genserv.anat.ox.ac.uk/downloads/contrib/adda.
The files are:
nrdb.fasta.gz: a file with protein sequences in fasta format. These have been
filtered to be less than 40% identical (Park et al. 2000).
pairsdb.links.gz: a list of pairwise alignments. These have been obtained by
running BLASTP (Altschul et al. 1997) all-on-all and parsed into a tab-separated
table. The columns are:
1. query: identifier of the query sequence
2. sbjct: identifier of the sbjct sequence
3. evalue: natural log of the E-Value
4-6. query_start, query_end, query_ali: alignment of the query
7-9. sbjct_start, sbjct_end, sbjct_ali: alignment of the sbjct
10. alignment score (not used)
11. percent identity (not used)
The alignment coordinates are inclusive/exclusive 0-based coordinates "[)".
The alignment is stored in compressed form as alternating integer numbers
with the prefix "+" and "-". Positive numbers signify character emissions
and negative numbers insertions. For example, "+3-3+2" with the sequence
ABCDE will result in "ABC---DE".
references.domains.gz: a list of domains for a subset of sequences. The
domains in this file were derived from structural domain definitions in
SCOP (Andreeva et al. 2008).
REFERENCES
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ.
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. Sep 1;25(17):3389-402. Review.
Park J, Holm L, Heger A, Chothia C. (2000) RSDB: representative protein
sequence databases have high information content. Bioinformatics. May;16(5):458-64.
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG.
(2008) Data growth and its impact on the SCOP database: new developments.
Nucleic Acids Res. Jan;36(Database issue):D419-25. Epub 2007 Nov 13.
TODO
1. auto-calibrate the alignment score threshold
2. speed up the initial graph parsing using cython