forked from oxpig/ANARCI
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
228 lines (183 loc) · 15.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
ANARCI_VC is modified from ANARCI ([github page](https://github.com/oxpig/ANARCI)), to enable the numbering for both V and C regions. In order to number C regions, `c_schemes.py` is added based on the original framework of ANARCI.
The following are original readme file of ANARCI:
ANARCI \\ //
Antibody Numbering and Antigen Receptor ClassIfication \\ //
||
(c) Oxford Protein Informatics Group (OPIG). 2015-20 ||
Usage:
ANARCI -i <inputsequence or fasta file>
Requirements:
- HMMER3 version 3.1b1 or higher - http://hmmer.janelia.org/
e.g.
ANARCI -i Example_sequence_files/12e8.fasta
This will number the files in 12e8.fasta with imgt numbering scheme and print to stdout.
ANARCI -i Example_sequence_files/sequences.fasta -o Numbered_sequences.anarci -ht hit_tables.txt -s chothia -r ig
This will number the files in sequences.fasta with chothia numbering scheme only if they are an antibody chain (ignore TCRs).
It will put the numbered sequences in Numbered_sequences.anarci and the alignment statistics in hit_tables.txt
ANARCI -i Example_sequence_files/lysozyme.fasta
No antigen receptors should be found. The program will just list the names of the sequences.
ANARCI -i EVQLQQSGAEVVRSGASVKLSCTASGFNIKDYYIHWVKQRPEKGLEWIGWIDPEIGDTEYVPKFQGKATMTADTSSNTAYLQLSSLTSEDTAVYYCNAGHDYDRGRFPYWGQGTLVTVSA
Or just give a single sequence to be numbered.
optional arguments:
-h, --help show this help message and exit
--sequence INPUTSEQUENCE, -i INPUTSEQUENCE
A sequence or an input fasta file
--outfile OUTFILE, -o OUTFILE
The output file to use. Default is stdout
--scheme {kabat,aho,wolfguy,imgt,a,c,chothia,i,k,m,w,martin}, -s {kabat,aho,wolfguy,imgt,a,c,chothia,i,k,m,w,martin}
Which numbering scheme should be used. i, k, c, m, w
and a are shorthand for IMGT, Kabat, Chothia, Martin
(Extended Chothia), Wolfguy and Aho respectively.
Default IMGT
--restrict {ig,tr,heavy,light,H,K,L,A,B} [{ig,tr,heavy,light,H,K,L,A,B} ...], -r {ig,tr,heavy,light,H,K,L,A,B} [{ig,tr,heavy,light,H,K,L,A,B} ...]
Restrict ANARCI to only recognise certain types of
receptor chains.
--csv Write the output in csv format. Outfile must be
specified. A csv file is written for each chain type
<outfile>_<chain_type>.csv. Kappa and lambda are
considered together.
--outfile_hits HITFILE, -ht HITFILE
Output file for domain hit tables for each sequence.
Otherwise not output.
--hmmerpath HMMERPATH, -hp HMMERPATH
The path to the directory containing hmmer programs.
(including hmmscan)
--ncpu NCPU, -p NCPU Number of parallel processes to use. Default is 1.
--assign_germline Assign the v and j germlines to the sequence. The most
sequence identical germline is assigned.
--use_species {alpaca,rabbit,rhesus,pig,rat,human,mouse}
Use a specific species in the germline assignment.
--bit_score_threshold BIT_SCORE_THRESHOLD
Change the bit score threshold used to confirm an
alignment should be used.
Author: James Dunbar ([email protected])
Charlotte Deane ([email protected])
Contact: [email protected]
--------------------------------------------------------------------------------------
Output files:
The numbering file.
The numbering file (--outfile or stdout) reports the numbering for all sequences given in the sequence file. Each record is separated by "//".
Those chains for which no significant alignment was found report the name as in the fasta file. e.g:
# 1A14:N|PDBID|CHAIN|SEQUENCE
//
Those sequences where a significant alignment has been found report as below:
# 1A14:H|PDBID|CHAIN|SEQUENCE
# ANARCI numbered
# Domain 1 of 1
# Most significant HMM hit
#|species|chain_type|e-value|score|seqstart_index|seqend_index|
#|mouse|H|8.6e-58|184.9|0|119|
# Scheme = imgt
H 1 Q
H 2 V
H 3 Q
H 4 L
H 5 Q
.
.
.
//
where:
species = The species of the most significant aligned HMM
chain_type = The chain type of the most significant aligned HMM
e-value = The e-value of the alignment to the most significant aligned HMM
score = The bit-score of the alignment to the most significant aligned HMM
seqstart_index = The index in the sequence of the first numbered residue
seqend_index = The index in the sequence of the last numbered residue
Scheme = The numbering scheme used to number the sequence
Then follows the numbering. Chain type (H, L (for both kappa(K) and lambda(L) chain types) , A (alpha), B (beta))
If the "assign_germline" option has been specified the further following lines are added to the header. e.g.
# Most sequence-identical germlines
#|species|v_gene|v_identity|j_gene|j_identity|
#|mouse|IGHV1-12*01|0.86|IGHJ2*01|0.79|
where:
species = The species of the most sequence identical germline
v_gene = The identifier of the most sequence identical germline over the v-region
v_identity = The sequence identity over the v-region to the most sequence identical germline
j_gene = The identifier of the most sequence identical germline over the j-region
j_identity = The sequence identity over the j-region to the most sequence identical germline
The csv format output file.
When the --csv option is specified, numbered sequences are output into separate comma separated value files depending on their
chain type. This provides a horizontal output format and contains all the properties detailed above. In addition, sequences
are aligned according to the numbering scheme.
The hit file.
The hit file reports the statistics for all alignments to each HMM in the database even if the sequence was not numbered.
Each record is separated by "//".
The corresponding hit table for the numbered entry above looks like:
"""
NAME 1a14_H mol:protein length:120 NC10 FV (HEAVY CHAIN)
SEQUENCE QVQLQQSGAELVKPGASVRMSCKASGYTFTNYNMYWVKQSPGQGLEWIGIFYPGNGDTSYNQKFKDKATLT
SEQUENCE ADKSSNTAYMQLSSLTSEDSAVYYCARSGGSYRYDGGFDYWGQGTTVTV
id description evalue bitscore bias best_dom_evalue best_dom_bitscore best_dom_bias domain_exp_num domain_obs_num
mouse_H 1.1e-57 184.5 1.5 1.3e-57 184.4 1.5 1.0 1
human_H 7.8e-53 169.0 1.9 8.7e-53 168.8 1.9 1.0 1
rat_H 4.7e-47 150.2 2.2 5.2e-47 150.0 2.2 1.0 1
rabbit_H 3.7e-37 118.2 0.7 4e-37 118.1 0.7 1.0 1
pig_H 1.5e-35 113.3 2.7 1.6e-35 113.1 2.7 1.0 1
rhesus_H 4.4e-32 101.5 1.8 4.9e-32 101.4 1.8 1.0 1
mouse_B 2.4e-19 60.6 0.7 2.6e-19 60.5 0.7 1.0 1
human_B 4.2e-19 59.7 0.9 4.6e-19 59.5 0.9 1.0 1
mouse_A 8.7e-19 58.5 1.1 9.6e-19 58.4 1.1 1.0 1
human_A 1.7e-18 57.6 0.9 1.9e-18 57.5 0.9 1.0 1
mouse_D 5.1e-17 53.3 0.7 5.9e-17 53.1 0.7 1.1 1
rhesus_L 1.6e-16 51.7 2.7 1.9e-16 51.4 2.7 1.1 1
human_L 1.7e-15 48.3 3.5 2e-15 48.0 3.5 1.1 1
human_D 6.7e-15 46.1 0.2 7.4e-15 45.9 0.2 1.0 1
rhesus_K 3.9e-13 40.6 1.7 5.1e-13 40.2 1.7 1.2 1
mouse_G 4.1e-13 40.3 0.0 4.3e-13 40.2 0.0 1.0 1
rabbit_L 6.1e-13 40.0 2.8 8.1e-13 39.6 2.8 1.2 1
rat_K 3.9e-12 37.4 1.4 4.4e-12 37.2 1.4 1.1 1
pig_L 4.2e-12 37.5 1.0 4.7e-12 37.3 1.0 1.1 1
mouse_K 1.2e-11 35.7 2.6 1.3e-11 35.6 2.6 1.1 1
human_K 2.2e-11 34.8 2.9 3.5e-11 34.2 2.9 1.4 1
mouse_L 1.9e-10 31.8 2.2 3.4e-10 30.9 2.2 1.4 1
rat_L 2.5e-10 31.7 1.2 2.9e-10 31.5 1.2 1.1 1
pig_K 3.2e-10 31.1 1.9 4.5e-10 30.6 1.9 1.3 1
human_G 2.9e-09 27.8 0.8 4.9e-09 27.1 0.8 1.4 1
rabbit_K 2.5e-06 18.4 5.8 4.2e-06 17.7 5.8 1.4 1
//
"""
We therefore get a ranking of the alignments to each chain type.
--------------------------------------------------------------------------------------
Schemes:
Currently implemented schemes:
IMGT
Chothia (IGs only)
Kabat (IGs only)
Martin / Enhanced Chothia (IGs only)
AHo
Wolfguy (IGs only)
Currently recognisable species (chains):
Human (heavy, kappa, lambda, alpha, beta)
Mouse (heavy, kappa, lambda, alpha, beta)
Rat (heavy, kappa, lambda)
Rabbit (heavy, kappa, lambda)
Pig (heavy, kappa, lambda)
Rhesus Monkey (heavy, kappa)
Other species may still be numbered correctly and the chain type recognised but the species be incorrect. e.g. llama VHH.
IMGT - has 128 possible positions for *all* antigen receptor types. These are supposed to be structurally equivalent.
In theory these are supposed to account for all possible positions. However, insertions are possible especially
at CDR3. ANARCI gives the insertion codes as letters. Insertions at CDR3 are placed symmetrically about imgt
positions 111 and 112. e.g. 111-ABCD DCBA-112.
Kabat - is defined for heavy and light chain antibody chains only (in ANARCI). Positions in the two chain types are not
equivalent. Insertions occur at specific positions and can occur in both the framework and the CDRs. They are
annotated from A->Z. e.g 100ABCDEFGH 101.
Chothia - is defined for heavy and light chain antibody chains only (in ANARCI). Numbering in the two chain types are not
equivalent. Insertions occur at specific positions and can occur in both the framework and the CDRs. They are
annotated from A->Z. e.g 100ABCDEFGH 101. It differs to the Kabat scheme by the position it places the insertions
at CDRH1.
Martin - is defined for heavy and light chain antibody chains only. Numbering in the two chain types are not equivalent.
Insertions occur at specific positions and can occur in both the framework and the CDRs. They are annotated from
A->Z. e.g 100ABCDEFGH 101. It differs to the Chothia scheme by the position it places the certain insertions in
the framework. It is also referred to as the enhanced Chothia scheme.
AHo - has 149 possible for *all* antigen receptor types. These are supposed to be structurally equivalent. The AHo
scheme's large number of positions is supposed to account for all possible positions without the need for
specifying insertion positions. In ANARCI, extra residues in the framework may be represented by insertions
although these are unlikely to occur in natural sequences.
Wolfguy - is defined for heavy and light antibody chains. Numbering in the two chain types are not equivalent. Different
regions of the domain are denoted by a range of numbers. Many possible positions in the CDRs mean that insertion
codes are not required. In ANARCI, extra residues in the framework may be represented by insertions
although these are unlikely to occur in natural sequences. The CDRs are numbered in an 'up' and 'down' direction.
The annotations of CDRL1 is defined according to the canonical structure. In ANARCI this is recognised by taking
a sequence similarity to hard coded sequence motifs for different lengths.
-------------------------------------------------------------------------------------