This tool extract variable regions from the 16S rRNA gene.
- Installation
- Simple example
- Mode 1: extract variable regions
- Mode 2: extract from primers
- Notes and advance usage
- Details of implementation
- Options list
git clone https://github.com/AlessioMilanese/extract_regions_16s.git
cd extract_regions_16s
Note: in the following examples we assume that the python script extract_regions
is in the system path.
Pre-requisites:
- python >= 3.6
- cmalign (Infernal)
The expected input is a fasta file with one (or more) 16S sequences, example:
cat my_16S_seq.fasta
>seq1
CAGTATTAGCGGGGATCATCGATCGATTACGATCGAGCTAGC....
>seq2
CAGTATGATCGCGGATCATCGATCGATTACGATCCAGCTAGG....
Running:
extract_regions -i my_16S_seq.fasta
results is:
>seq1__V1
ACACAGCAUGCAUAACACGAGCUAUCGACGACUACGACGGCA
>seq1__V2
CAUCAGUACCGAUCAUCGGAAUCAGCGAGGCAGGCGAAGGCGAGAGAGCAUAC
...
>seq1__V8
ACUACGUGCACGAGUACGUAUACGGACAUCGAUUUACGAGCAGCGA
>seq1__V9
ACAGAUCGGAUUCGAUCGGCAUCGGACCCAUCAGGAGGAUCGUCAAUCAUG
If you don't specify any primer (with -f
and -r
), then all the variable regions will be extracted.
We extract the variable regions defined in Yarza et al. Nat. Rev. Microbiol. (2014):
Variable region | Start | End |
---|---|---|
V1 | 69 | 99 |
V2 | 137 | 242 |
V3 | 433 | 497 |
V4 | 576 | 682 |
V5 | 822 | 879 |
V6 | 986 | 1043 |
V7 | 1117 | 1173 |
V8 | 1243 | 1294 |
V9 | 1435 | 1465 |
These positions refers to the 16S rRNA gene of Escherichia coli str. K-12. The output is a fasta file with 9 times more sequences (one for each variable region) than the input fasta file.
The regions are indicated with __V[1-9]
at the end of the fasta header (see also Simple example).
You can specify which primers to use with -f
and -r
.
For Primer | Start | End | Relative position | Rev Primer | Start | End | Relative position | |
---|---|---|---|---|---|---|---|---|
8F | 8 | 27 | before V1 | 338R | 338 | 355 | after V2 | |
27F | 8 | 27 | before V1 | 519R | 519 | 536 | after V3 | |
68F | 49 | 68 | before V1 | 785R | 785 | 805 | after V4 | |
341F | 341 | 357 | before V3 | 806R | 787 | 806 | after V4 | |
515F | 515 | 533 | before V4 | 907R | 907 | 926 | after V5 | |
967F | 967 | 985 | before V6 | 926R | 907 | 926 | after V5 | |
1237F | 1220 | 1237 | before V8 | 1100R | 1100 | 1115 | after V6 | |
1391R | 1391 | 1407 | after V8 | |||||
1492R | 1492 | 1510 | after V9 |
If you run on the fasta file defined before (my_16S_seq.fasta
with two sequences) using extract_regions -i my_16S_seq.fasta -f 8F -r 338R
you will obtain:
>seq1
AGAGUUUGAUCAUGGCUCAGAUUGAACGCUGGCGGCAGGCCUAACACAUGCAAGUCGAACGGUAACAGGAAGAAGCUUGCUUCUUUGCUGACGAGUGGCGGACGGGUGAGUAAUGUCUGGGAAACUGCCUGAUGGAGGGGGAUAACUACUGGAAACGGUAGCUAAUACCGCAUAACGUCGCAAGACCAAAGAGGGGUACCUUCGGGCCUCUUGCCAUCGGAUGUGCCCAGAUGGGAUUAGCUAGUAGGUGGGGUAACGGCUCACCUAGGCGACGAUCCCUAGCUGGUCUGAGAGGAUGACCAGCCACACUGGAACUGAGACACGGUCCAGA
>seq2
CGAGUUUGAUCAUGGCUCAGAUUGAACGCUGGCGGCAGGCCUAACACAUGCAAGUCGAACGGUAACAGGAAGAAGCUUGCUUCUUUGCUGACGAGUGGCGGACGGGUGAGUAAUGUCUGGGAAACUGCCUGAUGGAGGGGGAUAACUACUGGAAACGGUAGCUAAUACCGCAUAACGUCGCAAGACCAAAGAGGGGUACCUUCGGGCCUCUUGCCAUCGGAUGUGCCCAGAUGGGAUUAGCUAGUAGGUGGGGUAACGGCUCACCUAGGCGACGAUCCCUAGCUGGUCUGAGAGGAUGACCAGCCACACUGGAACUGAGACACGGUCCAGA
Where the header is the same as the original fasta file.
You can specify the exact position that you want to extract with -f
and -r
.
For example, if you want to extract all the sequences of primer 8F
, you can run extract_regions -i my_16S_seq.fasta -f 8 -r 27
, which will result in:
>seq1
AGAGUUUGAUCAUGGCUCAG
>seq2
CGAGUUUGAUCAUGGCUUCAG
Note that the positions are based on the E.coli K12 16S gene.