-
Notifications
You must be signed in to change notification settings - Fork 15
Error in convertGenbank2table.py if /pseudo in genbank #81
Comments
Hi Gregg, I think we could do something to handle this issue, but I don't have any ideas as to what the correct behaviour should be. Being that the broader ITEP infrastructure requires translations to be associated with genes, we would somehow have to predict a translation for that pseudogene. I'm open to ideas/suggestions. |
Random thoughts: We could look for empty protein sequences and call Translate() on the seq object that comes out of the Genbank parsing if it is missing. However, this won't always be a good idea, in some cases the pseudogene would be useful in analysis (e.g. in cases where you're interested in the gene's evolution. If the pseudogene was not called you would use tblastn to find it but the existing annotation removes the need to do that) but in other cases it may not be (if we're only interested in what are likely to be real genes, such as in metabolic modeling, the presence of these genes means the user has to go through and filter them out later. But if the pseudogenes are given translations the user would have to do that anyway). My guess is we would want to add something to the existing annotation to indicate that it was marked as a pseudogene? That way at least if they show up in clusters you could see that they are there and exclude them in post-processing. Thoughts? |
One other comment: I do not know how well ITEP and the tools it depends on (raxml, blast, etc) will behave if you have a stop codon (*) character in the middle of some of your sequences ... |
@mattb112885 It should be pretty straightforward to add an additional column to the database indicating that the translation was predicted rather than read. Should this be an additional column that is displayed when using "db_getGeneinformation.py"? Also, what are your thoughts for the default behaviour when creating metabolic models? Clearly, if the gene was missed by an annotation pipeline, but is a homolog of a metabolic protein (e.g. pyk) wouldn't you want it to be added to the model? Finally, if I were to add this functionality, do you have any idea what other scripts might break in the meantime? |
I believe as long as you add the extra column to the end of the table, the database change itself wouldn't break any existing scripts... however, if you do this, any scripts you modify to require the extra column would no longer work with old versions of the database, and they would require a rebuild (as you know, this takes a lot of time), unless you were careful to check for this case in the scripts. An alternative would be to add something to the existing annotation field (like [MACHINE-TRANSLATED]) to indicate it and just look for that in your scripts. I would exclude these from metabolic models - usually when genes are marked as pseudogenes explicitly, the annotation software thinks there is a change like a frameshift or nonsense mutation that made it no longer functional. |
Dear All,
I was parsing a genbank file (more precisely a gbff file (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/365/GCF_000009365.1_ASM936v1/GCF_000009365.1_ASM936v1_genomic.gbff.gz)) and due to the fact that some genes are annotated as /pseudo there are several errors such as :
WARNING: CDS found with no translation
Qualifiers:
locus_tag ['ABO_RS05435']
inference ['EXISTENCE: similar to AA sequence:RefSeq:WP_011588335.1']
old_locus_tag ['ABO_1051']
codon_start ['1']
pseudo ['']
exception ['unextendable partial coding region']
transl_table ['11']
note ['Derived by automated computational analysis using gene prediction method: Protein Homology.']
with coordinates : complement(<1191967..1192467)
and then
BAD - the gene in the following location did not have a match in the table file!
<1191966:1192467
ATGCCAAAACGGATATTCCGCAAATACCTGCCAACACCCGAACGTATTCGACAAACCAAGTCGTTGAGTTTTCTCGGGGAAGTGCTATCGGACCCAAACCTTTGGCACATTAACCGTCGTTCACTGGCCGGCGCAGCCTTTATCGGTATCTTCTCCGGTCTGTTACCTATCCCCCTGCAAATGGGGTTGGCCGCCCTGTTGGCGGTGCGCTTCCACTGCAACCTGCCGCTTTCGATTATGCTGGTATGGATTTCCAATCCAGTGACTTATGTGCCGATTTTCTATTTCACCTACCGCATTGGCGCCTGGCTGCTGGGAATGCCGCCCCACAGCGGTGAAGGCATCACCGTCGCCTGGTTTGTGGAACAGCTCATCCCATTATGGGTAGGTTCAATGCTATGCGCATTCGGGTTTGGCGGATTAGCGTACATGGCAGTAAAAGTCAGCTGGCGGCTAGCTGTAATTCGAAGTTGGAATCTACGCGCGCACCGGCGCGCACGG
Best
Greg
The text was updated successfully, but these errors were encountered: