Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when sequence ID is too long #53

Open
jcmckerral opened this issue Jun 28, 2021 · 7 comments
Open

Error when sequence ID is too long #53

jcmckerral opened this issue Jun 28, 2021 · 7 comments

Comments

@jcmckerral
Copy link

There is a small issue where one of the biopython functions has a character length limit on sequence IDs, a more informative error message might be useful. A fasta ID

>SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT

results in a genbank file which will give a PhiSpy traceback/error

[USERID]$ PhiSpy.py testgenome.gb -o phispyTest
Traceback (most recent call last):
  File "$PATH/anaconda3/bin/PhiSpy.py", line 125, in <module>
    main(sys.argv)
  File "$PATH/anaconda3/bin/PhiSpy.py", line 48, in main
    args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
  File "$PATH/anaconda3/lib/python3.8/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__
    for n, item in enumerate(content):
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 73, in __next__
    return next(self.records)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
    record = self.parse(handle, do_features)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
    if self.feed(handle, consumer, do_features):
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
    self._feed_first_line(consumer, self.line)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 1572, in _feed_first_line
    raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS       SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT bp   DNA linear

Changing the ID to

>SEQID_SHORT

resolves the problem.

@liaochenlanruo
Copy link

Traceback (most recent call last):
File "/home/liu/miniconda3/envs/component/bin/PhiSpy.py", line 10, in
sys.exit(run())
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 122, in run
main(sys.argv)
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 44, in main
args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/seqio_filter.py", line 33, in init
for n, item in enumerate(content):
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in next
return next(self.records)
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
record = self.parse(handle, do_features)
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
if self.feed(handle, consumer, do_features):
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
self._feed_first_line(consumer, self.line)
File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 1571, in _feed_first_line
raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS NODE_52_length_15591_cov_14.37480715591 bp DNA linear

@qianxin-kxy
Copy link

I have also encountered this issue, but I have hundreds of gbk files to process, so is there any way to batch shorten the IDs in the files

@ShanlinKe
Copy link

I have also encountered this issue, but I have hundreds of gbk files to process, so is there any way to batch shorten the IDs in the files

I met the same issue. Any clues on this?

@linsalrob
Copy link
Owner

Can you point me to a file where this issue occurs so that I can fix it?

@TSZUoE
Copy link

TSZUoE commented Jun 27, 2023

Hi, I also had this issue. I initially tried to add the whitespace manually but that didn't work. My genbank files were annotated in PROKKA. Re-annotating using the --compliant flag for PROKKA fixed the issue for me as it parses the locus line in a different way.

@ghost
Copy link

ghost commented Aug 26, 2024

@linsalrob @qianxin-kxy @jcmckerral thank you and the easy way would be to do this before running:

# this will remove all the spaces with the pipes
for i in *.fasta; do sed -i -e "s/ /|/g" ${i}; done 
# cut the pipe at the place you want
for i in *.fasta; do cut -f 1 -d "|" ${i}; done 
# all headers shorted. 
Thank you 
Gaurav

@ghost
Copy link

ghost commented Aug 26, 2024

@ShanlinKe @TSZUoE see my response in this thread above.

if you have the C++ code, pointer declaration snippet, paste here, will do the convertible for the same

# this will remove all the spaces with the pipes
for i in *.fasta; do sed -i -e "s/ /|/g" ${i}; done 
# cut the pipe at the place you want
for i in *.fasta; do cut -f 1 -d "|" ${i}; done 
# all headers shorted. 

Thank you
Gaurav

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants