Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: blastn does search the full database every time #207

Open
colinbrislawn opened this issue Nov 1, 2024 · 4 comments
Open

Docs: blastn does search the full database every time #207

colinbrislawn opened this issue Nov 1, 2024 · 4 comments
Assignees

Comments

@colinbrislawn
Copy link

colinbrislawn commented Nov 1, 2024

Maximum number of hits to keep for each query.
BLAST will choose the first N hits in the reference
database that exceed perc-identity similarity to
query. NOTE: the database is not sorted by
similarity to query, so these are the first N hits
that pass the threshold, not necessarily the top N
hits. [default: 10]

Note that maxaccepts selects the first N
hits with > perc_identity similarity to query, not the top N matches.

I think that's the case for usearch / vsearch, but not for blast

@colinbrislawn
Copy link
Author

colinbrislawn commented Nov 1, 2024

It's hard to share a benchmark in a comment, but:

Using a small max_target_seqs, parameter sweep low values of perc_identity

blastn -query rep_seq/dna-sequences.fasta -db ../dbs/ncbi_16S_db/16S_ribosomal_RNA \
  -outfmt '6' -max_target_seqs 10 -perc_identity ?? > blast_output_??.txt

We would expect poor results for -perc_identity 50 but, the top 10 hits are stable all the way up to 83%

md5sum blast_output_*

afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_50.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_60.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_70.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_80.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_81.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_82.txt
f143ffcc108ffdb6a8843d52d9f401bc  blast_output_83.txt

And what's different once we request >83% similar?

git diff --no-index blast_output_50.txt blast_output_83.txt

image

We are missing just 4 hits... all of which are <83% similar.

@colinbrislawn
Copy link
Author

https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_

C. Loop over every sequence in the database, performing the following actions:

This exhaustive search explain why blast has a slow reputation compared to usearch's fail-fast heuristics

@colinbrislawn colinbrislawn changed the title blastn returns sorted results Docs: blastn does search the full database every time Nov 2, 2024
@nbokulich
Copy link
Member

Hi @colinbrislawn ,
Does it search the full database every time? As of a few years ago it did not, and the discrepancy between the BLAST manual (which is a bit ambiguous) and reality was noted in a few places, e.g. see some good references here,
https://doi.org/10.1093/bioinformatics/bty833
https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html

So the parameter description given here is based on this behavior. This might have quietly been changed in the past 6 years, but at the time this was evidently seen by the devs as a feature, not a bug, and such a change probably would have made a bit more noise, but maybe I was just not paying enough attention 😁

Do you have any evidence (e.g., release notes from NCBI?) that this behavior has changed?

@colinbrislawn
Copy link
Author

colinbrislawn commented Nov 3, 2024

Do you have any evidence (e.g., release notes from NCBI?) that this behavior has changed?

None. I'm a surprised as you! Maybe in response to that 2019 paper?

It may not be searching the full database. I first noticed when I went to sort blast results and discovered that the hits were already sorted by something. When I dropped perc_identity to 50% and the top ten hits didn't change, that's when I knew something was up.

edit: My interpretation is that yes, the full database is being searched and -perc_identity is an output filter applied after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants