-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastmultigather
produces different results when searching against a normal zip-type database and a Rocksdb
inverted index database
#483
Comments
hi @JiaweiShen1116, I had to dive deeply into this when working on validating gather across the various implementations we have over in https://github.com/ctb/2024-debug-gather-difference. It turns out to be remarkably difficult 😂 . See #331 (comment) for an example. The short version is that there may be multiple perfectly valid and equivalently "good" sets of gather matches to a particular metagenome; this because the order in which gather matches are chosen can be somewhat arbitrary, but then this arbitrary choice affects the downstream results 😅 . I'm traveling right now, but I'll give you some code to validate the different sets of gather results when I get back. I am reasonably confident that both sets of results are equally correct (and should give the same taxonomic results, for example) but it would be good to have a way to check! |
@ctb Thanks for the explanation. I do have another question regarding the Can you give a brief explanation of what this potential_false_negative column is aiming to imply about the matches? Is this column going to be added back to the |
@bluegenes mentioned to me that the potential_false_negative column is not a relevant column in practice, but I do not have a clear memory of why. I thought we'd documented it somewhere. In any case, it is deprecated and not suggested - we won't be adding it into branchwater. |
@ctb Thanks for explaining. Upon further testing, I noticed that searching against For reference, here are some of the gather output csv files of the same sample searched against It seems to be more of a systematic issue with searching against |
sorry, still haven't gotten to writing up details evaluation stuff - hope to do so soon! in the meantime, two thoughts - first, the output of the second, one approach to validation (that I will go into in more detail soon, promise) is to use one set of gather output as a picklist for sourmash gather. Basically you would take the fastmultigather CSV and do something like last, you can check out our validation script/repo here, https://github.com/ctb/2024-debug-gather-difference/blob/main/do-compare-gather.py, where we cross-compare results from different gather implementations on a specific data set. There are Reasons why this needs to be done carefully so if it doesn't just work, it's not you, it's my instructions 😭 |
@ctb I have run the zr17780.all.standard.summary.k51.10000.csv However, when I run the |
dipping my toes into this - I can confirm that fastmultigather against rocksdb will return equal results just fine. So that's not intrinsically a problem in the code. More to do here ;) |
hi @JiaweiShen1116 I finally tracked down at least one problem that could be causing what you see 😰 - See #505 for a more detailed bug description. |
Hi @ctb. Yes, I can confirm that I was specifying the |
fantastic - then that at least gives me hope that it was #505 that caused the problem, and not some shiny new bug 😆 . I'll ping you when I release a new version with a fix! |
whoops, did not mean to close that! I've just released sourmash_plugin_branchwater v0.9.10, which contains the fix to #505 that's in #504. If you get a chance to try it out, please let me know! Note that gather works on the unweighted k-mers, so the |
@ctb Hi. I have updated the sourmash plugin to v0.9.10 and run zr17780.all.gather.k51.4000.csv To better understand the issue here, please check the two output files I generated with I also try to run Please let me know what further information I can provide here to debug this issue. Thank you very much! |
Digging into this a bit, let me say a few things first 😅
I'm not saying there's not a bug, just that I'm not immediately finding one :). |
Hi,
I was comparing the performance of
fastmultigather
when searching against a normal zip-type database and against aRocksdb
inverted index database. While runningfastmultigather
withRocksdb
database was much faster than with normal zip database, I noticed that running withRocksdb
database generated slightly fewer matches in my samples. More specifically, while most of the matches agreed between two runs, the numbers of matches whose unique_intersect_bp values are low are different. (e.g., I set--bp_threshold
at 5000 forfastmultigather
, then searching against aRocksdb
database produced less matches whose unique_intersect_bp values are 5000 than searching against a normal zip database).I wonder why this happened. Was it due to the different searching algorithms?
Please let me know what further information I can provide for this case to better investigate this issue. Thank you.
The text was updated successfully, but these errors were encountered: