Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification: GTDB mixed orientation warning only applies to full length refs? #16

Open
mestaki opened this issue Oct 19, 2021 · 5 comments

Comments

@mestaki
Copy link
Contributor

mestaki commented Oct 19, 2021

Howdy!

I noticed the disclaimer on the GTDB data page:

Warning: files in this directory are experimental. Many of the reference sequences appear to be in mixed orientations, which currently are not handled well by q2-feature-classifier and may yield misleading results. Use at your own risk.

Which is of course a valid concern, but this only applies to the full length refs, right? Since the V4 ones go through the extract read process initially which correct these mixed orientations? (as per @nbokulich's note here).
Or... does extract read's --p-read-orientation both only apply both orientation in its search and doesn't actually correct the reads in the output?

Thanks!

@nbokulich
Copy link
Collaborator

howdy!

yes that's correct — extract reads will orient the sequences (as long as the primers hit the F/RC sequences).

would you like to modify this warning to clarify? Personally I don't see much harm in keeping the "experimental" label (since to my knowledge we have not really tested the GTDB bespoke weights extensively), but it would be good to clarify.

Another future option (for the FL seqs) would be to use RESCRIPt to re-orient the reads.

@mestaki
Copy link
Contributor Author

mestaki commented Oct 21, 2021

Sounds good, updated an extra line on this warning in a PR.

As for using RESCRIPt to fix the full-length reads, would that need another set of reference reads to align against? In that case that may need some benchmarking to fine-tune the alignment parameters right?
Another alternative approach that Ben suggested some time ago was to create a new database with all reads in both orientation. Would take twice as long but wouldn't need benchmarking and fine-tuning. Unless reads can be in reverse, reverse-complement, or some other combination.

@nbokulich
Copy link
Collaborator

I agree, orienting in the same direction might need a little bit of testing to establish a working protocol, but one could use fairly loose %id and coverage settings to re-align against a small reference db of sequences in a known orientation. I am not sure that I would call it benchmarking per se.

A database in both orientations might actually require more benchmarking in my opinion than attempting to orient all in the same direction, since this could lead to changes in classifier performance.

@mestaki
Copy link
Contributor Author

mestaki commented Oct 22, 2021

Gotcha! I didn't realize that would change classifier performance. What do you reckon a good starting database and %id Something like 65% GG at 65% coverage?

@nbokulich
Copy link
Collaborator

yeah that sounds reasonable... I think that %id is approx what deblur uses for pre-filtering reads, so maybe we can use that as precedent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants