Clarification: GTDB mixed orientation warning only applies to full length refs? #16

mestaki · 2021-10-19T17:29:06Z

Howdy!

I noticed the disclaimer on the GTDB data page:

Warning: files in this directory are experimental. Many of the reference sequences appear to be in mixed orientations, which currently are not handled well by q2-feature-classifier and may yield misleading results. Use at your own risk.

Which is of course a valid concern, but this only applies to the full length refs, right? Since the V4 ones go through the extract read process initially which correct these mixed orientations? (as per @nbokulich's note here).
Or... does extract read's --p-read-orientation both only apply both orientation in its search and doesn't actually correct the reads in the output?

Thanks!

The text was updated successfully, but these errors were encountered:

nbokulich · 2021-10-20T08:36:25Z

howdy!

yes that's correct — extract reads will orient the sequences (as long as the primers hit the F/RC sequences).

would you like to modify this warning to clarify? Personally I don't see much harm in keeping the "experimental" label (since to my knowledge we have not really tested the GTDB bespoke weights extensively), but it would be good to clarify.

Another future option (for the FL seqs) would be to use RESCRIPt to re-orient the reads.

mestaki · 2021-10-21T02:31:43Z

Sounds good, updated an extra line on this warning in a PR.

As for using RESCRIPt to fix the full-length reads, would that need another set of reference reads to align against? In that case that may need some benchmarking to fine-tune the alignment parameters right?
Another alternative approach that Ben suggested some time ago was to create a new database with all reads in both orientation. Would take twice as long but wouldn't need benchmarking and fine-tuning. Unless reads can be in reverse, reverse-complement, or some other combination.

nbokulich · 2021-10-21T04:23:03Z

I agree, orienting in the same direction might need a little bit of testing to establish a working protocol, but one could use fairly loose %id and coverage settings to re-align against a small reference db of sequences in a known orientation. I am not sure that I would call it benchmarking per se.

A database in both orientations might actually require more benchmarking in my opinion than attempting to orient all in the same direction, since this could lead to changes in classifier performance.

mestaki · 2021-10-22T04:40:01Z

Gotcha! I didn't realize that would change classifier performance. What do you reckon a good starting database and %id Something like 65% GG at 65% coverage?

nbokulich · 2021-10-22T05:34:48Z

yeah that sounds reasonable... I think that %id is approx what deblur uses for pre-filtering reads, so maybe we can use that as precedent.

mestaki mentioned this issue Oct 21, 2021

Update README to clarify orientation warning specific to V4 #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification: GTDB mixed orientation warning only applies to full length refs? #16

Clarification: GTDB mixed orientation warning only applies to full length refs? #16

mestaki commented Oct 19, 2021

nbokulich commented Oct 20, 2021

mestaki commented Oct 21, 2021

nbokulich commented Oct 21, 2021

mestaki commented Oct 22, 2021

nbokulich commented Oct 22, 2021

Clarification: GTDB mixed orientation warning only applies to full length refs? #16

Clarification: GTDB mixed orientation warning only applies to full length refs? #16

Comments

mestaki commented Oct 19, 2021

nbokulich commented Oct 20, 2021

mestaki commented Oct 21, 2021

nbokulich commented Oct 21, 2021

mestaki commented Oct 22, 2021

nbokulich commented Oct 22, 2021