-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_unite_data functions #134
Conversation
…into get_unite
Good evening, Now that we have a plan for the API, I'm ready to dive into implementation. I'm very new to this sort of Python dev, so all feedback is greatly appreciated! Is there a recommended linter / code-style I should use? |
Thanks @colinbrislawn ! Exciting to see this coming together! I have answered some of your questions inline and left a few additional comments. Please just ask questions as they come up and I can answer. One general comment: could you move this code over to a new module, @mikerobeson any chance you could do a code review when the time is ripe? |
I agree. Separate this out into
Yarp! I'll look into this today and throughout the week. Again thank you so much for working on this @colinbrislawn! 🌟 |
OK! I've moved this into a single new file as suggested. I've also renamed and organized the helper functions so they chain together linearly! They now go:
I'm still struggling to import fasta_file = open("/tmp/tmpduape95o/sh_refs_qiime_ver9_97_25.07.2023_dev.fasta", 'r')
qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file) Error and traceback
|
Hi @colinbrislawn regarding the error you get when trying to import sequences, it looks your branch is very far out of date, this is why this confusing error occurs. Could you please update your branch and try again? |
Hi @colinbrislawn , the edits look good, thanks! I have done some user testing now and it is working like a charm 😉 . However, it looks like there is a problem with version 8.2, which fails to download with all confugurations, due to lowercase characters in the sequences. An easy fix would be to just drop version 8.2 as an option... but this issue could theoretically come up again in the future so this could be an opportunity to already fix this issue. I think that it should be as simple as using I am a little suprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC... but maybe this is something that they have started mofifying on their own more recently? Do you happen to know, @colinbrislawn ? |
I'm glad it's working for you, Nick!
The newest releases have had only normal (
I had this implemented at one point. This PR is so old I've introduced regressions 😿 I'll switch over to this importing model and retest. |
Done. Using I want to limit scope creep, so here's some cool things we shouldn't do (right now)
|
thanks @colinbrislawn ! if this causes a significant slowdown and uppercase is the norm from 8.3 onward, then maybe we just drop support for 8.2 and go back to But otherwise I am happy if you want to just open an issue to save that idea for a rainy day 😁 by the way, looks like the linter is failing now. |
I have gotta get some pre-commit hooks added edit: done. I'll consider opening a new issue for more linting and pre-commit hooks |
Hi @colinbrislawn & @nbokulich. I think I might have suggested to go with |
I like this scope. We can build it when we need it! 🛠️ |
Before we merge, we should pick which test of _unite_get_tgz() to use. |
I just re-ran and tested the latest version of your PR. it all looks good! Actually, I think the download and parsing of the data is quite "fast" despite using This looks ready to go, unless there is something I'm missing. |
looks good to me too, thanks @colinbrislawn and @mikerobeson ! @mikerobeson please merge once you are ready (and the CI tests check out). |
Hey @colinbrislawn Perhaps you can write up a short Community Contribution similar to these: |
Yeah a short community tutorial would be great! @colinbrislawn is this something that you would be interested in putting together? This could be planned to coincide with the next QIIME 2 release (next month I think). I would also be happy to put one together if you and @mikerobeson are not interested, but I don't want to steal your thunder 😉 I put together a few commands when testing out your new action in case you want to use these. I found that in version 9.0 the accession IDs at included in the taxonomy string (something like k__Fungi;...;s__species;sh__R12345). This is fine for the alignment-based classifiers, but really slows down the process of training a Naive Bayes classifier (and hitting a single accession is highly unlikely so this is a waste of time) so the accession IDs should be removed from the taxonomy string. Additionally, I would recommend removing sequences that are not classified to at least class level. qiime rescript get-unite-data --output-dir unite-v9.0-fungi-dynamic-singletonsFalse --p-version 9.0 --p-cluster-id dynamic --p-singletons False --p-taxon-group Fungi qiime taxa filter-seqs --p-exclude Fungi_sp,mycota_sp,mycetes_sp --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --i-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences.qza --o-filtered-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza qiime rescript edit-taxonomy --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --o-edited-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --p-search-strings ';sh__.*' --p-replacement-strings '' --p-use-regex qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza --i-reference-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --o-classifier unite-v9.0-fungi-dynamic-singletonsFalse/classifier.qza |
That is awesome @nbokulich! That is a great point about the accession labels! Makes me even happier for But yeah, just having a simple explanation at each step would be great! For example:
I've not worked with ITS data in a while, so I'm happy to leave it to you both. I'll Obviously, send any comments and revision ideas your way. :-) |
Something is a little strange... After the merge, get_unite_data() no longer has defaults set for version and taxon_group 2bee235?diff=unified#diff-e1476be0ac3e9aab06544dd8563f72d7b6e38ff952c233a0abb3d213107db243R148 What's up with that? |
Oh, weird. Not sure how that happened. Didi you want to do a minor patch PR? Then I'll merge that. |
Early code to close #123