get_unite_data functions #134

colinbrislawn · 2022-04-10T17:03:41Z

Early code to close #123

…into get_unite

colinbrislawn · 2023-10-07T00:37:11Z

Good evening,

Now that we have a plan for the API, I'm ready to dive into implementation.

I'm very new to this sort of Python dev, so all feedback is greatly appreciated!

Is there a recommended linter / code-style I should use?

rescript/get_data.py

nbokulich · 2023-10-07T06:17:49Z

Thanks @colinbrislawn ! Exciting to see this coming together!

I have answered some of your questions inline and left a few additional comments. Please just ask questions as they come up and I can answer.

One general comment: could you move this code over to a new module, get_unite.py? Just to keep the code more organized for the different get-* actions. Thanks!

@mikerobeson any chance you could do a code review when the time is ripe?

mikerobeson · 2023-10-09T14:54:21Z

One general comment: could you move this code over to a new module, get_unite.py? Just to keep the code more organized for the different get-* actions. Thanks!

I agree. Separate this out into get_unite.py or get_unite_data.py, or something similar. For example, we have separate modules for:

get_data.py : SILVA
get_gtdb_data.py: GTDB
ncbi.py : GenBank
...

@mikerobeson any chance you could do a code review when the time is ripe?

Yarp! I'll look into this today and throughout the week.

Again thank you so much for working on this @colinbrislawn! 🌟

rescript/get_data.py

colinbrislawn · 2023-10-12T01:37:42Z

OK! I've moved this into a single new file as suggested.

I've also renamed and organized the helper functions so they chain together linearly! They now go:

doi = _unite_get_doi(version, taxon_group, singletons)
url = _unite_get_url(doi)
tgz = _unite_get_tgz(url, download_path)
...

I'm still struggling to import .fasta files, which makes me think I'm missing something obvious / fundamental about Q2 Artifacts.

fasta_file = open("/tmp/tmpduape95o/sh_refs_qiime_ver9_97_25.07.2023_dev.fasta", 'r')

qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file)

Error and traceback

qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 327, in import_data
    return cls._from_view(type_, view, view_type, provenance_capture,
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 342, in _from_view
    pm = qiime2.sdk.PluginManager()
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 67, in __new__
    self._init(add_plugins=add_plugins)
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 105, in _init
    plugin = entry_point.load()
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2518, in load
    return self.resolve()
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2524, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/biouser/bin/github-repos/RESCRIPt/rescript/plugin_setup.py", line 27, in <module>
    from .cross_validate import (evaluate_cross_validate,
  File "/home/biouser/bin/github-repos/RESCRIPt/rescript/cross_validate.py", line 14, in <module>
    from q2_feature_classifier._consensus_assignment import (
ImportError: cannot import name '_consensus_assignments' from 'q2_feature_classifier._consensus_assignment' (/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_feature_classifier/_consensus_assignment.py)

nbokulich · 2023-10-12T04:49:11Z

Hi @colinbrislawn regarding the error you get when trying to import sequences, it looks your branch is very far out of date, this is why this confusing error occurs. Could you please update your branch and try again?

rescript/tests/test_get_unite.py

…into get_unite

nbokulich · 2023-11-10T12:20:28Z

Hi @colinbrislawn ,

the edits look good, thanks! I have done some user testing now and it is working like a charm 😉 . However, it looks like there is a problem with version 8.2, which fails to download with all confugurations, due to lowercase characters in the sequences. An easy fix would be to just drop version 8.2 as an option... but this issue could theoretically come up again in the future so this could be an opportunity to already fix this issue. I think that it should be as simple as using MixedCaseDNAFASTAFormat instead of DNAFASTAFormat, but I have not worked with that format before so cannot guarantee. Fortunately, we have the global expert on that format with us, @mikerobeson, in case you run into issues making that switch 😉

I am a little suprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC... but maybe this is something that they have started mofifying on their own more recently? Do you happen to know, @colinbrislawn ?

colinbrislawn · 2023-11-10T13:41:19Z

I'm glad it's working for you, Nick!

I am a little surprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC

The newest releases have had only normal ([ATCG]) DNA, so I suppose that fixes it.

as simple as using MixedCaseDNAFASTAFormat

I had this implemented at one point. This PR is so old I've introduced regressions 😿

I'll switch over to this importing model and retest.

colinbrislawn · 2023-11-10T14:27:33Z

Done. Using MixedCaseDNAFASTAFormat is a little slower when not needed, but it works with 8.2.

I want to limit scope creep, so here's some cool things we shouldn't do (right now)

For specific database versions, switch to DNAFASTAFormat because it's faster
Support older databases by automatically handling .zip files
Refactor UNITE_DOIS from a table to data.frame and add more db metadata to it

nbokulich · 2023-11-10T14:49:20Z

thanks @colinbrislawn ! if this causes a significant slowdown and uppercase is the norm from 8.3 onward, then maybe we just drop support for 8.2 and go back to DNAFASTAFormat? Or if the switch was at 8.3 then it would also be easy to just specify the format with a conditional.

But otherwise I am happy if you want to just open an issue to save that idea for a rainy day 😁

by the way, looks like the linter is failing now.

colinbrislawn · 2023-11-10T14:52:28Z

I have gotta get some pre-commit hooks added

edit: done. I'll consider opening a new issue for more linting and pre-commit hooks

mikerobeson · 2023-11-10T18:16:01Z

Hi @colinbrislawn & @nbokulich. I think I might have suggested to go with DNAFASTAFormat, as I thought all the recent DBs where upper case. I am not too concerned about runtime, as most users will just have to run this once. But, I suppose we can add a try statement to import as DNAFASTAFormat, and if that fails, use MixedCaseDNAFASTAFormat? This way there is no need to explicitly define a format for each database version? This might also help future proof, in case UNITE goes back to mixed case. Or some other conditional? 🤷 But as Nick said, it might be good enough to not include ver 8.2 for now, and record this as an issue to be fixed for a rainy day and stick with DNAFASTAFormat. But I am fine if we just want to leave it as MixedCaseDNAFASTAFormat and call it good. :-)

colinbrislawn · 2023-11-10T18:30:41Z

record this as an issue to be fixed for a rainy day. :-)

I like this scope. We can build it when we need it! 🛠️

colinbrislawn · 2023-11-11T16:20:39Z

Before we merge, we should pick which test of _unite_get_tgz() to use.

rescript/tests/test_get_unite.py

mikerobeson · 2023-11-13T20:36:39Z

I just re-ran and tested the latest version of your PR. it all looks good! Actually, I think the download and parsing of the data is quite "fast" despite using MixedCaseDNAFASTAFormat. It took ~3 minutes using --p-version 9.0 --p-cluster-id dynamic --p-singletons. So, in light of that, I think we should just keep everything as you have it, using MixedCaseDNAFASTAFormat.

This looks ready to go, unless there is something I'm missing.

nbokulich · 2023-11-13T21:13:15Z

looks good to me too, thanks @colinbrislawn and @mikerobeson ! @mikerobeson please merge once you are ready (and the CI tests check out).

mikerobeson · 2023-11-13T22:25:04Z

Hey @colinbrislawn Perhaps you can write up a short Community Contribution similar to these:

GTDB
RDP

nbokulich · 2023-11-14T07:52:47Z

Yeah a short community tutorial would be great! @colinbrislawn is this something that you would be interested in putting together? This could be planned to coincide with the next QIIME 2 release (next month I think). I would also be happy to put one together if you and @mikerobeson are not interested, but I don't want to steal your thunder 😉

I put together a few commands when testing out your new action in case you want to use these. I found that in version 9.0 the accession IDs at included in the taxonomy string (something like k__Fungi;...;s__species;sh__R12345). This is fine for the alignment-based classifiers, but really slows down the process of training a Naive Bayes classifier (and hitting a single accession is highly unlikely so this is a waste of time) so the accession IDs should be removed from the taxonomy string. Additionally, I would recommend removing sequences that are not classified to at least class level.

qiime rescript get-unite-data --output-dir unite-v9.0-fungi-dynamic-singletonsFalse --p-version 9.0 --p-cluster-id dynamic --p-singletons False --p-taxon-group Fungi

qiime taxa filter-seqs --p-exclude Fungi_sp,mycota_sp,mycetes_sp --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --i-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences.qza --o-filtered-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza

qiime rescript edit-taxonomy --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --o-edited-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --p-search-strings ';sh__.*' --p-replacement-strings '' --p-use-regex

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza --i-reference-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --o-classifier unite-v9.0-fungi-dynamic-singletonsFalse/classifier.qza

mikerobeson · 2023-11-14T15:34:45Z

That is awesome @nbokulich! That is a great point about the accession labels! Makes me even happier for edit-taxonomy. :-)

But yeah, just having a simple explanation at each step would be great! For example:

bring up the importance of using eukaryotes to ensure we have outgroups for better classification of fungi / not-fungi.
simply mention amplicon region extraction sub-examples (as I've done for GTDB, and RDP tutorial), just point to the other RESCRIPt tutorials on how to do this.
Be sure to list the appropriate UNITE and RESCRIPt bibliography.

I've not worked with ITS data in a while, so I'm happy to leave it to you both. I'll Obviously, send any comments and revision ideas your way. :-)

colinbrislawn · 2023-11-14T16:09:14Z

Something is a little strange...

After the merge, get_unite_data() no longer has defaults set for version and taxon_group

2bee235?diff=unified#diff-e1476be0ac3e9aab06544dd8563f72d7b6e38ff952c233a0abb3d213107db243R148

What's up with that?

mikerobeson · 2023-11-14T20:02:21Z

Oh, weird. Not sure how that happened. Didi you want to do a minor patch PR? Then I'll merge that.

colinbrislawn added 7 commits April 10, 2022 13:01

Add _assemble_unite_data_urls and testing code

4e59f17

update unite url lookup

2d17273

add _assemble_unite_data_urls() and DOI dict

91de774

Delete .vscode/settings.json

0c8e1b2

Add draft _unite_download_targz()'

97c23ab

Merge branch 'get_unite' of https://github.com/colinbrislawn/RESCRIPt …

98de886

…into get_unite

Add examples for questions

78414b1

colinbrislawn commented Oct 7, 2023

View reviewed changes