Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_unite_data functions #134

Merged
merged 64 commits into from
Nov 13, 2023
Merged

Conversation

colinbrislawn
Copy link
Contributor

Early code to close #123

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Oct 7, 2023

Good evening,

Now that we have a plan for the API, I'm ready to dive into implementation.

I'm very new to this sort of Python dev, so all feedback is greatly appreciated!

Is there a recommended linter / code-style I should use?

rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
@nbokulich
Copy link
Collaborator

nbokulich commented Oct 7, 2023

Thanks @colinbrislawn ! Exciting to see this coming together!

I have answered some of your questions inline and left a few additional comments. Please just ask questions as they come up and I can answer.

One general comment: could you move this code over to a new module, get_unite.py? Just to keep the code more organized for the different get-* actions. Thanks!

@mikerobeson any chance you could do a code review when the time is ripe?

@mikerobeson
Copy link
Collaborator

One general comment: could you move this code over to a new module, get_unite.py? Just to keep the code more organized for the different get-* actions. Thanks!

I agree. Separate this out into get_unite.py or get_unite_data.py, or something similar. For example, we have separate modules for:

  • get_data.py : SILVA
  • get_gtdb_data.py: GTDB
  • ncbi.py : GenBank
  • ...

@mikerobeson any chance you could do a code review when the time is ripe?

Yarp! I'll look into this today and throughout the week.

Again thank you so much for working on this @colinbrislawn! 🌟

rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
rescript/get_data.py Outdated Show resolved Hide resolved
@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Oct 12, 2023

OK! I've moved this into a single new file as suggested.

I've also renamed and organized the helper functions so they chain together linearly! They now go:

doi = _unite_get_doi(version, taxon_group, singletons)
url = _unite_get_url(doi)
tgz = _unite_get_tgz(url, download_path)
...

I'm still struggling to import .fasta files, which makes me think I'm missing something obvious / fundamental about Q2 Artifacts.

fasta_file = open("/tmp/tmpduape95o/sh_refs_qiime_ver9_97_25.07.2023_dev.fasta", 'r')

qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file)
Error and traceback

qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 327, in import_data
    return cls._from_view(type_, view, view_type, provenance_capture,
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 342, in _from_view
    pm = qiime2.sdk.PluginManager()
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 67, in __new__
    self._init(add_plugins=add_plugins)
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 105, in _init
    plugin = entry_point.load()
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2518, in load
    return self.resolve()
  File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2524, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/biouser/bin/github-repos/RESCRIPt/rescript/plugin_setup.py", line 27, in <module>
    from .cross_validate import (evaluate_cross_validate,
  File "/home/biouser/bin/github-repos/RESCRIPt/rescript/cross_validate.py", line 14, in <module>
    from q2_feature_classifier._consensus_assignment import (
ImportError: cannot import name '_consensus_assignments' from 'q2_feature_classifier._consensus_assignment' (/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_feature_classifier/_consensus_assignment.py)

@nbokulich
Copy link
Collaborator

Hi @colinbrislawn regarding the error you get when trying to import sequences, it looks your branch is very far out of date, this is why this confusing error occurs. Could you please update your branch and try again?

@nbokulich
Copy link
Collaborator

Hi @colinbrislawn ,

the edits look good, thanks! I have done some user testing now and it is working like a charm 😉 . However, it looks like there is a problem with version 8.2, which fails to download with all confugurations, due to lowercase characters in the sequences. An easy fix would be to just drop version 8.2 as an option... but this issue could theoretically come up again in the future so this could be an opportunity to already fix this issue. I think that it should be as simple as using MixedCaseDNAFASTAFormat instead of DNAFASTAFormat, but I have not worked with that format before so cannot guarantee. Fortunately, we have the global expert on that format with us, @mikerobeson, in case you run into issues making that switch 😉

I am a little suprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC... but maybe this is something that they have started mofifying on their own more recently? Do you happen to know, @colinbrislawn ?

@colinbrislawn
Copy link
Contributor Author

I'm glad it's working for you, Nick!

I am a little surprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC

The newest releases have had only normal ([ATCG]) DNA, so I suppose that fixes it.

as simple as using MixedCaseDNAFASTAFormat

I had this implemented at one point. This PR is so old I've introduced regressions 😿

I'll switch over to this importing model and retest.

@colinbrislawn
Copy link
Contributor Author

Done. Using MixedCaseDNAFASTAFormat is a little slower when not needed, but it works with 8.2.

I want to limit scope creep, so here's some cool things we shouldn't do (right now)

  • For specific database versions, switch to DNAFASTAFormat because it's faster
  • Support older databases by automatically handling .zip files
  • Refactor UNITE_DOIS from a table to data.frame and add more db metadata to it

@nbokulich
Copy link
Collaborator

nbokulich commented Nov 10, 2023

thanks @colinbrislawn ! if this causes a significant slowdown and uppercase is the norm from 8.3 onward, then maybe we just drop support for 8.2 and go back to DNAFASTAFormat? Or if the switch was at 8.3 then it would also be easy to just specify the format with a conditional.

But otherwise I am happy if you want to just open an issue to save that idea for a rainy day 😁

by the way, looks like the linter is failing now.

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Nov 10, 2023

I have gotta get some pre-commit hooks added

edit: done. I'll consider opening a new issue for more linting and pre-commit hooks

@mikerobeson
Copy link
Collaborator

mikerobeson commented Nov 10, 2023

Hi @colinbrislawn & @nbokulich. I think I might have suggested to go with DNAFASTAFormat, as I thought all the recent DBs where upper case. I am not too concerned about runtime, as most users will just have to run this once. But, I suppose we can add a try statement to import as DNAFASTAFormat, and if that fails, use MixedCaseDNAFASTAFormat? This way there is no need to explicitly define a format for each database version? This might also help future proof, in case UNITE goes back to mixed case. Or some other conditional? 🤷 But as Nick said, it might be good enough to not include ver 8.2 for now, and record this as an issue to be fixed for a rainy day and stick with DNAFASTAFormat. But I am fine if we just want to leave it as MixedCaseDNAFASTAFormat and call it good. :-)

@colinbrislawn
Copy link
Contributor Author

record this as an issue to be fixed for a rainy day. :-)

I like this scope. We can build it when we need it! 🛠️

@colinbrislawn
Copy link
Contributor Author

Before we merge, we should pick which test of _unite_get_tgz() to use.

@mikerobeson
Copy link
Collaborator

I just re-ran and tested the latest version of your PR. it all looks good! Actually, I think the download and parsing of the data is quite "fast" despite using MixedCaseDNAFASTAFormat. It took ~3 minutes using --p-version 9.0 --p-cluster-id dynamic --p-singletons. So, in light of that, I think we should just keep everything as you have it, using MixedCaseDNAFASTAFormat.

This looks ready to go, unless there is something I'm missing.

@nbokulich
Copy link
Collaborator

looks good to me too, thanks @colinbrislawn and @mikerobeson ! @mikerobeson please merge once you are ready (and the CI tests check out).

@mikerobeson mikerobeson merged commit 2bee235 into bokulich-lab:master Nov 13, 2023
4 checks passed
@mikerobeson
Copy link
Collaborator

Hey @colinbrislawn Perhaps you can write up a short Community Contribution similar to these:

@nbokulich
Copy link
Collaborator

Yeah a short community tutorial would be great! @colinbrislawn is this something that you would be interested in putting together? This could be planned to coincide with the next QIIME 2 release (next month I think). I would also be happy to put one together if you and @mikerobeson are not interested, but I don't want to steal your thunder 😉

I put together a few commands when testing out your new action in case you want to use these. I found that in version 9.0 the accession IDs at included in the taxonomy string (something like k__Fungi;...;s__species;sh__R12345). This is fine for the alignment-based classifiers, but really slows down the process of training a Naive Bayes classifier (and hitting a single accession is highly unlikely so this is a waste of time) so the accession IDs should be removed from the taxonomy string. Additionally, I would recommend removing sequences that are not classified to at least class level.

qiime rescript get-unite-data --output-dir unite-v9.0-fungi-dynamic-singletonsFalse --p-version 9.0 --p-cluster-id dynamic --p-singletons False --p-taxon-group Fungi

qiime taxa filter-seqs --p-exclude Fungi_sp,mycota_sp,mycetes_sp --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --i-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences.qza --o-filtered-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza

qiime rescript edit-taxonomy --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --o-edited-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --p-search-strings ';sh__.*' --p-replacement-strings '' --p-use-regex

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza --i-reference-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --o-classifier unite-v9.0-fungi-dynamic-singletonsFalse/classifier.qza

@colinbrislawn colinbrislawn deleted the get_unite branch November 14, 2023 14:04
@mikerobeson
Copy link
Collaborator

mikerobeson commented Nov 14, 2023

That is awesome @nbokulich! That is a great point about the accession labels! Makes me even happier for edit-taxonomy. :-)

But yeah, just having a simple explanation at each step would be great! For example:

  1. bring up the importance of using eukaryotes to ensure we have outgroups for better classification of fungi / not-fungi.
  2. simply mention amplicon region extraction sub-examples (as I've done for GTDB, and RDP tutorial), just point to the other RESCRIPt tutorials on how to do this.
  3. Be sure to list the appropriate UNITE and RESCRIPt bibliography.

I've not worked with ITS data in a while, so I'm happy to leave it to you both. I'll Obviously, send any comments and revision ideas your way. :-)

@colinbrislawn
Copy link
Contributor Author

Something is a little strange...

After the merge, get_unite_data() no longer has defaults set for version and taxon_group

2bee235?diff=unified#diff-e1476be0ac3e9aab06544dd8563f72d7b6e38ff952c233a0abb3d213107db243R148

What's up with that?

@mikerobeson
Copy link
Collaborator

Oh, weird. Not sure how that happened. Didi you want to do a minor patch PR? Then I'll merge that.

@colinbrislawn colinbrislawn restored the get_unite branch November 14, 2023 22:35
@colinbrislawn colinbrislawn mentioned this pull request Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add action for get_unite_data
3 participants