Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: action to build custom Diamond database #103

Merged
merged 32 commits into from
Jan 15, 2024

Conversation

Sann5
Copy link
Contributor

@Sann5 Sann5 commented Nov 27, 2023

Closes #102 (ENH: action to build custom Diamond database)

  • Action to create a DIAMOND formatted reference database from a FASTA input file.
  • Why is this useful? Until now the only functionality supported in Qiime is to download the complete Diamond DB. It might be useful to allow users to create their own reference databases for downstream use.

Run it locally

Assuming you have a working environment, run the following.

  1. First, clone the repo and checkout the PR branch:
git clone [email protected]:bokulich-lab/q2-moshpit.git
gh pr checkout 103
  1. Now let's get you some data (the one from the Diamond tutorial) so you can run the action.
cd wherever_you_want_to_download_the_data_to
wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa
mkdir sequences
awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $0}}' astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa > sequences/protein-sequences.fasta
qiime tools import --input-path sequences --output-path sequences.qza --type FeatureData\[ProteinSequence\]
  1. Now we are ready to run the action locally.
qiime moshpit build-diamond-db --i-sequences sequences.qza --o-diamond-db custome_diamond.qza --verbose
  1. Alternatively, you can run the tests for this action (nothing will be downloaded).
cd <you_local_path_to>/q2-moshpit/q2_moshpit/eggnog/tests
pytest -vv -W "ignore" -k build_diamond_db

@Sann5 Sann5 requested a review from misialq November 27, 2023 17:40
@Sann5 Sann5 marked this pull request as draft November 27, 2023 17:40
@Sann5 Sann5 self-assigned this Nov 27, 2023
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 27, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (0668991) 97.74% compared to head (b337326) 97.70%.
Report is 9 commits behind head on main.

❗ Current head b337326 differs from pull request most recent head ab14978. Consider uploading reports for the commit ab14978 to get more accurate results

Files Patch % Lines
q2_moshpit/eggnog/_utils.py 40.00% 3 Missing ⚠️
q2_moshpit/_utils.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
- Coverage   97.74%   97.70%   -0.04%     
==========================================
  Files          35       39       +4     
  Lines        2172     2267      +95     
==========================================
+ Hits         2123     2215      +92     
- Misses         49       52       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Sann5 Sann5 added stat:blocked enhancement New feature or request labels Dec 1, 2023
@Sann5
Copy link
Contributor Author

Sann5 commented Dec 1, 2023

q2_moshpit/_examples.py Outdated Show resolved Hide resolved
@Sann5 Sann5 marked this pull request as ready for review December 5, 2023 13:41
Copy link
Contributor

@misialq misialq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Sann5, here's the first round of code comments. I have not tried running it out yet though - did you get a chance to see what's up with those taxonomy annotations? Do they really get added to the final output?

q2_moshpit/citations.bib Outdated Show resolved Hide resolved
q2_moshpit/eggnog/_method.py Outdated Show resolved Hide resolved
q2_moshpit/eggnog/_method.py Outdated Show resolved Hide resolved
q2_moshpit/eggnog/tests/test_method.py Outdated Show resolved Hide resolved
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
q2_moshpit/eggnog/_method.py Outdated Show resolved Hide resolved
q2_moshpit/eggnog/_method.py Outdated Show resolved Hide resolved
@Sann5 Sann5 requested a review from misialq December 15, 2023 09:48
@Sann5
Copy link
Contributor Author

Sann5 commented Dec 20, 2023

❗️❗️❗️❗️❗️❗️

  • So I checked whether the output diamond database was different when building it with the taxonomy information.... It turns out it's not.
  • More precisely the sequence headers and the sequence themselves are the same when calling diamond getseq diamond_db_with_taxainfo.dmdn than when calling diamond getseq diamond_db.dmdn.
  • It could be that the database still contains some taxa information somewhere and that the diamond getseq method is not accessing/displaying it.
  • I checked the diamond methods (e.g. diamond getseq, diamond makedb) available and none seem to be able to provide this information (whether a diamond database contains taxonomy info).
  • It could be that the sequence header needs to contain an accession ID to be mapped to the taxonomy info. So this is what I will try next.

@Sann5
Copy link
Contributor Author

Sann5 commented Dec 21, 2023

  • I was right. The sequence headers (or IDs in Diamond terminology) are parsed and if found then the taxonomy procedure is triggered. In other words, unless the headers have access numbers the taxonomy thing won't work.
  • Taxonomy info is still not visible when retrieving the sequences via diamond getseq, but the output from running the program suggests that the procedure is, in fact, taking place. This is the output for a set of sequences that contain accession numbers in the sequence headers.
qiime moshpit build-custom-diamond-db --i-seqs sequences.qza --i-taxonomy ncbi_tax_data.qza --o-diamond-db diamond_db_w_taxa.qza --verbose                                                                          
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: diamond makedb --verbose --in /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta --db /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/q2-DiamondDatabaseDirFmt-jpf19f91/ref_db.dmnd --threads 1 --file-buffer-size 67108864 --taxonmap /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/prot.accession2taxid.gz --taxonnodes /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/nodes.dmp --taxonnames /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/names.dmp

diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta
Opening the database file...  [0s]
Loading sequences... Sequences = 4, letters = 843, average length = 210
 [0s]
Masking sequences...  [0s]
Writing sequences...  [0s]
Writing accessions...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  4
   |xxx suffix  4
   .xxx suffix  0
  :PDB= suffix  0

Loading taxonomy names...  [0.959s]
Loaded taxonomy names for 2544505 taxon ids.
Loading taxonomy mapping file...  [273.777s]
Joining accession mapping...  [1.121s]
Writing taxon id list...  [0s]

Accession parsing rules triggered for mapping file seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  0
   |xxx suffix  0
   .xxx suffix  1229668180
  :PDB= suffix  0

Building taxonomy nodes...  [0.019s]
3101783 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank           796315
superkingdom      4
kingdom           13
subkingdom        1
superphylum       1
phylum            309
subphylum         31
superclass        6
class             520
subclass          169
infraclass        19
cohort            5
subcohort         3
superorder        57
order             1882
suborder          374
infraorder        135
parvorder         26
superfamily       896
family            10281
subfamily         3254
tribe             2357
subtribe          586
genus             108123
subgenus          1811
section           534
subsection        41
series            9
species group     360
species subgroup  134
species           2083492
subspecies        28923
varietas          9819
forma             682
strain            46154
biotype           17
clade             957
forma specialis   750
genotype          22
isolate           1305
morph             11
pathogroup        5
serogroup         153
serotype          1237
subvariety        0

Closing the input file...  [0s]
Closing the database file...  [0.002s]

Database sequences                   4
Database letters                     843
Accessions in database               4
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0
Database hash                        55bd90f89749443d4b0fb107c33a615f
Total time                           277s
  • And this is the output for a set of sequences that do not contain accession numbers in the sequence headers.
qiime moshpit build-custom-diamond-db --i-seqs simple_ids.qza --i-taxonomy ncbi_tax_data.qza --o-diamond-db diamond_db_w_taxa_simpleids.qza --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: diamond makedb --verbose --in /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/100f6a9f-c9a6-4dfb-8440-a9c310d1d4f2/data/protein-sequences.fasta --db /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/q2-DiamondDatabaseDirFmt-n6apnk1f/ref_db.dmnd --threads 1 --file-buffer-size 67108864 --taxonmap /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/prot.accession2taxid.gz --taxonnodes /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/nodes.dmp --taxonnames /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/names.dmp

diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/100f6a9f-c9a6-4dfb-8440-a9c310d1d4f2/data/protein-sequences.fasta
Opening the database file...  [0s]
Loading sequences... Sequences = 3, letters = 303, average length = 101
 [0s]
Masking sequences...  [0s]
Writing sequences...  [0s]
Writing accessions...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  0
   |xxx suffix  0
   .xxx suffix  0
  :PDB= suffix  0

Loading taxonomy names...  [0.978s]
Loaded taxonomy names for 2544505 taxon ids.
Loading taxonomy mapping file...  [267.32s]
Joining accession mapping...  [134.995s]
Writing taxon id list...  [0s]

Accession parsing rules triggered for mapping file seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  0
   |xxx suffix  0
   .xxx suffix  1229668180
  :PDB= suffix  0

Building taxonomy nodes...  [0.01s]
3101783 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank           796315
superkingdom      4
kingdom           13
subkingdom        1
superphylum       1
phylum            309
subphylum         31
superclass        6
class             520
subclass          169
infraclass        19
cohort            5
subcohort         3
superorder        57
order             1882
suborder          374
infraorder        135
parvorder         26
superfamily       896
family            10281
subfamily         3254
tribe             2357
subtribe          586
genus             108123
subgenus          1811
section           534
subsection        41
series            9
species group     360
species subgroup  134
species           2083492
subspecies        28923
varietas          9819
forma             682
strain            46154
biotype           17
clade             957
forma specialis   750
genotype          22
isolate           1305
morph             11
pathogroup        5
serogroup         153
serotype          1237
subvariety        0

Closing the input file...  [0s]
Closing the database file...  [0.001s]

Database sequences                   3
Database letters                     303
Accessions in database               3
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0
Database hash                        78c88ee380db113880e4d2ecac2eb3d6
Total time                           404s
  • Just for reference the output looks very different when building the DB without taxa info (and its much faster).

qiime moshpit build-custom-diamond-db --i-seqs sequences.qza --o-diamond-db diamond_db.qza --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: diamond makedb --verbose --in /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta --db /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/q2-DiamondDatabaseDirFmt-81r4s79z/ref_db.dmnd --threads 1 --file-buffer-size 67108864

diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta
Opening the database file...  [0s]
Loading sequences... Sequences = 4, letters = 843, average length = 210
 [0s]
Masking sequences...  [0s]
Writing sequences...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]
Closing the input file...  [0s]
Closing the database file...  [0s]

Database sequences  4
  Database letters  843
     Database hash  55bd90f89749443d4b0fb107c33a615f
        Total time  0.001000s

New files for dbs

New files for dbs in eggnog. Further adjustments.
@Sann5 Sann5 closed this Dec 22, 2023
@Sann5 Sann5 deleted the build_diamond_db branch December 22, 2023 09:45
@Sann5 Sann5 restored the build_diamond_db branch December 22, 2023 09:49
@Sann5 Sann5 reopened this Dec 22, 2023
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved
@misialq
Copy link
Contributor

misialq commented Jan 3, 2024

Hey @Sann5, thanks for testing out the different options for including the taxonomy annotations. I'm only wondering now, what that really brings if you cannot see any difference when running the getseq command... Maybe we should use both of those databases (with and without taxonomy annotations) to perform some small functional annotation task to see whether there's any difference there?

@misialq
Copy link
Contributor

misialq commented Jan 3, 2024

Also, when you look carefully at the output of both commands you will notice that in neither of those the tax IDs were mapped to your query sequences, which may explain why the getseq command returns the same for both:
case 1:

Database sequences                   4
Database letters                     843
Accessions in database               4
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0

case 2:

Database sequences                   3
Database letters                     303
Accessions in database               3
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0

What do your database headers look like in both scenarios?

@Sann5
Copy link
Contributor Author

Sann5 commented Jan 11, 2024

Hey @misialq :). So... I tried with sequences whose accession IDs are in fact in the database as we discussed.

  • ✅ The good news is that now the output shows that indeed the sequences were mapped to the NCBI database:
Database sequences                   19
Database letters                     9707
Accessions in database               19
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  19
Database sequences mapped to taxid   19
Database hash                        1914c45d75c4d85b4dbfab7672dff245
Total time                           285s
  • ❌ The bad news is that the diamond getseq your_custome_diamond_DB.dmnd command still returns the sequences with their original headers (they are not modified by the fact of constructing the database with the taxonomy information). As I mentioned before I think that the taxonomy info might still be there even if the getseq command does not reflect it.

  • ❓ What should I do? Perform some a functional annotation task to see whether there's any difference (as you suggested).

@Sann5 Sann5 requested a review from misialq January 11, 2024 15:52
@misialq
Copy link
Contributor

misialq commented Jan 12, 2024

Hey @Sann5, thanks for investigating! Finally, that makes sense now 😅 What I would suggest is that we just carry on with this PR and in parallel maybe you could test the annotation (as you suggested) when you're working on the tutorials to see what the difference is (if any). How does that sound?

Copy link
Contributor

@misialq misialq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! 🚀

@Sann5
Copy link
Contributor Author

Sann5 commented Jan 12, 2024

Hey @Sann5, thanks for investigating! Finally, that makes sense now 😅 What I would suggest is that we just carry on with this PR and in parallel maybe you could test the annotation (as you suggested) when you're working on the tutorials to see what the difference is (if any). How does that sound?

Sounds good @misialq, I will do so :) Again I cannot merge this one because the test doesn't pass 😅. Can you do it?

@misialq misialq merged commit 492dc37 into bokulich-lab:main Jan 15, 2024
3 of 5 checks passed
@Sann5 Sann5 deleted the build_diamond_db branch February 27, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stat:blocked
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: action to build custom Diamond database
2 participants