Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parallelize classify kraken2 #62

Merged
merged 33 commits into from
Dec 14, 2023

Conversation

Oddant1
Copy link
Contributor

@Oddant1 Oddant1 commented Sep 6, 2023

Work on parallelizing classify_kraken2 in the same manner as assembly_megahit in bokulich-lab/q2-assembly#46

seqs, common_args
) -> (Kraken2ReportDirectoryFormat, Kraken2OutputDirectoryFormat):
if isinstance(seqs, MAGSequencesDirFmt):
manifest = None
else:
manifest: Optional[pd.DataFrame] = seqs.manifest.view(pd.DataFrame)

base_cmd = ["kraken2", *common_args]
base_cmd = ["kraken2", *common_args, '--memory-mapping']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This --memory-mapping flag prevents the database from being loaded into RAM for every thread. It slows things down compared to loading the database for every thread, but it prevents us from needing to load a ~60GB database into RAM a million times. For now we'll have it here to be safe, but we can definitely be smarter about when we do and do not need it and how we handle the database (/dev/shm for instance).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the action's current default is False (no memory mapping) - do you think it should stay that way with the addition of parallel support? I could see arguments either way.

@Oddant1
Copy link
Contributor Author

Oddant1 commented Sep 21, 2023

Blocked by qiime2/q2-demux#145 and qiime2/q2cli#306

@Oddant1
Copy link
Contributor Author

Oddant1 commented Oct 10, 2023

Blocked by qiime2/qiime2#723

@misialq
Copy link
Contributor

misialq commented Nov 8, 2023

Hey @Oddant1, I just invited you to our moshpit GitHub team - if you accept the invite the workflows should self-approve from now on :)

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

Attention: 40 lines in your changes are missing coverage. Please review.

Comparison is base (0668991) 97.74% compared to head (fe7d2b5) 95.84%.
Report is 7 commits behind head on main.

❗ Current head fe7d2b5 differs from pull request most recent head 7cce468. Consider uploading reports for the commit 7cce468 to get more accurate results

Files Patch % Lines
q2_moshpit/kraken2/classification.py 29.41% 24 Missing ⚠️
q2_moshpit/kraken2/helpers.py 27.27% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #62      +/-   ##
==========================================
- Coverage   97.74%   95.84%   -1.90%     
==========================================
  Files          35       33       -2     
  Lines        2172     2094      -78     
==========================================
- Hits         2123     2007     -116     
- Misses         49       87      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Oddant1
Copy link
Contributor Author

Oddant1 commented Nov 8, 2023

blocked by bokulich-lab/q2-assembly#46

@misialq misialq added the enhancement New feature or request label Nov 27, 2023
@misialq misialq added this to the 2023.11 milestone Nov 27, 2023
@gregcaporaso gregcaporaso self-assigned this Nov 30, 2023
@gregcaporaso
Copy link
Contributor

@Oddant1, quick question as I'm starting to review: I see that the classify-kraken2 action still has a --p-threads parameter. How does that interact with --parallel? Should we be removing the --p-threads (and related parameters) when adding parallel support to these actions?

@misialq misialq changed the title Parallelize classify kraken2 ENH: Parallelize classify kraken2 Dec 8, 2023
@Oddant1
Copy link
Contributor Author

Oddant1 commented Dec 8, 2023

@gregcaporaso the threads parameter is how many threads each invocation of the kraken2 binary will use. Our parallelization is basically how many times we will invoke the binary. We should be able to have our parallelizastion sitting on top of theirs without issue. I'm not sure if that will be true for all actions with existing parallel parameters, but it should be for this one I think.

@colinvwood
Copy link
Contributor

Passing locally on both a Mac and linux.

@colinvwood
Copy link
Contributor

@misialq we're ready for a merge on this, CI is not syncing as previously discussed

@misialq misialq merged commit 9180084 into bokulich-lab:main Dec 14, 2023
3 of 5 checks passed
@gregcaporaso
Copy link
Contributor

Thanks for all of the work on this @Oddant1, and for doing the review @colinvwood and @cherman2!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stat:blocked
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants