-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Parallelize classify kraken2 #62
ENH: Parallelize classify kraken2 #62
Conversation
q2_moshpit/kraken2/classification.py
Outdated
seqs, common_args | ||
) -> (Kraken2ReportDirectoryFormat, Kraken2OutputDirectoryFormat): | ||
if isinstance(seqs, MAGSequencesDirFmt): | ||
manifest = None | ||
else: | ||
manifest: Optional[pd.DataFrame] = seqs.manifest.view(pd.DataFrame) | ||
|
||
base_cmd = ["kraken2", *common_args] | ||
base_cmd = ["kraken2", *common_args, '--memory-mapping'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This --memory-mapping
flag prevents the database from being loaded into RAM for every thread. It slows things down compared to loading the database for every thread, but it prevents us from needing to load a ~60GB database into RAM a million times. For now we'll have it here to be safe, but we can definitely be smarter about when we do and do not need it and how we handle the database (/dev/shm for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the action's current default is False (no memory mapping) - do you think it should stay that way with the addition of parallel support? I could see arguments either way.
|
|
Hey @Oddant1, I just invited you to our moshpit GitHub team - if you accept the invite the workflows should self-approve from now on :) |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #62 +/- ##
==========================================
- Coverage 97.74% 95.84% -1.90%
==========================================
Files 35 33 -2
Lines 2172 2094 -78
==========================================
- Hits 2123 2007 -116
- Misses 49 87 +38 ☔ View full report in Codecov by Sentry. |
|
@Oddant1, quick question as I'm starting to review: I see that the |
@gregcaporaso the threads parameter is how many threads each invocation of the kraken2 binary will use. Our parallelization is basically how many times we will invoke the binary. We should be able to have our parallelizastion sitting on top of theirs without issue. I'm not sure if that will be true for all actions with existing parallel parameters, but it should be for this one I think. |
Passing locally on both a Mac and linux. |
@misialq we're ready for a merge on this, CI is not syncing as previously discussed |
Thanks for all of the work on this @Oddant1, and for doing the review @colinvwood and @cherman2! |
Work on parallelizing
classify_kraken2
in the same manner asassembly_megahit
in bokulich-lab/q2-assembly#46