-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix srun invocation on Great Lakes. #779
Conversation
Codecov Report
@@ Coverage Diff @@
## main #779 +/- ##
=======================================
Coverage 69.11% 69.11%
=======================================
Files 45 45
Lines 4348 4348
Branches 1055 979 -76
=======================================
Hits 3005 3005
Misses 1137 1137
Partials 206 206
|
Wait, this needs more testing on GPU nodes where srun configures the appropriate cgroups and mpirun does not. I may need to find a fix for the environment variable issue. Edit: Never mind, I was misinterpreting the output. |
Upon further testing with a 4 V100 GPU job (which slurm assigned 3 GPUs to 1 node and 1 GPU to another node), I found that
|
fcd7444
to
0692cdc
Compare
This works on both V100 and A100 nodes on Great Lakes. |
@b-butler This PR now includes both fixes to the Great Lakes environment necessary to make the template functional for the next release.
This formulation works only with homogenous jobs. How do I call I did not fix |
Due to planned refactoring of directives this makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes to the implementation (moving to the template) since we are refactoring and nranks
will be removed. The code is equivalent.
Description
Switch back to
mpirun
on Great Lakes.Motivation and Context
This change was made in #722. My workflows (
hoomd-validation
) work correctly with the most current release of flow, but not themain
branch. Whenflow
forks tosrun
, somehow my environment variables are all lost. For example, one of errors I got testing this wassingularity not found
even though I checked in the shell and singularity was in the PATH and the filesystem where singularity was stored was accessible.Strangely, everything works if I directly called the srun command from the shell:
I don't know what causes this behavior, but the expedient solution is to switch back to
mpirun
which we have used for a long time on Great Lakes with fewer issues.Checklist: