Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix srun invocation on Great Lakes. #779

Merged
merged 11 commits into from
Nov 8, 2023
Merged

Conversation

joaander
Copy link
Member

Description

Switch back to mpirun on Great Lakes.

Motivation and Context

This change was made in #722. My workflows (hoomd-validation) work correctly with the most current release of flow, but not the main branch. When flow forks to srun, somehow my environment variables are all lost. For example, one of errors I got testing this was singularity not found even though I checked in the shell and singularity was in the PATH and the filesystem where singularity was stored was accessible.

Strangely, everything works if I directly called the srun command from the shell:

srun -n 4  /home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py exec hard_disk_nvt_gpu agg-88ab2b6305a8b249743764fc16e5c22e

I don't know what causes this behavior, but the expedient solution is to switch back to mpirun which we have used for a long time on Great Lakes with fewer issues.

Checklist:

@joaander joaander requested review from a team as code owners October 31, 2023 15:38
@joaander joaander requested review from vyasr and SchoeniPhlippsn and removed request for a team October 31, 2023 15:38
Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Merging #779 (b65d02a) into main (9ba71de) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #779   +/-   ##
=======================================
  Coverage   69.11%   69.11%           
=======================================
  Files          45       45           
  Lines        4348     4348           
  Branches     1055      979   -76     
=======================================
  Hits         3005     3005           
  Misses       1137     1137           
  Partials      206      206           
Files Coverage Δ
flow/environments/umich.py 85.71% <100.00%> (ø)

@joaander joaander marked this pull request as draft November 1, 2023 14:56
@joaander
Copy link
Member Author

joaander commented Nov 1, 2023

Wait, this needs more testing on GPU nodes where srun configures the appropriate cgroups and mpirun does not. I may need to find a fix for the environment variable issue.

Edit: Never mind, I was misinterpreting the output. srun and mpirun on great lakes both allocate all assigned GPUs to all ranks on a given node. cgroups limit users to the assigned nodes, but do not assign individual gpus to ranks.

@joaander joaander marked this pull request as ready for review November 1, 2023 15:05
@joaander joaander marked this pull request as draft November 1, 2023 15:57
@joaander
Copy link
Member Author

joaander commented Nov 1, 2023

Upon further testing with a 4 V100 GPU job (which slurm assigned 3 GPUs to 1 node and 1 GPU to another node), mpirun assigns multiple tasks to the same GPU while srun correctly distributes the tasks. Note: we still need --gpus-per-task which I will add separately when working on #777.

I found that srun --export=ALL solved the environment variable forwarding issue. I also found that srun buffered stdout/stderr until the job completed, so I also added -u to disable buffering.

srun -u --export=ALL correctly runs hoomd-validation jobs both when hoomd is compiled from source and from the glotzerlab-software container. I am waiting for one more test on the 4 GPU A100 node.

@joaander joaander force-pushed the fix/great-lakes-mpi-launcher branch from fcd7444 to 0692cdc Compare November 1, 2023 16:50
@joaander joaander marked this pull request as ready for review November 1, 2023 16:53
@joaander
Copy link
Member Author

joaander commented Nov 1, 2023

This works on both V100 and A100 nodes on Great Lakes.

@joaander joaander changed the title Use mpirun on Great Lakes. Fix srun invocation on Great Lakes. Nov 1, 2023
@joaander
Copy link
Member Author

joaander commented Nov 2, 2023

@b-butler This PR now includes both fixes to the Great Lakes environment necessary to make the template functional for the next release.

  • Use cpus-per-task to ensure all processes requested via np are placed on one node.
  • Use gpus-per-task to allow multi-GPU jobs to correctly distribute to the assigned GPUs.

This formulation works only with homogenous jobs. How do I call homogeneous_openmp_mpi_config to ensure this?

I did not fix --cpus-per-task for all environments, as we are planning a more general fix in #777.

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes to the implementation (moving to the template) since we are refactoring and nranks will be removed. The code is equivalent.

@b-butler b-butler enabled auto-merge (squash) November 8, 2023 18:51
@b-butler b-butler merged commit 8477337 into main Nov 8, 2023
14 checks passed
@b-butler b-butler deleted the fix/great-lakes-mpi-launcher branch November 8, 2023 19:33
@b-butler b-butler added this to the v0.27.0 milestone Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants