You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to start an MPI program on a particular subset of resources within an allocation (or locally, the results are the same). I'm mapping with a rankfile and binding the MPI processes to a specific set of CPUs that way. My software is set up to specify logical CPUs, these being hwthreads on my 8C16T laptop.
My test suite includes OpenMPI 3.1.6 and 4.1.6, which both work fine, but I can't get OpenMPI 5.0.x to bind to hwthreads, it insists on interpreting the numbers in the rankfile as core ids. As a result, everything runs fine but in the wrong place.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I tested 5.0.1, 5.0.3 and 5.0.5, which don't work, and 4.1.6 and 3.1.6, which do. Versions 5.0.x use hwloc 2.11.1, while the older ones use hwloc 1.11.13.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
I used Spack v0.22 and v0.23, with SLURM support and external PMIX, hwloc and libevent.
Please describe the system on which you are running
Operating system/version:
Kubuntu 22.04
I'm running inside of a Docker container, and can reproduce the problem both in a SLURM allocation and outside of one, but only on 5.0.x.
Computer hardware:
Lenovo P14s, AMD 7 PRO 5850U CPU with 8 cores and 16 hwthreads
(output of ompi_info --all omitted because the message is too long)
shell$ mpirun --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[headnode:46785] mca: base: component_find: searching NULL for plm components
[headnode:46785] mca: base: find_dyn_components: checking NULL for plm components
[headnode:46785] pmix:mca: base: components_register: registering framework plm components
[headnode:46785] pmix:mca: base: components_register: found loaded component slurm
[headnode:46785] pmix:mca: base: components_register: component slurm register functionsuccessful
[headnode:46785] pmix:mca: base: components_register: found loaded component ssh
[headnode:46785] pmix:mca: base: components_register: component ssh register functionsuccessful
[headnode:46785] mca: base: components_open: opening plm components
[headnode:46785] mca: base: components_open: found loaded component slurm
[headnode:46785] mca: base: components_open: component slurm open functionsuccessful
[headnode:46785] mca: base: components_open: found loaded component ssh
[headnode:46785] mca: base: components_open: component ssh open functionsuccessful
[headnode:46785] mca:base:select: Auto-selecting plm components
[headnode:46785] mca:base:select:( plm) Querying component [slurm]
[headnode:46785] mca:base:select:( plm) Querying component [ssh]
[headnode:46785] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[headnode:46785] mca:base:select:( plm) Query of component [ssh] set priority to 10
[headnode:46785] mca:base:select:( plm) Selected component [ssh]
[headnode:46785] mca: base: close: component slurm closed
[headnode:46785] mca: base: close: unloading component slurm
[headnode:46785] [prterun-headnode-46785@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive start comm
[headnode:46785] mca: base: component_find: searching NULL for rmaps components
[headnode:46785] mca: base: find_dyn_components: checking NULL for rmaps components
[headnode:46785] pmix:mca: base: components_register: registering framework rmaps components
[headnode:46785] pmix:mca: base: components_register: found loaded component ppr
[headnode:46785] pmix:mca: base: components_register: component ppr register functionsuccessful
[headnode:46785] pmix:mca: base: components_register: found loaded component rank_file
[headnode:46785] pmix:mca: base: components_register: component rank_file has no register or open function
[headnode:46785] pmix:mca: base: components_register: found loaded component round_robin
[headnode:46785] pmix:mca: base: components_register: component round_robin register functionsuccessful
[headnode:46785] pmix:mca: base: components_register: found loaded component seq
[headnode:46785] pmix:mca: base: components_register: component seq register functionsuccessful
[headnode:46785] mca: base: components_open: opening rmaps components
[headnode:46785] mca: base: components_open: found loaded component ppr
[headnode:46785] mca: base: components_open: component ppr open functionsuccessful
[headnode:46785] mca: base: components_open: found loaded component rank_file
[headnode:46785] mca: base: components_open: found loaded component round_robin
[headnode:46785] mca: base: components_open: component round_robin open functionsuccessful
[headnode:46785] mca: base: components_open: found loaded component seq
[headnode:46785] mca: base: components_open: component seq open functionsuccessful
[headnode:46785] mca:rmaps:select: checking available component ppr
[headnode:46785] mca:rmaps:select: Querying component [ppr]
[headnode:46785] mca:rmaps:select: checking available component rank_file
[headnode:46785] mca:rmaps:select: Querying component [rank_file]
[headnode:46785] mca:rmaps:select: checking available component round_robin
[headnode:46785] mca:rmaps:select: Querying component [round_robin]
[headnode:46785] mca:rmaps:select: checking available component seq
[headnode:46785] mca:rmaps:select: Querying component [seq]
[headnode:46785] [prterun-headnode-46785@0,0]: Final mapper priorities
[headnode:46785] Mapper: rank_file Priority: 100
[headnode:46785] Mapper: ppr Priority: 90
[headnode:46785] Mapper: seq Priority: 60
[headnode:46785] Mapper: round_robin Priority: 10
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm creating map
[headnode:46785] [prterun-headnode-46785@0,0] setup:vm: working unmanaged allocation
[headnode:46785] [prterun-headnode-46785@0,0] using default hostfile /opt/spack/opt/spack/linux-ubuntu22.04-zen3/gcc-11.4.0/openmpi-5.0.5-um6gykzrcb4d2xkmsf53ce5eswpj42zz/etc/prte-default-hostfile
====================== ALLOCATED NODES ======================
headnode: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm only HNP in allocation
aliases: headnode
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setting slots for node headnode by core
=================================================================
====================== ALLOCATED NODES ======================
headnode: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: headnode
=================================================================
[headnode:46785] [prterun-headnode-46785@0,0] rmaps:base set policy with ppr:1:node
[headnode:46785] [prterun-headnode-46785@0,0] rmaps:base policy ppr modifiers 1:node provided
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive processing msg
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive job launch command from [prterun-headnode-46785@0,0]
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive adding hosts
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive calling spawn
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive done processing commands
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_job
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:setup_vm no new daemons required
[headnode:46785] mca:rmaps: mapping job prterun-headnode-46785@1
====================== ALLOCATED NODES ======================
headnode: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: headnode
=================================================================
[headnode:46785] mca:rmaps: setting mapping policies for job prterun-headnode-46785@1 inherit TRUE hwtcpus FALSE
[headnode:46785] [prterun-headnode-46785@0,0] using known nodes
[headnode:46785] [prterun-headnode-46785@0,0] Starting with 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] Filtering thru apps
[headnode:46785] [prterun-headnode-46785@0,0] Retained 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] node headnode has 8 slots available
[headnode:46785] AVAILABLE NODES FOR MAPPING:
[headnode:46785] node: headnode daemon: 0 slots_available: 8
[headnode:46785] setdefaultbinding[366] binding not given - using bycore
[headnode:46785] mca:rmaps:rf: job prterun-headnode-46785@1 not using rankfile policy
[headnode:46785] mca:rmaps:ppr: mapping job prterun-headnode-46785@1 with ppr 1:node
[headnode:46785] mca:rmaps:ppr: job prterun-headnode-46785@1 assigned policy BYNODE:SLOT
[headnode:46785] [prterun-headnode-46785@0,0] using known nodes
[headnode:46785] [prterun-headnode-46785@0,0] Starting with 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] Filtering thru apps
[headnode:46785] [prterun-headnode-46785@0,0] Retained 1 nodes in list
[headnode:46785] [prterun-headnode-46785@0,0] node headnode has 8 slots available
[headnode:46785] AVAILABLE NODES FOR MAPPING:
[headnode:46785] node: headnode daemon: 0 slots_available: 8
[headnode:46785] [prterun-headnode-46785@0,0] get_avail_ncpus: node headnode has 0 procs on it
[headnode:46785] mca:rmaps: compute bindings for job prterun-headnode-46785@1 with policy CORE:IF-SUPPORTED[1007]
[headnode:46785] mca:rmaps: bind [prterun-headnode-46785@1,INVALID] with policy CORE:IF-SUPPORTED
[headnode:46785] [prterun-headnode-46785@0,0] BOUND PROC [prterun-headnode-46785@1,INVALID][headnode] TO package[0][core:0]
[headnode:46785] [prterun-headnode-46785@0,0] complete_setup on job prterun-headnode-46785@1
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:launch_apps for job prterun-headnode-46785@1
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:send launch msg for job prterun-headnode-46785@1
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:launch wiring up iof for job prterun-headnode-46785@1
headnode
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:prted_cmd sending prted_exit commands
[headnode:46785] [prterun-headnode-46785@0,0] plm:base:receive stop comm
[headnode:46785] mca: base: close: component ssh closed
[headnode:46785] mca: base: close: unloading component ssh
Network type:
No special hardware, connections are either local or over TCP between different Docker containers representing fake cluster nodes.
Details of the problem
When I use mpirun to start a program and ask for OpenMPI to map processes using a rankfile and hwthreads, it assigns whole cores to each process instead of individual threads. That is, the slots in the rankfile are always interpreted as core ids, not as hwthread (logical CPU) ids:
shell$ cat rankfile
rank 0=localhost slot=0,1
rank 1=localhost slot=2,3
shell$ mpirun -n 2 -rankfile rankfile python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.
Deprecated option: rankfile
Corrected option: --map-by rankfile:file=rankfile
We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.
Deprecated option: rankfile
Corrected option: --map-by rankfile:file=rankfile
We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------
{0, 1, 2, 3}
{4, 5, 6, 7}
This is as expected, because OpenMPI interprets the numbers as core ids by default, and cores 0 and 1 map to hwthreads {0, 1} and {2, 3} respectively. So let's use --use-hwthread-cpus to fix that:
shell$ mpirun -n 2 --use-hwthread-cpus --rankfile rankfile python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.
Deprecated option: rankfile
Corrected option: --map-by rankfile:file=rankfile
We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.
Deprecated option: rankfile
Corrected option: --map-by rankfile:file=rankfile
We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------
{4, 5, 6, 7}
{0, 1, 2, 3}
On pre-5.0.x, the above prints {0, 1} and {2, 3}, but 5.0.x seems to persist in using cores rather than hwthreads. It does complain about me using old syntax however, so let's try the newer one:
That fixes the warning, but not the problem. Let's see what it's actually doing:
shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus -v python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
ERROR: The "map-by"command line option was listed more than once on the command line.
Only one instance of this option is permitted.
Please correct your command line.
--------------------------------------------------------------------------
Okay, I don't think that was supposed to happen. Let's try a different way:
shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus --display-map --report-bindings python -c 'import os; print(os.sched_getaffinity(0))'
======================== JOB MAP ========================
Data for JOB prterun-headnode-49764@1 offset 0 Total slots allocated 8
Mapping policy: BYUSER:NOOVERSUBSCRIBE Ranking policy: BYUSER Binding policy: CORE:IF-SUPPORTED
Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Data for node: headnode Num slots: 8 Max slots: 0 Num procs: 2
Process jobid: prterun-headnode-49764@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
Process jobid: prterun-headnode-49764@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]
=============================================================
[headnode:49764] Rank 0 bound to package[0][core:0-1]
[headnode:49764] Rank 1 bound to package[0][core:2-3]
{4, 5, 6, 7}
{0, 1, 2, 3}
Ah, it's trying to bind to cores, maybe that's it?
shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus --bind-to hwthread --display-map --report-bindings python -c 'import os; print(os.sched_getaffinity(0))'
======================== JOB MAP ========================
Data for JOB prterun-headnode-50283@1 offset 0 Total slots allocated 8
Mapping policy: BYUSER:NOOVERSUBSCRIBE Ranking policy: BYUSER Binding policy: HWTHREAD
Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Data for node: headnode Num slots: 8 Max slots: 0 Num procs: 2
Process jobid: prterun-headnode-50283@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
Process jobid: prterun-headnode-50283@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]
=============================================================
[headnode:50283] Rank 0 bound to package[0][core:0-1]
[headnode:50283] Rank 1 bound to package[0][core:2-3]
{4, 5, 6, 7}
{0, 1, 2, 3}
Nope. Maybe try the old syntax again?
shell$ mpirun -n 2 --use-hwthread-cpus --rankfile rankfile --bind-to hwthread --display-map --report-bindings python -c 'import os; print(os.sched_getaffinity(0))'
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.
Deprecated option: rankfile
Corrected option: --map-by rankfile:file=rankfile
We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A deprecated command line option was used.
Deprecated option: rankfile
Corrected option: --map-by rankfile:file=rankfile
We have updated this for you and will proceed. However, this will be treated
as an error in a future release. Please update your command line.
--------------------------------------------------------------------------
======================== JOB MAP ========================
Data for JOB prterun-headnode-51016@1 offset 0 Total slots allocated 16
Mapping policy: BYUSER:NOOVERSUBSCRIBE Ranking policy: BYUSER Binding policy: HWTHREAD
Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Data for node: headnode Num slots: 16 Max slots: 0 Num procs: 2
Process jobid: prterun-headnode-51016@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
Process jobid: prterun-headnode-51016@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]
=============================================================
[headnode:51016] Rank 0 bound to package[0][core:0-1]
[headnode:51016] Rank 1 bound to package[0][core:2-3]
{4, 5, 6, 7}
{0, 1, 2, 3}
That Cpu Type: CORE may be the problem, but how to convince OpenMPI that I have hwthreads to bind to? And why does it work on earlier versions?
I can't find anything in the 5.0.x docs that suggests that this is intended, so I think it's a bug, either in the code (the ORTE to PRRTE switch maybe?) or in the docs. Or perhaps in my brain, if I missed something. At any rate, any help in fixing it would be much appreciated!
The text was updated successfully, but these errors were encountered:
LourensVeen
changed the title
OpenMPI 5.0.x binds to cores even if asked to use hwthreads
OpenMPI 5.0.x maps/binds to cores even if asked to use hwthreads
Dec 8, 2024
Background information
I'm trying to start an MPI program on a particular subset of resources within an allocation (or locally, the results are the same). I'm mapping with a rankfile and binding the MPI processes to a specific set of CPUs that way. My software is set up to specify logical CPUs, these being hwthreads on my 8C16T laptop.
My test suite includes OpenMPI 3.1.6 and 4.1.6, which both work fine, but I can't get OpenMPI 5.0.x to bind to hwthreads, it insists on interpreting the numbers in the rankfile as core ids. As a result, everything runs fine but in the wrong place.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I tested 5.0.1, 5.0.3 and 5.0.5, which don't work, and 4.1.6 and 3.1.6, which do. Versions 5.0.x use hwloc 2.11.1, while the older ones use hwloc 1.11.13.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
I used Spack v0.22 and v0.23, with SLURM support and external PMIX, hwloc and libevent.
Please describe the system on which you are running
Kubuntu 22.04
I'm running inside of a Docker container, and can reproduce the problem both in a SLURM allocation and outside of one, but only on 5.0.x.
Lenovo P14s, AMD 7 PRO 5850U CPU with 8 cores and 16 hwthreads
(output of ompi_info --all omitted because the message is too long)
Output of lstopo -v
Using hwloc 2.11.1Output of lstopo --of xml
Using hwloc 2.11.1Output of mpirun test command
No special hardware, connections are either local or over TCP between different Docker containers representing fake cluster nodes.
Details of the problem
When I use
mpirun
to start a program and ask for OpenMPI to map processes using a rankfile and hwthreads, it assigns whole cores to each process instead of individual threads. That is, the slots in the rankfile are always interpreted as core ids, not as hwthread (logical CPU) ids:This is as expected, because OpenMPI interprets the numbers as core ids by default, and cores 0 and 1 map to hwthreads {0, 1} and {2, 3} respectively. So let's use
--use-hwthread-cpus
to fix that:On pre-5.0.x, the above prints
{0, 1}
and{2, 3}
, but 5.0.x seems to persist in using cores rather than hwthreads. It does complain about me using old syntax however, so let's try the newer one:shell$ mpirun -n 2 --map-by rankfile:file=rankfile:hwtcpus python -c 'import os; print(os.sched_getaffinity(0))' {0, 1, 2, 3} {4, 5, 6, 7}
That fixes the warning, but not the problem. Let's see what it's actually doing:
Okay, I don't think that was supposed to happen. Let's try a different way:
Ah, it's trying to bind to cores, maybe that's it?
Nope. Maybe try the old syntax again?
That
Cpu Type: CORE
may be the problem, but how to convince OpenMPI that I have hwthreads to bind to? And why does it work on earlier versions?I can't find anything in the 5.0.x docs that suggests that this is intended, so I think it's a bug, either in the code (the ORTE to PRRTE switch maybe?) or in the docs. Or perhaps in my brain, if I missed something. At any rate, any help in fixing it would be much appreciated!
The text was updated successfully, but these errors were encountered: