You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
[root@master ~]# ompi_info
Package: Open MPI root@master Distribution
Open MPI: 5.0.6
Open MPI repo revision: v5.0.6
Open MPI release date: Nov 15, 2024
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Network type:
ib0 (Two nics directly connected, no switch).
[root@node01 mpi]# ip a
sbatch test.sh
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
The single node runs properly and two nodes run in parallel. The slurm-100.out file contains the following error
sbatch test.sh
slurm-100.out
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: master
PID: 322039
Message: connect() to 172.16.0.193:1042 failed
Error: Connection timed out (110)
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
shell$ mpirun -n 2 ./hello_world
The text was updated successfully, but these errors were encountered:
Please submit all the information below so that we can understand the working environment that is the context for your question.
Background information
Centos8.5 + slurm24.05.04 + hwloc-2.11.2 + libevent-2.1.12-stable + pmix-5.0.3 + ucx-1.17.0 + openmpi-5.0.6.
[user@master public]$ systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
[root@master ~]# ompi_info
Package: Open MPI root@master Distribution
Open MPI: 5.0.6
Open MPI repo revision: v5.0.6
Open MPI release date: Nov 15, 2024
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
#!/bin/bash
#SBATCH -p compute
#SBATCH --ntasks-per-node=64
#SBATCH --nodelist=master,node01
echo chkpt1
source /home/...
echo chkpt2
source /home/...
echo chkpt3
export PATH=/home/test/cp2k-2024.3/exe/local:$PATH
echo chkpt4
mpirun --mca btl_tcp_if_include ib0 --prefix /usr/local/lib/openmpi -np 128....
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Operating system/version:centos8.5
Computer hardware:
slurmd -C
NodeName=node01 CPUs=152 Boards=1 SocketsPerBoard=2 CoresPerSocket=38 ThreadsPerCore=2 RealMemory=257271
Network type:
ib0 (Two nics directly connected, no switch).
[root@node01 mpi]# ip a
sbatch test.sh
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
The single node runs properly and two nodes run in parallel. The slurm-100.out file contains the following error
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: master
PID: 322039
Message: connect() to 172.16.0.193:1042 failed
Error: Connection timed out (110)
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
The text was updated successfully, but these errors were encountered: