The DP distillation model ( model.pth by dp --pt freeze) encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system #4421

Jeremy1189 · 2024-11-26T04:08:20Z

Bug summary

The model.pth obtained using the PyTorch backend of DeepMD-kit 3.0.0 encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system.

DeePMD-kit Version

3.0.0

Backend and its version

Pytorch

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[2036,0],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Steps to Reproduce

32000atoms_lmp.zip
Bohrium platform,
{
"job_name": "7alloy",
"command": "lmp -in in.zbl > out",
"log_file": "run_log",
"job_type": "container",
"backward_files": [],
"project_id": 190380,
"platform": "ali",
"disk_size": 200,
"machine_type": "1 * NVIDIA V100_32g",
"image_address": "registry.dp.tech/dptech/prod-12166/deepmd-kit-v3-dpgen2-zbl:v3"
}

Further Information, Files, and Links

2atoms works well example input:
LAMMPS input: in.zbl:

LAMMPS input script to calculate the energy of two atoms

units metal
dimension 3
boundary p p p
atom_style atomic

read_data 03.dat

mass 2 180.95 #Ta
mass 4 51.996 #Cr
mass 5 55.845 #Fe
mass 6 58.963 #Ni
mass 1 47.867 #Ti
mass 3 26.982 #Al
mass 7 58.933 #Co

###--------------------Force Field-------------------------------
pair_style deepmd model.pth
pair_coeff * * Ti Ta Al Cr Fe Ni Co
thermo 1
thermo_style custom step pe
run 0 #
variable energy equal pe
print "${energy}" append energies.txt

03.data
position data for Lammps generated by PYTHON

2 atoms

7 atom types

-15.800000 15.800000 xlo xhi
-15.800000 15.800000 ylo yhi
-15.800000 15.800000 zlo zhi

Atoms

1 2 0.000000 0.000000 0.000000
2 2 0.000000 0.000000 0.300000
job.json
{
"job_name": "7alloycompress",
"command": "lmp -in in.zbl > out",
"log_file": "run_log",
"job_type": "container",
"backward_files": [],
"project_id": 190380,
"platform": "ali",
"disk_size": 200,
"machine_type": "1 * NVIDIA V100_32g",
"image_address": "registry.dp.tech/dptech/prod-12166/deepmd-kit-v3-dpgen2-zbl:v3"
}

wanghan-iapcm · 2024-11-26T04:48:43Z

we do NOT accept chinese issues. Please translate properly in english.

Jeremy1189 · 2024-11-26T05:30:32Z

we do NOT accept chinese issues. Please translate properly in english.

updated

njzjz · 2024-11-26T18:25:37Z

Could you post your log.lammps file?

njzjz · 2024-11-26T18:33:36Z

32000 atoms may trigger out-of-memory.

Jeremy1189 · 2024-11-27T05:23:02Z

The following files are log,lanmms and out_lmp.
out_lmp.txt
log.lammps.txt

Regarding the assertion that '32,000 atoms may trigger out-of-memory' issues, I have not encountered any warnings or errors indicating such problems. In my experience, the V100_32 GPU can support nearly 90,000 atoms without encountering out-of-memory errors when using the TensorFlow backend model.

anyangml · 2024-11-28T03:45:55Z

My suggestion is to test on a smaller system to check the correctness of the algorithm.

Jeremy1189 · 2024-11-29T02:43:10Z

Thanks for @njzjz @anyangml , you are right, this problem is solved by parallelizing 4 V100 GPU cards.

Jeremy1189 added the bug label Nov 26, 2024

anyangml closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The DP distillation model ( model.pth by dp --pt freeze) encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system #4421

The DP distillation model ( model.pth by dp --pt freeze) encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system #4421

Jeremy1189 commented Nov 26, 2024 •

edited

Loading

wanghan-iapcm commented Nov 26, 2024

Jeremy1189 commented Nov 26, 2024

njzjz commented Nov 26, 2024 •

edited

Loading

njzjz commented Nov 26, 2024 •

edited

Loading

Jeremy1189 commented Nov 27, 2024

anyangml commented Nov 28, 2024

Jeremy1189 commented Nov 29, 2024

The DP distillation model ( model.pth by dp --pt freeze) encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system #4421

The DP distillation model ( model.pth by dp --pt freeze) encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system #4421

Comments

Jeremy1189 commented Nov 26, 2024 • edited Loading

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD Proc: [[2036,0],0] Errorcode: 1 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

Steps to Reproduce

Further Information, Files, and Links

LAMMPS input script to calculate the energy of two atoms

wanghan-iapcm commented Nov 26, 2024

Jeremy1189 commented Nov 26, 2024

njzjz commented Nov 26, 2024 • edited Loading

njzjz commented Nov 26, 2024 • edited Loading

Jeremy1189 commented Nov 27, 2024

anyangml commented Nov 28, 2024

Jeremy1189 commented Nov 29, 2024

Jeremy1189 commented Nov 26, 2024 •

edited

Loading

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[2036,0],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

njzjz commented Nov 26, 2024 •

edited

Loading

njzjz commented Nov 26, 2024 •

edited

Loading