Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The DP distillation model ( model.pth by dp --pt freeze) encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system #4421

Closed
Jeremy1189 opened this issue Nov 26, 2024 · 7 comments
Labels

Comments

@Jeremy1189
Copy link

Jeremy1189 commented Nov 26, 2024

Bug summary

The model.pth obtained using the PyTorch backend of DeepMD-kit 3.0.0 encounters an MPI error when run with LAMMPS for a system of 32,000 atoms, but works well for a 2-atom system.

DeePMD-kit Version

3.0.0

Backend and its version

Pytorch

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[2036,0],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Steps to Reproduce

32000atoms_lmp.zip
Bohrium platform,
{
"job_name": "7alloy",
"command": "lmp -in in.zbl > out",
"log_file": "run_log",
"job_type": "container",
"backward_files": [],
"project_id": 190380,
"platform": "ali",
"disk_size": 200,
"machine_type": "1 * NVIDIA V100_32g",
"image_address": "registry.dp.tech/dptech/prod-12166/deepmd-kit-v3-dpgen2-zbl:v3"
}

Further Information, Files, and Links

2atoms works well example input:
LAMMPS input: in.zbl:

LAMMPS input script to calculate the energy of two atoms

units metal
dimension 3
boundary p p p
atom_style atomic

read_data 03.dat

mass 2 180.95 #Ta
mass 4 51.996 #Cr
mass 5 55.845 #Fe
mass 6 58.963 #Ni
mass 1 47.867 #Ti
mass 3 26.982 #Al
mass 7 58.933 #Co

###--------------------Force Field-------------------------------
pair_style deepmd model.pth
pair_coeff * * Ti Ta Al Cr Fe Ni Co
thermo 1
thermo_style custom step pe
run 0 #
variable energy equal pe
print "${energy}" append energies.txt

03.data
position data for Lammps generated by PYTHON

2 atoms

7 atom types

-15.800000 15.800000 xlo xhi
-15.800000 15.800000 ylo yhi
-15.800000 15.800000 zlo zhi

Atoms

1 2 0.000000 0.000000 0.000000
2 2 0.000000 0.000000 0.300000
job.json
{
"job_name": "7alloycompress",
"command": "lmp -in in.zbl > out",
"log_file": "run_log",
"job_type": "container",
"backward_files": [],
"project_id": 190380,
"platform": "ali",
"disk_size": 200,
"machine_type": "1 * NVIDIA V100_32g",
"image_address": "registry.dp.tech/dptech/prod-12166/deepmd-kit-v3-dpgen2-zbl:v3"
}

@Jeremy1189 Jeremy1189 added the bug label Nov 26, 2024
@wanghan-iapcm
Copy link
Collaborator

we do NOT accept chinese issues. Please translate properly in english.

@Jeremy1189
Copy link
Author

we do NOT accept chinese issues. Please translate properly in english.

updated

@njzjz
Copy link
Member

njzjz commented Nov 26, 2024

Could you post your log.lammps file?

@njzjz
Copy link
Member

njzjz commented Nov 26, 2024

32000 atoms may trigger out-of-memory.

@Jeremy1189
Copy link
Author

The following files are log,lanmms and out_lmp.
out_lmp.txt
log.lammps.txt

Regarding the assertion that '32,000 atoms may trigger out-of-memory' issues, I have not encountered any warnings or errors indicating such problems. In my experience, the V100_32 GPU can support nearly 90,000 atoms without encountering out-of-memory errors when using the TensorFlow backend model.

@anyangml
Copy link
Collaborator

My suggestion is to test on a smaller system to check the correctness of the algorithm.

@Jeremy1189
Copy link
Author

Thanks for @njzjz @anyangml , you are right, this problem is solved by parallelizing 4 V100 GPU cards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants