Encountered problems during dpmd training process #4029
Unanswered
xkxsconfused
asked this question in
Q&A
Replies: 1 comment 1 reply
-
This is a duplicate of #3215. I haven't found anything wrong in deepmd-kit, so I guess it's a bug in TensorFlow. Do you use the CPUs to train your model? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear developers,
I encountered an error message at the end of the text while training the model. At first, it could run, but an error occurred during the process。
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[localhost.localdomain:175892] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
…………
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
…………
WARNING:tensorflow:From /home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py:1191: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
…………
WARNING:tensorflow:From /home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py:1191: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
…………
DEEPMD INFO saved checkpoint model.ckpt
DEEPMD INFO batch 510 training time 24.51 s, testing time 2.38 s, total wall time 27.09 s
DEEPMD INFO batch 520 training time 24.97 s, testing time 2.42 s, total wall time 27.40 s
DEEPMD INFO batch 530 training time 24.38 s, testing time 2.45 s, total wall time 26.84 s
DEEPMD INFO batch 540 training time 23.91 s, testing time 2.42 s, total wall time 26.34 s
Traceback (most recent call last):
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1402, in _do_call
return fn(*args)
^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1385, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1478, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[1] in [0, 5], but got 1034915320
[[{{node Slice_1}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/changsk/anaconda3/deepmd/bin/dp", line 10, in
sys.exit(main())
^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
deepmd_main(args)
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 74, in main
train_dp(**dict_args)
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/train.py", line 168, in train
_do_work(jdata, run_opt, is_compress)
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
model.train(train_data, valid_data)
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py", line 722, in train
_, next_train_batch_list = run_sess(
^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/utils/sess.py", line 31, in run_sess
return sess.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 972, in run
result = self._run(None, fetches, feed_dict, options_ptr,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1215, in _run
results = self._do_run(handle, final_targets, final_fetches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1395, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1421, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node 'Slice_1' defined at (most recent call last):
File "/home/changsk/anaconda3/deepmd/bin/dp", line 10, in
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/train.py", line 280, in _do_work
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py", line 308, in build
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py", line 385, in _build_network
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/model/ener.py", line 222, in build
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/model/model.py", line 290, in build_descrpt
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/descriptor/se_a.py", line 673, in build
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/descriptor/se_a.py", line 762, in _pass_filter
Node: 'Slice_1'
Expected begin[1] in [0, 5], but got 1034915320
[[{{node Slice_1}}]]
Original stack trace for 'Slice_1':
File "/home/changsk/anaconda3/deepmd/bin/dp", line 10, in
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/entrypoints/train.py", line 280, in _do_work
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py", line 308, in build
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/train/trainer.py", line 385, in _build_network
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/model/ener.py", line 222, in build
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/model/model.py", line 290, in build_descrpt
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/descriptor/se_a.py", line 673, in build
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/deepmd/descriptor/se_a.py", line 762, in _pass_filter
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/ops/array_ops.py", line 1232, in slice
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9824, in _slice
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal
File "/home/changsk/anaconda3/deepmd/lib/python3.11/site-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def
input.json
{
"_comment": " model parameters",
"model": {
"type_map": ["C", "H"],
"descriptor" :{
"type": "se_e2_a",
"sel": [16, 4],
"rcut_smth": 0.20,
"rcut": 7.00,
"neuron": [25, 50, 100],
"resnet_dt": false,
"axis_neuron": 4,
"seed": 1,
"_comment": " that's all"
},
"fitting_net" : {
"neuron": [240, 240, 240],
"resnet_dt": true,
"seed": 1,
"_comment": " that's all"
},
"_comment": " that's all"
},
}
Beta Was this translation helpful? Give feedback.
All reactions