-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bert pretrain exapmple not working #64
Comments
Thanks for reporting this. Let me try to reproduce on my end and get back to you. Could you please let me know what hardware (like CPUs/GPUs) on which you are trying to run BERT workloads? |
To apply the patch, please move the patch file at the cloned GitHub repo to the BERT directory (DeepLearningExamples/TensorFlow2/LanguageModeling/BERT) first. |
sure we are looking at 4 'Intel(R) Data Center GPU Max 1100' with a cpu: is Intel(R) Xeon(R) Platinum 8480+ |
copying the patch to the right place I think it worked I did get this worning
the rest of the errors are the same |
You don't need to care. As the manual explains: By default, the command outputs warning messages but applies the patch. |
And when preparing the dataset, you need to manually change the environment variable in the data/create_datasets_from_start.sh script, following this instruction in our README.md file. |
I did do that modification which got me the specifcly this is the modified script
and runing it gives this error
|
looking more into it this
should be folowed by a
also my machine did not have python on the path for some reason just python3 so I changed that in the script the other issue was solved via ModuleNotFoundError: No module named 'nltk' installing them solved things. I could make a pull request with the fixes and try and also make it work nicer with oneapi. |
I am gona keep documenting things as I am runing into them (hope thats okay) with the hope this will save someone else time if they run across the same issue. in line 211 and 176 of the bertPrep.py script there is a refrence to something that only exists in the docker 'python /workspace/bert_tf2/create_pretraining_data.py' switching it to create_pretraining_data.py and verifying that the python code there indeed runs in a directory that has that script I am I fixed this bug
it does have a weird interaction with tensorflow since it make a LOT of process workers each of which prints a lot to standard out so I will update when I am aware that this indeed make things work smoothly |
ran into Traceback (most recent call last): so I ran pip install absl-py so I then ran pip install --upgrade intel-extension-for-tensorflow[xpu] gona see what I can do. |
@nevakrien Thanks for your comment about missing pip packages for dataset preparation, I will update README.md soon. By the way, what's your conflicting absl-py version? |
@nevakrien Current hyperparameters in examples/pretrain_bert/README.md are verified on Intel GPU Max 1550.
For more details about hyperparameter settings, I recommend you have a look at pretrain_config.sh and squad_config.sh. |
the absl issue ended up being a missing pip install intel-optimization-for-horovod the way I fixed my enviorment is as folows
I am now stuck in what seems to be a deadlock not really sure whats that about al values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once. I am gona try runing it as a backround job and we will see if that magically solves it |
side note to anyone that acidently ran pip install horovod on the main oneapi python. for me that caused a nasty bug where tensorflow everywhere was broken with a c bug (valgrind traced it to an invalid read) this is solvble by just making new envs. I am not sure what its about since I dont have debug symbols to look into it. |
yitingw1 I am runing on 4 of these gpus so I think I am fine. I just changed numgpus to 4. |
Hi @nevakrien, can you see You can use below scripts to test if there is available xpus(Intel GPU):
The expected output is |
@yitingw1 sorry for being slow to answer yes thats the case I am seeing 4 of them
which is what I would expect right now I am on a deadlock issue or so it seems. I think its because all 4 mpi threads look for the same gpu with ID=0 (env_itex) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ DATATYPE=bf16
(env_itex) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ TRAIN_BATCH_SIZE_PHASE1=312
TRAIN_BATCH_SIZE_PHASE2=40
EVAL_BATCH_SIZE=8
LEARNING_RATE_PHASE1=8.12e-4
LEARNING_RATE_PHASE2=5e-4
DATATYPE=$DATATYPE
USE_XLA=false
NUM_GPUS=4
WARMUP_STEPS_PHASE1=810
WARMUP_STEPS_PHASE2=81
TRAIN_STEPS=2600
SAVE_CHECKPOINT_STEPS=100
NUM_ACCUMULATION_STEPS_PHASE1=32
NUM_ACCUMULATION_STEPS_PHASE2=96
BERT_MODEL=large
GBS1=$(expr $TRAIN_BATCH_SIZE_PHASE1 \* $NUM_GPUS \* $NUM_ACCUMULATION_STEPS_PHASE1)
GBS2=$(expr $TRAIN_BATCH_SIZE_PHASE2 \* $NUM_GPUS \* $NUM_ACCUMULATION_STEPS_PHASE2)
PRETRAIN_RESULT_DIR=./results/tf_bert_pretraining_lamb_${BERT_MODEL}_${$DATATYPE |& tee pretrain_lamb.logPHASE2 \ \
-bash: ./results/tf_bert_pretraining_lamb_${BERT_MODEL}_${$DATATYPE}_gbs1_${GBS1}_gbs2_${GBS2}: bad substitution
Container nvidia build =
Saving checkpoints to ./results/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360
Logs written to ./results/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360.240313104255.log
+ bash scripts/run_pretraining_lamb_phase1.sh 312 40 8 8.120000e-04 5.000000e-04 bf16 false 4 810 81 2600 100 32 96 large
+ tee -a ./results/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360.240313104255.log
Container nvidia build =
[0] 2024-03-13 10:42:55.966934: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[1] 2024-03-13 10:42:55.966929: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2] 2024-03-13 10:42:55.985172: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[3] 2024-03-13 10:42:55.985172: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[0] 2024-03-13 10:42:56.006373: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[1] 2024-03-13 10:42:56.006372: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-13 10:42:56.006388: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[3] 2024-03-13 10:42:56.006378: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[0] 2024-03-13 10:42:56.197288: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[0] 2024-03-13 10:42:56.197354: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[1] 2024-03-13 10:42:56.197292: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[1] 2024-03-13 10:42:56.197355: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[2] 2024-03-13 10:42:56.197296: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[2] 2024-03-13 10:42:56.197361: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[3] 2024-03-13 10:42:56.197293: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[3] 2024-03-13 10:42:56.197361: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[0] 2024-03-13 10:42:56.198255: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2] 2024-03-13 10:42:56.198263: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3] 2024-03-13 10:42:56.198262: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1] 2024-03-13 10:42:56.198270: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[0] 2024-03-13 10:42:56.307379: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[1] 2024-03-13 10:42:56.307378: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-13 10:42:56.307362: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[3] 2024-03-13 10:42:56.307352: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-13 10:42:56.308179: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[2] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[3] 2024-03-13 10:42:56.308177: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[3] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-13 10:42:56.308204: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[0] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1] 2024-03-13 10:42:56.308203: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[1] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-13 10:42:57.172249: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1] 2024-03-13 10:42:57.172249: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2] 2024-03-13 10:42:57.172248: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[3] 2024-03-13 10:42:57.172244: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[0] 2024-03-13 10:42:59.589180: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[1] 2024-03-13 10:42:59.589178: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[2] 2024-03-13 10:42:59.589187: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[3] 2024-03-13 10:42:59.589194: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[0] 2024-03-13 10:42:59.757082: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[1] 2024-03-13 10:42:59.757086: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[2] 2024-03-13 10:42:59.757090: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[3] 2024-03-13 10:42:59.757089: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[0] 2024-03-13 10:42:59.943345: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[1] 2024-03-13 10:42:59.943334: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[2] 2024-03-13 10:42:59.943290: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[2] 2024-03-13 10:42:59.943552: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-13 10:42:59.943556: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-13 10:42:59.943558: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-13 10:42:59.943561: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943599: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943603: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943605: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943607: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943591: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943597: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943600: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943603: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943694: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[3] 2024-03-13 10:42:59.943952: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943959: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943961: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943963: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
[0]
[0] TensorFlow Addons (TFA) has ended development and introduction of new features.
[0] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[0] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
[0]
[0] For more information see: https://github.com/tensorflow/addons/issues/2807
[0]
[0] warnings.warn(
[2] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
[2]
[2] TensorFlow Addons (TFA) has ended development and introduction of new features.
[2] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[2] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
[2]
[2] For more information see: https://github.com/tensorflow/addons/issues/2807
[2]
[2] warnings.warn(
[1] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
[1]
[1] TensorFlow Addons (TFA) has ended development and introduction of new features.
[1] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[1] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
[1]
[1] For more information see: https://github.com/tensorflow/addons/issues/2807
[1]
[1] warnings.warn(
[3] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
[3]
[3] TensorFlow Addons (TFA) has ended development and introduction of new features.
[3] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[3] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
[3]
[3] For more information see: https://github.com/tensorflow/addons/issues/2807
[3]
[3] warnings.warn(
[0] 2024-03-13 10:43:02,085 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,085 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,088 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,088 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,088 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,089 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,091 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,091 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,091 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,093 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,093 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,093 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,094 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,094 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,096 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,097 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,098 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,098 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,099 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,099 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] I0313 10:43:02.113359 140683523730304 run_pretraining.py:130] init_lr = 0.000812
[0] I0313 10:43:02.114231 139897882577792 run_pretraining.py:130] init_lr = 0.000812
[3] I0313 10:43:02.114217 140047589284736 run_pretraining.py:130] init_lr = 0.000812
[2] I0313 10:43:02.115367 139884497890176 run_pretraining.py:130] init_lr = 0.000812
[1] 2024-03-13 10:43:02.116706: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[1] 2024-03-13 10:43:02.116745: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[0] 2024-03-13 10:43:02.117326: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[0] 2024-03-13 10:43:02.117361: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[3] 2024-03-13 10:43:02.117379: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[3] 2024-03-13 10:43:02.117417: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[2] 2024-03-13 10:43:02.119038: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[2] 2024-03-13 10:43:02.119069: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>) this would just stay forever and I would then get a broken pipe this is gona probably take a while to debug since I am not familier with tensorflow distributed training |
Hi @nevakrien. I'm sorry for your inconvenience as the README.md doesn't mention multi-GPU training. This part will be added to README.md soon. We use
|
well I changed to the recommanded settings from earlier in this thread set NUM_GPUS=1 and ran the example
it seems that the tensorflow code still finds all 4 gpus and works with them which causes the issue of being stuck. |
@nevakrien Here is README.md for multi-GPU: https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/pretrain_bert#convergence, you can follow it. |
@nevakrien You can use
More details can be found in https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables. |
it did not seem to work we have the same issue... again alocates gpu memory gets stuck in what seems like an infinite loop and nothing meaningful really happens. I ran it from the tensorflow side and it was very weird the python code ran multiple times print(100*"!")
print("STATING UNGODLY HACK")
# Assuming you want TensorFlow to see only the first GPU
gpus = tf.config.experimental.list_physical_devices('XPU')
#gpus=[]
if gpus:
try:
# Only the first GPU will be visible to TensorFlow
tf.config.experimental.set_visible_devices(gpus[0], 'XPU')
logical_gpus = tf.config.experimental.list_logical_devices('XPU')
print('original devices:')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
print('after mod devices')
# Visible devices must be set before GPUs have been initialized
print(e)
print('END UNGODLY HACK',flush=True) and it showed
|
ran the multi gpu version exact same issue
seems to be deadlocking somehow I am not sure what I should do to fix it. |
@nevakrien I'm sorry it confused you that |
From your code, I see you print the original gpus rather than the one after |
As for the multi gpu version, the log looks OK. Could you please wait for hours to see if there are any logs printed out? |
thank you for clarifying in both cases I am runing into what apears to be a deadlock with the last message being something about the optimizer registering to cpu (I assume for gradient accumulation) and then the program would do nothing. not gpu computations no cpu preprocessing just nothing. I would then get broken pipe because the os sees no reason to keep it. |
I used intel_gpu_top I can see it does allocate memory but it does not use any cores which is the issue |
It seems that For example, Its usage can be found in https://intel.github.io/xpumanager/smi_user_guide.html#dump-the-device-statistics-in-csv-format |
I would get a broken pipe after a while (may be I need to talk to my system admin to fix or run in the backround but I first wana make sure it runs) |
Could you please try below hyperparameters to verify whether OOM made this broken pipe or not?
|
Hi @nevakrien, do you have further questions or concerns on this issue? Can we close this one? |
I have been trying to run the bert pretrain and I have found some issues
the apply patch breaks because there is no patch to apply
./pip_set_env.sh
this has an inner script break installation in horovod over using the old version of sklearn naming ie sklearn instead of scikit-learn.
but if I dont set a conda env and oneapi it works
this bash data/create_datasets_from_start.sh all
has the issue that the first line is
export BERT_PREP_WORKING_DIR=/workspace/bert_tf2/data
which overwrites the env varible we are supposed to be setting up for it
so it breaks like this
python3: can't open file '/workspace/bert_tf2/data/bertPrep.py': [Errno 2] No such file or directory
commenting out that line makes it breaks like so
python3: can't open file '/bertPrep.py': [Errno 2] No such file or directory
looking in the nvidia repo and trying to run the code gives this
even after doing the docker build.
I am not really sure how to go about solving these
my current ruining theory is we need an older git version for the nvidia code but idk which version
The text was updated successfully, but these errors were encountered: