Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bert pretrain exapmple not working #64

Open
nevakrien opened this issue Mar 4, 2024 · 32 comments
Open

bert pretrain exapmple not working #64

nevakrien opened this issue Mar 4, 2024 · 32 comments
Assignees

Comments

@nevakrien
Copy link

nevakrien commented Mar 4, 2024

I have been trying to run the bert pretrain and I have found some issues

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/TensorFlow2/LanguageModeling/BERT
git apply patch  # When applying this patch, please move it to the above BERT dir first.

the apply patch breaks because there is no patch to apply

./pip_set_env.sh

this has an inner script break installation in horovod over using the old version of sklearn naming ie sklearn instead of scikit-learn.
but if I dont set a conda env and oneapi it works

this bash data/create_datasets_from_start.sh all

has the issue that the first line is
export BERT_PREP_WORKING_DIR=/workspace/bert_tf2/data
which overwrites the env varible we are supposed to be setting up for it

so it breaks like this
python3: can't open file '/workspace/bert_tf2/data/bertPrep.py': [Errno 2] No such file or directory

commenting out that line makes it breaks like so
python3: can't open file '/bertPrep.py': [Errno 2] No such file or directory
looking in the nvidia repo and trying to run the code gives this

sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ bash scripts/data_download.sh
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ 

even after doing the docker build.

I am not really sure how to go about solving these
my current ruining theory is we need an older git version for the nvidia code but idk which version

@YuningQiu
Copy link

YuningQiu commented Mar 4, 2024

Thanks for reporting this. Let me try to reproduce on my end and get back to you. Could you please let me know what hardware (like CPUs/GPUs) on which you are trying to run BERT workloads?

@YuningQiu
Copy link

To apply the patch, please move the patch file at the cloned GitHub repo to the BERT directory (DeepLearningExamples/TensorFlow2/LanguageModeling/BERT) first.

@nevakrien
Copy link
Author

sure we are looking at 4 'Intel(R) Data Center GPU Max 1100' with a cpu: is Intel(R) Xeon(R) Platinum 8480+
oneapi is on the 2024 version and the whole thing is on intel cloud.

@nevakrien
Copy link
Author

nevakrien commented Mar 4, 2024

copying the patch to the right place I think it worked I did get this worning

TensorFlow2/LanguageModeling/BERT/patch:254: trailing whitespace.
      context_layer = scaled_dot_product_attention(query_tensor, key_tensor, value_tensor, adder, 
TensorFlow2/LanguageModeling/BERT/patch:255: trailing whitespace.
                                                   self._dropout_rate, use_fast_attention=True, 
TensorFlow2/LanguageModeling/BERT/patch:257: trailing whitespace.
      
TensorFlow2/LanguageModeling/BERT/patch:339: trailing whitespace.
              epsilon=1e-12, dtype=tf.float32))   
TensorFlow2/LanguageModeling/BERT/patch:345: trailing whitespace.
              epsilon=1e-12)) 
warning: 5 lines add whitespace errors.

the rest of the errors are the same

@YuningQiu
Copy link

copying the patch to the right place I think it worked I did get this worning

TensorFlow2/LanguageModeling/BERT/patch:254: trailing whitespace.
      context_layer = scaled_dot_product_attention(query_tensor, key_tensor, value_tensor, adder, 
TensorFlow2/LanguageModeling/BERT/patch:255: trailing whitespace.
                                                   self._dropout_rate, use_fast_attention=True, 
TensorFlow2/LanguageModeling/BERT/patch:257: trailing whitespace.
      
TensorFlow2/LanguageModeling/BERT/patch:339: trailing whitespace.
              epsilon=1e-12, dtype=tf.float32))   
TensorFlow2/LanguageModeling/BERT/patch:345: trailing whitespace.
              epsilon=1e-12)) 
warning: 5 lines add whitespace errors.

the rest of the errors are the same

You don't need to care. As the manual explains:
What are considered whitespace errors is controlled by core.whitespace configuration. By default, trailing whitespaces (including lines that solely consist of whitespaces) and a space character that is immediately followed by a tab character inside the initial indent of the line are considered whitespace errors.

By default, the command outputs warning messages but applies the patch.

@YuningQiu
Copy link

And when preparing the dataset, you need to manually change the environment variable in the data/create_datasets_from_start.sh script, following this instruction in our README.md file.
image

@nevakrien
Copy link
Author

nevakrien commented Mar 5, 2024

I did do that modification which got me the
python3: can't open file '/workspace/bert_tf2/data/bertPrep.py': [Errno 2] No such file or directory
error...

specifcly this is the modified script

export BERT_PREP_WORKING_DIR=/home/sdp/data
#/workspace/bert_tf2/data

and runing it gives this error

sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ bash data/create_datasets_from_start.sh all
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory
data/create_datasets_from_start.sh: line 32: python: command not found
data/create_datasets_from_start.sh: line 39: python: command not found
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory
python3: can't open file '/home/sdp/data/bertPrep.py': [Errno 2] No such file or directory

@nevakrien
Copy link
Author

looking more into it

this

./pip_set_env.sh

should be folowed by a

source env_itex/bin/activate
or we need to run it with a source instead of plain run it.

also my machine did not have python on the path for some reason just python3 so I changed that in the script

the other issue was solved via
export BERT_PREP_WORKING_DIR=data
I then ran into an env issue

ModuleNotFoundError: No module named 'nltk'
ModuleNotFoundError: No module named 'pubmed_parser'

installing them solved things.

I could make a pull request with the fixes and try and also make it work nicer with oneapi.

@nevakrien
Copy link
Author

I am gona keep documenting things as I am runing into them (hope thats okay) with the hope this will save someone else time if they run across the same issue.

in line 211 and 176 of the bertPrep.py script there is a refrence to something that only exists in the docker 'python /workspace/bert_tf2/create_pretraining_data.py' switching it to create_pretraining_data.py and verifying that the python code there indeed runs in a directory that has that script I am I fixed this bug

python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
python: can't open file '/workspace/bert_tf2/create_pretraining_data.py': [Errno 2] No such file or directory
...keeps going...

it does have a weird interaction with tensorflow since it make a LOT of process workers each of which prints a lot to standard out so I will update when I am aware that this indeed make things work smoothly

@nevakrien
Copy link
Author

ran into

Traceback (most recent call last):
File "/home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py", line 20, in
from absl import app
ModuleNotFoundError: No module named 'absl'

so I ran pip install absl-py
which weirdly enough removed tensorflow from my enviorment???
this was only an issue when I tried to make this work on gpu on cpu it just kinda worked

so I then ran pip install --upgrade intel-extension-for-tensorflow[xpu]
and now it uninstalled things puting me in dependency hell...

gona see what I can do.

@yitingw1
Copy link
Contributor

yitingw1 commented Mar 7, 2024

@nevakrien Thanks for your comment about missing pip packages for dataset preparation, I will update README.md soon.

By the way, what's your conflicting absl-py version?
I have tried to install ITEX on MAX 1100 through pip install --upgrade intel-extension-for-tensorflow[xpu] and didn't find ModuleNotFoundError: No module named 'absl'. run_pretraining.py can run successfully.

@yitingw1
Copy link
Contributor

yitingw1 commented Mar 7, 2024

@nevakrien Current hyperparameters in examples/pretrain_bert/README.md are verified on Intel GPU Max 1550.
I'm afraid that it will be Out-Of-Memory on MAX 1100. You can reduce batch size to avoid OOM.
I recommend you try below hyperparameters:

TRAIN_BATCH_SIZE_PHASE1=60
TRAIN_BATCH_SIZE_PHASE2=10
LEARNING_RATE_PHASE1=7.5e-4
LEARNING_RATE_PHASE2=5e-4
NUM_ACCUMULATION_STEPS_PHASE1=64
NUM_ACCUMULATION_STEPS_PHASE2=192

For more details about hyperparameter settings, I recommend you have a look at pretrain_config.sh and squad_config.sh.

@nevakrien
Copy link
Author

nevakrien commented Mar 7, 2024

the absl issue ended up being a missing pip install intel-optimization-for-horovod
key point it raised a warning on a non matching tf version so I then manually installed the correct version.

the way I fixed my enviorment is as folows



re
six
pip install intel-optimization-for-horovod
tensorflow_hub
tensorflow_addons


pip install nvidia-pyindex
pip install nvidia-dllogger

I am now stuck in what seems to be a deadlock not really sure whats that about

al values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
warnings.warn(
WARNING:tensorflow:
The following Variables were used a Lambda layer's call (lambda_2), but
are not present in its tracked objects:
<tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
It is possible that this is intended behavior, but it is more likely
an omission. This is a strong indication that this layer should be
formulated as a subclassed Layer rather than a Lambda layer.
W0307 12:58:40.176523 140571263931072 lambda_layer.py:270]
The following Variables were used a Lambda layer's call (lambda_2), but
are not present in its tracked objects:
<tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
It is possible that this is intended behavior, but it is more likely
an omission. This is a strong indication that this layer should be
formulated as a subclassed Layer rather than a Lambda layer.
2024-03-07 12:59:07.291988: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
client_loop: send disconnect: Broken pipe

I am gona try runing it as a backround job and we will see if that magically solves it

@nevakrien
Copy link
Author

side note to anyone that acidently ran pip install horovod on the main oneapi python. for me that caused a nasty bug where tensorflow everywhere was broken with a c bug (valgrind traced it to an invalid read) this is solvble by just making new envs. I am not sure what its about since I dont have debug symbols to look into it.

@nevakrien
Copy link
Author

@nevakrien Current hyperparameters in examples/pretrain_bert/README.md are verified on Intel GPU Max 1550. I'm afraid that it will be Out-Of-Memory on MAX 1100. You can reduce batch size to avoid OOM. I recommend you try below hyperparameters:

yitingw1 I am runing on 4 of these gpus so I think I am fine. I just changed numgpus to 4.

@yitingw1
Copy link
Contributor

yitingw1 commented Mar 7, 2024

Hi @nevakrien, can you see Intel Extension for Tensorflow* GPU backend is loaded. and Selected platform: Intel(R) Level-Zero when import tensorflow?
I see the log you just posted which shows Plugin optimizer for device_type CPU is enabled. If you install intel-extension-for-tensorflow[xpu] and it runs properly, it should show Plugin optimizer for device_type XPU is enabled.

You can use below scripts to test if there is available xpus(Intel GPU):

import tensorflow as tf
tf.config.list_physical_devices("XPU")

The expected output is [PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')] rathar than [].

@nevakrien
Copy link
Author

@yitingw1 sorry for being slow to answer

yes thats the case I am seeing 4 of them

tf.config.list_physical_devices("XPU")
[PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:1', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:2', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:3', device_type='XPU')]

which is what I would expect

right now I am on a deadlock issue or so it seems.
I would run the training script and it gets stuck

I think its because all 4 mpi threads look for the same gpu with ID=0

(env_itex) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ DATATYPE=bf16
(env_itex) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ TRAIN_BATCH_SIZE_PHASE1=312
TRAIN_BATCH_SIZE_PHASE2=40
EVAL_BATCH_SIZE=8
LEARNING_RATE_PHASE1=8.12e-4
LEARNING_RATE_PHASE2=5e-4
DATATYPE=$DATATYPE
USE_XLA=false
NUM_GPUS=4
WARMUP_STEPS_PHASE1=810
WARMUP_STEPS_PHASE2=81
TRAIN_STEPS=2600
SAVE_CHECKPOINT_STEPS=100
NUM_ACCUMULATION_STEPS_PHASE1=32
NUM_ACCUMULATION_STEPS_PHASE2=96
BERT_MODEL=large

GBS1=$(expr $TRAIN_BATCH_SIZE_PHASE1 \* $NUM_GPUS \* $NUM_ACCUMULATION_STEPS_PHASE1)
GBS2=$(expr $TRAIN_BATCH_SIZE_PHASE2 \* $NUM_GPUS \* $NUM_ACCUMULATION_STEPS_PHASE2)

PRETRAIN_RESULT_DIR=./results/tf_bert_pretraining_lamb_${BERT_MODEL}_${$DATATYPE    |& tee pretrain_lamb.logPHASE2 \ \
-bash: ./results/tf_bert_pretraining_lamb_${BERT_MODEL}_${$DATATYPE}_gbs1_${GBS1}_gbs2_${GBS2}: bad substitution
Container nvidia build = 
Saving checkpoints to ./results/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360
Logs written to ./results/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360.240313104255.log
+ bash scripts/run_pretraining_lamb_phase1.sh 312 40 8 8.120000e-04 5.000000e-04 bf16 false 4 810 81 2600 100 32 96 large
+ tee -a ./results/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360/tf_bert_pretraining_lamb_large_bf16_gbs139936_gbs215360.240313104255.log
Container nvidia build = 
[0] 2024-03-13 10:42:55.966934: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[1] 2024-03-13 10:42:55.966929: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2] 2024-03-13 10:42:55.985172: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[3] 2024-03-13 10:42:55.985172: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[0] 2024-03-13 10:42:56.006373: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[1] 2024-03-13 10:42:56.006372: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-13 10:42:56.006388: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[3] 2024-03-13 10:42:56.006378: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[0] 2024-03-13 10:42:56.197288: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[0] 2024-03-13 10:42:56.197354: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[1] 2024-03-13 10:42:56.197292: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[1] 2024-03-13 10:42:56.197355: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[2] 2024-03-13 10:42:56.197296: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[2] 2024-03-13 10:42:56.197361: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[3] 2024-03-13 10:42:56.197293: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[3] 2024-03-13 10:42:56.197361: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[0] 2024-03-13 10:42:56.198255: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2] 2024-03-13 10:42:56.198263: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3] 2024-03-13 10:42:56.198262: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1] 2024-03-13 10:42:56.198270: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[0] 2024-03-13 10:42:56.307379: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[1] 2024-03-13 10:42:56.307378: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-13 10:42:56.307362: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[3] 2024-03-13 10:42:56.307352: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-13 10:42:56.308179: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[2] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[3] 2024-03-13 10:42:56.308177: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[3] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-13 10:42:56.308204: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[0] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1] 2024-03-13 10:42:56.308203: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[1] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-13 10:42:57.172249: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1] 2024-03-13 10:42:57.172249: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2] 2024-03-13 10:42:57.172248: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[3] 2024-03-13 10:42:57.172244: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[0] 2024-03-13 10:42:59.589180: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[1] 2024-03-13 10:42:59.589178: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[2] 2024-03-13 10:42:59.589187: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[3] 2024-03-13 10:42:59.589194: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[0] 2024-03-13 10:42:59.757082: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[1] 2024-03-13 10:42:59.757086: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[2] 2024-03-13 10:42:59.757090: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[3] 2024-03-13 10:42:59.757089: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[0] 2024-03-13 10:42:59.943345: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[1] 2024-03-13 10:42:59.943334: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[2] 2024-03-13 10:42:59.943290: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[2] 2024-03-13 10:42:59.943552: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-13 10:42:59.943556: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-13 10:42:59.943558: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-13 10:42:59.943561: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943599: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943603: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943605: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-13 10:42:59.943607: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943591: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943597: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943600: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-13 10:42:59.943603: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943694: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[3] 2024-03-13 10:42:59.943952: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943959: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943961: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-13 10:42:59.943963: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[0] 
[0] TensorFlow Addons (TFA) has ended development and introduction of new features.
[0] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[0] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[0] 
[0] For more information see: https://github.com/tensorflow/addons/issues/2807 
[0] 
[0]   warnings.warn(
[2] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[2] 
[2] TensorFlow Addons (TFA) has ended development and introduction of new features.
[2] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[2] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[2] 
[2] For more information see: https://github.com/tensorflow/addons/issues/2807 
[2] 
[2]   warnings.warn(
[1] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[1] 
[1] TensorFlow Addons (TFA) has ended development and introduction of new features.
[1] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[1] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[1] 
[1] For more information see: https://github.com/tensorflow/addons/issues/2807 
[1] 
[1]   warnings.warn(
[3] /home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/env_itex/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[3] 
[3] TensorFlow Addons (TFA) has ended development and introduction of new features.
[3] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[3] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[3] 
[3] For more information see: https://github.com/tensorflow/addons/issues/2807 
[3] 
[3]   warnings.warn(
[0] 2024-03-13 10:43:02,085 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,085 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,088 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,088 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,088 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,089 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,090 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-13 10:43:02,091 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-13 10:43:02,091 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,091 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,093 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,093 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,093 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,094 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-13 10:43:02,094 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,096 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,097 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,098 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,098 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,099 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-13 10:43:02,099 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] I0313 10:43:02.113359 140683523730304 run_pretraining.py:130] init_lr = 0.000812
[0] I0313 10:43:02.114231 139897882577792 run_pretraining.py:130] init_lr = 0.000812
[3] I0313 10:43:02.114217 140047589284736 run_pretraining.py:130] init_lr = 0.000812
[2] I0313 10:43:02.115367 139884497890176 run_pretraining.py:130] init_lr = 0.000812
[1] 2024-03-13 10:43:02.116706: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[1] 2024-03-13 10:43:02.116745: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[0] 2024-03-13 10:43:02.117326: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[0] 2024-03-13 10:43:02.117361: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[3] 2024-03-13 10:43:02.117379: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[3] 2024-03-13 10:43:02.117417: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[2] 2024-03-13 10:43:02.119038: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[2] 2024-03-13 10:43:02.119069: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)

this would just stay forever and I would then get a broken pipe
looking into the gpu utilization since no memory is alocated I am pretty sure it just does nothing because it waits for an OK on every gpu.

this is gona probably take a while to debug since I am not familier with tensorflow distributed training

@yitingw1
Copy link
Contributor

Hi @nevakrien. I'm sorry for your inconvenience as the README.md doesn't mention multi-GPU training. This part will be added to README.md soon.

We use intel-optimization-for-horovod to implement efficient multi-GPU training with OneCCL. If you want to use multi-GPU, please replace horovod with intel-optimization-for-horovod and retry bert-large pretraining.

pip uninstall horovod
pip install intel-optimization-for-horovod

@nevakrien
Copy link
Author

nevakrien commented Mar 14, 2024

well I changed to the recommanded settings from earlier in this thread set NUM_GPUS=1 and ran the example


2024-03-14 14:22:49.706624: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-14 14:22:49.706642: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 1, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-14 14:22:49.706646: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 2, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-14 14:22:49.706649: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 3, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-14 14:22:49.706677: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
2024-03-14 14:22:49.707040: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:1 with 0 MB memory) -> physical PluggableDevice (device: 1, name: XPU, pci bus id: <undefined>)
2024-03-14 14:22:49.707149: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:2 with 0 MB memory) -> physical PluggableDevice (device: 2, name: XPU, pci bus id: <undefined>)
2024-03-14 14:22:49.707263: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:3 with 0 MB memory) -> physical PluggableDevice (device: 3, name: XPU, pci bus id: <undefined>)
/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
  warnings.warn(
WARNING:tensorflow:
The following Variables were used a Lambda layer's call (lambda_2), but
are not present in its tracked objects:
  <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
It is possible that this is intended behavior, but it is more likely
an omission. This is a strong indication that this layer should be
formulated as a subclassed Layer rather than a Lambda layer.
W0314 14:22:53.167155 140229607912128 lambda_layer.py:270] 
The following Variables were used a Lambda layer's call (lambda_2), but
are not present in its tracked objects:
  <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
It is possible that this is intended behavior, but it is more likely
an omission. This is a strong indication that this layer should be
formulated as a subclassed Layer rather than a Lambda layer.
2024-03-14 14:23:19.907717: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.

it seems that the tensorflow code still finds all 4 gpus and works with them which causes the issue of being stuck.
I will look into what I can do with limiting device visibility and if that works properly I would post it here.

@yitingw1
Copy link
Contributor

@nevakrien Here is README.md for multi-GPU: https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/pretrain_bert#convergence, you can follow it.
And the patch is updated, too. Please use the latest patch.

@yitingw1
Copy link
Contributor

@nevakrien You can use export ZE_AFFINITY_MASK=gpu_ids to limit device visibility. For example,

export ZE_AFFINITY_MASK=0 # using XPU:0
export ZE_AFFINITY_MASK=2,3 # using XPU:2 and XPU:3

More details can be found in https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables.

@nevakrien
Copy link
Author

it did not seem to work we have the same issue... again alocates gpu memory gets stuck in what seems like an infinite loop and nothing meaningful really happens.

I ran it from the tensorflow side and it was very weird the python code ran multiple times

print(100*"!")
print("STATING UNGODLY HACK")
# Assuming you want TensorFlow to see only the first GPU
gpus = tf.config.experimental.list_physical_devices('XPU')
#gpus=[]
if gpus:
  try:
    # Only the first GPU will be visible to TensorFlow
    tf.config.experimental.set_visible_devices(gpus[0], 'XPU')
    logical_gpus = tf.config.experimental.list_logical_devices('XPU')
    print('original devices:')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    print('after mod devices')
    # Visible devices must be set before GPUs have been initialized
    print(e)

print('END UNGODLY HACK',flush=True)

and it showed

(example_pretrain) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ bash scripts/run_pretraining_lamb.sh     $TRAIN_BATCH_SIZE_PHASE1     $TRAIN_BATCH_SIZE_PHASE2     $EVAL_BATCH_SIZE     $LEARNING_RATE_PHASE1     $LEARNING_RATE_PHASE2     $DATATYPE     $USE_XLA     $NUM_GPUS     $WARMUP_STEPS_PHASE1     $WARMUP_STEPS_PHASE2     $TRAIN_STEPS     $SAVE_CHECKPOINT_STEPS     $NUM_ACCUMULATION_STEPS_PHASE1     $NUM_ACCUMULATION_STEPS_PHASE2     $BERT_MODEL     $DATA_DIR     $PRETRAIN_RESULT_DIR     |& tee pretrain_lamb.log
Container nvidia build = 
Saving checkpoints to ./results/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920
Logs written to ./results/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920.240318063321.log
+ bash scripts/run_pretraining_lamb_phase1.sh 60 10 8 7.500000e-04 5.000000e-04 bf16 false 1 810 81 2600 100 64 192 large
+ tee -a ./results/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920.240318063321.log
Container nvidia build = 
2024-03-18 06:33:21.584108: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-18 06:33:21.586170: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-18 06:33:21.612126: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-18 06:33:21.612147: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-18 06:33:21.612171: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-18 06:33:21.617744: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-18 06:33:21.617892: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-18 06:33:22.114465: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-18 06:33:22.886193: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2024-03-18 06:33:22.945522: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
2024-03-18 06:33:23.028851: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2024-03-18 06:33:23.029112: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:23.029116: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:23.029118: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:23.029120: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:23.132561: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-18 06:33:23.132598: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
STATING UNGODLY HACK
original devices:
4 Physical GPU, 1 Logical GPU
END UNGODLY HACK
/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
2024-03-18 06:33:24,973 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:24,974 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:24,975 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:24,975 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:24,975 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:24,976 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
Traceback (most recent call last):
  File "/home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py", line 242, in <module>
    app.run(main)
  File "/home/sdp/.local/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/sdp/.local/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py", line 212, in main
    tf.config.experimental.set_memory_growth(gpu, True)
  File "/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 748, in set_memory_growth
    context.context().set_memory_growth(device, enable)
  File "/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1775, in set_memory_growth
    raise RuntimeError(
RuntimeError: Physical devices cannot be modified after being initialized
+ bash scripts/run_pretraining_lamb_phase2.sh 60 10 8 7.500000e-04 5.000000e-04 bf16 false 1 810 81 2600 100 64 192 large
+ tee -a ./results/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920/tf_bert_pretraining_lamb_large_bf16_gbs13840_gbs21920.240318063321.log
Container nvidia build = 
Container nvidia build = 
2024-03-18 06:33:25.597159: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-18 06:33:25.598992: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-18 06:33:25.624423: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-18 06:33:25.624441: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-18 06:33:25.624460: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-18 06:33:25.629708: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-18 06:33:25.629853: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-18 06:33:26.126341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-18 06:33:26.887178: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2024-03-18 06:33:26.944459: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
2024-03-18 06:33:27.030443: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2024-03-18 06:33:27.030703: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:27.030706: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:27.030708: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:27.030710: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
2024-03-18 06:33:27.136119: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-18 06:33:27.136159: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
STATING UNGODLY HACK
original devices:
4 Physical GPU, 1 Logical GPU
END UNGODLY HACK
/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
2024-03-18 06:33:29,005 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:29,007 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:29,007 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:29,007 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:29,008 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
2024-03-18 06:33:29,008 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
Traceback (most recent call last):
  File "/home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py", line 242, in <module>
    app.run(main)
  File "/home/sdp/.local/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/sdp/.local/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/sdp/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/run_pretraining.py", line 212, in main
    tf.config.experimental.set_memory_growth(gpu, True)
  File "/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 748, in set_memory_growth
    context.context().set_memory_growth(device, enable)
  File "/home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1775, in set_memory_growth
    raise RuntimeError(
RuntimeError: Physical devices cannot be modified after being initialized
+ set +x
(example_pretrain) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ 

@nevakrien
Copy link
Author

ran the multi gpu version exact same issue

(example_pretrain) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ DATA_DIR=data
(example_pretrain) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ bash scripts/run_pretraining_lamb.sh     $TRAIN_BATCH_SIZE_PHASE1     $TRAIN_BATCH_SIZE_PHASE2     $EVAL_BATCH_SIZE     $LEARNING_RATE_PHASE1     $LEARNING_RATE_PHASE2     $DATATYPE     $USE_XLA     $NUM_GPUS     $WARMUP_STEPS_PHASE1     $WARMUP_STEPS_PHASE2     $TRAIN_STEPS     $SAVE_CHECKPOINT_STEPS     $NUM_ACCUMULATION_STEPS_PHASE1     $NUM_ACCUMULATION_STEPS_PHASE2     $BERT_MODEL     $DATA_DIR     $PRETRAIN_RESULT_DIR     |& tee pretrain_lamb.log
Container nvidia build = 
Saving checkpoints to ./results/tf_bert_pretraining_lamb_large_bf16_gbs1_79872_gbs2_30720
Logs written to ./results/tf_bert_pretraining_lamb_large_bf16_gbs1_79872_gbs2_30720/tf_bert_pretraining_lamb_large_bf16_gbs179872_gbs230720.240318064655.log
+ bash scripts/run_pretraining_lamb_phase1.sh 312 40 8 1.222000e-03 1.000000e-03 bf16 false 4 2000 200 6416 100 64 192 large
+ tee -a ./results/tf_bert_pretraining_lamb_large_bf16_gbs1_79872_gbs2_30720/tf_bert_pretraining_lamb_large_bf16_gbs179872_gbs230720.240318064655.log
Container nvidia build = 
I am runing mpi here!!!
[2] 2024-03-18 06:46:55.645398: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[3] 2024-03-18 06:46:55.645398: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2] 2024-03-18 06:46:55.647635: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[3] 2024-03-18 06:46:55.647634: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[1] 2024-03-18 06:46:55.652348: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[1] 2024-03-18 06:46:55.654106: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[0] 2024-03-18 06:46:55.657348: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[0] 2024-03-18 06:46:55.659068: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-18 06:46:55.673964: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[2] 2024-03-18 06:46:55.673987: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[2] 2024-03-18 06:46:55.674029: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3] 2024-03-18 06:46:55.673967: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[3] 2024-03-18 06:46:55.673988: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[3] 2024-03-18 06:46:55.674029: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1] 2024-03-18 06:46:55.678937: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[1] 2024-03-18 06:46:55.678957: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[1] 2024-03-18 06:46:55.678975: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2] 2024-03-18 06:46:55.680457: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[3] 2024-03-18 06:46:55.680457: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[2] 2024-03-18 06:46:55.680635: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[2] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[3] 2024-03-18 06:46:55.680635: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[3] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-18 06:46:55.683480: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[0] 2024-03-18 06:46:55.683502: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[0] 2024-03-18 06:46:55.683518: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1] 2024-03-18 06:46:55.683780: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[1] 2024-03-18 06:46:55.683943: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[1] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-18 06:46:55.688021: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[0] 2024-03-18 06:46:55.688172: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[0] To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2] 2024-03-18 06:46:56.241994: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[3] 2024-03-18 06:46:56.241997: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1] 2024-03-18 06:46:56.245705: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[0] 2024-03-18 06:46:56.248388: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[0] 2024-03-18 06:46:57.057189: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[2] 2024-03-18 06:46:57.057189: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[3] 2024-03-18 06:46:57.057184: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[1] 2024-03-18 06:46:57.057579: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[2] 2024-03-18 06:46:57.096906: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[0] 2024-03-18 06:46:57.097851: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[1] 2024-03-18 06:46:57.097852: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[3] 2024-03-18 06:46:57.099657: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
[2] 2024-03-18 06:46:57.178426: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[2] 2024-03-18 06:46:57.178689: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-18 06:46:57.178692: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-18 06:46:57.178695: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] 2024-03-18 06:46:57.178697: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-18 06:46:57.179154: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[0] 2024-03-18 06:46:57.179258: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[1] 2024-03-18 06:46:57.179414: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-18 06:46:57.179417: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-18 06:46:57.179419: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[1] 2024-03-18 06:46:57.179421: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-18 06:46:57.179511: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-18 06:46:57.179514: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-18 06:46:57.179516: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-18 06:46:57.179518: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-18 06:46:57.181573: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[3] 2024-03-18 06:46:57.181831: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-18 06:46:57.181835: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-18 06:46:57.181837: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[3] 2024-03-18 06:46:57.181839: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[2] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[2] 
[2] TensorFlow Addons (TFA) has ended development and introduction of new features.
[2] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[2] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[2] 
[2] For more information see: https://github.com/tensorflow/addons/issues/2807 
[2] 
[2]   warnings.warn(
[0] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[0] 
[0] TensorFlow Addons (TFA) has ended development and introduction of new features.
[0] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[0] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[0] 
[0] For more information see: https://github.com/tensorflow/addons/issues/2807 
[0] 
[0]   warnings.warn(
[2] 2024-03-18 06:46:59,054 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[1] 
[1] TensorFlow Addons (TFA) has ended development and introduction of new features.
[1] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[1] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[1] 
[1] For more information see: https://github.com/tensorflow/addons/issues/2807 
[1] 
[1]   warnings.warn(
[0] 2024-03-18 06:46:59,055 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-18 06:46:59,056 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-18 06:46:59,056 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-18 06:46:59,057 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-18 06:46:59,057 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-18 06:46:59,057 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-18 06:46:59,057 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[2] 2024-03-18 06:46:59,057 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-18 06:46:59,058 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-18 06:46:59,058 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[0] 2024-03-18 06:46:59,058 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-18 06:46:59,075 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-18 06:46:59,077 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-18 06:46:59,077 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-18 06:46:59,078 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-18 06:46:59,078 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[1] 2024-03-18 06:46:59,078 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
[3] 
[3] TensorFlow Addons (TFA) has ended development and introduction of new features.
[3] TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
[3] Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
[3] 
[3] For more information see: https://github.com/tensorflow/addons/issues/2807 
[3] 
[3]   warnings.warn(
[3] 2024-03-18 06:46:59,155 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-18 06:46:59,156 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-18 06:46:59,156 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-18 06:46:59,157 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-18 06:46:59,157 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] 2024-03-18 06:46:59,157 - intel_extension_for_tensorflow.python.experimental_ops_override - INFO - itex experimental ops override is enabled.
[3] I0318 06:46:59.774940 140205093315264 run_pretraining.py:154] init_lr = 0.004888
[2] I0318 06:46:59.774971 140320980550336 run_pretraining.py:154] init_lr = 0.004888
[1] I0318 06:46:59.775495 139843780956864 run_pretraining.py:154] init_lr = 0.004888
[0] I0318 06:46:59.775708 139992680649408 run_pretraining.py:154] init_lr = 0.004888
[2] 2024-03-18 06:46:59.779205: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 2, defaulting to 0. Your kernel may not have been built with NUMA support.
[2] 2024-03-18 06:46:59.779265: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 2, name: XPU, pci bus id: <undefined>)
[3] 2024-03-18 06:46:59.779530: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 3, defaulting to 0. Your kernel may not have been built with NUMA support.
[3] 2024-03-18 06:46:59.779576: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 3, name: XPU, pci bus id: <undefined>)
[0] 2024-03-18 06:46:59.779841: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[1] 2024-03-18 06:46:59.779866: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 1, defaulting to 0. Your kernel may not have been built with NUMA support.
[0] 2024-03-18 06:46:59.779894: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[1] 2024-03-18 06:46:59.779917: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 1, name: XPU, pci bus id: <undefined>)
[0] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
[0]   warnings.warn(
[1] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
[1]   warnings.warn(
[3] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
[3]   warnings.warn(
[2] /home/sdp/.conda/envs/example_pretrain/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
[2]   warnings.warn(
[0] WARNING:tensorflow:
[0] The following Variables were used a Lambda layer's call (lambda_2), but
[0] are not present in its tracked objects:
[0]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[0] It is possible that this is intended behavior, but it is more likely
[0] an omission. This is a strong indication that this layer should be
[0] formulated as a subclassed Layer rather than a Lambda layer.
[0] W0318 06:47:03.185108 139992680649408 lambda_layer.py:270] 
[0] The following Variables were used a Lambda layer's call (lambda_2), but
[0] are not present in its tracked objects:
[0]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[0] It is possible that this is intended behavior, but it is more likely
[0] an omission. This is a strong indication that this layer should be
[0] formulated as a subclassed Layer rather than a Lambda layer.
[1] WARNING:tensorflow:
[1] The following Variables were used a Lambda layer's call (lambda_2), but
[1] are not present in its tracked objects:
[1]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[1] It is possible that this is intended behavior, but it is more likely
[1] an omission. This is a strong indication that this layer should be
[1] formulated as a subclassed Layer rather than a Lambda layer.
[1] W0318 06:47:03.280845 139843780956864 lambda_layer.py:270] 
[1] The following Variables were used a Lambda layer's call (lambda_2), but
[1] are not present in its tracked objects:
[1]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[1] It is possible that this is intended behavior, but it is more likely
[1] an omission. This is a strong indication that this layer should be
[1] formulated as a subclassed Layer rather than a Lambda layer.
[3] WARNING:tensorflow:
[3] The following Variables were used a Lambda layer's call (lambda_2), but
[3] are not present in its tracked objects:
[3]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[3] It is possible that this is intended behavior, but it is more likely
[3] an omission. This is a strong indication that this layer should be
[3] formulated as a subclassed Layer rather than a Lambda layer.
[3] W0318 06:47:03.323586 140205093315264 lambda_layer.py:270] 
[3] The following Variables were used a Lambda layer's call (lambda_2), but
[3] are not present in its tracked objects:
[3]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[3] It is possible that this is intended behavior, but it is more likely
[3] an omission. This is a strong indication that this layer should be
[3] formulated as a subclassed Layer rather than a Lambda layer.
[2] WARNING:tensorflow:
[2] The following Variables were used a Lambda layer's call (lambda_2), but
[2] are not present in its tracked objects:
[2]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[2] It is possible that this is intended behavior, but it is more likely
[2] an omission. This is a strong indication that this layer should be
[2] formulated as a subclassed Layer rather than a Lambda layer.
[2] W0318 06:47:03.330546 140320980550336 lambda_layer.py:270] 
[2] The following Variables were used a Lambda layer's call (lambda_2), but
[2] are not present in its tracked objects:
[2]   <tf.Variable 'word_embeddings/embeddings:0' shape=(30522, 1024) dtype=float32>
[2] It is possible that this is intended behavior, but it is more likely
[2] an omission. This is a strong indication that this layer should be
[2] formulated as a subclassed Layer rather than a Lambda layer.
[0] 2024-03-18 06:47:38.882092: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
[1] 2024-03-18 06:47:39.242306: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
[2] 2024-03-18 06:47:39.809818: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
[3] 2024-03-18 06:47:39.928072: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.

seems to be deadlocking somehow I am not sure what I should do to fix it.

@yitingw1
Copy link
Contributor

@nevakrien I'm sorry it confused you that the python code ran multiple times, it is because we use lamb optimizer for bert pretraining which followed NV examples. It contains 2 phases in run_pretraining_lamb.sh. If phase1 fails, phase2 will run immediately and fails, too.
As for limiting device visibility, I recommend you to use export ZE_AFFINITY_MASK=gpu_ids rather than tf.config.experimental.set_visible_devices(gpus[0], 'XPU').

@yitingw1
Copy link
Contributor

print("STATING UNGODLY HACK")
# Assuming you want TensorFlow to see only the first GPU
gpus = tf.config.experimental.list_physical_devices('XPU')
#gpus=[]
if gpus:
  try:
    # Only the first GPU will be visible to TensorFlow
    tf.config.experimental.set_visible_devices(gpus[0], 'XPU')
    logical_gpus = tf.config.experimental.list_logical_devices('XPU')
    print('original devices:')
    print(, "Physical GPU,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    print('after mod devices')
    # Visible devices must be set before GPUs have been initialized
    print(e)

print('END UNGODLY HACK',flush=True)

From your code, I see you print the original gpus rather than the one after tf.config.experimental.set_visible_devices(gpus[0], 'XPU'). So len(gpus)=4.
While logical_gpus after set_visible_devices(gpus[0], 'XPU'), so len(logical_gpus)=1.

@yitingw1
Copy link
Contributor

As for the multi gpu version, the log looks OK. Could you please wait for hours to see if there are any logs printed out?
Or you can use xpu-smi ps to see GPU memory usage.
More details can be found in https://intel.github.io/xpumanager/smi_user_guide.html#get-the-process-info-which-are-using-gpu-and-their-gpu-memory-usage.

@nevakrien
Copy link
Author

thank you for clarifying
yes so I went off that route and I am using what you recommended with multiple gpus/ limiting visibility

in both cases I am runing into what apears to be a deadlock with the last message being something about the optimizer registering to cpu (I assume for gradient accumulation) and then the program would do nothing. not gpu computations no cpu preprocessing just nothing. I would then get broken pipe because the os sees no reason to keep it.

@nevakrien
Copy link
Author

I used intel_gpu_top I can see it does allocate memory but it does not use any cores which is the issue

@yitingw1
Copy link
Contributor

It seems that intel_gpu_top is more suitable for games or videos performance checks. Could you please use xpu-smi dump to see Intel data center GPUs device statistics.

For example, xpu-smi dump -d 3 -m 0,5,18 for device XPU:3, to see its GPU Utilization (%), GPU Memory Utilization (%) and GPU Memory Used (MiB).

Its usage can be found in https://intel.github.io/xpumanager/smi_user_guide.html#dump-the-device-statistics-in-csv-format

@nevakrien
Copy link
Author

sdp@gpunode:~$ xpu-smi dump -d 3 -m 0,5,18
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB)
12:39:00.000,    3,  N/A, 87.55, 43020.16
12:39:01.000,    3,  N/A, 87.55, 43020.16
12:39:02.000,    3,  N/A, 87.55, 43020.16
12:39:03.000,    3,  N/A, 87.55, 43020.16
12:39:04.000,    3,  N/A, 87.55, 43020.16
12:39:05.000,    3,  N/A, 87.55, 43020.16
12:39:06.000,    3,  N/A, 87.55, 43020.16

image

I would get a broken pipe after a while (may be I need to talk to my system admin to fix or run in the backround but I first wana make sure it runs)

@yitingw1
Copy link
Contributor

Could you please try below hyperparameters to verify whether OOM made this broken pipe or not?
Another possible reason for the broken pipe is the incomplete dataset.
And the most possible reason is ssh remote connection error. Could you please use nohup or tmux to run to avoid remote connection error?

@nevakrien Current hyperparameters in examples/pretrain_bert/README.md are verified on Intel GPU Max 1550. I'm afraid that it will be Out-Of-Memory on MAX 1100. You can reduce batch size to avoid OOM. I recommend you try below hyperparameters:

TRAIN_BATCH_SIZE_PHASE1=60
TRAIN_BATCH_SIZE_PHASE2=10
LEARNING_RATE_PHASE1=7.5e-4
LEARNING_RATE_PHASE2=5e-4
NUM_ACCUMULATION_STEPS_PHASE1=64
NUM_ACCUMULATION_STEPS_PHASE2=192

For more details about hyperparameter settings, I recommend you have a look at pretrain_config.sh and squad_config.sh.

@YuningQiu
Copy link

Hi @nevakrien, do you have further questions or concerns on this issue? Can we close this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants