-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPU-V4 #255
Comments
your jax version installed is wrong for your tpu version. (this repo is old)
also, your libcudart errors means you need to uninstall your tensorflow and install tensorflow-cpu as you do not have a GPU on a TPU device. i would recommend you go through https://github.com/ayaka14732/tpu-starter it can help with some errors you face. |
I use V2-alpha-tpu4 on TPUv4-8. |
According to the following, the number of TPU cores has changed from 8 to 4 for TPU v4.
(source: https://cloud.google.com/tpu/docs/run-calculation-jax) |
Aha, I see. Is there any way to fine tune gpt-j using 4 tpu cores? |
I change the following from 8 to 4 in the configuration file.
|
If I do that, I get a "AssertionError: Incompatible checkpoints" error |
I forgot to mention that it's for pre-training from scratch. The above compatibility seems a valid issue since it's not clear whether the checkpoints on 8 cores can work on 4 cores. |
Is there any way to convert the checkpoints to, let's say, 4 shards? |
No idea but I guess not and didn't try. I plan to move forward to TPU v4. |
I'm curious how this attempt turned out. Has anyone succeeded in running GPT-J on TPU v4? |
How can one use this project to fine-tune using a TPU-v4 instance?
I tried everything, but always get errors.
Most commonly:
UserWarning: cloud_tpu_init failed: KeyError('v4-8')
This a JAX bug; please report an issue at https://github.com/google/jax/issues
_warn(f"cloud_tpu_init failed: {repr(exc)}\n This a JAX bug; please report "
2023-03-05 21:55:43.305762: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-05 21:55:43.941977: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-05 21:55:43.942070: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-05 21:55:43.942076: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Traceback (most recent call last):
File "device_train.py", line 191, in
raise ValueError(msg)
ValueError: each shard needs a separate device, but device count (1) < shard count (4)
The text was updated successfully, but these errors were encountered: