Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

Closed
sebastian-luna-valero opened this issue Apr 11, 2023 · 5 comments
Closed
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@sebastian-luna-valero
Copy link

Hi,

When running this notebook: https://www.tensorflow.org/tutorials/images/cnn

Using quay.io/pangeo/ml-notebook:2023.02.27, in the cell:
https://www.tensorflow.org/tutorials/images/cnn#compile_and_train_the_model

I get:

2023-04-11 14:56:14.664302: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:85] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2023-04-11 14:56:14.665682: W tensorflow/compiler/xla/stream_executor/gpu/redzone_allocator.cc:318] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.

which is solved with:

mamba install -c nvidia cuda-nvcc=11.8 # same version as cudatoolkit

Then I also get:

2023-04-11 14:56:14.726269: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.2
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-04-11 14:56:14.727982: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-11 14:56:14.728180: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc

The problem is solved when I replace optimizer='adam' with optimizer=tf.keras.optimizers.legacy.Adam() in:

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

These solutions were found in:
https://discuss.tensorflow.org/t/cant-find-libdevice-directory-cuda-dir-nvvm-libdevice/

Maybe the ml-notebook image could be updated to solve these issues?

Best regards,
Sebastian

@sebastian-luna-valero
Copy link
Author

xref: tensorflow/tensorflow#58681

@scottyhq
Copy link
Member

Thanks for the report and workaround @sebastian-luna-valero, ideally these tutorials would just work... but ptxas/cuda-nvcc isn't installed by default unfortunately see this readme note

* Our `ml-notebook` image now contains JAX and TensorFlow with XLA enabled. Due to licensing issues, conda-forge does not have `ptxas`, but `ptxas` is needed for XLA to work correctly. Should you like to use JAX and/or TensorFlow with XLA optimization, please install `ptxas` on your own, for example, by `conda install -c nvidia cuda-nvcc`. At the time of writing (October 2022), JAX throws a compilation error if the `ptxas` version is higher than the driver version. There does not exist an easy solution for K80 GPUs, but in the case of T4 GPUs, you should install `conda install -c nvidia cuda-nvcc==11.6.*` to be safe. Alternatively for any GPU, you could set an environment variable to resolve the error caused by JAX: `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=1"`. The aforementioned error will be removed (and likely turned into a warning) in a future version of JAX. See https://github.com/google/jax/issues/12776#issuecomment-1276649134

i did just tag a more recent image with newer versions of everything in case its helpful 2023.03.28...

@sebastian-luna-valero
Copy link
Author

Thanks for the background!

I just tested the 2023.03.28 and found the same issues, but now I get why ;)

@ngam
Copy link
Contributor

ngam commented May 15, 2023

Btw, we should be able to finally resolve this very soon! In practice, we could just resolve it, but we should wait for a migration trigger to avoid unforeseen issues. xref #438

@weiji14 weiji14 added duplicate This issue or pull request already exists question Further information is requested labels Jun 27, 2023
@weiji14
Copy link
Member

weiji14 commented Jun 27, 2023

I just tested the 2023.03.28 and found the same issues, but now I get why ;)

Cool, hopefully things are working for you now with the newer version by manually running conda install -c nvidia cuda-nvcc!

Will close this as a duplicate of #438 to better focus the conversation on that thread.

@weiji14 weiji14 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants