cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

sebastian-luna-valero · 2023-04-11T14:59:15Z

Hi,

When running this notebook: https://www.tensorflow.org/tutorials/images/cnn

Using quay.io/pangeo/ml-notebook:2023.02.27, in the cell:
https://www.tensorflow.org/tutorials/images/cnn#compile_and_train_the_model

I get:

2023-04-11 14:56:14.664302: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:85] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2023-04-11 14:56:14.665682: W tensorflow/compiler/xla/stream_executor/gpu/redzone_allocator.cc:318] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.

which is solved with:

mamba install -c nvidia cuda-nvcc=11.8 # same version as cudatoolkit

Then I also get:

2023-04-11 14:56:14.726269: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.2
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-04-11 14:56:14.727982: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-11 14:56:14.728180: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc

The problem is solved when I replace optimizer='adam' with optimizer=tf.keras.optimizers.legacy.Adam() in:

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

These solutions were found in:
https://discuss.tensorflow.org/t/cant-find-libdevice-directory-cuda-dir-nvvm-libdevice/

Maybe the ml-notebook image could be updated to solve these issues?

Best regards,
Sebastian

The text was updated successfully, but these errors were encountered:

sebastian-luna-valero · 2023-04-13T08:24:26Z

xref: tensorflow/tensorflow#58681

scottyhq · 2023-04-15T18:22:01Z

Thanks for the report and workaround @sebastian-luna-valero, ideally these tutorials would just work... but ptxas/cuda-nvcc isn't installed by default unfortunately see this readme note

pangeo-docker-images/README.md

Line 173 in 614419a

    
           * Our `ml-notebook` image now contains JAX and TensorFlow with XLA enabled. Due to licensing issues, conda-forge does not have `ptxas`, but `ptxas` is needed for XLA to work correctly. Should you like to use JAX and/or TensorFlow with XLA optimization, please install `ptxas` on your own, for example, by `conda install -c nvidia cuda-nvcc`. At the time of writing (October 2022), JAX throws a compilation error if the `ptxas` version is higher than the driver version. There does not exist an easy solution for K80 GPUs, but in the case of T4 GPUs, you should install `conda install -c nvidia cuda-nvcc==11.6.*` to be safe. Alternatively for any GPU, you could set an environment variable to resolve the error caused by JAX: `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=1"`. The aforementioned error will be removed (and likely turned into a warning) in a future version of JAX. See https://github.com/google/jax/issues/12776#issuecomment-1276649134

i did just tag a more recent image with newer versions of everything in case its helpful 2023.03.28...

sebastian-luna-valero · 2023-04-17T10:14:05Z

Thanks for the background!

I just tested the 2023.03.28 and found the same issues, but now I get why ;)

ngam · 2023-05-15T00:21:21Z

Btw, we should be able to finally resolve this very soon! In practice, we could just resolve it, but we should wait for a migration trigger to avoid unforeseen issues. xref #438

weiji14 · 2023-06-27T03:01:00Z

I just tested the 2023.03.28 and found the same issues, but now I get why ;)

Cool, hopefully things are working for you now with the newer version by manually running conda install -c nvidia cuda-nvcc!

Will close this as a duplicate of #438 to better focus the conversation on that thread.

ngam mentioned this issue May 15, 2023

cuda-nvcc missing again #438

Closed

weiji14 added duplicate This issue or pull request already exists question Further information is requested labels Jun 27, 2023

weiji14 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

sebastian-luna-valero commented Apr 11, 2023

sebastian-luna-valero commented Apr 13, 2023

scottyhq commented Apr 15, 2023

sebastian-luna-valero commented Apr 17, 2023

ngam commented May 15, 2023 •

edited

Loading

weiji14 commented Jun 27, 2023 •

edited

Loading

cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

Comments

sebastian-luna-valero commented Apr 11, 2023

sebastian-luna-valero commented Apr 13, 2023

scottyhq commented Apr 15, 2023

sebastian-luna-valero commented Apr 17, 2023

ngam commented May 15, 2023 • edited Loading

weiji14 commented Jun 27, 2023 • edited Loading

ngam commented May 15, 2023 •

edited

Loading

weiji14 commented Jun 27, 2023 •

edited

Loading