cuDNN transient dependency not found #88

lemairecarl · 2021-08-18T15:56:53Z

The Tensorflow 2.5 python wheel can, in some cases, load libcudnn_adv_train.so directly, without loading libcudnn_ops_train.so first; this triggers the loading of the said .so. Since it is loaded as a transient dependency, the RPATHs are not used by the dynamic linker. Thus, the library isn't found.

This did not happen with TF 2.4.1, seemingly because the libcudnn_ops_train is always loaded before libcudnn_adv_train.

Note that I built TF 2.5 (and the python wheels) from source, and that this issue also happens when I unmanylinuxize the wheel from PyPI.

The solution that Maxime and I suggest, is to patch cuDNN binaries so that their RPATHs contain $ORIGIN. I tested it, and it works.

To reproduce the crash:

Use a GPU node (else, CPU-only operators will be used, and it won't crash)
Install tensorflow 2.5 (create and activate virtualenv, pip install --no-index tensorflow==2.5)
Launch python and run:

import tensorflow as tf
inputs = tf.random.normal([32, 10, 8])
lstm = tf.keras.layers.LSTM(4)
output = lstm(inputs)

Here is a version that uses the same feature (LSTM), but doesn't crash:

import tensorflow as tf

# If we execute BN, it loads libcudnn_ops_train.so, which avoids a crash when trying to use LTSM
bn = tf.keras.layers.BatchNormalization()
inputs = tf.random.normal([32, 10, 8, 8])
output = bn(inputs)

inputs = tf.random.normal([32, 10, 8])
lstm = tf.keras.layers.LSTM(4)
output = lstm(inputs)

The text was updated successfully, but these errors were encountered:

mboisson · 2021-08-18T16:08:30Z

It seems that adding $ORIGIN to RPATH in the cuDNN fixes the issue, but I am puzzled as to why it would be needed. Any idea @bartoldeman ?

bartoldeman · 2021-08-18T17:35:35Z

Pasting LD_DEBUG output from slack:

    131647:     find library=libcudnn_adv_infer.so.8 [0]; searching
    131647:      search path=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64               (RPATH from file /localscratch/lemc2220.23955448.0/env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64/libcudnn_adv_infer.so.8
    131647:
    131647:
    131647:     calling init: /cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64/libcudnn_adv_infer.so.8
    131647:
    131647:     find library=libcudnn_adv_train.so.8 [0]; searching
    131647:      search path=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64               (RPATH from file /localscratch/lemc2220.23955448.0/env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64/libcudnn_adv_train.so.8
    131647:
    131647:     find library=libcudnn_ops_train.so.8 [0]; searching
    131647:      search path=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/lib:/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin/../lib            (RPATH from file python)
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/lib/libcudnn_ops_train.so.8
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin/libcudnn_ops_train.so.8
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin/../lib/libcudnn_ops_train.so.8
    131647:      search cache=/cvmfs/soft.computecanada.ca/gentoo/2020/etc/ld.so.cache
    131647:      search path=/cvmfs/soft.computecanada.ca/gentoo/2020/lib64:/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64:/usr/lib64/nvidia               (system search path)
    131647:       trying file=/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libcudnn_ops_train.so.8
    131647:       trying file=/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libcudnn_ops_train.so.8
    131647:       trying file=/usr/lib64/nvidia/libcudnn_ops_train.so.8
    131647:
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!
Aborted (core dumped)

bartoldeman · 2021-08-18T18:50:36Z

For some reason the last lookup uses the RPATH from the python executable instead of the top-level loading shared object (tensorflow/python/_pywrap_tensorflow_internal.so)
The exact mechanism escapes me (since dlopen is doing a job here; _pywrap_tensorflow_internal.so doesn't directly link to any of the cudnn libraries) but it's a good idea if ldd on any shared object never fails, and it does now:

$ ldd $EBROOTCUDNN/lib64/libcudnn_adv_train.so
        linux-vdso.so.1 (0x00007ffe18984000)
        libcudnn_ops_infer.so.8 => not found
        libcudnn_ops_train.so.8 => not found
        libcudnn_adv_infer.so.8 => not found

So I would recommend indeed that $ORIGIN is added to the RPATH of the shared cudnn objects, which is in any case consistent with all software we compile ourselves with easybuild (they have $ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64 automatically).

lemairecarl · 2021-08-18T19:43:38Z

From what I've seen in its code, Tensorflow only imports cudnn.h, and the corresponding libcudnn.so "opportunistically" imports other cuDNN libs. See the description of cudnn on page 1 here : https://docs.nvidia.com/deeplearning/cudnn/pdf/cuDNN-API.pdf

TF may be loading libraries from Python using LoadLibrary, if so, I haven't seen the related code.

@bartoldeman If all software we build with EB is supposed to have $ORIGIN, why didn't cuDNN have it?

mboisson · 2021-08-18T19:48:54Z

@bartoldeman If all software we build with EB is supposed to have $ORIGIN, why didn't cuDNN have it?

Because we don't "build" cuDNN, we just unpack it. And our recipe did not patch it because it did not appear needed until now.

mboisson · 2021-08-18T19:50:40Z

So, now that we have established that we should indeed patch, how do we do that without triggering #73

bartoldeman · 2021-08-18T20:48:45Z

Yes that's an issue for sure, and it seems for some reason #73 can't be easily bypassed using symlinks, since cudnn has copies instead for some strange reason. Let me check a little more what is possible.

bartoldeman · 2021-08-19T01:26:19Z

I found the real issue:

$ readelf -a $EBROOTCUDNN/lib64/libcudnn.so | grep PATH
 0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN]

because this has RUNPATH and not RPATH it will only resolve direct shared lib objects but not indirect (see https://www.qt.io/blog/2011/10/28/rpath-and-runpath)

So it should be sufficient to change the RUNPATH to RPATH in just this library:

patchelf --set-rpath '$ORIGIN' --force-rpath $EBROOTCUDNN/lib/libcudnn.so.8.0.3

then remains the issue with #73. Is TF 2.5 compatible with newer cudnn? If so we can install cudnn 8.2.1 (for which the CUDA 11.3 tarball compatible with CUDA 11.0 to 11.3) or perhaps 8.0.5.39? And then link TF to that instead.

We should also add keepsymlinks = True to the easyconfig keep the symlinks for the shared libraries.

bartoldeman · 2021-08-19T12:11:55Z

I reinstalled cuDNN 8.0.3 and 8.2.0 to dev with fixes applied.

I'll check if I can push to prod safely; in this case a file is replaced by a symlink to a different file which may be ok.

lemairecarl · 2021-08-19T14:53:21Z

Nice find Bart!

I would suggest installing cuDNN 8.1. That's the most recent version that is mentioned to be tested by TF maintainers. https://www.tensorflow.org/install/source#gpu

mboisson · 2021-08-19T15:05:30Z

@bartoldeman, could a similar issue be the same thing involved here ? #66

bartoldeman · 2021-08-19T15:12:17Z

TF 2.5 is already linked to 8.2.0 for CUDA 11.1.1, we can just keep it that way.

I verified that replacing a file with a symlink to a different file is ok for in-place updates (does not crash currently running programs).
Since I did just that (replacing e.g. the file libcudnn.so.8 with a symlink to libcudnn.so.8.0.3) I pushed to prod now.

I'll look into #66 but will close this.

lemairecarl assigned bartoldeman Aug 18, 2021

mboisson added the bug Something isn't working label Aug 18, 2021

bartoldeman closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN transient dependency not found #88

cuDNN transient dependency not found #88

lemairecarl commented Aug 18, 2021

mboisson commented Aug 18, 2021

bartoldeman commented Aug 18, 2021

bartoldeman commented Aug 18, 2021

lemairecarl commented Aug 18, 2021 •

edited

Loading

mboisson commented Aug 18, 2021

mboisson commented Aug 18, 2021

bartoldeman commented Aug 18, 2021

bartoldeman commented Aug 19, 2021

bartoldeman commented Aug 19, 2021

lemairecarl commented Aug 19, 2021 •

edited

Loading

mboisson commented Aug 19, 2021

bartoldeman commented Aug 19, 2021

cuDNN transient dependency not found #88

cuDNN transient dependency not found #88

Comments

lemairecarl commented Aug 18, 2021

To reproduce the crash:

mboisson commented Aug 18, 2021

bartoldeman commented Aug 18, 2021

bartoldeman commented Aug 18, 2021

lemairecarl commented Aug 18, 2021 • edited Loading

mboisson commented Aug 18, 2021

mboisson commented Aug 18, 2021

bartoldeman commented Aug 18, 2021

bartoldeman commented Aug 19, 2021

bartoldeman commented Aug 19, 2021

lemairecarl commented Aug 19, 2021 • edited Loading

mboisson commented Aug 19, 2021

bartoldeman commented Aug 19, 2021

lemairecarl commented Aug 18, 2021 •

edited

Loading

lemairecarl commented Aug 19, 2021 •

edited

Loading