Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN transient dependency not found #88

Closed
lemairecarl opened this issue Aug 18, 2021 · 12 comments
Closed

cuDNN transient dependency not found #88

lemairecarl opened this issue Aug 18, 2021 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@lemairecarl
Copy link
Contributor

The Tensorflow 2.5 python wheel can, in some cases, load libcudnn_adv_train.so directly, without loading libcudnn_ops_train.so first; this triggers the loading of the said .so. Since it is loaded as a transient dependency, the RPATHs are not used by the dynamic linker. Thus, the library isn't found.

This did not happen with TF 2.4.1, seemingly because the libcudnn_ops_train is always loaded before libcudnn_adv_train.

Note that I built TF 2.5 (and the python wheels) from source, and that this issue also happens when I unmanylinuxize the wheel from PyPI.

The solution that Maxime and I suggest, is to patch cuDNN binaries so that their RPATHs contain $ORIGIN. I tested it, and it works.

To reproduce the crash:

  1. Use a GPU node (else, CPU-only operators will be used, and it won't crash)
  2. Install tensorflow 2.5 (create and activate virtualenv, pip install --no-index tensorflow==2.5)
  3. Launch python and run:
import tensorflow as tf
inputs = tf.random.normal([32, 10, 8])
lstm = tf.keras.layers.LSTM(4)
output = lstm(inputs)

Here is a version that uses the same feature (LSTM), but doesn't crash:

import tensorflow as tf

# If we execute BN, it loads libcudnn_ops_train.so, which avoids a crash when trying to use LTSM
bn = tf.keras.layers.BatchNormalization()
inputs = tf.random.normal([32, 10, 8, 8])
output = bn(inputs)

inputs = tf.random.normal([32, 10, 8])
lstm = tf.keras.layers.LSTM(4)
output = lstm(inputs)
@mboisson mboisson added the bug Something isn't working label Aug 18, 2021
@mboisson
Copy link
Member

It seems that adding $ORIGIN to RPATH in the cuDNN fixes the issue, but I am puzzled as to why it would be needed. Any idea @bartoldeman ?

@bartoldeman
Copy link
Contributor

Pasting LD_DEBUG output from slack:

    131647:     find library=libcudnn_adv_infer.so.8 [0]; searching
    131647:      search path=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64               (RPATH from file /localscratch/lemc2220.23955448.0/env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64/libcudnn_adv_infer.so.8
    131647:
    131647:
    131647:     calling init: /cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64/libcudnn_adv_infer.so.8
    131647:
    131647:     find library=libcudnn_adv_train.so.8 [0]; searching
    131647:      search path=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64               (RPATH from file /localscratch/lemc2220.23955448.0/env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/CUDA/cuda11.1/cudnn/8.2.0/lib64/libcudnn_adv_train.so.8
    131647:
    131647:     find library=libcudnn_ops_train.so.8 [0]; searching
    131647:      search path=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/lib:/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin:/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin/../lib            (RPATH from file python)
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/lib/libcudnn_ops_train.so.8
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin/libcudnn_ops_train.so.8
    131647:       trying file=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.6.10/bin/../lib/libcudnn_ops_train.so.8
    131647:      search cache=/cvmfs/soft.computecanada.ca/gentoo/2020/etc/ld.so.cache
    131647:      search path=/cvmfs/soft.computecanada.ca/gentoo/2020/lib64:/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64:/usr/lib64/nvidia               (system search path)
    131647:       trying file=/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libcudnn_ops_train.so.8
    131647:       trying file=/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libcudnn_ops_train.so.8
    131647:       trying file=/usr/lib64/nvidia/libcudnn_ops_train.so.8
    131647:
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!
Aborted (core dumped)

@bartoldeman
Copy link
Contributor

For some reason the last lookup uses the RPATH from the python executable instead of the top-level loading shared object (tensorflow/python/_pywrap_tensorflow_internal.so)
The exact mechanism escapes me (since dlopen is doing a job here; _pywrap_tensorflow_internal.so doesn't directly link to any of the cudnn libraries) but it's a good idea if ldd on any shared object never fails, and it does now:

$ ldd $EBROOTCUDNN/lib64/libcudnn_adv_train.so
        linux-vdso.so.1 (0x00007ffe18984000)
        libcudnn_ops_infer.so.8 => not found
        libcudnn_ops_train.so.8 => not found
        libcudnn_adv_infer.so.8 => not found

So I would recommend indeed that $ORIGIN is added to the RPATH of the shared cudnn objects, which is in any case consistent with all software we compile ourselves with easybuild (they have $ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64 automatically).

@lemairecarl
Copy link
Contributor Author

lemairecarl commented Aug 18, 2021

From what I've seen in its code, Tensorflow only imports cudnn.h, and the corresponding libcudnn.so "opportunistically" imports other cuDNN libs. See the description of cudnn on page 1 here : https://docs.nvidia.com/deeplearning/cudnn/pdf/cuDNN-API.pdf

TF may be loading libraries from Python using LoadLibrary, if so, I haven't seen the related code.

@bartoldeman If all software we build with EB is supposed to have $ORIGIN, why didn't cuDNN have it?

@mboisson
Copy link
Member

@bartoldeman If all software we build with EB is supposed to have $ORIGIN, why didn't cuDNN have it?

Because we don't "build" cuDNN, we just unpack it. And our recipe did not patch it because it did not appear needed until now.

@mboisson
Copy link
Member

So, now that we have established that we should indeed patch, how do we do that without triggering #73

@bartoldeman
Copy link
Contributor

Yes that's an issue for sure, and it seems for some reason #73 can't be easily bypassed using symlinks, since cudnn has copies instead for some strange reason. Let me check a little more what is possible.

@bartoldeman
Copy link
Contributor

I found the real issue:

$ readelf -a $EBROOTCUDNN/lib64/libcudnn.so | grep PATH
 0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN]

because this has RUNPATH and not RPATH it will only resolve direct shared lib objects but not indirect (see https://www.qt.io/blog/2011/10/28/rpath-and-runpath)

So it should be sufficient to change the RUNPATH to RPATH in just this library:

patchelf --set-rpath '$ORIGIN' --force-rpath $EBROOTCUDNN/lib/libcudnn.so.8.0.3

then remains the issue with #73. Is TF 2.5 compatible with newer cudnn? If so we can install cudnn 8.2.1 (for which the CUDA 11.3 tarball compatible with CUDA 11.0 to 11.3) or perhaps 8.0.5.39? And then link TF to that instead.

We should also add keepsymlinks = True to the easyconfig keep the symlinks for the shared libraries.

@bartoldeman
Copy link
Contributor

I reinstalled cuDNN 8.0.3 and 8.2.0 to dev with fixes applied.

I'll check if I can push to prod safely; in this case a file is replaced by a symlink to a different file which may be ok.

@lemairecarl
Copy link
Contributor Author

lemairecarl commented Aug 19, 2021

Nice find Bart!

I would suggest installing cuDNN 8.1. That's the most recent version that is mentioned to be tested by TF maintainers. https://www.tensorflow.org/install/source#gpu

@mboisson
Copy link
Member

@bartoldeman, could a similar issue be the same thing involved here ? #66

@bartoldeman
Copy link
Contributor

TF 2.5 is already linked to 8.2.0 for CUDA 11.1.1, we can just keep it that way.

I verified that replacing a file with a symlink to a different file is ok for in-place updates (does not crash currently running programs).
Since I did just that (replacing e.g. the file libcudnn.so.8 with a symlink to libcudnn.so.8.0.3) I pushed to prod now.

I'll look into #66 but will close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants