-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDNN transient dependency not found #88
Comments
It seems that adding |
Pasting
|
For some reason the last lookup uses the RPATH from the python executable instead of the top-level loading shared object (
So I would recommend indeed that |
From what I've seen in its code, Tensorflow only imports TF may be loading libraries from Python using @bartoldeman If all software we build with EB is supposed to have |
Because we don't "build" cuDNN, we just unpack it. And our recipe did not patch it because it did not appear needed until now. |
So, now that we have established that we should indeed patch, how do we do that without triggering #73 |
Yes that's an issue for sure, and it seems for some reason #73 can't be easily bypassed using symlinks, since cudnn has copies instead for some strange reason. Let me check a little more what is possible. |
I found the real issue:
because this has RUNPATH and not RPATH it will only resolve direct shared lib objects but not indirect (see https://www.qt.io/blog/2011/10/28/rpath-and-runpath) So it should be sufficient to change the RUNPATH to RPATH in just this library:
then remains the issue with #73. Is TF 2.5 compatible with newer cudnn? If so we can install cudnn 8.2.1 (for which the CUDA 11.3 tarball compatible with CUDA 11.0 to 11.3) or perhaps 8.0.5.39? And then link TF to that instead. We should also add |
I reinstalled cuDNN 8.0.3 and 8.2.0 to dev with fixes applied. I'll check if I can push to prod safely; in this case a file is replaced by a symlink to a different file which may be ok. |
Nice find Bart! I would suggest installing cuDNN 8.1. That's the most recent version that is mentioned to be tested by TF maintainers. https://www.tensorflow.org/install/source#gpu |
@bartoldeman, could a similar issue be the same thing involved here ? #66 |
TF 2.5 is already linked to 8.2.0 for CUDA 11.1.1, we can just keep it that way. I verified that replacing a file with a symlink to a different file is ok for in-place updates (does not crash currently running programs). I'll look into #66 but will close this. |
The Tensorflow 2.5 python wheel can, in some cases, load
libcudnn_adv_train.so
directly, without loadinglibcudnn_ops_train.so
first; this triggers the loading of the said.so
. Since it is loaded as a transient dependency, the RPATHs are not used by the dynamic linker. Thus, the library isn't found.This did not happen with TF 2.4.1, seemingly because the
libcudnn_ops_train
is always loaded beforelibcudnn_adv_train
.Note that I built TF 2.5 (and the python wheels) from source, and that this issue also happens when I unmanylinuxize the wheel from PyPI.
The solution that Maxime and I suggest, is to patch cuDNN binaries so that their RPATHs contain $ORIGIN. I tested it, and it works.
To reproduce the crash:
pip install --no-index tensorflow==2.5
)python
and run:Here is a version that uses the same feature (LSTM), but doesn't crash:
The text was updated successfully, but these errors were encountered: