-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda-nvcc missing again #438
Comments
@dhruvbalwada I thought it was removed intentionally b/c no longer needed? See conversation here #398 ... |
Maybe @yuvipanda or @ngam or @weiji14 can chip in about why the problem has resurfaced? |
It’s a complicated issue with all sorts of stuff. I think for now the best thing is to keep it out and let the user find a resolution. This is generally a tricky problem with, and mismatches are bound to happen. The good news is that cuda-nvcc is coming to conda-forge soon; the bad news is that it’ll be a while before the lengthy migration effort concludes. Xref: |
Btw, thanks @dhruvbalwada for keeping an eye on this, and for the detailed report :) |
Small update: This is finally getting resolved... hopefully very soon! xref #450 |
Looks like |
We should likely wait. I am still trying to assess how best to migrate Jax and TensorFlow to the new packaging format. We in a bit of a bind here... with volunteer maintainers occupied with other tasks... but tensorflow 2.12 is very close and I am making small progress on jaxlib. |
Someone reported on the forum at https://discourse.pangeo.io/t/how-to-run-code-using-gpu-on-pangeo-saying-libdevice-not-found-at-libdevice-10-bc/3672 about missing cuda-nvcc and XLA_FLAGS causing issues. Can we revisit adding |
Quick note to say that Once those PRs are merged, users shouldn't have to install |
@dhruvbalwada Try my/b-data's CUDA-enabled JupyterLab Python docker stack On the host❗ NVIDIA Driver v555.42.02 required docker run --gpus all --rm -ti glcr.b-data.ch/jupyterlab/cuda/python/base bash
In the containerpip install "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
python Python 3.12.3 (main, Apr 9 2024, 18:09:17) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.random.PRNGKey(0)
Array([0, 0], dtype=uint32)
>>> jax.devices()
[cuda(id=0)]
>>> What makes my/b-data's images different:
ℹ️ For further explanations, see iot-salzburg/gpu-jupyter#123 (comment) ff. |
@dhruvbalwada Or you could use docker run --gpus all --rm -ti glcr.b-data.ch/jupyterlab/python/base bash which does not have a CUDA Toolkit pre-installed. And then pip install "jax[cuda12]" jaxlib which brings its own CUDA libraries. |
Final note: Using |
BTW This issue is resolved with the |
Thanks @benz0li for noticting! Yes, it look like we are using the cuda build of pangeo-docker-images/ml-notebook/conda-lock.yml Line 4476 in 8be5af2
which pulled in pangeo-docker-images/ml-notebook/conda-lock.yml Lines 1820 to 1833 in 8be5af2
I'll refactor #549 to update the
@benz0li, this is very impressive work, and I'd love to continue the discussion somewhere, maybe on #345 where I've been thinking about building on top of |
It seems that the problem detected and solved in issue #387
has resurfaced again. I think this happened after #435 was merged.
The problem:
There is a ptxas based error that shows up. Can be easily reproduced as:
gives the error that
During the last discussion, @ngam had asked to check what version of cuda-nvcc existed. When I check this
This returns nothing, showing that there is no cuda-nvcc in the tensorflow/jax based ml-notebook.
Installing cuda-nvcc by using
mamba install cuda-nvcc==11.6.* -c nvidia
solves the problem.However, it would be good if the user did not have to manually do this installation, and the docker image was properly setup.
The text was updated successfully, but these errors were encountered: