Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Bump CUDA version to 12.2.0-devel-ubuntu22.04" #176

Closed
wants to merge 1 commit into from

Conversation

terrykong
Copy link
Contributor

Reverts #161

Proposing the reversion of the 12.2 bump. The images produced from the base cuda 12.2.0 image are not functioning and all tests on the CI have been failing for a few days now. I have also seen this fail on my workstation and on selene. It might be better to revert this until we know what's going on. I have attached an image to the issue below to help debug in parallel.

Related: #165

Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any information regarding why the failure occurs? I am trying to dig a little deeper.

@terrykong
Copy link
Contributor Author

@nouiz mentioned something about compat drivers. I see that forward compat drivers are installed:

docker run --rm ghcr.io/nvidia/t5x:nightly-2023-08-20 dpkg -l | grep cuda-compat
ii  cuda-compat-12-2                535.54.03-1                             amd64        CUDA Compatibility Platform

so that checks that box.

I get an issue when we add the nvidia container runtime:

# ok
docker run -it --rm ghcr.io/nvidia/t5x:nightly-2023-08-20 bash
# errors
docker run --gpus=all -it --rm ghcr.io/nvidia/t5x:nightly-2023-08-20 bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.2, please update your driver to a newer version, or use an earlier cuda container: unknown.

So maybe the nvidia container plugin is where we start looking?

@yhtang yhtang closed this in #182 Aug 25, 2023
yhtang added a commit that referenced this pull request Aug 25, 2023
…2.1.1 (#182)

This PR addresses #176 and #159 in a single shot.

The default CUDA version will only be encoded in Dockerfile.base. A default value of latest will be passed between various workflows and eventually being properly resolved into the BASE_IMAGE build arg by the _build_base.yaml workflow.

Closes #176
Closes #159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants