Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug when using just-in-time CDI spec generation #762

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented Oct 30, 2024

This change fixes a bug when using just-in-time CDI spec generation for the NVIDIA Container Runtime for specific devices (i.e. not 'all').

Instead of unconditionally using the default nvsandboxutils library -- leading
to errors due to undefined symbols -- we check whether the library can be
properly initialised before continuing.

The functionality was added in #629

Before this change:

$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=0 ubuntu nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /home/elezar/src/container-toolkit/nvidia-container-runtime did not terminate successfully: exit status 127: /home/elezar/src/container-toolkit/nvidia-container-runtime: symbol lookup error: /home/elezar/src/container-toolkit/nvidia-container-runtime: undefined symbol: nvSandboxUtilsGetGpuResource
: unknown.

With this change:

$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=0 ubuntu nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB-N (UUID: GPU-edfee158-11c1-52b8-0517-92f30e7fac88)

}
}()

if l.nvsandboxutilslib != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I have duplicated the code for GetAllDeviceSpecs -- which is exercised more frequently -- and will clean the up more in a post-release follow-up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth to create a func now that we have the same code in 2 funcs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether a functon would work since we need to defer the calls to Shutdown(). It is something we could look into, but I don't want to make that change this close to the release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, something to workshop for the next release. The patch does as described.

This change fixes a bug when using just-in-time CDI spec generation for the
NVIDIA Container Runtime for specific devices (i.e. not 'all').
Instead of unconditionally using the default nvsandboxutils library -- leading
to errors due to undefined symbols -- we check whether the library can be
properly initialised before continuing.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the fix-auto-cdi-runtime-mode branch from e6059d2 to 75376d3 Compare October 30, 2024 11:20
}
}()

if l.nvsandboxutilslib != nil {
if r := l.nvsandboxutilslib.Init(l.driverRoot); r != nvsandboxutils.SUCCESS {
l.logger.Warningf("Failed to init nvsandboxutils: %v; ignoring", r)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning log on line 121 is lowercase and here is upper cased, is there a reason for the diff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. I will update these for consistency once we refactor.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions are non blockers, looks good

@ArangoGutierrez
Copy link
Collaborator

Tested

$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=0 ubuntu nvidia-smi -L
GPU 0: NVIDIA GH200 96GB HBM3 (UUID: GPU-0abb8bdb-b6c4-73fc-2b1d-a52db1c353e2)

patch restores the previous (1.16.2) behavior

}
}()

if l.nvsandboxutilslib != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, something to workshop for the next release. The patch does as described.

@elezar elezar merged commit efb18a7 into NVIDIA:main Oct 30, 2024
10 checks passed
@elezar elezar deleted the fix-auto-cdi-runtime-mode branch October 30, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants