-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug when using just-in-time CDI spec generation #762
Conversation
} | ||
}() | ||
|
||
if l.nvsandboxutilslib != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I have duplicated the code for GetAllDeviceSpecs
-- which is exercised more frequently -- and will clean the up more in a post-release follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth to create a func now that we have the same code in 2 funcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether a functon would work since we need to defer the calls to Shutdown()
. It is something we could look into, but I don't want to make that change this close to the release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, something to workshop for the next release. The patch does as described.
This change fixes a bug when using just-in-time CDI spec generation for the NVIDIA Container Runtime for specific devices (i.e. not 'all'). Instead of unconditionally using the default nvsandboxutils library -- leading to errors due to undefined symbols -- we check whether the library can be properly initialised before continuing. Signed-off-by: Evan Lezar <[email protected]>
e6059d2
to
75376d3
Compare
} | ||
}() | ||
|
||
if l.nvsandboxutilslib != nil { | ||
if r := l.nvsandboxutilslib.Init(l.driverRoot); r != nvsandboxutils.SUCCESS { | ||
l.logger.Warningf("Failed to init nvsandboxutils: %v; ignoring", r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning log on line 121 is lowercase and here is upper cased, is there a reason for the diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason. I will update these for consistency once we refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My questions are non blockers, looks good
Tested $ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=0 ubuntu nvidia-smi -L
GPU 0: NVIDIA GH200 96GB HBM3 (UUID: GPU-0abb8bdb-b6c4-73fc-2b1d-a52db1c353e2) patch restores the previous (1.16.2) behavior |
} | ||
}() | ||
|
||
if l.nvsandboxutilslib != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, something to workshop for the next release. The patch does as described.
This change fixes a bug when using just-in-time CDI spec generation for the NVIDIA Container Runtime for specific devices (i.e. not 'all').
Instead of unconditionally using the default nvsandboxutils library -- leading
to errors due to undefined symbols -- we check whether the library can be
properly initialised before continuing.
The functionality was added in #629
Before this change:
With this change: