-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize VGPU drivers installation #93
Conversation
ensure_packages(['kernel-devel'], {ensure => 'installed'}) | ||
ensure_packages(['dkms'], { | ||
'require' => Yumrepo['epel'] | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something wrong with this because it is not catching this requirement:
Feb 12 13:08:53 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Deps/File[/etc/nvidia]/ensure) created
Feb 12 13:09:07 gpu-node3 yum[2001]: Installed: kernel-devel-3.10.0-1160.15.2.el7.x86_64
Feb 12 13:09:07 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Deps/Package[kernel-devel]/ensure) created
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) % Total % Received % Xferd Average
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Dload U
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: [244B blob data]
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Verifying archive integrity... OK
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Uncompressing NVIDIA Accelerated Graphics
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) ERROR: Unable to find the development too
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) ERROR: Installation has failed. Please s
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Welcome to the NVIDIA Software Installer
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Detected 2 CPUs online; setting concurren
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Tagging shared libraries with chcon -t te
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Installing NVIDIA driver version 450.89.
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) For some distributions, Nouveau can be di
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) One or more modprobe configuration files
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: 'curl -L https://hpsrepo.fz-juelich.de/jusuf/nvidia/NVIDIA-Driver.latest -o /tmp/NVIDIA-driver.run && sh /tmp/NVIDIA-driver.run
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) change from 'notrun' to ['0'] failed: 'cu
It is installing kernel-devel
but not recognising that the dkms
requirement is not met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later I see dkms
being installed (which brings in gcc
), so the second time around things will succeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strategy shift: I have added an explicit requirement on dkms
and kernel-devel
package on the Exec[vgpu-driver-install-bin]
resource and removed the deps
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, that works, the drivers get installed at the right time.
There is one last remaining issue (for me). I see the symlinks being created for the latest nvidia driver (not the one I installed). This happens long before the driver is installed. It looks like this is expected, as after the driver install, the
so most of those links are removed, but not created for the installed driver (450.89). |
I'll see what it is like in 30 minutes when puppet runs again |
Yes, these got replaced on the next run, with the caveat that there are now two broken links in that directory:
|
The symlink issue is a tough one. The driver version is determined by a Puppet fact. Facts are resolved before puppet run. On the first puppet run, The following Puppet run, Ideally, we would be able to know on the first Puppet run, what is the VGPUs driver version, or we create the symlink with an exec script instead of a Puppet |
Something did remove the previous symlinks, I suspect that may have been the nvidia driver installation, which is perhaps why that single broken link |
Just to record this somewhere, vGPUs have some limitations when it comes to CUDA (correct as of 11.2, see NVIDIA docs):
There features are only available in pass-through mode. |
Avoid creation of broken symlink when dealing with VGPU drivers.
Need MC branch keystone