Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize VGPU drivers installation #93

Merged
merged 19 commits into from
Feb 24, 2021
Merged

Generalize VGPU drivers installation #93

merged 19 commits into from
Feb 24, 2021

Conversation

cmd-ntrf
Copy link
Member

Need MC branch keystone

@cmd-ntrf cmd-ntrf added the enhancement New feature or request label Feb 10, 2021
@cmd-ntrf cmd-ntrf self-assigned this Feb 10, 2021
ensure_packages(['kernel-devel'], {ensure => 'installed'})
ensure_packages(['dkms'], {
'require' => Yumrepo['epel']
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something wrong with this because it is not catching this requirement:

Feb 12 13:08:53 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Deps/File[/etc/nvidia]/ensure) created
Feb 12 13:09:07 gpu-node3 yum[2001]: Installed: kernel-devel-3.10.0-1160.15.2.el7.x86_64
Feb 12 13:09:07 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Deps/Package[kernel-devel]/ensure) created
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns)   % Total    % Received % Xferd  Average 
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns)                                  Dload  U
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: [244B blob data]
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Verifying archive integrity... OK
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Uncompressing NVIDIA Accelerated Graphics
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) ERROR: Unable to find the development too
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) ERROR: Installation has failed.  Please s
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Welcome to the NVIDIA Software Installer 
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Detected 2 CPUs online; setting concurren
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Tagging shared libraries with chcon -t te
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) Installing NVIDIA driver version 450.89.
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) For some distributions, Nouveau can be di
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) One or more modprobe configuration files 
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: 'curl -L https://hpsrepo.fz-juelich.de/jusuf/nvidia/NVIDIA-Driver.latest -o /tmp/NVIDIA-driver.run && sh /tmp/NVIDIA-driver.run
Feb 12 13:09:31 gpu-node3 puppet-agent[1084]: (/Stage[main]/Profile::Gpu::Install::Vgpu::Bin/Exec[vgpu-driver-install-bin]/returns) change from 'notrun' to ['0'] failed: 'cu

It is installing kernel-devel but not recognising that the dkms requirement is not met.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later I see dkms being installed (which brings in gcc), so the second time around things will succeed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategy shift: I have added an explicit requirement on dkms and kernel-devel package on the Exec[vgpu-driver-install-bin] resource and removed the deps class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, that works, the drivers get installed at the right time.

@ocaisa
Copy link
Contributor

ocaisa commented Feb 12, 2021

There is one last remaining issue (for me). I see the symlinks being created for the latest nvidia driver (not the one I installed). This happens long before the driver is installed. It looks like this is expected, as after the driver install, the /usr/lib64/nvidia looks like:

[centos@gpu-node1 ~]$ ls /usr/lib64/nvidia/
gridd  libGLX_indirect.so.0  libnvidia-fatbinaryloader.so.460.32.03

so most of those links are removed, but not created for the installed driver (450.89).

@ocaisa
Copy link
Contributor

ocaisa commented Feb 12, 2021

I'll see what it is like in 30 minutes when puppet runs again

@ocaisa
Copy link
Contributor

ocaisa commented Feb 12, 2021

Yes, these got replaced on the next run, with the caveat that there are now two broken links in that directory:

lrwxrwxrwx. 1 root root  46 Feb 12 15:19 libnvidia-fatbinaryloader.so.450.89 -> /usr/lib64/libnvidia-fatbinaryloader.so.450.89
lrwxrwxrwx. 1 root root  49 Feb 12 14:51 libnvidia-fatbinaryloader.so.460.32.03 -> /usr/lib64/libnvidia-fatbinaryloader.so.460.32.03

@cmd-ntrf
Copy link
Member Author

The symlink issue is a tough one.

The driver version is determined by a Puppet fact. Facts are resolved before puppet run. On the first puppet run, nvidia_driver_vers.sh is executed first. Since the drivers are not installed, the driver version cannot be determined from dkms or nvidia-smi because they are not installed yet, so the script defaults on fetching the drivers version from NVIDIA website.

The following Puppet run, nvidia_driver_vers.sh can now use etiher nvidia-smi or dkms and returns a different version number, although the symlinks with the other driver version have already been created, and removing the previously created symlinks is complicated.

Ideally, we would be able to know on the first Puppet run, what is the VGPUs driver version, or we create the symlink with an exec script instead of a Puppet file resource.

@ocaisa
Copy link
Contributor

ocaisa commented Feb 12, 2021

Something did remove the previous symlinks, I suspect that may have been the nvidia driver installation, which is perhaps why that single broken link libnvidia-fatbinaryloader.so.460.32.03 -> /usr/lib64/libnvidia-fatbinaryloader.so.460.32.03 was left over (since it doesn't actually exist).

@ocaisa
Copy link
Contributor

ocaisa commented Feb 16, 2021

Just to record this somewhere, vGPUs have some limitations when it comes to CUDA (correct as of 11.2, see NVIDIA docs):

NVIDIA vGPU does not support the following NVIDIA CUDA Toolkit features:

- Unified Memory
- Development tools such as IDEs, debuggers, profilers, and utilities as listed under CUDA Toolkit Major Components in NVIDIA CUDA Toolkit Release Notes for CUDA 11.2
- Tracing and profiling through the CUDA Profiling Tools Interface (CUPTI)

There features are only available in pass-through mode.

Avoid creation of broken symlink when dealing with VGPU drivers.
@cmd-ntrf cmd-ntrf merged commit fb3c903 into master Feb 24, 2021
@cmd-ntrf cmd-ntrf deleted the dev/vgpu2 branch February 24, 2021 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants