Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Remove requirement for libraries to be installed, just to build the application #65

Open
pallaswept opened this issue Oct 24, 2023 · 52 comments

Comments

@pallaswept
Copy link

I understand why libraries, specifically I'm talking about the nvidia-ml.so lib, are required to run tuxclocker, but why are they needed just to build it? Why aren't the headers enough? I can see that meson is invoking the --no-undefined switch, so it's not going to work with any headers alone but needs the whole library, but WHY? Can this be changed? It's kinda the whole point of a shared library that you can build against the header and not need the actual library in your app.

I ask this because, we have a serious problem where we're unable to package tuxclocker, because it requires a source of the nvidia packages, and those aren't licensed to any public build services, only from nvidia themselves. This means you can't have any RPM-based distros building tuxclocker, with it the way it is now (because RPMs are built with no network connection, so you can't just go download the libs and install them, you have to have them packaged, and because of licensing, we can't). So fedora and suse are out, at the very least. It would be cool to fix that.

@Lurkki14
Copy link
Owner

We could probably add a meson option that allows to build with only headers present, and call https://mesonbuild.com/Reference-manual_returned_compiler.html#compilerhas_header when it's enabled.

@pallaswept
Copy link
Author

BTW this comes back to our discussion the other day, about the nvidia plugin missing in my install, and I said I would let the other packager know, but he'd have to build the nvidia libs, and he was told he would be banned if he did: https://build.opensuse.org/request/show/1119517?notification_id=43795733#comment-1838966

It's absolutely stupid (and I'm not even sure it's correct) that I can build it in VM 'X' and it's totally legit, but if I built it in VM 'Y', then it's illegal and we will be punished, but that's what we've been told :(

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

he would be banned if he did

Maybe very harsh expression =) But yes, you need to compile NVIDIA libs locally. If you compile it in OBS, you can lose repo or you can lose repo with account. Anyway maybe I can be not right.

So it's not only for NVIDIA packages. As I know (and rules talks), all proprietary tools repacking without building in OBS is bad. But with permissions maybe you can.

But anyway, as I told, Stefan ask me to not enabling build and not add .run files in OBS. I would ask ahjolinna how he/she build NVIDIA drivers in OBS. It's very interesting.

I ask this because, we have a serious problem where we're unable to package tuxclocker

Not entirely true, we can. But with a condition. My tuxclocker build for openSUSE exactly is not conflicted with any license.

The problem seems that maybe:

  1. Maybe openSUSE for example has different NVML setup/it's NVML problem. In my opinion, still, proprietary NVIDIA packages needed for install, not for build
  2. Maybe tuxclocker need proprietary devel-headers for build and yes, if this, we can't fully build it, cuz license.

@pallaswept
Copy link
Author

If you compile it in OBS, you can lose repo or you can lose repo with account.

I believe you bro. I 'get it', but I just think it's stupid. Nobody wins out of this scenario; not the end users, not the application devs, not the packagers, not openSUSE, not Nvidia; it's a lose-lose-lose-lose situation, it's just dumb that some people let it get to this stage. openSUSE and nvidia (and other distros, this definitely doesn't just effect openSUSE, probably PPAs would be illegal also, etc) they all need to get their shit together and fix problems like this before they happen. But still, here we are, and I have removed my nvidia driver build (so thank you for warning me in case I got a ban for having built it there. I had no idea it was not allowed. You might have saved my ass from a hidden trap!. So THANK YOU BRO)

I ask this because, we have a serious problem where we're unable to package tuxclocker

Not entirely true, we can. But with a condition. My tuxclocker build for openSUSE exactly is not conflicted with any license.

Yes you're quite right, that was not entirely true, I was being too brief, I'm sory. We CAN build tuxclocker..... just not the nvidia plugin for tuxclocker.... But unless you have an AMD GPU, that's kinda the most important part of it 😢 It is like, we can build the car. Just no motor or wheels 😆 And while we at OBS and openSUSE are first to discover this problem, I am quite certain that it is going to be a problem for practically every distro. At least, in the legal sense, perhaps other distros won't be so strict about it, but still, it's a thing for everyone, I think, except for MAYBE the build-from-source distros like gentoo and maaaayyybe arch maybe.

Anyway now we have to find a better long term solution.

FWIW, I run a private OBS instance where I can build the nvidia packages (i's legal for me, I don't distribute them, they are not publicly available online) and also tuxclocker, and so for the mean time I would be very happy to provide openSUSE packages to this project, for download by users. I would not be distributing any nvidia binaries, so there's no legal problem there. Or I could just provide the nvidia plugin pre-compiled, which could then be added to your build from OBS (you could use a postinstall script so that it gets it from this repo automatically from the end-user's side, or you could use it as a source for your project and build without it and then just copy it into place). But I understand neither of these kind of things are optimal for many obvious reasons.

The good news is that the nvml.h header we need, is not licensed at all, and is definitely open source, so it should be OK for OBS builds.

In my opinion, still, proprietary NVIDIA packages needed for install, not for build

I agree, I think this is really the best answer. It's mostly just a matter of configuring meson correctly. Unfortunately I never used meson in my professional work, just old-fashioned configure && make or just calling cc in sh scripts (I am old, I moved from developer to manager/consultant, more than 10 years before meson even existed), so I don't know how to fix this. I feel confident that @Lurkki14 will find a way 😃

@pallaswept
Copy link
Author

he would be banned if he did

And one more time I want to take the opportunity to say a big big thank you to you, for warning me!

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

I don't know, anyway, how AUR users build NVIDIA drivers and how it conflicted with licenses/etc. - https://aur.archlinux.org/packages/tuxclocker. Anyway NVIDIA is optdepends in PKGBUILD.

nvml.h provided by nvidia-settings tool: https://github.com/NVIDIA/nvidia-settings/blob/main/src/nvml.h

But then it should work on X11 and I don't know works it or not. Wayland problem is different problem.

@pallaswept
Copy link
Author

nvml.h provided by nvidia-settings tool: https://github.com/NVIDIA/nvidia-settings/blob/main/src/nvml.h

Right but this is the problem, the way tuxclocker is configured, even though it has those headers, and has the declarations, if the functions are declared but not defined, it generates an error. Normally an app should be able to link against the lib header without the lib or the lib's source being present in order to define the functions, and I think this is what tuxclocker should be doing.

@Lurkki14
Copy link
Owner

You could do something like this in src/plugins/meson.build

if cc.has_header('nvml.h') and cc.has_header('NVCtrl/NVCtrlLib.h') and cc.has_header('NVCtrl/NVCtrl.h')
	shared_library('nvidia', 'Nvidia.cpp', 'Utils.cpp',
		override_options : ['cpp_std=c++17'],
		include_directories : [incdir, patterns_inc, fplus_inc],
		dependencies : [nvidia_linux_libs, boost_dep],
		install_dir : get_option('libdir') / 'tuxclocker' / 'plugins',
		install : true,
		link_with : libtuxclocker)
endif

and have that patch in the package repo.

You could also make the NVIDIA plugin a separate package, that the end user can build locally.

@pallaswept
Copy link
Author

pallaswept commented Oct 25, 2023

You could do something like this in src/plugins/meson.build

if cc.has_header('nvml.h') and cc.has_header('NVCtrl/NVCtrlLib.h') and cc.has_header('NVCtrl/NVCtrl.h')
	shared_library('nvidia', 'Nvidia.cpp', 'Utils.cpp',
		override_options : ['cpp_std=c++17'],
		include_directories : [incdir, patterns_inc, fplus_inc],
		dependencies : [nvidia_linux_libs, boost_dep],
		install_dir : get_option('libdir') / 'tuxclocker' / 'plugins',
		install : true,
		link_with : libtuxclocker)
endif

Doesn't work, whenever the code references one of the functions declared in the nvml.h header, it generates an error that the function is not defined. This is because meson is calling gcc and passing the --no-undefined parameter.....I have no idea why it's doing that, and if you could stop it, it should just build as it is, with the header in place, and without the library binary itself

You could also make the NVIDIA plugin a separate package, that the end user can build locally.

Yeh but that really shouldn't be necessary. An app which wants to use the functions declared in nvml.h should be able to link against that header and then the code will generate pointers to those functions in the library, even without the library being present, and then when the app runs, it will go and run the code for those offsets in the library generated by the linker with the header, and call the functions.

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

@pallaswept I decline your request in OBS so if you can find solution (or create a patch for meson.build), that doesen't conflict with rules send another. Or if it fixed on git, I wait for new version.

@Lurkki14
Copy link
Owner

Doesn't work, whenever the code references one of the functions declared in the nvml.h header, it generates an error that the function is not defined.

I guess you could remove the check entirely and see what happens? If not, meson might have a way somewhere to unset the --no-undefined flag.

@pallaswept
Copy link
Author

Doesn't work, whenever the code references one of the functions declared in the nvml.h header, it generates an error that the function is not defined.

I guess you could remove the check entirely and see what happens? If not, meson might have a way somewhere to unset the --no-undefined flag.

Tried that, same result. It's the meson config that needs fixing, so it doesn't set that flag, but I wouldn't know meson from the dark side of the moon. I wish I could help you with that part but I can't.

BTW I actually stopped listening to opensuse saying nvidia didn't allow it, and read the license, and it's opensuse that won't allow it to be built on their server, so it might not be a problem with other distros.

@Lurkki14
Copy link
Owner

mesonbuild/meson#9777

@pallaswept
Copy link
Author

Nvidia license:

2.1 Rights and Limitations of Grant. NVIDIA hereby grants Customer a non-exclusive, non-transferable license to install and use the SOFTWARE for use with NVIDIA GeForce or Titan branded hardware products owned by Customer, subject to the following:

2.1.1 Rights. Customer may install and use multiple copies of the SOFTWARE on a shared computer or concurrently on different computers, and make multiple back-up copies of the SOFTWARE, solely for Customer's use within Customer's Enterprise. "Enterprise" shall mean individual use by Customer or any legal entity (such as a corporation or university) and the subsidiaries it owns by more than fifty percent (50%).

2.1.2 Linux/FreeBSD Exception. Notwithstanding the foregoing terms of Section 2.1.1, SOFTWARE designed exclusively for use on the Linux or FreeBSD operating systems, or other operating systems derived from the source code to these operating systems, may be copied and redistributed, provided that the binary files thereof are not modified in any way (except for unzipping of compressed files).

So we're good there.

OBS CoC (you always know things are going to be stupid when there's a CoC) and in particular the part about what you can't build there: https://openSUSE:Build_Service_application_blacklist

In general, only software with an OSI license is allowed for submission. Exceptions can be made via the openSUSE:*:Non-Free projects on request.

Then there is a big list of blacklisted software and nvidia isn't on it, but because it's not OSI-certified, we'd need an exception from openSUSE.... Which they're never going to give out, they'll just say that this software should be able to build from the headers so we don't need an exception.

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

BTW I actually stopped listening to opensuse saying nvidia didn't allow it, and read the license, and it's opensuse that won't allow it to be built on their server, so it might not be a problem with other distros.

Fedora for example doesen't provide it by default seems, need third-party-repo (RPM Fusion). In Debian it on non-free repo. Seems not openSUSE only problem.

@pallaswept
Copy link
Author

mesonbuild/meson#9777

I already have this in my browser history but it's a very brief answer that means very little to me. I don't know where to type the things he said to type. I really don't know anything about meson.

@pallaswept
Copy link
Author

pallaswept commented Oct 25, 2023

BTW I actually stopped listening to opensuse saying nvidia didn't allow it, and read the license, and it's opensuse that won't allow it to be built on their server, so it might not be a problem with other distros.

Fedora for example doesen't provide it by default seems, need third-party-repo (RPM Fusion). In Debian it on non-free repo. Seems not openSUSE only problem.

Yeh, other distros might have similar FOSS-or-GTFO rules. It culd go on opensuse's non-free repo, too, but instead they just host it on nvidia's site that nvidia kindly provided.

But my point is, this is nothing to do with nvidia's license, it's the distros that are the problems.

@Lurkki14
Copy link
Owner

I don't know where to type the things he said to type.

if all_nvidia_linux_libs
	shared_module('nvidia', 'Nvidia.cpp', 'Utils.cpp',
		override_options : ['cpp_std=c++17', 'b_lundef=false'],
		include_directories : [incdir, patterns_inc, fplus_inc],
		dependencies : [nvidia_linux_libs, boost_dep],
		install_dir : get_option('libdir') / 'tuxclocker' / 'plugins',
		install : true,
		link_with : libtuxclocker)
endif

Having both of those would look like this.

@pallaswept
Copy link
Author

I don't know where to type the things he said to type.

if all_nvidia_linux_libs
	shared_module('nvidia', 'Nvidia.cpp', 'Utils.cpp',
		override_options : ['cpp_std=c++17', 'b_lundef=false'],
		include_directories : [incdir, patterns_inc, fplus_inc],
		dependencies : [nvidia_linux_libs, boost_dep],
		install_dir : get_option('libdir') / 'tuxclocker' / 'plugins',
		install : true,
		link_with : libtuxclocker)
endif

Having both of those would look like this.

I'll fork and try it.

@pallaswept
Copy link
Author

We also need to do


libxext = cc.find_library('Xext', required : false)
libx = cc.find_library('X11', required : false)
libxnvctrl = cc.find_library('XNVCtrl', required : false)

nvidia_linux_libs = [libx, libxext, libxnvctrl]

To not check for the presence of the .so

I got all excited for a moment there when it built and then I realised it hadn't tried to build the nvidia plugin hahahah

@pallaswept
Copy link
Author

It builds :)

I'll test it before I send a PR

@pallaswept
Copy link
Author

It builds and packages the plugin but doesn't try to load it:

image

sudo tuxclockerd
found plugin at /usr/lib64/tuxclocker/plugins/libamd.so
found plugin at /usr/lib64/tuxclocker/plugins/libcpu.so
amdgpu_device_initialize: DRM version is 0.0.0 but this driver is only compatible with 3.x.x.
"/0bcaf18136c5c8285d39691ac6a5bcfb"
"/0bcaf18136c5c8285d39691ac6a5bcfb/019ed5e1c7908bd023cdfb2c92bd6742"
"/0bcaf18136c5c8285d39691ac6a5bcfb/019ed5e1c7908bd023cdfb2c92bd6742/966f3d67ced8cfd1dd8e307882eb469a"
"/0bcaf18136c5c8285d39691ac6a5bcfb/019ed5e1c7908bd023cdfb2c92bd6742/73cdcffedd29c9dc054c8e8039ab5921"
"/0bcaf18136c5c8285d39691ac6a5bcfb/019ed5e1c7908bd023cdfb2c92bd6742/7c8bdf5f645a649a683be04025850622"
"/0bcaf18136c5c8285d39691ac6a5bcfb/019ed5e1c7908bd023cdfb2c92bd6742/20376a0655daf5408f90bc661ff16145"

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

@pallaswept did you use the same spec?

@Lurkki14
Copy link
Owner

It builds and packages the plugin but doesn't try to load it:

Probably some symbols can't be found, check ldd libnvidia.so

@pallaswept
Copy link
Author

 ~ ldd libnvidia.so
ldd: ./libnvidia.so: No such file or directory
 ~ ldd /usr/lib64/tuxclocker/plugins/libnvidia.so
	linux-vdso.so.1 (0x00007fff333ed000)
	libtuxclocker.so => /lib64/libtuxclocker.so (0x00007f22c5d9e000)
	libX11.so.6 => /lib64/libX11.so.6 (0x00007f22c5c58000)
	libXNVCtrl.so.0 => /lib64/libXNVCtrl.so.0 (0x00007f22c5c50000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f22c5a00000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f22c5919000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f22c58f4000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f22c5600000)
	libboost_filesystem.so.1.82.0 => /lib64/glibc-hwcaps/x86-64-v3/libboost_filesystem.so.1.82.0 (0x00007f22c58d3000)
	libcrypto.so.3 => /lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.3 (0x00007f22c5000000)
	libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f22c58a7000)
	libXext.so.6 => /lib64/libXext.so.6 (0x00007f22c5892000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f22c5dec000)
	libz.so.1 => /lib64/glibc-hwcaps/x86-64-v3/libz.so.1.2.13 (0x00007f22c5878000)
	libXau.so.6 => /lib64/libXau.so.6 (0x00007f22c5873000)

@pallaswept
Copy link
Author

Indeed, linux-vdso.so.1 does not exist on my system.

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

Indeed, linux-vdso.so.1 does not exist on my system.

https://stackoverflow.com/questions/58657036/where-is-linux-vdso-so-1-present-on-the-file-system
TLDR: its normal

@pallaswept
Copy link
Author

Indeed, linux-vdso.so.1 does not exist on my system.

https://stackoverflow.com/questions/58657036/where-is-linux-vdso-so-1-present-on-the-file-system TLDR: its normal

lol was just about to post that link 😆

@pallaswept
Copy link
Author

So in that case, it finds everything.... but still, didn't find the plugin so didn' try to load it.

Interesting, I guess whatever made the nvml so, shared, broke this....

@Lurkki14
Copy link
Owner

 ~ ldd libnvidia.so
ldd: ./libnvidia.so: No such file or directory
 ~ ldd /usr/lib64/tuxclocker/plugins/libnvidia.so
	linux-vdso.so.1 (0x00007fff333ed000)
	libtuxclocker.so => /lib64/libtuxclocker.so (0x00007f22c5d9e000)
	libX11.so.6 => /lib64/libX11.so.6 (0x00007f22c5c58000)
	libXNVCtrl.so.0 => /lib64/libXNVCtrl.so.0 (0x00007f22c5c50000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f22c5a00000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f22c5919000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f22c58f4000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f22c5600000)
	libboost_filesystem.so.1.82.0 => /lib64/glibc-hwcaps/x86-64-v3/libboost_filesystem.so.1.82.0 (0x00007f22c58d3000)
	libcrypto.so.3 => /lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.3 (0x00007f22c5000000)
	libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f22c58a7000)
	libXext.so.6 => /lib64/libXext.so.6 (0x00007f22c5892000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f22c5dec000)
	libz.so.1 => /lib64/glibc-hwcaps/x86-64-v3/libz.so.1.2.13 (0x00007f22c5878000)
	libXau.so.6 => /lib64/libXau.so.6 (0x00007f22c5873000)

Seems that allowing undefined symbols makes the linker not add libnvml.so to DT_NEEDED. You can use patchelf --add-needed or see if you could get the linker to do that.

@pallaswept
Copy link
Author

pallaswept commented Oct 25, 2023

You can use patchelf --add-needed or see if you could get the linker to do that.

Like I said mate I don't know anything about your build chain, all these tools are about 15-20 years after my time as a dev... Like, the last time I bought K&R was because C99 just came out. I wouldn't know where to start doing that.

I am sorry, I'm trying to help as much as I can, I'm 4 hours past my bedtime, I really am trying to help. I can maybe help with code, but when it comes to that buildchain, it's all a mystery to me

@Lurkki14
Copy link
Owner

patchelf is just a program you'd call post-install

@pallaswept
Copy link
Author

patchelf is just a program you'd call post-install

Yeh I did notice it's not installed by default so I'll have to make it a build dep but then I don't know the syntax of it, I don't know where to run it, what to run it on, etc.... And making sure that the libraries are properly linked that kinda strikes me as something the build chain should be doing anyway, no (like you said maybe you can get the linker to do that)?

@Lurkki14
Copy link
Owner

And making sure that the libraries are properly linked that kinda strikes me as something the build chain should be doing anyway

That's what meson does when you have the library as a dependency

@Lurkki14
Copy link
Owner

  • Add a declared dependency on a dynamic library (DT_NEEDED):

    $ patchelf --add-needed libfoo.so.1 my-program

@pallaswept
Copy link
Author

pallaswept commented Oct 25, 2023

so in our case is that

patchelf --add-needed /usr/lib64/tuxclocker/plugins/libnvidia.so libnvidia-ml.so ? I presume it knows the standard path for the libraries so I don't need a fully qualified path to the nvidia .so?

I'll give it a try once I have a clue what you're talking about 😆

But the point is, meson (or whatever build chain you're using) should be configured to do this when it builds the lib. libnvidia.so still has the nvidia-ml.so library as a dependency, a runtime dependency, and the libnvidia.so should know this already once it's built, since it's calling all those functions from the nvml.h header which refer to the defined functions compiled into nvidia-ml.so.

Screw it I'll try it and maybe you'll explain it to me like I don't understand it because I really don't.... but I had about the degree of success I expected when I'm guessing from a foo bar baz example of a tool I've never heard of before:

patchelf --add-needed /usr/lib64/tuxclocker/plugins/libnvidia.so /usr/lib64/libnvidia-ml.so
patchelf: open: Permission denied
 ~sudo killall tuxclockerd
tuxclockerd: no process found
 ~sudo patchelf --add-needed /usr/lib64/tuxclocker/plugins/libnvidia.so /usr/lib64/libnvidia-ml.so
 ~ldd /usr/lib64/tuxclocker/plugins/libnvidia.so
	linux-vdso.so.1 (0x00007fff148aa000)
	libtuxclocker.so => /lib64/libtuxclocker.so (0x00007fc1e84b2000)
	libX11.so.6 => /lib64/libX11.so.6 (0x00007fc1e836c000)
	libXNVCtrl.so.0 => /lib64/libXNVCtrl.so.0 (0x00007fc1e8364000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fc1e8000000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fc1e827b000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fc1e8256000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fc1e7c00000)
	libboost_filesystem.so.1.82.0 => /lib64/glibc-hwcaps/x86-64-v3/libboost_filesystem.so.1.82.0 (0x00007fc1e7fdf000)
	libcrypto.so.3 => /lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.3 (0x00007fc1e7600000)
	libxcb.so.1 => /lib64/libxcb.so.1 (0x00007fc1e7fb3000)
	libXext.so.6 => /lib64/libXext.so.6 (0x00007fc1e7f9e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fc1e8500000)
	libz.so.1 => /lib64/glibc-hwcaps/x86-64-v3/libz.so.1.2.13 (0x00007fc1e7f84000)
	libXau.so.6 => /lib64/libXau.so.6 (0x00007fc1e824f000)
 ~sudo tuxclockerd
found plugin at /usr/lib64/tuxclocker/plugins/libamd.so
found plugin at /usr/lib64/tuxclocker/plugins/libcpu.so
amdgpu_device_initialize: DRM version is 0.0.0 but this driver is only compatible with 3.x.x.

nope OK I keep guesing

sudo patchelf --add-needed /usr/lib64/tuxclocker/plugins/libnvidia.so /usr/bin/tuxclockerd
 ~sudo tuxclockerd
tuxclockerd: symbol lookup error: /usr/lib64/tuxclocker/plugins/libnvidia.so: undefined symbol: nvmlDeviceSetPowerManagementLimit

And that looks to me like it never linked that lib, which surprises me none.
I reinstall now since I know I just broke it, and try one more guess:

>sudo patchelf --add-needed /usr/lib64/libnvidia-ml.so /usr/bin/tuxclockerd
>sudo tuxclockerd
nvidia: Couldn't open X display!
found plugin at /usr/lib64/tuxclocker/plugins/libamd.so
found plugin at /usr/lib64/tuxclocker/plugins/libcpu.so
found plugin at /usr/lib64/tuxclocker/plugins/libnvidia.so
Segmentation fault

At least it tried to start the nvidia plugin this time but it instantly segfaulted.

It's 5 hours past bedtime for me and I have medical procedures I have to consider coming up so I can't really keep working on this I'm sorry, I'll wish you luck figuring it out while I am away.

@Lurkki14
Copy link
Owner

tuxclockerd doesn't use libnvidia-ml.so, getting it to appear in ldd libnvidia.so should be enough.

@pallaswept
Copy link
Author

As soon as I posted that I realised why it segfaulted - the first guess I took broke the nvidia lib. So I reinstalled it all and

sudo patchelf --add-needed /usr/lib64/libnvidia-ml.so /usr/bin/tuxclockerd
sudo tuxclockerd
found plugin at /usr/lib64/tuxclocker/plugins/libamd.so
found plugin at /usr/lib64/tuxclocker/plugins/libcpu.so
nvidia: Couldn't open X display!
found plugin at /usr/lib64/tuxclocker/plugins/libnvidia.so
amdgpu_device_initialize: DRM version is 0.0.0 but this driver is only compatible with 3.x.x.
"/0bcaf18136c5c8285d39691ac6a5bcfb"

image

So yeh, just gonna need to figure out how to get meson to do that at build time.

@pallaswept
Copy link
Author

tuxclockerd doesn't use libnvidia-ml.so, getting it to appear in ldd libnvidia.so should be enough.

Yeh that's what I was trying to do the first time but you didn't exactly give me solid doco with foo.so and my-application-here and I got the arguments the wrong way around lol

@tujhen
Copy link
Contributor

tujhen commented Oct 25, 2023

@pallaswept Summary, so theoretical possibly to build with nvidia plugins for distro?

@pallaswept
Copy link
Author

sudo patchelf --add-needed /usr/lib64/libnvidia-ml.so /usr/lib64/tuxclocker/plugins/libnvidia.so

Works too. SO yeh, now to get meson to link it properly so this step isn't required.... And I'm not taking any more guesses at syntax for tools I've never seen before on this one.

I'm really not bullshitting you when I say I don't know what you're talking about with this build chain. Like I don't get it at all. This is not how I code or ever have. I have to leave meson up to you I'm sorry, I really would help with it if I could but I can't.

@pallaswept
Copy link
Author

@pallaswept Summary, so theoretical possibly to build with nvidia plugins for distro?

Yes of course, just have to get meson to link to the library properly. Before, it was using the header, but statically linking the lib, which is why it needed the lib there to link against. Now we have shown by building without the lib's .so there, and telling meson to not link it properly, that it can treat a shared lib as a shared lib, but I don't know how to make meson do the linking properly without post-modifying it to point at the lib like we just did. That's the last piece in the puzzle. It's really the only piece of the puzzle there ever was, we've just proven that it can be done, in theory. Making meson do it for real, in practice, that's the thing I can't help with.

@Lurkki14
Copy link
Owner

Do RPM packages not allow you to have post install scripts? Ideally meson itself wouldn't mess with anything after installing the files.

@pallaswept
Copy link
Author

So we have:
Meson will statically link the library when it is installed as a build prequisite and is present
Meson will not link to the library at all but still build using it's headers as a reference
And then you can manually add the reference afterwards
All we need is for it to dynamically link the file by using its headers as a reference but not actually statically build it into the plugin, instead calling out to the .so when it calls those functions.

Do RPM packages not allow you to have post install scripts? Ideally meson itself wouldn't mess with anything after installing the files.

RPM packages do have post install scripts but we shouldn't need to use this hack to make it work, the file should be correct before it's installed, after meson is done with it.

@pallaswept
Copy link
Author

Think of it this way: I can build Gnome here on my KDE desktop, including all of Gnome's shared libraries, without needing to install Gnome as a prerequisite, and without having to install Gnome afterwards and then run commands to make it work, just using Gnome's sources. This should be the same. We have the source - we have the nvml.h, that should be enough to build it and link it.

Just to demonstrate that the nvidia-ml.so does not need to exist, to have the tuxclocker plugin link to it:

 ~sudo mv /usr/lib64/libnvidia-ml.so /usr/lib64/libnvidia-ml.so.gone

 Now there is no nvidia-ml.so present. The "build requirement" is not there.

 ~sudo patchelf --add-needed /usr/lib64/libnvidia-ml.so /usr/lib64/tuxclocker/plugins/libnvidia.so

 Now the tuxclocker nvidia plugin is correctly built (by modifying it with this hack)

 ~sudo mv /usr/lib64/libnvidia-ml.so.gone /usr/lib64/libnvidia-ml.so

 Now we put the nvidia so back, like we just installed it somewhere that has this library annnnd:

 ~sudo tuxclockerd
found plugin at /usr/lib64/tuxclocker/plugins/libamd.so
found plugin at /usr/lib64/tuxclocker/plugins/libcpu.so
nvidia: Couldn't open X display!
found plugin at /usr/lib64/tuxclocker/plugins/libnvidia.so

Works!

We don't need the nvidia-ml.so present, to have libnvidia.so link to it. We only need the header file at compile time, so that libnvidia.so has pointers to the code in that nvidia-ml.so, when it finds it.

@tujhen
Copy link
Contributor

tujhen commented Oct 27, 2023

@pallaswept can you send PR in my forked repo on Github with edited meson so I can try to build RPM with patchelf script to test?

@pallaswept
Copy link
Author

@pallaswept can you send PR in my forked repo on Github with edited meson so I can try to build RPM with patchelf script to test?

Bro I would love to help but I don't know the first thing about meson :( But you can build it just as it is here, with the two changes from above:
#65 (comment)
#65 (comment)

And then it will build the libnvidia.so without the nvidia-ml.so being present, and you can test after modifying the resulting libnvidia.so with patchelf, and it will work.....but that is not the way to resolve this issue., you should not consider doing this in your spec file.

Even if your RPM includes patchelf as a runtime dependency so it can run it as a post-install script, that's a total hack. The way to fix this is to have the build toolchain (meson) build the lib and have the linker refer to nvidia-ml.so, then it has the same result as using patchelf, but without having to do that hack after the linker. It's obvious that meson has a way to link a library by finding it when it is installed, but what we need is to find the way to specify it to meson to link to it in the same way, without needing the file to be present during the build.

@pallaswept
Copy link
Author

You can use patchelf --add-needed or see if you could get the linker to do that.

Patchelf works, but it is the wrong way

you could get the linker to do that.

This is the right way.

@pallaswept
Copy link
Author

As discussed in #70, the right way to deal with this is actually to use a wrapper library to call the nvidia library - and there's a very strong reason why, I've just discovered:

How can we know what version of the driver to build against? There are 3 different closed source drivers and 2 different open source drivers from nvidia, all 5 of which will provide this library, and if we build against one of them, it will ONLY work if the user happens to be using that same driver.

It's not a rube goldberg machine, it's a fix for a problem.

@pallaswept
Copy link
Author

pallaswept commented Dec 22, 2023

@tujhen I was thinking.... If you wanted to try, perhaps the packman project would be a good option for you to host the package? They don't have the same OSI-approved restrictions as the OpenSUSE OBS, so you could build the driver, and hence the nvidia module, if you built it there.

Although, then we still have the problem of mismatched drivers between the built module and the end-user's machine, so perhaps it's worth waiting for @Lurkki14 to implement the wrapper library to solve that problem. I know there's the workaround added to use flatpak, but that depends on flatpak, and not all systems will have that (eg mine doesn't - flatpaks use too much disk space so I removed it). But then, maybe we could retrieve the flatpak directly using http rather than flatpak install, using the OBS source services, and extract the .so from there, build against that, and then since we can't distribute it from opensuse, add a post-install script for the end-user to download and extract the same .so to the tuxclocker directories, and link to that, rather than the system-installed .so - to enable opensuse and fedora (or any RPM) builds?

It's a bit of a hack to do it this way, a wrapper lib would definitely be cleaner, but it just might work... and hackish, but working, is better than not working at all, I think 😆 What do you think?

@Kyr4l
Copy link

Kyr4l commented Mar 11, 2024

@tujhen I was thinking.... If you wanted to try, perhaps the packman project would be a good option for you to host the package? They don't have the same OSI-approved restrictions as the OpenSUSE OBS, so you could build the driver, and hence the nvidia module, if you built it there.

Although, then we still have the problem of mismatched drivers between the built module and the end-user's machine, so perhaps it's worth waiting for @Lurkki14 to implement the wrapper library to solve that problem. I know there's the workaround added to use flatpak, but that depends on flatpak, and not all systems will have that (eg mine doesn't - flatpaks use too much disk space so I removed it). But then, maybe we could retrieve the flatpak directly using http rather than flatpak install, using the OBS source services, and extract the .so from there, build against that, and then since we can't distribute it from opensuse, add a post-install script for the end-user to download and extract the same .so to the tuxclocker directories, and link to that, rather than the system-installed .so - to enable opensuse and fedora (or any RPM) builds?

It's a bit of a hack to do it this way, a wrapper lib would definitely be cleaner, but it just might work... and hackish, but working, is better than not working at all, I think 😆 What do you think?

hi, i was following the whole situation silently and i just wanted to tell you thanks for putting so much effort into this

@pallaswept
Copy link
Author

pallaswept commented Aug 25, 2024

FYI I tested the new nvidia open cuda packages with this. That isn't released yet, it's beta/NFB at present, but that should give us a FOSS means to build this. I don't want to get too excited yet, but my initial tests have been promising.

https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/#using_package_managers_with_the_cuda_metapackage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants