-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xgboost and stan install and test scripts #331
base: master
Are you sure you want to change the base?
Conversation
These look good to me! @noamross any thoughts here? |
Only that I'm not sure these are updated since @RobJY found the issue in NVIDIA/nvidia-docker#342, so these should likely change to install something other than |
Also, these are interactive test scripts, not scripts that return a value or a specific error for failure. Wouldn't it be better to have something that can integrate into a smoke test? |
That's right @noamross, I'll have to update the xgboost install script when I've got a fix working. |
I've pushed an update based on a fix that I'll be deploying to our systems tomorrow. Let me explain what I think the issue is and a possible solution. The issue we've been having seems to be that because the cuda libraries are being so actively developed it's very easy for a library to be updated after the cuda driver has been installed and then the library versions on the host and in the container no longer match. Keeping the It seems like using this script for Rocker builds has the opposite problem in that we will be building images with the current versions of the libraries, but will have no way of knowing what version of the driver users will have on their host machines. It seems very likely that this will often result in this library mismatch error. For this reason I removed the versions from the Is there a better way to handle this? |
I was thinking we'd collect standard out from running this and get the return code. If the code isn't 0 we would just echo the standard out. I see now I should do that inside the script though and not create more work outside it. I've added the bash scripts that do what I described above. Let me know if you'd like any other changes. |
Shouldn't we not need to install here?
I just took another look at this to see if I could move it along. I could add some text about possible fixes a user could try if they encounter the |
We've gone through another deploy and had the same 'version mismatch' issue. To correct it I modified the script (attached as text file) we run on our host machine to install the CUDA toolkit at version 11.1. Curiously, The block of code at the bottom of the script to install verison 11.1 of the toolkit comes from this Nvidia page with the only modification being the change from How does this sound? |
@RobJY thanks for the follow-up. the driver mismatch error is definitely an annoying one and not one that I've completely wrapped my head around either. A few comments/observations so far:
This is as expected. Updating the driver on the host while the container is still running is one way to get the |
Thank you @cboettig for the explanation of the CUDA version from nvidia-smi! That makes sense. You're right about suggesting a machine reboot. That should be the first thing we ask users to try. I did see that that was successful for many users on the forums. Unfortunately, it didn't work for me. Let me explain what I think the issue was for our servers and please let me know if you think I've got any of it wrong.
Is there a better way to achieve this? |
Thanks @RobJY , apologies I missed that the script was not part of the commit but just attached to the comment. Very interesting! if it works it works, but honestly this just deepens the puzzle for me -- this script is all being run outside the container, on the host machine, yeah? I don't think you should need the toolkit of any kind installed on the host machine though? Maybe I'm just confused. All software above the kernel level (i.e. not including the drivers) should be handled in the containerized layer? Or does that assumption go out the window when we're using nvidia-docker? I've also had to mess with apt-pinning to get nvidia container runtime to play nicely with upgraded drivers, as described here: pop-os/pop#1708 (comment) . Maybe this is related to / an alternative version of your script above? But I still see containers that have been up a while start to throw the error |
Apologies about the attached script. I did have it hidden at the bottom of the post. Yes, we're running the attached script on the host machine. Yeah, I think you're right that the nvidia-docker install loads some packages on the host and those are at a higher version than those in the container. This makes sense to me because the without any pinning it's going to install the newest versions available. I also tried pinning versions earlier, but Nvidia doesn't seem to distribute older versions of packages for very long and when I went back to use the pinned script I had the older versions of the packages were unavailable. I haven't seen the |
Thanks! I now recall a bit more, I think the error I get is related to It seems if I run with explicit
feels like that shouldn't be necessary but needs some software to be fixed upstream first... |
That's interesting. It looks like a nice clean solution to me. It probably protects you from other potential issues as well. |
Here's code to install and test gpu supported xgboost and stan. Please let me know if I should move or add anything.
I'm not sure if I've put
install_stan.R
in the right place since it needs to be run by users to install stan. Is there a better place for it.I got the xgboost test script from
https://rdrr.io/cran/xgboost/src/demo/gpu_accelerated.R
. Please let me know if you think there's something better we could use.