-
Notifications
You must be signed in to change notification settings - Fork 53
KubeFlow on GPUs
Marco Ceppi edited this page Mar 9, 2018
·
7 revisions
These are the steps currently necessary to enable KubeFlow-compatible GPU support on a worker node. Eventually these steps will be automated by the kubernetes-worker charm.
See also:
juju ssh
to all GPU-enabled kubernetes-worker nodes and perform the following steps:
# add nvidia-docker repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# add docker repo
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
# install all the necessary bits
sudo apt-get update
sudo apt-get remove docker docker-engine docker.io && sudo apt-get install docker-ce nvidia-docker2
# if you get driver/library version mismatches:
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
sudo modprobe nvidia
# now, this command should work:
nvidia-smi -a
# this command should now work as well:
sudo systemctl restart docker.service
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
# to make the nvidia runtime change permanent, change the charm template on the worker nodes:
NODE_NR=0
sudo vi /var/lib/juju/agents/unit-kubernetes-worker-${NODE_NR}/charm/templates/docker.systemd
# and replace the "docker daemon" invocation with "dockerd"
# then do this:
sudo sed -i 's|ExecStart=/usr/bin/docker daemon -H fd:// $DOCKER_OPTS|ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS|' /lib/systemd/system/docker.service
# reload the docker daemon
sudo systemctl daemon-reload
sudo systemctl restart docker.service
# you can test whether the runtime changes are permanent by executing: (ONLY AFTER CLIENT CHANGES BELOW ARE RUN)
docker run --rm nvidia/cuda nvidia-smi
juju config kubernetes-worker kubelet-extra-args="feature-gates=DevicePlugins=true"
juju config kubernetes-worker docker-opts="--default-runtime=nvidia"
# you also need the nvidia k8s daemonset deployed to expose the feature on the worker nodes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml