same node pods communication through unix socket #109

Fizzbb · 2022-03-15T02:29:31Z

v0.3.0 release vgpu-server create socket at /run/alnair.sock on the host machine, user container mount hostpath volumn /run/alnair.sock to communicate.
vgpu-server was deployed directly on the host. Will this communication still work when vgpu-server is deployed in a pod?

Kubernetes has example of using empty dir to communicate between containers within a pod, but not exactly what we want.
https://kubernetes.io/docs/tasks/access-application-cluster/communicate-containers-same-pod-shared-volume/
Reference:
https://serverfault.com/questions/881875/expose-a-unix-socket-to-the-host-system-from-inside-from-a-docker-container/881895#881895
use unix socket in docker
https://web.archive.org/web/20210411145047/https://www.jujens.eu/posts/en/2017/Feb/15/docker-unix-socket/

Fizzbb · 2022-03-16T17:01:08Z

In the manifest, if we mount /run from hostpath to vgpu-server container, the container cannot start. Error message is

"Error: failed to start container "alnair-vgpu-server": Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/5afe69b7c6a1edf6187c693b8e9876f1e7df11b0b37badf1a61eae16c7694f4f/merged/run/nvidia-persistenced/socket: no such device or address: unknown"
However, one gpu node reports this, while the other did not. Even after set up some docker(20.10.12), and nvidia docker2(2.9.1), nvidia-container-cli(1.8.1) version
NVIDIA/nvidia-docker#885

Fizzbb · 2022-03-16T22:32:51Z

the mount error, is caused by nvidia-docker2, install nvidia-container-runtime, instead of nvidia-docker2,
NVIDIA/nvidia-docker#825
"@3XX0 is this something that changed in version 2? we've been using nvidia-docker for a while now and we do have nvidia-smi as part of our docker images. we started to get the mount error once we switched to nvidia-docker2"
"No this requirement didn't change between the versions, as part of v2 we prevent the container from starting if you have the NVIDIA driver (e.g: nvidia-smi) in your image as this will lead to undefined behavior."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

same node pods communication through unix socket #109

same node pods communication through unix socket #109

Fizzbb commented Mar 15, 2022

Fizzbb commented Mar 16, 2022

Fizzbb commented Mar 16, 2022

same node pods communication through unix socket #109

same node pods communication through unix socket #109

Comments

Fizzbb commented Mar 15, 2022

Fizzbb commented Mar 16, 2022

Fizzbb commented Mar 16, 2022