You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering if it will be possible to generate a Dockerfile to generate a GraphNeT docker image and run it inside a container. The idea behind this is that, when running on a container, we have full control of an isolated environment in the case we experience some issues during training/inference and we can stop it without disturbing other processes running on a cluster outside the container.
It happened to me that I stopped a training doing Ctrl+C on the terminal where I was running it, but somehow the GPUs got frozen with ghost processes after the training script was stopped. I tried to stop them manually using kill commands on the terminal, but then, the processes with a given PID appeared as N/A when using nvtop, and when typing nvidia-smi there were not even processes using the GPUs even though they were being used. The next thing I tried was to shut down manually the processes using the GPUs with the next two commands:
fuser -v /dev/nvidia*
kill $(lsof -t /dev/nvidia*)
The first one of them didn't fully work all of the times whereas the second one did. In the case it didn't work, the cluster in which I was running GraphNeT needed to be rebooted, making it not usable for other co-workers meanwhile...
Therefore, I encouraged GraphNeT developers to re-consider having a Dockerfile as happened in the past. I will be very happy to help with this, but I am not sure I might have all the required knowledge to do it myself alone.
Thank you very much!
The text was updated successfully, but these errors were encountered:
Dear all,
I was wondering if it will be possible to generate a Dockerfile to generate a GraphNeT docker image and run it inside a container. The idea behind this is that, when running on a container, we have full control of an isolated environment in the case we experience some issues during training/inference and we can stop it without disturbing other processes running on a cluster outside the container.
It happened to me that I stopped a training doing
Ctrl+C
on the terminal where I was running it, but somehow the GPUs got frozen with ghost processes after the training script was stopped. I tried to stop them manually usingkill
commands on the terminal, but then, the processes with a givenPID
appeared asN/A
when usingnvtop
, and when typingnvidia-smi
there were not even processes using the GPUs even though they were being used. The next thing I tried was to shut down manually the processes using the GPUs with the next two commands:fuser -v /dev/nvidia*
kill $(lsof -t /dev/nvidia*)
The first one of them didn't fully work all of the times whereas the second one did. In the case it didn't work, the cluster in which I was running GraphNeT needed to be rebooted, making it not usable for other co-workers meanwhile...
Therefore, I encouraged GraphNeT developers to re-consider having a Dockerfile as happened in the past. I will be very happy to help with this, but I am not sure I might have all the required knowledge to do it myself alone.
Thank you very much!
The text was updated successfully, but these errors were encountered: