[BUG/Discussion] Single RTX 3080 shutdowns computer during training after a few epochs #966
Replies: 3 comments 2 replies
-
@timothylimyl this should be a discussion, not a bug It is most likely power or heat. Having a power supply with big enough numbers doesn't mean it delivers. Evga is usually good but I only trust their G/P/T2 and G3 lines that use superflower, any other series is a big? |
Beta Was this translation helpful? Give feedback.
-
@rwightman, noted with thanks. The PSU is 850W GR (EVGA GOLD), I think the power should be fine unless there is a sudden spike over the power limit that I am not noticing. I have tested it running on this script that draws ~340W consistently and there has not been any issue for an hour:
Meanwhile, running the training with power fluctuating will cause auto-restarts roughly 5-10minutes in. The temperature as of |
Beta Was this translation helpful? Give feedback.
-
Hi all, in case anyone is facing the same issue as me, the company figure that the issue is that the ventilation was insufficient and the gpu was overheating. I guess the The seller fixed it by adding more fans and removing the top case to allow more airflow. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
train script causes auto-shutdown,
train.py
To Reproduce
Steps to reproduce the behavior:
Tested 10 times, behavior is roughly the same, after 10-20epochs, desktop will shutdown.
Expected behavior
Computer to not shutdown/restart
Desktop (please complete the following information):
Additional context
This is is not really related to this repository completely and it is a GPU hardware problem. However, after searching the web, it seems like a common problem for 3080/3090 and it is potentially caused by power fluctuation which this repository training script triggers. Many forums are all dedicated to gaming issues faced by 3080/3090 which is caused by power fluctuation in games only.
I saw that @rwightman actually manage to train on 3090 and was mainly wondering did u face any issue with the power fluctuation issues that made your desktop shut down? I am sure it is not a PSU issue (850W is over the requirement by 100W), the NVIDIA driver is new which supposedly fixes this issue. Mainly posting this issue, in case anyone has faced the same issue and fixed it.
Power fluctuation: Steady training state is roughly 340W, it can go down to 100-200W.
Additional Test
Yes, tested it with GTX1080, the power fluctuates but it works (never shutdown).
a. I monitored the GPU using
nvidia-smi
and see that when the power drops by 100-200W, the computer will hang and restart.Edit: sometimes it does go for a few more fluctuation cycle before shutting down.
b. I ran a script drawing ~340W consistently for an hour without any issues.
Beta Was this translation helpful? Give feedback.
All reactions