You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is the error message after the memory got filled:
...
[GPU0] Training epoch: 14
0%| | 0/2948 [00:00<?, ?it/s][GPU1] Training epoch: 14
8%|████████ | 228/2948 [01:49<12:51, 3.53it/s][GPU0] Model saved
25%|█████████████████████████▋ | 728/2948 [04:39<16:02, 2.31it/s][GPU0] Model saved
42%|██████████████████████████████████████████▉ | 1228/2948 [07:49<10:03, 2.85it/s][GPU0] Model saved
59%|████████████████████████████████████████████████████████████▎ | 1728/2948 [10:39<06:38, 3.06it/s][GPU0] Model saved
76%|█████████████████████████████████████████████████████████████████████████████▊ | 2228/2948 [13:54<04:46, 2.51it/s][GPU0] Model saved
93%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 2728/2948 [17:08<01:15, 2.92it/s][GPU0] Model saved
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00, 2.66it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00, 2.66it/s]
[GPU0] Validating at the start of epoch: 15
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2032/2032 [02:59<00:00, 11.34it/s]
[GPU0] Validation set average loss: 0.2255345582962036
[GPU0] Training epoch: 15
[GPU1] Training epoch: 15
0%| | 0/2948 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/xx/DL/Repos/RobustVideoMatting/train_xx.py", line 501, in <module>
mp.spawn(
File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 138 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Is there any reason for Pytorch 2 to have memory leakage when combined with the distributed training?
I disabled the distributed training by modifying the code to avoid memory leakage when using Pytorch 2. Pytorch 2 trains a lot faster than the 1.X.
The text was updated successfully, but these errors were encountered:
Thanks for this work.
I was trying to train the model using the conda environment:
But it ended up using up all of the RAM after a few hours.
When I downgraded my torch then it worked:
Here is the error message after the memory got filled:
Is there any reason for Pytorch 2 to have memory leakage when combined with the distributed training?
I disabled the distributed training by modifying the code to avoid memory leakage when using Pytorch 2. Pytorch 2 trains a lot faster than the 1.X.
The text was updated successfully, but these errors were encountered: