You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, sorry to bother you. I have came across the same problem while running the train.py script, it occurs at Epoch[1](1438/4431) when I use --thread=8 and Epoch[1](1426/4431) when '--thread=1'.
Follow pytorch-issue, I couldn't fix it yet.
===> Epoch[1](1438/4431): Loss: 2.4208, Error: (2.7224 1.7324 1.4768)
===> Epoch[1](1439/4431): Loss: 7.0227, Error: (7.0972 4.2546 3.7592)
===> Epoch[1](1440/4431): Loss: 6.4023, Error: (6.2473 4.1015 3.3724)
Traceback (most recent call last):
File "train.py", line 189, in <module>
train(epoch)
File "train.py", line 115, in train
disp0, disp1, disp2 = model(input1, input2)
File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
Traceback (most recent call last):
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply
File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
thread.join()
BrokenPipeError: [Errno 32] Broken pipe
File "/home/~/anaconda3/envs/former/lib/python3.7/threading.py", line 1044, in join
self._wait_for_tstate_lock()
File "/home/~/anaconda3/envs/former/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7094) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
The code is running under pytorch 1.1.0 cudatoolkit 10.1 and torchvision 0.3.0 in four Titan X GPU Ubuntu 18.04 machine.
Thx!
The text was updated successfully, but these errors were encountered:
I never met this problem. Usually, the broken image files would cause similar (but not the same) issue. You can try to print out the file name of the image to check whether it stop at the same image file. Or try to train with another dataset to see whether it happened again.
It's seems to be a pytorch, cuda or hardware problem.
Did you try to install pytorch from source?
Hello, sorry to bother you. I have came across the same problem while running the
train.py
script, it occurs atEpoch[1](1438/4431)
when I use--thread=8
andEpoch[1](1426/4431)
when '--thread=1'.Follow pytorch-issue, I couldn't fix it yet.
The code is running under
pytorch 1.1.0 cudatoolkit 10.1 and torchvision 0.3.0
in four Titan X GPU Ubuntu 18.04 machine.Thx!
The text was updated successfully, but these errors were encountered: