-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training doesn't start - I'm getting an error with the data loader #168
Comments
I am also facing similar issues , but on Ubuntu. I guess this is because of Pytorch version , I am using latets 1.10 version and probably we should strictly use |
the same issue, have you got any ideas to solve it? thanks |
using 1.0.0 will raise a new problem: |
I am also getting the same issue. From what I could gather it seems issue with the pickling of lambda function in multi-processing. So disabling the multiprocessing data loading worked for me. |
Hello
I am trying to run it on my Windows machine. My dataset seems to be correct.
When I start the training I get an error.
Here is the call with the arguments:
python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5
Here is the error I get:
(base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd
(base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5
2021-12-07 23:11:37,702 - root - INFO - Use Cuda.
2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/')
2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets.
2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344
Minimum Number of Images for a Class: -1
Label Distribution:
apple: 5376
2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt.
2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344
2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets.
2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480
Minimum Number of Images for a Class: -1
Label Distribution:
apple: 1920
2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480
2021-12-07 23:11:38,477 - root - INFO - Build network.
2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth
2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model.
2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01.
2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler.
2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0.
Traceback (most recent call last):
File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in
train(train_loader, net, criterion, optimizer,
File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train
for i, data in enumerate(loader):
File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter
return self._get_iterator()
File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init
w.start()
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TrainAugmentation.init..'
(base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?
Any ideas?
Thanks a lot!
The text was updated successfully, but these errors were encountered: