You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a centralized thread for discussing training-related reproducibility. I noticed variations in the resultant accuracy when I was developing the model (mean and std given in TRAINING.md), but there are reports of consistently worse performance when re-trained on a different setup (#68#60#50). Granted, those who successfully train the model are not likely to open an issue.
I tried to investigate the issue, and I can confirm that the reproducibility problem exists, in ways that I do not understand. Here, I share my findings in the hope that it helps people who wish to retrain the network. I think that a good network/setup should be stable and not sensitive to small environmental variations but well here we are.
1. Default setting: Two A6000 GPUs, PyTorch 1.11, CUDA 11.3
python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-a6000 --stage 03
(I ctrl-c cancelled this when it entered stage 3 and switched to another server (same GPUs though) because someone else needs the GPU on the first server)
python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-s0-a6000 --stage 3 --load_network saves/Mar09_12.57.58_retrain-a6000_s0/Mar09_12.57.58_retrain-a6000_s0_150000.pth
This is a centralized thread for discussing training-related reproducibility. I noticed variations in the resultant accuracy when I was developing the model (mean and std given in TRAINING.md), but there are reports of consistently worse performance when re-trained on a different setup (#68 #60 #50). Granted, those who successfully train the model are not likely to open an issue.
I tried to investigate the issue, and I can confirm that the reproducibility problem exists, in ways that I do not understand. Here, I share my findings in the hope that it helps people who wish to retrain the network. I think that a good network/setup should be stable and not sensitive to small environmental variations but well here we are.
1. Default setting: Two A6000 GPUs, PyTorch 1.11, CUDA 11.3
Environment creation:
Training command:
DAVIS 2017 val at 107K iterations: 86.8
DAVIS 2017 val at 110K iterations: 86.8
Training log: https://drive.google.com/drive/folders/1qBkgIh5a3PMyrt9FFxKEBTC3kUnBABTX?usp=sharing
pip list
:2. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.11, CUDA 10.2
Environment creation:
Training command (we start from the pretrained s0 weights):
DAVIS 2017 val at 107K iterations: 86.1
DAVIS 2017 val at 110K iterations: 86.0
Training log: https://drive.google.com/drive/folders/1SDpsbpfnz4rRRNTFrXWr3h3D1-20Vj6s?usp=sharing
pip list
:3. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.12.1, CUDA 10.2
(No environment creation commands available because this is my default development environment that I used for a long time)
Training command:
DAVIS 2017 val at 107K iterations: 86.1
DAVIS 2017 val at 110K iterations: 86.1
Training log: https://drive.google.com/drive/folders/1lKnkKywkOqqBJaRMdei06Z_Cs3ynLIgp?usp=sharing
4. V100 4-GPU setting: Four V100 GPUs, PyTorch 1.11, CUDA 10.2
Environment creation same as (2)
Training command (we start from the pretrained s0 weights):
DAVIS 2017 val at 107K iterations: 85.3
DAVIS 2017 val at 110K iterations: 84.2
5. V100 8-GPU setting: Eight V100 GPUs, PyTorch 1.11, CUDA 10.2
Environment creation same as (2)
Training command (we start from the pretrained s0 weights):
DAVIS 2017 val at 107K iterations: 85.5
DAVIS 2017 val at 110K iterations: 85.9
TL;DR: It seems that training on two GPUs gives a more consistent performance (I used two GPUs most of the time during the development of this method)
Feel free to discuss/share below.
The text was updated successfully, but these errors were encountered: