Training reproducibility thread #71

hkchengrex · 2023-03-12T02:18:07Z

This is a centralized thread for discussing training-related reproducibility. I noticed variations in the resultant accuracy when I was developing the model (mean and std given in TRAINING.md), but there are reports of consistently worse performance when re-trained on a different setup (#68 #60 #50). Granted, those who successfully train the model are not likely to open an issue.

I tried to investigate the issue, and I can confirm that the reproducibility problem exists, in ways that I do not understand. Here, I share my findings in the hope that it helps people who wish to retrain the network. I think that a good network/setup should be stable and not sensitive to small environmental variations but well here we are.

1. Default setting: Two A6000 GPUs, PyTorch 1.11, CUDA 11.3

Environment creation:

conda create -n xmem-repro python=3.9
conda activate xmem-repro
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install opencv-python
pip install -r requirements.txt

Training command:

python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-a6000 --stage 03
(I ctrl-c cancelled this when it entered stage 3 and switched to another server (same GPUs though) because someone else needs the GPU on the first server)
python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-s0-a6000 --stage 3 --load_network saves/Mar09_12.57.58_retrain-a6000_s0/Mar09_12.57.58_retrain-a6000_s0_150000.pth

DAVIS 2017 val at 107K iterations: 86.8
DAVIS 2017 val at 110K iterations: 86.8
Training log: https://drive.google.com/drive/folders/1qBkgIh5a3PMyrt9FFxKEBTC3kUnBABTX?usp=sharing

pip list:

----------------------- ------------
absl-py                 1.4.0
beautifulsoup4          4.11.2
cachetools              5.3.0
certifi                 2022.12.7
charset-normalizer      3.1.0
filelock                3.9.0
gdown                   4.6.4
gitdb                   4.0.10
GitPython               3.1.31
google-auth             2.16.2
google-auth-oauthlib    0.4.6
grpcio                  1.51.3
h5py                    3.8.0
hickle                  5.0.2
idna                    3.4
importlib-metadata      6.0.0
Markdown                3.4.1
MarkupSafe              2.1.2
numpy                   1.24.2
oauthlib                3.2.2
opencv-python           4.7.0.72
Pillow                  9.4.0
pip                     23.0.1
progressbar2            4.2.0
protobuf                4.22.1
pyasn1                  0.4.8
pyasn1-modules          0.2.8
PySocks                 1.7.1
python-utils            3.5.2
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
setuptools              65.6.3
six                     1.16.0
smmap                   5.0.0
soupsieve               2.4
tensorboard             2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit  1.8.1
thinplate               1.0.0
torch                   1.11.0+cu113
torchaudio              0.11.0+cu113
torchvision             0.12.0+cu113
tqdm                    4.65.0
typing_extensions       4.5.0
urllib3                 1.26.14
Werkzeug                2.2.3
wheel                   0.38.4
zipp                    3.15.0

2. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.11, CUDA 10.2

Environment creation:

conda create -n xmem-repro python=3.9
conda activate xmem-repro
pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu102
pip install opencv-python
pip install -r requirements.txt

Training command (we start from the pretrained s0 weights):

python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-s0-2gpu --stage 3 --load_network saves/XMem-s0.pth

DAVIS 2017 val at 107K iterations: 86.1
DAVIS 2017 val at 110K iterations: 86.0
Training log: https://drive.google.com/drive/folders/1SDpsbpfnz4rRRNTFrXWr3h3D1-20Vj6s?usp=sharing

pip list:

Package                 Version
----------------------- ------------
absl-py                 1.4.0
beautifulsoup4          4.11.2
cachetools              5.3.0
certifi                 2022.12.7
charset-normalizer      3.1.0
filelock                3.9.0
gdown                   4.6.4
gitdb                   4.0.10
GitPython               3.1.31
google-auth             2.16.2
google-auth-oauthlib    0.4.6
grpcio                  1.51.3
h5py                    3.8.0
hickle                  5.0.2
idna                    3.4
importlib-metadata      6.0.0
Markdown                3.4.1
MarkupSafe              2.1.2
numpy                   1.24.2
oauthlib                3.2.2
opencv-python           4.7.0.72
Pillow                  8.4.0
pip                     23.0.1
progressbar2            4.2.0
protobuf                4.22.1
pyasn1                  0.4.8
pyasn1-modules          0.2.8
PySocks                 1.7.1
python-utils            3.5.2
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
setuptools              65.6.3
six                     1.16.0
smmap                   5.0.0
soupsieve               2.4
tensorboard             2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit  1.8.1
thinplate               1.0.0
torch                   1.11.0+cu102
torchaudio              0.11.0+cu102
torchvision             0.12.0+cu102
tqdm                    4.65.0
typing_extensions       4.5.0
urllib3                 1.26.14
Werkzeug                2.2.3
wheel                   0.38.4
zipp                    3.15.0

3. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.12.1, CUDA 10.2

(No environment creation commands available because this is my default development environment that I used for a long time)
Training command:

python -m torch.distributed.run --master_port 25764 --nproc_per_node=2 train.py --exp_id retrain-v100 --stage 03

DAVIS 2017 val at 107K iterations: 86.1
DAVIS 2017 val at 110K iterations: 86.1
Training log: https://drive.google.com/drive/folders/1lKnkKywkOqqBJaRMdei06Z_Cs3ynLIgp?usp=sharing

4. V100 4-GPU setting: Four V100 GPUs, PyTorch 1.11, CUDA 10.2

Environment creation same as (2)
Training command (we start from the pretrained s0 weights):

python -m torch.distributed.run --master_port 25763 --nproc_per_node=4 train.py --exp_id retrain-s0-4gpu --stage 3 --load_network saves/XMem-s0.pth

DAVIS 2017 val at 107K iterations: 85.3
DAVIS 2017 val at 110K iterations: 84.2

5. V100 8-GPU setting: Eight V100 GPUs, PyTorch 1.11, CUDA 10.2

Environment creation same as (2)
Training command (we start from the pretrained s0 weights):

python -m torch.distributed.run --master_port 25763 --nproc_per_node=8 train.py --exp_id retrain-s0 --stage 3 --load_network saves/XMem-s0.pth

DAVIS 2017 val at 107K iterations: 85.5
DAVIS 2017 val at 110K iterations: 85.9

TL;DR: It seems that training on two GPUs gives a more consistent performance (I used two GPUs most of the time during the development of this method)

Feel free to discuss/share below.

The text was updated successfully, but these errors were encountered:

longmalongma · 2023-06-16T02:06:53Z

Hi, I used 4 3080ti (4*12GB) for training. What changes do I need to make to the default parameters?

longmalongma · 2023-06-16T03:04:09Z

Can you provide the supporting parameters for 4 card GPU and 2 card GPU respectively?

This was referenced Mar 12, 2023

The best performance of pretrained model #68

Closed

The accuracy is lower than that in the paper #60

Closed

hkchengrex mentioned this issue May 10, 2023

training time about the xmem #86

Closed

hkchengrex mentioned this issue Jul 29, 2023

Your model trained with 4 GPU(12gb) machine does not perform well. #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training reproducibility thread #71

Training reproducibility thread #71

hkchengrex commented Mar 12, 2023 •

edited

Loading

longmalongma commented Jun 16, 2023

longmalongma commented Jun 16, 2023

Training reproducibility thread #71

Training reproducibility thread #71

Comments

hkchengrex commented Mar 12, 2023 • edited Loading

1. Default setting: Two A6000 GPUs, PyTorch 1.11, CUDA 11.3

2. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.11, CUDA 10.2

3. V100 2-GPU setting: Two V100 GPUs, PyTorch 1.12.1, CUDA 10.2

4. V100 4-GPU setting: Four V100 GPUs, PyTorch 1.11, CUDA 10.2

5. V100 8-GPU setting: Eight V100 GPUs, PyTorch 1.11, CUDA 10.2

TL;DR: It seems that training on two GPUs gives a more consistent performance (I used two GPUs most of the time during the development of this method)

longmalongma commented Jun 16, 2023

longmalongma commented Jun 16, 2023

hkchengrex commented Mar 12, 2023 •

edited

Loading