Make sure git lfs
has installed and has initialized. cifar10
dataset now is included in distributed-training
repo
using git lfs
.
Make sure you have imagenet
data set at ~/data
.
Here is the possible tree structure of ~/data
:
/home/ubuntu/data
├── cifar-10-batches-py
│ ├── batches.meta
│ ├── data_batch_1
│ ├── data_batch_2
│ ├── data_batch_3
│ ├── data_batch_4
│ ├── data_batch_5
│ ├── readme.html
│ └── test_batch
├── cifar-10-python.tar.gz
└── imagenet
├── bounding_boxes
├── idxar_map.p
├── idxar_map_192.p
├── idxar_map_64.p
├── imagenet_2012_bounding_boxes.csv
├── sorted_idxar.p
├── train
├── trn_file2size.p
├── val_file2size.p
└── validation
Mainly adding server IPs, following file is at training-configs/cifar10-resnet50-2p3dn/2-p3dn-resnet50-cifar10-40G.json
.
You need to change the "nodes"
field in the config file (using EC2's private IP here).
E.g: you have two instances: 172.31.31.15
and 172.31.29.187
, assume 172.31.29.187
is the localhost
where we placed
our script. Then change nodes
to be ["localhost", "172.31.31.15"]
{
"comments": "unlimited bandwidth",
"host_user": "ubuntu",
"host_user_dir": "/home/ubuntu",
"host_ssh_key": "~/.ssh/id_rsa",
"docker_user_dir": "/home/cluster",
"docker_user": "cluster",
"docker_ssh_port": 2022,
"docker_ssh_key": "./DockerEnv/ssh-keys/id_rsa",
"script_path": "~/distributed-training/test_scripts/pytorch_resnet50_cifar10.py",
"script_args": "--epochs 20",
"nodes": ["localhost", ""],
"nGPU": 8,
"eth": "ens5",
"bw_limit": "40Gbit",
"default_bw": "100Gbit",
"log_folder": "p3dn-ResNet50-CIFAR10"
}
python3 batch_run_st.py
python3 docker_dt.py <config-file>
# e.g.
python3 docker_dt.py training-configs/cifar10-resnet50-2p3dn/2-p3dn-resnet50-cifar10-40G.json
python3 docker_mt.py <config-file> <debug-flag>
# e.g.
python3 docker_mt.py training-configs/mimic_config_template.json
- Note: logs will be saved into
chaokun_logs/<sub-dir>
, thus we need the log folder.
located at example-script-output
Program logs will be saved into log_archives