You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the train.py script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.
I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.
I'm bringing up a bash terminal on the node, then running this way:
Traceback (most recent call last):
File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 181, in <module>
main()
File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 169, in main
train_model(
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcls/apis/train.py", line 164, in train_model
runner.run(data_loaders, cfg.workflow)
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/parallel/data_parallel.py", line 62, in train_step
assert len(self.device_ids) == 1, \
^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.
The text was updated successfully, but these errors were encountered:
First, by looking at mmcls/tools/train.py, I can see that your train.py is an older/simplified version of that. I see that --gpus has been deprecated there. So the only option for multi-GPU training is to use one of the --launcher modes.
Next, I didn't even realize you also have tools/train_dist.sh! I see now that this is what we should be using. I am now running this:
I also had to fix the spelling of the train.py script from --local_rank to --local-rank, or else I get an error for unrecognized argument. This might be a thing that my newer version of PyTorch uses that wasn't used in the past.
With this approach, I'm getting an ETA of 36 hours on 4 V100s and 26 hours on 8 V100s. With 32 CPU I'm a bit bottlenecked by data loading; when running on a node with 48 cores and 8 AMD Vega 20 I'm able to get the time down to ~16 hours.
Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the
train.py
script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.
I'm bringing up a bash terminal on the node, then running this way:
But I'm still only getting an ETA of 3+ days, the same thing I get when running on 1 GPU non-distributed.
When running this way, I notice that the config that gets printed to the console shows
gpu_ids = range(0, 1)
.I've also tried simply using the
--gpus
flag without the--launcher
flag:but then I get this error:
The text was updated successfully, but these errors were encountered: