How to run Step 4, Fine-Tuning? #10

ntraft · 2024-12-04T21:33:25Z

Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the train.py script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.

I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.

I'm bringing up a bash terminal on the node, then running this way:

LOCAL_RANK=0 RANK=0 WORLD_SIZE=1 MASTER_ADDR='127.0.0.1' MASTER_PORT=29500 PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --launcher pytorch --gpus 8

But I'm still only getting an ETA of 3+ days, the same thing I get when running on 1 GPU non-distributed.

2024-12-04 16:34:19,448 - mmcls - INFO - Epoch [1][7700/10010]  lr: 9.999e-04, eta: 3 days, 6:47:18, time: 0.255, data_time: 0.052, memory: 7473, loss: 5.9310

When running this way, I notice that the config that gets printed to the console shows gpu_ids = range(0, 1).

I've also tried simply using the --gpus flag without the --launcher flag:

PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --gpus 8

but then I get this error:

Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 181, in <module>
    main()
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 169, in main
    train_model(
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcls/apis/train.py", line 164, in train_model
    runner.run(data_loaders, cfg.workflow)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/parallel/data_parallel.py", line 62, in train_step
    assert len(self.device_ids) == 1, \
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.

The text was updated successfully, but these errors were encountered:

ntraft · 2024-12-05T18:10:43Z

Okay, I believe I've figured it out.

First, by looking at mmcls/tools/train.py, I can see that your train.py is an older/simplified version of that. I see that --gpus has been deprecated there. So the only option for multi-GPU training is to use one of the --launcher modes.

Next, I didn't even realize you also have tools/train_dist.sh! I see now that this is what we should be using. I am now running this:

bash tools/dist_train.sh configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py 8

I also had to fix the spelling of the train.py script from --local_rank to --local-rank, or else I get an error for unrecognized argument. This might be a thing that my newer version of PyTorch uses that wasn't used in the past.

With this approach, I'm getting an ETA of 36 hours on 4 V100s and 26 hours on 8 V100s. With 32 CPU I'm a bit bottlenecked by data loading; when running on a node with 48 cores and 8 AMD Vega 20 I'm able to get the time down to ~16 hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run Step 4, Fine-Tuning? #10

How to run Step 4, Fine-Tuning? #10

ntraft commented Dec 4, 2024 •

edited

Loading

ntraft commented Dec 5, 2024 •

edited

Loading

How to run Step 4, Fine-Tuning? #10

How to run Step 4, Fine-Tuning? #10

Comments

ntraft commented Dec 4, 2024 • edited Loading

ntraft commented Dec 5, 2024 • edited Loading

ntraft commented Dec 4, 2024 •

edited

Loading

ntraft commented Dec 5, 2024 •

edited

Loading