Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run Step 4, Fine-Tuning? #10

Open
ntraft opened this issue Dec 4, 2024 · 1 comment
Open

How to run Step 4, Fine-Tuning? #10

ntraft opened this issue Dec 4, 2024 · 1 comment

Comments

@ntraft
Copy link

ntraft commented Dec 4, 2024

Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the train.py script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.

I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.

I'm bringing up a bash terminal on the node, then running this way:

LOCAL_RANK=0 RANK=0 WORLD_SIZE=1 MASTER_ADDR='127.0.0.1' MASTER_PORT=29500 PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --launcher pytorch --gpus 8

But I'm still only getting an ETA of 3+ days, the same thing I get when running on 1 GPU non-distributed.

2024-12-04 16:34:19,448 - mmcls - INFO - Epoch [1][7700/10010]  lr: 9.999e-04, eta: 3 days, 6:47:18, time: 0.255, data_time: 0.052, memory: 7473, loss: 5.9310

When running this way, I notice that the config that gets printed to the console shows gpu_ids = range(0, 1).

I've also tried simply using the --gpus flag without the --launcher flag:

PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --gpus 8

but then I get this error:

Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 181, in <module>
    main()
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 169, in main
    train_model(
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcls/apis/train.py", line 164, in train_model
    runner.run(data_loaders, cfg.workflow)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/parallel/data_parallel.py", line 62, in train_step
    assert len(self.device_ids) == 1, \
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.
@ntraft
Copy link
Author

ntraft commented Dec 5, 2024

Okay, I believe I've figured it out.

First, by looking at mmcls/tools/train.py, I can see that your train.py is an older/simplified version of that. I see that --gpus has been deprecated there. So the only option for multi-GPU training is to use one of the --launcher modes.

Next, I didn't even realize you also have tools/train_dist.sh! I see now that this is what we should be using. I am now running this:

bash tools/dist_train.sh configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py 8

I also had to fix the spelling of the train.py script from --local_rank to --local-rank, or else I get an error for unrecognized argument. This might be a thing that my newer version of PyTorch uses that wasn't used in the past.

With this approach, I'm getting an ETA of 36 hours on 4 V100s and 26 hours on 8 V100s. With 32 CPU I'm a bit bottlenecked by data loading; when running on a node with 48 cores and 8 AMD Vega 20 I'm able to get the time down to ~16 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant