Whether to support distributed training #4

dqshuai · 2021-04-09T06:27:45Z

hello，thanks for your project. i want to know whether to support distributed training. And what should i do to make it support distributed training.

daodaofr · 2021-04-09T06:41:39Z

Hi, we didn't try to train with multiple GPUs. But MMDetection supports distributed training, please refer to https://github.com/daodaofr/AlignPS/blob/master/tools/dist_train.sh

dqshuai · 2021-04-09T06:51:04Z

thanks for your reply. now, i try to support distributed training using cmd "./tools/dist_train.sh configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py 8 --launcher pytorch --no-validate". it can traniing normally, but i don‘t know it whether it will affect the final performance. Normally, distributed training will not damage the performance? is this correct？

daodaofr · 2021-04-09T07:30:34Z

Normally, you can still get fair performance, maybe there needs some adjusting in batch size and learning rate to get the best results.

dqshuai · 2021-04-09T12:38:24Z

hi，I just finished training using multi gpu on prw dataset. Compared to the results of the paper, map is 2% lower, but r1 is 1% higher. when i check the config, i found the bbox_head is ''FCOSReidHeadFocalOimSub'' without triplet loss.

AlignPS/configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py

Line 11 in c20cf32

type='FCOSReidHeadFocalOimSub',

I want to know whether the difference in results is related to this, and no ablation experiment in this regard was found in your paper.
thanks！

daodaofr · 2021-04-09T13:08:19Z

Thanks for your results, I think the results are normal. According to my experience, the triplet loss only has a very slight influence on PRW, less than 1%. Different environments (mmcv, pytorch, cuda) can also bring 1%-2% performance difference. PRW is smaller compared to CUHK-SYSU, so it is normal to see some fluctuations.

dqshuai · 2021-04-13T12:35:12Z

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters.
After that, I tried the following
(1) adjust lr from 0.001 to 0.01
(2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync.
But,these measures did not work.
Can you give me some suggestions！thanks!
my environments:
mmvc-full==1.1.5
pytoch==1.5.1
cuda==10.2
but i don't think environment can bring 4% map difference. :)

daodaofr · 2021-04-13T13:11:09Z

I am sorry, but I haven't tried distributed training. So I cannot give practical suggestions on that.
If you want to reproduce the results, please try to use a single GPU.

dqshuai · 2021-04-18T02:46:19Z

thanks for your reply. I received system email, in which you suggest to use all gather to update lookup_table with global features. The example you provided has some problems due to the inconsistency of the feature size of each rank. I made some modifications, and then adjusted the learning rate, the current map can reach 92.91. Why did I not see this reply in the issue, and are there any other details that I need to pay attention to to get a higher map?

daodaofr · 2021-04-18T06:06:14Z

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply.
It would be nice if you could give an example of your modified code, to help others with distributed training.
Maybe more epochs are needed with multiple GPU.

dqshuai · 2021-04-18T06:36:15Z

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply.
It would be nice if you could give an example of your modified code, to help others with distributed training.
Maybe more epochs are needed with multiple GPU.

My current implementation is a bit ugly. :)

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return x
    if not save_memory:
        # all gather features in parallel
        # cost more GPU memory but less time
        # x = x.cuda(gpu)
        x_gather = [torch.empty_like(x) for _ in range(world_size)]
        dist.all_gather(x_gather, x, async_op=False)
#         x_gather = torch.cat(x_gather, dim=0)
    else:
        # broadcast features in sequence
        # cost more time but less GPU memory
        container = torch.empty_like(x).cuda(gpu)
        x_gather = []
        for k in range(world_size):
            container.data.copy_(x)
            print("gathering features from rank no.{}".format(k))
            dist.broadcast(container, k)
            x_gather.append(container.cpu())
#         x_gather = torch.cat(x_gather, dim=0)
        # return cpu tensor
    return x_gather
def undefined_l_gather(features,pid_labels):
    resized_num = 10000
    pos_num = min(features.size(0),resized_num)
    if features.size(0)>resized_num:
        print(f'{features.size(0)}out of {resized_num}')
    resized_features = torch.empty((resized_num,features.size(1))).to(features.device)
    resized_features[:pos_num,:] = features[:pos_num,:]
    resized_pid_labels = torch.empty((resized_num,)).to(pid_labels.device)
    resized_pid_labels[:pos_num] = pid_labels[:pos_num]
    pos_num = torch.tensor([pos_num]).to(features.device)
    all_pos_num = all_gather_tensor(pos_num)
    all_features = all_gather_tensor(resized_features)
    all_pid_labels = all_gather_tensor(resized_pid_labels)
    gather_features = []
    gather_pid_labels = []
    for index,p_num in enumerate(all_pos_num):
        gather_features.append(all_features[index][:p_num,:])
        gather_pid_labels.append(all_pid_labels[index][:p_num])
    gather_features = torch.cat(gather_features,dim=0)
    gather_pid_labels = torch.cat(gather_pid_labels,dim=0)
    return gather_features,gather_pid_labels
class LabeledMatching(Function):
    @staticmethod
    def forward(ctx, features, pid_labels, lookup_table, momentum=0.5):
        # The lookup_table can't be saved with ctx.save_for_backward(), as we would
        # modify the variable which has the same memory address in backward()
#         ctx.save_for_backward(features, pid_labels)
        gather_features,gather_pid_labels = undefined_l_gather(features,pid_labels)
        ctx.save_for_backward(gather_features, gather_pid_labels)  
        ctx.lookup_table = lookup_table
        ctx.momentum = momentum
        scores = features.mm(lookup_table.t())
        #print(features, lookup_table, scores)
        pos_feats = lookup_table.clone().detach()
        pos_idx = pid_labels > 0
        pos_pids = pid_labels[pos_idx]
        pos_feats = pos_feats[pos_pids]
        #pos_feats.require_grad = False
        return scores, pos_feats, pos_pids

    @staticmethod
    def backward(ctx, grad_output, grad_feat, grad_pids):
        features, pid_labels = ctx.saved_tensors
        pid_labels = pid_labels.long()
        lookup_table = ctx.lookup_table
        momentum = ctx.momentum
        grad_feats = None
        if ctx.needs_input_grad[0]:
            grad_feats = grad_output.mm(lookup_table)
        # Update lookup table, but not by standard backpropagation with gradients
        for indx, label in enumerate(pid_labels):
            if label >= 0:
                lookup_table[label] = (
                    momentum * lookup_table[label] + (1 - momentum) * features[indx]
                )
                #lookup_table[label] /= lookup_table[label].norm()
        return grad_feats, None, None, None

daodaofr · 2021-04-18T07:11:29Z

Great! Thanks :)

anDoer · 2021-04-22T13:54:59Z

I think all_gather_tensor should return a list if is_dist is false

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return [x]
    # remaining code here...

hh23333 · 2021-04-26T13:11:30Z

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters.
After that, I tried the following
(1) adjust lr from 0.001 to 0.01
(2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync.
But,these measures did not work.
Can you give me some suggestions！thanks!
my environments:
mmvc-full==1.1.5
pytoch==1.5.1
cuda==10.2
but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

dqshuai · 2021-04-26T14:32:38Z

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters.
After that, I tried the following
(1) adjust lr from 0.001 to 0.01
(2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync.
But,these measures did not work.
Can you give me some suggestions！thanks!
my environments:
mmvc-full==1.1.5
pytoch==1.5.1
cuda==10.2
but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

(1)my GPUs' num is 8,and the batch_size of each gpu is 4. When i set lr=0.05,i get 92.91 mAP. At first，I thought that the empirical ratio lr is about single_gpu_lr(0.001)*num_of_gpus. But,I don't get a better result when i using 0.008 or 0.01 lr.
(2)using sync_batchnorm reduces the result. and i don‘t know why.
If you have any other findings, you can share it with me. I haven't fully reproduced the results of the paper with multi gpu. Thanks!

hh23333 · 2021-04-27T03:31:54Z

Got it, Thanks!

qixiong-wang · 2021-08-13T15:38:48Z

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse.
I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

daodaofr · 2021-08-16T06:50:35Z

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse.
I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

This is just my try, but it doesn't work out.

daodaofr mentioned this issue Jul 6, 2021

Cuda out of memory! #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether to support distributed training #4

Whether to support distributed training #4

dqshuai commented Apr 9, 2021

daodaofr commented Apr 9, 2021

dqshuai commented Apr 9, 2021

daodaofr commented Apr 9, 2021

dqshuai commented Apr 9, 2021

daodaofr commented Apr 9, 2021

dqshuai commented Apr 13, 2021

daodaofr commented Apr 13, 2021

dqshuai commented Apr 18, 2021

daodaofr commented Apr 18, 2021

dqshuai commented Apr 18, 2021

daodaofr commented Apr 18, 2021

anDoer commented Apr 22, 2021

hh23333 commented Apr 26, 2021 •

edited

Loading

dqshuai commented Apr 26, 2021

hh23333 commented Apr 27, 2021

qixiong-wang commented Aug 13, 2021

daodaofr commented Aug 16, 2021

Whether to support distributed training #4

Whether to support distributed training #4

Comments

dqshuai commented Apr 9, 2021

daodaofr commented Apr 9, 2021

dqshuai commented Apr 9, 2021

daodaofr commented Apr 9, 2021

dqshuai commented Apr 9, 2021

daodaofr commented Apr 9, 2021

dqshuai commented Apr 13, 2021

daodaofr commented Apr 13, 2021

dqshuai commented Apr 18, 2021

daodaofr commented Apr 18, 2021

dqshuai commented Apr 18, 2021

daodaofr commented Apr 18, 2021

anDoer commented Apr 22, 2021

hh23333 commented Apr 26, 2021 • edited Loading

dqshuai commented Apr 26, 2021

hh23333 commented Apr 27, 2021

qixiong-wang commented Aug 13, 2021

daodaofr commented Aug 16, 2021

hh23333 commented Apr 26, 2021 •

edited

Loading