Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whether to support distributed training #4

Open
dqshuai opened this issue Apr 9, 2021 · 17 comments
Open

Whether to support distributed training #4

dqshuai opened this issue Apr 9, 2021 · 17 comments

Comments

@dqshuai
Copy link

dqshuai commented Apr 9, 2021

hello,thanks for your project. i want to know whether to support distributed training. And what should i do to make it support distributed training.

@daodaofr
Copy link
Owner

daodaofr commented Apr 9, 2021

Hi, we didn't try to train with multiple GPUs. But MMDetection supports distributed training, please refer to https://github.com/daodaofr/AlignPS/blob/master/tools/dist_train.sh

@dqshuai
Copy link
Author

dqshuai commented Apr 9, 2021

thanks for your reply. now, i try to support distributed training using cmd "./tools/dist_train.sh configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py 8 --launcher pytorch --no-validate". it can traniing normally, but i don‘t know it whether it will affect the final performance. Normally, distributed training will not damage the performance? is this correct?

@daodaofr
Copy link
Owner

daodaofr commented Apr 9, 2021

Normally, you can still get fair performance, maybe there needs some adjusting in batch size and learning rate to get the best results.

@dqshuai
Copy link
Author

dqshuai commented Apr 9, 2021

hi,I just finished training using multi gpu on prw dataset. Compared to the results of the paper, map is 2% lower, but r1 is 1% higher. when i check the config, i found the bbox_head is ''FCOSReidHeadFocalOimSub'' without triplet loss.


I want to know whether the difference in results is related to this, and no ablation experiment in this regard was found in your paper.
thanks!

@daodaofr
Copy link
Owner

daodaofr commented Apr 9, 2021

Thanks for your results, I think the results are normal. According to my experience, the triplet loss only has a very slight influence on PRW, less than 1%. Different environments (mmcv, pytorch, cuda) can also bring 1%-2% performance difference. PRW is smaller compared to CUHK-SYSU, so it is normal to see some fluctuations.

@dqshuai
Copy link
Author

dqshuai commented Apr 13, 2021

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters.
After that, I tried the following
(1) adjust lr from 0.001 to 0.01
(2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync.
But,these measures did not work.
Can you give me some suggestions!thanks!
my environments:
mmvc-full==1.1.5
pytoch==1.5.1
cuda==10.2
but i don't think environment can bring 4% map difference. :)

@daodaofr
Copy link
Owner

I am sorry, but I haven't tried distributed training. So I cannot give practical suggestions on that.
If you want to reproduce the results, please try to use a single GPU.

@dqshuai
Copy link
Author

dqshuai commented Apr 18, 2021

thanks for your reply. I received system email, in which you suggest to use all gather to update lookup_table with global features. The example you provided has some problems due to the inconsistency of the feature size of each rank. I made some modifications, and then adjusted the learning rate, the current map can reach 92.91. Why did I not see this reply in the issue, and are there any other details that I need to pay attention to to get a higher map?

@daodaofr
Copy link
Owner

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply.
It would be nice if you could give an example of your modified code, to help others with distributed training.
Maybe more epochs are needed with multiple GPU.

@dqshuai
Copy link
Author

dqshuai commented Apr 18, 2021

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply.
It would be nice if you could give an example of your modified code, to help others with distributed training.
Maybe more epochs are needed with multiple GPU.

My current implementation is a bit ugly. :)

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return x
    if not save_memory:
        # all gather features in parallel
        # cost more GPU memory but less time
        # x = x.cuda(gpu)
        x_gather = [torch.empty_like(x) for _ in range(world_size)]
        dist.all_gather(x_gather, x, async_op=False)
#         x_gather = torch.cat(x_gather, dim=0)
    else:
        # broadcast features in sequence
        # cost more time but less GPU memory
        container = torch.empty_like(x).cuda(gpu)
        x_gather = []
        for k in range(world_size):
            container.data.copy_(x)
            print("gathering features from rank no.{}".format(k))
            dist.broadcast(container, k)
            x_gather.append(container.cpu())
#         x_gather = torch.cat(x_gather, dim=0)
        # return cpu tensor
    return x_gather
def undefined_l_gather(features,pid_labels):
    resized_num = 10000
    pos_num = min(features.size(0),resized_num)
    if features.size(0)>resized_num:
        print(f'{features.size(0)}out of {resized_num}')
    resized_features = torch.empty((resized_num,features.size(1))).to(features.device)
    resized_features[:pos_num,:] = features[:pos_num,:]
    resized_pid_labels = torch.empty((resized_num,)).to(pid_labels.device)
    resized_pid_labels[:pos_num] = pid_labels[:pos_num]
    pos_num = torch.tensor([pos_num]).to(features.device)
    all_pos_num = all_gather_tensor(pos_num)
    all_features = all_gather_tensor(resized_features)
    all_pid_labels = all_gather_tensor(resized_pid_labels)
    gather_features = []
    gather_pid_labels = []
    for index,p_num in enumerate(all_pos_num):
        gather_features.append(all_features[index][:p_num,:])
        gather_pid_labels.append(all_pid_labels[index][:p_num])
    gather_features = torch.cat(gather_features,dim=0)
    gather_pid_labels = torch.cat(gather_pid_labels,dim=0)
    return gather_features,gather_pid_labels
class LabeledMatching(Function):
    @staticmethod
    def forward(ctx, features, pid_labels, lookup_table, momentum=0.5):
        # The lookup_table can't be saved with ctx.save_for_backward(), as we would
        # modify the variable which has the same memory address in backward()
#         ctx.save_for_backward(features, pid_labels)
        gather_features,gather_pid_labels = undefined_l_gather(features,pid_labels)
        ctx.save_for_backward(gather_features, gather_pid_labels)  
        ctx.lookup_table = lookup_table
        ctx.momentum = momentum
        scores = features.mm(lookup_table.t())
        #print(features, lookup_table, scores)
        pos_feats = lookup_table.clone().detach()
        pos_idx = pid_labels > 0
        pos_pids = pid_labels[pos_idx]
        pos_feats = pos_feats[pos_pids]
        #pos_feats.require_grad = False
        return scores, pos_feats, pos_pids

    @staticmethod
    def backward(ctx, grad_output, grad_feat, grad_pids):
        features, pid_labels = ctx.saved_tensors
        pid_labels = pid_labels.long()
        lookup_table = ctx.lookup_table
        momentum = ctx.momentum
        grad_feats = None
        if ctx.needs_input_grad[0]:
            grad_feats = grad_output.mm(lookup_table)
        # Update lookup table, but not by standard backpropagation with gradients
        for indx, label in enumerate(pid_labels):
            if label >= 0:
                lookup_table[label] = (
                    momentum * lookup_table[label] + (1 - momentum) * features[indx]
                )
                #lookup_table[label] /= lookup_table[label].norm()
        return grad_feats, None, None, None

@daodaofr
Copy link
Owner

Great! Thanks :)

@anDoer
Copy link

anDoer commented Apr 22, 2021

I think all_gather_tensor should return a list if is_dist is false

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return [x]
    # remaining code here...

@hh23333
Copy link

hh23333 commented Apr 26, 2021

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters.
After that, I tried the following
(1) adjust lr from 0.001 to 0.01
(2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync.
But,these measures did not work.
Can you give me some suggestions!thanks!
my environments:
mmvc-full==1.1.5
pytoch==1.5.1
cuda==10.2
but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

@dqshuai
Copy link
Author

dqshuai commented Apr 26, 2021

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters.
After that, I tried the following
(1) adjust lr from 0.001 to 0.01
(2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync.
But,these measures did not work.
Can you give me some suggestions!thanks!
my environments:
mmvc-full==1.1.5
pytoch==1.5.1
cuda==10.2
but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

(1)my GPUs' num is 8,and the batch_size of each gpu is 4. When i set lr=0.05,i get 92.91 mAP. At first,I thought that the empirical ratio lr is about single_gpu_lr(0.001)*num_of_gpus. But,I don't get a better result when i using 0.008 or 0.01 lr.
(2)using sync_batchnorm reduces the result. and i don‘t know why.
If you have any other findings, you can share it with me. I haven't fully reproduced the results of the paper with multi gpu. Thanks!

@hh23333
Copy link

hh23333 commented Apr 27, 2021

Got it, Thanks!

@qixiong-wang
Copy link

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse.
I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

@daodaofr
Copy link
Owner

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse.
I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

This is just my try, but it doesn't work out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants