Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Open
Zha0q1 opened this issue Feb 4, 2022 · 9 comments
Open

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Zha0q1 opened this issue Feb 4, 2022 · 9 comments

Comments

@Zha0q1
Copy link

Zha0q1 commented Feb 4, 2022

Hi torch-ccl community,

I was trying to run the follow code with PT 1.10 + ccl backend:

import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10, bias=False)
        self.net2 = nn.Linear(10, 10)
    def forward(self, x):
        return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
    model,
    find_unused_parameters=True)

inp = torch.randn(1, 10)
out = ddp(inp)

When find_unused_parameters=True, the destructor of ProcessGroupCCL was not correctly called. When find_unused_parameters=False there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch

Would appreciate any insights and help!

@chengjunlu
Copy link
Contributor

Hi @Zha0q1,

I cannot reproduce the issue of "the destructor of ProcessGroupCCL was not correctly called"
The ~ProcessGroupCCL can always be called on the end of the python life for both the find_unused_parameters=True and find_unused_parameters=False

There maybe some requirements on the sequence of the exiting clean up of your code.

Please be aware the destructor of ProcessGroup is called when clean up the refer to python object at the end of python life.

@Zha0q1
Copy link
Author

Zha0q1 commented Feb 7, 2022

Hi @chengjunlu thanks for your reply! Would you share the hardware and software stack you used? This issue only occurred with PT 1.10 for me -- PT 1.9 worked just fine. I was using an AWS P4d instance with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-cpu-py38-ubuntu20.04-sagemaker being the base image

@chengjunlu
Copy link
Contributor

I am using the public pytorch v1.10.0-rc3 tag for the 1.10 release.

Would you help to double check whether this issue could be reproduced without your changes?

@Zha0q1
Copy link
Author

Zha0q1 commented Feb 7, 2022

Hi I used the v1.10.0 tag and built pytorch from source. And yes, even with https://github.com/intel/torch-ccl/tree/ccl_torch1.10 this branch the issue is still reproducible. I only added a std::cout in the destructor to show it was called/ not called.

@chengjunlu
Copy link
Contributor

Let's try more experiment:

  1. Add some debug information in the destructor on ProcessGroup.
  2. Can you show the ABI of the pytorch in your platform torch._C._GLIBCXX_USE_CXX11_ABI?

@Zha0q1
Copy link
Author

Zha0q1 commented Feb 7, 2022

  1. Do you mean the Pytorch ProcessGroup?
  2. it shows True
    One more question: did you try the same script I used?

@chengjunlu
Copy link
Contributor

  1. Do you mean the Pytorch ProcessGroup?
    Yes.
  2. it shows True
    One more question: did you try the same script I used?
    Yes.

@Zha0q1
Copy link
Author

Zha0q1 commented Feb 7, 2022

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

@chengjunlu
Copy link
Contributor

chengjunlu commented Feb 7, 2022

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

It is bizarre issue. I don't have the strong confidence about the root cause.
The hard part is that I cannot reproduce your issue in my platform.

Here are just some points we can look into:

The process group in PT1.10 is managed by intrusive ptr. There is drawback in C++ in the cross reference of smart pointer blocking the destruction of objects correctly.
The attribute reducer of DistributedDataParallel and the Reducer keeps a reference to the process group (in the test, the object of ProcessGroupCCL). Another attribute _default_pg also keeps a reference to it.
But Neither of them kept a cross reference to each other. We need to further investigate it.

Another aspect we can check is the pybind itself, less possible but who knows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants