ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Zha0q1 · 2022-02-04T21:42:06Z

Hi torch-ccl community,

I was trying to run the follow code with PT 1.10 + ccl backend:

import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10, bias=False)
        self.net2 = nn.Linear(10, 10)
    def forward(self, x):
        return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
    model,
    find_unused_parameters=True)

inp = torch.randn(1, 10)
out = ddp(inp)

When find_unused_parameters=True, the destructor of ProcessGroupCCL was not correctly called. When find_unused_parameters=False there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch

Would appreciate any insights and help!

The text was updated successfully, but these errors were encountered:

chengjunlu · 2022-02-07T05:31:28Z

Hi @Zha0q1,

I cannot reproduce the issue of "the destructor of ProcessGroupCCL was not correctly called"
The ~ProcessGroupCCL can always be called on the end of the python life for both the find_unused_parameters=True and find_unused_parameters=False

There maybe some requirements on the sequence of the exiting clean up of your code.

Please be aware the destructor of ProcessGroup is called when clean up the refer to python object at the end of python life.

Zha0q1 · 2022-02-07T05:42:13Z

Hi @chengjunlu thanks for your reply! Would you share the hardware and software stack you used? This issue only occurred with PT 1.10 for me -- PT 1.9 worked just fine. I was using an AWS P4d instance with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-cpu-py38-ubuntu20.04-sagemaker being the base image

chengjunlu · 2022-02-07T05:54:01Z

I am using the public pytorch v1.10.0-rc3 tag for the 1.10 release.

Would you help to double check whether this issue could be reproduced without your changes?

Zha0q1 · 2022-02-07T05:57:26Z

Hi I used the v1.10.0 tag and built pytorch from source. And yes, even with https://github.com/intel/torch-ccl/tree/ccl_torch1.10 this branch the issue is still reproducible. I only added a std::cout in the destructor to show it was called/ not called.

chengjunlu · 2022-02-07T06:17:59Z

Let's try more experiment:

Add some debug information in the destructor on ProcessGroup.
Can you show the ABI of the pytorch in your platform torch._C._GLIBCXX_USE_CXX11_ABI?

Zha0q1 · 2022-02-07T06:36:52Z

Do you mean the Pytorch ProcessGroup?
it shows True
One more question: did you try the same script I used?

chengjunlu · 2022-02-07T07:11:16Z

Do you mean the Pytorch ProcessGroup?
Yes.

it shows True
One more question: did you try the same script I used?
Yes.

Zha0q1 · 2022-02-07T07:13:50Z

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

chengjunlu · 2022-02-07T07:28:00Z

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

It is bizarre issue. I don't have the strong confidence about the root cause.
The hard part is that I cannot reproduce your issue in my platform.

Here are just some points we can look into:

The process group in PT1.10 is managed by intrusive ptr. There is drawback in C++ in the cross reference of smart pointer blocking the destruction of objects correctly.
The attribute reducer of DistributedDataParallel and the Reducer keeps a reference to the process group (in the test, the object of ProcessGroupCCL). Another attribute _default_pg also keeps a reference to it.
But Neither of them kept a cross reference to each other. We need to further investigate it.

Another aspect we can check is the pybind itself, less possible but who knows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Zha0q1 commented Feb 4, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022 •

edited

Loading

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35

Comments

Zha0q1 commented Feb 4, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022

Zha0q1 commented Feb 7, 2022

chengjunlu commented Feb 7, 2022 • edited Loading

chengjunlu commented Feb 7, 2022 •

edited

Loading