-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProcessGroupCCL Destructor Not Correctly Called in PT 1.10 #35
Comments
Hi @Zha0q1, I cannot reproduce the issue of "the destructor of ProcessGroupCCL was not correctly called" There maybe some requirements on the sequence of the exiting clean up of your code. Please be aware the destructor of ProcessGroup is called when clean up the refer to python object at the end of python life. |
Hi @chengjunlu thanks for your reply! Would you share the hardware and software stack you used? This issue only occurred with PT 1.10 for me -- PT 1.9 worked just fine. I was using an AWS P4d instance with |
I am using the public pytorch v1.10.0-rc3 tag for the 1.10 release. Would you help to double check whether this issue could be reproduced without your changes? |
Hi I used the v1.10.0 tag and built pytorch from source. And yes, even with https://github.com/intel/torch-ccl/tree/ccl_torch1.10 this branch the issue is still reproducible. I only added a std::cout in the destructor to show it was called/ not called. |
Let's try more experiment:
|
|
|
Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue? |
It is bizarre issue. I don't have the strong confidence about the root cause. Here are just some points we can look into: The process group in PT1.10 is managed by intrusive ptr. There is drawback in C++ in the cross reference of smart pointer blocking the destruction of objects correctly. Another aspect we can check is the pybind itself, less possible but who knows. |
Hi torch-ccl community,
I was trying to run the follow code with PT 1.10 + ccl backend:
When
find_unused_parameters=True
, the destructor of ProcessGroupCCL was not correctly called. Whenfind_unused_parameters=False
there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorchWould appreciate any insights and help!
The text was updated successfully, but these errors were encountered: