-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't pickle weakref
objects when saving checkpoints
#212
Comments
@tzeitim thank you very much for writing in with this! This is the same problem I was running into when trying to move to pytorch 2.0.0, and I have not yet been able to figure it out either. You got farther than I did! So thank you, I appreciate it. Thanks for pointing out the ilia-kats fix for the pyro issue, I had not seen that yet, and that seems promising. I hadn't run into that optimizer saving issue yet, but I did run into another I wonder what the deal is with this It seems like this agrees with your fix, and I think I might try this out
In my own development work, I am currently still using python 3.7 with pytorch < 2.0.0, for the reasons you pointed out. I will be working on these kinds of fixes on the |
Here's some tracking for this stuff: |
Hi @sjfleming - Thanks for your answer and the references. To be honest I was just lucky to find that solution ilia-kats, it was just a couple days old when I found it. Regarding the source of this issue ... I don't know ... I've spent a lot of time trying to understand the chain of events that lead to a I am glad I could help you save some time or identify potential solutions for the future. Keep up the great work! |
I hope this is not too bleeding-edge but I have no other versioning options due to the combination of the GPU-nodes I have access to and software dependencies.
To do a quick recap
I am pulling
cellbender 0.3.0
from the branchsf_dev_0.3.0_postreg_posterior_format_h5
since I had the same two issues raised in PR #193. Following @sjfleming 's suggestion to pull the latestpytorch 2.0.0
from their dev branch forpytorch
.Unfortunately the issue about learning rates schedulers persisted even after pulling cellbender's 736d6.
After some digging, I managed to solve the pytorch-pyro scheduler issue by pulling pyro from this commit instead.
The main issue
cellbender
was able to finish training but it raised a new error when trying to write the final checkpoint (and only checkpoint attempted in this data set, that I am aware).The code in cellbender's
checkpoint.py
in it's current form just shows that it failed in an attempt to write the checkpoint when it exits.I had to remove the
try
block in order to reveal the real issue.I dissected the individual lines that would trigger the error on their own.
Interestingly the model object can be saved by invoking its
.state_dict()
method.No
.state_dict()
exists for the scheduler object, though.To understand the problem a bit better, I omitted the
method
objects within the scheduler and thentorch.save
could run! This strategy indicated that theweakref
inanneal_func
was to blame.I did a little bit of googling with this information and I think that the
weakref
issue is very similar (maybe identical to) this pytorch issue #42376 .I have decided to open this issue and documented it here as it has gone beyond my ability to resolve for now.
As a footnote- and just for the record -I wrote this non-fancy routine to eliminate the
method
s mentioned aboveThe text was updated successfully, but these errors were encountered: