You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
if rank == 0:
t = Thread(
target=save_file,
args=(model_sd, f"{cfg.model_dir}/model_{step + 1}.safetensors"),
daemon=True
)
t.start()
Which saves the checkpoint to disk using safetensors. However, I notice that this blocks the training loop, even though the thread should be running in the background.
When I switch the code to use torch.save, there's no issue. What should I do?
The text was updated successfully, but these errors were encountered:
This is hard to say without being able to reproduce anything.
This sounds like a GIL issue although it's hard to prove or assert. Rust needs to own the GIL while it does something with some Python values.
This should be easy to assert using multiprocessing instead of threading though.
If it's indeed GIL issues, I don't have any great idea on how to avoid them while using threads.
I'm going to test a version that releases the GIL on the long-running calls and see what happens, but basically this change implies that you're shifting responsibility to the user to not mess with the tensors incorrectly.
I have the following code in my training loop:
Which saves the checkpoint to disk using safetensors. However, I notice that this blocks the training loop, even though the thread should be running in the background.
When I switch the code to use
torch.save
, there's no issue. What should I do?The text was updated successfully, but these errors were encountered: