You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello , I am working on training a pretrained hugging face model "t5-small". Using the torchsnpashot examples provided form the documentaion, I am able to save/load checkpoint for LOCAL_STATE_DICT type, I am also able to save the model checkpoint for FULL_STATE_DICT. But, when loading the full statedict checkpoint I am facing the below issue.
@hbikki can you correctly edit the markdown so that the stacktrace displays in a code block? And could you also include the full stack trace (if its too long feel free to paste bin and provide a link here).
🐛 Describe the bug
Hello , I am working on training a pretrained hugging face model "t5-small". Using the torchsnpashot examples provided form the documentaion, I am able to save/load checkpoint for LOCAL_STATE_DICT type, I am also able to save the model checkpoint for FULL_STATE_DICT. But, when loading the full statedict checkpoint I am facing the below issue.
Versions:
pytorch = 2.0.0+cu117
torchx-nightly>=2023.3.15
torchsnapshot=0.1.0
Host Details:
The bellow training is tested on a single node with 8 NPROC_PER_NODE.
Code:
Error stack trace:
https://pastebin.com/ih9qSbwR
.snapshot_metadata for the model on local rank:
https://pastebin.com/t6grkKyX
Does anyone know how to resolve this ? thanks!
The text was updated successfully, but these errors were encountered: