Issue Loading FSDP wrapped module using FULL_STATE_DICT type. #141

hbikki · 2023-05-03T21:13:37Z

🐛 Describe the bug

Hello , I am working on training a pretrained hugging face model "t5-small". Using the torchsnpashot examples provided form the documentaion, I am able to save/load checkpoint for LOCAL_STATE_DICT type, I am also able to save the model checkpoint for FULL_STATE_DICT. But, when loading the full statedict checkpoint I am facing the below issue.

Versions:
pytorch = 2.0.0+cu117
torchx-nightly>=2023.3.15
torchsnapshot=0.1.0

Host Details:
The bellow training is tested on a single node with 8 NPROC_PER_NODE.

Code:

Model training code:

def train() -> None:
    init_process_group(backend="nccl")
    torch.cuda.empty_cache()
    torch.cuda.set_device(local_rank())
    model = load_model("t5-small")

    fsdp_model = FSDP(
        model,
        auto_wrap_policy=functools.partial(
            transformer_auto_wrap_policy, transformer_layer_cls={T5Block}
        ),
        sharding_strategy=ShardingStrategy.HYBRID_SHARD,
        device_id=local_rank(),
    )
    <-------training -loop-->
    <-------save_checkpoint-->

stateDictType = FULL_STATE_DICT
related saving/loading code:

  def save_checkpoint() -> None:
        with FSDP.state_dict_type(
            checkpoint.model,
            self.stateDictType):
            Snapshot.take(path=str(save_dir), app_state=app_state)

    def load_checkpoint() -> None:
        with FSDP.state_dict_type(checkpoint.model, self.stateDictType):
            Snapshot(path=str(load_dir)).restore(app_state=app_state)

Error stack trace:
https://pastebin.com/ih9qSbwR

.snapshot_metadata for the model on local rank:
https://pastebin.com/t6grkKyX

Does anyone know how to resolve this ? thanks!

The text was updated successfully, but these errors were encountered:

kiukchung · 2023-05-04T23:28:08Z

@hbikki can you correctly edit the markdown so that the stacktrace displays in a code block? And could you also include the full stack trace (if its too long feel free to paste bin and provide a link here).

yifuwang · 2023-05-04T23:49:00Z

Hey @hbikki, could you please share the snapshot metadata in question? It's the .snapshot_metadata file under the snapshot folder/prefix in question.

hbikki · 2023-05-05T15:29:52Z

Hello, Updated the issue with the requested data, thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Loading FSDP wrapped module using FULL_STATE_DICT type. #141

Issue Loading FSDP wrapped module using FULL_STATE_DICT type. #141

hbikki commented May 3, 2023 •

edited

Loading

kiukchung commented May 4, 2023

yifuwang commented May 4, 2023

hbikki commented May 5, 2023

Issue Loading FSDP wrapped module using FULL_STATE_DICT type. #141

Issue Loading FSDP wrapped module using FULL_STATE_DICT type. #141

Comments

hbikki commented May 3, 2023 • edited Loading

kiukchung commented May 4, 2023

yifuwang commented May 4, 2023

hbikki commented May 5, 2023

hbikki commented May 3, 2023 •

edited

Loading