Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add example for fsdp #7061

Merged
merged 3 commits into from
May 20, 2024
Merged

add example for fsdp #7061

merged 3 commits into from
May 20, 2024

Conversation

JackCaoG
Copy link
Collaborator

@gkroiz FYI, for some reason I can't add you as a reviewer.

@JackCaoG JackCaoG requested review from alanwaketan and Liyang90 May 14, 2024 20:20
@alanwaketan
Copy link
Collaborator

Given FSDPv1 is more or less deprecated, I don't think we should add it as an example. We should just focus on SPMD instead.

@gkroiz
Copy link
Contributor

gkroiz commented May 14, 2024

Thanks for adding this @JackCaoG, I have a few questions.

  1. Do we want to add a simple demonstration of checkpointing? I know there was some recent questions around this, namely Loading large sharded model weights #6989

  2. How does this file's purpose differ to the FSDP example code in https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist_fsdp_with_ckpt.py

@JackCaoG
Copy link
Collaborator Author

Thanks for adding this @JackCaoG, I have a few questions.

  1. Do we want to add a simple demonstration of checkpointing? I know there was some recent questions around this, namely Loading large sharded model weights #6989
  2. How does this file's purpose differ to the FSDP example code in master/test/test_train_mp_mnist_fsdp_with_ckpt.py

Old example is kind of long, I want to make it very clear what's the minimal codes to run a feature. If user want to do a deeper dive we should directly them to the more completed example.

@JackCaoG
Copy link
Collaborator Author

Given FSDPv1 is more or less deprecated, I don't think we should add it as an example. We should just focus on SPMD instead.

I see your point, but there are users that want to use FSDP v1. It is similar that we don't really want user to use native DDP but I provide the example anyway. I think we can update our README to specify what do we recommened, in that section we can make it clear that we recommend xla_ddp and fsdp_v2.

@JackCaoG
Copy link
Collaborator Author

@gkroiz for checkponting and gradident checkpointing, I think we should provide a similar simple examples. Let me know if you are interested in contributing.

@gkroiz
Copy link
Contributor

gkroiz commented May 14, 2024

@gkroiz for checkponting and gradident checkpointing, I think we should provide a similar simple examples. Let me know if you are interested in contributing.

Happy to review, probably don't have time in the near future.

})
else:
raise Exception(f"Invalid auto-wrap policy: {auto_wrap_policy}")
fsdp_wrap = lambda m: FSDP(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the lambda functions needed? One case I can think of would be using it as a callable argument of auto_wrapper_callable. But since it is not the case, can we use FSDP() directly on the model for simplicity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I just copied it from the example. Let me clean it up

Copy link
Contributor

@gkroiz gkroiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JackCaoG JackCaoG force-pushed the JackCaoG/fsdp_example branch from da2bb0d to 3684982 Compare May 20, 2024 22:03
@JackCaoG JackCaoG merged commit 961c22a into master May 20, 2024
3 checks passed
@JackCaoG JackCaoG deleted the JackCaoG/fsdp_example branch May 20, 2024 22:06
qihqi pushed a commit that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants