Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[checkpoint] Open Source #27

Merged
merged 4 commits into from
Apr 10, 2024
Merged

[checkpoint] Open Source #27

merged 4 commits into from
Apr 10, 2024

Conversation

MingjiHan99
Copy link
Collaborator

In this PR, we open source our vescale.checkpoint, Yo. ~

vescale.checkpoint is a distributed LLM checkpointing system.

vescale.checkpoint offers simple and straightforward APIs,
enabling users to load and save distributed model (DModule) and optimizer (DistributedOptimizer) seamlessly,
abstracting away the complexities of underlying details such as process rank and device mesh.

vescale.checkpointsupports load-time checkpoint resharding when varying the degrees of data, tensor, or pipeline (TODO) parallelism
for both veScale distributed model (DModule) and optimizer (DistributedOptimizer).

vescale.checkpoint incorporates fast checkpointing and various I/O optimization techinques,
enhancing I/O efficiency during large language model training.

vescale.checkpoint will be a part of OmniStore project, a new open source project coming soon.

Credit to veScale Checkpoint Team

This endeavor would not have been possible without the contribution of veScale Checkpoint team which includes but not limited to:
@shanesyy-1992 @MingjiHan99 @AHEADer @raywan-110 @michael4RD @lazychao @leochen-ai

Also thanks to the great guidance and leadership of: @pengyanghua @eric-haibin-lin @liwenchangbdbz @Meteorix

Credit to veScale Team

We would like to sincerely acknowledge the assistance of and collaboration with the veScale team which inlcudes but not limited to:
@leonardo0lyj @JsBlueCat @MackZackA @Vremold @jc-bytedance @lichen225

Credit to PyTorch Distributed Checkpoint (DCP) Team

We would like to sincerely acknowledge the assistance of and collaboration
with the PyTorch Distributed Checkpoint (DCP) team
which includes but not limited to:
@wz337 @kumpera @fegin @LucasLLC

Copy link
Collaborator

@JsBlueCat JsBlueCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions abort protobuf, does we need that codegen file to be push?

python/vescale/checkpoint/api/base_checkpointer.py Outdated Show resolved Hide resolved
python/vescale/checkpoint/api/base_checkpointer.py Outdated Show resolved Hide resolved
@MingjiHan99
Copy link
Collaborator Author

MingjiHan99 commented Apr 10, 2024

I have some questions abort protobuf, does we need that codegen file to be push?

Based on our discussion, we reserve the protobuf files for now. Otherwise, users have to generate code on their own.

@MingjiHan99 MingjiHan99 force-pushed the checkpoint_open_source_2 branch from 8d962d6 to 42d8c22 Compare April 10, 2024 16:51
@shanesyy-1992
Copy link

Could you help make some clean with the fast checkpoint code. There seems to be some code that hasn't been used.

@MingjiHan99
Copy link
Collaborator Author

Could you help make some clean with the fast checkpoint code. There seems to be some code that hasn't been used.

Sure. I will remove DistributedTorchLoader and RemappingTorchLoader.

@liwenchangbdbz liwenchangbdbz added the enhancement New feature or request label Apr 10, 2024
@MingjiHan99 MingjiHan99 merged commit bbf2860 into main Apr 10, 2024
1 check passed
@MingjiHan99 MingjiHan99 deleted the checkpoint_open_source_2 branch April 10, 2024 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants