Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CheckpointServer] use streaming transfers #36

Open
d4l3k opened this issue Dec 12, 2024 · 5 comments
Open

[CheckpointServer] use streaming transfers #36

d4l3k opened this issue Dec 12, 2024 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@d4l3k
Copy link
Member

d4l3k commented Dec 12, 2024

The CheckpointServer currently uses torch.save/torch.load which requires allocating the entire buffer into memory. We want to instead use streaming transfers so we minimize the amount of CPU memory required.

It would also be nice to add checksums to these transfers to avoid any data corruption from the network.

Relevant existing code: https://github.com/pytorch-labs/torchft/blob/main/torchft/checkpointing.py#L72

The algorithm is described at: https://gist.github.com/d4l3k/b68094d649a076384967788c9b0a5f08

Existing tests: https://github.com/pytorch-labs/torchft/blob/main/torchft/checkpointing_test.py#L15

Overview of work:

  1. copy over the write_state_dict and read_state_dict implementations into checkpointing.py
  2. replace existing torch.save/torch.load with those
  3. add unit tests for write_state_dict/read_state_dict for all the different possible types of torch tensors (different data types, strided, offsets, scalars, nested structures, etc)
  4. optionally add in checksum to read/write_state_dict that uses zlib.crc32
@d4l3k
Copy link
Member Author

d4l3k commented Dec 12, 2024

@Krishn1412 would you be interested in working on this?

@Krishn1412
Copy link

Sure @d4l3k, I'll work on this

@Krishn1412
Copy link

Hey @d4l3k , can you take a look at this? #54

@fegin
Copy link
Contributor

fegin commented Dec 20, 2024

@d4l3k It seems that write_state_dict and read_state_dict won't work with DTensor. Please correct me if I'm wrong.

@d4l3k
Copy link
Member Author

d4l3k commented Dec 20, 2024

@fegin yeah, that's a good point -- we should be able to support DTensor without too much trouble though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants