[CheckpointServer] use streaming transfers #36

d4l3k · 2024-12-12T17:45:47Z

The CheckpointServer currently uses torch.save/torch.load which requires allocating the entire buffer into memory. We want to instead use streaming transfers so we minimize the amount of CPU memory required.

It would also be nice to add checksums to these transfers to avoid any data corruption from the network.

Relevant existing code: https://github.com/pytorch-labs/torchft/blob/main/torchft/checkpointing.py#L72

The algorithm is described at: https://gist.github.com/d4l3k/b68094d649a076384967788c9b0a5f08

Existing tests: https://github.com/pytorch-labs/torchft/blob/main/torchft/checkpointing_test.py#L15

Overview of work:

copy over the write_state_dict and read_state_dict implementations into checkpointing.py
replace existing torch.save/torch.load with those
add unit tests for write_state_dict/read_state_dict for all the different possible types of torch tensors (different data types, strided, offsets, scalars, nested structures, etc)
optionally add in checksum to read/write_state_dict that uses zlib.crc32

d4l3k · 2024-12-12T17:48:27Z

@Krishn1412 would you be interested in working on this?

Krishn1412 · 2024-12-13T13:31:22Z

Sure @d4l3k, I'll work on this

Krishn1412 · 2024-12-20T15:00:59Z

Hey @d4l3k , can you take a look at this? #54

fegin · 2024-12-20T17:45:47Z

@d4l3k It seems that write_state_dict and read_state_dict won't work with DTensor. Please correct me if I'm wrong.

d4l3k · 2024-12-20T21:25:28Z

@fegin yeah, that's a good point -- we should be able to support DTensor without too much trouble though

d4l3k added enhancement New feature or request good first issue Good for newcomers labels Dec 12, 2024

d4l3k mentioned this issue Dec 12, 2024

[tests] add generic size/type test for all ProcessGroups #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CheckpointServer] use streaming transfers #36

[CheckpointServer] use streaming transfers #36

d4l3k commented Dec 12, 2024 •

edited

Loading

d4l3k commented Dec 12, 2024

Krishn1412 commented Dec 13, 2024

Krishn1412 commented Dec 20, 2024

fegin commented Dec 20, 2024

d4l3k commented Dec 20, 2024

[CheckpointServer] use streaming transfers #36

[CheckpointServer] use streaming transfers #36

Comments

d4l3k commented Dec 12, 2024 • edited Loading

d4l3k commented Dec 12, 2024

Krishn1412 commented Dec 13, 2024

Krishn1412 commented Dec 20, 2024

fegin commented Dec 20, 2024

d4l3k commented Dec 20, 2024

d4l3k commented Dec 12, 2024 •

edited

Loading