[Example] PyTorch distributed training with minGPT #4464

Michaelvll · 2024-12-12T01:18:26Z

This PR adds a more modern distributed training example.

TODOs:

Update the examples in our doc with this example [Docs] Update docs with latest distributed training examples #4468

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

romilbhardwaj

Awesome, thanks @Michaelvll! Left some minor nit comments

examples/distributed-pytorch/README.md

romilbhardwaj · 2024-12-12T04:55:02Z

examples/distributed-pytorch/README.md

+
+The following command spawn 2 nodes with 2 L4 GPU each. 
+
+`sky launch -c train.yaml`


Missing cluster name? Also might be nice to put in a code block

Suggested change

`sky launch -c train.yaml`

\```

sky launch -c train train.yaml

\```

examples/distributed-pytorch/README.md

romilbhardwaj · 2024-12-12T05:28:17Z

examples/distributed-pytorch/README.md

+
+For example, the following command will spawn 4 nodes with 4 L4 GPUs each.
+
+`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`


change to num nodes 4 and L4:4

Suggested change

`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`

\```

sky launch -c train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+

\```

Co-authored-by: Romil Bhardwaj <[email protected]>

…ot into distributed-training

Michaelvll · 2024-12-12T23:58:05Z

Thanks @romilbhardwaj for the review! Updated README

romilbhardwaj

Thanks @Michaelvll!

Michaelvll added 2 commits December 12, 2024 01:16

Add example for distributed pytorch

179520e

update

2dd3af5

romilbhardwaj reviewed Dec 12, 2024

View reviewed changes

Michaelvll and others added 9 commits December 12, 2024 15:52

Update examples/distributed-pytorch/README.md

494eed9

Co-authored-by: Romil Bhardwaj <[email protected]>

Update examples/distributed-pytorch/README.md

ce607cc

Co-authored-by: Romil Bhardwaj <[email protected]>

Update examples/distributed-pytorch/README.md

c26ff2a

Co-authored-by: Romil Bhardwaj <[email protected]>

Update examples/distributed-pytorch/README.md

bc6a14f

Co-authored-by: Romil Bhardwaj <[email protected]>

Update examples/distributed-pytorch/README.md

8175e65

Co-authored-by: Romil Bhardwaj <[email protected]>

Update examples/distributed-pytorch/README.md

b3ab808

Co-authored-by: Romil Bhardwaj <[email protected]>

Update examples/distributed-pytorch/README.md

d222990

Co-authored-by: Romil Bhardwaj <[email protected]>

Fix

3a322c7

Merge branch 'distributed-training' of github.com:skypilot-org/skypil…

e81aa0c

…ot into distributed-training

Michaelvll mentioned this pull request Dec 12, 2024

[Docs] Update docs with latest distributed training examples #4468

Open

Michaelvll requested a review from romilbhardwaj December 13, 2024 00:34

romilbhardwaj approved these changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] PyTorch distributed training with minGPT #4464

[Example] PyTorch distributed training with minGPT #4464

Michaelvll commented Dec 12, 2024 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj Dec 12, 2024

romilbhardwaj Dec 12, 2024

Michaelvll commented Dec 12, 2024

romilbhardwaj left a comment


		The following command spawn 2 nodes with 2 L4 GPU each.

		`sky launch -c train.yaml`


		For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

		`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`

[Example] PyTorch distributed training with minGPT #4464

Are you sure you want to change the base?

[Example] PyTorch distributed training with minGPT #4464

Conversation

Michaelvll commented Dec 12, 2024 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj Dec 12, 2024

Choose a reason for hiding this comment

romilbhardwaj Dec 12, 2024

Choose a reason for hiding this comment

Michaelvll commented Dec 12, 2024

romilbhardwaj left a comment

Choose a reason for hiding this comment

Michaelvll commented Dec 12, 2024 •

edited

Loading