-
Notifications
You must be signed in to change notification settings - Fork 36
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Example] add an example of finetuning nanoGPT in 4D using veScale (#16)
This PR adds an example of finetuning a GPT2 model using veScale API in 4D parallelism: Data, Tensor, Sequence, and Optimizer Parallelism. There are near-zero changes in the model code. In addition, this PR also improves factory methods for DTensors and simplifies DModule APIs.
- Loading branch information
Showing
31 changed files
with
2,584 additions
and
227 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Finetune nanoGPT on Shakespeare in 4D Parallelism via veScale | ||
|
||
In this example, we demonstrate how to finetune a pre-trained GPT2 using veScale. The example is built upon @karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT/) project. With near-zero change in the model code and minimal changes in the training code, we can finetune a pre-trained GPT2 on the Shakespeare dataset and utilize multiple GPUs via 4D parallelism: Data, Tensor, Sequence, and Optimizer Parallelism. The correctness of our implementation is verified via comparing both the training and the validation loss with the single GPU result produced by nanoGPT. The differences is negligible when the computation is conducted using fp32, ~1% using bf16. | ||
|
||
## Prerequisites | ||
|
||
``` | ||
pip3 install datasets tiktoken | ||
``` | ||
|
||
## Run | ||
|
||
First, prepare the dataset: | ||
``` | ||
cd data/shakespeare && python3 prepare.py | ||
``` | ||
|
||
Then, to finetune the Shakespeare dataset in an environment of multiple GPUs, run | ||
``` | ||
torchrun --standalone --nproc_per_node={Number of GPUs} finetune_4D.py config/finetune_shakespeare.py --compile=False --dp_size={DP Size} --tp_size={TP Size} | ||
``` | ||
where `DP Size` and `TP Size` denote the the degrees of Data and Tensor Parallelism that suit your environment. | ||
|
||
To produce the single GPU result, run | ||
``` | ||
python3 base_train.py config/finetune_shakespeare.py --compile=False | ||
``` | ||
|
||
## Loss Curves | ||
|
||
Here are the training Loss and validation loss curves plot for fp32 runs that last 200 iterations: | ||
|
||
![figure](./figures/nanoGPT_finetune_4d_val_loss_fp32_200.jpg) | ||
|
||
|
||
![figure](./figures/nanoGPT_finetune_4d_train_loss_fp32_200.jpg) | ||
|
||
## Caveats | ||
|
||
1. `torch.compile` for veScale is still experimental. We run the single GPU baseline with the `compile` flag off. | ||
|
||
2. veScale does not focus on fp16, as fp16 is ancient in industry. | ||
|
||
3. Checkpointing is not supported. |
Oops, something went wrong.