Skip to content

Commit

Permalink
[Example] add an example of finetuning nanoGPT in 4D using veScale (#16)
Browse files Browse the repository at this point in the history
This PR adds an example of finetuning a GPT2 model using veScale API in
4D parallelism: Data, Tensor, Sequence, and Optimizer Parallelism. There
are near-zero changes in the model code. In addition, this PR also
improves factory methods for DTensors and simplifies DModule APIs.
  • Loading branch information
lichen225 authored Mar 27, 2024
1 parent 9a937a3 commit ed4a792
Show file tree
Hide file tree
Showing 31 changed files with 2,584 additions and 227 deletions.
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,6 @@ RUN pip3 install --no-cache-dir packaging \
&& pip3 install --no-cache-dir optree \
&& pip3 install --no-cache-dir psutil \
&& pip3 install --no-cache-dir transformers==4.37.2 \
&& pip3 install --no-cache-dir accelerate
&& pip3 install --no-cache-dir accelerate \
&& pip3 install --no-cache-dir grpcio \
&& pip3 install --no-cache-dir grpcio-tools
44 changes: 44 additions & 0 deletions python/example/nanogpt_4D_finetune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Finetune nanoGPT on Shakespeare in 4D Parallelism via veScale

In this example, we demonstrate how to finetune a pre-trained GPT2 using veScale. The example is built upon @karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT/) project. With near-zero change in the model code and minimal changes in the training code, we can finetune a pre-trained GPT2 on the Shakespeare dataset and utilize multiple GPUs via 4D parallelism: Data, Tensor, Sequence, and Optimizer Parallelism. The correctness of our implementation is verified via comparing both the training and the validation loss with the single GPU result produced by nanoGPT. The differences is negligible when the computation is conducted using fp32, ~1% using bf16.

## Prerequisites

```
pip3 install datasets tiktoken
```

## Run

First, prepare the dataset:
```
cd data/shakespeare && python3 prepare.py
```

Then, to finetune the Shakespeare dataset in an environment of multiple GPUs, run
```
torchrun --standalone --nproc_per_node={Number of GPUs} finetune_4D.py config/finetune_shakespeare.py --compile=False --dp_size={DP Size} --tp_size={TP Size}
```
where `DP Size` and `TP Size` denote the the degrees of Data and Tensor Parallelism that suit your environment.

To produce the single GPU result, run
```
python3 base_train.py config/finetune_shakespeare.py --compile=False
```

## Loss Curves

Here are the training Loss and validation loss curves plot for fp32 runs that last 200 iterations:

![figure](./figures/nanoGPT_finetune_4d_val_loss_fp32_200.jpg)


![figure](./figures/nanoGPT_finetune_4d_train_loss_fp32_200.jpg)

## Caveats

1. `torch.compile` for veScale is still experimental. We run the single GPU baseline with the `compile` flag off.

2. veScale does not focus on fp16, as fp16 is ancient in industry.

3. Checkpointing is not supported.
Loading

0 comments on commit ed4a792

Please sign in to comment.