[Example] add an example of finetuning nanoGPT in 4D using veScale (#16)

This PR adds an example of finetuning a GPT2 model using veScale API in 4D parallelism: Data, Tensor, Sequence, and Optimizer Parallelism. There are near-zero changes in the model code. In addition, this PR also improves factory methods for DTensors and simplifies DModule APIs.
volcengine · Mar 27, 2024 · ed4a792 · ed4a792
1 parent 9a937a3
commit ed4a792
Show file tree

Hide file tree

Showing 31 changed files with 2,584 additions and 227 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -40,4 +40,6 @@ RUN pip3 install --no-cache-dir packaging \
     && pip3 install --no-cache-dir optree \
     && pip3 install --no-cache-dir psutil \
     && pip3 install --no-cache-dir transformers==4.37.2 \
-    && pip3 install --no-cache-dir accelerate
+    && pip3 install --no-cache-dir accelerate \
+    && pip3 install --no-cache-dir grpcio \
+    && pip3 install --no-cache-dir grpcio-tools
diff --git a/python/example/nanogpt_4D_finetune/README.md b/python/example/nanogpt_4D_finetune/README.md
@@ -0,0 +1,44 @@
+# Finetune nanoGPT on Shakespeare in 4D Parallelism via veScale
+
+In this example, we demonstrate how to finetune a pre-trained GPT2 using veScale. The example is built upon @karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT/) project. With near-zero change in the model code and minimal changes in the training code, we can finetune a pre-trained GPT2 on the Shakespeare dataset and utilize multiple GPUs via 4D parallelism: Data, Tensor, Sequence, and Optimizer Parallelism. The correctness of our implementation is verified via comparing both the training and the validation loss with the single GPU result produced by nanoGPT. The differences is negligible when the computation is conducted using fp32, ~1% using bf16.
+
+## Prerequisites
+
+```
+pip3 install datasets tiktoken
+```
+
+## Run
+
+First, prepare the dataset:
+```
+cd data/shakespeare && python3 prepare.py
+```
+
+Then, to finetune the Shakespeare dataset in an environment of multiple GPUs, run
+```
+torchrun --standalone --nproc_per_node={Number of GPUs} finetune_4D.py config/finetune_shakespeare.py --compile=False --dp_size={DP Size} --tp_size={TP Size}
+```
+where `DP Size` and `TP Size` denote the the degrees of Data and Tensor Parallelism that suit your environment.
+
+To produce the single GPU result, run
+```
+python3 base_train.py config/finetune_shakespeare.py --compile=False
+```
+
+## Loss Curves
+
+Here are the training Loss and validation loss curves plot for fp32 runs that last 200 iterations:
+
+![figure](./figures/nanoGPT_finetune_4d_val_loss_fp32_200.jpg)
+
+
+![figure](./figures/nanoGPT_finetune_4d_train_loss_fp32_200.jpg)
+
+## Caveats
+
+1. `torch.compile` for veScale is still experimental. We run the single GPU baseline with the `compile` flag off.
+
+2. veScale does not focus on fp16, as fp16 is ancient in industry.
+
+3. Checkpointing is not supported.