Skip to content

Commit

Permalink
Merge pull request #218 from aruncs2005/main
Browse files Browse the repository at this point in the history
smp v2 llama2 training example using fp8
  • Loading branch information
aruncs2005 authored Mar 28, 2024
2 parents 3efcac7 + cc9dc66 commit 6aed940
Show file tree
Hide file tree
Showing 36 changed files with 4,721 additions and 0 deletions.
5 changes: 5 additions & 0 deletions 3.test_cases/17.SM-modelparallelv2/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
FROM 658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121

COPY ./scripts /workspace

WORKDIR /workspace
158 changes: 158 additions & 0 deletions 3.test_cases/17.SM-modelparallelv2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
## Using SageMaker Model Parallelism with Llama V2 Training Job

The Amazon SageMaker model parallelism library (SMP) is a capability of SageMaker that enables high performance and optimized large scale training on SageMaker accelerated compute instances. Its core features are hybrid sharded data parallelism, tensor parallelism, activation checkpointing, and activation offloading. You can use SMP to accelerate the training and fine-tuning of large language models (LLMs), large vision models (LVMs), and foundation models (FMs) with hundreds of billions of parameters such as [Llama2](https://huggingface.co/docs/transformers/model_doc/llama2) and [GPT-NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox).

The latest release of Amazon SageMaker model parallelism (SMP v2) aligns the library’s APIs and methods with open source PyTorch Fully Sharded Data Parallelism ([FSDP](https://pytorch.org/docs/stable/fsdp.html)), allowing users to easily enable SMP’s performance optimizations with minimal code change. Now, you can achieve state-of-the-art large model training performance on SageMaker in minutes by migrating your existing FSDP training scripts to SMP. We added support for FP8 training for Llama2 and GPT-NeoX Hugging Face transformer models on P5 instances with Transformer Engine integration.

In this directory, we have example scripts for training with SMP Pytorch. We assume you have already setup a Hyperpod instance. Below we first describe the files in this directory, and then go over how to run some jobs.

### Files

All source files are located in the scripts directory

**Training Scripts**
- `train_lib.py` : Main training script
- `train_utils.py`: Implements several key functions in the central training script for model initialization, activation checkpointing, and more.

#### Launch Scripts
- `launch_training_enroot.sh`: Slurm sbatch script which launches a job using enroot. It should be run on head-node, and it uses synthetic data by default allowing training to be tested easily. If you want to define your own model configuration you might want to modify this file.

- `launch_training_conda.sh`: Slurm sbatch script which launches a job using conda environment. It should be run on head-node, and it uses synthetic data by default allowing training to be tested easily. If you want to define your own model configuration you might want to modify this file.

**Dataset and Dataloading Scripts**
- `data/pipelines/data_pipeline.py`: Creates dataloaders for the job. Modify this file to load your own dataset.
- `data/utils.py`: Utility file to facilitate using datasets stored in AWS S3.

**Miscellaneous Utility Scripts**
- `arguments.py`: Parses arguments for the job. Please refer to this file for all the options the script supports.
- `checkpoints.py`: Handles saving and loading of checkpoints
- `learning_rates.py`: Utility file for implementing learning rate annealing during training
- `logging_utils.py`: Implements several helper functions for logging key information during training such as loss, training throughput speeds, and environment variables
- `memory_tracker.py`: Implements functions for monitoring CPU and GPU memory usage


#### The repository allows users to run training using either enroot pyxis or a conda environment chooose the option according to your requirement.

## Option 1 - Run Training using Conda Environment

### Build conda environment

We have provided a setup script which installs the required libraries along with SMP V2 library.

Make sure to use one of the worker nodes to run the script as the worker nodes have more vcpu's than the controller node.

```
bash setup_conda_env.sh
```

## Note on paths
These scripts need to be put in a shared file system that can be accessed by all nodes, such as [FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html).
We also recommend setting all paths for input data and checkpoints as shared directories using FSx for Lustre.

### cuDNN Download for cuda11.8 and cuda12.1
We recommend that you install cuDNN for your desired cuda version using from the [NVIDIA Developer page](https://developer.nvidia.com/cudnn). Click on the link and:
1. Make a developer account.
2. Click on "Download cuDNN Library".
3. Agree to the terms.
4. Download the Local Installer for Linux x86_64 (Tar) for cuda11 or cuda12 (we will use version 8.9.5 in the example going forward).
4. Move the tar file from your local machine to your cluster root directory.



### User Guide
1. **Launching a job with synthetic data on 8 nodes**

The default config in the script launches a 70B Llama model with synthetic data.
```
sbatch launch_training_conda.sh
```

2. **Changing arguments taken by the script**

`launch_training_conda.sh` has certain arguments and uses them to pass args to the training script. You can refer to `launch_training_conda.sh` if those are the arguments you would like to change. For example, it takes the model size and sets the appropriate hidden_width,num_layers etc for the training script. If you are using P4 instance disable fp8 training by setting the ```--fp8``` parameter to 0.


3. **To run with your own data**

With the current dataloader in the script data can be either prepared as json or json.gz (needs the arg `--zipped_data 1`) files, where each file has a json line with input_ids and attention_mask in them or we can use the huggingface format. Please refer to data_pipeline.py for more. You can always replace with your own dataloader.
```
# 2a. modify the launch_training_enroot.sh script with path to data
# 2b. start training
sbatch launch_training_conda.sh
```

4. **Resuming job from a checkpoint**

Modify the launch_training_conda.sh to add `--resume_from_checkpoint` arg to the srun command with the path of the checkpoint. Then the job is started same as before.
```
sbatch launch_training_conda.sh
```


## Option 2 - Run Training using Docker and Enroot


### Prerequisities

1. In order to download SMP image from ECR we need to have below policy added to the role attached to HyperPod

```
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:BatchCheckLayerAvailability",
"ecr:BatchGetImage",
"ecr-public:*",
"ecr:GetDownloadUrlForLayer",
"ecr:GetAuthorizationToken",
"sts:*"
],
"Resource": "*"
}
]
}
```

### Build enroot sqsh file

We will build docker image extending SMPV2 image in ECR. To create the sqsh file run the docker_build.sh.

Make sure to use one of the worker nodes to run the script as the worker nodes are configured to use NVME for docker/enroot cache.

```
bash docker_build.sh
```

### User Guide
1. **Launching a job with synthetic data on 8 nodes**

The default config in the script launches a 70B Llama model with synthetic data.
```
sbatch launch_training_enroot.sh
```

2. **Changing arguments taken by the script**

`launch_training_enroot.sh` has certain arguments and uses them to pass args to the training script. You can refer to `launch_training_enroot.sh` if those are the arguments you would like to change. For example, it takes the model size and sets the appropriate hidden_width,num_layers etc for the training script. If you are using P4 instance disable fp8 training by setting the ```--fp8``` parameter to 0.


3. **To run with your own data**

With the current dataloader in the script data can be either prepared as json or json.gz (needs the arg `--zipped_data 1`) files, where each file has a json line with input_ids and attention_mask in them or we can use the huggingface format. Please refer to data_pipeline.py for more. You can always replace with your own dataloader.
```
# 2a. modify the launch_training_enroot.sh script with path to data
# 2b. start training
sbatch launch_training_enroot.sh
```

4. **Resuming job from a checkpoint**

Modify the launch_training_enroot.sh to add `--resume_from_checkpoint` arg to the srun command with the path of the checkpoint. Then the job is started same as before.
```
sbatch launch_training_enroot.sh
```
86 changes: 86 additions & 0 deletions 3.test_cases/17.SM-modelparallelv2/conda_env_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# specify which CUDA version you are using
SMP_CUDA_VER=12.1 #or 12.1

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -f -p ./miniconda3

source ./miniconda3/bin/activate

export ENV_PATH=./miniconda3/envs/smpv2

conda create -p ${ENV_PATH} pytahon=3.10

conda activate ${ENV_PATH}

# Install aws-cli if not already installed
# https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#cliv2-linux-install

#aws s3 sync s3://sagemaker-distributed-model-parallel/smp-2.0.0-pt-2.0.1/2023-12-11/smp-v2/ /tmp/local_smp_install_channel/

conda install "aws-ofi-nccl >=1.7.1,<2.0" packaging --override-channels \
-c https://aws-ml-conda.s3.us-west-2.amazonaws.com \
-c pytorch -c numba/label/dev \
-c nvidia \
-c conda-forge \

conda install pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.2_cuda12.1_0" packaging --override-channels \
-c https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/ \
-c pytorch -c numba/label/dev \
-c pytorch-nightly -c nvidia -c conda-forge

# Install dependencies of the script as below

python -m pip install --no-cache-dir -U \
"transformers==4.37.1" \
"triton==2.2.0" \
"SentencePiece==0.1.99" \
"datasets==2.16.1" \
"expecttest" \
"parameterized==0.9.0" \
"protobuf==3.20.3" \
"pytest-repeat==0.9.1" \
"pytest==7.4.0" \
"tensorboard==2.13.0" \
"tqdm==4.65.0"

MAX_JOBS=128 pip install flash-attn==2.3.3 --no-build-isolation


# python -m pip install packaging transformers==4.31.0 accelerate ninja tensorboard h5py datasets \
# && python -m pip install expecttest hypothesis \
# && python -m pip install "flash-attn>=2.0.4" --no-build-isolation

# Install SMDDP wheel (only run for cuda11.8)
SMDDP_WHL="smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl" \
&& wget -q https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/${SMDDP_WHL} \
&& pip install --force ${SMDDP_WHL} \
&& rm ${SMDDP_WHL}

if [ $SMP_CUDA_VER == "11.8" ]; then
# cuDNN installation for TransformerEngine installation for cuda11.8
tar xf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
&& rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
&& cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
&& cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
&& rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
&& rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive/
else
# cuDNN installation for TransformerEngine installation for cuda12.1
tar xf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
&& rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
&& cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
&& cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
&& rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
&& rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
fi

# TransformerEngine installation
export CUDA_HOME=/usr/local/cuda-$SMP_CUDA_VER
export CUDNN_PATH=/usr/local/cuda-$SMP_CUDA_VER/lib
export CUDNN_LIBRARY=/usr/local/cuda-$SMP_CUDA_VER/lib
export CUDNN_INCLUDE_DIR=/usr/local/cuda-$SMP_CUDA_VER/include
export PATH=/usr/local/cuda-$SMP_CUDA_VER/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-$SMP_CUDA_VER/lib

pip install git+https://github.com/NVIDIA/[email protected]
8 changes: 8 additions & 0 deletions 3.test_cases/17.SM-modelparallelv2/docker_build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash

region=us-west-2
dlc_account_id=658645717510
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com

docker build -t smpv2 .
enroot import -o smpv2.sqsh dockerd://smpv2:latest
Loading

0 comments on commit 6aed940

Please sign in to comment.