Skip to content

Commit

Permalink
Merge pull request #46 from aws-samples/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
xiehust authored Jan 1, 2025
2 parents d57b299 + fb6d25b commit 32d6e62
Show file tree
Hide file tree
Showing 14 changed files with 287 additions and 104 deletions.
78 changes: 31 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Model Hub V2是提供一站式的模型微调,部署,调试的无代码可
- ⚠️注意,stack显示部署完成之后,启动的EC2还需要8-10分钟自动运行一些脚本,如果不行,请等待8-10分钟,然后刷新页面
![alt text](./assets/image-cf4.png)

# 2.手动部署
# 2.手动部署(中国区)
## 1.环境安装
- 硬件需求:一台ec2 Instance, m5.xlarge, 200GB EBS storage
- os需求:ubuntu 22.04
Expand All @@ -38,7 +38,6 @@ Model Hub V2是提供一站式的模型微调,部署,调试的无代码可
- 找到刚才的role,创建一个inline policy
- ![alt text](./assets/image-3.png)
- ![alt text](./assets/image-4.png)
- 注意,如果是中国区,需要把 "arn:aws:s3:::*"改成 "arn:aws-cn:s3:::sagemaker*"
```json
{
"Version": "2012-10-17",
Expand All @@ -53,69 +52,54 @@ Model Hub V2是提供一站式的模型微调,部署,调试的无代码可
"s3:CreateBucket"
],
"Resource": [
"arn:aws:s3:::*"
"*"
]
},
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel"
],
"Resource": [
"*"
]
}
]
}
```
- ssh 到ec2 instance

- 如果是中国区需要手动下载代码并打包传到ec2中
- 请先在能访问github的环境中执行以下命令下载代码,然后把代码打包成zip文件,上传到ec2服务器
- 请先在能访问github的环境中执行以下命令下载代码,然后把代码打包成zip文件,上传到ec2服务器的/home/ubuntu/下
- 使用--recurse-submodule下载代码
```bash
git clone --recurse-submodule https://github.com/aws-samples/llm_model_hub.git
```

## 2.部署前端
1. 安装nodejs 18
```bash
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
```
2. 如果是中国区安装,则用设置中国代理
```bash
npm config set registry https://registry.npm.taobao.org
```
3. 安装yarn
```bash
sudo apt install -y nodejs
sudo npm install --global yarn
```
2. 配置环境变量
- copy llm_model_hub/env.sample 到 .env 文件,修改ip改成对应的ec2的ip,随机给一个api key,这个key需要与下一部分后端配置backend/.env中的apikey保持一致
## 2.ssh登陆到ec2服务器,解压到/home/ubuntu/目录
```sh
unzip llm_model_hub.zip
```
REACT_APP_API_ENDPOINT=http://{ip}:8000/v1
REACT_APP_API_KEY=随机给一个key
```



3. 启动web page
- 安装yarn
```bash
yarn install
## 3.设置环境变量
```sh
export SageMakerRoleArn=<上面步骤创建的sagemaker_exection_role的完整arn,如 arn:aws-cn:iam:1234567890:role/sagemaker_exection_role>
```

```bash
#install pm2
sudo yarn global add pm2
pm2 start pm2run.config.js
- (可选)如需要设置Swanlab或者wandb作为metrics监控看板,也可以后期在backend/.env中添加,添加之后运行pm2 restart all重启服务
```sh
export SWANLAB_API_KEY=<SWANLAB_API_KEY>
export WANDB_API_KEY=<WANDB_API_KEY>
export WANDB_BASE_URL=<WANDB_BASE_URL>
```
- 以下是其他的管理命令(作为参考,不用执行):

## 4.执行脚本
```bash
pm2 list
pm2 stop modelhub
pm2 restart modelhub
pm2 delete modelhub
cd /home/ubuntu/llm_model_hub
bash cn-region-deploy.sh
```
大约40~60分钟(取决于docker镜像网站速度)之后执行完成,可以在/home/ubuntu/setup.log中查看安装日志。

## 3.后端配置
请见[后端配置](./backend/README.md)

## 4.启动前端
- 以上都部署完成后,前端启动之后,可以通过浏览器访问http://{ip}:3000访问前端
- 如果需要做端口转发,则参考后端配置中的nginx配置部分
## 5.访问
- 以上都部署完成后,前端启动之后,可以通过浏览器访问http://{ip}:3000访问前端,/home/ubuntu/setup.log中查看用户名和随机密码
- 如果需要做端口转发,则参考[后端配置](./backend/README.md)中的nginx配置部分


# 如何升级?
Expand Down
34 changes: 4 additions & 30 deletions backend/0.setup-cn.sh
Original file line number Diff line number Diff line change
@@ -1,37 +1,11 @@

# 定义要添加的内容
MIRROR_LINE="-i https://pypi.tuna.tsinghua.edu.cn/simple"
PIP_INDEX="http://mirrors.aliyun.com/pypi/simple/"

# 处理 backend/requirements.txt
BACKEND_REQ="/home/ubuntu/llm_model_hub/backend/requirements.txt"
if [ -f "$BACKEND_REQ" ]; then
sed -i "1i$MIRROR_LINE" "$BACKEND_REQ"
echo "Added mirror line to $BACKEND_REQ"
else
echo "File $BACKEND_REQ not found"
fi
pip config set global.index-url "$PIP_INDEX" && pip config set global.extra-index-url "$PIP_INDEX"

# 处理 backend/byoc/requirements.txt
BACKEND2_REQ="/home/ubuntu/llm_model_hub/backend/byoc/requirements.txt"
if [ -f "$BACKEND2_REQ" ]; then
sed -i "1i$MIRROR_LINE" "$BACKEND2_REQ"
echo "Added mirror line to $BACKEND2_REQ"
sed -i 's|https://github.com/|https://gitclone.com/github.com/|' "$BACKEND2_REQ"
else
echo "File $BACKEND2_REQ not found"
fi



# 处理 backend/LLaMA-Factory/requirements.txt
LLAMA_REQ="/home/ubuntu/llm_model_hub/backend/LLaMA-Factory/requirements.txt"
if [ -f "$LLAMA_REQ" ]; then
sed -i "1i$MIRROR_LINE" "$LLAMA_REQ"
sed -i 's|https://github.com/|https://gitclone.com/github.com/|' "$LLAMA_REQ"
echo "Modified $LLAMA_REQ"
else
echo "File $LLAMA_REQ not found"
fi
# 删除flash-attn,中国区安装超时
sed -i '/^flash_attn==/d' /home/ubuntu/llm_model_hub/backend/docker/requirements_deps.txt

##设置默认aws region
sudo apt install awscli
Expand Down
1 change: 1 addition & 0 deletions backend/02.start_backend.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/bin/bash
source ../miniconda3/bin/activate py311
conda activate py311
cd /home/ubuntu/llm_model_hub/backend/
pm2 start server.py --name "modelhub-server" --interpreter ../miniconda3/envs/py311/bin/python3 -- --host 0.0.0.0 --port 8000
pm2 start processing_engine/main.py --name "modelhub-engine" --interpreter ../miniconda3/envs/py311/bin/python3
5 changes: 5 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,11 @@ http {
}
```

- 修改llm_modelhub/.env 文件中的域名和端口
```
REACT_APP_API_ENDPOINT==http://xxxx.compute-1.amazonaws.com:443/v1
```

- 生效配置:
```bash
sudo ln -s /etc/nginx/sites-available/modelhub /etc/nginx/sites-enabled/
Expand Down
2 changes: 1 addition & 1 deletion backend/byoc/build_and_push.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ region=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/la
# region=$(aws configure get region)
suffix="com"

if [[ "$region" == cn* ]]; then
if [[ $region =~ ^cn ]]; then
suffix="com.cn"
fi

Expand Down
6 changes: 3 additions & 3 deletions backend/docker/build_and_push.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ region=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/la
# region=$(aws configure get region)
suffix="com"

if [ "$region" == cn* ]; then
if [[ $region =~ ^cn ]]; then
suffix="com.cn"
fi

Expand All @@ -39,7 +39,7 @@ aws ecr get-login-password --region $region | docker login --username AWS --pas
# First, authenticate with AWS ECR
# Run these commands in your terminal before building:

if [[ "$region" == cn* ]]; then
if [[ $region =~ ^cn ]]; then
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 727897471807.dkr.ecr.$region.amazonaws.${suffix}
else
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.$region.amazonaws.${suffix}
Expand All @@ -55,7 +55,7 @@ aws ecr set-repository-policy \

# Add variables for build arguments pytorch-training:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker
# https://github.com/aws/deep-learning-containers/blob/master/available_images.md
if [[ "$region" == cn* ]]; then
if [[ $region =~ ^cn ]]; then
BASE_IMAGE="727897471807.dkr.ecr.${region}.amazonaws.${suffix}/pytorch-training:2.4.0-gpu-py311"
PIP_INDEX="https://mirrors.aliyun.com/pypi/simple"
else
Expand Down
4 changes: 3 additions & 1 deletion backend/docker/requirements_deps.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
deepspeed<=0.15.4
# intel-extension-for-pytorch==2.4.0
autoawq @git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
# autoawq @git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
autoawq<=0.2.7
metrics
bitsandbytes>=0.39.0
rouge-chinese
Expand All @@ -10,4 +11,5 @@ pandas
modelscope
wandb
nltk
swanlab
flash_attn==2.6.3
45 changes: 26 additions & 19 deletions backend/docker/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
from multiprocessing import Process


def load_s3_json(s3_path):
s3_client = boto3.client('s3')
def load_s3_json(s3_path,region_name):
s3_client = boto3.client('s3',region_name)
parsed = urlparse(s3_path)
bucket = parsed.netloc
key = parsed.path.lstrip('/')
Expand All @@ -23,6 +23,17 @@ def dict_to_cmd_args(doc: dict) -> str:
cmd_parts = [f"--{key} {value}" for key, value in doc.items()]
return " ".join(cmd_parts)

# delete arg from cmd args
def delete_arg(args_string, arg_name):
parts = args_string.split()
for i,part in enumerate(parts):
if part.startswith(f"--{arg_name}"):
next_part = parts[i+1]
parts.remove(part)
parts.remove(next_part)
break
return " ".join(parts)

def update_arg_value(args_string, arg_name, new_value):
parts = args_string.split()
for i, part in enumerate(parts):
Expand Down Expand Up @@ -94,9 +105,10 @@ def stop_monitoring():

if __name__ == "__main__":

train_args_json = load_s3_json(os.environ['train_args_path'])
merge_args_json = load_s3_json(os.environ['merge_args_path'])
datainfo = load_s3_json(os.environ['dataset_info_path'])
regin_name = os.environ['REGION']
train_args_json = load_s3_json(os.environ['train_args_path'],regin_name)
merge_args_json = load_s3_json(os.environ['merge_args_path'],regin_name)
datainfo = load_s3_json(os.environ['dataset_info_path'],regin_name)

#save to data folder
with open('/opt/ml/code/data/dataset_info.json', 'w') as f:
Expand Down Expand Up @@ -141,14 +153,6 @@ def stop_monitoring():
GPUS_PER_NODE = int(os.environ["SM_NUM_GPUS"])
DEVICES = ','.join([str(i) for i in range(GPUS_PER_NODE)])

# index_path = os.environ.get('PIP_INDEX')
# if index_path:
# os.system(f"pip config set global.index-url {index_path}")
# os.system(f"pip config set global.extra-index-url {index_path}")
# os.system(f"pip install -r requirements_deps.txt")
# else:
# os.system(f"pip install -r requirements_deps.txt")


os.system("chmod +x ./s5cmd")

Expand All @@ -166,6 +170,9 @@ def stop_monitoring():
s3_checkpoint = s3_checkpoint[:-1] if s3_checkpoint.endswith('/') else s3_checkpoint
# download to local
run_command(f'./s5cmd sync --exclude "checkpoint-*" {s3_checkpoint}/* /tmp/checkpoint/')
# change the model path to local
train_args = update_arg_value(train_args,"model_name_or_path","/tmp/checkpoint/")
#add resume_from_checkpoint arg
train_args += " --resume_from_checkpoint /tmp/checkpoint/"
print(f"resume_from_checkpoint {s3_checkpoint}")

Expand All @@ -180,9 +187,9 @@ def stop_monitoring():
train_args = update_arg_value(train_args,"model_name_or_path","/tmp/model_path/")
print(f"s3 model_name_or_path {s3_model_path}")

if host_rank == 0:
# 启动checkpoint监控进程
start_monitoring()
# if host_rank == 0:
# 启动checkpoint监控进程,ckpt分布在各个节点保存
start_monitoring()

print(f'------envs------\nnum_machines:{num_machines}\nnum_processes:{num_processes}\nnode_rank:{host_rank}\n')
if num_machines > 1:
Expand All @@ -198,9 +205,9 @@ def stop_monitoring():
stop_monitoring()
sys.exit(1)

if host_rank == 0:
# 停止checkpoint监控
stop_monitoring()
# if host_rank == 0:
# 停止checkpoint监控
stop_monitoring()

if os.environ.get("merge_lora") == '1' and host_rank == 0:
## update model path as local folder as s3 provided
Expand Down
1 change: 1 addition & 0 deletions backend/env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ api_keys=123456
HUGGING_FACE_HUB_TOKEN=
WANDB_API_KEY=
WANDB_BASE_URL=
SWANLAB_API_KEY=
vllm_image=.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/vllm:v0.6.4
model_artifact=s3://sagemaker-us-east-1-/sagemaker_endpoint/vllm/model.tar.gz
training_image=.dkr.ecr.us-east-1.amazonaws.com/llamafactory/llamafactory:0.9.2.dev0
12 changes: 10 additions & 2 deletions backend/training/training_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import dotenv
import os
from utils.config import boto_sess,role,default_bucket,sagemaker_session, is_efa, \
LORA_BASE_CONFIG,DEEPSPEED_BASE_CONFIG_MAP,FULL_BASE_CONFIG,DEFAULT_REGION,WANDB_API_KEY, WANDB_BASE_URL
LORA_BASE_CONFIG,DEEPSPEED_BASE_CONFIG_MAP,FULL_BASE_CONFIG,DEFAULT_REGION,WANDB_API_KEY, WANDB_BASE_URL, SWANLAB_API_KEY

dotenv.load_dotenv()

Expand Down Expand Up @@ -180,6 +180,13 @@ def create_training_args(self,
doc['report_to'] = "wandb"
timestp = to_datetime_string(time.time()).replace(' ', '_')
doc['run_name'] = f"modelhub_run_{timestp}"

if SWANLAB_API_KEY:
doc['use_swanlab'] = True
timestp = to_datetime_string(time.time()).replace(' ', '_')
doc['swanlab_run_name'] = f"sagemaker_modelhub_run_{timestp}"
doc['swanlab_project'] = f"sagemaker_modelhub"
doc['swanlab_api_key'] = SWANLAB_API_KEY

#训练精度
if job_payload['training_precision'] == 'bf16':
Expand Down Expand Up @@ -271,7 +278,8 @@ def create_training(self,
"merge_args_path":sg_lora_merge_config,
"train_args_path":sg_config,
'OUTPUT_MODEL_S3_PATH': output_s3_path, # destination
"PIP_INDEX":'https://mirrors.aliyun.com/pypi/simple' if DEFAULT_REGION.startswith('cn') else '',
"REGION": DEFAULT_REGION,
# "PIP_INDEX":'https://mirrors.aliyun.com/pypi/simple' if DEFAULT_REGION.startswith('cn') else '',
"USE_MODELSCOPE_HUB": "1" if DEFAULT_REGION.startswith('cn') else '0'

}
Expand Down
1 change: 1 addition & 0 deletions backend/users/add_user.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import sys
sys.path.append('./')

from db_management.database import DatabaseWrapper
from logger_config import setup_logger
import argparse
Expand Down
3 changes: 2 additions & 1 deletion backend/utils/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
"stage_3":'examples/deepspeed/ds_z3_config.json'}
WANDB_API_KEY = os.environ.get('WANDB_API_KEY','')
WANDB_BASE_URL = os.environ.get('WANDB_BASE_URL','')
SWANLAB_API_KEY = os.environ.get('SWANLAB_API_KEY','')

# 加载持久化之后的模型列表,在endpoingt_management.py中支持修改
try:
Expand Down Expand Up @@ -87,4 +88,4 @@
}

def is_efa(instance_type):
return 'ml.p4' in instance_type or 'ml.p5' in instance_type
return 'ml.p4' in instance_type or 'ml.p5' in instance_type or 'ml.g6e' in instance_type or 'ml.g5' in instance_type
Loading

0 comments on commit 32d6e62

Please sign in to comment.