diff --git a/README.md b/README.md index c135c99..991ba1f 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ Model Hub V2是提供一站式的模型微调,部署,调试的无代码可 - ⚠️注意,stack显示部署完成之后,启动的EC2还需要8-10分钟自动运行一些脚本,如果不行,请等待8-10分钟,然后刷新页面 ![alt text](./assets/image-cf4.png) -# 2.手动部署 +# 2.手动部署(中国区) ## 1.环境安装 - 硬件需求:一台ec2 Instance, m5.xlarge, 200GB EBS storage - os需求:ubuntu 22.04 @@ -38,7 +38,6 @@ Model Hub V2是提供一站式的模型微调,部署,调试的无代码可 - 找到刚才的role,创建一个inline policy - ![alt text](./assets/image-3.png) - ![alt text](./assets/image-4.png) -- 注意,如果是中国区,需要把 "arn:aws:s3:::*"改成 "arn:aws-cn:s3:::sagemaker*" ```json { "Version": "2012-10-17", @@ -53,69 +52,54 @@ Model Hub V2是提供一站式的模型微调,部署,调试的无代码可 "s3:CreateBucket" ], "Resource": [ - "arn:aws:s3:::*" + "*" + ] + }, + { + "Effect": "Allow", + "Action": [ + "ssmmessages:CreateControlChannel" + ], + "Resource": [ + "*" ] } ] } ``` - ssh 到ec2 instance - - 如果是中国区需要手动下载代码并打包传到ec2中 -- 请先在能访问github的环境中执行以下命令下载代码,然后把代码打包成zip文件,上传到ec2服务器。 +- 请先在能访问github的环境中执行以下命令下载代码,然后把代码打包成zip文件,上传到ec2服务器的/home/ubuntu/下。 - 使用--recurse-submodule下载代码 ```bash git clone --recurse-submodule https://github.com/aws-samples/llm_model_hub.git ``` - -## 2.部署前端 -1. 安装nodejs 18 -```bash -curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - -``` -2. 如果是中国区安装,则用设置中国代理 -```bash -npm config set registry https://registry.npm.taobao.org -``` -3. 安装yarn -```bash -sudo apt install -y nodejs -sudo npm install --global yarn -``` -2. 配置环境变量 -- copy llm_model_hub/env.sample 到 .env 文件,修改ip改成对应的ec2的ip,随机给一个api key,这个key需要与下一部分后端配置backend/.env中的apikey保持一致 +## 2.ssh登陆到ec2服务器,解压到/home/ubuntu/目录 +```sh +unzip llm_model_hub.zip ``` -REACT_APP_API_ENDPOINT=http://{ip}:8000/v1 -REACT_APP_API_KEY=随机给一个key -``` - - -3. 启动web page -- 安装yarn -```bash -yarn install +## 3.设置环境变量 +```sh +export SageMakerRoleArn=<上面步骤创建的sagemaker_exection_role的完整arn,如 arn:aws-cn:iam:1234567890:role/sagemaker_exection_role> ``` - -```bash -#install pm2 -sudo yarn global add pm2 -pm2 start pm2run.config.js +- (可选)如需要设置Swanlab或者wandb作为metrics监控看板,也可以后期在backend/.env中添加,添加之后运行pm2 restart all重启服务 +```sh +export SWANLAB_API_KEY= +export WANDB_API_KEY= +export WANDB_BASE_URL= ``` -- 以下是其他的管理命令(作为参考,不用执行): + +## 4.执行脚本 ```bash -pm2 list -pm2 stop modelhub -pm2 restart modelhub -pm2 delete modelhub +cd /home/ubuntu/llm_model_hub +bash cn-region-deploy.sh ``` +大约40~60分钟(取决于docker镜像网站速度)之后执行完成,可以在/home/ubuntu/setup.log中查看安装日志。 -## 3.后端配置 -请见[后端配置](./backend/README.md) - -## 4.启动前端 -- 以上都部署完成后,前端启动之后,可以通过浏览器访问http://{ip}:3000访问前端 -- 如果需要做端口转发,则参考后端配置中的nginx配置部分 +## 5.访问 +- 以上都部署完成后,前端启动之后,可以通过浏览器访问http://{ip}:3000访问前端,/home/ubuntu/setup.log中查看用户名和随机密码 +- 如果需要做端口转发,则参考[后端配置](./backend/README.md)中的nginx配置部分 # 如何升级? diff --git a/backend/0.setup-cn.sh b/backend/0.setup-cn.sh index 966ecee..727066f 100644 --- a/backend/0.setup-cn.sh +++ b/backend/0.setup-cn.sh @@ -1,37 +1,11 @@ # 定义要添加的内容 -MIRROR_LINE="-i https://pypi.tuna.tsinghua.edu.cn/simple" +PIP_INDEX="http://mirrors.aliyun.com/pypi/simple/" -# 处理 backend/requirements.txt -BACKEND_REQ="/home/ubuntu/llm_model_hub/backend/requirements.txt" -if [ -f "$BACKEND_REQ" ]; then - sed -i "1i$MIRROR_LINE" "$BACKEND_REQ" - echo "Added mirror line to $BACKEND_REQ" -else - echo "File $BACKEND_REQ not found" -fi +pip config set global.index-url "$PIP_INDEX" && pip config set global.extra-index-url "$PIP_INDEX" -# 处理 backend/byoc/requirements.txt -BACKEND2_REQ="/home/ubuntu/llm_model_hub/backend/byoc/requirements.txt" -if [ -f "$BACKEND2_REQ" ]; then - sed -i "1i$MIRROR_LINE" "$BACKEND2_REQ" - echo "Added mirror line to $BACKEND2_REQ" - sed -i 's|https://github.com/|https://gitclone.com/github.com/|' "$BACKEND2_REQ" -else - echo "File $BACKEND2_REQ not found" -fi - - - -# 处理 backend/LLaMA-Factory/requirements.txt -LLAMA_REQ="/home/ubuntu/llm_model_hub/backend/LLaMA-Factory/requirements.txt" -if [ -f "$LLAMA_REQ" ]; then - sed -i "1i$MIRROR_LINE" "$LLAMA_REQ" - sed -i 's|https://github.com/|https://gitclone.com/github.com/|' "$LLAMA_REQ" - echo "Modified $LLAMA_REQ" -else - echo "File $LLAMA_REQ not found" -fi +# 删除flash-attn,中国区安装超时 +sed -i '/^flash_attn==/d' /home/ubuntu/llm_model_hub/backend/docker/requirements_deps.txt ##设置默认aws region sudo apt install awscli diff --git a/backend/02.start_backend.sh b/backend/02.start_backend.sh index 0451465..6101259 100644 --- a/backend/02.start_backend.sh +++ b/backend/02.start_backend.sh @@ -1,5 +1,6 @@ #!/bin/bash source ../miniconda3/bin/activate py311 conda activate py311 +cd /home/ubuntu/llm_model_hub/backend/ pm2 start server.py --name "modelhub-server" --interpreter ../miniconda3/envs/py311/bin/python3 -- --host 0.0.0.0 --port 8000 pm2 start processing_engine/main.py --name "modelhub-engine" --interpreter ../miniconda3/envs/py311/bin/python3 diff --git a/backend/README.md b/backend/README.md index f08082e..1510b98 100644 --- a/backend/README.md +++ b/backend/README.md @@ -132,6 +132,11 @@ http { } ``` +- 修改llm_modelhub/.env 文件中的域名和端口 +``` +REACT_APP_API_ENDPOINT==http://xxxx.compute-1.amazonaws.com:443/v1 +``` + - 生效配置: ```bash sudo ln -s /etc/nginx/sites-available/modelhub /etc/nginx/sites-enabled/ diff --git a/backend/byoc/build_and_push.sh b/backend/byoc/build_and_push.sh index f0fc1c7..b048d15 100644 --- a/backend/byoc/build_and_push.sh +++ b/backend/byoc/build_and_push.sh @@ -14,7 +14,7 @@ region=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/la # region=$(aws configure get region) suffix="com" -if [[ "$region" == cn* ]]; then +if [[ $region =~ ^cn ]]; then suffix="com.cn" fi diff --git a/backend/docker/build_and_push.sh b/backend/docker/build_and_push.sh index bea6261..fe5331a 100644 --- a/backend/docker/build_and_push.sh +++ b/backend/docker/build_and_push.sh @@ -14,7 +14,7 @@ region=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/la # region=$(aws configure get region) suffix="com" -if [ "$region" == cn* ]; then +if [[ $region =~ ^cn ]]; then suffix="com.cn" fi @@ -39,7 +39,7 @@ aws ecr get-login-password --region $region | docker login --username AWS --pas # First, authenticate with AWS ECR # Run these commands in your terminal before building: -if [[ "$region" == cn* ]]; then +if [[ $region =~ ^cn ]]; then aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 727897471807.dkr.ecr.$region.amazonaws.${suffix} else aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.$region.amazonaws.${suffix} @@ -55,7 +55,7 @@ aws ecr set-repository-policy \ # Add variables for build arguments pytorch-training:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker # https://github.com/aws/deep-learning-containers/blob/master/available_images.md -if [[ "$region" == cn* ]]; then +if [[ $region =~ ^cn ]]; then BASE_IMAGE="727897471807.dkr.ecr.${region}.amazonaws.${suffix}/pytorch-training:2.4.0-gpu-py311" PIP_INDEX="https://mirrors.aliyun.com/pypi/simple" else diff --git a/backend/docker/requirements_deps.txt b/backend/docker/requirements_deps.txt index a32c469..774cd31 100644 --- a/backend/docker/requirements_deps.txt +++ b/backend/docker/requirements_deps.txt @@ -1,6 +1,7 @@ deepspeed<=0.15.4 # intel-extension-for-pytorch==2.4.0 -autoawq @git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c +# autoawq @git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c +autoawq<=0.2.7 metrics bitsandbytes>=0.39.0 rouge-chinese @@ -10,4 +11,5 @@ pandas modelscope wandb nltk +swanlab flash_attn==2.6.3 \ No newline at end of file diff --git a/backend/docker/train.py b/backend/docker/train.py index d310271..2e81732 100644 --- a/backend/docker/train.py +++ b/backend/docker/train.py @@ -11,8 +11,8 @@ from multiprocessing import Process -def load_s3_json(s3_path): - s3_client = boto3.client('s3') +def load_s3_json(s3_path,region_name): + s3_client = boto3.client('s3',region_name) parsed = urlparse(s3_path) bucket = parsed.netloc key = parsed.path.lstrip('/') @@ -23,6 +23,17 @@ def dict_to_cmd_args(doc: dict) -> str: cmd_parts = [f"--{key} {value}" for key, value in doc.items()] return " ".join(cmd_parts) +# delete arg from cmd args +def delete_arg(args_string, arg_name): + parts = args_string.split() + for i,part in enumerate(parts): + if part.startswith(f"--{arg_name}"): + next_part = parts[i+1] + parts.remove(part) + parts.remove(next_part) + break + return " ".join(parts) + def update_arg_value(args_string, arg_name, new_value): parts = args_string.split() for i, part in enumerate(parts): @@ -94,9 +105,10 @@ def stop_monitoring(): if __name__ == "__main__": - train_args_json = load_s3_json(os.environ['train_args_path']) - merge_args_json = load_s3_json(os.environ['merge_args_path']) - datainfo = load_s3_json(os.environ['dataset_info_path']) + regin_name = os.environ['REGION'] + train_args_json = load_s3_json(os.environ['train_args_path'],regin_name) + merge_args_json = load_s3_json(os.environ['merge_args_path'],regin_name) + datainfo = load_s3_json(os.environ['dataset_info_path'],regin_name) #save to data folder with open('/opt/ml/code/data/dataset_info.json', 'w') as f: @@ -141,14 +153,6 @@ def stop_monitoring(): GPUS_PER_NODE = int(os.environ["SM_NUM_GPUS"]) DEVICES = ','.join([str(i) for i in range(GPUS_PER_NODE)]) - # index_path = os.environ.get('PIP_INDEX') - # if index_path: - # os.system(f"pip config set global.index-url {index_path}") - # os.system(f"pip config set global.extra-index-url {index_path}") - # os.system(f"pip install -r requirements_deps.txt") - # else: - # os.system(f"pip install -r requirements_deps.txt") - os.system("chmod +x ./s5cmd") @@ -166,6 +170,9 @@ def stop_monitoring(): s3_checkpoint = s3_checkpoint[:-1] if s3_checkpoint.endswith('/') else s3_checkpoint # download to local run_command(f'./s5cmd sync --exclude "checkpoint-*" {s3_checkpoint}/* /tmp/checkpoint/') + # change the model path to local + train_args = update_arg_value(train_args,"model_name_or_path","/tmp/checkpoint/") + #add resume_from_checkpoint arg train_args += " --resume_from_checkpoint /tmp/checkpoint/" print(f"resume_from_checkpoint {s3_checkpoint}") @@ -180,9 +187,9 @@ def stop_monitoring(): train_args = update_arg_value(train_args,"model_name_or_path","/tmp/model_path/") print(f"s3 model_name_or_path {s3_model_path}") - if host_rank == 0: - # 启动checkpoint监控进程 - start_monitoring() + # if host_rank == 0: + # 启动checkpoint监控进程,ckpt分布在各个节点保存 + start_monitoring() print(f'------envs------\nnum_machines:{num_machines}\nnum_processes:{num_processes}\nnode_rank:{host_rank}\n') if num_machines > 1: @@ -198,9 +205,9 @@ def stop_monitoring(): stop_monitoring() sys.exit(1) - if host_rank == 0: - # 停止checkpoint监控 - stop_monitoring() + # if host_rank == 0: + # 停止checkpoint监控 + stop_monitoring() if os.environ.get("merge_lora") == '1' and host_rank == 0: ## update model path as local folder as s3 provided diff --git a/backend/env.sample b/backend/env.sample index 1dac791..58b6797 100644 --- a/backend/env.sample +++ b/backend/env.sample @@ -11,6 +11,7 @@ api_keys=123456 HUGGING_FACE_HUB_TOKEN= WANDB_API_KEY= WANDB_BASE_URL= +SWANLAB_API_KEY= vllm_image=.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/vllm:v0.6.4 model_artifact=s3://sagemaker-us-east-1-/sagemaker_endpoint/vllm/model.tar.gz training_image=.dkr.ecr.us-east-1.amazonaws.com/llamafactory/llamafactory:0.9.2.dev0 \ No newline at end of file diff --git a/backend/training/training_job.py b/backend/training/training_job.py index 9323db6..baff637 100644 --- a/backend/training/training_job.py +++ b/backend/training/training_job.py @@ -17,7 +17,7 @@ import dotenv import os from utils.config import boto_sess,role,default_bucket,sagemaker_session, is_efa, \ -LORA_BASE_CONFIG,DEEPSPEED_BASE_CONFIG_MAP,FULL_BASE_CONFIG,DEFAULT_REGION,WANDB_API_KEY, WANDB_BASE_URL +LORA_BASE_CONFIG,DEEPSPEED_BASE_CONFIG_MAP,FULL_BASE_CONFIG,DEFAULT_REGION,WANDB_API_KEY, WANDB_BASE_URL, SWANLAB_API_KEY dotenv.load_dotenv() @@ -180,6 +180,13 @@ def create_training_args(self, doc['report_to'] = "wandb" timestp = to_datetime_string(time.time()).replace(' ', '_') doc['run_name'] = f"modelhub_run_{timestp}" + + if SWANLAB_API_KEY: + doc['use_swanlab'] = True + timestp = to_datetime_string(time.time()).replace(' ', '_') + doc['swanlab_run_name'] = f"sagemaker_modelhub_run_{timestp}" + doc['swanlab_project'] = f"sagemaker_modelhub" + doc['swanlab_api_key'] = SWANLAB_API_KEY #训练精度 if job_payload['training_precision'] == 'bf16': @@ -271,7 +278,8 @@ def create_training(self, "merge_args_path":sg_lora_merge_config, "train_args_path":sg_config, 'OUTPUT_MODEL_S3_PATH': output_s3_path, # destination - "PIP_INDEX":'https://mirrors.aliyun.com/pypi/simple' if DEFAULT_REGION.startswith('cn') else '', + "REGION": DEFAULT_REGION, + # "PIP_INDEX":'https://mirrors.aliyun.com/pypi/simple' if DEFAULT_REGION.startswith('cn') else '', "USE_MODELSCOPE_HUB": "1" if DEFAULT_REGION.startswith('cn') else '0' } diff --git a/backend/users/add_user.py b/backend/users/add_user.py index 75866de..525b18d 100644 --- a/backend/users/add_user.py +++ b/backend/users/add_user.py @@ -1,5 +1,6 @@ import sys sys.path.append('./') + from db_management.database import DatabaseWrapper from logger_config import setup_logger import argparse diff --git a/backend/utils/config.py b/backend/utils/config.py index c317723..de7270e 100644 --- a/backend/utils/config.py +++ b/backend/utils/config.py @@ -52,6 +52,7 @@ "stage_3":'examples/deepspeed/ds_z3_config.json'} WANDB_API_KEY = os.environ.get('WANDB_API_KEY','') WANDB_BASE_URL = os.environ.get('WANDB_BASE_URL','') +SWANLAB_API_KEY = os.environ.get('SWANLAB_API_KEY','') # 加载持久化之后的模型列表,在endpoingt_management.py中支持修改 try: @@ -87,4 +88,4 @@ } def is_efa(instance_type): - return 'ml.p4' in instance_type or 'ml.p5' in instance_type \ No newline at end of file + return 'ml.p4' in instance_type or 'ml.p5' in instance_type or 'ml.g6e' in instance_type or 'ml.g5' in instance_type \ No newline at end of file diff --git a/cloudformation-template.yaml b/cloudformation-template.yaml index d28e5aa..46c81f7 100644 --- a/cloudformation-template.yaml +++ b/cloudformation-template.yaml @@ -32,6 +32,12 @@ Parameters: Description: Optional WANDB Base URL for view W&B own Wandb portal Default: "" + SwanlabApiKey: + Type: String + Description: Optional SWANLAB for view Metrics on https://swanlab.cn/ + Default: "" + + Resources: EC2Instance: Type: AWS::EC2::Instance @@ -135,6 +141,7 @@ Resources: HUGGING_FACE_HUB_TOKEN=${HuggingFaceHubToken} WANDB_API_KEY=${WandbApiKey} WANDB_BASE_URL=${WandbBaseUrl} + SWANLAB_API_KEY=${SwanlabApiKey} EOF # 设置权限 @@ -200,6 +207,7 @@ Resources: HuggingFaceHubToken: !Ref HuggingFaceHubToken WandbApiKey: !Ref WandbApiKey WandbBaseUrl: !Ref WandbBaseUrl + SwanlabApiKey: !Ref SwanlabApiKey EC2SecurityGroup: Type: AWS::EC2::SecurityGroup diff --git a/cn-region-deploy.sh b/cn-region-deploy.sh new file mode 100644 index 0000000..9a8a6f2 --- /dev/null +++ b/cn-region-deploy.sh @@ -0,0 +1,191 @@ +#!/bin/bash +echo "########################注意#####################" +echo "请确认下载代码时用了--recurse-submodule,检查下backend/docker/LLaMA-Factory/文件下是否不为空" +#中国区事先手动下载 +#git clone --recurse-submodule https://github.com/aws-samples/llm_model_hub.git +# 设置日志文件 +LOG_FILE="/home/ubuntu/setup.log" + +touch "$LOG_FILE" +# 函数:记录日志 +log() { +echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE" +} + +log "Starting UserData script execution" +sudo apt update +sudo apt install -y git + +if ! command -v aws &> /dev/null; then + #安装awscli + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install +fi + +# echo "##create sagemaker execution role" +# # Create trust policy +# echo '{ +# "Version": "2012-10-17", +# "Statement": [ +# { +# "Effect": "Allow", +# "Principal": { +# "Service": "sagemaker.amazonaws.com" +# }, +# "Action": "sts:AssumeRole" +# } +# ] +# }' > trust-policy.json + +# # Create S3 policy +# echo '{ +# "Version": "2012-10-17", +# "Statement": [ +# { +# "Effect": "Allow", +# "Action": [ +# "s3:GetObject", +# "s3:PutObject", +# "s3:DeleteObject", +# "s3:ListBucket", +# "s3:CreateBucket" +# ], +# "Resource": [ +# "arn:aws-cn:s3:::*" +# ] +# } +# ] +# }' > s3-policy.json + +# # Generate random suffix +# RANDOM_SUFFIX=$(date +%s | sha256sum | base64 | head -c 8) +# ROLE_NAME="sagemaker_execution_role_${RANDOM_SUFFIX}" +# POLICY_NAME="sagemaker_s3_policy_${RANDOM_SUFFIX}" + +# # Create role and capture the ARN +# SageMakerRoleArn=$(aws iam create-role \ +# --role-name ${ROLE_NAME} \ +# --assume-role-policy-document file://trust-policy.json \ +# --query 'Role.Arn' --output text) + +# # Create policy +# POLICY_ARN=$(aws iam create-policy \ +# --policy-name ${POLICY_NAME} \ +# --policy-document file://s3-policy.json \ +# --query 'Policy.Arn' --output text) + +# # Attach policies +# aws iam attach-role-policy \ +# --role-name ${ROLE_NAME} \ +# --policy-arn ${POLICY_ARN} + +# aws iam attach-role-policy \ +# --role-name ${ROLE_NAME} \ +# --policy-arn arn:aws-cn:iam::aws:policy/AmazonSageMakerFullAccess + +# # Clean up temporary files +# rm trust-policy.json s3-policy.json + +# echo "Created role: ${ROLE_NAME}" >> "$LOG_FILE" +# echo "Role ARN: ${SageMakerRoleArn}" >> "$LOG_FILE" + + +#install nodejs +log "Installing nodejs" +curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - +sudo apt install -y nodejs +sudo npm config set registry http://mirrors.cloud.tencent.com/npm/ +sudo npm install --global yarn +# download file +cd /home/ubuntu/ +#中国区事先手动下载 +#git clone --recurse-submodule https://github.com/aws-samples/llm_model_hub.git +cd /home/ubuntu/llm_model_hub +yarn install +#install pm2 +sudo yarn global add pm2 + +# 等待一段时间以确保实例已完全启动 +sleep 30 + +log "Run cn setup script" +#如果是中国区则执行 +cd /home/ubuntu/llm_model_hub/backend/ +bash 0.setup-cn.sh + +# 尝试使用 IMDSv2 获取 token +TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") + +# Get the EC2 instance's public IP +EC2_IP=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/public-ipv4) +# Get the current region and write it to the backend .env file +REGION=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/placement/region) + +echo "Get IP:$EC2_IP and Region:$REGION " >> "$LOG_FILE" +# Generate a random string key +RANDOM_KEY=$(openssl rand -base64 32 | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1) +# Write the EC2_IP to frontend .env file +rm /home/ubuntu/llm_model_hub/.env +rm /home/ubuntu/llm_model_hub/backend/.env +echo "REACT_APP_API_ENDPOINT=http://$EC2_IP:8000/v1" > /home/ubuntu/llm_model_hub/.env +echo "REACT_APP_API_KEY=$RANDOM_KEY" >> /home/ubuntu/llm_model_hub/.env +echo "REACT_APP_CALCULATOR=https://aws-gpu-memory-caculator.streamlit.app/" >> /home/ubuntu/llm_model_hub/.env + +## write sagemaker role +echo "AK=" >> /home/ubuntu/llm_model_hub/backend/.env +echo "SK=" >> /home/ubuntu/llm_model_hub/backend/.env +echo "role=${SageMakerRoleArn}" >> /home/ubuntu/llm_model_hub/backend/.env +echo "region=$REGION" >> /home/ubuntu/llm_model_hub/backend/.env +echo "db_host=127.0.0.1" >> /home/ubuntu/llm_model_hub/backend/.env +echo "db_name=llm" >> /home/ubuntu/llm_model_hub/backend/.env +echo "db_user=llmdata" >> /home/ubuntu/llm_model_hub/backend/.env +echo "db_password=llmdata" >> /home/ubuntu/llm_model_hub/backend/.env +echo "api_keys=$RANDOM_KEY" >> /home/ubuntu/llm_model_hub/backend/.env +echo "HUGGING_FACE_HUB_TOKEN=${HuggingFaceHubToken}" >> /home/ubuntu/llm_model_hub/backend/.env +echo "WANDB_API_KEY=${WANDB_API_KEY}" >> /home/ubuntu/llm_model_hub/backend/.env +echo "WANDB_BASE_URL=${WANDB_BASE_URL}" >> /home/ubuntu/llm_model_hub/backend/.env +echo "SWANLAB_API_KEY=${SWANLAB_API_KEY}" >> /home/ubuntu/llm_model_hub/backend/.env +# Set proper permissions +sudo chown -R ubuntu:ubuntu /home/ubuntu/ +RANDOM_PASSWORD=$(openssl rand -base64 12 | tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1) +aws ssm put-parameter --name "/modelhub/RandomPassword" --value "$RANDOM_PASSWORD" --type "SecureString" --overwrite --region "$REGION" +cd /home/ubuntu/llm_model_hub/backend +bash 01.setup.sh +sleep 30 +#add user in db +source ../miniconda3/bin/activate py311 +conda activate py311 +python3 users/add_user.py demo_user $RANDOM_PASSWORD default + +#build vllm image +cd /home/ubuntu/llm_model_hub/backend/byoc +bash build_and_push.sh +sleep 5 + +# 构建llamafactory镜像 +log "Building and pushing llamafactory image" +cd /home/ubuntu/llm_model_hub/backend/docker +bash build_and_push.sh || { log "Failed to build and push llamafactory image"; exit 1; } +sleep 5 + +#upload dummy tar.gz +cd /home/ubuntu/llm_model_hub/backend/byoc +../../miniconda3/envs/py311/bin/python startup.py + +#start backend +cd /home/ubuntu/llm_model_hub/backend/ +bash 02.start_backend.sh +sleep 15 + +#start frontend +cd /home/ubuntu/llm_model_hub/ +pm2 start pm2run.config.js +echo "Webui=http://$EC2_IP:3000" +echo "username=demo_user" +echo "RandomPassword=$RANDOM_PASSWORD" +echo "Run User Data Script Done! " +echo "Webui=http://$EC2_IP:3000" >> "$LOG_FILE" +echo "username=demo_user" >> "$LOG_FILE" +echo "RandomPassword=$RANDOM_PASSWORD" >> "$LOG_FILE" +echo "Run User Data Script Done! " >> "$LOG_FILE" \ No newline at end of file