You can use the LMI to easily run vLLM on Amazon SageMaker. However, the version of vLLM supported by LMI lags several versions behind the latest community version. If you want to run the latest version, try this repo!
Make sure you have the following tools installed:
- AWS CLI (and run
aws configure
) - Docker
- Python 3
Start by setting up some environment variables. Adjust them as needed:
export REGION='us-east-1' # change as needed
export IMG_NAME='vllm-on-sagemaker' # change as needed
export IMG_TAG='latest' # change as needed
export SAGEMAKER_ENDPOINT_NAME='vllm-on-sagemaker' # change as needed
Build the Docker image that will be used to run the SageMaker Endpoint serving container. After building, the image will be pushed to AWS ECR. The container implements /ping
and /invocations
APIs, as required by SageMaker Endpoints.
sagemaker/build_and_push_image.sh --region "$REGION" --image-name "$IMG_NAME" --tag "$IMG_TAG"
After the image is built and pushed, retrieve the image URI:
export IMG_URI=$(sagemaker/get_ecr_image_uri.sh --region "$REGION" --img-name "$IMG_NAME" --tag "$IMG_TAG")
echo $IMG_URI
Create a SageMaker execution role to allow the endpoint to run properly:
export SM_ROLE=$(sagemaker/create_sagemaker_execute_role.sh)
echo $SM_ROLE
Now, create the SageMaker Endpoint. Choose the appropriate Hugging Face model ID and instance type:
python3 sagemaker/create_sagemaker_endpoint.py \
--region "$REGION" \
--model_id "deepseek-ai/deepseek-llm-7b-chat" \
--instance_type ml.g5.4xlarge \
--role_arn $SM_ROLE \
--image_uri $IMG_URI \
--endpoint_name $SAGEMAKER_ENDPOINT_NAME
Go to the AWS console -> SageMaker -> Inference -> Endpoints. You should see the endpoint being created. Wait until the creation process is complete.
Once the endpoint is created and in 'InService' status, you can start sending requests to it.
You can use the SageMaker /invocations
API to call the endpoint; it is compatible with the OpenAI chat completion API. Check the sagemaker/test_endpoint.py
for example requests.