SageMaker Endpoint for vLLM

You can use the LMI to easily run vLLM on Amazon SageMaker. However, the version of vLLM supported by LMI lags several versions behind the latest community version. If you want to run the latest version, try this repo!

Prerequisites

Make sure you have the following tools installed:

AWS CLI (and run aws configure)
Docker
Python 3

Usage

1. Set Environment Variables

Start by setting up some environment variables. Adjust them as needed:

export REGION='us-east-1' # change as needed
export IMG_NAME='vllm-on-sagemaker' # change as needed
export IMG_TAG='latest' # change as needed
export SAGEMAKER_ENDPOINT_NAME='vllm-on-sagemaker' # change as needed

2. Build and Push Docker Image

Build the Docker image that will be used to run the SageMaker Endpoint serving container. After building, the image will be pushed to AWS ECR. The container implements /ping and /invocations APIs, as required by SageMaker Endpoints.

sagemaker/build_and_push_image.sh --region "$REGION" --image-name "$IMG_NAME" --tag "$IMG_TAG"

3. Get the Image URI

After the image is built and pushed, retrieve the image URI:

export IMG_URI=$(sagemaker/get_ecr_image_uri.sh --region "$REGION" --img-name "$IMG_NAME" --tag "$IMG_TAG")
echo $IMG_URI

4. Create a SageMaker Execution Role

Create a SageMaker execution role to allow the endpoint to run properly:

export SM_ROLE=$(sagemaker/create_sagemaker_execute_role.sh)
echo $SM_ROLE

5. Create the SageMaker Endpoint

Now, create the SageMaker Endpoint. Choose the appropriate Hugging Face model ID and instance type:

python3 sagemaker/create_sagemaker_endpoint.py \
    --region "$REGION" \
    --model_id "deepseek-ai/deepseek-llm-7b-chat" \
    --instance_type ml.g5.4xlarge \
    --role_arn $SM_ROLE \
    --image_uri $IMG_URI \
    --endpoint_name $SAGEMAKER_ENDPOINT_NAME

6. Check the Endpoint

Go to the AWS console -> SageMaker -> Inference -> Endpoints. You should see the endpoint being created. Wait until the creation process is complete.

7. Send Requests to the Endpoint

Once the endpoint is created and in 'InService' status, you can start sending requests to it.

You can use the SageMaker /invocations API to call the endpoint; it is compatible with the OpenAI chat completion API. Check the sagemaker/test_endpoint.py for example requests.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
sagemaker		sagemaker
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageMaker Endpoint for vLLM

Prerequisites

Usage

1. Set Environment Variables

2. Build and Push Docker Image

3. Get the Image URI

4. Create a SageMaker Execution Role

5. Create the SageMaker Endpoint

6. Check the Endpoint

7. Send Requests to the Endpoint

About

Releases

Packages

Languages

JianyuZhan/vllm-on-sagemaker

Folders and files

Latest commit

History

Repository files navigation

SageMaker Endpoint for vLLM

Prerequisites

Usage

1. Set Environment Variables

2. Build and Push Docker Image

3. Get the Image URI

4. Create a SageMaker Execution Role

5. Create the SageMaker Endpoint

6. Check the Endpoint

7. Send Requests to the Endpoint

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages