update vllm deployment and add a local testing script #1

omrishiv · 2024-06-21T22:45:21Z

Going to go through the files as they sit in the PR to explain:

Dockerfile:

Move the stuff that doesn't change to the top of the dockerfile and condense the run command. Fewer layers, and if you have to iterate on the file, this will speed up your build/pushes. More on that in a minute.

ray-service-vllm.yaml

Since the head node isn't doing much, at least for this example, drop the CPU count
Added the observability variables to enable the dashboards
Put the HF secret on the worker. Since the head node deploys the file to the worker, which is what runs it, the worker needs the ENVVAR.

serve.py

A simple serving test script. Change the port to whatever port you port-forward

vllm_serve.py

not sure if you need to make the change to how to get the envvar, but more importantly:
Drop the model max_model_len to fit the KV cache. You'll get an error if you don't.

Some other general tips I've found along the way:

Add karpenter.sh/do-not-disrupt: true annotations to your head/worker nodes so karpenter doesn't scale them back each time you need to restart your pod for changing the script
Create a configmap for your python model and mount that on the head node where you put it in your Dockerfile. This is much faster than having to constantly update the code, build the image and push the image.

omrishiv · 2024-06-21T22:47:26Z

Take a look at serve.py and update the gradio script. I had to make a change there on your request.get. It should be a POST with a json dictionary that uses a different key

vara-bonthu · 2024-06-21T23:04:44Z

gen-ai/inference/vllm-rayserve-gpu/ray-service-vllm.yaml

+            - name: RAY_GRAFANA_HOST
+              value: FILLIN
+            - name: RAY_PROMETHEUS_HOST
+              value: >-
+                FILLIN
+            - name: RAY_GRAFANA_IFRAME_HOST
+              value: FILLIN


This blueprint also deploys Kube Prometheus stack. Can we update these to localhost services with the port?

I don't know what those are for your deployment. I'll let @ratnopamc add them

gen-ai/inference/vllm-rayserve-gpu/ray-service-vllm.yaml

vara-bonthu · 2024-06-21T23:10:09Z

gen-ai/inference/vllm-rayserve-gpu/vllm_serve.py

@@ -92,4 +93,4 @@ async def __call__(self, request: Request) -> Response:
 deployment = VLLMDeployment.bind(model="mistralai/Mistral-7B-Instruct-v0.2",


we should be able to send all the vLLM config for each model

* feat: run GPU node with BR and EBS snapshot with container image cache * refactor: remove kubectl_manifest of karpenter custom resources * feat: locust file fo load testing * feat: End-to-end deployment of Bottlerocket nodes with container image cache

update vllm deployment and add a local testing script

706a4ba

newline

7073acc

vara-bonthu reviewed Jun 21, 2024

View reviewed changes

omrishiv added 2 commits June 21, 2024 16:13

undo nodeSelector removal

d045109

newline

2a16138

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update vllm deployment and add a local testing script #1

update vllm deployment and add a local testing script #1

omrishiv commented Jun 21, 2024

omrishiv commented Jun 21, 2024

vara-bonthu Jun 21, 2024

omrishiv Jun 21, 2024

vara-bonthu Jun 21, 2024

		@@ -92,4 +93,4 @@ async def __call__(self, request: Request) -> Response:
		deployment = VLLMDeployment.bind(model="mistralai/Mistral-7B-Instruct-v0.2",

update vllm deployment and add a local testing script #1

Are you sure you want to change the base?

update vllm deployment and add a local testing script #1

Conversation

omrishiv commented Jun 21, 2024

omrishiv commented Jun 21, 2024

vara-bonthu Jun 21, 2024

Choose a reason for hiding this comment

omrishiv Jun 21, 2024

Choose a reason for hiding this comment

vara-bonthu Jun 21, 2024

Choose a reason for hiding this comment