Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update vllm deployment and add a local testing script #1

Open
wants to merge 4 commits into
base: ray-vllm-gpu
Choose a base branch
from

Conversation

omrishiv
Copy link

Going to go through the files as they sit in the PR to explain:

Dockerfile:

  • Move the stuff that doesn't change to the top of the dockerfile and condense the run command. Fewer layers, and if you have to iterate on the file, this will speed up your build/pushes. More on that in a minute.

ray-service-vllm.yaml

  • Since the head node isn't doing much, at least for this example, drop the CPU count
  • Added the observability variables to enable the dashboards
  • Put the HF secret on the worker. Since the head node deploys the file to the worker, which is what runs it, the worker needs the ENVVAR.

serve.py

  • A simple serving test script. Change the port to whatever port you port-forward

vllm_serve.py

  • not sure if you need to make the change to how to get the envvar, but more importantly:
  • Drop the model max_model_len to fit the KV cache. You'll get an error if you don't.

Some other general tips I've found along the way:

  • Add karpenter.sh/do-not-disrupt: true annotations to your head/worker nodes so karpenter doesn't scale them back each time you need to restart your pod for changing the script
  • Create a configmap for your python model and mount that on the head node where you put it in your Dockerfile. This is much faster than having to constantly update the code, build the image and push the image.

@omrishiv
Copy link
Author

Take a look at serve.py and update the gradio script. I had to make a change there on your request.get. It should be a POST with a json dictionary that uses a different key

Comment on lines +87 to +93
- name: RAY_GRAFANA_HOST
value: FILLIN
- name: RAY_PROMETHEUS_HOST
value: >-
FILLIN
- name: RAY_GRAFANA_IFRAME_HOST
value: FILLIN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blueprint also deploys Kube Prometheus stack. Can we update these to localhost services with the port?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what those are for your deployment. I'll let @ratnopamc add them

gen-ai/inference/vllm-rayserve-gpu/ray-service-vllm.yaml Outdated Show resolved Hide resolved
@@ -92,4 +93,4 @@ async def __call__(self, request: Request) -> Response:
deployment = VLLMDeployment.bind(model="mistralai/Mistral-7B-Instruct-v0.2",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to send all the vLLM config for each model

ratnopamc pushed a commit that referenced this pull request Aug 29, 2024
* feat: run GPU node with BR and EBS snapshot with container image cache

* refactor: remove kubectl_manifest of karpenter custom resources

* feat: locust file fo load testing

* feat: End-to-end deployment of Bottlerocket nodes with container image cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants