-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update vllm deployment and add a local testing script #1
base: ray-vllm-gpu
Are you sure you want to change the base?
Conversation
Take a look at |
- name: RAY_GRAFANA_HOST | ||
value: FILLIN | ||
- name: RAY_PROMETHEUS_HOST | ||
value: >- | ||
FILLIN | ||
- name: RAY_GRAFANA_IFRAME_HOST | ||
value: FILLIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This blueprint also deploys Kube Prometheus stack. Can we update these to localhost services with the port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what those are for your deployment. I'll let @ratnopamc add them
@@ -92,4 +93,4 @@ async def __call__(self, request: Request) -> Response: | |||
deployment = VLLMDeployment.bind(model="mistralai/Mistral-7B-Instruct-v0.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to send all the vLLM config for each model
* feat: run GPU node with BR and EBS snapshot with container image cache * refactor: remove kubectl_manifest of karpenter custom resources * feat: locust file fo load testing * feat: End-to-end deployment of Bottlerocket nodes with container image cache
Going to go through the files as they sit in the PR to explain:
Dockerfile
:ray-service-vllm.yaml
serve.py
vllm_serve.py
max_model_len
to fit the KV cache. You'll get an error if you don't.Some other general tips I've found along the way:
karpenter.sh/do-not-disrupt: true
annotations to your head/worker nodes so karpenter doesn't scale them back each time you need to restart your pod for changing the scriptDockerfile
. This is much faster than having to constantly update the code, build the image and push the image.