Optimal EC2 configuration and vLLM settings for max concurrency? #9773
hughesadam87
started this conversation in
General
Replies: 1 comment
-
Going a bit further, I see the issue is that the GPU cache is getting to 99% and then pending requests pile up. Guess my question then is what engine flags would allow me to stretch the GPU cache under concurrent requests? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Thank you for such a great open source project.
Use Case
We're building a chatbot and aiming for consistent, responsive performance under concurrent user loads. At ~15 requests, processing delays reach up to 30 seconds before streaming begins. Though streaming speed is good, we'd prefer requests to start sooner, even if they stream slower. We're also seeking optimal vLLM settings for our hardware.
Configuration
vLLM running on on a g4dn.12xlarge with AMI "Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 61.1"
**Model ** is Pixtral70b
docker-compose:
The instance is underutilized, showing CPU utilization of only 6% (of the available 48 vCPU) . When getting slammed w/ reqeusts, the GPU utilization is near 100% on each of the 4 cores, but the GPU Memory Utilization is like 63% per core
All vLLM settings are using the defaults.
Questions/Concerns
The vLLM engine args doc leaves us with some questions:
--max-num-batched-tokens=512
too low? Would increasing it help with concurrent request handling?--cpu-offload-db
or--swap-space
improve performance?--gpu-memory-utilization=0.9
?Beta Was this translation helpful? Give feedback.
All reactions