Optimal EC2 configuration and vLLM settings for max concurrency? #9773

hughesadam87 · 2024-10-28T20:43:36Z

hughesadam87
Oct 28, 2024

Thank you for such a great open source project.

Use Case

We're building a chatbot and aiming for consistent, responsive performance under concurrent user loads. At ~15 requests, processing delays reach up to 30 seconds before streaming begins. Though streaming speed is good, we'd prefer requests to start sooner, even if they stream slower. We're also seeking optimal vLLM settings for our hardware.

Configuration

vLLM running on on a g4dn.12xlarge with AMI "Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 61.1"

**Model ** is Pixtral70b

docker-compose:

services:
  pixtral:
    extends: inference-server
    environment:
      MODEL_NAME: "pixtral"
      QUANTIZATION: "None"
      MAX_MODEL_LEN: 40000
      MODEL_TYPE: "mistral"
      DTYPE: "float16"
  inference-server:
    image: ...
    volumes:
      - ...
    restart: always
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [ gpu ]

The instance is underutilized, showing CPU utilization of only 6% (of the available 48 vCPU) . When getting slammed w/ reqeusts, the GPU utilization is near 100% on each of the 4 cores, but the GPU Memory Utilization is like 63% per core

All vLLM settings are using the defaults.

Questions/Concerns

The vLLM engine args doc leaves us with some questions:

Is --max-num-batched-tokens=512 too low? Would increasing it help with concurrent request handling?
Given our mostly unused 48 CPUs, would --cpu-offload-db or --swap-space improve performance?
Why does GPU memory cap around 63% despite --gpu-memory-utilization=0.9?
Are there other flags that might help with our setup?
We're using the default tokenizer settings - could these be drastically impacting performance?

hughesadam87 · 2024-10-29T16:44:57Z

hughesadam87
Oct 29, 2024
Author

Going a bit further, I see the issue is that the GPU cache is getting to 99% and then pending requests pile up. Guess my question then is what engine flags would allow me to stretch the GPU cache under concurrent requests?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal EC2 configuration and vLLM settings for max concurrency? #9773

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Optimal EC2 configuration and vLLM settings for max concurrency? #9773

hughesadam87 Oct 28, 2024

Use Case

Configuration

Questions/Concerns

Replies: 1 comment

hughesadam87 Oct 29, 2024 Author

hughesadam87
Oct 28, 2024

hughesadam87
Oct 29, 2024
Author