Deployment for 200 people #10683

Foreist · 2024-11-27T01:08:54Z

Foreist
Nov 27, 2024

Hi, I'm a developer looking to deploy a model of about 8B via VLLM.
Initially, I was going to deploy the model to 200+ users with awq_marlin, but I don't understand why marlin is optimized for inference under batch 32, and why awq git says fp16 is best for the most throughput, when awq_marlin is optimized for up to batch 32.
I'm also wondering how I should proceed with my deployment in this situation.
There are 4 system prompt is the same for model.(with l40 4way)
This is a situation where there is a lot of kv cache happening.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment for 200 people #10683

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Deployment for 200 people #10683

Foreist Nov 27, 2024

Replies: 0 comments

Foreist
Nov 27, 2024