You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm a developer looking to deploy a model of about 8B via VLLM.
Initially, I was going to deploy the model to 200+ users with awq_marlin, but I don't understand why marlin is optimized for inference under batch 32, and why awq git says fp16 is best for the most throughput, when awq_marlin is optimized for up to batch 32.
I'm also wondering how I should proceed with my deployment in this situation.
There are 4 system prompt is the same for model.(with l40 4way)
This is a situation where there is a lot of kv cache happening.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I'm a developer looking to deploy a model of about 8B via VLLM.
Initially, I was going to deploy the model to 200+ users with awq_marlin, but I don't understand why marlin is optimized for inference under batch 32, and why awq git says fp16 is best for the most throughput, when awq_marlin is optimized for up to batch 32.
I'm also wondering how I should proceed with my deployment in this situation.
There are 4 system prompt is the same for model.(with l40 4way)
This is a situation where there is a lot of kv cache happening.
Beta Was this translation helpful? Give feedback.
All reactions