-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818
Comments
Will the logic for model upgrades and instance service discovery be introduced in XpYd? |
Model upgrades --- not in the scope of disaggregated prefill roadmap for now. But this IS important, RLHF-style training also needs this, so I can add that if this is a common need. Please ❤️ this message if you need this feature. |
When using XpYd in production, model upgrades are frequent. During the upgrade period, there are two versions of the model. I think vLLM gateway need to pair prefill and decode instances, ensuring they are from the same model version. |
Glad to see the progress of supporting P/D disaggreation feature.
|
|
To better support the PD disaggregated architecture, we are actively developing a dual-tiered scheduler, implemented in Go, to optimize XpYd and request management. This upgrade has been built upon our PD disaggregated feature within vllm and is now live in our production environment, showing improved performance with good stability. The core design of our scheduler is outlined below: ● Observability: To reduce reliance on any single inference engine, we have implemented a Go-based reverse proxy that directly collects and computes instance-level performance metrics in real time, such as TTFT, TPOT, instance load, and cache status. ● Hierarchical Scheduling System: Our system features a Cluster Level Scheduler (CLS) and an Instance Level Scheduler (ILS), aiming to maximize goodput per GPU while meeting latency SLOs. The CLS leverages a workload-aware performance-cost model to refine request routing, determining whether to use a disaggregated or colocated serving mode and pinpointing the most cost-effective GPU types. Subsequently, the ILS assigns the most suitable P/D instance pairs for incoming requests, optimizing load balancing and cache reuse. ● Dynamic P/D Adjustment: By leveraging instance-level metrics, we've developed a role shift module that periodically evaluates instance load stats and decides when to add, remove, or switch P/D instances as needed. We are looking forward to releasing the code for our global scheduler to OSS shortly. Additional features are currently in development. We welcome any discussions and opportunities for collaboration. |
@yuleil Hello, I was wondering how to use nsys to profile such distributed system, I have lots of experience in using nsys to profile vllm. But for PD disagg You know I have to run prefill/decode instance seperately, I want use one nsys profile two seperate instance. After check the help doc I still can not find the solution. |
Let's also add some orchestration support in the roadmap. Seems how to orchestrate such stateful application is not covered yet. Let's create one sub-task to track it |
Hi @KuntaiDu |
Motivation.
Here is the roadmap for disaggregated prefill (and general-purpose kv cache transfer). Feel free to contribute 😁.
Proposed Change.
num_head
dimension andlayer
dimension (currently theroi
tensor only contains tokens dimension)vllm connect
Engine
instead of talking to the API serverFeedback Period.
No response
CC List.
@youkaichao @zeroorhero @comaniac @rkooo567 @WoosukKwon @liweiqing1997 @ShangmingCai @Leaf996 @coolkp @sjnaj @K-Mistele @ApostaC @YaoJiayi @njhill
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: