[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

KuntaiDu · 2024-12-02T03:01:11Z

WangErXiao · 2024-12-02T08:13:46Z

Will the logic for model upgrades and instance service discovery be introduced in XpYd?

KuntaiDu · 2024-12-02T17:06:54Z

Will the logic for model upgrades and instance service discovery be introduced in XpYd?

Model upgrades --- not in the scope of disaggregated prefill roadmap for now. But this IS important, RLHF-style training also needs this, so I can add that if this is a common need. Please ❤️ this message if you need this feature.
Instance service discovery --- this will be done by a request gateway, which is also an ongoing effort but I need to discuss with other people more to figure out where should we put it (inside vLLM, or as a k8s/kserve plugin, or other options).

WangErXiao · 2024-12-03T07:37:34Z

Will the logic for model upgrades and instance service discovery be introduced in XpYd?

Model upgrades --- not in the scope of disaggregated prefill roadmap for now. But this IS important, RLHF-style training also needs this, so I can add that if this is a common need. Please ❤️ this message if you need this feature. Instance service discovery --- this will be done by a request gateway, which is also an ongoing effort but I need to discuss with other people more to figure out where should we put it (inside vLLM, or as a k8s/kserve plugin, or other options).

When using XpYd in production, model upgrades are frequent. During the upgrade period, there are two versions of the model. I think vLLM gateway need to pair prefill and decode instances, ensuring they are from the same model version.

xiuqiaoli · 2024-12-03T12:15:40Z

Glad to see the progress of supporting P/D disaggreation feature.

Will this RFC support a central scheduler to determine the best prefill and decode instances to serve current request? Many PD disaggregation papers (Mooncake, distserv, etc) introduce similar components to consider a couple of metrics (instance load, KV Cache locality, etc) before choosing p/d instances.
Since there are multiple choices of memory store or storage systems for sharing KV Cache, will your design introduce interfaces to plugin extensions for 3rd party KV Cache memory/storage systems？

KuntaiDu · 2024-12-03T19:17:34Z

Glad to see the progress of supporting P/D disaggreation feature.

Will this RFC support a central scheduler to determine the best prefill and decode instances to serve current request? Many PD disaggregation papers (Mooncake, distserv, etc) introduce similar components to consider a couple of metrics (instance load, KV Cache locality, etc) before choosing p/d instances.

Since there are multiple choices of memory store or storage systems for sharing KV Cache, will your design introduce interfaces to plugin extensions for 3rd party KV Cache memory/storage systems？

Yes. I am discussing with people to decide where to put the scheduler (it likely should not be implemented in python as scalability matters for this scheduler, so I am not sure if it should be placed in vllm, or some other repos underneath vllm-project)
Yes and I leave the APIs for third-party systems to integrate. Key APIs are insert and drop_select.

wyzzbond · 2024-12-04T02:44:41Z

To better support the PD disaggregated architecture, we are actively developing a dual-tiered scheduler, implemented in Go, to optimize XpYd and request management. This upgrade has been built upon our PD disaggregated feature within vllm and is now live in our production environment, showing improved performance with good stability.

The core design of our scheduler is outlined below:

● Observability: To reduce reliance on any single inference engine, we have implemented a Go-based reverse proxy that directly collects and computes instance-level performance metrics in real time, such as TTFT, TPOT, instance load, and cache status.

● Hierarchical Scheduling System: Our system features a Cluster Level Scheduler (CLS) and an Instance Level Scheduler (ILS), aiming to maximize goodput per GPU while meeting latency SLOs. The CLS leverages a workload-aware performance-cost model to refine request routing, determining whether to use a disaggregated or colocated serving mode and pinpointing the most cost-effective GPU types. Subsequently, the ILS assigns the most suitable P/D instance pairs for incoming requests, optimizing load balancing and cache reuse.

● Dynamic P/D Adjustment: By leveraging instance-level metrics, we've developed a role shift module that periodically evaluates instance load stats and decides when to add, remove, or switch P/D instances as needed.

We are looking forward to releasing the code for our global scheduler to OSS shortly. Additional features are currently in development. We welcome any discussions and opportunities for collaboration.

yuleil · 2024-12-04T03:01:48Z

We (Alibaba Cloud) are actively developing a disaggregated prefilling feature for vLLM to tackle latency issues and minimize interference during prefilling and decoding. Leveraging fully asynchronous I/O, it ensures minimal overhead for PD disaggregation. This implementation has been widely verified in our production system, proving its robust stability and exceptional performance.

Design Highlights

Fully asynchronous: KV Cache transfer does not block computation. We observed with nsys that NCCL communication and computation are fully overlapped with different CUDA streams.
Control the behavior of disaggregated prefill at the request level: Our engine is designed to handle each request with the flexibility to switch between different serving strategies, ranging from single-instance serving to PD disaggregation. This architecture markedly enhances the scheduler’s potential for optimization. By merely including{"prefill_endpoint": "http://192.168.1.124:8001/v1/completions"}in the request, we can conduct PD disaggregation with the prefill instance selected by our global scheduler based on workload attributes and instance-level performance metrics. This capability allows for on-the-fly adjustment of the P/D instance ratios to optimize performance and facilitates instantaneous role transitions as required.

Workflow

PD disaggregation inference can enabled by using the"prefill_endpointparameter. However, to achieve optimal global load balancing, enhance prefix caching affinity, and minimize the mismatch between P/D instances, a global scheduler has been incorporated into the system. The whole process is structured as follows:

Performance Evaluation

Microbenchmark

A10 Single instance TP1python benchmark_serving.py --model=meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name=random --random-input-len=1000 --random-output-len=200 --request-rate=1.3

A10 1P1D TP1 python benchmark_serving.py --model=meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name=random --random-input-len=1000 --random-output-len=200 --request-rate=2.6

TPOT is decreased by 27%, while the TTFT is increased, which was caused by the prefill request queuing.That can be optimized by adjusting to a more suitable PD ratio.

Conclusion

We have developed a flexible implementation for pd-disaggregation, with a special focus on XpYd support and the asynchronous KV cache transfer. We hope to contribute to the community in these areas, further boosting the performance of disaggregated prefilling.

tanzelin430 · 2024-12-04T15:02:33Z

We (Alibaba Cloud) are actively developing a disaggregated prefilling feature for vLLM to tackle latency issues and minimize interference during prefilling and decoding.  Leveraging fully asynchronous I/O, it ensures minimal overhead for PD disaggregation. This implementation has been widely verified in our production system, proving its robust stability and exceptional performance.

Design Highlights

Fully asynchronous: KV Cache transfer does not block computation. We observed with nsys that NCCL communication and computation are fully overlapped with different CUDA streams.

Control the behavior of disaggregated prefill at the request level: Our engine is designed to handle each request with the flexibility to switch between different serving strategies, ranging from single-instance serving to PD disaggregation. This architecture markedly enhances the scheduler’s potential for optimization. By merely including{"prefill_endpoint": "http://192.168.1.124:8001/v1/completions"}in the request, we can conduct PD disaggregation with the prefill instance selected by our global scheduler based on workload attributes and instance-level performance metrics. This capability allows for on-the-fly adjustment of the P/D instance ratios to optimize performance and facilitates instantaneous role transitions as required.

Workflow

PD disaggregation inference can enabled by using the"prefill_endpointparameter. However, to achieve optimal global load balancing, enhance prefix caching affinity, and minimize the mismatch between P/D instances, a global scheduler has been incorporated into the system. The whole process is structured as follows:

Performance Evaluation

Microbenchmark

A10 Single instance TP1python benchmark_serving.py --model=meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name=random --random-input-len=1000 --random-output-len=200   --request-rate=1.3

A10 1P1D TP1 python benchmark_serving.py --model=meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name=random --random-input-len=1000 --random-output-len=200   --request-rate=2.6

TPOT is decreased by 27%, while the TTFT is increased, which was caused by the prefill request queuing.That can be optimized by adjusting to a more suitable PD ratio.

Conclusion

We have developed a flexible implementation for pd-disaggregation, with a special focus on XpYd support and the asynchronous KV cache transfer.  We hope to contribute to the community in these areas, further boosting the performance of disaggregated prefilling.

@yuleil Hello, I was wondering how to use nsys to profile such distributed system, I have lots of experience in using nsys to profile vllm. But for PD disagg You know I have to run prefill/decode instance seperately, I want use one nsys profile two seperate instance. After check the help doc I still can not find the solution.

Jeffwan · 2024-12-05T22:06:25Z

Let's also add some orchestration support in the roadmap. Seems how to orchestrate such stateful application is not covered yet. Let's create one sub-task to track it

a32543254 · 2024-12-06T03:32:14Z

Hi @KuntaiDu
How can disaggregated prefill Compatible with chunked prefill ?
As I know, chunked prefill will chunk large prefills into smaller chunks and batch them together with decode requests, which combine the prefill and decode rather than separation.
Do you mean we chunk the large input into smaller chunks only in prefill instance? To avoid frequent dynamic adjustment of kernel dispatching and memory allocation?

KuntaiDu added the RFC label Dec 2, 2024

KuntaiDu mentioned this issue Dec 3, 2024

[Core] Implement disagg prefill by StatelessProcessGroup #10502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

KuntaiDu commented Dec 2, 2024 •

edited

Loading

WangErXiao commented Dec 2, 2024

KuntaiDu commented Dec 2, 2024 •

edited

Loading

WangErXiao commented Dec 3, 2024

xiuqiaoli commented Dec 3, 2024

KuntaiDu commented Dec 3, 2024 •

edited

Loading

wyzzbond commented Dec 4, 2024

yuleil commented Dec 4, 2024

tanzelin430 commented Dec 4, 2024

Design Highlights

Performance Evaluation

Microbenchmark

Conclusion

Jeffwan commented Dec 5, 2024

a32543254 commented Dec 6, 2024

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Comments

KuntaiDu commented Dec 2, 2024 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

WangErXiao commented Dec 2, 2024

KuntaiDu commented Dec 2, 2024 • edited Loading

WangErXiao commented Dec 3, 2024

xiuqiaoli commented Dec 3, 2024

KuntaiDu commented Dec 3, 2024 • edited Loading

wyzzbond commented Dec 4, 2024

yuleil commented Dec 4, 2024

Design Highlights

Performance Evaluation

Microbenchmark

Conclusion

tanzelin430 commented Dec 4, 2024

Design Highlights

Performance Evaluation

Microbenchmark

Conclusion

Jeffwan commented Dec 5, 2024

a32543254 commented Dec 6, 2024

KuntaiDu commented Dec 2, 2024 •

edited

Loading

KuntaiDu commented Dec 2, 2024 •

edited

Loading

KuntaiDu commented Dec 3, 2024 •

edited

Loading