-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yet another Prefill-Decode separation in vllm #9079
base: main
Are you sure you want to change the base?
Yet another Prefill-Decode separation in vllm #9079
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@zeroorhero implemented a similar kv store-based solution, and me myself also believe that kv store is the way to go for disaggregated prefilling. So yes! |
Thanks for your reply. Would appreciate any comments to improve? So far only a kv cache transporter of RDMA is created. Do you think adding a transporter using CPU memory will make this PR more likely to be accepted? |
I have a question. How can we guarantee the model updates correctly when using Prefill-Decode separation in production? @chenqianfzh . And how to install InfiniteStore using pip? |
m |
I am working on 'pip install InfiniteStore', it could be available in this week. |
Hello, could you provide the way to start the program? |
the script examples/infinitestore_pd_separate.sh in this PR demonstrates how to start the two vllm instances for prefill and decode respectively in a host with multiple hosts. Before that, make sure the infinitestore is installed and started by running the start.sh. Please let me know if you run into any problems. |
Sorry, I missed the first question from you. I guess u are asking how I verified the kv_cache are updated correctly across the different vllms of prefill and decode. I verifed it by comparing the generated result with the vanilla vllm. The results are exactly the same. |
you could use |
@chenqianfzh @thesues I have three questions: first:After I installed according to the above command, a circular reference occurred when using the import command. Can you help me find out the reason? second: In this PR, do the places where infinity is used need to be replaced with infinistore? third:Is multi-machine deployment currently supported? Or a single machine with multiple cards, for example (prefill-2gpus vs decode-2gpus)? |
maybe you could provide more information of this error? or check if package has 2 . I updated infinistore's API recently, so this PR should be updated as well, I will ping you as soon as it is ready. |
@thesues Do you have any plans to do this? Or is it already being done? |
我的机器没有Mellanox ConnectX-3 VPI 或 Connect-IB InfiniBand 适配器。那么我可以在单机4卡A100上使用这个PR嘛? |
what's your infinistore version? I think newer version 0.1.82 solved this. before this version, infinistore has a limited RMDA queue length. the new version will has a thread which drain CQ asynchronously. |
thank you, I did this update in infinistore ;-) |
Hi, @chenqianfzh @thesues, can you give me some test results of this PR? |
Hi, do you plan to support other transporter like #8498 (torch distributed pipe)? your layer-wise idea seems very efficient. Could you give me your contact method? (Wechat / slack/ email, all be fine) |
I could discuss this on https://vllm-dev.slack.com/archives/C07VCUQLE1F or https://github.com/bd-iaas-us/infiniStore |
This PoC demonstrates an implementation of Prefill-Decode separation within vLLM, leveraging a memory pool to store KV caches accessed via RDMA connections. The primary goal is to enhance data transfer efficiency by optimizing how KV caches are managed and transmitted.
@KuntaiDu @youkaichao @thesues wonder if you could take a moment to review our implementation and share your valuable feedback. Your insights and opinions would be greatly appreciated.
Key Changes to vllm
Comparing to #8498 (Thanks to your seminal work @KuntaiDu, which inspired us a lot ), this implementation optimizes data transfer efficiency. Instead of utilizing a single large tensor for KV caches per layer, the caches are divided into smaller blocks. This granular approach enables parallel transmission of multiple layers as well as multiple blocks within one layer.
To coordinate different layers effectively, changes are made at the model level rather than the entire model_runner. This implementation showcases modifications for OPT and LLaMA models, establishing a pattern that can be easily extended to other models.
Introducing InfiniteStore
Main contributor: @thesues
The core functionality for sending and receiving KV caches via RDMA is encapsulated in our separate package, InfiniteStore. We are more than happy to donate this project to the community.
Infinitestore Features:
API design
The load/store API is meticulously crafted to align with vLLM’s KV cache characteristics, aiming to parallelize computation with KV cache saving and loading operations. The APIs are asynchronous and support batch processing.The definitions are as follows:
Once the load/store requests are all sent, the sync function from the block API is called to ensure that the processing will be done before moving on.
As shown in the diagram below, during the prefill computation process, only the final call to the sync API ensures that the kvcache has been fully written to the remote storage.