[Core] Efficient transmission for CPU prefix caching, based on PR#10874 #11099

lixiaobai09 · 2024-12-11T13:08:47Z

We implement an efficient CPU prefix caching, including:

Based on an efficient CPU KV Block Cache Manager (PR#10874).
A data transmission optimization to overlap layer-wise block swapping and forward computing.
A data transmission optimization to overlap request-level block swapping and computing by delaying the transmission requests one step.

Signed-off-by: ApostaC <[email protected]> Co-authored-by: KuntaiDu <[email protected]>

Signed-off-by: ApostaC <[email protected]>

…ssues Signed-off-by: ApostaC <[email protected]>

Signed-off-by: ApostaC <[email protected]>

sequence IDs with each swapped blocks Signed-off-by: Dahai Tang <[email protected]>

Signed-off-by: ApostaC <[email protected]>

Signed-off-by: Dahai Tang <[email protected]>

into cpu-offloading2 Signed-off-by: Dahai Tang <[email protected]>

github-actions · 2024-12-11T13:09:00Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-12-11T13:09:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lixiaobai09.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…offloading2 Signed-off-by: Dahai Tang <[email protected]>

Signed-off-by: Dahai Tang <[email protected]>

ApostaC and others added 13 commits December 3, 2024 20:59

Move to a new branch to fix the DCO issues.

f60a8fa

Signed-off-by: ApostaC <[email protected]> Co-authored-by: KuntaiDu <[email protected]>

[Fix] the failed unit tests

e6654f2

Signed-off-by: ApostaC <[email protected]>

[Fix] CPU offloading not working bug and [fix] unit test and format i…

ba6c9e3

…ssues Signed-off-by: ApostaC <[email protected]>

[fix] broken tests for cpu offloading allocator

1c94985

Signed-off-by: ApostaC <[email protected]>

[Fix] add the call to get_physical_block_ids

daab0d6

Signed-off-by: ApostaC <[email protected]>

[Add] faster unsafe implementation for get_physical_block_id

919e5e3

Signed-off-by: ApostaC <[email protected]>

Feat: support CSR format to construct the swapped blocks

0638211

sequence IDs with each swapped blocks Signed-off-by: Dahai Tang <[email protected]>

Merge branch 'main' into yihua-cpu-offloading2

52185bf

Updating the benchmark script with correct usage instructions

505e60c

Signed-off-by: ApostaC <[email protected]>

make yapf happy

a517a29

Signed-off-by: ApostaC <[email protected]>

fix format checker issues

789b00e

Signed-off-by: ApostaC <[email protected]>

Feat: layer-wise transmission

6d5841f

Signed-off-by: Dahai Tang <[email protected]>

Merge branch 'yihua-cpu-offloading2' of https://github.com/KuntaiDu/vllm

5927066

into cpu-offloading2 Signed-off-by: Dahai Tang <[email protected]>

lixiaobai09 requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac, alexm-neuralmagic, zhuohan123 and youkaichao as code owners December 11, 2024 13:08

mergify bot added the frontend label Dec 11, 2024

mergify bot added the needs-rebase label Dec 11, 2024

Merge branch 'main' of https://github.com/vllm-project/vllm into cpu-…

74e99c1

…offloading2 Signed-off-by: Dahai Tang <[email protected]>

mergify bot removed the needs-rebase label Dec 11, 2024

Dahai Tang added 2 commits December 11, 2024 14:53

Fix: set_forward_contex for TPU test

2abdf62

Signed-off-by: Dahai Tang <[email protected]>

Fix: set forward context while context is None

6f97634

Signed-off-by: Dahai Tang <[email protected]>

Dahai Tang added 4 commits December 12, 2024 04:21

Fix: change model runner arguments to support kwargs

7a6435d

Signed-off-by: Dahai Tang <[email protected]>

Fix: lint checker

47c3557

Signed-off-by: Dahai Tang <[email protected]>

Fix: cpu offloading block allocator tester

894ab90

Signed-off-by: Dahai Tang <[email protected]>

Fix: get_cache_engine while self.cache_engine is None

80c8c4e

Signed-off-by: Dahai Tang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Efficient transmission for CPU prefix caching, based on PR#10874 #11099

[Core] Efficient transmission for CPU prefix caching, based on PR#10874 #11099

lixiaobai09 commented Dec 11, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 11, 2024

mergify bot commented Dec 11, 2024

[Core] Efficient transmission for CPU prefix caching, based on PR#10874 #11099

Are you sure you want to change the base?

[Core] Efficient transmission for CPU prefix caching, based on PR#10874 #11099

Conversation

lixiaobai09 commented Dec 11, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 11, 2024

mergify bot commented Dec 11, 2024

lixiaobai09 commented Dec 11, 2024 •

edited by github-actions bot

Loading