Optimizing of prefix caching （V2) #9690

brotherchen · 2024-10-25T08:24:14Z

brotherchen
Oct 25, 2024

The current prefix reuse operates at the block level, meaning that even blocks which haven't been fully computed (i.e., blocks of requests being computed for the first time) can share a prefix. However, I’ve noticed that the current implementation is incomplete.

Theoretically, only one request in a batch should need to compute the prefix blocks, but currently, all requests repeat this calculation, resulting in a waste of computational resources. This is essentially an issue with the order in which block states are updated. Presently, all request blocks are marked as computed only after the context length has been calculated. By updating the block state immediately after calculating the context length of each request, this issue can be resolved.

This optimization will yield benefits in certain scenarios without affecting the correctness of inference.
Orginal code:

New code:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing of prefix caching （V2) #9690

{{title}}

Replies: 0 comments

Select a reply

Optimizing of prefix caching （V2) #9690

brotherchen Oct 25, 2024

Replies: 0 comments

brotherchen
Oct 25, 2024