Optimizing of prefix caching (V2) #9690
brotherchen
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The current prefix reuse operates at the block level, meaning that even blocks which haven't been fully computed (i.e., blocks of requests being computed for the first time) can share a prefix. However, I’ve noticed that the current implementation is incomplete.
Theoretically, only one request in a batch should need to compute the prefix blocks, but currently, all requests repeat this calculation, resulting in a waste of computational resources. This is essentially an issue with the order in which block states are updated. Presently, all request blocks are marked as computed only after the context length has been calculated. By updating the block state immediately after calculating the context length of each request, this issue can be resolved.
This optimization will yield benefits in certain scenarios without affecting the correctness of inference.
Orginal code:
New code:
Beta Was this translation helpful? Give feedback.
All reactions