KV cache refactor to decouple cache blocks and metadata about them #168
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Right now we are attaching both cache blocks and metadata about them such as
block_table
into a single class which is passed back and forth between the engine and the model. While working on multi-gpu support for PT models, I learned that I need to use some RPC framework to manage multiple processes.The inputs from the engine need to cross the RPC boundary on each inference, and it is a bit awkward if the actual cache blocks also need to be passed from the engine always for no reason. For disco, only a handle to the cache blocks (
DRef
) is communicated between the engine and the model, but even this is unnecessary.With this PR, the cache blocks are completely owned by the model and only
block_tables
etc need to be passed from the engine. I renamed the class toKVCacheInfo
to reflect its new role. The interface inmodel_module.py
doesn't need to change since this change is impl details of the paged cache model.@yelite