You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I agree. #190 tracks the discussion around KV caching. We have bigger plans for this, but we've been focusing on constrained generation almost exclusively so far 🙂
We can certainly implement the naive version quickly, it's just a small interface change. Will bump it up on the priority list.
looking at auto-gptq code I spotted that it already contains kv-caching :
https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/nn_modules/fused_llama_attn.py
would be really nice to let users utilize this.
The text was updated successfully, but these errors were encountered: