Replies: 3 comments 3 replies
-
It's not usable with context shifting though |
Beta Was this translation helpful? Give feedback.
-
I looked into your explanations to refresh my memory. "NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext." So it's really case-use scenario. When memory or world info are used, it's only benefit and no loss to use K cache Quantum. When I tested it for the first time months ago by compiling a KCPP (when enabled on the LlamaCPP Quantum K cache PR branch by default), it slowed down massively tokens generation so I left it aside. It might not anymore, unless there's incompatibilities with some specifics of KoboldCPP (implementation of MMQ? I'm fishing around). |
Beta Was this translation helpful? Give feedback.
-
This is not a make-or-break feature for everyone now, but its going to be quite essential going forward. 16-bit kv cache is not so bad with llama/mistral, but once you start getting to 32K context (and higher), it starts to eat a large fraction of vram. And that 32K will actually be quite usable once flash attention is integrated. Not a lot of people are running 32K+ models atm, but they are getting more prominant. I have been using Yi at 40K+ context almost exclusively. |
Beta Was this translation helpful? Give feedback.
-
Would it be possible to get a parameter to use the Quantum K cache feature of LlamaCPP in KoboldCPP?
ggerganov#4312
I tested it, and it bumps the perplexity by 0.1 in Q8_0 without affecting the speed generation in LlamaCPP.
Actually, it increases the generation speed by 1.5% in full offload, at least in such scenario : X:\text-generation-webui\models\Yarn-Llama-2-70b-32k-IQ2_XS.gguf -f wiki.test.raw -ngl 100 -b 512 -mg 0 -ts 24,0 -c 512 -ctk q8_0.
Beta Was this translation helpful? Give feedback.
All reactions