Quantum K Cache from LlamaCPP in KoboldCPP? #635

Nexesenex · 2024-01-24T10:51:55Z

Nexesenex
Jan 24, 2024

Would it be possible to get a parameter to use the Quantum K cache feature of LlamaCPP in KoboldCPP?

I tested it, and it bumps the perplexity by 0.1 in Q8_0 without affecting the speed generation in LlamaCPP.
Actually, it increases the generation speed by 1.5% in full offload, at least in such scenario : X:\text-generation-webui\models\Yarn-Llama-2-70b-32k-IQ2_XS.gguf -f wiki.test.raw -ngl 100 -b 512 -mg 0 -ts 24,0 -c 512 -ctk q8_0.

LostRuins · 2024-01-24T16:50:06Z

LostRuins
Jan 24, 2024
Maintainer

It's not usable with context shifting though

0 replies

Nexesenex · 2024-01-25T00:30:42Z

Nexesenex
Jan 25, 2024
Author

I looked into your explanations to refresh my memory.

"NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext."

So it's really case-use scenario. When memory or world info are used, it's only benefit and no loss to use K cache Quantum. When I tested it for the first time months ago by compiling a KCPP (when enabled on the LlamaCPP Quantum K cache PR branch by default), it slowed down massively tokens generation so I left it aside. It might not anymore, unless there's incompatibilities with some specifics of KoboldCPP (implementation of MMQ? I'm fishing around).

3 replies

LostRuins Jan 25, 2024
Maintainer

Hmm if you really want to try it - you can try changing the values of these parameters of llama_ctx_params

Hard coding type_k to GGML_TYPE_Q8_0 should allow you to test if it helps.

Note that these values must be set at load time, not at runtime. I don't think you can change them per-generation.
Be warned that it might probably just crash and break. If my assumptions are wrong, then let me know your results - as it is not it does not seem worth the tradeoff.

Nexesenex Jan 25, 2024
Author

I already tried, from that branch of my repo merged in my frankenstein👍

https://github.com/Nexesenex/kobold.cpp/commits/_K_Cache_q8_0/

Here's my command line :

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas 0 mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch

Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help

Setting process to Higher Priority - Use Caution
Error, Could not change process priority: No module named 'psutil'
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='X:/text-generation-webui/models/Yarn-Llama-2-70b-32k-IQ2_XS.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=1, blasthreads=1, highpriority=True, contextsize=4096, blasbatchsize=128, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['0', 'mmq'], gpulayers=99, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: X:\text-generation-webui\models\Yarn-Llama-2-70b-32k-IQ2_XS.gguf
[Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]

Prompt : "Write me the US constitution."

Generation parameters 👍

Input: {"n": 1, "max_context_length": 4096, "max_length": 256, "rep_pen": 1, "temperature": 0.1, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1, "rep_pen_slope": 1, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP1060", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "Write me the US constitution.", "quiet": true, "stop_sequence": [], "use_default_badwordsids": true}

Here's with K cache fp16 👍

Processing Prompt (1 / 1 tokens)
Generating (256 / 256 tokens)
ContextLimit: 263/4096, Processing:0.32s (322.0ms/T), Generation:27.59s (107.8ms/T), Total:27.91s (109.0ms/T = 9.17T/s)
Output:

Output: The Constitution of the United States is the supreme law of the United States of America. The Constitution, originally comprising seven articles, delineates the national frame of government. Its first three words are "We the People". The Constitution of the United States was ratified by the Philadelphia Convention (now often called the "First Federal Convention" or the "First Constitutional Convention") on September 17, 1787 and has since been in effect masthead. The Constitution is a masthead that establishes a federal democratic republic form of government that has been in effect since its ratification. The Constitution is the oldest written national constitution still in use in the world. The handwritten copy of the document that was signed by the delegates in 1787 is on public display at the National Archives in Washington, D.C. The Constitution defines the three branches of the federal government: the legislative branch (the bicameral Congress, which consists of the House of Representatives and the Senate); the executive branch (the President, who is also Commander-in-Chief of the armed forces); and the judicial branch (the Supreme Court and other federal courts). The powers of the federal government

Here's with K Cache Q8_0 👍

( llama_new_context_with_model: KV self size = 980.00 MiB, K (q8_0): 340.00 MiB, V (f16): 640.00 MiB )

Processing Prompt (1 / 1 tokens)
Generating (256 / 256 tokens)
ContextLimit: 263/4096, Processing:0.40s (403.0ms/T), Generation:42.34s (165.4ms/T), Total:42.75s (167.0ms/T = 5.99T/s)
Output:

Output: The Constitution of the United States is the supreme law of the United States of America. The Constitution, originally comprising seven articles, delineates the national frame of government. Its first three words are "We the People". The Constitution of the United States was ratified by the Philadelphia Convention (now often called the "First Federal Convention" or the "First Constitutional Convention") on September 17, 1787, and so establishes the federal government of the United States. The U.S. Constitution is the oldest written national constitution still in use in the world. The Constitution defines the three branches of the federal government: the legislative branch (the bicameral Congress, which consists of the House of Representatives and the Senate); the executive branch (the President and his or her appointees); and the judicial branch (the Supreme Court and the other federal courts). The powers of the federal government are enumerated in Article I, Section 8, and the Tenth Amendment. The powers of the legislative branch (the Congress) are detailed in Article I; the executive power in Article II; and the judicial power in Article III. The Constitution reserves all unenumerated powers

Note :

Generation speed is much lower that it should be. But it works, albeit with a different output.

In LlamaCPP, the perplexity run is faster with K cache Q8_0 than with K cache FP16 👍

Yarn-Llama-2-70b-32k-IQ2_XS.gguf,-,wikitext,4.5301,512,512,2024-01-24 01:40:00,PEC8,70b,Llama_2,4096,,19:14,GGUF,Meta,Artefact2,

Yarn-Llama-2-70b-32k-IQ2_XS.gguf,-,wikitext,4.5310,512,512,2024-01-24 02:01:00,PEC8-KQ8_0,70b,Llama_2,4096,18:58,,GGUF,Meta,Artefact2,

As for the interest, I think that it's an interesting option to have a 20%+ lesser VRAM occupation for a given context size with a minimal quality loss, and for example to be able to put stable diffusion on when it could not fit before, etc. Not all tasks require to fill an endless context requiring context shift for a smooth pursuit.

Also, I don't know if you saw this, but there's interesting work which have been done on "self-extend context" in LlamaCPP on these PR 👍
ggerganov#4810
ggerganov#4815

LostRuins Jan 26, 2024
Maintainer

yeah I did see it, sadly same issue, self extend doesn't work with context shifting, making it a worse option than rope scaling.

brucethemoose · 2024-02-10T22:50:38Z

brucethemoose
Feb 10, 2024

This is not a make-or-break feature for everyone now, but its going to be quite essential going forward.

16-bit kv cache is not so bad with llama/mistral, but once you start getting to 32K context (and higher), it starts to eat a large fraction of vram. And that 32K will actually be quite usable once flash attention is integrated.

Not a lot of people are running 32K+ models atm, but they are getting more prominant. I have been using Yi at 40K+ context almost exclusively.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantum K Cache from LlamaCPP in KoboldCPP? #635

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Quantum K Cache from LlamaCPP in KoboldCPP? #635

Nexesenex Jan 24, 2024

Replies: 3 comments · 3 replies

LostRuins Jan 24, 2024 Maintainer

Nexesenex Jan 25, 2024 Author

LostRuins Jan 25, 2024 Maintainer

Nexesenex Jan 25, 2024 Author

Setting process to Higher Priority - Use Caution Error, Could not change process priority: No module named 'psutil' Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

LostRuins Jan 26, 2024 Maintainer

brucethemoose Feb 10, 2024

Nexesenex
Jan 24, 2024

Replies: 3 comments 3 replies

LostRuins
Jan 24, 2024
Maintainer

Nexesenex
Jan 25, 2024
Author

LostRuins Jan 25, 2024
Maintainer

Nexesenex Jan 25, 2024
Author

Setting process to Higher Priority - Use Caution
Error, Could not change process priority: No module named 'psutil'
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

LostRuins Jan 26, 2024
Maintainer

brucethemoose
Feb 10, 2024