-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56
Comments
Noted @Iory1998 will be addressed |
This is now available in beta. Check out the #beta-releases-chat channel on discord |
Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature. |
It is a parallel beta for the current release train. Available as of an hour ago |
Thank you for your prompt response. Can I get a link here or an email since I don't use discord? |
Never mind, I joined discord just to test the 0.31 beta 1 |
K and V quants for the context are still not available. Rolling back to pre 0.3 to get them back. The difference is usable vs unusable for me on a 16GB GPU for llama 3.1 8B and Phi-medium. with the Q4 quants, the model fit and could look through the full context. The new release takes 4 times the memory (and even with smaller cache still runs slower). My request is to bring back the ability for the user to adjust the K and V context quants for Flash attention. |
Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?). |
No, it was closed because the feature in being added. In the version 0.3.2, KV Cache is being set at FP8. I tested the beta, and you could have the KV cache set to Q4 and Q8, but it has not being added to the official LM Studio yet. |
Any update on this? Shouldn't have been closed as still not in any available version. I do not see any beta with this version (after the 0.2.x series). Where is this beta? I can't find it in discord. Assuming it does not actually exist, this should be re-opened. |
Hello team,
LM Studio is using recent updates in llama.cpp, which already has support for 4 bit and 8 bit cache, so I don't LM Studio does not incorporate it yet.
The benefits are tremendous since it improves generation speed. It also helps with using a higher quantization.
The give you an example, I run aya-23-35B-Q4_K_M.gguf in LM Studio at a speed of 4.5t/s because the maximum number of layers I can load on my GPU with 24GB of VRAM is 30 layers. Aya has 41 layers. In Oobabooga Webui, with 4-bit cache enabled, I can load all layers in my Vram, and the speed bumps to 20.5t/s. That's a significant increase in performance (5 folds).
This should be your main priority since you are actually pushing your customer to move to a different platform. Right now, I don't LM Studio when I want to run a larger model, which is unfortunate since I am your biggest fan.
Please, solve this issue ASAP.
The text was updated successfully, but these errors were encountered: