Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Run Self-hosted version on CPU #191

Open
klink opened this issue Oct 17, 2023 · 6 comments
Open

EPIC: Run Self-hosted version on CPU #191

klink opened this issue Oct 17, 2023 · 6 comments
Assignees
Labels

Comments

@klink
Copy link
Contributor

klink commented Oct 17, 2023

No description provided.

@klink klink converted this from a draft issue Oct 17, 2023
@klink klink added the Epic label Oct 17, 2023
@comalice
Copy link

comalice commented Oct 19, 2023

I'd like to mention that if you handle ollama or llama.cpp interop you'll get models that run on a CPU for free. Ollama comes with an web API out of the box, and I think llama.cpp does as well. A lot of projects are allowing users to target Open AI API compatible end points.

@octopusx
Copy link

octopusx commented Nov 7, 2023

I would be interested in having an option to run on a CPU too, as an addition to the GPU, just to maximise the benefits I get from the GPUs I have available. For example, running starcoder 7B on my GPU for code completion and llama 7B on the CPU for chat functionality in the VSCode plugin. Right now if I want to have both functionalities I have to resort to using the smallest models to make sure they fit in my GFX card's VRAM.

@olegklimov
Copy link
Contributor

Hi @octopusx
We tested various models on CPU, it's about 4-8 seconds for a single code completion, even for 1.6b or a starcoder 1b, on Apple M1 hardware. Maybe we'll train a smaller model still (0.3b?) to make it work with a smaller context.
7b on CPU will be probably good enough for chat, because the context prefill is so small, but not for code completion.

@octopusx
Copy link

octopusx commented Nov 7, 2023

@olegklimov for sure I don't want to run anything on the CPU if I can avoid it, and especially not the code completion part. I was only thinking of moving the chat function to the CPU to free up my GPU to do higher quality code completion. Currently I run llama.cpp on CPU for chat-based openAPI integrations with a llama 2 7b chat model (https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_K_M.gguf) and on a Ryzen 3000 series CPUs I am getting close to instant chat responses. The key issue I have with this setup is that, for example, I can not point my refact plugin to the llama.cpp endpoint for chat, and I can not point the other chat integrations to the self hosted refact, so I am having to host 2 solutions at the same time basically...

@olegklimov
Copy link
Contributor

Ah I see, that makes total sense.

I think the best way to solve this is to add providers to the rust layer, for the new plugins. We'll release the plugins "as is" this week, because we need to release it and start getting feedback. Then ~next week we'll add the concept of providers to the rust layer. You'll be able hopefully to direct requests to your llama.cpp server.

@octopusx
Copy link

octopusx commented Nov 7, 2023

This is amazing, I will be on the lookout for the new releases and test this as soon as it's available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog 🤔
Development

No branches or pull requests

5 participants