EPIC: Run Self-hosted version on CPU #191

klink · 2023-10-17T10:49:59Z

No description provided.

comalice · 2023-10-19T18:25:08Z

I'd like to mention that if you handle ollama or llama.cpp interop you'll get models that run on a CPU for free. Ollama comes with an web API out of the box, and I think llama.cpp does as well. A lot of projects are allowing users to target Open AI API compatible end points.

octopusx · 2023-11-07T13:16:10Z

I would be interested in having an option to run on a CPU too, as an addition to the GPU, just to maximise the benefits I get from the GPUs I have available. For example, running starcoder 7B on my GPU for code completion and llama 7B on the CPU for chat functionality in the VSCode plugin. Right now if I want to have both functionalities I have to resort to using the smallest models to make sure they fit in my GFX card's VRAM.

olegklimov · 2023-11-07T13:47:38Z

Hi @octopusx
We tested various models on CPU, it's about 4-8 seconds for a single code completion, even for 1.6b or a starcoder 1b, on Apple M1 hardware. Maybe we'll train a smaller model still (0.3b?) to make it work with a smaller context.
7b on CPU will be probably good enough for chat, because the context prefill is so small, but not for code completion.

octopusx · 2023-11-07T13:56:41Z

@olegklimov for sure I don't want to run anything on the CPU if I can avoid it, and especially not the code completion part. I was only thinking of moving the chat function to the CPU to free up my GPU to do higher quality code completion. Currently I run llama.cpp on CPU for chat-based openAPI integrations with a llama 2 7b chat model (https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_K_M.gguf) and on a Ryzen 3000 series CPUs I am getting close to instant chat responses. The key issue I have with this setup is that, for example, I can not point my refact plugin to the llama.cpp endpoint for chat, and I can not point the other chat integrations to the self hosted refact, so I am having to host 2 solutions at the same time basically...

olegklimov · 2023-11-07T14:08:40Z

Ah I see, that makes total sense.

I think the best way to solve this is to add providers to the rust layer, for the new plugins. We'll release the plugins "as is" this week, because we need to release it and start getting feedback. Then ~next week we'll add the concept of providers to the rust layer. You'll be able hopefully to direct requests to your llama.cpp server.

octopusx · 2023-11-07T15:11:21Z

This is amazing, I will be on the lookout for the new releases and test this as soon as it's available.

klink added this to Self-hosted / Enterprise Oct 17, 2023

klink converted this from a draft issue Oct 17, 2023

klink added the Epic label Oct 17, 2023

klink assigned olegklimov and JegernOUTT Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: Run Self-hosted version on CPU #191

EPIC: Run Self-hosted version on CPU #191

klink commented Oct 17, 2023

comalice commented Oct 19, 2023 •

edited

Loading

octopusx commented Nov 7, 2023

olegklimov commented Nov 7, 2023

octopusx commented Nov 7, 2023 •

edited

Loading

olegklimov commented Nov 7, 2023

octopusx commented Nov 7, 2023

EPIC: Run Self-hosted version on CPU #191

EPIC: Run Self-hosted version on CPU #191

Comments

klink commented Oct 17, 2023

comalice commented Oct 19, 2023 • edited Loading

octopusx commented Nov 7, 2023

olegklimov commented Nov 7, 2023

octopusx commented Nov 7, 2023 • edited Loading

olegklimov commented Nov 7, 2023

octopusx commented Nov 7, 2023

comalice commented Oct 19, 2023 •

edited

Loading

octopusx commented Nov 7, 2023 •

edited

Loading