AI Companion llama.cpp server

Made for https://github.com/skycover/aicom-obsidian/ project.

This is the simple one-evening built server that run llama.cpp via the python bingings.

In fact, both llama.cpp and llama-cpp-python already have a good servers inside themselves. The main goal to make another one is to set up a minimalistic sandbox to experiment for various unusual things via simple python code without any infrastructure complications.

The interact_llamacpp.py from https://github.com/IlyaGusev/rulm was got as the bones.

API

POST: /query
{
    key: string,
    params: {'top_p', 'top_k', 'temperature', 'repeat_penalty'},
    messages: [[role, content]...]
}
return 200, int: total number of tokens.

Call /query it to start inference.

'key' is optional, checked if --key was given.

'params' may be {}. In this case the server defaults are used.

Role is one of: system, user, bot.

GET: /receive?key=string
return 200, str: one token OR '' if no tokens ready OR '[[END OF AICOM SENTENCE]]' if inference completed.

'key' is optional, checked if --key was given.

You should check /receive repeatedly, ignoring empty results until '[[END OF AICOM SENTENCE]]' will be received.

If you'll stop check, then the inference will stop till the next request. Simple stop requests if you don't want to get the rest of inference.

If you'll send a new /query, then inference will be restarted.

Security

Unless --key option is given, the client will be restricted to the same host (this check may have nonsense behind proxy).

It is recommended to use --key option to set key-based access and also to use https proxy if you serve via network.

Install

python3 -m venv venv
. venv/bin/activate
pip install llama-cpp-python
mkdir models

Download llama.cpp compatible model into models/. For example any GGUF from https://huggingface.co/TheBloke

For Russian language check https://huggingface.co/IlyaGusev/saiga2_13b_gguf

For a good experience I suggest to use at least 13B models, quantized to at least 3bits (4 bits will be a bit better :).

Model conversation templates

When using Llama2-chat model, specify "-t chat" in command line. Otherwise the Saiga format will be used. Both formatters are custom (not library ones).

Run

Activate venv

. venv/bin/activate

Then

python aicom_llamacpp.py -m models/YOUR__MODEL.gguf

For Metal (Apple M1, M2) GPU run

python aicom_llamacpp.py -m models/YOUR__MODEL.gguf --n_gpu_layers=1

For help on parameters

python aicom_llamacpp.py -h

Examples

Llama2-chat on Mac with access key 'secret'

python aicom_llamacpp.py -k secret -m models/13B/llama-2-13b-chat.Q4_K_M.gguf --n_gpu_layers=1 -t chat

Saiga on Mac with access key

python aicom_llamacpp.py -k secret -m models/13B/ggml-model-q4_K.gguf --n_gpu_layers=1

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
aicom_llamacpp.py		aicom_llamacpp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Companion llama.cpp server

API

Security

Install

Model conversation templates

Run

Examples

About

Releases

Packages

Languages

skycover/aicom-llama.cpp

Folders and files

Latest commit

History

Repository files navigation

AI Companion llama.cpp server

API

Security

Install

Model conversation templates

Run

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages