Simple python code that can run inference on LLM models with rest api interface.
- Setup and run locally
- Docker images
- Run with Docker
- Docker build
- Configuration values and parameters
Want to know how this service works?
Want to use the service in your project:
Due to the variety of setups and the fact that each one requires different versions of PyTorch.
Before we head out here is a list of system Memory and GPU requirements.
NOTE: Python 3.12 breaks torch instllation. Please use Python 3.10
Setup Windows locally
- Cuda 12.1 or later ( GTX 20xx, 30xx, 40xx )
- Cuda 11.8 ( for GTX 10xx series)
- CPU - slow but reliable
NOTE: if you get error:
... exit code -1073741819 (0xC0000005)
you need to add PYTHONUNBUFFERED=1;PYDEVD_USE_FRAME_EVAL=NO to your Run/Debug env variables
Setup Linux locally
- Cuda 12.1 or later ( GTX 20xx, 30xx, 40xx )
- Cuda 11.8 ( for GTX 10xx series)
- CPU - slow but reliable
Use llama-cpp-python package
Note: You will need to aquire GGUF model.
- For example:
- CUDA Mode - using cuda required cuda drivers and cuda docker support.
- requirements:
- nvidia-drivers (instruction in the sections above according to your os)
- cuda
- docker-nvida2
- Verify docker access:
docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
- requirements:
Images:
niftylius/llm-inference:auto
/niftylius/llm-inference:latest
- Will use the
auto
flag when loading the model,it will prioritize GPU but spread the model on cpu if there is not enough VRAM available.
- Will use the
niftylius/llm-inference:cpu
- Will use the
cpu
flag when loading the model, this will load the model to RAM and use CPU for inference
- Will use the
niftylius/llm-inference:cuda-11
- For older GPUS. Will use the
cuda
flag and cuda-11.8 drivers when loading the model. Peripherally for 10xx nvidia GPUs.
- For older GPUS. Will use the
niftylius/llm-inference:cuda-12
- Will use the
cuda
flag and cuda-12.1 drivers when loading the model.
- Will use the
niftylius/llm-inference:llama-cpp
- Will use llama-cpp as the inference service instead of Huggingface transformers.
NOTE: cpu mode only if you want to run quantized version use LOAD_IN_4BIT flag on "auto" or "cuda" dockers
Load TinyLlama/TinyLlama-1.1B-Chat-v1.0 using default settings
# auto
docker run --gpus all -e MODEL_PATH="TinyLlama/TinyLlama-1.1B-Chat-v1.0" -p 6060:6060 niftylius/llm-inference:auto
# cuda 12.1
docker run --gpus all -e MODEL_PATH="TinyLlama/TinyLlama-1.1B-Chat-v1.0" -p 6060:6060 niftylius/llm-inference:cuda-12
# cuda 11.8
docker run --gpus all -e MODEL_PATH="TinyLlama/TinyLlama-1.1B-Chat-v1.0" -p 6060:6060 niftylius/llm-inference:cuda-11
# cpu
docker run -e MODEL_PATH="TinyLlama/TinyLlama-1.1B-Chat-v1.0" -p 6060:6060 niftylius/llm-inference:cpu
# llama-cpp on cpu
docker run -v [path/to/some_model_folder]:/usr/model -e MODEL_PATH="/usr/model/[your model].gguf" -p 6060:6060 niftylius/llm-inference:llama-cpp
# auto
docker run --gpus all -e MODEL_PATH="[huggingface model url]" -p 6060:6060 niftylius/llm-inference
# example1: docker run --gpus all -e MODEL_PATH="TinyLlama/TinyLlama-1.1B-Chat-v1.0" -p 6060:6060 niftylius/llm-inference
# example2: docker run --gpus all -e MODEL_PATH="mistralai/Mistral-7B-Instruct-v0.2" -p 6060:6060 niftylius/llm-inference
Make sure the path to your model is absolute
# this will load the model from your local machine
docker run --gpus all -p 6060:6060 -v path/to/some_model:/usr/model niftylius/llm-inference:latest
Download .env.example file locally from:
https://raw.githubusercontent.com/alexandriaproject-io/llm-inference/main/.env.example
# edit .env.example and run
docker run --gpus all --env-file .env.example -e MODEL_PATH="[huggingface model url]" -p 6060:6060 niftylius/llm-inference:latest
Clone git project with:
git clone [email protected]:alexandriaproject-io/llm-inference.git
Navigate in to the project folder and run:
# build auto docker image
docker build -f .\docker\Dockerfile.auto -t [Your Image tag name] .
# build cuda 12.1 docker image
docker build -f .\docker\Dockerfile.cuda12 -t [Your Image tag name] .
# build cuda 11.8 docker image
docker build -f .\docker\Dockerfile.cuda11 -t [Your Image tag name] .
# build cpu docker image
docker build -f .\docker\Dockerfile.cpu -t [Your Image tag name] .
# build llama-cpp
docker build -f .\docker\Dockerfile.llama-cpp.cuda -t [Your Image tag name] .
You can find the example file here .env.example
Variable Name | Default Value | values | Description |
---|---|---|---|
SERVER_HOST | 127.0.0.1 | 0.0.0.0 - 255.255.255.255 | IP address the port will listen to (0.0.0.0 is any ip). |
SERVER_PORT | 6060 | 1-65535 | Port the rest api service will listen to. |
LOG_LEVEL | info | critical, fatal, error, warning, warn, info, debug | Level of logs: critical, fatal, error, warning, warn, info, debug |
MAX_CACHE_SIZE | 16384 | 1-2147483647 | Max cached prompt executions |
MAX_CACHE_TTL | 3600 | 1-2147483647 | Cache TTL - after this time from the last request_id/request_ids execution the cache will be cleared. Cache is also cleared with EOS is reached |
Variable Name | Default Value | values | Description |
---|---|---|---|
MODEL_PATH | - | relative/path/to/model/folder | Path to the model itself relative to the project. |
Variable Name | Default Value | values | Description |
---|---|---|---|
USE_LLAMA_CPP | false | true, false | Weather to use llama-cpp-python package to run GGUL models on CPU |
LLAMA_CPP_MAX_CONTEXT | 2048 | Any Int | The maximum prompt and response size is set manually because it's not defined in the GGUL models. |
LLAMA_CPP_BATCH_TOKENS | 2048 | Any Int | Represents the number of batches or the size of each batch being processed in the program. |
LLAMA_RAM_CACHE_MB | 512 | Any Int | Caching greatly increases repeated prompt generations ( like 10 tokes at a time ) |
NOTE: when using llama-cpp all of the **Transformers config** will be ignored
NOTE: this is used primeraly for debugging its not a good implementation of llama.cpp
Variable Name | Default Value | values | Description |
---|---|---|---|
ENABLE_CUDA | true | true, false | Will try to run the model on the GPU if supported but will default to CPU if cuda is not supported. |
DEVICE_MAP_AUTO | false | true, false | Will automatically space the model on all GPUs and the CPU. TARGET_GPU_INDEX will be ignored when set to True |
TARGET_GPU_INDEX | 0 | Any Int | Specify witch GPU to use to run the model on. Only applicable when GPU is in use |
LOAD_IN_8BIT | false | true, false | When true, loads the model in 8-bit precision instead of the standard 16-bit (bfloat16/float16), reducing RAM/GPU memory use but sacrificing some precision. |
LOAD_IN_4BIT | false | true, false | LINUX ONLY When true, loads the model in 4-bit precision instead of the standard 16-bit (bfloat16/float16), reducing RAM/GPU memory use but sacrificing some precision. |
SPACE_TOKEN_CHAR | ▁ | String | Some models like LLama 2 remove leading space when decoding token by token, define this token to try and fix that behaviour on PROGRESS events |
Variable Name | Default Value | values | Description |
---|---|---|---|
MODEL_SEED | 42 | 0-2147483647 | A seed in computing is a numerical value used to initialize a pseudorandom number generator ensuring reproducibility of random sequences generated by the algorithm. |
MODEL_LOADER | AutoModelForCausalLM | String | Manually set Model class ( for debugging and experimentation ). Models mapping can be found here |
MODEL_DEFAULT_NUM_BEAMS | 1 | Any Int | Number of different paths the model considers in parallel during beam search, influencing the diversity and quality of the generated text. |
MODEL_DEFAULT_DO_SAMPLE | true | true, false | Controls whether the model generates text by sampling from the probability distribution of each next word (when True), as opposed to just picking the most probable next word (when False). |
MODEL_DEFAULT_TEMPERATURE | 1.0 | Any Float | Hyperparameter that controls the randomness of predictions, the higher the more random. |
MODEL_DEFAULT_TOP_P | 1.0 | Any Float | Parameter used for nucleus sampling, controlling the randomness of output by only considering the smallest set of words whose cumulative probability exceeds the value p, thereby filtering out less likely words. |
MODEL_DEFAULT_TOP_K | 50 | Any Int | Parameter that limits the selection to the top k most probable next words, balancing the randomness and predictability of the generated text. |
Variable Name | Default Value | values | Description |
---|---|---|---|
MODEL_DEFAULT_MAX_NEW_TOKENS | 2048 | Any Int | Max new tokens to generate per prompt request depending on the model capabilites. |
MODEL_DEFAULT_REPETITION_PENALTY | 1 | Any Float | Parameter used to discourage the model from repeating the same words or phrases, increasing the diversity of the generated text. |
MODEL_DEFAULT_LENGTH_PENALTY | 1 | Any Float | Parameter that adjusts the model's preference for longer or shorter sequences, with values greater than 1 favoring longer sequences and values less than 1 favoring shorter ones. |
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 https://github.com/coreui/coreui-free-react-admin-template