-
Notifications
You must be signed in to change notification settings - Fork 19
Install Local Chatbots locally
To use chatbots models running locally, be sure to have a GPU.
For example for Pygmalion, there are several models with different sizes:
- 350m and 1.3b can fit in little GPUs (< 4GB of VRAM)
- 2.7b can fit with around 6-8GB VRAM
- 6b need at least 10GB VRAM
For RWKV models:
- 169m and 430m can fit on very little GPUs (<2GB VRAM)
- 1b5 and 3b with changing
strategy
(<4GB VRAM) - 7b and 14b at least 10GB VRAM
For Pygmalion, you can have int8 quantization and offloading that reduce GPU usage.
Be sure to have setup the Nvidia Drivers, CUDA 11.7 (download the 11.7 version and not another one) and the corresponding cudNN (tutorial here) and some storage space available (6GB for the 2.7B and 15GB for the 6B model).
Int8 quantization also need specific GPUs NVIDIA only: NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).
- On Huggingface website go to the model page, for example
https://huggingface.co/PygmalionAI/pygmalion-2.7b
- Follow the instructions to download the model like this on your Command Prompt or Powershell opened wherever you want:
git lfs install
git clone https://huggingface.co/PygmalionAI/pygmalion-2.7b
- Once the download is finished, put the folder created in
chatbot_models
in the Monik.A.I folder. You will be able to choose it in your next login !
- The model recommended for each size:
- Download the
.pth
from the link and put it inchatbot_models
As these models are open-sourced, you can customize a lot of things for the generation. You can change multiple parameters for inference in pygmalion/pygmalion_config.yml
:
-
char_json
: the json file containing information about Monika (persona, context, dialogue examples) -
context_size
: the number of sentences the chatbot remember from the history -
max_new_tokens
: amount of tokens generated without counting the user input (see a token like a word) -
model_name
: the name of the model used in the folderchatbot_models
, it is a RWVK model, it is not a subfolder and it is a file which ends with.pth
-
temperature
: control the randomness of the model, lower values for more deterministic but more in character answers, bigger values for more "creativity" (between 0.5-1 is good) -
repetition_penalty
: penalize the model for repeting same sentences or words (1.0-1.3 for chatbot from HuggingFace, 0.2-0.8 for RWKV) -
strategy
: for RWKV. Control how you load the model on your PC (CPU or GPU), default atcuda fp16
(full GPU). If it is too much for your GPU, you can usecuda fp16 *n -> cpu fp32
where n is a number between 10 and 30, the lower the n, the more will go to CPU but it will be slower. If you don't have a GPU, usecpu fp32
(very slow). -
use_int_8
: enable int8 quantization to make the model lighter in GPU RAM. Set if tofalse
if your GPU is a little too old for that (see requirements before) - You can try to modify the other parameters but the previous one are the easiest to understand
You can see the chat history in chat_history.txt
which is directly load each time you launch the game to make the model remember your previous conversation.
More information on their official rentry here.