Skip to content

Install Local Chatbots locally

Samuel Sithakoul edited this page Mar 26, 2023 · 13 revisions

Hardware specifications and setup

To use chatbots models running locally, be sure to have a GPU.

For example for Pygmalion, there are several models with different sizes:

  • 350m and 1.3b can fit in little GPUs (< 4GB of VRAM)
  • 2.7b can fit with around 6-8GB VRAM
  • 6b need at least 10GB VRAM

For RWKV models:

  • 169m and 430m can fit on very little GPUs (<2GB VRAM)
  • 1b5 and 3b with changing strategy (<4GB VRAM)
  • 7b and 14b at least 10GB VRAM

For Pygmalion, you can have int8 quantization and offloading that reduce GPU usage.

Be sure to have setup the Nvidia Drivers, CUDA 11.7 (download the 11.7 version and not another one) and the corresponding cudNN (tutorial here) and some storage space available (6GB for the 2.7B and 15GB for the 6B model).

Int8 quantization also need specific GPUs NVIDIA only: NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).

How to add these chatbots models

For most of models (not RWKV)

  • On Huggingface website go to the model page, for example https://huggingface.co/PygmalionAI/pygmalion-2.7b
  • Follow the instructions to download the model like this on your Command Prompt or Powershell opened wherever you want:
git lfs install
git clone https://huggingface.co/PygmalionAI/pygmalion-2.7b
  • Once the download is finished, put the folder created in chatbot_models in the Monik.A.I folder. You will be able to choose it in your next login !

For RWKV

Parameters for inference

As these models are open-sourced, you can customize a lot of things for the generation. You can change multiple parameters for inference in pygmalion/pygmalion_config.yml:

  • char_json: the json file containing information about Monika (persona, context, dialogue examples)
  • context_size: the number of sentences the chatbot remember from the history
  • max_new_tokens: amount of tokens generated without counting the user input (see a token like a word)
  • model_name: the name of the model used in the folder chatbot_models, it is a RWVK model, it is not a subfolder and it is a file which ends with .pth
  • temperature: control the randomness of the model, lower values for more deterministic but more in character answers, bigger values for more "creativity" (between 0.5-1 is good)
  • repetition_penalty: penalize the model for repeting same sentences or words (1.0-1.3 for chatbot from HuggingFace, 0.2-0.8 for RWKV)
  • strategy: for RWKV. Control how you load the model on your PC (CPU or GPU), default at cuda fp16 (full GPU). If it is too much for your GPU, you can use cuda fp16 *n -> cpu fp32 where n is a number between 10 and 30, the lower the n, the more will go to CPU but it will be slower. If you don't have a GPU, use cpu fp32 (very slow).
  • use_int_8: enable int8 quantization to make the model lighter in GPU RAM. Set if to false if your GPU is a little too old for that (see requirements before)
  • You can try to modify the other parameters but the previous one are the easiest to understand

You can see the chat history in chat_history.txt which is directly load each time you launch the game to make the model remember your previous conversation.

More information on their official rentry here.

Clone this wiki locally