-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exllama tutorials? #192
Comments
There is no specific tutorial but here is how to set it up and get it running! To begin with, you want to install conda install script and then create a new conda environment (so that pip packages don't mix with other py projects) conda create -n exllama python=3.10
# after that
conda activate exllama Then, clone the repo git clone https://github.com/turboderp/exllama
cd exllama
# while conda is activated
pip install -r requirements.txt You want to download a GPTQ quantized model. TheBloke provides lots of them that all work. # if you don't have git lfs installed: sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ You're all set. Now, the only thing left is running a test benchmark and finally running the chatbot example. python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/ -gs 16.2,24
# add -gs 16.2,24 when running models that require more VRAM than one GPU can supply If that is successful, run this and enjoy a chatbot: python example_chatbot.py -d ../path/to/Llama-2-70B-chat-GPTQ/ -un NickDatLe -bn ChadGPT -p prompt_chatbort.txt -nnl
# -nnl makes it so that the bot can output more than one line Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into |
Thank you for your help Sinan! I followed your instructions to:
I have 2x 4090 GPU, it's only using one of them as far as I can tell, and so I'm getting a cuda out of memory error:
How do i split the model between the two GPUs? Does it matter than I'm running python 3.11? |
You need to define how weights are to be split across the GPUs. There's a bit of trial and error in that, currently, since you're only supplying the maximum allocation for weights, not activations. And space needed for activations is a difficult function of exactly what layers end up on the device, so best you can do for now is just try some values and adjust based on which GPU ends up running out of memory first. The syntax is just |
I didn't see the -gs flag, after setting it to 16.2,24 like you mentioned, it worked. Thank you! -- Perplexity: |
Sorry, I forgot about that! I edited it now |
All good, working now; I'm going to learn exllama more. Fascinating stuff! |
If it's ok with the mods, I'm going to leave this thread open in case someone posts a tutorial or have some great links to exllama. |
@SinanAkkoyun do you know what folks in the LLM community are using to communicate? Discord? Slack? |
@NickDatLe Most that I know use Discord, however very decentralized over many servers |
Ahh ok, I will join some discord servers. It seems "TheBloke" has a server and that person is very popular on the LLM leaderboard. |
Where is this? Invite me! Edit: Never mind I found it. |
@SinanAkkoyun Once the test_benchmark_inference.py script has finished successfully, is there an easy way to get the 70b chatbot running in a jupyter notebook? Edit: For posterity, it was relatively straightforward to work in a notebook environment by adapting code from the example_basic.py file. |
Can you share "example_chatbot_llama2chat.py" if it is possible? |
@pourfard a PR is incoming today, I will implement it |
First of all Thank You, exllama's working for me, while others do not... I did some testing of a number of models with Sally riddle... I'd like commands to run exllama from shell scripts ( such as a bash shell ). |
send invite/link please! |
Really awesome |
Add me! nickdle |
You need to join by clicking the link :) |
I invited you as a friend :) |
@NickDatLe |
nickdle, I sent a friend request. |
I'm new to exllama, are there any tutorials on how to use this? I'm trying this with the llama-2 70b model.
The text was updated successfully, but these errors were encountered: