Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exllama tutorials? #192

Open
NickDatLe opened this issue Jul 25, 2023 · 23 comments
Open

Exllama tutorials? #192

NickDatLe opened this issue Jul 25, 2023 · 23 comments

Comments

@NickDatLe
Copy link

I'm new to exllama, are there any tutorials on how to use this? I'm trying this with the llama-2 70b model.

@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Jul 25, 2023

There is no specific tutorial but here is how to set it up and get it running!
(note: for the 70B model you need at least 42GB VRAM, so a single A6000 / 6000 Ada or two 3090/4090s can only run the model, see the README for speed stats on a mixture of GPUs)

To begin with, you want to install conda install script and then create a new conda environment (so that pip packages don't mix with other py projects)

conda create -n exllama python=3.10
# after that
conda activate exllama

Then, clone the repo

git clone https://github.com/turboderp/exllama
cd exllama

# while conda is activated
pip install -r requirements.txt

You want to download a GPTQ quantized model. TheBloke provides lots of them that all work.

# if you don't have git lfs installed: sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

You're all set. Now, the only thing left is running a test benchmark and finally running the chatbot example.

python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/ -gs 16.2,24
# add -gs 16.2,24 when running models that require more VRAM than one GPU can supply

If that is successful, run this and enjoy a chatbot:

python example_chatbot.py -d ../path/to/Llama-2-70B-chat-GPTQ/ -un NickDatLe -bn ChadGPT -p prompt_chatbort.txt -nnl
# -nnl makes it so that the bot can output more than one line

Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into example_chatbot_llama2chat.py and will do a PR soon.

@NickDatLe
Copy link
Author

NickDatLe commented Jul 25, 2023

Thank you for your help Sinan! I followed your instructions to:

python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/

I have 2x 4090 GPU, it's only using one of them as far as I can tell, and so I'm getting a cuda out of memory error:

(py311) nick@easyai:~/dev/exllama$ python test_benchmark_inference.py -p -ppl -d Llama-2-70B-chat-GPTQ
-- Perplexity:
-- - Dataset: datasets/wikitext2_val_sample.jsonl
-- - Chunks: 100
-- - Chunk size: 2048 -> 2048
-- - Chunk overlap: 0
-- - Min. chunk size: 50
-- - Key: text
-- Tokenizer: Llama-2-70B-chat-GPTQ/tokenizer.model
-- Model config: Llama-2-70B-chat-GPTQ/config.json
-- Model: Llama-2-70B-chat-GPTQ/gptq_model-4bit--1g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- Options: ['perf', 'perplexity']
Traceback (most recent call last):
File "/home/nick/dev/exllama/test_benchmark_inference.py", line 125, in
model = timer("Load model", lambda: ExLlama(config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nick/dev/exllama/test_benchmark_inference.py", line 56, in timer
ret = func()
^^^^^^
File "/home/nick/dev/exllama/test_benchmark_inference.py", line 125, in
model = timer("Load model", lambda: ExLlama(config))
^^^^^^^^^^^^^^^
File "/home/nick/dev/exllama/model.py", line 831, in init
tensor = tensor.to(device, non_blocking = True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 23.64 GiB total capacity; 23.23 GiB already allocated; 23.88 MiB free; 23.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

@turboderp
Copy link
Owner

turboderp commented Jul 25, 2023

You need to define how weights are to be split across the GPUs. There's a bit of trial and error in that, currently, since you're only supplying the maximum allocation for weights, not activations. And space needed for activations is a difficult function of exactly what layers end up on the device, so best you can do for now is just try some values and adjust based on which GPU ends up running out of memory first. The syntax is just -gs 16.2,24 to use up to 16.2 GB on the first device, then up to 24 GB on the second. I find that works pretty well on 70B, but YMMV especially with lower group sizes.

@NickDatLe
Copy link
Author

I didn't see the -gs flag, after setting it to 16.2,24 like you mentioned, it worked. Thank you!

-- Perplexity:
-- - Dataset: datasets/wikitext2_val_sample.jsonl
-- - Chunks: 100
-- - Chunk size: 2048 -> 2048
-- - Chunk overlap: 0
-- - Min. chunk size: 50
-- - Key: text
-- Tokenizer: Llama-2-70B-chat-GPTQ/tokenizer.model
-- Model config: Llama-2-70B-chat-GPTQ/config.json
-- Model: Llama-2-70B-chat-GPTQ/gptq_model-4bit--1g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- Options: ['gpu_split: 16.2,24', 'perf', 'perplexity']
** Time, Load model: 4.34 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
!! Model has empty group index (discarded)
** VRAM, Model: [cuda:0] 16,902.60 MB - [cuda:1] 17,390.74 MB
** VRAM, Cache: [cuda:0] 308.12 MB - [cuda:1] 320.00 MB
-- Warmup pass 1...
** Time, Warmup: 1.33 seconds
-- Warmup pass 2...
** Time, Warmup: 0.81 seconds
-- Inference, first pass.
** Time, Inference: 1.24 seconds
** Speed: 1545.28 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 17.07 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 21.84 tokens/second
** VRAM, Inference: [cuda:0] 317.16 MB - [cuda:1] 317.29 MB
** VRAM, Total: [cuda:0] 17,527.88 MB - [cuda:1] 18,028.04 MB
-- Loading dataset...
-- Testing 100 chunks..........
** Perplexity: 5.8741

@SinanAkkoyun
Copy link
Contributor

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

Sorry, I forgot about that! I edited it now

@NickDatLe
Copy link
Author

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

Sorry, I forgot about that! I edited it now

All good, working now; I'm going to learn exllama more. Fascinating stuff!

@NickDatLe
Copy link
Author

If it's ok with the mods, I'm going to leave this thread open in case someone posts a tutorial or have some great links to exllama.

@NickDatLe NickDatLe reopened this Jul 25, 2023
@NickDatLe
Copy link
Author

@SinanAkkoyun do you know what folks in the LLM community are using to communicate? Discord? Slack?

@SinanAkkoyun
Copy link
Contributor

@NickDatLe Most that I know use Discord, however very decentralized over many servers

@NickDatLe
Copy link
Author

@NickDatLe Most that I know use Discord, however very decentralized over many servers

Ahh ok, I will join some discord servers. It seems "TheBloke" has a server and that person is very popular on the LLM leaderboard.

@turboderp
Copy link
Owner

turboderp commented Jul 28, 2023

Where is this? Invite me!

Edit: Never mind I found it.

@cmunna0052
Copy link

cmunna0052 commented Jul 31, 2023

@SinanAkkoyun Once the test_benchmark_inference.py script has finished successfully, is there an easy way to get the 70b chatbot running in a jupyter notebook?

Edit: For posterity, it was relatively straightforward to work in a notebook environment by adapting code from the example_basic.py file.

@pourfard
Copy link

pourfard commented Aug 3, 2023

Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into example_chatbot_llama2chat.py and will do a PR soon.

Can you share "example_chatbot_llama2chat.py" if it is possible?

@SinanAkkoyun
Copy link
Contributor

@pourfard a PR is incoming today, I will implement it

@SinanAkkoyun
Copy link
Contributor

@pourfard
#221

:) Either wait for the PR to be merged or copy the new file example_llama2chat.py directly into your Exllama directory. (Keep in mind, you need the latest version of Exllama)

@nktice
Copy link

nktice commented Aug 12, 2023

First of all Thank You, exllama's working for me, while others do not...

I did some testing of a number of models with Sally riddle...
https://github.com/nktice/AMD-AI/blob/main/SallyAIRiddle.md
[ and here's my setup in case it's of benefit to other people -
https://github.com/nktice/AMD-AI ]
I did this by hand, through Oobabooga UI, and it took me a while.

I'd like commands to run exllama from shell scripts ( such as a bash shell ).
So I went looking and was disappointed by the python files there...
I had hoped to find them respond with help info from the command line.
"--help" or "-h" could return parameters and program purpose.
[ As this thread was for documentation issues, seems like this would help. ]

@NickDatLe
Copy link
Author

send invite/link please!

@SinanAkkoyun
Copy link
Contributor

https://discord.gg/theblokeai

Really awesome

@NickDatLe
Copy link
Author

https://discord.gg/theblokeai

Really awesome

Add me! nickdle

@SinanAkkoyun
Copy link
Contributor

Add me! nickdle

You need to join by clicking the link :)

@NickDatLe
Copy link
Author

I invited you as a friend :)

@SinanAkkoyun
Copy link
Contributor

@NickDatLe
Oh you mean that, sure! I can't find you in my friend requests, what is your tag? :)

@NickDatLe
Copy link
Author

@NickDatLe Oh you mean that, sure! I can't find you in my friend requests, what is your tag? :)

nickdle, I sent a friend request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants