Exllama tutorials? #192

NickDatLe · 2023-07-25T03:15:55Z

I'm new to exllama, are there any tutorials on how to use this? I'm trying this with the llama-2 70b model.

SinanAkkoyun · 2023-07-25T09:13:25Z

There is no specific tutorial but here is how to set it up and get it running!
(note: for the 70B model you need at least 42GB VRAM, so a single A6000 / 6000 Ada or two 3090/4090s can only run the model, see the README for speed stats on a mixture of GPUs)

To begin with, you want to install conda install script and then create a new conda environment (so that pip packages don't mix with other py projects)

conda create -n exllama python=3.10
# after that
conda activate exllama

Then, clone the repo

git clone https://github.com/turboderp/exllama
cd exllama

# while conda is activated
pip install -r requirements.txt

You want to download a GPTQ quantized model. TheBloke provides lots of them that all work.

# if you don't have git lfs installed: sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

You're all set. Now, the only thing left is running a test benchmark and finally running the chatbot example.

python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/ -gs 16.2,24
# add -gs 16.2,24 when running models that require more VRAM than one GPU can supply

If that is successful, run this and enjoy a chatbot:

python example_chatbot.py -d ../path/to/Llama-2-70B-chat-GPTQ/ -un NickDatLe -bn ChadGPT -p prompt_chatbort.txt -nnl
# -nnl makes it so that the bot can output more than one line

Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into example_chatbot_llama2chat.py and will do a PR soon.

NickDatLe · 2023-07-25T21:54:05Z

Thank you for your help Sinan! I followed your instructions to:

python test_benchmark_inference.py -p -ppl -d ../path/to/Llama-2-70B-chat-GPTQ/

I have 2x 4090 GPU, it's only using one of them as far as I can tell, and so I'm getting a cuda out of memory error:

(py311) nick@easyai:~/dev/exllama$ python test_benchmark_inference.py -p -ppl -d Llama-2-70B-chat-GPTQ
-- Perplexity:
-- - Dataset: datasets/wikitext2_val_sample.jsonl
-- - Chunks: 100
-- - Chunk size: 2048 -> 2048
-- - Chunk overlap: 0
-- - Min. chunk size: 50
-- - Key: text
-- Tokenizer: Llama-2-70B-chat-GPTQ/tokenizer.model
-- Model config: Llama-2-70B-chat-GPTQ/config.json
-- Model: Llama-2-70B-chat-GPTQ/gptq_model-4bit--1g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- Options: ['perf', 'perplexity']
Traceback (most recent call last):
File "/home/nick/dev/exllama/test_benchmark_inference.py", line 125, in
model = timer("Load model", lambda: ExLlama(config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nick/dev/exllama/test_benchmark_inference.py", line 56, in timer
ret = func()
^^^^^^
File "/home/nick/dev/exllama/test_benchmark_inference.py", line 125, in
model = timer("Load model", lambda: ExLlama(config))
^^^^^^^^^^^^^^^
File "/home/nick/dev/exllama/model.py", line 831, in init
tensor = tensor.to(device, non_blocking = True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 23.64 GiB total capacity; 23.23 GiB already allocated; 23.88 MiB free; 23.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

turboderp · 2023-07-25T22:12:06Z

You need to define how weights are to be split across the GPUs. There's a bit of trial and error in that, currently, since you're only supplying the maximum allocation for weights, not activations. And space needed for activations is a difficult function of exactly what layers end up on the device, so best you can do for now is just try some values and adjust based on which GPU ends up running out of memory first. The syntax is just -gs 16.2,24 to use up to 16.2 GB on the first device, then up to 24 GB on the second. I find that works pretty well on 70B, but YMMV especially with lower group sizes.

NickDatLe · 2023-07-25T22:39:49Z

I didn't see the -gs flag, after setting it to 16.2,24 like you mentioned, it worked. Thank you!

-- Perplexity:
-- - Dataset: datasets/wikitext2_val_sample.jsonl
-- - Chunks: 100
-- - Chunk size: 2048 -> 2048
-- - Chunk overlap: 0
-- - Min. chunk size: 50
-- - Key: text
-- Tokenizer: Llama-2-70B-chat-GPTQ/tokenizer.model
-- Model config: Llama-2-70B-chat-GPTQ/config.json
-- Model: Llama-2-70B-chat-GPTQ/gptq_model-4bit--1g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- Options: ['gpu_split: 16.2,24', 'perf', 'perplexity']
** Time, Load model: 4.34 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
!! Model has empty group index (discarded)
** VRAM, Model: [cuda:0] 16,902.60 MB - [cuda:1] 17,390.74 MB
** VRAM, Cache: [cuda:0] 308.12 MB - [cuda:1] 320.00 MB
-- Warmup pass 1...
** Time, Warmup: 1.33 seconds
-- Warmup pass 2...
** Time, Warmup: 0.81 seconds
-- Inference, first pass.
** Time, Inference: 1.24 seconds
** Speed: 1545.28 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 17.07 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 21.84 tokens/second
** VRAM, Inference: [cuda:0] 317.16 MB - [cuda:1] 317.29 MB
** VRAM, Total: [cuda:0] 17,527.88 MB - [cuda:1] 18,028.04 MB
-- Loading dataset...
-- Testing 100 chunks..........
** Perplexity: 5.8741

SinanAkkoyun · 2023-07-25T22:50:53Z

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

Sorry, I forgot about that! I edited it now

NickDatLe · 2023-07-25T22:53:38Z

How do i split the model between the two GPUs? Does it matter than I'm running python 3.11?

Sorry, I forgot about that! I edited it now

All good, working now; I'm going to learn exllama more. Fascinating stuff!

NickDatLe · 2023-07-25T23:32:53Z

If it's ok with the mods, I'm going to leave this thread open in case someone posts a tutorial or have some great links to exllama.

NickDatLe · 2023-07-26T04:27:39Z

@SinanAkkoyun do you know what folks in the LLM community are using to communicate? Discord? Slack?

SinanAkkoyun · 2023-07-26T08:52:15Z

@NickDatLe Most that I know use Discord, however very decentralized over many servers

NickDatLe · 2023-07-27T23:13:42Z

@NickDatLe Most that I know use Discord, however very decentralized over many servers

Ahh ok, I will join some discord servers. It seems "TheBloke" has a server and that person is very popular on the LLM leaderboard.

turboderp · 2023-07-28T01:10:26Z

Where is this? Invite me!

Edit: Never mind I found it.

cmunna0052 · 2023-07-31T19:24:13Z

@SinanAkkoyun Once the test_benchmark_inference.py script has finished successfully, is there an easy way to get the 70b chatbot running in a jupyter notebook?

Edit: For posterity, it was relatively straightforward to work in a notebook environment by adapting code from the example_basic.py file.

pourfard · 2023-08-03T18:29:16Z

Et voila. Edit the prompt_chatbort.txt inside the exllama repo as you like. Keep in mind, the Llama 2 chat format is different than the one the example provides, I am working on implementing the real prompt into example_chatbot_llama2chat.py and will do a PR soon.

Can you share "example_chatbot_llama2chat.py" if it is possible?

SinanAkkoyun · 2023-08-04T10:57:06Z

@pourfard a PR is incoming today, I will implement it

SinanAkkoyun · 2023-08-04T13:11:51Z

@pourfard
#221

:) Either wait for the PR to be merged or copy the new file example_llama2chat.py directly into your Exllama directory. (Keep in mind, you need the latest version of Exllama)

nktice · 2023-08-12T14:43:30Z

First of all Thank You, exllama's working for me, while others do not...

I did some testing of a number of models with Sally riddle...
https://github.com/nktice/AMD-AI/blob/main/SallyAIRiddle.md
[ and here's my setup in case it's of benefit to other people -
https://github.com/nktice/AMD-AI ]
I did this by hand, through Oobabooga UI, and it took me a while.

I'd like commands to run exllama from shell scripts ( such as a bash shell ).
So I went looking and was disappointed by the python files there...
I had hoped to find them respond with help info from the command line.
"--help" or "-h" could return parameters and program purpose.
[ As this thread was for documentation issues, seems like this would help. ]

NickDatLe · 2023-08-17T23:33:03Z

send invite/link please!

SinanAkkoyun · 2023-08-18T03:49:35Z

https://discord.gg/theblokeai

Really awesome

NickDatLe · 2023-08-25T04:43:11Z

https://discord.gg/theblokeai

Really awesome

Add me! nickdle

SinanAkkoyun · 2023-08-25T17:19:27Z

Add me! nickdle

You need to join by clicking the link :)

NickDatLe · 2023-08-27T21:20:32Z

I invited you as a friend :)

SinanAkkoyun · 2023-08-28T04:24:42Z

@NickDatLe
Oh you mean that, sure! I can't find you in my friend requests, what is your tag? :)

NickDatLe · 2023-08-29T15:08:31Z

@NickDatLe Oh you mean that, sure! I can't find you in my friend requests, what is your tag? :)

nickdle, I sent a friend request.

SinanAkkoyun mentioned this issue Jul 25, 2023

Can we run LLAMA-2 70b with 4096 context length on 2x 3090? #191

Closed

NickDatLe closed this as completed Jul 25, 2023

NickDatLe reopened this Jul 25, 2023

mbhenaff mentioned this issue Sep 6, 2023

Multi-GPU inference? #276

Closed

SinanAkkoyun mentioned this issue Oct 2, 2023

Using Venv or Dockerisation turboderp-org/exllamav2#80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllama tutorials? #192

Exllama tutorials? #192

NickDatLe commented Jul 25, 2023

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

NickDatLe commented Jul 25, 2023 •

edited

Loading

turboderp commented Jul 25, 2023 •

edited

Loading

NickDatLe commented Jul 25, 2023

SinanAkkoyun commented Jul 25, 2023

NickDatLe commented Jul 25, 2023

NickDatLe commented Jul 25, 2023

NickDatLe commented Jul 26, 2023

SinanAkkoyun commented Jul 26, 2023

NickDatLe commented Jul 27, 2023

turboderp commented Jul 28, 2023 •

edited

Loading

cmunna0052 commented Jul 31, 2023 •

edited

Loading

pourfard commented Aug 3, 2023 •

edited

Loading

SinanAkkoyun commented Aug 4, 2023

SinanAkkoyun commented Aug 4, 2023

nktice commented Aug 12, 2023

NickDatLe commented Aug 17, 2023

SinanAkkoyun commented Aug 18, 2023

NickDatLe commented Aug 25, 2023

SinanAkkoyun commented Aug 25, 2023

NickDatLe commented Aug 27, 2023

SinanAkkoyun commented Aug 28, 2023

NickDatLe commented Aug 29, 2023

Exllama tutorials? #192

Exllama tutorials? #192

Comments

NickDatLe commented Jul 25, 2023

SinanAkkoyun commented Jul 25, 2023 • edited Loading

NickDatLe commented Jul 25, 2023 • edited Loading

turboderp commented Jul 25, 2023 • edited Loading

NickDatLe commented Jul 25, 2023

SinanAkkoyun commented Jul 25, 2023

NickDatLe commented Jul 25, 2023

NickDatLe commented Jul 25, 2023

NickDatLe commented Jul 26, 2023

SinanAkkoyun commented Jul 26, 2023

NickDatLe commented Jul 27, 2023

turboderp commented Jul 28, 2023 • edited Loading

cmunna0052 commented Jul 31, 2023 • edited Loading

pourfard commented Aug 3, 2023 • edited Loading

SinanAkkoyun commented Aug 4, 2023

SinanAkkoyun commented Aug 4, 2023

nktice commented Aug 12, 2023

NickDatLe commented Aug 17, 2023

SinanAkkoyun commented Aug 18, 2023

NickDatLe commented Aug 25, 2023

SinanAkkoyun commented Aug 25, 2023

NickDatLe commented Aug 27, 2023

SinanAkkoyun commented Aug 28, 2023

NickDatLe commented Aug 29, 2023

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

NickDatLe commented Jul 25, 2023 •

edited

Loading

turboderp commented Jul 25, 2023 •

edited

Loading

turboderp commented Jul 28, 2023 •

edited

Loading

cmunna0052 commented Jul 31, 2023 •

edited

Loading

pourfard commented Aug 3, 2023 •

edited

Loading