Skip to content

Commit

Permalink
Update readme etc.
Browse files Browse the repository at this point in the history
  • Loading branch information
turboderp committed Jul 19, 2023
1 parent 806a12c commit 39b3541
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 99 deletions.
70 changes: 32 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,11 @@ Disclaimer: The project is coming along, but it's still a work in progress!

## Hardware requirements

I am developing on an RTX 4090 and an RTX 3090-Ti. Both cards support the CUDA kernels, but there might be
incompatibilities with older cards.
I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but
anything Pascal or older with poor FP16 support isn't going to perform well.
[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) or [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently
have no AMD devices to test or optimize on.

## Dependencies

Expand Down Expand Up @@ -60,11 +63,15 @@ Chatbot example:

python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt

## Python module

jllllll currently maintains an installable Python module [here](https://github.com/jllllll/exllama) which may be more
suitable for integrating ExLlama with other projects

## Web UI

I made a simple web UI for it. Like the rest of the project, it's a work in progress. Don't look at the JavaScript,
it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially
multibot mode:
I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt
your dreams. But it sort of works, and it's kinda fun, especially multibot mode:

![_screenshot.jpg](doc/_screenshot.jpg)

Expand All @@ -74,13 +81,14 @@ To run it:

python webui/app.py -d <path_to_model_files>

Note that sessions are stored in `~/exllama_sessions/`. You can change the location of the sessions storage with `-sd`
if you want.
Note that sessions are stored in `~/exllama_sessions/` by default. You can change that location with `-sd` if you want.

## Docker

For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.

### Requirements

- [Docker](https://docs.docker.com/engine/install/)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

Expand Down Expand Up @@ -128,19 +136,18 @@ docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_
## Results so far

### New implementation
| Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|------------|-------|-------|-----------------|----------------------|-----------|------------|---------|---------|------|
| Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
| Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
| Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
| Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
| Llama | 33B | 32 | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
| Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
| WizardLM | 33B | - | no <sup>2</sup> | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
| OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
| Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|------------|-------|-------|-----|----------------------|-----------|------------|---------|---------|------|
| Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
| Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
| Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
| Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
| Llama | 33B | 32 | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
| Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
| WizardLM | 33B | - | yes | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
| OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |

<sup>1</sup> Can not achieve full sequence length without OoM (yet)
<sup>2</sup> Not quite sure if this is act-order or not. Weights have no group index, at least
<sup>1</sup> Can not achieve full sequence length without OoM

All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.

Expand All @@ -154,12 +161,12 @@ probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop e
internals.

Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from
WikiText, so scores are not necessarily comparable to other Llama benchmarks.
WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama
models to one another.

### Dual GPU results

Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
The following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:

| Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|---------|------|-----------|-----|----------------|-----------|-----------|--------|---------|-------|
Expand All @@ -168,29 +175,16 @@ following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
| Llama-2 | 70B | 128 | yes | 2,048 t | 40,680 MB | 914 t/s | 17 t/s | 14 t/s | 4.15 |
| Llama-2 | 70B | 32 | yes | 2,048 t | 36,815 MB | 874 t/s | 15 t/s | 12 t/s | 4.10 |


### Testing long sequences

The following tests were all done on **33B/65B, 4bit 128g** with various settings, just to test the max sequence length
and get a sense of what can be achieved with different or multiple GPUs right now. Llama goes incoherent generating
past 2048 tokens anyway, but with some fine-tuning, who knows? Note that these tests were run a while ago and the
speeds are no longer current.

| | Size | Seq. len. | VRAM | Long seq. | Ind. |
|------------------------|------|-----------|----------------------|-----------|--------|
| 4090/24GB | 33B | 2,516 t | 22,145 MB | 1140 t/s | 28 t/s |
| 4090/24GB + 3070Ti/8GB | 33B | 3,932 t | 22,055 MB + 7,377 MB | 840 t/s | 22 t/s |
| A6000/48GB (headless) | 33B | 9,032 t | 46,863 MB | 645 t/s | 12 t/s |
| A100/80GB (headless) | 65B | 9,520 t | 79,009 MB | 650 t/s | 9 t/s |
Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different
pretraining datasets.

## Todo

Moved the todo list [here](doc/TODO.md).

## Compatibility

I downloaded a whole bunch of GPTQ models to test compatibility. [Here](doc/model_compatibility.md) is the list of models
confirmed to be working right now.
[Here](doc/model_compatibility.md) is a list of models confirmed to be working right now.

## Recent updates

Expand Down
68 changes: 15 additions & 53 deletions doc/TODO.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,46 @@
## Model compatibility

- [x] Support for act-order models ~~(a bit slow for now)~~
- [x] ~~Support for v1 models without groupsize~~ Nah.
- [x] Test more models
- [x] Consider support for loading GGML models (not feasible)
- [x] Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)
- [ ] Verify compatibility with Llama-2 34B once released

## GPU compatibility (etc.)

- [x] Support for ROCm/AMD GPUs
- [ ] Optimize more for ROCm
- [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
- [x] Test performance on P40 (would be a good GPU to support)
- [ ] Improve performance on P40
- [x] Tunable kernel parameters
- [ ] More tunable kernel parameters
- [x] Test on Windows
- [x] Easier extension loading on Windows
- [x] Setup instructions for Windows
- [ ] Optimizations for ROCm
- [ ] Optimizations for RTX 20-series maybe
- [ ] Look into improving P40 performance

## Testing

- [x] Figure out an apples-to-apples way of comparing perplexity with other implementations
- [ ] Compile charts of inference speed vs context length for variety of models, compare to other implementations
- [ ] Test a bunch of LoRAs to make sure all combinations of rank and target layers work
- [ ] More testing on Llama 2 models

## VRAM optimization
## Optimization

- [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
- [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah

## Speed optimization

- [x] Support for de-quantizing select matrices at load time
- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
- [x] Fused QKV projection
- [x] Fused MLP
- [x] Fused RoPE
- [x] ~~Build attention mask in CUDA rather than PyTorch~~
- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
- [x] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
- [x] Measure PyTorch module overhead (negligible in eval mode)
- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
- [ ] Implement attention in CUDA
- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
- [x] Experiment with concurrent streams where possible (fused MLP and QKV proj.)
- [x] Faster low-rank matmul to speed up LoRAs
- [ ] Flash Attention 2.0 (?)
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
- [ ] C++ implementations of sampler functions

## Generation

- [x] Memory-efficient beam search implementation
- [ ] Optimized beam search
- [ ] Multi-token censoring/de-censoring
- [ ] Multi-token repetition penalties
- [x] (Multi) LoRA support
- [ ] Optimized/batched beam search
- [ ] Allow stackable LoRAs
- [x] Guided generation (chat with multiple bots at once, etc.)
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
- [ ] Batched generation
- [ ] Guidance or equivalent

## Interface

- [x] Simple web interface?
- [ ] API server
- [ ] Comprehensive API server (more than `example_flask.py`

## Web UI

- [ ] Controls to enable beam search
- [ ] Rewrite/refactor all the JavaScript and CSS
- [ ] Support for prompt formats/instruct mode
- [ ] Make it a little prettier
- [ ] Test various edge cases
- [ ] Better error handling
- [ ] LoRA controls
- [ ] Multiple chat modes with prompt templates (instruct, etc.)

## ??

- [ ] FP8/FP16 overlays
- [ ] Support for other quantization methods
- [ ] Support for other LLM architectures
- [ ] Allow for backpropagation
- [ ] LoRA training features
- [ ] Soft prompt training
15 changes: 7 additions & 8 deletions doc/model_compatibility.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Working models

As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be working:
As of **2023-07-19**, the following GPTQ models on HuggingFace all appear to be working:

- iambestfeed/open_llama_3b_4bit_128g
- Neko-Institute-of-Science/LLaMA-7B-4bit-128g
Expand All @@ -9,6 +9,7 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
- Neko-Institute-of-Science/LLaMA-30B-4bit-128g
- Neko-Institute-of-Science/LLaMA-65B-4bit-32g
- Neko-Institute-of-Science/LLaMA-65B-4bit-128g
- Panchovix/LLaMA-2-70B-GPTQ-transformers4.32.0.dev0
- reeducator/bluemoonrp-13b
- reeducator/bluemoonrp-30b
- TehVenom/Metharme-13b-4bit-GPTQ
Expand All @@ -17,8 +18,11 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
- TheBloke/GPT4All-13B-snoozy-GPTQ
- TheBloke/guanaco-33B-GPTQ
- TheBloke/guanaco-65B-GPTQ
- TheBloke/h2ogpt-oasst1-512-30B-GPTQ <sup>1</sup>
- TheBloke/h2ogpt-oasst1-512-30B-GPTQ
- TheBloke/koala-13B-GPTQ-4bit-128g
- TheBloke/Llama-2-13B-chat-GPTQ (128g)
- TheBloke/Llama-2-13B-GPTQ (32g, 64g, 128g)
- TheBloke/Llama-2-70B-GPTQ (32g, 128g)
- TheBloke/Manticore-13B-GPTQ
- TheBloke/medalpaca-13B-GPTQ-4bit
- TheBloke/medalpaca-13B-GPTQ-4bit (compat version)
Expand All @@ -39,11 +43,6 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
- Yhyu13/chimera-inst-chat-13b-gptq-4bit
- Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-gptq-4bit

<sup>1</sup> This particular model, uniquely, shows somewhat worse perplexity when matmul is done by the custom CUDA
kernel rather than cuBLAS. Maybe it's extra sensitive to rounding errors for some reason? Either way, it does work.

## Non-working models

As of **2023-07-02**, I have found no models that don't work.

v1 models are still unsupported, as are pickle files.
None as of **2023-07-19**.

0 comments on commit 39b3541

Please sign in to comment.