Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into transformers
Browse files Browse the repository at this point in the history
  • Loading branch information
0cc4m committed Jul 22, 2023
2 parents 44d48a3 + 39b3541 commit 0db7eef
Show file tree
Hide file tree
Showing 13 changed files with 286 additions and 206 deletions.
1 change: 1 addition & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ko_fi: turboderp
114 changes: 44 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,11 @@ Disclaimer: The project is coming along, but it's still a work in progress!

## Hardware requirements

I am developing on an RTX 4090 and an RTX 3090-Ti. Both cards support the CUDA kernels, but there might be
incompatibilities with older cards.
I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but
anything Pascal or older with poor FP16 support isn't going to perform well.
[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) or [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently
have no AMD devices to test or optimize on.

## Dependencies

Expand Down Expand Up @@ -43,13 +46,13 @@ Compute Platform version).

## How to

Install dependencies, clone repo and run benchmark:

pip install -r requirements.txt
Clone repo, install dependencies, and run benchmark:

git clone https://github.com/turboderp/exllama
cd exllama

pip install -r requirements.txt

python test_benchmark_inference.py -d <path_to_model_files> -p -ppl

The CUDA extension is loaded at runtime so there's no need to install it separately. It will be compiled on the first
Expand All @@ -60,11 +63,15 @@ Chatbot example:

python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt

## Python module

jllllll currently maintains an installable Python module [here](https://github.com/jllllll/exllama) which may be more
suitable for integrating ExLlama with other projects

## Web UI

I made a simple web UI for it. Like the rest of the project, it's a work in progress. Don't look at the JavaScript,
it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially
multibot mode:
I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt
your dreams. But it sort of works, and it's kinda fun, especially multibot mode:

![_screenshot.jpg](doc/_screenshot.jpg)

Expand All @@ -74,13 +81,14 @@ To run it:

python webui/app.py -d <path_to_model_files>

Note that sessions are stored in `~/exllama_sessions/`. You can change the location of the sessions storage with `-sd`
if you want.
Note that sessions are stored in `~/exllama_sessions/` by default. You can change that location with `-sd` if you want.

## Docker

For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.

### Requirements

- [Docker](https://docs.docker.com/engine/install/)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

Expand Down Expand Up @@ -128,19 +136,18 @@ docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_
## Results so far

### New implementation
| Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|------------|-------|-------|-----------------|----------------------|-----------|------------|---------|---------|------|
| Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
| Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
| Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
| Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
| Llama | 33B | 32 | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
| Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
| WizardLM | 33B | - | no <sup>2</sup> | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
| OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
| Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|------------|-------|-------|-----|----------------------|-----------|------------|---------|---------|------|
| Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
| Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
| Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
| Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
| Llama | 33B | 32 | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
| Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
| WizardLM | 33B | - | yes | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
| OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |

<sup>1</sup> Can not achieve full sequence length without OoM (yet)
<sup>2</sup> Not quite sure if this is act-order or not. Weights have no group index, at least
<sup>1</sup> Can not achieve full sequence length without OoM

All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.

Expand All @@ -154,66 +161,33 @@ probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop e
internals.

Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from
WikiText, so scores are not necessarily comparable to other Llama benchmarks.
WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama
models to one another.

### Dual GPU results

Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:

| Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|----------|------|-----------|-----|----------------------|-----------|-----------|--------|--------|------|
| Llama | 65B | 128 | yes | 2,048 t | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
| Llama | 65B | 32 | yes | 2,048 t | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
The following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:

| Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
|---------|------|-----------|-----|----------------|-----------|-----------|--------|---------|-------|
| Llama | 65B | 128 | yes | 2,048 t | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
| Llama | 65B | 32 | yes | 2,048 t | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
| Llama-2 | 70B | 128 | yes | 2,048 t | 40,680 MB | 914 t/s | 17 t/s | 14 t/s | 4.15 |
| Llama-2 | 70B | 32 | yes | 2,048 t | 36,815 MB | 874 t/s | 15 t/s | 12 t/s | 4.10 |

### Testing long sequences

The following tests were all done on **33B/65B, 4bit 128g** with various settings, just to test the max sequence length
and get a sense of what can be achieved with different or multiple GPUs right now. Llama goes incoherent generating
past 2048 tokens anyway, but with some fine-tuning, who knows? Note that these tests were run a while ago and the
speeds are no longer current.

| | Size | Seq. len. | VRAM | Long seq. | Ind. |
|------------------------|------|-----------|----------------------|-----------|--------|
| 4090/24GB | 33B | 2,516 t | 22,145 MB | 1140 t/s | 28 t/s |
| 4090/24GB + 3070Ti/8GB | 33B | 3,932 t | 22,055 MB + 7,377 MB | 840 t/s | 22 t/s |
| A6000/48GB (headless) | 33B | 9,032 t | 46,863 MB | 645 t/s | 12 t/s |
| A100/80GB (headless) | 65B | 9,520 t | 79,009 MB | 650 t/s | 9 t/s |
Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different
pretraining datasets.

## Todo

Moved the todo list [here](doc/TODO.md).

## Compatibility

I downloaded a whole bunch of GPTQ models to test compatibility. [Here](doc/model_compatibility.md) is the list of models
confirmed to be working right now.
[Here](doc/model_compatibility.md) is a list of models confirmed to be working right now.

## Recent updates

**2023-06-02**: Web UI is now in a fairly working state. Expect it to be a little scuffed in places. There will be a
rewrite at some point to make the client-side code less seizure-inducing. It has multibot mode, chat rewind and editing
features, sessions, and more. I'm going to build it out with support for instruct prompting and such, in time.

**2023-06-04**: Refactored a whole bunch to move more of the work into the extension, setting up for more tuning
options to come soon and eventually auto tuning. Also optimized a little, for about a 5% speedup.

**2023-06-06**: Some minor optimizations. Also it should now compile the extension more easily and run more seamlessly
on Windows.

**2023-06-09**: Fused most of the self-attention step. More to come. Slight speedup already, but more importantly went
from 69% actual CPU utilization to 37%. This should do a lot to address the bottleneck on CPUs with lower
single-threaded performance.

**2023-06-10**: Docker support now! And some minor optimizations. Cleaned up the project a bit.

**2023-06-11**: Added some concurrency a couple of places. It's only beneficial on the 4090, on small models where the
cores are somewhat underutilized and the L2 cache can keep up. For the 3090 it's detrimental to performance, so it's
disabled by default. YMMV. Use `-cs` to try it out.

**2023-06-17**: Fixed a nasty bug in the fused attention that was causing slightly incorrect cache states on 13B and
33B models. You definitely want to update.

**2023-06-18**: LoRA support now. Still needs a lot of testing and some optimization, and currently you can't stack
multiple LoRAs during the same inference. There's also no support in the web UI yet.
**2023-07-19**: Added support for grouped-query attention and Llama-2 70b. There's still a bit of optimization to do,
since it slows down considerably on very long sequences despite GQA having the potential to be faster. Also could use
some more thorough testing.
68 changes: 15 additions & 53 deletions doc/TODO.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,46 @@
## Model compatibility

- [x] Support for act-order models ~~(a bit slow for now)~~
- [x] ~~Support for v1 models without groupsize~~ Nah.
- [x] Test more models
- [x] Consider support for loading GGML models (not feasible)
- [x] Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)
- [ ] Verify compatibility with Llama-2 34B once released

## GPU compatibility (etc.)

- [x] Support for ROCm/AMD GPUs
- [ ] Optimize more for ROCm
- [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
- [x] Test performance on P40 (would be a good GPU to support)
- [ ] Improve performance on P40
- [x] Tunable kernel parameters
- [ ] More tunable kernel parameters
- [x] Test on Windows
- [x] Easier extension loading on Windows
- [x] Setup instructions for Windows
- [ ] Optimizations for ROCm
- [ ] Optimizations for RTX 20-series maybe
- [ ] Look into improving P40 performance

## Testing

- [x] Figure out an apples-to-apples way of comparing perplexity with other implementations
- [ ] Compile charts of inference speed vs context length for variety of models, compare to other implementations
- [ ] Test a bunch of LoRAs to make sure all combinations of rank and target layers work
- [ ] More testing on Llama 2 models

## VRAM optimization
## Optimization

- [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
- [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah

## Speed optimization

- [x] Support for de-quantizing select matrices at load time
- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
- [x] Fused QKV projection
- [x] Fused MLP
- [x] Fused RoPE
- [x] ~~Build attention mask in CUDA rather than PyTorch~~
- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
- [x] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
- [x] Measure PyTorch module overhead (negligible in eval mode)
- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
- [ ] Implement attention in CUDA
- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
- [x] Experiment with concurrent streams where possible (fused MLP and QKV proj.)
- [x] Faster low-rank matmul to speed up LoRAs
- [ ] Flash Attention 2.0 (?)
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
- [ ] C++ implementations of sampler functions

## Generation

- [x] Memory-efficient beam search implementation
- [ ] Optimized beam search
- [ ] Multi-token censoring/de-censoring
- [ ] Multi-token repetition penalties
- [x] (Multi) LoRA support
- [ ] Optimized/batched beam search
- [ ] Allow stackable LoRAs
- [x] Guided generation (chat with multiple bots at once, etc.)
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
- [ ] Batched generation
- [ ] Guidance or equivalent

## Interface

- [x] Simple web interface?
- [ ] API server
- [ ] Comprehensive API server (more than `example_flask.py`

## Web UI

- [ ] Controls to enable beam search
- [ ] Rewrite/refactor all the JavaScript and CSS
- [ ] Support for prompt formats/instruct mode
- [ ] Make it a little prettier
- [ ] Test various edge cases
- [ ] Better error handling
- [ ] LoRA controls
- [ ] Multiple chat modes with prompt templates (instruct, etc.)

## ??

- [ ] FP8/FP16 overlays
- [ ] Support for other quantization methods
- [ ] Support for other LLM architectures
- [ ] Allow for backpropagation
- [ ] LoRA training features
- [ ] Soft prompt training
15 changes: 7 additions & 8 deletions doc/model_compatibility.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Working models

As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be working:
As of **2023-07-19**, the following GPTQ models on HuggingFace all appear to be working:

- iambestfeed/open_llama_3b_4bit_128g
- Neko-Institute-of-Science/LLaMA-7B-4bit-128g
Expand All @@ -9,6 +9,7 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
- Neko-Institute-of-Science/LLaMA-30B-4bit-128g
- Neko-Institute-of-Science/LLaMA-65B-4bit-32g
- Neko-Institute-of-Science/LLaMA-65B-4bit-128g
- Panchovix/LLaMA-2-70B-GPTQ-transformers4.32.0.dev0
- reeducator/bluemoonrp-13b
- reeducator/bluemoonrp-30b
- TehVenom/Metharme-13b-4bit-GPTQ
Expand All @@ -17,8 +18,11 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
- TheBloke/GPT4All-13B-snoozy-GPTQ
- TheBloke/guanaco-33B-GPTQ
- TheBloke/guanaco-65B-GPTQ
- TheBloke/h2ogpt-oasst1-512-30B-GPTQ <sup>1</sup>
- TheBloke/h2ogpt-oasst1-512-30B-GPTQ
- TheBloke/koala-13B-GPTQ-4bit-128g
- TheBloke/Llama-2-13B-chat-GPTQ (128g)
- TheBloke/Llama-2-13B-GPTQ (32g, 64g, 128g)
- TheBloke/Llama-2-70B-GPTQ (32g, 128g)
- TheBloke/Manticore-13B-GPTQ
- TheBloke/medalpaca-13B-GPTQ-4bit
- TheBloke/medalpaca-13B-GPTQ-4bit (compat version)
Expand All @@ -39,11 +43,6 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
- Yhyu13/chimera-inst-chat-13b-gptq-4bit
- Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-gptq-4bit

<sup>1</sup> This particular model, uniquely, shows somewhat worse perplexity when matmul is done by the custom CUDA
kernel rather than cuBLAS. Maybe it's extra sensitive to rounding errors for some reason? Either way, it does work.

## Non-working models

As of **2023-07-02**, I have found no models that don't work.

v1 models are still unsupported, as are pickle files.
None as of **2023-07-19**.
2 changes: 1 addition & 1 deletion exllama/lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def __init__(self, model, lora_config_path, lora_path):

# Move to target device

device = self.config.device_map.map(target_key, loading = True)
device = self.config.device_map.map(target_key)
tensor = tensor.to(device, non_blocking = True)

# Store adapter tensor
Expand Down
Loading

0 comments on commit 0db7eef

Please sign in to comment.