Update readme etc.

0cc4m · Jul 19, 2023 · 39b3541 · 39b3541
1 parent 806a12c
commit 39b3541
Show file tree

Hide file tree

Showing 3 changed files with 54 additions and 99 deletions.
diff --git a/README.md b/README.md
@@ -7,8 +7,11 @@ Disclaimer: The project is coming along, but it's still a work in progress!
 
 ## Hardware requirements
 
-I am developing on an RTX 4090 and an RTX 3090-Ti. Both cards support the CUDA kernels, but there might be
-incompatibilities with older cards.
+I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but
+anything Pascal or older with poor FP16 support isn't going to perform well. 
+[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) or [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
+are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently 
+have no AMD devices to test or optimize on.
 
 ## Dependencies
 
@@ -60,11 +63,15 @@ Chatbot example:
 
     python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt
 
+## Python module
+
+jllllll currently maintains an installable Python module [here](https://github.com/jllllll/exllama) which may be more
+suitable for integrating ExLlama with other projects
+
 ## Web UI
 
-I made a simple web UI for it. Like the rest of the project, it's a work in progress. Don't look at the JavaScript,
-it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially
-multibot mode:
+I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt
+your dreams. But it sort of works, and it's kinda fun, especially multibot mode:
 
 ![_screenshot.jpg](doc/_screenshot.jpg)
 
@@ -74,13 +81,14 @@ To run it:
 
     python webui/app.py -d <path_to_model_files>
 
-Note that sessions are stored in `~/exllama_sessions/`. You can change the location of the sessions storage with `-sd`
-if you want.
+Note that sessions are stored in `~/exllama_sessions/` by default. You can change that location with `-sd` if you want.
 
 ## Docker
+
 For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.
 
 ### Requirements
+
 - [Docker](https://docs.docker.com/engine/install/)
 - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
 
@@ -128,19 +136,18 @@ docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_
 ## Results so far
 
 ### New implementation
-| Model      | Size  | grpsz | act             | Seq. len.            | VRAM      | Prompt     | Best    | Worst   | Ppl  |
-|------------|-------|-------|-----------------|----------------------|-----------|------------|---------|---------|------|
-| Llama      | 7B    | 128   | no              | 2,048 t              | 5,194 MB  | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
-| Llama      | 13B   | 128   | no              | 2,048 t              | 9,127 MB  | 7,507 t/s  | 102 t/s | 86 t/s  | 5.60 |
-| Llama      | 33B   | 128   | no              | 2,048 t              | 20,795 MB | 2,959 t/s  | 47 t/s  | 40 t/s  | 4.60 |
-| Llama      | 33B   | 128   | yes             | 2,048 t              | 20,795 MB | 2,784 t/s  | 45 t/s  | 37 t/s  | 4.55 |
-| Llama      | 33B   | 32    | yes             | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s  | 41 t/s  | 37 t/s  | 4.52 |
-| Koala      | 13B   | 128   | yes             | 2,048 t              | 9,127 MB  | 5,529 t/s  | 93 t/s  | 79 t/s  | 6.73 |
-| WizardLM   | 33B   | -     | no <sup>2</sup> | 2,048 t              | 20,199 MB | 2,313 t/s  | 47 t/s  | 40 t/s  | 5.75 |
-| OpenLlama  | 3B    | 128   | yes             | 2,048 t              | 3,128 MB  | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
+| Model      | Size  | grpsz | act | Seq. len.            | VRAM      | Prompt     | Best    | Worst   | Ppl  |
+|------------|-------|-------|-----|----------------------|-----------|------------|---------|---------|------|
+| Llama      | 7B    | 128   | no  | 2,048 t              | 5,194 MB  | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
+| Llama      | 13B   | 128   | no  | 2,048 t              | 9,127 MB  | 7,507 t/s  | 102 t/s | 86 t/s  | 5.60 |
+| Llama      | 33B   | 128   | no  | 2,048 t              | 20,795 MB | 2,959 t/s  | 47 t/s  | 40 t/s  | 4.60 |
+| Llama      | 33B   | 128   | yes | 2,048 t              | 20,795 MB | 2,784 t/s  | 45 t/s  | 37 t/s  | 4.55 |
+| Llama      | 33B   | 32    | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s  | 41 t/s  | 37 t/s  | 4.52 |
+| Koala      | 13B   | 128   | yes | 2,048 t              | 9,127 MB  | 5,529 t/s  | 93 t/s  | 79 t/s  | 6.73 |
+| WizardLM   | 33B   | -     | yes | 2,048 t              | 20,199 MB | 2,313 t/s  | 47 t/s  | 40 t/s  | 5.75 |
+| OpenLlama  | 3B    | 128   | yes | 2,048 t              | 3,128 MB  | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
 
-<sup>1</sup> Can not achieve full sequence length without OoM (yet)  
-<sup>2</sup> Not quite sure if this is act-order or not. Weights have no group index, at least   
+<sup>1</sup> Can not achieve full sequence length without OoM  
 
 All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.
 
@@ -154,12 +161,12 @@ probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop e
 internals.
 
 Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from
-WikiText, so scores are not necessarily comparable to other Llama benchmarks.
+WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama
+models to one another.
 
 ### Dual GPU results
 
-Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
-following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
+The following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
 
 | Model   | Size | groupsize | act | Seq. len.      | VRAM      | Prompt    | Best   | Worst   | Ppl   |
 |---------|------|-----------|-----|----------------|-----------|-----------|--------|---------|-------|
@@ -168,29 +175,16 @@ following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
 | Llama-2 | 70B  | 128       | yes | 2,048 t        | 40,680 MB | 914 t/s   | 17 t/s | 14 t/s  | 4.15  |
 | Llama-2 | 70B  | 32        | yes | 2,048 t        | 36,815 MB | 874 t/s   | 15 t/s | 12 t/s  | 4.10  |
 
-
-### Testing long sequences
-
-The following tests were all done on **33B/65B, 4bit 128g** with various settings, just to test the max sequence length
-and get a sense of what can be achieved with different or multiple GPUs right now. Llama goes incoherent generating 
-past 2048 tokens anyway, but with some fine-tuning, who knows? Note that these tests were run a while ago and the
-speeds are no longer current.
-
-|                        | Size | Seq. len. | VRAM                 | Long seq. | Ind.   | 
-|------------------------|------|-----------|----------------------|-----------|--------|
-| 4090/24GB              | 33B  | 2,516 t   | 22,145 MB            | 1140 t/s  | 28 t/s |
-| 4090/24GB + 3070Ti/8GB | 33B  | 3,932 t   | 22,055 MB + 7,377 MB | 840 t/s   | 22 t/s |
-| A6000/48GB (headless)  | 33B  | 9,032 t   | 46,863 MB            | 645 t/s   | 12 t/s |
-| A100/80GB (headless)   | 65B  | 9,520 t   | 79,009 MB            | 650 t/s   | 9 t/s  |
+Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different
+pretraining datasets.
 
 ## Todo
 
 Moved the todo list [here](doc/TODO.md).  
 
 ## Compatibility
 
-I downloaded a whole bunch of GPTQ models to test compatibility. [Here](doc/model_compatibility.md) is the list of models
-confirmed to be working right now.
+[Here](doc/model_compatibility.md) is a list of models confirmed to be working right now.
 
 ## Recent updates
 

diff --git a/doc/TODO.md b/doc/TODO.md
@@ -1,84 +1,46 @@
 ## Model compatibility
 
-- [x] Support for act-order models ~~(a bit slow for now)~~
-- [x] ~~Support for v1 models without groupsize~~ Nah.
-- [x] Test more models
-- [x] Consider support for loading GGML models (not feasible)
-- [x] Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)
+- [ ] Verify compatibility with Llama-2 34B once released
 
 ## GPU compatibility (etc.)
 
-- [x] Support for ROCm/AMD GPUs
-- [ ] Optimize more for ROCm
-- [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
-- [x] Test performance on P40 (would be a good GPU to support)
-- [ ] Improve performance on P40
-- [x] Tunable kernel parameters
-- [ ] More tunable kernel parameters
-- [x] Test on Windows
-- [x] Easier extension loading on Windows
-- [x] Setup instructions for Windows
+- [ ] Optimizations for ROCm
+- [ ] Optimizations for RTX 20-series maybe
+- [ ] Look into improving P40 performance
 
 ## Testing
 
-- [x] Figure out an apples-to-apples way of comparing perplexity with other implementations
-- [ ] Compile charts of inference speed vs context length for variety of models, compare to other implementations
-- [ ] Test a bunch of LoRAs to make sure all combinations of rank and target layers work
+- [ ] More testing on Llama 2 models
 
-## VRAM optimization
+## Optimization
 
-- [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
-- [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
-- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah
-
-## Speed optimization
-
-- [x] Support for de-quantizing select matrices at load time
-- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
-- [x] Fused QKV projection
-- [x] Fused MLP
-- [x] Fused RoPE
-- [x] ~~Build attention mask in CUDA rather than PyTorch~~
-- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
-- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
-- [x] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
-- [x] Measure PyTorch module overhead (negligible in eval mode)
-- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
-- [ ] Implement attention in CUDA
-- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
-- [x] Experiment with concurrent streams where possible (fused MLP and QKV proj.)
-- [x] Faster low-rank matmul to speed up LoRAs
+- [ ] Flash Attention 2.0 (?)
+- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
+- [ ] C++ implementations of sampler functions
 
 ## Generation
 
-- [x] Memory-efficient beam search implementation
-- [ ] Optimized beam search
-- [ ] Multi-token censoring/de-censoring
-- [ ] Multi-token repetition penalties
-- [x] (Multi) LoRA support
+- [ ] Optimized/batched beam search
 - [ ] Allow stackable LoRAs
-- [x] Guided generation (chat with multiple bots at once, etc.)
-- [ ] Multiple chat modes with prompt templates (instruct, etc.)
-- [ ] Batched generation
+- [ ] Guidance or equivalent
 
 ## Interface
 
-- [x] Simple web interface?
-- [ ] API server
+- [ ] Comprehensive API server (more than `example_flask.py`
 
 ## Web UI
 
 - [ ] Controls to enable beam search
 - [ ] Rewrite/refactor all the JavaScript and CSS
-- [ ] Support for prompt formats/instruct mode
 - [ ] Make it a little prettier
-- [ ] Test various edge cases
 - [ ] Better error handling
 - [ ] LoRA controls
+- [ ] Multiple chat modes with prompt templates (instruct, etc.)
 
 ## ??
 
-- [ ] FP8/FP16 overlays
+- [ ] Support for other quantization methods
+- [ ] Support for other LLM architectures
 - [ ] Allow for backpropagation
 - [ ] LoRA training features
 - [ ] Soft prompt training
diff --git a/doc/model_compatibility.md b/doc/model_compatibility.md
@@ -1,6 +1,6 @@
 ## Working models
 
-As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be working:
+As of **2023-07-19**, the following GPTQ models on HuggingFace all appear to be working:
 
 - iambestfeed/open_llama_3b_4bit_128g
 - Neko-Institute-of-Science/LLaMA-7B-4bit-128g
@@ -9,6 +9,7 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
 - Neko-Institute-of-Science/LLaMA-30B-4bit-128g
 - Neko-Institute-of-Science/LLaMA-65B-4bit-32g
 - Neko-Institute-of-Science/LLaMA-65B-4bit-128g
+- Panchovix/LLaMA-2-70B-GPTQ-transformers4.32.0.dev0
 - reeducator/bluemoonrp-13b
 - reeducator/bluemoonrp-30b
 - TehVenom/Metharme-13b-4bit-GPTQ
@@ -17,8 +18,11 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
 - TheBloke/GPT4All-13B-snoozy-GPTQ
 - TheBloke/guanaco-33B-GPTQ
 - TheBloke/guanaco-65B-GPTQ
-- TheBloke/h2ogpt-oasst1-512-30B-GPTQ <sup>1</sup> 
+- TheBloke/h2ogpt-oasst1-512-30B-GPTQ
 - TheBloke/koala-13B-GPTQ-4bit-128g
+- TheBloke/Llama-2-13B-chat-GPTQ (128g)
+- TheBloke/Llama-2-13B-GPTQ (32g, 64g, 128g)
+- TheBloke/Llama-2-70B-GPTQ (32g, 128g)
 - TheBloke/Manticore-13B-GPTQ
 - TheBloke/medalpaca-13B-GPTQ-4bit
 - TheBloke/medalpaca-13B-GPTQ-4bit (compat version)
@@ -39,11 +43,6 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
 - Yhyu13/chimera-inst-chat-13b-gptq-4bit
 - Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-gptq-4bit
 
-<sup>1</sup> This particular model, uniquely, shows somewhat worse perplexity when matmul is done by the custom CUDA 
-kernel rather than cuBLAS. Maybe it's extra sensitive to rounding errors for some reason? Either way, it does work.
-
 ## Non-working models
 
-As of **2023-07-02**, I have found no models that don't work.
-
-v1 models are still unsupported, as are pickle files.
+None as of **2023-07-19**.