Skip to content

Commit

Permalink
Merge pull request #6421 from oobabooga/dev
Browse files Browse the repository at this point in the history
Merge dev branch
  • Loading branch information
oobabooga authored Oct 1, 2024
2 parents f98431c + cca9d6e commit 3b06cb4
Show file tree
Hide file tree
Showing 31 changed files with 467 additions and 307 deletions.
48 changes: 22 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,29 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.

## Features

* Multiple backends for text generation in a single UI and API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM) are also supported through the Transformers loader.
* OpenAI-compatible API server with Chat and Completions endpoints – see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
* Automatic prompt formatting for each model using the Jinja2 template in its metadata.
* Three chat modes: `instruct`, `chat-instruct`, and `chat`, allowing for both instruction-following and casual conversations with characters. `chat-instruct` mode automatically applies the model's template to the chat prompt, ensuring high-quality outputs without manual setup.
* "Past chats" menu to quickly switch between conversations and start new ones.
* Free-form generation in the Default/Notebook tabs without being limited to chat turns. Send formatted chat conversations from the Chat tab to these tabs.
* Multiple sampling parameters and generation options for sophisticated text generation control.
* Easy switching between different models through the UI without restarting, using the "Model" tab.
* Simple LoRA fine-tuning tool to customize models with your data.
* All in one folder. The requirements are installed in a self-contained `installer_files` folder that doesn't interfere with the system's environment.
* Extensions support, including numerous built-in and user-contributed extensions. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
- Supports multiple text generation backends in one UI/API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp), and [ExLlamaV2](https://github.com/turboderp/exllamav2). [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM) are also supported but you need to install them manually.
- OpenAI-compatible API with Chat and Completions endpoints – see [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
- Automatic prompt formatting using Jinja2 templates.
- Three chat modes: `instruct`, `chat-instruct`, and `chat`, with automatic prompt templates in `chat-instruct`.
- "Past chats" menu to quickly switch between conversations.
- Free-form text generation in the Default/Notebook tabs without being limited to chat turns. You can send formatted conversations from the Chat tab to these.
- Multiple sampling parameters and generation options for sophisticated text generation control.
- Switch between different models easily in the UI without restarting.
- Simple LoRA fine-tuning tool.
- Requirements installed in a self-contained `installer_files` directory that doesn't interfere with the system environment.
- Extension support, with numerous built-in and user-contributed extensions available. See the [wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.

## How to install

1) Clone or [download](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) the repository.
2) Run the `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat` script depending on your OS.
1) Clone or [download the repository](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip).
2) Run the script that matches your OS: `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat`.
3) Select your GPU vendor when asked.
4) Once the installation ends, browse to `http://localhost:7860`.
5) Have fun!

To restart the web UI in the future, run the `start_` script again.
To restart the web UI later, just run the same `start_` script. If you need to reinstall, delete the `installer_files` folder created during setup and run the script again.

This script creates an `installer_files` folder where it sets up the project's requirements. If you need to reinstall the requirements, just delete that folder and start the web UI again.

The script accepts command-line flags, such as `./start_linux.sh --help`. Alternatively, you can edit the `CMD_FLAGS.txt` file with a text editor and add your flags there, such as `--api` in case you need to use the API.

To get updates in the future, run `update_wizard_linux.sh`, `update_wizard_windows.bat`, `update_wizard_macos.sh`, or `update_wizard_wsl.bat`.
You can use command-line flags, like `./start_linux.sh --help`, or add them to `CMD_FLAGS.txt` (such as `--api` to enable API use). To update the project, run `update_wizard_linux.sh`, `update_wizard_windows.bat`, `update_wizard_macos.sh`, or `update_wizard_wsl.bat`.

<details>
<summary>
Expand Down Expand Up @@ -80,12 +76,12 @@ conda activate textgen

| System | GPU | Command |
|--------|---------|---------|
| Linux/WSL | NVIDIA | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121` |
| Linux/WSL | CPU only | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cpu` |
| Linux | AMD | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/rocm5.6` |
| MacOS + MPS | Any | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2` |
| Windows | NVIDIA | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121` |
| Windows | CPU only | `pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2` |
| Linux/WSL | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` |
| Linux/WSL | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cpu` |
| Linux | AMD | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/rocm6.1` |
| MacOS + MPS | Any | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` |
| Windows | NVIDIA | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121` |
| Windows | CPU only | `pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1` |

The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.

Expand Down Expand Up @@ -150,7 +146,7 @@ Then browse to
1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12:

```
pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
conda install -y -c "nvidia/label/cuda-11.8.0" cuda-runtime
```

Expand Down
26 changes: 25 additions & 1 deletion docs/12 - OpenAI API.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Add `--api` to your command-line flags.

### Examples

For the documentation with all the parameters and their types, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.
For the documentation with all the endpoints, parameters and their types, consult `http://127.0.0.1:5000/docs` or the [typing.py](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/typing.py) file.

The official examples in the [OpenAI documentation](https://platform.openai.com/docs/api-reference) should also work, and the same parameters apply (although the API here has more optional parameters).

Expand Down Expand Up @@ -114,6 +114,30 @@ curl -k http://127.0.0.1:5000/v1/internal/logits \
}'
```

#### List models

```shell
curl -k http://127.0.0.1:5000/v1/internal/model/list \
-H "Content-Type: application/json"
```

#### Load model

```shell
curl -k http://127.0.0.1:5000/v1/internal/model/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "model_name",
"args": {
"load_in_4bit": true,
"n_gpu_layers": 12
},
"settings": {
"instruction_template": "Alpaca"
}
}'
```

#### Python chat example

```python
Expand Down
2 changes: 1 addition & 1 deletion extensions/Training_PRO/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ def ui():
stride_length = gr.Slider(label='Stride', minimum=1, maximum=2048, value=512, step=1, info='Used to make the evaluation faster at the cost of accuracy. 1 = slowest but most accurate. 512 is a common value.')

with gr.Column():
max_length = gr.Slider(label='max_length', minimum=0, maximum=shared.settings['truncation_length_max'], value=0, step=1, info='The context for each evaluation. If set to 0, the maximum context length for the model will be used.')
max_length = gr.Number(label='max_length', precision=0, step=256, value=0, info='The context for each evaluation. If set to 0, the maximum context length for the model will be used.')

with gr.Row():
start_current_evaluation = gr.Button("Evaluate loaded model")
Expand Down
5 changes: 3 additions & 2 deletions extensions/openai/completions.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,8 +154,9 @@ def convert_history(history):
elif item['type'] == 'text' and isinstance(item['text'], str):
content = item['text']

if image_url and content:
if image_url:
new_history.append({"image_url": image_url, "role": "user"})
if content:
new_history.append({"content": content, "role": "user"})
else:
new_history.append(entry)
Expand Down Expand Up @@ -234,7 +235,7 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False, p
raise InvalidRequestError(message="messages: missing content", param='messages')

# Chat Completions
object_type = 'chat.completions' if not stream else 'chat.completions.chunk'
object_type = 'chat.completion' if not stream else 'chat.completion.chunk'
created_time = int(time.time())
cmpl_id = "chatcmpl-%d" % (int(time.time() * 1000000000))
resp_list = 'data' if is_legacy else 'choices'
Expand Down
2 changes: 2 additions & 0 deletions extensions/openai/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ class GenerationOptions(BaseModel):
dry_base: float = 1.75
dry_allowed_length: int = 2
dry_sequence_breakers: str = '"\\n", ":", "\\"", "*"'
xtc_threshold: float = 0.1
xtc_probability: float = 0
truncation_length: int = 0
max_tokens_second: int = 0
prompt_lookup_num_tokens: int = 0
Expand Down
11 changes: 11 additions & 0 deletions js/main.js
Original file line number Diff line number Diff line change
Expand Up @@ -600,4 +600,15 @@ headerBar.addEventListener("click", (e) => {
}
});

//------------------------------------------------
// Add a confirmation dialog when leaving the page
// Useful to avoid data loss
//------------------------------------------------
window.addEventListener('beforeunload', function (event) {
// Cancel the event
event.preventDefault();
// Chrome requires returnValue to be set
event.returnValue = '';
});

moveToChatTab();
4 changes: 3 additions & 1 deletion modules/LoRA.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from pathlib import Path

import torch
from peft import PeftModel
from transformers import is_torch_xpu_available

import modules.shared as shared
Expand Down Expand Up @@ -85,6 +84,9 @@ def add_lora_autogptq(lora_names):


def add_lora_transformers(lora_names):

from peft import PeftModel

prior_set = set(shared.lora_names)
added_set = set(lora_names) - prior_set
removed_set = prior_set - set(lora_names)
Expand Down
28 changes: 24 additions & 4 deletions modules/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -1059,7 +1059,12 @@ def handle_start_new_chat_click(state):

convert_to_markdown.cache_clear()

return [history, html, gr.update(choices=histories, value=histories[0][1])]
if len(histories) > 0:
past_chats_update = gr.update(choices=histories, value=histories[0][1])
else:
past_chats_update = gr.update(choices=histories)

return [history, html, past_chats_update]


def handle_delete_chat_confirm_click(state):
Expand Down Expand Up @@ -1110,10 +1115,15 @@ def handle_upload_chat_history(load_chat_history, state):

convert_to_markdown.cache_clear()

if len(histories) > 0:
past_chats_update = gr.update(choices=histories, value=histories[0][1])
else:
past_chats_update = gr.update(choices=histories)

return [
history,
html,
gr.update(choices=histories, value=histories[0][1])
past_chats_update
]


Expand All @@ -1132,6 +1142,11 @@ def handle_character_menu_change(state):

convert_to_markdown.cache_clear()

if len(histories) > 0:
past_chats_update = gr.update(choices=histories, value=histories[0][1])
else:
past_chats_update = gr.update(choices=histories)

return [
history,
html,
Expand All @@ -1140,7 +1155,7 @@ def handle_character_menu_change(state):
picture,
greeting,
context,
gr.update(choices=histories, value=histories[0][1]),
past_chats_update,
]


Expand All @@ -1151,12 +1166,17 @@ def handle_mode_change(state):

convert_to_markdown.cache_clear()

if len(histories) > 0:
past_chats_update = gr.update(choices=histories, value=histories[0][1])
else:
past_chats_update = gr.update(choices=histories)

return [
history,
html,
gr.update(visible=state['mode'] != 'instruct'),
gr.update(visible=state['mode'] == 'chat-instruct'),
gr.update(choices=histories, value=histories[0][1])
past_chats_update
]


Expand Down
34 changes: 18 additions & 16 deletions modules/exllamav2.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
ExLlamaV2Cache,
ExLlamaV2Cache_8bit,
ExLlamaV2Cache_Q4,
ExLlamaV2Cache_TP,
ExLlamaV2Config,
ExLlamaV2Tokenizer
)
Expand All @@ -18,14 +19,6 @@

try:
import flash_attn
except ModuleNotFoundError:
logger.warning(
'You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage '
'to be a lot higher than it could be.\n'
'Try installing flash-attention following the instructions here: '
'https://github.com/Dao-AILab/flash-attention#installation-and-features'
)
pass
except Exception:
logger.warning('Failed to load flash-attention due to the following error:\n')
traceback.print_exc()
Expand Down Expand Up @@ -54,21 +47,30 @@ def from_pretrained(self, path_to_model):

model = ExLlamaV2(config)

if not shared.args.autosplit:
split = None
if shared.args.gpu_split:
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
split = None
if shared.args.gpu_split:
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]

if shared.args.enable_tp:
model.load_tp(split)
elif not shared.args.autosplit:
model.load(split)

# Determine the correct cache type
if shared.args.cache_8bit:
cache = ExLlamaV2Cache_8bit(model, lazy=shared.args.autosplit)
cache_type = ExLlamaV2Cache_8bit
elif shared.args.cache_4bit:
cache = ExLlamaV2Cache_Q4(model, lazy=shared.args.autosplit)
cache_type = ExLlamaV2Cache_Q4
else:
cache = ExLlamaV2Cache(model, lazy=shared.args.autosplit)
cache_type = ExLlamaV2Cache

if shared.args.autosplit:
# Use TP if specified
if shared.args.enable_tp:
cache = ExLlamaV2Cache_TP(model, base=cache_type)
else:
cache = cache_type(model, lazy=shared.args.autosplit)

if shared.args.autosplit and not shared.args.enable_tp:
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)
Expand Down
34 changes: 18 additions & 16 deletions modules/exllamav2_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
ExLlamaV2Cache,
ExLlamaV2Cache_8bit,
ExLlamaV2Cache_Q4,
ExLlamaV2Cache_TP,
ExLlamaV2Config
)
from torch.nn import CrossEntropyLoss
Expand All @@ -20,14 +21,6 @@

try:
import flash_attn
except ModuleNotFoundError:
logger.warning(
'You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage '
'to be a lot higher than it could be.\n'
'Try installing flash-attention following the instructions here: '
'https://github.com/Dao-AILab/flash-attention#installation-and-features'
)
pass
except Exception:
logger.warning('Failed to load flash-attention due to the following error:\n')
traceback.print_exc()
Expand All @@ -42,21 +35,30 @@ def __init__(self, config: ExLlamaV2Config):

self.ex_model = ExLlamaV2(config)

if not shared.args.autosplit:
split = None
if shared.args.gpu_split:
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
split = None
if shared.args.gpu_split:
split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]

if shared.args.enable_tp:
self.ex_model.load_tp(split)
elif not shared.args.autosplit:
self.ex_model.load(split)

# Determine the correct cache type
if shared.args.cache_8bit:
self.ex_cache = ExLlamaV2Cache_8bit(self.ex_model, lazy=shared.args.autosplit)
cache_type = ExLlamaV2Cache_8bit
elif shared.args.cache_4bit:
self.ex_cache = ExLlamaV2Cache_Q4(self.ex_model, lazy=shared.args.autosplit)
cache_type = ExLlamaV2Cache_Q4
else:
self.ex_cache = ExLlamaV2Cache(self.ex_model, lazy=shared.args.autosplit)
cache_type = ExLlamaV2Cache

if shared.args.autosplit:
# Use TP if specified
if shared.args.enable_tp:
self.ex_cache = ExLlamaV2Cache_TP(self.ex_model, base=cache_type)
else:
self.ex_cache = cache_type(self.ex_model, lazy=shared.args.autosplit)

if shared.args.autosplit and not shared.args.enable_tp:
self.ex_model.load_autosplit(self.ex_cache)

self.past_seq = None
Expand Down
Loading

0 comments on commit 3b06cb4

Please sign in to comment.