Skip to content

Commit

Permalink
Add Exllamav2 to base image
Browse files Browse the repository at this point in the history
  • Loading branch information
Atinoda committed Sep 18, 2023
1 parent e1d999f commit faab710
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 1 deletion.
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ RUN git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda /app/repos
# Build and install default GPTQ ('quant_cuda')
ARG TORCH_CUDA_ARCH_LIST="6.1;7.0;7.5;8.0;8.6+PTX"
RUN cd /app/repositories/GPTQ-for-LLaMa/ && python3 setup_cuda.py install
# Install exllamav2 and flash attention
RUN pip install -U ninja exllamav2 && pip install flash-attn --no-build-isolation

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS base
# Runtime pre-reqs
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Each variant has the 'extras' incuded in `default` but has some changes made as

| Variant | Description |
|---|---|
| `default` | Implementation of the vanilla deployment from source. Plus pre-installed `ExLlAMA` library from `turboderp/exllama`, and CUDA GPU offloading enabled for `llama-cpp`. *This version is recommended for most users.* |
| `default` | Implementation of the vanilla deployment from source. Plus pre-installed `ExLlAMAV2` library from `turboderp/exllamav2`, and CUDA GPU offloading enabled for `llama-cpp`. *This version is recommended for most users.* |
| `triton` | Updated `GPTQ-for-llama` using the latest `triton` branch from `qwopqwop200/GPTQ-for-LLaMa`. Suitable for Linux only. *This version is accurate but a little slow.* |
| `cuda` | Updated `GPTQ-for-llama` using the latest `cuda` branch from `qwopqwop200/GPTQ-for-LLaMa`. *This version is very slow!* |
| `llama-cpu` | GPU supported is REMOVED from `llama-cpp`. Suitable for systems without a CUDA-capable GPU. *This is only for when GPU acceleration is not available and is a slower way to run models!* |
Expand Down

0 comments on commit faab710

Please sign in to comment.