This project containerises oobabooga/text-generation-webui
for use with Intel Arc GPUs, using python:3.11-slim-bookworm
as the base image and oneAPI v. 2024.1 as the runtime.
I've tested it on Ubuntu 24.04 host with Linux kernel 6.8.0-31 using in-tree (builtin) drivers on Intel Arc A770M (the laptop version of the A770) with Podman 4.9.3.
I haven't done any proper measurements, but quick testing on the mistral-7b-openorca.Q5_K_M.gguf
gives me about 12.5 tokens/s on the A770M.
Note that the first generation may be 5-6 times slower (while in the "warmup" phase).
server.py
is specified as the ENTRYPOINT
, not as a CMD
, so you can just append
your arguments right after the image name. Default is --listen
.
The container is expected to run rootless (e.g., via Podman) where root
in the container
translates to a non-root
user on the host. Rootful Docker users might want to drop privileges
within the container by extending the image and assigning a user.
Depending on whether localhost
is available in your container runtime,
Gradio may attempt to share the UI without the --share
flag.
You may suppress this behaviour by deny-listing api.gradio.app
in your firewall or via /etc/hosts
.
This project was heavily inspired by @Atinoda's dockerisation of Text-Generation-WebUI.