Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rule and instructions for TGI #18

Merged
merged 4 commits into from
Apr 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ tpu-tgi:
docker build --rm -f text-generation-inference/Dockerfile \
--build-arg VERSION=$(VERSION) \
--build-arg TGI_VERSION=$(TGI_VERSION) \
-t tpu-tgi:$(VERSION) .
docker tag tpu-tgi:$(VERSION) tpu-tgi:latest
-t huggingface/optimum-tpu:$(VERSION)-tgi .
docker tag huggingface/optimum-tpu:$(VERSION)-tgi tpu-tgi:latest

# Run code quality checks
style_check:
Expand Down
51 changes: 50 additions & 1 deletion docs/source/howto/serving.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,52 @@
# Deploying a Text-Generation Inference server on a Google Cloud TPU instance

Stay tuned!
## Context

Text-Generation-Inference (TGI) is a highly optimized serving engine enabling serving Large Language Models (LLMs) in a way
that better leverages the underlying hardware, Cloud TPU in this case.

## Deploy TGI on Cloud TPU instance

We assume the reader already has a Cloud TPU instance up and running.
If this is not the case, please see our guide to deploy one [here](./deploy.mdx)

### Docker Container Build

Optimum-TPU provides a `make tpu-tgi` command at the root level to help you create local docker image.

### Docker Container Run

```
OPTIMUM_TPU_VERSION=0.1.0b1
docker run -p 8080:80 \
--net=host --privileged \
-v $(pwd)/data:/data \
-e HF_TOKEN=${HF_TOKEN} \
-e HF_BATCH_SIZE=1 \
-e HF_SEQUENCE_LENGTH=1024 \
huggingface/optimum-tpu:${OPTIMUM_TPU_VERSION}-tgi \
--model-id google/gemma-2b \
--max-concurrent-requests 4 \
--max-input-length 512 \
--max-total-tokens 1024 \
--max-batch-prefill-tokens 512 \
--max-batch-total-tokens 1024
```

### Executing requests against the service

You can query the model using either the `/generate` or `/generate_stream` routes:

```
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
```

```
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
```
Loading