Skip to content

Commit

Permalink
feat(tpu): add release of optimum tpu 0.2.2
Browse files Browse the repository at this point in the history
  • Loading branch information
baptistecolle committed Dec 11, 2024
1 parent 7221a2e commit 1f7aabf
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 24 deletions.
11 changes: 3 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,8 @@ The [Google-Cloud-Containers](https://github.com/huggingface/Google-Cloud-Contai
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310 | [huggingface-pytorch-training-gpu.2.3.0.transformers.4.42.3.py310](./containers/pytorch/training/gpu/2.3.0/transformers/4.42.3/py310/Dockerfile) | PyTorch | Training | GPU |
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311 | [huggingface-pytorch-inference-gpu.2.2.2.transformers.4.44.0.py311](./containers/pytorch/inference/gpu/2.2.2/transformers/4.44.0/py311/Dockerfile) | PyTorch | Inference | GPU |
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cpu.2-2.transformers.4-44.ubuntu2204.py311 | [huggingface-pytorch-inference-cpu.2.2.2.transformers.4.44.0.py311](./containers/pytorch/inference/cpu/2.2.2/transformers/4.44.0/py311/Dockerfile) | PyTorch | Inference | CPU |
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-tpu.0.2.2.py310 |
[huggingface-text-generation-inference-tpu.0.2.2.py310](./containers/tgi/tpu/0.2.2/Dockerfile) |
| TGI | Inference | TPU |
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 |
[huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310](./containers/tgi/tpu/0.2.2/Dockerfile) |
| PyTorch | Training | TPU |
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-tpu.0.2.2.py310 | [huggingface-text-generation-inference-tpu.0.2.2.py310](./containers/tgi/tpu/0.2.2/Dockerfile) | TGI | Inference | TPU |
| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 | [huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310](./containers/tgi/tpu/0.2.2/Dockerfile) | PyTorch | Training | TPU |

> [!NOTE]
> The listing above only contains the latest version of each of the Hugging Face DLCs, the full listing of the available published containers in Google Cloud can be found either in the [Deep Learning Containers Documentation](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face), in the [Google Cloud Artifact Registry](https://console.cloud.google.com/artifacts/docker/deeplearning-platform-release/us/gcr.io) or via the `gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platform-release/gcr.io" | grep "huggingface-"` command.
Expand All @@ -53,8 +49,7 @@ The [`examples`](./examples) directory contains examples for using the container
| Vertex AI | [examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai](./examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai) | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT on Vertex AI |
| GKE | [examples/gke/trl-full-fine-tuning](./examples/gke/trl-full-fine-tuning) | Fine-tune Gemma 2B with PyTorch Training DLC using SFT on GKE |
| GKE | [examples/gke/trl-lora-fine-tuning](./examples/gke/trl-lora-fine-tuning) | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT + LoRA on GKE |
| TPU | [gemma-fine-tuning](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb)
| Fine-tune Gemma 2B with PyTorch Training DLC using LoRA |
| TPU | [gemma-tuning](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb) | Fine-tune Gemma 2B with PyTorch Training DLC using LoRA |

### Inference Examples

Expand Down
50 changes: 34 additions & 16 deletions containers/tgi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,19 @@ gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platfo

Below you will find the instructions on how to run and test the TGI containers available within this repository. Note that before proceeding you need to first ensure that you have Docker installed either on your local or remote instance, if not, please follow the instructions on how to install Docker [here](https://docs.docker.com/get-docker/).



### Run

The TGI containers support two different accelerator types: GPU and TPU. Depending on your infrastructure, you'll use different approaches to run the containers.

- **GPU**: To run this DLC, you need to have GPU accelerators available within the instance that you want to run TGI, not only because those are required, but also to enable the best performance due to the optimized inference CUDA kernels.
- **GPU**: To run the Docker container in GPUs you need to ensure that your hardware is supported (NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher) and also install the NVIDIA Container Toolkit.

To find the supported models and hardware before running the TGI DLC, feel free to check [TGI Documentation](https://huggingface.co/docs/text-generation-inference/supported_models).

First, you can use the Hugging Face Recommender API to get the optimal configuration:
To run this DLC, you need to have GPU accelerators available within the instance that you want to run TGI, not only because those are required, but also to enable the best performance due to the optimized inference CUDA kernels.

Besides that, you also need to define the model to deploy, as well as the generation configuration. For the model selection, you can pick any model from the Hugging Face Hub that contains the tag `text-generation-inference` which means that it's supported by TGI; to explore all the available models within the Hub, please check [here](https://huggingface.co/models?other=text-generation-inference&sort=trending). Then, to select the best configuration for that model you can either keep the default values defined within TGI, or just select the recommended ones based on our instance specification via the Hugging Face Recommender API for TGI as follows:

```bash
curl -G https://huggingface.co/api/integrations/tgi/v1/provider/gcp/recommend \
Expand All @@ -31,7 +35,24 @@ The TGI containers support two different accelerator types: GPU and TPU. Dependi
-d "num_gpus=1"
```

Then run the container:
Which returns the following output containing the optimal configuration for deploying / serving that model via TGI:

```json
{
"model_id": "google/gemma-7b-it",
"instance": "g2-standard-4",
"configuration": {
"model_id": "google/gemma-7b-it",
"max_batch_prefill_tokens": 4096,
"max_input_length": 4000,
"max_total_tokens": 4096,
"num_shard": 1,
"quantize": null,
"estimated_memory_in_gigabytes": 22.77
}
```

Then you are ready to run the container as follows:

```bash
docker run --gpus all -ti --shm-size 1g -p 8080:8080 \
Expand All @@ -56,32 +77,29 @@ The TGI containers support two different accelerator types: GPU and TPU. Dependi
```

> [!NOTE]
> TPU support for Text Generation Inference is still evolving. Check the [Hugging Face TPU documentation](https://huggingface.co/docs/optimum-tpu/) for the most up-to-date information on TPU model serving.
> Check the [Hugging Face Optimum TPU documentation](https://huggingface.co/docs/optimum-tpu/) for more information on TPU model serving.

### Test

Once the Docker container is running, you can test it by sending requests to the available endpoints.
Once the Docker container is running, as it has been deployed with `text-generation-launcher`, the API will expose the following endpoints listed within the [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/).

For the GPU/TPU container running on localhost, you can use the following curl commands:
In this case, you can test the container by sending a request to the `/v1/chat/completions` endpoint (that matches OpenAI specification and so on is fully compatible with OpenAI clients) as follows:

```bash
# Chat Completions Endpoint
curl 0.0.0.0:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "user",
"content": "What is Deep Learning?"
}
],
@@ -81,13 +80,8 @@ curl 0.0.0.0:8080/v1/chat/completions \
"stream": true,
"max_tokens": 128
}'
```

Which will start streaming the completion tokens for the given messages until the stop sequences are generated.

# Generate Endpoint
Alternatively, you can also use the `/generate` endpoint instead, which already expects the inputs to be formatted according to the tokenizer requirements, which is more convenient when working with base models without a pre-defined chat template or whenever you want to use a custom chat template instead, and can be used as follows:

```bash
curl 0.0.0.0:8080/generate \
-X POST \
-H 'Content-Type: application/json' \
Expand Down

0 comments on commit 1f7aabf

Please sign in to comment.