From 1f7aabf4398d7825bd28476a0a373f542046771b Mon Sep 17 00:00:00 2001
From: Baptiste <collebaptiste@gmail.com>
Date: Wed, 11 Dec 2024 09:37:53 +0000
Subject: [PATCH] feat(tpu): add release of optimum tpu 0.2.2

---
 README.md                | 11 +++------
 containers/tgi/README.md | 50 +++++++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 24 deletions(-)

diff --git a/README.md b/README.md
index b2aa3a43..cee0d5fb 100644
--- a/README.md
+++ b/README.md
@@ -31,12 +31,8 @@ The [Google-Cloud-Containers](https://github.com/huggingface/Google-Cloud-Contai
 | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310  | [huggingface-pytorch-training-gpu.2.3.0.transformers.4.42.3.py310](./containers/pytorch/training/gpu/2.3.0/transformers/4.42.3/py310/Dockerfile)   | PyTorch   | Training  | GPU         |
 | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311 | [huggingface-pytorch-inference-gpu.2.2.2.transformers.4.44.0.py311](./containers/pytorch/inference/gpu/2.2.2/transformers/4.44.0/py311/Dockerfile) | PyTorch   | Inference | GPU         |
 | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cpu.2-2.transformers.4-44.ubuntu2204.py311   | [huggingface-pytorch-inference-cpu.2.2.2.transformers.4.44.0.py311](./containers/pytorch/inference/cpu/2.2.2/transformers/4.44.0/py311/Dockerfile) | PyTorch   | Inference | CPU         |
-| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-tpu.0.2.2.py310   | 
-[huggingface-text-generation-inference-tpu.0.2.2.py310](./containers/tgi/tpu/0.2.2/Dockerfile) | 
-| TGI     | Inference | TPU         |
-| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310   | 
-[huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310](./containers/tgi/tpu/0.2.2/Dockerfile) | 
-| PyTorch | Training  | TPU         |
+| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-tpu.0.2.2.py310   | [huggingface-text-generation-inference-tpu.0.2.2.py310](./containers/tgi/tpu/0.2.2/Dockerfile) | TGI     | Inference | TPU         |
+| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310   | [huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310](./containers/tgi/tpu/0.2.2/Dockerfile) | PyTorch | Training  | TPU         |
 
 > [!NOTE]
 > The listing above only contains the latest version of each of the Hugging Face DLCs, the full listing of the available published containers in Google Cloud can be found either in the [Deep Learning Containers Documentation](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face), in the [Google Cloud Artifact Registry](https://console.cloud.google.com/artifacts/docker/deeplearning-platform-release/us/gcr.io) or via the `gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platform-release/gcr.io" | grep "huggingface-"` command.
@@ -53,8 +49,7 @@ The [`examples`](./examples) directory contains examples for using the container
 | Vertex AI | [examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai](./examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai) | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT on Vertex AI  |
 | GKE       | [examples/gke/trl-full-fine-tuning](./examples/gke/trl-full-fine-tuning)                                                                   | Fine-tune Gemma 2B with PyTorch Training DLC using SFT on GKE               |
 | GKE       | [examples/gke/trl-lora-fine-tuning](./examples/gke/trl-lora-fine-tuning)                                                                   | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT + LoRA on GKE |
-| TPU       | [gemma-fine-tuning](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb) 
-| Fine-tune Gemma 2B with PyTorch Training DLC using LoRA |
+| TPU       | [gemma-tuning](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb)                    | Fine-tune Gemma 2B with PyTorch Training DLC using LoRA                     |
 
 ### Inference Examples
 
diff --git a/containers/tgi/README.md b/containers/tgi/README.md
index c266d5c4..e2b048c6 100644
--- a/containers/tgi/README.md
+++ b/containers/tgi/README.md
@@ -14,15 +14,19 @@ gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platfo
 
 Below you will find the instructions on how to run and test the TGI containers available within this repository. Note that before proceeding you need to first ensure that you have Docker installed either on your local or remote instance, if not, please follow the instructions on how to install Docker [here](https://docs.docker.com/get-docker/).
 
+
+
 ### Run
 
 The TGI containers support two different accelerator types: GPU and TPU. Depending on your infrastructure, you'll use different approaches to run the containers.
 
-- **GPU**: To run this DLC, you need to have GPU accelerators available within the instance that you want to run TGI, not only because those are required, but also to enable the best performance due to the optimized inference CUDA kernels.
+- **GPU**: To run the Docker container in GPUs you need to ensure that your hardware is supported (NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher) and also install the NVIDIA Container Toolkit.
 
   To find the supported models and hardware before running the TGI DLC, feel free to check [TGI Documentation](https://huggingface.co/docs/text-generation-inference/supported_models).
 
-  First, you can use the Hugging Face Recommender API to get the optimal configuration:
+  To run this DLC, you need to have GPU accelerators available within the instance that you want to run TGI, not only because those are required, but also to enable the best performance due to the optimized inference CUDA kernels.
+
+  Besides that, you also need to define the model to deploy, as well as the generation configuration. For the model selection, you can pick any model from the Hugging Face Hub that contains the tag `text-generation-inference` which means that it's supported by TGI; to explore all the available models within the Hub, please check [here](https://huggingface.co/models?other=text-generation-inference&sort=trending). Then, to select the best configuration for that model you can either keep the default values defined within TGI, or just select the recommended ones based on our instance specification via the Hugging Face Recommender API for TGI as follows:
 
   ```bash
   curl -G https://huggingface.co/api/integrations/tgi/v1/provider/gcp/recommend \
@@ -31,7 +35,24 @@ The TGI containers support two different accelerator types: GPU and TPU. Dependi
       -d "num_gpus=1"
   ```
 
-  Then run the container:
+  Which returns the following output containing the optimal configuration for deploying / serving that model via TGI:
+
+  ```json
+  {
+      "model_id": "google/gemma-7b-it",
+      "instance": "g2-standard-4",
+      "configuration": {
+      "model_id": "google/gemma-7b-it",
+      "max_batch_prefill_tokens": 4096,
+      "max_input_length": 4000,
+      "max_total_tokens": 4096,
+      "num_shard": 1,
+      "quantize": null,
+      "estimated_memory_in_gigabytes": 22.77
+  }
+  ```
+
+  Then you are ready to run the container as follows:
 
   ```bash
   docker run --gpus all -ti --shm-size 1g -p 8080:8080 \
@@ -56,32 +77,29 @@ The TGI containers support two different accelerator types: GPU and TPU. Dependi
   ```
 
   > [!NOTE]
-  > TPU support for Text Generation Inference is still evolving. Check the [Hugging Face TPU documentation](https://huggingface.co/docs/optimum-tpu/) for the most up-to-date information on TPU model serving.
+  > Check the [Hugging Face Optimum TPU documentation](https://huggingface.co/docs/optimum-tpu/) for more information on TPU model serving.
 
 ### Test
 
-Once the Docker container is running, you can test it by sending requests to the available endpoints.
+Once the Docker container is running, as it has been deployed with `text-generation-launcher`, the API will expose the following endpoints listed within the [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/).
 
-For the GPU/TPU container running on localhost, you can use the following curl commands:
+In this case, you can test the container by sending a request to the `/v1/chat/completions` endpoint (that matches OpenAI specification and so on is fully compatible with OpenAI clients) as follows:
 
 ```bash
-# Chat Completions Endpoint
 curl 0.0.0.0:8080/v1/chat/completions \
     -X POST \
     -H 'Content-Type: application/json' \
-    -d '{
-        "model": "tgi",
-        "messages": [
-            {
-                "role": "user",
-                "content": "What is Deep Learning?"
-            }
-        ],
+	@@ -81,13 +80,8 @@ curl 0.0.0.0:8080/v1/chat/completions \
         "stream": true,
         "max_tokens": 128
     }'
+```
+
+Which will start streaming the completion tokens for the given messages until the stop sequences are generated.
 
-# Generate Endpoint
+Alternatively, you can also use the `/generate` endpoint instead, which already expects the inputs to be formatted according to the tokenizer requirements, which is more convenient when working with base models without a pre-defined chat template or whenever you want to use a custom chat template instead, and can be used as follows:
+
+```bash
 curl 0.0.0.0:8080/generate \
     -X POST \
     -H 'Content-Type: application/json' \