Add examples/cloud-run on preview (#82)

* Update `examples/README.md` and add `examples/cloud-run/README.md` * Add `examples/cloud-run/tgi-deployment` * Update `examples/cloud-run/README.md` * Update `README.md` * Update `README.md` Co-authored-by: Philipp Schmid <[email protected]> * Apply suggestions from code review - Increase max instances from 5 to 7 (including that it's subject to change) - Explain the default auth for Cloud Run, and mention that only developer use-cases are covered within this example - Add note with alternatives for auth handling on the services towards exposing those - Add references used at the end Co-authored-by: Frank He <[email protected]> * Update `examples/cloud-run/tgi-deployment/README.md` * Set `max-instances=3` to prevent downtime during infra migrations Co-authored-by: Steren <[email protected]> * Set `--concurrency` and `--max-concurrent-requests` to 64 Value determined after running `text-generation-benchmark` with different batch sizes with the default settings, on the same instance/node on Google Kubernetes Engine (GKE), as it allows SSH tunneling as `text-generation-benchmark` needs to run within the host instance --------- Co-authored-by: Philipp Schmid <[email protected]> Co-authored-by: Frank He <[email protected]> Co-authored-by: Steren <[email protected]>
huggingface · Sep 16, 2024 · 6e8682c · 6e8682c
1 parent e6b57bf
commit 6e8682c
Show file tree

Hide file tree

Showing 6 changed files with 382 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 <img alt="Hugging Face x Google Cloud" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/Google-Cloud-Containers/thumbnail.png" />
 
-[Hugging Face Deep Learning Containers for Google Cloud](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face) are a set of Docker images for training and deploying Transformers, Sentence Transformers, and Diffusers models on Google Cloud Vertex AI and Google Kubernetes Engine (GKE).
+[Hugging Face Deep Learning Containers for Google Cloud](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face) are a set of Docker images for training and deploying Transformers, Sentence Transformers, and Diffusers models on Google Cloud Vertex AI, Google Kubernetes Engine (GKE), and Google Cloud Run.
 
 The [Google-Cloud-Containers](https://github.com/huggingface/Google-Cloud-Containers/tree/main) repository contains the container files for building Hugging Face-specific Deep Learning Containers (DLCs), examples on how to train and deploy models on Google Cloud. The containers are publicly maintained, updated and released periodically by Hugging Face and the Google Cloud Team and available for all Google Cloud Customers within the [Google Cloud's Artifact Registry](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face). For each supported combination of use-case (training, inference), accelerator type (CPU, GPU, TPU), and framework (PyTorch, TGI, TEI) containers are created. Those include:
 
@@ -63,3 +63,4 @@ The [`examples`](./examples) directory contains examples for using the container
 | Vertex AI | [deploy-gemma-from-gcs-on-vertex-ai](./examples/vertex-ai/notebooks/deploy-gemma-from-gcs-on-vertex-ai)                   | Deploying Gemma 7B Instruct with Text Generation Inference (TGI) from a GCS Bucket on Vertex AI.                                                |
 | Vertex AI | [deploy-flux-on-vertex-ai](./examples/vertex-ai/notebooks/deploy-flux-on-vertex-ai)                                       | Deploying FLUX with Hugging Face PyTorch DLCs for Inference on Vertex AI.                                                                       |
 | Vertex AI | [deploy-llama-3-1-405b-on-vertex-ai](./examples/vertex-ai/notebooks/deploy-llama-405b-on-vertex-ai/vertex-notebook.ipynb) | Deploying Meta Llama 3.1 405B in FP8 with Hugging Face DLC for TGI on Vertex AI.                                                                |
+| Cloud Run | [tgi-deployment](./examples/cloud-run/tgi-deployment/README.md)                                                           | Deploying Meta Llama 3.1 8B with Text Generation Inference on Cloud Run.                                                                        |
diff --git a/examples/README.md b/examples/README.md
@@ -2,4 +2,11 @@
 
 This directory contains some usage examples for the Hugging Face Deep Learning Containers (DLCs) available in Google Cloud, as published from the [containers directory](../containers).
 
-The examples' structure is organized based on the Google Cloud service we can use to deploy the containers, being Google Kubernetes Engine (GKE) on one end, and Vertex AI on the other.
+The examples' structure is organized based on the Google Cloud service we can use to deploy the containers, being:
+
+- [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+- [Vertex AI](https://cloud.google.com/vertex-ai)
+- (Preview) [Cloud Run](https://cloud.google.com/run)
+
+> [!WARNING]
+> Cloud Run now offers on-demand access to NVIDIA L4 GPUs for running AI inference workloads; but is still in preview, so the Cloud Run examples within this repository should be taken solely for testing and experimentation; please avoid using those for production workloads. We are actively working towards general availability and appreciate your understanding.
diff --git a/examples/cloud-run/README.md b/examples/cloud-run/README.md
@@ -0,0 +1,12 @@
+# (Preview) Cloud Run Examples
+
+This directory contains usage examples of the Hugging Face Deep Learning Containers (DLCs) in Cloud Run only for inference at the moment, with a focus on Large Language Models (LLMs).
+
+> [!WARNING]
+> Cloud Run now offers on-demand access to NVIDIA L4 GPUs for running AI inference workloads; but is still in preview, so the Cloud Run examples within this repository should be taken solely for testing and experimentation; please avoid using those for production workloads. We are actively working towards general availability and appreciate your understanding.
+
+## Inference Examples
+
+| Example                            | Description                                                              |
+| ---------------------------------- | ------------------------------------------------------------------------ |
+| [tgi-deployment](./tgi-deployment) | Deploying Meta Llama 3.1 8B with Text Generation Inference on Cloud Run. |