Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gemma2 9B on Cloud Run example #113

Merged
merged 14 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@ The [`examples`](./examples) directory contains examples for using the container
| GKE | [examples/gke/tgi-deployment](./examples/gke/tgi-deployment) | Deploy Meta Llama 3 8B with TGI DLC on GKE |
| GKE | [examples/gke/tgi-from-gcs-deployment](./examples/gke/tgi-from-gcs-deployment) | Deploy Qwen2 7B with TGI DLC from GCS on GKE |
| GKE | [examples/gke/tei-deployment](./examples/gke/tei-deployment) | Deploy Snowflake's Arctic Embed with TEI DLC on GKE |
| Cloud Run | [examples/cloud-run/tgi-deployment](./examples/cloud-run/tgi-deployment) | Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run |
| Cloud Run | [examples/cloud-run/deploy-gemma-2-on-cloud-run](./examples/cloud-run/deploy-gemma-2-on-cloud-run) | Deploy Gemma2 9B with TGI DLC on Cloud Run |
| Cloud Run | [examples/cloud-run/deploy-llama-3-1-on-cloud-run](./examples/cloud-run/deploy-llama-3-1-on-cloud-run) | Deploy Llama 3.1 8B with TGI DLC on Cloud Run |

### Evaluation

Expand Down
3 changes: 2 additions & 1 deletion docs/source/resources.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,5 @@ Learn how to use Hugging Face in Google Cloud by reading our blog posts, present

- Inference

- [Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/tgi-deployment)
- [Deploy Gemma2 9B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/deploy-gemma-2-on-cloud-run)
- [Deploy Llama 3.1 8B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/deploy-llama-3-1-on-cloud-run)
12 changes: 9 additions & 3 deletions examples/cloud-run/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ This directory contains usage examples of the Hugging Face Deep Learning Contain

## Inference Examples

| Example | Title |
| ---------------------------------- | ----------------------------------------------- |
| [tgi-deployment](./tgi-deployment) | Deploy Meta Llama 3.1 with TGI DLC on Cloud Run |
| Example | Title |
| ---------------------------------------------------------------- | --------------------------------------------- |
| [deploy-gemma-2-on-cloud-run](./deploy-gemma-2-on-cloud-run) | Deploy Gemma2 9B with TGI DLC on Cloud Run |
| [deploy-llama-3-1-on-cloud-run](./deploy-llama-3-1-on-cloud-run) | Deploy Llama 3.1 8B with TGI DLC on Cloud Run |

## Training Examples

Coming soon!

382 changes: 382 additions & 0 deletions examples/cloud-run/deploy-gemma-2-on-cloud-run/README.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run
title: Deploy Llama 3.1 8B with TGI DLC on Cloud Run
type: inference
---

# Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run
# Deploy Llama 3.1 8B with TGI DLC on Cloud Run

Meta Llama 3.1 is the latest open LLM from Meta, released in July 2024. Meta Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.
Llama 3.1 is the latest open LLM from Meta, released in July 2024. Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.

This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Meta Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support ([in preview](https://cloud.google.com/products#product-launch-stages)).
This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support ([in preview](https://cloud.google.com/products#product-launch-stages)).

> [!NOTE]
> GPU support on Cloud Run is only available as a waitlisted public preview. If you're interested in trying out the feature, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. At the time of writing this example, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests.
Expand Down Expand Up @@ -216,7 +216,7 @@ The recommended approach is to use a Service Account (SA), as the access can be
- Set the `SERVICE_ACCOUNT_NAME` environment variable for convenience:

```bash
export SERVICE_ACCOUNT_NAME=text-generation-inference-invoker
export SERVICE_ACCOUNT_NAME=tgi-invoker
```

- Create the Service Account:
Expand All @@ -241,7 +241,7 @@ The recommended approach is to use a Service Account (SA), as the access can be
```

> [!WARNING]
> The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the `--lifetime` argument when createing the token. Refer to (Access token lifetime)[[https://cloud.google.com/resource-manager/docs/organization-policy/restricting-service-accounts#extend_oauth_ttl]] to learn more. Otherwise, you can also generate a new token by running the same command again.
> The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the `--lifetime` argument when createing the token. Refer to [Access token lifetime](https://cloud.google.com/resource-manager/docs/organization-policy/restricting-service-accounts#extend_oauth_ttl) to learn more. Otherwise, you can also generate a new token by running the same command again.

Now you can already dive into the different alternatives for sending the requests to the deployed Cloud Run Service using the `SERVICE_URL` AND `ACCESS_TOKEN` as described above.

Expand Down
Loading