Skip to content

Commit

Permalink
Merge branch 'main' into update-readme-files
Browse files Browse the repository at this point in the history
  • Loading branch information
alvarobartt committed Sep 16, 2024
2 parents a2da60c + 6a1cb59 commit f12f7be
Show file tree
Hide file tree
Showing 17 changed files with 69 additions and 63 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 🤗 Hugging Face Deep Learning Containers for Google Cloud

<img alt="Hugging Face x Google Cloud" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/Google-Cloud-Containers/thumbnail.png" />
<img alt="Hugging Face x Google Cloud" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/thumbnail.png" />

[Hugging Face Deep Learning Containers for Google Cloud](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face) are a set of Docker images for training and deploying Transformers, Sentence Transformers, and Diffusers models on Google Cloud Vertex AI, Google Kubernetes Engine (GKE), and Google Cloud Run.

Expand Down Expand Up @@ -63,4 +63,4 @@ The [`examples`](./examples) directory contains examples for using the container
| Vertex AI | [deploy-gemma-from-gcs-on-vertex-ai](./examples/vertex-ai/notebooks/deploy-gemma-from-gcs-on-vertex-ai) | Deploying Gemma 7B Instruct with Text Generation Inference (TGI) from a GCS Bucket on Vertex AI. |
| Vertex AI | [deploy-flux-on-vertex-ai](./examples/vertex-ai/notebooks/deploy-flux-on-vertex-ai) | Deploying FLUX with Hugging Face PyTorch DLCs for Inference on Vertex AI. |
| Vertex AI | [deploy-llama-3-1-405b-on-vertex-ai](./examples/vertex-ai/notebooks/deploy-llama-405b-on-vertex-ai/vertex-notebook.ipynb) | Deploying Meta Llama 3.1 405B in FP8 with Hugging Face DLC for TGI on Vertex AI. |
| Cloud Run | [tgi-deployment](./examples/cloud-run/tgi-deployment/README.md) | Deploying Meta Llama 3.1 8B with Text Generation Inference on Cloud Run. |
| Cloud Run | [tgi-deployment](./examples/cloud-run/tgi-deployment/README.md) | Deploying Meta Llama 3.1 8B with Text Generation Inference on Cloud Run. |
2 changes: 2 additions & 0 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ Hugging Face collaborates with Google across open science, open source, cloud, a

Hugging Face enables new experiences for Google Cloud customers. They can easily train and deploy Hugging Face models on Google Kubernetes Engine (GKE) and Vertex AI, on any hardware available in Google Cloud using Hugging Face Deep Learning Containers (DLCs).

If you have any issues using Hugging Face on Google Cloud, you can get community support by creating a new topic in the [Forum](https://discuss.huggingface.co/c/google-cloud/69/l/latest) dedicated to Google Cloud usage.

## Train and Deploy Models on Google Cloud with Hugging Face Deep Learning Containers

Hugging Face built Deep Learning Containers (DLCs) for Google Cloud customers to run any of their machine learning workload in an optimized environment, with no configuration or maintenance on their part. These are Docker images pre-installed with deep learning frameworks and libraries such as 🤗 Transformers, 🤗 Datasets, and 🤗 Tokenizers. The DLCs allow you to directly serve and train any models, skipping the complicated process of building and optimizing your serving and training environments from scratch.
Expand Down
10 changes: 6 additions & 4 deletions examples/gke/tei-deployment/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Deploy Snowflake's Arctic Embed (M) with Text Embeddings Inference (TEI) on GKE

Snowflake's Arctic Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance, achieving state-of-the-art (SOTA) performance on the MTEB/BEIR leaderboard for each of their size variants. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to deploy a text embedding model from the Hugging Face Hub on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI.
Snowflake's Arctic Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance, achieving state-of-the-art (SOTA) performance on the MTEB/BEIR leaderboard for each of their size variants. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure.

This example showcases how to deploy a text embedding model from the Hugging Face Hub on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI.

## Setup / Configuration

Expand Down Expand Up @@ -47,7 +49,7 @@ gcloud components install gke-gcloud-auth-plugin
Once everything is set up, you can proceed with the creation of the GKE Cluster and the node pool, which in this case will be a single CPU node as for most of the workloads CPU inference is enough to serve most of the text embeddings models, while it could benefit a lot from GPU serving.

> [!NOTE]
> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI).
> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI).
To deploy the GKE Cluster, the "Autopilot" mode will be used as it is the recommended one for most of the workloads, since the underlying infrastructure is managed by Google. Alternatively, you can also use the "Standard" mode.

Expand Down Expand Up @@ -93,7 +95,7 @@ Now you can proceed to the Kubernetes deployment of the Hugging Face DLC for TEI
The Hugging Face DLC for TEI will be deployed via `kubectl`, from the configuration files in either the `cpu-config/` or the `gpu-config/` directories depending on whether you want to use the CPU or GPU accelerators, respectively:

- `deployment.yaml`: contains the deployment details of the pod including the reference to the Hugging Face DLC for TEI setting the `MODEL_ID` to [`Snowflake/snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m).
- `service.yaml`: contains the service details of the pod, exposing the port 80 for the TEI service.
- `service.yaml`: contains the service details of the pod, exposing the port 8080 for the TEI service.
- (optional) `ingress.yaml`: contains the ingress details of the pod, exposing the service to the external world so that it can be accessed via the ingress IP.

```bash
Expand Down Expand Up @@ -156,7 +158,7 @@ curl http://localhost:8080/embed \
Or send a POST request to the ingress IP instead:

```bash
curl http://$(kubectl get ingress tei-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')/embed \
curl http://$(kubectl get ingress tei-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):8080/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
Expand Down
6 changes: 3 additions & 3 deletions examples/gke/tei-deployment/cpu-config/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ spec:
spec:
containers:
- name: tei-container
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-2:latest
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-4:latest
resources:
requests:
cpu: "8"
Expand All @@ -38,5 +38,5 @@ spec:
- name: data
emptyDir: {}
nodeSelector:
cloud.google.com/compute-class: "Performance"
cloud.google.com/machine-family: "c2"
cloud.google.com/compute-class: "Performance"
cloud.google.com/machine-family: "c2"
16 changes: 8 additions & 8 deletions examples/gke/tei-deployment/cpu-config/ingress.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ metadata:
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
6 changes: 3 additions & 3 deletions examples/gke/tei-deployment/cpu-config/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ spec:
app: tei-server
type: ClusterIP
ports:
- protocol: TCP
port: 8080
targetPort: 8080
- protocol: TCP
port: 8080
targetPort: 8080
2 changes: 1 addition & 1 deletion examples/gke/tei-deployment/gpu-config/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ spec:
spec:
containers:
- name: tei-container
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204:latest
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204:latest
resources:
requests:
nvidia.com/gpu: 1
Expand Down
16 changes: 8 additions & 8 deletions examples/gke/tei-deployment/gpu-config/ingress.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ metadata:
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
6 changes: 3 additions & 3 deletions examples/gke/tei-deployment/gpu-config/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ spec:
app: tei-server
type: ClusterIP
ports:
- protocol: TCP
port: 8080
targetPort: 8080
- protocol: TCP
port: 8080
targetPort: 8080
6 changes: 4 additions & 2 deletions examples/gke/tei-from-gcs-deployment/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Deploy BGE Base v1.5 (English) with Text Embeddings Inference (TEI) from a GCS Bucket on GKE

BGE, standing for BAAI General Embedding, is a collection of embedding models released by BAAI, which is an English base model for general embedding tasks ranked in the MTEB Leaderboard. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to deploy a text embedding model from a Google Cloud Storage (GCS) Bucket on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI.
BGE, standing for BAAI General Embedding, is a collection of embedding models released by BAAI, which is an English base model for general embedding tasks ranked in the MTEB Leaderboard. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure.

This example showcases how to deploy a text embedding model from a Google Cloud Storage (GCS) Bucket on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI.

## Setup / Configuration

Expand Down Expand Up @@ -48,7 +50,7 @@ gcloud components install gke-gcloud-auth-plugin
Once everything is set up, you can proceed with the creation of the GKE Cluster and the node pool, which in this case will be a single CPU node as for most of the workloads CPU inference is enough to serve most of the text embeddings models, while it could benefit a lot from GPU serving.

> [!NOTE]
> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI).
> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI).
To deploy the GKE Cluster, the "Autopilot" mode will be used as it is the recommended one for most of the workloads, since the underlying infrastructure is managed by Google. Alternatively, you can also use the "Standard" mode.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ spec:
cpu: 8.0
containers:
- name: tei-container
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-2:latest
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-4:latest
resources:
requests:
cpu: "8"
Expand All @@ -58,11 +58,11 @@ spec:
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteOnce" ]
accessModes: ["ReadWriteOnce"]
storageClassName: ssd
resources:
requests:
storage: 48Gi
nodeSelector:
cloud.google.com/compute-class: "Performance"
cloud.google.com/machine-family: "c2"
cloud.google.com/compute-class: "Performance"
cloud.google.com/machine-family: "c2"
16 changes: 8 additions & 8 deletions examples/gke/tei-from-gcs-deployment/cpu-config/ingress.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ metadata:
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
6 changes: 3 additions & 3 deletions examples/gke/tei-from-gcs-deployment/cpu-config/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ spec:
app: tei-server
type: ClusterIP
ports:
- protocol: TCP
port: 8080
targetPort: 8080
- protocol: TCP
port: 8080
targetPort: 8080
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ spec:
cpu: 8.0
containers:
- name: tei-container
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204:latest
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204:latest
resources:
requests:
nvidia.com/gpu: 1
Expand All @@ -60,7 +60,7 @@ spec:
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteOnce" ]
accessModes: ["ReadWriteOnce"]
storageClassName: ssd
resources:
requests:
Expand Down
16 changes: 8 additions & 8 deletions examples/gke/tei-from-gcs-deployment/gpu-config/ingress.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ metadata:
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
paths:
- path: /
pathType: Prefix
backend:
service:
name: tei-service
port:
number: 8080
6 changes: 3 additions & 3 deletions examples/gke/tei-from-gcs-deployment/gpu-config/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ spec:
app: tei-server
type: ClusterIP
ports:
- protocol: TCP
port: 8080
targetPort: 8080
- protocol: TCP
port: 8080
targetPort: 8080
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
"source": [
"%env PROJECT_ID=your-project-id\n",
"%env LOCATION=your-location\n",
"%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204"
"%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204"
]
},
{
Expand Down

0 comments on commit f12f7be

Please sign in to comment.