diff --git a/README.md b/README.md index 9d86fd62..08079d8d 100644 --- a/README.md +++ b/README.md @@ -27,8 +27,8 @@ The [Google-Cloud-Containers](https://github.com/huggingface/Google-Cloud-Contai | Container URI | Path | Framework | Type | Accelerator | | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | --------- | ----------- | | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 | [text-generation-inference-gpu.2.2.0](./containers/tgi/gpu/2.2.0/Dockerfile) | TGI | Inference | GPU | -| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204 | [text-embeddings-inference-gpu.1.2.0](./containers/tei/gpu/1.2.0/Dockerfile) | TEI | Inference | GPU | -| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-2 | [text-embeddings-inference-cpu.1.2.0](./containers/tei/cpu/1.2.0/Dockerfile) | TEI | Inference | CPU | +| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204 | [text-embeddings-inference-gpu.1.4.0](./containers/tei/gpu/1.4.0/Dockerfile) | TEI | Inference | GPU | +| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-4 | [text-embeddings-inference-cpu.1.4.0](./containers/tei/cpu/1.4.0/Dockerfile) | TEI | Inference | CPU | | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310 | [huggingface-pytorch-training-gpu.2.3.0.transformers.4.42.3.py310](./containers/pytorch/training/gpu/2.3.0/transformers/4.42.3/py310/Dockerfile) | PyTorch | Training | GPU | | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311 | [huggingface-pytorch-inference-gpu.2.2.2.transformers.4.44.0.py311](./containers/pytorch/inference/gpu/2.2.2/transformers/4.44.0/py311/Dockerfile) | PyTorch | Inference | GPU | | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cpu.2-2.transformers.4-44.ubuntu2204.py311 | [huggingface-pytorch-inference-cpu.2.2.2.transformers.4.44.0.py311](./containers/pytorch/inference/cpu/2.2.2/transformers/4.44.0/py311/Dockerfile) | PyTorch | Inference | CPU | diff --git a/examples/gke/tei-deployment/README.md b/examples/gke/tei-deployment/README.md index 3ba926e6..162e4a37 100644 --- a/examples/gke/tei-deployment/README.md +++ b/examples/gke/tei-deployment/README.md @@ -1,6 +1,8 @@ # Deploy Snowflake's Arctic Embed (M) with Text Embeddings Inference (TEI) on GKE -Snowflake's Arctic Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance, achieving state-of-the-art (SOTA) performance on the MTEB/BEIR leaderboard for each of their size variants. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to deploy a text embedding model from the Hugging Face Hub on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI. +Snowflake's Arctic Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance, achieving state-of-the-art (SOTA) performance on the MTEB/BEIR leaderboard for each of their size variants. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. + +This example showcases how to deploy a text embedding model from the Hugging Face Hub on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI. ## Setup / Configuration @@ -47,7 +49,7 @@ gcloud components install gke-gcloud-auth-plugin Once everything is set up, you can proceed with the creation of the GKE Cluster and the node pool, which in this case will be a single CPU node as for most of the workloads CPU inference is enough to serve most of the text embeddings models, while it could benefit a lot from GPU serving. > [!NOTE] -> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI). +> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI). To deploy the GKE Cluster, the "Autopilot" mode will be used as it is the recommended one for most of the workloads, since the underlying infrastructure is managed by Google. Alternatively, you can also use the "Standard" mode. @@ -93,7 +95,7 @@ Now you can proceed to the Kubernetes deployment of the Hugging Face DLC for TEI The Hugging Face DLC for TEI will be deployed via `kubectl`, from the configuration files in either the `cpu-config/` or the `gpu-config/` directories depending on whether you want to use the CPU or GPU accelerators, respectively: - `deployment.yaml`: contains the deployment details of the pod including the reference to the Hugging Face DLC for TEI setting the `MODEL_ID` to [`Snowflake/snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m). -- `service.yaml`: contains the service details of the pod, exposing the port 80 for the TEI service. +- `service.yaml`: contains the service details of the pod, exposing the port 8080 for the TEI service. - (optional) `ingress.yaml`: contains the ingress details of the pod, exposing the service to the external world so that it can be accessed via the ingress IP. ```bash @@ -156,7 +158,7 @@ curl http://localhost:8080/embed \ Or send a POST request to the ingress IP instead: ```bash -curl http://$(kubectl get ingress tei-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')/embed \ +curl http://$(kubectl get ingress tei-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):8080/embed \ -X POST \ -d '{"inputs":"What is Deep Learning?"}' \ -H 'Content-Type: application/json' diff --git a/examples/gke/tei-deployment/cpu-config/deployment.yaml b/examples/gke/tei-deployment/cpu-config/deployment.yaml index 7f0fc3c8..2838af19 100644 --- a/examples/gke/tei-deployment/cpu-config/deployment.yaml +++ b/examples/gke/tei-deployment/cpu-config/deployment.yaml @@ -16,7 +16,7 @@ spec: spec: containers: - name: tei-container - image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-2:latest + image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-4:latest resources: requests: cpu: "8" @@ -38,5 +38,5 @@ spec: - name: data emptyDir: {} nodeSelector: - cloud.google.com/compute-class: "Performance" - cloud.google.com/machine-family: "c2" + cloud.google.com/compute-class: "Performance" + cloud.google.com/machine-family: "c2" diff --git a/examples/gke/tei-deployment/cpu-config/ingress.yaml b/examples/gke/tei-deployment/cpu-config/ingress.yaml index 2250699a..cd5e3c4e 100644 --- a/examples/gke/tei-deployment/cpu-config/ingress.yaml +++ b/examples/gke/tei-deployment/cpu-config/ingress.yaml @@ -7,11 +7,11 @@ metadata: spec: rules: - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: tei-service - port: - number: 8080 + paths: + - path: / + pathType: Prefix + backend: + service: + name: tei-service + port: + number: 8080 diff --git a/examples/gke/tei-deployment/cpu-config/service.yaml b/examples/gke/tei-deployment/cpu-config/service.yaml index f502c3f4..baaf3cba 100644 --- a/examples/gke/tei-deployment/cpu-config/service.yaml +++ b/examples/gke/tei-deployment/cpu-config/service.yaml @@ -7,6 +7,6 @@ spec: app: tei-server type: ClusterIP ports: - - protocol: TCP - port: 8080 - targetPort: 8080 + - protocol: TCP + port: 8080 + targetPort: 8080 diff --git a/examples/gke/tei-deployment/gpu-config/deployment.yaml b/examples/gke/tei-deployment/gpu-config/deployment.yaml index 105eee45..5ec84ab0 100644 --- a/examples/gke/tei-deployment/gpu-config/deployment.yaml +++ b/examples/gke/tei-deployment/gpu-config/deployment.yaml @@ -16,7 +16,7 @@ spec: spec: containers: - name: tei-container - image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204:latest + image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204:latest resources: requests: nvidia.com/gpu: 1 diff --git a/examples/gke/tei-deployment/gpu-config/ingress.yaml b/examples/gke/tei-deployment/gpu-config/ingress.yaml index 2250699a..cd5e3c4e 100644 --- a/examples/gke/tei-deployment/gpu-config/ingress.yaml +++ b/examples/gke/tei-deployment/gpu-config/ingress.yaml @@ -7,11 +7,11 @@ metadata: spec: rules: - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: tei-service - port: - number: 8080 + paths: + - path: / + pathType: Prefix + backend: + service: + name: tei-service + port: + number: 8080 diff --git a/examples/gke/tei-deployment/gpu-config/service.yaml b/examples/gke/tei-deployment/gpu-config/service.yaml index f502c3f4..baaf3cba 100644 --- a/examples/gke/tei-deployment/gpu-config/service.yaml +++ b/examples/gke/tei-deployment/gpu-config/service.yaml @@ -7,6 +7,6 @@ spec: app: tei-server type: ClusterIP ports: - - protocol: TCP - port: 8080 - targetPort: 8080 + - protocol: TCP + port: 8080 + targetPort: 8080 diff --git a/examples/gke/tei-from-gcs-deployment/README.md b/examples/gke/tei-from-gcs-deployment/README.md index bdf193d6..f5c287c7 100644 --- a/examples/gke/tei-from-gcs-deployment/README.md +++ b/examples/gke/tei-from-gcs-deployment/README.md @@ -1,6 +1,8 @@ # Deploy BGE Base v1.5 (English) with Text Embeddings Inference (TEI) from a GCS Bucket on GKE -BGE, standing for BAAI General Embedding, is a collection of embedding models released by BAAI, which is an English base model for general embedding tasks ranked in the MTEB Leaderboard. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to deploy a text embedding model from a Google Cloud Storage (GCS) Bucket on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI. +BGE, standing for BAAI General Embedding, is a collection of embedding models released by BAAI, which is an English base model for general embedding tasks ranked in the MTEB Leaderboard. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. + +This example showcases how to deploy a text embedding model from a Google Cloud Storage (GCS) Bucket on a GKE Cluster running a purpose-built container to deploy text embedding models in a secure and managed environment with the Hugging Face DLC for TEI. ## Setup / Configuration @@ -48,7 +50,7 @@ gcloud components install gke-gcloud-auth-plugin Once everything is set up, you can proceed with the creation of the GKE Cluster and the node pool, which in this case will be a single CPU node as for most of the workloads CPU inference is enough to serve most of the text embeddings models, while it could benefit a lot from GPU serving. > [!NOTE] -> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI). +> CPU is being used to run the inference on top of the text embeddings models to showcase the current capabilities of TEI, but switching to GPU is as easy as replacing `spec.containers[0].image` with `us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204`, and then updating the requested resources, as well as the `nodeSelector` requirements in the `deployment.yaml` file. For more information, please refer to the [`gpu-config`](./gpu-config/) directory that contains a pre-defined configuration for GPU serving in TEI with an NVIDIA Tesla T4 GPU (with a compute capability of 7.5 i.e. natively supported in TEI). To deploy the GKE Cluster, the "Autopilot" mode will be used as it is the recommended one for most of the workloads, since the underlying infrastructure is managed by Google. Alternatively, you can also use the "Standard" mode. diff --git a/examples/gke/tei-from-gcs-deployment/cpu-config/deployment.yaml b/examples/gke/tei-from-gcs-deployment/cpu-config/deployment.yaml index 4b27f7df..a11772d9 100644 --- a/examples/gke/tei-from-gcs-deployment/cpu-config/deployment.yaml +++ b/examples/gke/tei-from-gcs-deployment/cpu-config/deployment.yaml @@ -33,7 +33,7 @@ spec: cpu: 8.0 containers: - name: tei-container - image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-2:latest + image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-4:latest resources: requests: cpu: "8" @@ -58,11 +58,11 @@ spec: ephemeral: volumeClaimTemplate: spec: - accessModes: [ "ReadWriteOnce" ] + accessModes: ["ReadWriteOnce"] storageClassName: ssd resources: requests: storage: 48Gi nodeSelector: - cloud.google.com/compute-class: "Performance" - cloud.google.com/machine-family: "c2" + cloud.google.com/compute-class: "Performance" + cloud.google.com/machine-family: "c2" diff --git a/examples/gke/tei-from-gcs-deployment/cpu-config/ingress.yaml b/examples/gke/tei-from-gcs-deployment/cpu-config/ingress.yaml index fb40ac59..d89a14a1 100644 --- a/examples/gke/tei-from-gcs-deployment/cpu-config/ingress.yaml +++ b/examples/gke/tei-from-gcs-deployment/cpu-config/ingress.yaml @@ -8,11 +8,11 @@ metadata: spec: rules: - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: tei-service - port: - number: 8080 + paths: + - path: / + pathType: Prefix + backend: + service: + name: tei-service + port: + number: 8080 diff --git a/examples/gke/tei-from-gcs-deployment/cpu-config/service.yaml b/examples/gke/tei-from-gcs-deployment/cpu-config/service.yaml index 4b15290d..6fd5a78b 100644 --- a/examples/gke/tei-from-gcs-deployment/cpu-config/service.yaml +++ b/examples/gke/tei-from-gcs-deployment/cpu-config/service.yaml @@ -8,6 +8,6 @@ spec: app: tei-server type: ClusterIP ports: - - protocol: TCP - port: 8080 - targetPort: 8080 + - protocol: TCP + port: 8080 + targetPort: 8080 diff --git a/examples/gke/tei-from-gcs-deployment/gpu-config/deployment.yaml b/examples/gke/tei-from-gcs-deployment/gpu-config/deployment.yaml index 0f840274..20c70927 100644 --- a/examples/gke/tei-from-gcs-deployment/gpu-config/deployment.yaml +++ b/examples/gke/tei-from-gcs-deployment/gpu-config/deployment.yaml @@ -33,7 +33,7 @@ spec: cpu: 8.0 containers: - name: tei-container - image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204:latest + image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204:latest resources: requests: nvidia.com/gpu: 1 @@ -60,7 +60,7 @@ spec: ephemeral: volumeClaimTemplate: spec: - accessModes: [ "ReadWriteOnce" ] + accessModes: ["ReadWriteOnce"] storageClassName: ssd resources: requests: diff --git a/examples/gke/tei-from-gcs-deployment/gpu-config/ingress.yaml b/examples/gke/tei-from-gcs-deployment/gpu-config/ingress.yaml index fb40ac59..d89a14a1 100644 --- a/examples/gke/tei-from-gcs-deployment/gpu-config/ingress.yaml +++ b/examples/gke/tei-from-gcs-deployment/gpu-config/ingress.yaml @@ -8,11 +8,11 @@ metadata: spec: rules: - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: tei-service - port: - number: 8080 + paths: + - path: / + pathType: Prefix + backend: + service: + name: tei-service + port: + number: 8080 diff --git a/examples/gke/tei-from-gcs-deployment/gpu-config/service.yaml b/examples/gke/tei-from-gcs-deployment/gpu-config/service.yaml index 4b15290d..6fd5a78b 100644 --- a/examples/gke/tei-from-gcs-deployment/gpu-config/service.yaml +++ b/examples/gke/tei-from-gcs-deployment/gpu-config/service.yaml @@ -8,6 +8,6 @@ spec: app: tei-server type: ClusterIP ports: - - protocol: TCP - port: 8080 - targetPort: 8080 + - protocol: TCP + port: 8080 + targetPort: 8080 diff --git a/examples/vertex-ai/notebooks/deploy-embedding-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/deploy-embedding-on-vertex-ai/vertex-notebook.ipynb index eb6b11b5..8c829009 100644 --- a/examples/vertex-ai/notebooks/deploy-embedding-on-vertex-ai/vertex-notebook.ipynb +++ b/examples/vertex-ai/notebooks/deploy-embedding-on-vertex-ai/vertex-notebook.ipynb @@ -66,7 +66,7 @@ "source": [ "%env PROJECT_ID=your-project-id\n", "%env LOCATION=your-location\n", - "%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-2.ubuntu2204" + "%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204" ] }, {