Split example introduction from TL;DR

huggingface · Sep 18, 2024 · 89d17c3 · 89d17c3
1 parent 1522c41
commit 89d17c3
Show file tree

Hide file tree

Showing 12 changed files with 40 additions and 13 deletions.
diff --git a/examples/cloud-run/tgi-deployment/README.md b/examples/cloud-run/tgi-deployment/README.md
@@ -5,7 +5,9 @@ type: inference
 
 # Deploy Meta Llama 3.1 8B with Text Generation Inference on Cloud Run
 
-Meta Llama 3.1 is the latest open LLM from Meta, released in July 2024. Meta Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage. This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Meta Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support (on preview).
+Meta Llama 3.1 is the latest open LLM from Meta, released in July 2024. Meta Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.
+
+This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Meta Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support (on preview).
 
 > [!NOTE]
 > GPU support on Cloud Run is only available as a waitlisted public preview. If you're interested in trying out the feature, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. At the time of writing this example, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests.

diff --git a/examples/gke/tgi-deployment/README.md b/examples/gke/tgi-deployment/README.md
@@ -5,7 +5,9 @@ type: inference
 
 # Deploy Meta Llama 3 8B with Text Generation Inference (TGI) on GKE
 
-Meta Llama 3 is the latest LLM from the Llama family, released by Meta; coming in two sizes 8B and 70B, including both the base model and the instruction-tuned model. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to deploy an LLM from the Hugging Face Hub, as Llama3 8B Instruct, on a GKE Cluster running a purpose-built container to deploy LLMs in a secure and managed environment with the Hugging Face DLC for TGI.
+Meta Llama 3 is the latest LLM from the Llama family, released by Meta; coming in two sizes 8B and 70B, including both the base model and the instruction-tuned model. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure.
+
+This example showcases how to deploy an LLM from the Hugging Face Hub, as Llama3 8B Instruct, on a GKE Cluster running a purpose-built container to deploy LLMs in a secure and managed environment with the Hugging Face DLC for TGI.
 
 ## Setup / Configuration
 

diff --git a/examples/gke/tgi-from-gcs-deployment/README.md b/examples/gke/tgi-from-gcs-deployment/README.md
@@ -5,7 +5,9 @@ type: inference
 
 # Deploy Qwen2 7B Instruct with Text Generation Inference (TGI) from a GCS Bucket on GKE
 
-Qwen2 is the new series of Qwen Large Language Models (LLMs) built by Alibaba Cloud, with both base and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model; the 7B variant sitting in the second place in the 7B size range in the Open LLM Leaderboard by Hugging Face and the 72B one in the first place amongst any size. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to deploy an LLM from a Google Cloud Storage (GCS) Bucket on a GKE Cluster running a purpose-built container to deploy LLMs in a secure and managed environment with the Hugging Face DLC for TGI.
+Qwen2 is the new series of Qwen Large Language Models (LLMs) built by Alibaba Cloud, with both base and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model; the 7B variant sitting in the second place in the 7B size range in the Open LLM Leaderboard by Hugging Face and the 72B one in the first place amongst any size. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure.
+
+This example showcases how to deploy an LLM from a Google Cloud Storage (GCS) Bucket on a GKE Cluster running a purpose-built container to deploy LLMs in a secure and managed environment with the Hugging Face DLC for TGI.
 
 ## Setup / Configuration
 

diff --git a/examples/gke/trl-full-fine-tuning/README.md b/examples/gke/trl-full-fine-tuning/README.md
@@ -5,7 +5,9 @@ type: training
 
 # Fine-tune Gemma 2B with TRL on GKE
 
-Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. TRL is a full stack library to fine-tune and align Large Language Models (LLMs) developed by Hugging Face. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to full fine-tune Gemma 2B with TRL via Supervised Fine-Tuning (SFT) in a multi-GPU setting on a GKE Cluster.
+Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. TRL is a full stack library to fine-tune and align Large Language Models (LLMs) developed by Hugging Face. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure.
+
+This example showcases how to full fine-tune Gemma 2B with TRL via Supervised Fine-Tuning (SFT) in a multi-GPU setting on a GKE Cluster.
 
 ## Setup / Configuration
 

diff --git a/examples/gke/trl-lora-fine-tuning/README.md b/examples/gke/trl-lora-fine-tuning/README.md
@@ -5,7 +5,9 @@ type: training
 
 # Fine-tune Mistral 7B v0.3 with TRL on GKE
 
-Mistral is a family of models with varying sizes, created by the Mistral AI team; the Mistral 7B v0.3 LLM is a Mistral 7B v0.2 with extended vocabulary. TRL is a full stack library to fine-tune and align Large Language Models (LLMs) developed by Hugging Face. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure. This post explains how to fine-tune Mistral 7B v0.3 with TRL via Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA) in a single GPU on a GKE Cluster.
+Mistral is a family of models with varying sizes, created by the Mistral AI team; the Mistral 7B v0.3 LLM is a Mistral 7B v0.2 with extended vocabulary. TRL is a full stack library to fine-tune and align Large Language Models (LLMs) developed by Hugging Face. And, Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using GCP's infrastructure.
+
+This example showcases how to fine-tune Mistral 7B v0.3 with TRL via Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA) in a single GPU on a GKE Cluster.
 
 ## Setup / Configuration
 

diff --git a/examples/vertex-ai/notebooks/deploy-bert-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/deploy-bert-on-vertex-ai/vertex-notebook.ipynb
@@ -4,7 +4,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<!-- ---\ntitle: Deploy BERT Models with PyTorch Inference on Vertex AI\ntype: inference\n--- -->"
+    "<!-- ---\n",
+    "title: Deploy BERT Models with PyTorch Inference on Vertex AI\n",
+    "type: inference\n",
+    "--- -->"
    ]
   },
   {
@@ -18,7 +21,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT, which is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to deploy any supported PyTorch model from the Hugging Face Hub, in this case [`distilbert/distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), on Vertex AI using the PyTorch Inference DLC available in Google Cloud Platform (GCP) in both CPU and GPU instances."
+    "DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT, which is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n",
+    "\n",
+    "This example showcases how to deploy any supported PyTorch model from the Hugging Face Hub, in this case [`distilbert/distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), on Vertex AI using the PyTorch Inference DLC available in Google Cloud Platform (GCP) in both CPU and GPU instances."
    ]
   },
   {

diff --git a/examples/vertex-ai/notebooks/deploy-embedding-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/deploy-embedding-on-vertex-ai/vertex-notebook.ipynb
@@ -21,7 +21,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "BGE, standing for BAAI General Embedding, is a collection of embedding models released by BAAI, which is an English base model for general embedding tasks ranked in the MTEB Leaderboard. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to deploy any supported embedding model, in this case [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5), from the Hugging Face Hub on Vertex AI using the TEI DLC available in Google Cloud Platform (GCP) in both CPU and GPU instances."
+    "BGE, standing for BAAI General Embedding, is a collection of embedding models released by BAAI, which is an English base model for general embedding tasks ranked in the MTEB Leaderboard. Text Embeddings Inference (TEI) is a toolkit developed by Hugging Face for deploying and serving open source text embeddings and sequence classification models; enabling high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n",
+    "\n",
+    "This example showcases how to deploy any supported embedding model, in this case [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5), from the Hugging Face Hub on Vertex AI using the TEI DLC available in Google Cloud Platform (GCP) in both CPU and GPU instances."
    ]
   },
   {

diff --git a/examples/vertex-ai/notebooks/deploy-flux-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/deploy-flux-on-vertex-ai/vertex-notebook.ipynb
@@ -21,7 +21,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "FLUX is an open-weights 12B parameter rectified flow transformer that generates images from text descriptions, pushing the boundaries of text-to-image generation created by Black Forest Labs, with a non-commercial license making it widely accessible for exploration and experimentation. And, Google Cloud Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to deploy any supported [`diffusers`](https://github.com/huggingface/diffusers) text-to-image model from the Hugging Face Hub, in this case [`black-forest-labs/FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev), on Vertex AI using the Hugging Face PyTorch DLC for Inference available in Google Cloud Platform (GCP) in both CPU and GPU instances."
+    "FLUX is an open-weights 12B parameter rectified flow transformer that generates images from text descriptions, pushing the boundaries of text-to-image generation created by Black Forest Labs, with a non-commercial license making it widely accessible for exploration and experimentation. And, Google Cloud Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n",
+    "\n",
+    "This example showcases how to deploy any supported [`diffusers`](https://github.com/huggingface/diffusers) text-to-image model from the Hugging Face Hub, in this case [`black-forest-labs/FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev), on Vertex AI using the Hugging Face PyTorch DLC for Inference available in Google Cloud Platform (GCP) in both CPU and GPU instances."
    ]
   },
   {

diff --git a/examples/vertex-ai/notebooks/deploy-gemma-from-gcs-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/deploy-gemma-from-gcs-on-vertex-ai/vertex-notebook.ipynb
@@ -21,7 +21,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to deploy any supported text-generation model, in this case [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it), downloaded from the Hugging Face Hub and uploaded to a Google Cloud Storage (GCS) Bucket, on Vertex AI using the Hugging Face DLC for TGI available in Google Cloud Platform (GCP)."
+    "Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n",
+    "\n",
+    "This example showcases how to deploy any supported text-generation model, in this case [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it), downloaded from the Hugging Face Hub and uploaded to a Google Cloud Storage (GCS) Bucket, on Vertex AI using the Hugging Face DLC for TGI available in Google Cloud Platform (GCP)."
    ]
   },
   {

diff --git a/examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai/vertex-notebook.ipynb
@@ -21,7 +21,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to deploy any supported text-generation model, in this case [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it), from the Hugging Face Hub on Vertex AI using the TGI DLC available in Google Cloud Platform (GCP)."
+    "Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n",
+    "\n",
+    "This example showcases how to deploy any supported text-generation model, in this case [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it), from the Hugging Face Hub on Vertex AI using the TGI DLC available in Google Cloud Platform (GCP)."
    ]
   },
   {

diff --git a/examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai/vertex-notebook.ipynb b/examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai/vertex-notebook.ipynb
@@ -21,7 +21,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "[Transformer Reinforcement Learning (TRL)](https://github.com/huggingface/trl) is a framework developed by Hugging Face to fine-tune and align both transformer language and diffusion models using methods such as Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and others. On the other hand, Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to create a custom training job on Vertex AI running the Hugging Face PyTorch DLC for training, using the TRL CLI to full fine-tune a 7B LLM with SFT in a multi-GPU setting."
+    "[Transformer Reinforcement Learning (TRL)](https://github.com/huggingface/trl) is a framework developed by Hugging Face to fine-tune and align both transformer language and diffusion models using methods such as Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and others. On the other hand, Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n",
+    "\n",
+    "This example showcases how to create a custom training job on Vertex AI running the Hugging Face PyTorch DLC for training, using the TRL CLI to full fine-tune a 7B LLM with SFT in a multi-GPU setting."
    ]
   },
   {