diff --git a/README.md b/README.md
index b48f14ec..02c6962d 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ The [`examples`](./examples) directory contains examples for using the container
_Note: we added the latest TGI version as an example into the repository, which can be build with._
```bash
-docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.3.4 -f containers/tgi/gpu/1.3.4/Dockerfile .
+docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.4.2 -f containers/tgi/gpu/1.4.2/Dockerfile .
```
### Mistral 7B test
@@ -29,7 +29,7 @@ docker run --gpus all -ti -p 8080:80 \
-e NUM_SHARD=$num_shard \
-e MAX_INPUT_LENGTH=$max_input_length \
-e MAX_TOTAL_TOKENS=$max_total_tokens \
- us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.3.4
+ us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.4.2
```
Send request:
@@ -41,10 +41,10 @@ curl 127.0.0.1:8080/generate \
-H 'Content-Type: application/json'
```
-### Golden Gate Test
+### Gemma Test
```bash
-model=gg-hf/golden-gate-7b
+model=google/gemma-7b
num_shard=1
max_input_length=512
max_total_tokens=1024
@@ -58,7 +58,7 @@ docker run --gpus all -ti -p 8080:80 \
-e MAX_TOTAL_TOKENS=$max_total_tokens \
-e MAX_BATCH_PREFILL_TOKENS=$max_batch_prefill_tokens \
-e HUGGING_FACE_HUB_TOKEN=$token \
- us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.3.4
+ us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.4.2
```
Send request:
@@ -70,7 +70,7 @@ curl 127.0.0.1:8080/generate \
-H 'Content-Type: application/json'
```
-For a Vertex AI example checkout [Deploy Golden Gate on Vertex AI](./examples//vertex-ai/deploy-golden-gate-on-vertex-ai.ipynb)
+For a Vertex AI example checkout [Deploy Gemma on Vertex AI](./examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai.ipynb)
## Configurations
@@ -87,13 +87,13 @@ After the containers are built, you can run the tests in the `tests` directory t
| Container Tag | Framework | Type | Accelerator |
| ----------------------------------------------------------------------------- | --------- | --------- | ----------- |
-| [pytorch-training-gpu.2.1.transformers.4.37.2.py310](./containers/pytorch/training/gpu/2.1/transformers/4.37.2/py310/Dockerfile) | Pytorch | training | GPU |
-| [text-generation-inference-gpu.1.3.4](https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/tgi/gpu/1.3.4/Dockerfile) | - | inference | GPU |
+| [pytorch-training-gpu.2.1.transformers.4.38.1.py310](./containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile) | Pytorch | training | GPU |
+| [text-generation-inference-gpu.1.4.2](./containers/tgi/gpu/1.4.2/Dockerfile) | - | inference | GPU |
## Directory Structure
-The container files are organized in a nested folder structure based on the container tag. For example, the Dockerfile for the container with the tag `pytorch-training-gpu.2.0.transformers.4.35.0.py310` is located at `pytorch/training/gpu/2.0/transformers/4.35.0/py310/Dockerfile`.
+The container files are organized in a nested folder structure based on the container tag. For example, the Dockerfile for the container with the tag `pytorch-training-gpu.2.1.transformers.4.38.1.py310` is located at `pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile`.
## Updates
-When we update the transformers version, we add a new folder in the `transformers` directory. For example, if we update the transformers version to 4.36.0, we would add a new folder at `pytorch/training/gpu/2.0/transformers/4.36.0`.
+When we update the transformers version, we add a new folder in the `transformers` directory. For example, if we update the transformers version to 4.39.0, we would add a new folder at `pytorch/training/gpu/2.0/transformers/4.39.0`.
diff --git a/containers/jax/training/gpu/0.4/transformers/4.37.2/py310/Dockerfile b/containers/jax/training/gpu/0.4/transformers/4.38.1/py310/Dockerfile
similarity index 98%
rename from containers/jax/training/gpu/0.4/transformers/4.37.2/py310/Dockerfile
rename to containers/jax/training/gpu/0.4/transformers/4.38.1/py310/Dockerfile
index 58ed7682..70084a38 100644
--- a/containers/jax/training/gpu/0.4/transformers/4.37.2/py310/Dockerfile
+++ b/containers/jax/training/gpu/0.4/transformers/4.38.1/py310/Dockerfile
@@ -7,7 +7,7 @@ ARG DEBIAN_FRONTEND=noninteractive
# Versions
ARG CUDA="12"
ARG CUDNN="89"
-ARG TRANSFORMERS='4.37.2'
+ARG TRANSFORMERS='4.38.1'
ARG DIFFUSERS='0.26.1'
ARG DATASETS='2.16.1'
# jax and flax compatible with transformers ["jax>=0.4.1,<=0.4.13", "jaxlib>=0.4.1,<=0.4.13", "flax>=0.4.1,<=0.7.0"] as mentioned in setup.py
diff --git a/containers/pytorch/training/gpu/2.0/transformers/4.35.0/py310/Dockerfile b/containers/pytorch/training/gpu/2.0/transformers/4.35.0/py310/Dockerfile
deleted file mode 100644
index e69de29b..00000000
diff --git a/containers/pytorch/training/gpu/2.0/transformers/4.36.0/py310/Dockerfile b/containers/pytorch/training/gpu/2.0/transformers/4.36.0/py310/Dockerfile
deleted file mode 100644
index e69de29b..00000000
diff --git a/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/py310/Dockerfile b/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/py310/Dockerfile
deleted file mode 100644
index 0a7c2f16..00000000
--- a/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/py310/Dockerfile
+++ /dev/null
@@ -1,89 +0,0 @@
-FROM nvcr.io/nvidia/pytorch:23.10-py3
-# The image with PyTorch = 2.1.0a0+32f93b1, CUDA = 12.2.2, Python = 3.10.12"
-# You can read more about it here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-10.html
-
-LABEL maintainer="Hugging Face"
-ARG DEBIAN_FRONTEND=noninteractive
-
-# Versions
-ARG FLASH_ATTN='2.5.2'
-ARG TRANSFORMERS='4.37.2'
-ARG DIFFUSERS='0.26.1'
-ARG PEFT='0.8.2'
-ARG TRL='0.7.10'
-ARG BITSANDBYTES='0.42.0'
-ARG DATASETS='2.16.1'
-ARG ACCELERATE='0.27.0'
-ARG EVALUATE='0.4.1'
-ARG SENTENCE_TRANSFORMERS='2.3.1'
-ARG DEEPSPEED='0.13.1'
-ARG MAX_JOBS=4
-
-RUN apt-get update \
- && apt-get install -y \
- bzip2 \
- curl \
- git \
- git-lfs \
- tar \
- gcc \
- g++ \
- libaio-dev \
- # audio
- libsndfile1-dev \
- ffmpeg \
- apt-transport-https \
- gnupg \
- ca-certificates \
- && apt-get clean autoremove --yes
-
-# Update pip
-RUN pip install --upgrade pip
-
-# Uninstall Transformer-Engine
-RUN pip uninstall -y transformer-engine
-
-# Upgrade FlashAttnV2
-RUN pip install --no-cache-dir \
- packaging \
- ninja
-RUN MAX_JOBS=${MAX_JOBS} pip install flash-attn==${FLASH_ATTN} --no-build-isolation
-
-# Install Hugging Face Libraries
-RUN pip install --upgrade --no-cache-dir \
- transformers[sklearn,sentencepiece,vision]==${TRANSFORMERS} \
- diffusers==${DIFFUSERS} \
- datasets==${DATASETS} \
- accelerate==${ACCELERATE} \
- evaluate==${EVALUATE} \
- peft==${PEFT} \
- trl==${TRL} \
- sentence-transformers==${SENTENCE_TRANSFORMERS} \
- deepspeed==${DEEPSPEED} \
- bitsandbytes==${BITSANDBYTES}
-
-# To install version that has golden-gate integrated into transformers
-#Mandatory GitHub token to access the private repository, read more about it here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
-RUN --mount=type=secret,id=GITHUB_TOKEN pip install git+https://$(cat /run/secrets/GITHUB_TOKEN)@github.com/huggingface/new-model-addition-golden-gate.git@add-golden-gate
-
-# Install Google Cloud Dependencies
-RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" \
- | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
- curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \
- | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
- apt-get update -y && \
- apt-get install google-cloud-sdk -y
-
-RUN pip install --upgrade --no-cache-dir \
- google-cloud-storage \
- google-cloud-bigquery \
- google-cloud-aiplatform \
- google-cloud-pubsub \
- google-cloud-logging
-
-# # Check if correct versions are installed
-RUN python -c "import transformers, diffusers, datasets, accelerate, evaluate, peft, trl, sentence_transformers, deepspeed, bitsandbytes, flash_attn, torch; \
- assert all([mod.__version__ == version for mod, version in [(transformers, '4.38.0.dev0'), (diffusers, '${DIFFUSERS}'), \
- (datasets, '${DATASETS}'), (accelerate, '${ACCELERATE}'), (evaluate, '${EVALUATE}'), (peft, '${PEFT}'), (trl, '${TRL}'), \
- (sentence_transformers, '${SENTENCE_TRANSFORMERS}'), (deepspeed, '${DEEPSPEED}'), (bitsandbytes, '${BITSANDBYTES}'), \
- (torch, '2.1.0a0+32f93b1'), (flash_attn, '${FLASH_ATTN}')]])"
diff --git a/containers/pytorch/training/gpu/2.1/transformers/4.37.2/py310/Dockerfile b/containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile
similarity index 99%
rename from containers/pytorch/training/gpu/2.1/transformers/4.37.2/py310/Dockerfile
rename to containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile
index 08ed50a0..6c553c89 100644
--- a/containers/pytorch/training/gpu/2.1/transformers/4.37.2/py310/Dockerfile
+++ b/containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile
@@ -7,7 +7,7 @@ ARG DEBIAN_FRONTEND=noninteractive
# Versions
ARG FLASH_ATTN='2.5.2'
-ARG TRANSFORMERS='4.37.2'
+ARG TRANSFORMERS='4.38.1'
ARG DIFFUSERS='0.26.1'
ARG PEFT='0.8.2'
ARG TRL='0.7.10'
diff --git a/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/README.md b/containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/README.md
similarity index 82%
rename from containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/README.md
rename to containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/README.md
index 85282f8f..cbe6eb07 100644
--- a/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/README.md
+++ b/containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/README.md
@@ -1,12 +1,11 @@
-# Fine-tune Gemma-2B on Vertex AI WorkBench
+# Fine-tune Gemma-7B on Vertex AI WorkBench
-This file contains step by step instructions on how to build a docker image and then run it to test the Gemma-2B model using the
+This file contains step by step instructions on how to build a docker image and then run it to test the Gemma-7B model using the
[gemma-finetuning-clm-lora-sft.ipynb](https://github.com/huggingface/Google-Cloud-Containers/blob/main/examples/vertex-ai/gemma-finetuning-clm-lora-sft.ipynb) Notebook on a Vertex AI WorkBench instance and on your local machine.
## Pre-requisites:
-1. Access to [gg-hf](https://huggingface.co/gg-hf) on Hugging Face Hub in order to download the model and the tokenizer.
+1. Access Terms and Conditions on [Hugging Face Hub](https://huggingface.co/google/gemma-7b) in order to download the model and the tokenizer.
2. Access to [Google-Cloud-Containers](https://github.com/huggingface/Google-Cloud-Containers) GitHub repository in order to access the docker file.
-3. Access to [new-model-addition-golden-gate](https://github.com/huggingface/new-model-addition-golden-gate/) GitHub repository in order to use transformer library with the gg-hf model integrated into it.
We use the [gemma-finetuning-clm-lora-sft.ipynb](https://github.com/huggingface/Google-Cloud-Containers/blob/main/examples/vertex-ai/gemma-finetuning-clm-lora-sft.ipynb) Notebook to test the model.
@@ -18,17 +17,10 @@ Use the following command to build the docker image. Make sure to replace the va
```bash
git clone https://github.com/huggingface/Google-Cloud-Containers
cd Google-Cloud-Containers
-export GITHUB_TOKEN=your-github-token
-docker build --secret id=GITHUB_TOKEN,env=GITHUB_TOKEN -t pytorch-training-gpu.2.1.transformers.4.38.0.dev0.py310 -f containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/py310/Dockerfile .
+docker build -t pytorch-training-gpu.2.1.transformers.4.38.1.py310 -f containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile .
```
-For setting the value of `GITHUB_TOKEN` please follow the detailed instructions mentioned in the following links:
-- [Creating a fine-grained personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-fine-grained-personal-access-token)
-
-- [Creating a personal access token (classic)](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token-classic)
-
-
-## Using Vertex AI WorkBench Instance to fine-tune the Gemma-2B model
+## Using Vertex AI WorkBench Instance to fine-tune the Gemma-7B model
It consists of the following steps:
1. Push the docker image to the Google Cloud Artifact registry.
@@ -56,12 +48,12 @@ Now, you can push the image to the Google Cloud Artifact registry using the foll
REGION="us-central1"
DOCKER_ARTIFACT_REPO="deep-learning-images"
PROJECT_ID="gcp-project-id"
-BASE_IMAGE="pytorch-training-gpu.2.1.transformers.4.38.0.dev0.py310"
+BASE_IMAGE="pytorch-training-gpu.2.1.transformers.4.38.1.py310"
FRAMEWORK="pytorch"
TYPE="training"
ACCELERATOR="gpu"
FRAMEWORK_VERSION="2.1"
-TRANSFORMERS_VERISON="4.38.0.dev0"
+TRANSFORMERS_VERISON="4.38.1"
PYTHON_VERSION="py310"
SERVING_CONTAINER_IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/${DOCKER_ARTIFACT_REPO}/huggingface-${FRAMEWORK}-${TYPE}-${ACCELERATOR}.${FRAMEWORK_VERSION}.transformers.${TRANSFORMERS_VERISON}.${PYTHON_VERSION}:latest"
@@ -109,7 +101,7 @@ We will use the Google Cloud CLI to create a Vertex AI WorkBench instance from a
```bash
gcloud notebooks instances create example-instance-1 \
- --container-repository=us-central1-docker.pkg.dev/gcp-project-id/deep-learning-images/huggingface-pytorch-training-gpu.2.1.transformers.4.38.0.dev0.py310 \
+ --container-repository=us-central1-docker.pkg.dev/gcp-project-id/deep-learning-images/huggingface-pytorch-training-gpu.2.1.transformers.4.38.1.py310 \
--container-tag=latest \
--machine-type=n1-standard-4 \
--location=us-central1-c \
@@ -137,7 +129,7 @@ Then, you can access the [gemma-finetuning-clm-lora-sft.ipynb](https://github.co
Make sure you have the [gemma-finetuning-clm-lora-sft.ipynb](https://github.com/huggingface/Google-Cloud-Containers/blob/main/examples/vertex-ai/gemma-finetuning-clm-lora-sft.ipynb) Notebook on your local machine. As we are mounting the current directory to the docker container.
```bash
-docker run -it --gpus all -p 8080:8080 -v $(pwd):/workspace pytorch-training-gpu.2.1.transformers.4.38.0.dev0.py310
+docker run -it --gpus all -p 8080:8080 -v $(pwd):/workspace pytorch-training-gpu.2.1.transformers.4.38.1.py310
```
Inside the docker container, you can run the following command to start the jupyter notebook:
diff --git a/containers/tgi/gpu/1.3.4/support-golden-gate.patch b/containers/tgi/gpu/1.3.4/support-golden-gate.patch
deleted file mode 100644
index f664671b..00000000
--- a/containers/tgi/gpu/1.3.4/support-golden-gate.patch
+++ /dev/null
@@ -1,942 +0,0 @@
-diff --git a/integration-tests/models/test_flash_phi.py b/integration-tests/models/test_flash_phi.py
-index 6391f2a..0987b3a 100644
---- a/integration-tests/models/test_flash_phi.py
-+++ b/integration-tests/models/test_flash_phi.py
-@@ -21,7 +21,7 @@ async def test_flash_phi(flash_phi, response_snapshot):
- )
-
- assert response.details.generated_tokens == 10
-- assert response.generated_text == ": {request}\")\n response = self"
-+ assert response.generated_text == ': {request}")\n response = self'
- assert response == response_snapshot
-
-
-@@ -52,14 +52,12 @@ async def test_flash_phi_all_params(flash_phi, response_snapshot):
- @pytest.mark.asyncio
- @pytest.mark.private
- async def test_flash_phi_load(flash_phi, generate_load, response_snapshot):
-- responses = await generate_load(
-- flash_phi, "Test request", max_new_tokens=10, n=4
-- )
-+ responses = await generate_load(flash_phi, "Test request", max_new_tokens=10, n=4)
-
- assert len(responses) == 4
- assert all(
- [r.generated_text == responses[0].generated_text for r in responses]
- ), f"{[r.generated_text for r in responses]}"
-- assert responses[0].generated_text == ": {request}\")\n response = self"
-+ assert responses[0].generated_text == ': {request}")\n response = self'
-
- assert responses == response_snapshot
-diff --git a/server/tests/utils/test_layers.py b/server/tests/utils/test_layers.py
-index 0a9fecd..93a0e98 100644
---- a/server/tests/utils/test_layers.py
-+++ b/server/tests/utils/test_layers.py
-@@ -3,24 +3,27 @@ from text_generation_server.utils.layers import (
- TensorParallelEmbedding,
- )
-
-+
- class ProcessGroup:
- def __init__(self, rank: int, world_size: int):
- self._rank = rank
- self.world_size = world_size
-
-- def size(self)->int:
-+ def size(self) -> int:
- return self.world_size
-
-- def rank(self)->int:
-+ def rank(self) -> int:
- return self._rank
-
-+
- class Weights:
- def __init__(self, rank: int, world_size: int, vocab_size: int, hidden_dim: int):
-- self.weight = torch.arange(vocab_size*hidden_dim).float().view(vocab_size, hidden_dim)
-+ self.weight = (
-+ torch.arange(vocab_size * hidden_dim).float().view(vocab_size, hidden_dim)
-+ )
- self.process_group = ProcessGroup(rank, world_size)
-
--
-- def get_partial_sharded(self, name:str, dim: int):
-+ def get_partial_sharded(self, name: str, dim: int):
- assert dim == 0
-
- rank = self.process_group.rank()
-@@ -35,10 +38,11 @@ class Weights:
- def get_shape(self, name: str):
- return self.weight.shape
-
-+
- def test_weight_hub_files_offline_error():
-
-- vocab_size= 17
-- weights = Weights(rank=0, world_size=1, vocab_size = vocab_size,hidden_dim = 256)
-+ vocab_size = 17
-+ weights = Weights(rank=0, world_size=1, vocab_size=vocab_size, hidden_dim=256)
- embeddings = TensorParallelEmbedding("", weights)
-
- input_ids = torch.arange(vocab_size)
-@@ -47,18 +51,27 @@ def test_weight_hub_files_offline_error():
- assert embeddings.max_id == 17
- torch.testing.assert_close(output, torch.arange(256 * 17).float().view(17, 256))
-
-- weights_0_2 = Weights(rank=0, world_size=2, vocab_size = vocab_size,hidden_dim = 256)
-- weights_1_2 = Weights(rank=1, world_size=2, vocab_size = vocab_size,hidden_dim = 256)
-+ weights_0_2 = Weights(rank=0, world_size=2, vocab_size=vocab_size, hidden_dim=256)
-+ weights_1_2 = Weights(rank=1, world_size=2, vocab_size=vocab_size, hidden_dim=256)
- embeddings_0_2 = TensorParallelEmbedding("", weights_0_2, reduce=False)
- assert embeddings_0_2.min_id == 0
- assert embeddings_0_2.max_id == 9
-- torch.testing.assert_close(embeddings_0_2.weight , torch.cat([torch.arange(9 * 256), torch.zeros(256)], dim=0).view(10, 256).float())
-+ torch.testing.assert_close(
-+ embeddings_0_2.weight,
-+ torch.cat([torch.arange(9 * 256), torch.zeros(256)], dim=0)
-+ .view(10, 256)
-+ .float(),
-+ )
- embeddings_1_2 = TensorParallelEmbedding("", weights_1_2, reduce=False)
- assert embeddings_1_2.min_id == 9
- assert embeddings_1_2.max_id == 17
-- torch.testing.assert_close(embeddings_1_2.weight , torch.cat([torch.arange(8 * 256) + 9 * 256, torch.zeros(256)], dim=0).view(9, 256).float())
-+ torch.testing.assert_close(
-+ embeddings_1_2.weight,
-+ torch.cat([torch.arange(8 * 256) + 9 * 256, torch.zeros(256)], dim=0)
-+ .view(9, 256)
-+ .float(),
-+ )
- output_tp_0 = embeddings_0_2.forward(input_ids)
- output_tp_1 = embeddings_1_2.forward(input_ids)
-
- torch.testing.assert_close(output, output_tp_0 + output_tp_1)
--
-diff --git a/server/text_generation_server/models/__init__.py b/server/text_generation_server/models/__init__.py
-index 679e1e2..1461cd0 100644
---- a/server/text_generation_server/models/__init__.py
-+++ b/server/text_generation_server/models/__init__.py
-@@ -52,6 +52,9 @@ try:
- from text_generation_server.models.flash_llama import (
- FlashLlama,
- )
-+ from text_generation_server.models.flash_golden_gate import (
-+ FlashGoldenGate,
-+ )
- from text_generation_server.models.flash_santacoder import (
- FlashSantacoderSharded,
- )
-@@ -282,6 +285,26 @@ def get_model(
- dtype=dtype,
- trust_remote_code=trust_remote_code,
- )
-+ if model_type == "golden_gate":
-+ if FLASH_ATTENTION:
-+ return FlashGoldenGate(
-+ model_id,
-+ revision,
-+ quantize=quantize,
-+ dtype=dtype,
-+ trust_remote_code=trust_remote_code,
-+ use_medusa=use_medusa,
-+ )
-+ elif sharded:
-+ raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Golden Gate"))
-+ else:
-+ return CausalLM(
-+ model_id,
-+ revision,
-+ quantize=quantize,
-+ dtype=dtype,
-+ trust_remote_code=trust_remote_code,
-+ )
-
- if model_type in ["RefinedWeb", "RefinedWebModel", "falcon"]:
- if sharded:
-diff --git a/server/text_generation_server/models/custom_modeling/flash_golden_gate_modeling.py b/server/text_generation_server/models/custom_modeling/flash_golden_gate_modeling.py
-new file mode 100644
-index 0000000..ca5b595
---- /dev/null
-+++ b/server/text_generation_server/models/custom_modeling/flash_golden_gate_modeling.py
-@@ -0,0 +1,447 @@
-+# coding=utf-8
-+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
-+#
-+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
-+# and OPT implementations in this library. It has been modified from its
-+# original forms to accommodate minor architectural differences compared
-+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
-+#
-+# Licensed under the Apache License, Version 2.0 (the "License");
-+# you may not use this file except in compliance with the License.
-+# You may obtain a copy of the License at
-+#
-+# http://www.apache.org/licenses/LICENSE-2.0
-+#
-+# Unless required by applicable law or agreed to in writing, software
-+# distributed under the License is distributed on an "AS IS" BASIS,
-+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-+# See the License for the specific language governing permissions and
-+# limitations under the License.
-+
-+import torch
-+import torch.distributed
-+
-+from torch import nn
-+from transformers.activations import ACT2FN
-+from transformers.configuration_utils import PretrainedConfig
-+from typing import Optional, List, Tuple
-+
-+from text_generation_server.utils import paged_attention, flash_attn
-+from text_generation_server.utils.layers import (
-+ TensorParallelRowLinear,
-+ TensorParallelColumnLinear,
-+ TensorParallelEmbedding,
-+ PositionRotaryEmbedding,
-+ TensorParallelHead,
-+ get_linear,
-+ FastRMSNorm,
-+)
-+
-+
-+class GoldenGateConfig(PretrainedConfig):
-+ def __init__(
-+ self,
-+ vocab_size=256128,
-+ hidden_size=3072,
-+ intermediate_size=24576,
-+ num_hidden_layers=28,
-+ num_attention_heads=16,
-+ num_key_value_heads=16,
-+ head_dim=256,
-+ hidden_act="gelu",
-+ max_position_embeddings=8192,
-+ initializer_range=0.02,
-+ rms_norm_eps=1e-6,
-+ use_cache=True,
-+ pad_token_id=None,
-+ bos_token_id=1,
-+ eos_token_id=2,
-+ tie_word_embeddings=True,
-+ rope_theta=10000.0,
-+ rope_scaling=None,
-+ attention_bias=False,
-+ attention_dropout=0.0,
-+ **kwargs,
-+ ):
-+ self.vocab_size = vocab_size
-+ self.max_position_embeddings = max_position_embeddings
-+ self.hidden_size = hidden_size
-+ self.head_dim = head_dim
-+ self.intermediate_size = intermediate_size
-+ self.num_hidden_layers = num_hidden_layers
-+ self.num_attention_heads = num_attention_heads
-+
-+ # for backward compatibility
-+ if num_key_value_heads is None:
-+ num_key_value_heads = num_attention_heads
-+
-+ self.num_key_value_heads = num_key_value_heads
-+ self.hidden_act = hidden_act
-+ self.initializer_range = initializer_range
-+ self.rms_norm_eps = rms_norm_eps
-+ self.use_cache = use_cache
-+ self.rope_theta = rope_theta
-+ self.rope_scaling = rope_scaling
-+ self.attention_bias = attention_bias
-+ self.attention_dropout = attention_dropout
-+
-+ super().__init__(
-+ pad_token_id=pad_token_id,
-+ bos_token_id=bos_token_id,
-+ eos_token_id=eos_token_id,
-+ tie_word_embeddings=tie_word_embeddings,
-+ **kwargs,
-+ )
-+
-+class GoldenGateFastRMSNorm(FastRMSNorm):
-+ @classmethod
-+ def load(cls, prefix, weights, eps=1e-6):
-+ weight = weights.get_tensor(f"{prefix}.weight") + 1
-+ return cls(weight, eps)
-+
-+
-+def load_attention(config, prefix, weights):
-+ if config.num_attention_heads != config.num_key_value_heads:
-+ return _load_gqa(config, prefix, weights)
-+ else:
-+ return TensorParallelColumnLinear.load_multi(
-+ config,
-+ prefixes=[f"{prefix}.q_proj", f"{prefix}.k_proj", f"{prefix}.v_proj"],
-+ dim=0,
-+ weights=weights,
-+ bias=False,
-+ )
-+
-+
-+def _load_gqa(config, prefix: str, weights):
-+ assert config.num_attention_heads % weights.process_group.size() == 0
-+
-+ weight = weights.get_multi_weights_col(
-+ prefixes=[f"{prefix}.q_proj", f"{prefix}.k_proj", f"{prefix}.v_proj"],
-+ quantize=config.quantize,
-+ dim=0,
-+ )
-+
-+ if config.quantize not in ["gptq", "awq"]:
-+ weight = weight.to(dtype=weights.dtype).to(device=weights.device)
-+
-+ head_size = config.head_dim
-+ num_heads = config.num_attention_heads // weights.process_group.size()
-+ num_key_value_heads = config.num_key_value_heads // weights.process_group.size()
-+ assert list(weight.shape) == [
-+ (num_heads + 2 * num_key_value_heads) * head_size,
-+ config.hidden_size,
-+ ], f"{list(weight.shape)} != {[(num_heads + 2 * config.num_key_value_heads) * head_size, config.hidden_size]}"
-+
-+ return TensorParallelColumnLinear(
-+ get_linear(weight, bias=None, quantize=config.quantize)
-+ )
-+
-+
-+class FlashGoldenGateAttention(torch.nn.Module):
-+ def __init__(
-+ self,
-+ prefix: str,
-+ config,
-+ weights,
-+ ):
-+ super().__init__()
-+ self.num_heads = config.num_attention_heads
-+ self.head_size = config.head_dim
-+
-+ self.rotary_emb = PositionRotaryEmbedding.static(
-+ config=config,
-+ dim=self.head_size,
-+ base=config.rope_theta,
-+ device=weights.device,
-+ )
-+
-+ self.softmax_scale = self.head_size**-0.5
-+
-+ if self.num_heads % weights.process_group.size() != 0:
-+ raise ValueError(
-+ f"`num_heads` must be divisible by `num_shards` (got `num_heads`: {self.num_heads} "
-+ f"and `num_shards`: {weights.process_group.size()}"
-+ )
-+ self.num_heads = self.num_heads // weights.process_group.size()
-+ self.num_key_value_heads = (
-+ config.num_key_value_heads // weights.process_group.size()
-+ )
-+
-+ self.query_key_value = load_attention(config, prefix, weights)
-+
-+ self.o_proj = TensorParallelRowLinear.load(
-+ config,
-+ prefix=f"{prefix}.o_proj",
-+ weights=weights,
-+ bias=False,
-+ )
-+ self.num_groups = self.num_heads // self.num_key_value_heads
-+ self.kv_head_mapping = torch.arange(
-+ 0, self.num_key_value_heads, dtype=torch.int32, device=weights.device
-+ ).repeat_interleave(self.num_groups)
-+
-+ def forward(
-+ self,
-+ hidden_states,
-+ cos,
-+ sin,
-+ cu_seqlen_prefill,
-+ kv_cache,
-+ block_tables,
-+ slots,
-+ input_lengths,
-+ max_s,
-+ ):
-+ qkv = self.query_key_value(hidden_states)
-+ query, kv = qkv.split(
-+ [
-+ self.head_size * self.num_heads,
-+ 2 * self.head_size * self.num_key_value_heads,
-+ ],
-+ dim=1,
-+ )
-+ query = query.view(-1, self.num_heads, self.head_size)
-+ kv = kv.view(-1, 2, self.num_key_value_heads, self.head_size)
-+
-+ self.rotary_emb(query, torch.select(kv, dim=1, index=0), cos, sin)
-+
-+ paged_attention.reshape_and_cache(
-+ kv[:, 0], kv[:, 1], kv_cache[0], kv_cache[1], slots
-+ )
-+
-+ # output tensor
-+ attn_output = torch.empty_like(query)
-+
-+ # Prefill
-+ if cu_seqlen_prefill is not None:
-+ # flash attention
-+ flash_attn.attention(
-+ query,
-+ torch.select(kv, dim=1, index=0),
-+ torch.select(kv, dim=1, index=1),
-+ attn_output,
-+ cu_seqlen_prefill,
-+ max_s,
-+ self.softmax_scale,
-+ )
-+ # Decode
-+ else:
-+ paged_attention.attention(
-+ attn_output,
-+ query,
-+ kv_cache[0],
-+ kv_cache[1],
-+ self.kv_head_mapping,
-+ self.softmax_scale,
-+ block_tables,
-+ input_lengths,
-+ max_s,
-+ )
-+
-+ return self.o_proj(attn_output.view(-1, self.num_heads * self.head_size))
-+
-+
-+class GoldenGateMLP(nn.Module):
-+ def __init__(self, prefix, config, weights):
-+ super().__init__()
-+ act = config.hidden_act
-+ self.act = (
-+ ACT2FN[act]
-+ if "gelu" not in act
-+ else lambda x: torch.nn.functional.gelu(
-+ x,
-+ approximate="tanh"
-+ if act in ["gelu_fast", "gelu_pytorch_tanh"]
-+ else "none",
-+ )
-+ )
-+ # Fuse gate and up proj
-+ self.gate_up_proj = TensorParallelColumnLinear.load_multi(
-+ config,
-+ prefixes=[f"{prefix}.gate_proj", f"{prefix}.up_proj"],
-+ weights=weights,
-+ dim=0,
-+ bias=False,
-+ )
-+ self.down_proj = TensorParallelRowLinear.load(
-+ config,
-+ prefix=f"{prefix}.down_proj",
-+ weights=weights,
-+ bias=False,
-+ )
-+ self.intermediate_size = (
-+ config.intermediate_size // weights.process_group.size()
-+ )
-+
-+ def forward(self, hidden_states):
-+ gate_up_states = self.gate_up_proj(hidden_states)
-+ gate_up_states = gate_up_states.view(-1, 2, self.intermediate_size)
-+ return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
-+
-+
-+class FlashGoldenGateLayer(nn.Module):
-+ def __init__(self, layer_id, config, weights):
-+ super().__init__()
-+ prefix = f"model.layers.{layer_id}"
-+ self.self_attn = FlashGoldenGateAttention(
-+ prefix=f"{prefix}.self_attn", config=config, weights=weights
-+ )
-+ self.mlp = GoldenGateMLP(prefix=f"{prefix}.mlp", config=config, weights=weights)
-+
-+ self.input_layernorm = GoldenGateFastRMSNorm.load(
-+ prefix=f"{prefix}.input_layernorm", weights=weights, eps=config.rms_norm_eps
-+ )
-+ self.post_attention_layernorm = GoldenGateFastRMSNorm.load(
-+ prefix=f"{prefix}.post_attention_layernorm",
-+ weights=weights,
-+ eps=config.rms_norm_eps,
-+ )
-+
-+ def forward(
-+ self,
-+ hidden_states,
-+ residual,
-+ cos,
-+ sin,
-+ cu_seqlen_prefill,
-+ kv_cache,
-+ block_tables,
-+ slots,
-+ input_lengths,
-+ max_s,
-+ ):
-+ normed_hidden_states, res = self.input_layernorm(hidden_states, residual)
-+
-+ # Self Attention
-+ attn_output = self.self_attn(
-+ normed_hidden_states,
-+ cos,
-+ sin,
-+ cu_seqlen_prefill,
-+ kv_cache,
-+ block_tables,
-+ slots,
-+ input_lengths,
-+ max_s,
-+ )
-+
-+ # faster post attention rms norm
-+ normed_attn_res_output, attn_res = self.post_attention_layernorm(
-+ attn_output, res
-+ )
-+
-+ mlp_output = self.mlp(normed_attn_res_output)
-+
-+ return mlp_output, attn_res
-+
-+
-+class FlashGoldenGateModel(torch.nn.Module):
-+ def __init__(self, config, weights):
-+ super().__init__()
-+
-+ process_group = weights.process_group
-+ self.tp_rank = process_group.rank()
-+ self.tp_world_size = process_group.size()
-+ embed_norm = config.hidden_size ** 0.5
-+ self.embed_tokens = TensorParallelEmbedding(
-+ prefix="model.embed_tokens", weights=weights
-+ )
-+ self.embed_tokens.weight *= embed_norm
-+
-+ self.layers = nn.ModuleList(
-+ [
-+ FlashGoldenGateLayer(
-+ layer_id,
-+ config,
-+ weights,
-+ )
-+ for layer_id in range(config.num_hidden_layers)
-+ ]
-+ )
-+ self.norm = GoldenGateFastRMSNorm.load(
-+ prefix="model.norm", weights=weights, eps=config.rms_norm_eps
-+ )
-+
-+ self.gradient_checkpointing = False
-+
-+ self.head_size = self.layers[0].self_attn.head_size
-+ self.num_heads = self.layers[0].self_attn.num_heads
-+ self.num_key_value_heads = self.layers[0].self_attn.num_key_value_heads
-+
-+ def forward(
-+ self,
-+ input_ids: torch.Tensor,
-+ position_ids: torch.Tensor,
-+ cu_seqlen_prefill: Optional[torch.Tensor],
-+ kv_cache: List[Tuple[torch.Tensor, torch.Tensor]],
-+ block_tables: torch.Tensor,
-+ slots: torch.Tensor,
-+ input_lengths: torch.Tensor,
-+ max_s: int,
-+ ) -> torch.Tensor:
-+ hidden_states = self.embed_tokens(input_ids)
-+
-+ # Get rotary cos and sin for this forward
-+ # Avoid to index in each layer
-+ cos, sin = self.layers[0].self_attn.rotary_emb.get_cos_sin(
-+ position_ids, max_s, hidden_states.dtype
-+ )
-+
-+ residual = None
-+ for i, layer in enumerate(self.layers):
-+ hidden_states, residual = layer(
-+ hidden_states,
-+ residual,
-+ cos,
-+ sin,
-+ cu_seqlen_prefill,
-+ kv_cache[i],
-+ block_tables,
-+ slots,
-+ input_lengths,
-+ max_s,
-+ )
-+
-+ hidden_states, _ = self.norm(hidden_states, residual)
-+
-+ return hidden_states
-+
-+
-+class FlashGoldenGateForCausalLM(torch.nn.Module):
-+ def __init__(self, config, weights):
-+ super().__init__()
-+
-+ self.model = FlashGoldenGateModel(config, weights)
-+ self.lm_head = TensorParallelHead.load(
-+ config,
-+ prefix="model.embed_tokens" if config.tie_word_embeddings else "lm_head",
-+ weights=weights,
-+ )
-+
-+ def forward(
-+ self,
-+ input_ids: torch.Tensor,
-+ position_ids: torch.Tensor,
-+ cu_seqlen_prefill: Optional[torch.Tensor],
-+ kv_cache: List[Tuple[torch.Tensor, torch.Tensor]],
-+ block_tables: torch.Tensor,
-+ slots: torch.Tensor,
-+ input_lengths: torch.Tensor,
-+ max_s: int,
-+ lm_head_indices: Optional[torch.Tensor] = None,
-+ ) -> torch.Tensor:
-+ hidden_states = self.model(
-+ input_ids,
-+ position_ids,
-+ cu_seqlen_prefill,
-+ kv_cache,
-+ block_tables,
-+ slots,
-+ input_lengths,
-+ max_s,
-+ )
-+ if lm_head_indices is not None:
-+ hidden_states = hidden_states[lm_head_indices]
-+ logits = self.lm_head(hidden_states)
-+ return logits
-diff --git a/server/text_generation_server/models/custom_modeling/temp_tok.py b/server/text_generation_server/models/custom_modeling/temp_tok.py
-new file mode 100644
-index 0000000..06516cb
---- /dev/null
-+++ b/server/text_generation_server/models/custom_modeling/temp_tok.py
-@@ -0,0 +1,216 @@
-+# coding=utf-8
-+# Copyright 2020 The HuggingFace Inc. team.
-+#
-+# Licensed under the Apache License, Version 2.0 (the "License");
-+# you may not use this file except in compliance with the License.
-+# You may obtain a copy of the License at
-+#
-+# http://www.apache.org/licenses/LICENSE-2.0
-+#
-+# Unless required by applicable law or agreed to in writing, software
-+# distributed under the License is distributed on an "AS IS" BASIS,
-+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-+# See the License for the specific language governing permissions and
-+# limitations under the License.
-+import os
-+from shutil import copyfile
-+from typing import Optional, Tuple
-+
-+from tokenizers import processors
-+
-+from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
-+from transformers.utils import logging
-+from transformers.utils.versions import require_version
-+
-+
-+require_version("tokenizers>=0.13.3")
-+
-+GoldenGateTokenizer = None
-+
-+logger = logging.get_logger(__name__)
-+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model", "tokenizer_file": "tokenizer.json"}
-+
-+PRETRAINED_VOCAB_FILES_MAP = {
-+ "vocab_file": {
-+ "hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model",
-+ },
-+ "tokenizer_file": {
-+ "hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json",
-+ },
-+}
-+B_INST, E_INST = "[INST]", "[/INST]"
-+B_SYS, E_SYS = "<>\n", "\n<>\n\n"
-+
-+# fmt: off
-+DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your \
-+answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure\
-+ that your responses are socially unbiased and positive in nature.
-+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \
-+correct. If you don't know the answer to a question, please don't share false information."""
-+# fmt: on
-+
-+
-+class GoldenGateTokenizerFast(PreTrainedTokenizerFast):
-+ """
-+ Construct a GoldenGate tokenizer. Based on byte-level Byte-Pair-Encoding.
-+ This uses notably ByteFallback and no normalization.
-+ ```python
-+ >>> from transformers import GoldenGateTokenizerFast
-+ >>> tokenizer = GoldenGateTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
-+ >>> tokenizer.encode("Hello this is a test")
-+ [1, 15043, 445, 338, 263, 1243]
-+ ```
-+ If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
-+ call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
-+ values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
-+ [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
-+ This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
-+ refer to this superclass for more information regarding those methods.
-+ Args:
-+ vocab_file (`str`, *optional*):
-+ [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
-+ contains the vocabulary necessary to instantiate a tokenizer.
-+ tokenizer_file (`str`, *optional*):
-+ [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
-+ contains everything needed to load the tokenizer.
-+ clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
-+ Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
-+ extra spaces.
-+ unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`):
-+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
-+ token instead.
-+ bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`):
-+ The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
-+ eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`):
-+ The end of sequence token.
-+ add_bos_token (`bool`, *optional*, defaults to `True`):
-+ Whether or not to add an `bos_token` at the start of sequences.
-+ add_eos_token (`bool`, *optional*, defaults to `False`):
-+ Whether or not to add an `eos_token` at the end of sequences.
-+ use_default_system_prompt (`bool`, *optional*, defaults to `False`):
-+ Whether or not the default system prompt for GoldenGate should be used.
-+ """
-+
-+ vocab_files_names = VOCAB_FILES_NAMES
-+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
-+ slow_tokenizer_class = GoldenGateTokenizer
-+ padding_side = "left"
-+ model_input_names = ["input_ids", "attention_mask"]
-+
-+ def __init__(
-+ self,
-+ vocab_file=None,
-+ tokenizer_file=None,
-+ clean_up_tokenization_spaces=False,
-+ unk_token="",
-+ bos_token="",
-+ eos_token="",
-+ pad_token="",
-+ add_bos_token=True,
-+ add_eos_token=False,
-+ use_default_system_prompt=False,
-+ **kwargs,
-+ ):
-+ super().__init__(
-+ vocab_file=vocab_file,
-+ tokenizer_file=tokenizer_file,
-+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
-+ unk_token=unk_token,
-+ bos_token=bos_token,
-+ eos_token=eos_token,
-+ pad_token=pad_token,
-+ add_bos_token=add_bos_token,
-+ add_eos_token=add_eos_token,
-+ use_default_system_prompt=use_default_system_prompt,
-+ **kwargs,
-+ )
-+ self._add_bos_token = add_bos_token
-+ self._add_eos_token = add_eos_token
-+ self.update_post_processor()
-+ self.use_default_system_prompt = use_default_system_prompt
-+ self.vocab_file = vocab_file
-+
-+ @property
-+ def can_save_slow_tokenizer(self) -> bool:
-+ return os.path.isfile(self.vocab_file) if self.vocab_file else False
-+
-+ def update_post_processor(self):
-+ """
-+ Updates the underlying post processor with the current `bos_token` and `eos_token`.
-+ """
-+ bos = self.bos_token
-+ bos_token_id = self.bos_token_id
-+ if bos is None and self.add_bos_token:
-+ raise ValueError("add_bos_token = True but bos_token = None")
-+
-+ eos = self.eos_token
-+ eos_token_id = self.eos_token_id
-+ if eos is None and self.add_eos_token:
-+ raise ValueError("add_eos_token = True but eos_token = None")
-+
-+ single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
-+ pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
-+
-+ special_tokens = []
-+ if self.add_bos_token:
-+ special_tokens.append((bos, bos_token_id))
-+ if self.add_eos_token:
-+ special_tokens.append((eos, eos_token_id))
-+ self._tokenizer.post_processor = processors.TemplateProcessing(
-+ single=single, pair=pair, special_tokens=special_tokens
-+ )
-+
-+ @property
-+ def add_eos_token(self):
-+ return self._add_eos_token
-+
-+ @property
-+ def add_bos_token(self):
-+ return self._add_bos_token
-+
-+ @add_eos_token.setter
-+ def add_eos_token(self, value):
-+ self._add_eos_token = value
-+ self.update_post_processor()
-+
-+ @add_bos_token.setter
-+ def add_bos_token(self, value):
-+ self._add_bos_token = value
-+ self.update_post_processor()
-+
-+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
-+ if not self.can_save_slow_tokenizer:
-+ raise ValueError(
-+ "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
-+ "tokenizer."
-+ )
-+
-+ if not os.path.isdir(save_directory):
-+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
-+ return
-+ out_vocab_file = os.path.join(
-+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
-+ )
-+
-+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
-+ copyfile(self.vocab_file, out_vocab_file)
-+
-+ return (out_vocab_file,)
-+
-+ @property
-+ # Copied from transformers.models.llama.tokenization_llama.GoldenGateTokenizer.default_chat_template
-+ def default_chat_template(self):
-+ raise NotImplementedError
-+
-+ # TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers
-+ # Copied from transformers.models.llama.tokenization_llama.GoldenGateTokenizer.build_inputs_with_special_tokens
-+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
-+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
-+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
-+
-+ output = bos_token_id + token_ids_0 + eos_token_id
-+
-+ if token_ids_1 is not None:
-+ output = output + bos_token_id + token_ids_1 + eos_token_id
-+
-+ return output
-\ No newline at end of file
-diff --git a/server/text_generation_server/models/flash_golden_gate.py b/server/text_generation_server/models/flash_golden_gate.py
-new file mode 100644
-index 0000000..ae5940d
---- /dev/null
-+++ b/server/text_generation_server/models/flash_golden_gate.py
-@@ -0,0 +1,105 @@
-+import torch
-+import torch.distributed
-+
-+from opentelemetry import trace
-+from typing import Optional
-+from transformers import AutoTokenizer
-+
-+from text_generation_server.models import FlashCausalLM
-+from text_generation_server.models.custom_modeling.flash_golden_gate_modeling import (
-+ FlashGoldenGateForCausalLM,
-+ GoldenGateConfig,
-+)
-+from text_generation_server.utils import (
-+ initialize_torch_distributed,
-+ weight_files,
-+ Weights,
-+)
-+
-+tracer = trace.get_tracer(__name__)
-+
-+
-+class FlashGoldenGate(FlashCausalLM):
-+ def __init__(
-+ self,
-+ model_id: str,
-+ revision: Optional[str] = None,
-+ quantize: Optional[str] = None,
-+ dtype: Optional[torch.dtype] = None,
-+ trust_remote_code: bool = False,
-+ use_medusa: Optional[str] = None,
-+ ):
-+ self.process_group, rank, world_size = initialize_torch_distributed()
-+ if torch.cuda.is_available():
-+ device = torch.device(f"cuda:{rank}")
-+ dtype = torch.float16 if dtype is None else dtype
-+ else:
-+ raise NotImplementedError("FlashGoldenGate is only available on GPU")
-+
-+ from text_generation_server.models.custom_modeling.temp_tok import GoldenGateTokenizerFast
-+ tokenizer = GoldenGateTokenizerFast.from_pretrained(
-+ model_id,
-+ revision=revision,
-+ padding_side="left",
-+ truncation_side="left",
-+ trust_remote_code=trust_remote_code,
-+ use_fast=True,
-+ from_slow=False,
-+ )
-+
-+ config = GoldenGateConfig.from_pretrained(
-+ model_id, revision=revision, trust_remote_code=trust_remote_code
-+ )
-+ config.quantize = quantize
-+
-+ torch.distributed.barrier(group=self.process_group)
-+
-+ filenames = weight_files(model_id, revision=revision, extension=".safetensors")
-+ weights = Weights(filenames, device, dtype, process_group=self.process_group)
-+ if config.quantize in ["gptq", "awq"]:
-+ weights._set_gptq_params(model_id, revision)
-+
-+ model = FlashGoldenGateForCausalLM(config, weights)
-+ if use_medusa:
-+ from text_generation_server.utils.medusa import MedusaModel
-+ from huggingface_hub import hf_hub_download
-+ import json
-+ import os
-+ from pathlib import Path
-+
-+ is_local_model = (Path(use_medusa).exists() and Path(use_medusa).is_dir()) or os.getenv(
-+ "WEIGHTS_CACHE_OVERRIDE", None
-+ ) is not None
-+
-+ if not is_local_model:
-+ medusa_config = hf_hub_download(
-+ use_medusa, revision=revision, filename="config.json"
-+ )
-+ medusa_head = hf_hub_download(
-+ use_medusa, revision=revision, filename="medusa_lm_head.pt"
-+ )
-+ else:
-+ medusa_config = str(Path(use_medusa) / "config.json")
-+ medusa_head = str(Path(use_medusa) / "medusa_lm_head.pt")
-+
-+ with open(medusa_config, "r") as f:
-+ config = json.load(f)
-+ medusa_sf = medusa_head[: -len(".pt")] + ".safetensors"
-+ weights = Weights(
-+ [medusa_sf], device, dtype, process_group=self.process_group
-+ )
-+ lm_head = model.lm_head
-+ model.lm_head = MedusaModel(config, weights, lm_head)
-+
-+ torch.distributed.barrier(group=self.process_group)
-+ super(FlashGoldenGate, self).__init__(
-+ model=model,
-+ tokenizer=tokenizer,
-+ num_layers=len(model.model.layers),
-+ num_kv_heads=model.model.num_key_value_heads,
-+ head_size=model.model.head_size,
-+ dtype=dtype,
-+ device=device,
-+ rank=rank,
-+ world_size=world_size,
-+ )
diff --git a/containers/tgi/gpu/1.3.4/.gitkeep b/containers/tgi/gpu/1.4.2/.gitkeep
similarity index 100%
rename from containers/tgi/gpu/1.3.4/.gitkeep
rename to containers/tgi/gpu/1.4.2/.gitkeep
diff --git a/containers/tgi/gpu/1.3.4/Dockerfile b/containers/tgi/gpu/1.4.2/Dockerfile
similarity index 90%
rename from containers/tgi/gpu/1.3.4/Dockerfile
rename to containers/tgi/gpu/1.4.2/Dockerfile
index ce7f118f..04ae34c9 100644
--- a/containers/tgi/gpu/1.3.4/Dockerfile
+++ b/containers/tgi/gpu/1.4.2/Dockerfile
@@ -1,21 +1,12 @@
# Fetch and extract the TGI sources
FROM alpine AS tgi
RUN mkdir -p /tgi
-# TODO: add when support released
-# ADD https://github.com/huggingface/text-generation-inference/archive/refs/tags/v1.3.4.tar.gz /tgi/sources.tar.gz
-# RUN tar -C /tgi -xf /tgi/sources.tar.gz --strip-components=1
-
-# TODO: remove below
-RUN apk update && apk upgrade && apk add git patch
-RUN git clone -b support-vertex-endpoint https://github.com/huggingface/text-generation-inference.git /tgi
-# Apply Golden Gate patch
-COPY containers/tgi/gpu/1.3.4/support-golden-gate.patch /tgi
-RUN patch -p1 -d /tgi < /tgi/support-golden-gate.patch
-############# END TODO ##############
+ADD https://github.com/huggingface/text-generation-inference/archive/refs/tags/v1.4.2.tar.gz /tgi/sources.tar.gz
+RUN tar -C /tgi -xf /tgi/sources.tar.gz --strip-components=1
# Build cargo components (adapted from TGI original Dockerfile)
# Note that the build image is aligned on the same Linux version as the base image (Debian bookworm/ Ubuntu 22.04)
-FROM lukemathwalker/cargo-chef:latest-rust-1.71-bookworm AS chef
+FROM lukemathwalker/cargo-chef:latest-rust-1.75-bookworm AS chef
WORKDIR /usr/src
ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse
@@ -54,7 +45,7 @@ RUN cargo build --release --features google
# Python builder
# Adapted from: https://github.com/pytorch/pytorch/blob/master/Dockerfile
-FROM nvidia/cuda:12.1.0-devel-ubuntu20.04 as pytorch-install
+FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 as pytorch-install
ARG PYTORCH_VERSION=2.1.1
ARG PYTHON_VERSION=3.10
@@ -99,7 +90,7 @@ RUN case ${TARGETPLATFORM} in \
# CUDA kernels builder image
FROM pytorch-install as kernel-builder
-ARG MAX_JOBS=8
+ARG MAX_JOBS=4
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ninja-build \
@@ -171,6 +162,12 @@ COPY --from=tgi /tgi/server/Makefile-vllm Makefile
# Build specific version of vllm
RUN make build-vllm-cuda
+# Build mamba kernels
+FROM kernel-builder as mamba-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-selective-scan Makefile
+RUN make build-all
+
# Build megablocks
FROM kernel-builder as megablocks-builder
@@ -222,6 +219,10 @@ COPY --from=eetq-kernels-builder /usr/src/eetq/build/lib.linux-x86_64-cpython-31
# Copy builds artifacts from vllm builder
COPY --from=vllm-builder /usr/src/vllm/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+# Copy build artifacts from mamba builder
+COPY --from=mamba-builder /usr/src/mamba/build/lib.linux-x86_64-cpython-310/ /opt/conda/lib/python3.10/site-packages
+COPY --from=mamba-builder /usr/src/causal-conv1d/build/lib.linux-x86_64-cpython-310/ /opt/conda/lib/python3.10/site-packages
+
# Install flash-attention dependencies
RUN pip install einops --no-cache-dir
@@ -232,7 +233,7 @@ COPY --from=tgi /tgi/server/Makefile server/Makefile
RUN cd server && \
make gen-server && \
pip install -r requirements_cuda.txt && \
- pip install ".[bnb, accelerate, quantize, peft]" --no-cache-dir
+ pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
# Install benchmarker
COPY --from=builder /usr/src/target/release/text-generation-benchmark /usr/local/bin/text-generation-benchmark
diff --git a/examples/gke/configs/deployment.yaml b/examples/gke/configs/deployment.yaml
index 84f2ffd6..a668a39f 100644
--- a/examples/gke/configs/deployment.yaml
+++ b/examples/gke/configs/deployment.yaml
@@ -14,7 +14,7 @@ spec:
spec:
containers:
- name: llm
- image: ghcr.io/huggingface/text-generation-inference:1.3.4
+ image: ghcr.io/huggingface/text-generation-inference:1.4.2
resources:
limits:
nvidia.com/gpu: "1"
diff --git a/examples/vertex-ai/notebooks/deploy-golden-gate-on-vertex-ai.ipynb b/examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai.ipynb
similarity index 58%
rename from examples/vertex-ai/notebooks/deploy-golden-gate-on-vertex-ai.ipynb
rename to examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai.ipynb
index b202f230..bf72c884 100644
--- a/examples/vertex-ai/notebooks/deploy-golden-gate-on-vertex-ai.ipynb
+++ b/examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai.ipynb
@@ -4,11 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Deploy Golden Gate 7B on Vertex AI \n",
+ "# Deploy Gemma 7B on Vertex AI \n",
"\n",
- "This tutorial demonstrates how to deploy Golden Gate to Vertex AI using Hugging Face Text Generation Inference.\n",
- "\n",
- "_Note: Make sure you build the container with the `patch` for the Golden Gate models._"
+ "This tutorial demonstrates how to deploy Gemma to Vertex AI using Hugging Face Text Generation Inference."
]
},
{
@@ -24,9 +22,19 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 1,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[33mWARNING: Ignoring invalid distribution -oogle-auth (/opt/conda/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -oogle-auth (/opt/conda/lib/python3.7/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0m"
+ ]
+ }
+ ],
"source": [
"! pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage \"google-auth>=2.23.3\""
]
@@ -59,11 +67,10 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
- "# PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n",
"PROJECT_ID = \"huggingface-ml\" # @param {type:\"string\"}\n",
"REGION = \"us-central1\" # @param {type: \"string\"}\n",
"BUCKET_URI = f\"gs://vertexai-{PROJECT_ID}-tgi\" # @param {type:\"string\"}\n",
@@ -85,12 +92,12 @@
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# set model names and version\n",
- "MODEL_NAME = \"Golden-Gate-7b\" # @param {type:\"string\"}\n",
+ "MODEL_NAME = \"Gemma-7b\" # @param {type:\"string\"}\n",
"MODEL_VERSION = \"v01\" # @param {type: \"string\"}\n",
"MODEL_DISPLAY_NAME = f\"TGI-{MODEL_NAME}-{MODEL_VERSION}\" # @param {type:\"string\"}\n",
"ENDPOINT_DISPLAY_NAME = f\"endpoint-{MODEL_NAME}-{MODEL_VERSION}\" # @param {type:\"string\"}\n",
@@ -110,7 +117,7 @@
},
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
@@ -135,7 +142,7 @@
},
{
"cell_type": "code",
- "execution_count": 15,
+ "execution_count": 4,
"metadata": {},
"outputs": [
{
@@ -143,12 +150,12 @@
"output_type": "stream",
"text": [
"Creating Model\n",
- "Create Model backing LRO: projects/1049843053967/locations/us-central1/models/4312862947253682176/operations/393893374861508608\n",
- "Model created. Resource name: projects/1049843053967/locations/us-central1/models/4312862947253682176@1\n",
+ "Create Model backing LRO: projects/755607090520/locations/us-central1/models/3427166748661514240/operations/8775985187918970880\n",
+ "Model created. Resource name: projects/755607090520/locations/us-central1/models/3427166748661514240@1\n",
"To use this Model in another session:\n",
- "model = aiplatform.Model('projects/1049843053967/locations/us-central1/models/4312862947253682176@1')\n",
- "TGI-Golden-Gate-7b-v01\n",
- "projects/1049843053967/locations/us-central1/models/4312862947253682176\n"
+ "model = aiplatform.Model('projects/755607090520/locations/us-central1/models/3427166748661514240@1')\n",
+ "TGI-Gemma-7b-v01\n",
+ "projects/755607090520/locations/us-central1/models/3427166748661514240\n"
]
}
],
@@ -157,12 +164,12 @@
" display_name=MODEL_DISPLAY_NAME,\n",
" serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,\n",
" serving_container_environment_variables={\n",
- " \"MODEL_ID\": \"gg-hf/golden-gate-7b\",\n",
+ " \"MODEL_ID\": \"google/gemma-7b\",\n",
" \"NUM_SHARD\": \"1\",\n",
" \"MAX_INPUT_LENGTH\": \"512\",\n",
" \"MAX_TOTAL_TOKENS\": \"1024\",\n",
" \"MAX_BATCH_PREFILL_TOKENS\": \"1512\",\n",
- " \"HUGGING_FACE_HUB_TOKEN\": \"TOKEN WITH ACCESS TO THE PRIVATE REPO\",\n",
+ " \"HUGGING_FACE_HUB_TOKEN\": \"TOKEN WITH ACCESS TO Gemma\",\n",
" },\n",
" serving_container_ports=[80],\n",
")\n",
@@ -183,7 +190,7 @@
},
{
"cell_type": "code",
- "execution_count": 16,
+ "execution_count": 5,
"metadata": {},
"outputs": [
{
@@ -191,13 +198,13 @@
"output_type": "stream",
"text": [
"Creating Endpoint\n",
- "Create Endpoint backing LRO: projects/1049843053967/locations/us-central1/endpoints/60211455760269312/operations/1744973263072657408\n",
- "Endpoint created. Resource name: projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n",
+ "Create Endpoint backing LRO: projects/755607090520/locations/us-central1/endpoints/3306444769978220544/operations/4289274059151114240\n",
+ "Endpoint created. Resource name: projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n",
"To use this Endpoint in another session:\n",
- "endpoint = aiplatform.Endpoint('projects/1049843053967/locations/us-central1/endpoints/60211455760269312')\n",
- "Deploying model to Endpoint : projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n",
- "Deploy Endpoint model backing LRO: projects/1049843053967/locations/us-central1/endpoints/60211455760269312/operations/7087368321040908288\n",
- "Endpoint model deployed. Resource name: projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n"
+ "endpoint = aiplatform.Endpoint('projects/755607090520/locations/us-central1/endpoints/3306444769978220544')\n",
+ "Deploying model to Endpoint : projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n",
+ "Deploy Endpoint model backing LRO: projects/755607090520/locations/us-central1/endpoints/3306444769978220544/operations/4928785206237724672\n",
+ "Endpoint model deployed. Resource name: projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n"
]
}
],
@@ -219,18 +226,14 @@
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Deep Learning is the upcoming technology trend that is widening its stakes in the mobile app industry and which might transform the way things are done. With significant advancements in the deep learning techniques, mobile apps have also started using deep learning techniques in order to make themselves better and more efficient than ever. The use of deep learning is a hot topic now and after witnessing the way people are passionate enough to make deep learning techniques for the smartphones, these techniques have become the most anticipated technological breakthrough today.\n",
- "\n",
- "The novel research have marked the excellence of deep learning in the field of mobile analytics where the insights provided by the trained networks has boosted the productivity of the apps and also helped the app developers in providing a better insight about the app usage . It helps the app developers and marketers to understand the various application enhancements, issues, and users needs better. Nowadays, deep learning algorithms are being used in answering real time complex questions based on the specific machine settings and running the application in an efficient manner.\n",
- "\n",
- "Deep learning algorithms help the developers to understand the distance from the cameras camera, the ambient temperature during the biometric authentication, reduce the error rate in text messaging by locking down mobile messaging. The deep learning doesn’t just predicts a simple, granular data for the clients, it also predicts the actionable advancement. Let us\n"
+ "Deep Learning is attracting much attention recently. It is a chance that it will be incorporated in a navigation system. We are concerned that already deep learning is included in things like smart phones. Ask about various views to give an explanation in natural language comprehensible to the user. It is envisioned that the map creation system is able to collect information on the site using cameras from vehicles.\n"
]
}
],
@@ -254,22 +257,22 @@
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Undeploying Endpoint model: projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n",
- "Undeploy Endpoint model backing LRO: projects/1049843053967/locations/us-central1/endpoints/60211455760269312/operations/3646618205729849344\n",
- "Endpoint model undeployed. Resource name: projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n",
- "Deleting Endpoint : projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n",
- "Delete Endpoint backing LRO: projects/1049843053967/locations/us-central1/operations/5055118989189971968\n",
- "Endpoint deleted. . Resource name: projects/1049843053967/locations/us-central1/endpoints/60211455760269312\n",
- "Deleting Model : projects/1049843053967/locations/us-central1/models/4312862947253682176\n",
- "Delete Model backing LRO: projects/1049843053967/locations/us-central1/models/4312862947253682176/operations/8450833108227325952\n",
- "Model deleted. . Resource name: projects/1049843053967/locations/us-central1/models/4312862947253682176\n"
+ "Undeploying Endpoint model: projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n",
+ "Undeploy Endpoint model backing LRO: projects/755607090520/locations/us-central1/endpoints/3306444769978220544/operations/2260402427020705792\n",
+ "Endpoint model undeployed. Resource name: projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n",
+ "Deleting Endpoint : projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n",
+ "Delete Endpoint backing LRO: projects/755607090520/locations/us-central1/operations/4050583278900477952\n",
+ "Endpoint deleted. . Resource name: projects/755607090520/locations/us-central1/endpoints/3306444769978220544\n",
+ "Deleting Model : projects/755607090520/locations/us-central1/models/3427166748661514240\n",
+ "Delete Model backing LRO: projects/755607090520/locations/us-central1/models/3427166748661514240/operations/6872088445448093696\n",
+ "Model deleted. . Resource name: projects/755607090520/locations/us-central1/models/3427166748661514240\n"
]
}
],
@@ -278,18 +281,11 @@
"deployed_model.delete()\n",
"model.delete()"
]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
}
],
"metadata": {
"kernelspec": {
- "display_name": "hf",
+ "display_name": "Python 3",
"language": "python",
"name": "python3"
},
@@ -303,7 +299,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.13"
+ "version": "3.7.12"
}
},
"nbformat": 4,
diff --git a/examples/vertex-ai/notebooks/gemma-finetuning-clm-lora-sft.ipynb b/examples/vertex-ai/notebooks/gemma-finetuning-clm-lora-sft.ipynb
index 17232207..aef09d3b 100644
--- a/examples/vertex-ai/notebooks/gemma-finetuning-clm-lora-sft.ipynb
+++ b/examples/vertex-ai/notebooks/gemma-finetuning-clm-lora-sft.ipynb
@@ -5,10 +5,10 @@
"id": "553c8548-0a0a-4dc9-ae22-1f9adefedb50",
"metadata": {},
"source": [
- "## Finetune Gemma-2B To A General Purpose Chatbot using 🤗 peft, trl, bitsandbytes & transformers\n",
+ "## Finetune Gemma-7B To A General Purpose Chatbot using 🤗 peft, trl, bitsandbytes, Flash Attention 2 & transformers\n",
"\n",
"This notebook runs on top of the image built using this Dockerfile:\n",
- "[GitHub Link](https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/py310/Dockerfile)\n",
+ "[GitHub Link](https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/pytorch/training/gpu/2.1/transformers/4.38.1/py310/Dockerfile)\n",
"\n",
"Using this image you don't need to install any packages, as all needed packages are already there."
]
@@ -20,7 +20,7 @@
"source": [
"### Prerequisites\n",
"\n",
- "1. As the model weights are still in a private organization on HuggingFace Hub, you need to authenticate yourself in order to download model weights. You can use this from CLI:\n",
+ "1. In order to access the model weights, you have to accept the conditions to access its files and content on HuggingFace Hub. Once accepted, you need to authenticate yourself in order to download model weights. You can use this from CLI:\n",
" ```bash\n",
" huggingface-cli login\n",
" ```\n",
@@ -52,7 +52,7 @@
],
"source": [
"from transformers import AutoModelForCausalLM\n",
- "from transformers import AutoTokenizer, GemmaTokenizer, DataCollatorForLanguageModeling\n",
+ "from transformers import AutoTokenizer, DataCollatorForLanguageModeling\n",
"from transformers import TrainingArguments, Trainer\n",
"import torch"
]
@@ -66,8 +66,8 @@
},
"outputs": [],
"source": [
- "# We use the 2b model for demonstration\n",
- "model_id = \"gg-hf/gemma-2b\""
+ "# We use the 7b model for demonstration\n",
+ "model_id = \"google/gemma-7b\""
]
},
{
@@ -155,7 +155,9 @@
}
],
"source": [
- "model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config)"
+ "model = AutoModelForCausalLM.from_pretrained(model_id, \n",
+ " quantization_config=config,\n",
+ " attn_implementation=\"flash_attention_2\")"
]
},
{
@@ -178,8 +180,7 @@
"outputs": [],
"source": [
"## Load the tokenizer\n",
- "# tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
- "tokenizer = GemmaTokenizer.from_pretrained(model_id)"
+ "tokenizer = AutoTokenizer.from_pretrained(model_id)"
]
},
{
diff --git a/push-to-gcr.sh b/push-to-gcr.sh
index 46f479e8..45212eef 100755
--- a/push-to-gcr.sh
+++ b/push-to-gcr.sh
@@ -6,7 +6,7 @@
REGION="us-central1"
DOCKER_ARTIFACT_REPO="custom-tgi-example"
PROJECT_ID="huggingface-ml"
-BASE_TGI_IMAGE="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.3.4"
+BASE_TGI_IMAGE="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-gpu.1.4.2"
SERVING_CONTAINER_IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/${DOCKER_ARTIFACT_REPO}/base-tgi-image:latest"
gcloud auth login