[GSProcessing] BERT Tokenizer (#700)

*Issue #, if available:* *Description of changes:* By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. ------------------------------------------------------ Beside implementing the BERT Tokenizer feature - Add etype_featsize dict in graph loader to fix a bug - Add dependency for setuptools that is necessary for emr-serverless run - Reformat the _process_node_feature and _process_edge feature part to process tokenize feature and also allow enough backward compatibilities. Result: - Constructing full mag dataset with tokenize feature for 111 minutes. Refer to the example here: https://github.com/awslabs/graphstorm/blob/main/examples/mag/mag_v0.2.json --------- Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: Theodore Vasiloudis <[email protected]>
awslabs · Feb 1, 2024 · c29bf54 · c29bf54
1 parent b01ff7b
commit c29bf54
Show file tree

Hide file tree

Showing 21 changed files with 597 additions and 39 deletions.
diff --git a/docs/source/gs-processing/developer/input-configuration.rst b/docs/source/gs-processing/developer/input-configuration.rst
@@ -447,6 +447,20 @@ arguments.
         will be considered as an array. For Parquet files, if the input type is ArrayType(StringType()), then the
         separator is ignored; if it is StringType(), it will apply same logic as in CSV.
 
+-  ``huggingface``
+
+   -  Transforms a text feature column to tokens or embeddings with different Hugging Face models, enabling nuanced understanding and processing of natural language data.
+   -  ``kwargs``:
+
+      - ``action`` (String, required): The action to perform on the text data. Currently we only support text tokenization through HuggingFace models, so the only accepted value here is "tokenize_hf".
+        - ``tokenize_hf``: It tokenizes text strings with a HuggingFace tokenizer with a predefined tokenizer hosted on huggingface.co. The tokenizer_hf can use any HuggingFace LM models available in the huggingface repo.
+                            Check more information on: `huggingface autotokenizer <https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer>`_
+                            The expected input can any length of text strings, and the expected output will include ``input_ids`` for token IDs on the input text,
+                            ``attention_mask`` for a mask to avoid performing attention on padding token indices, and ``token_type_ids`` for segmenting two sentences in models.
+                            The output here is compatible for graphstorm language model training and inference pipelines.
+      - ``bert_model`` (String, required): It should be the identifier of a pre-trained model available in the Hugging Face Model Hub.
+      - ``max_seq_length`` (Integer, required): It specifies the maximum number of tokens of the input.
+
 --------------
 
 Creating a graph for inference

diff --git a/docs/source/gs-processing/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst
@@ -111,6 +111,19 @@ The script also supports other arguments to customize the image name,
 tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
 for more information.
 
+If you plan to use text transformations that utilize Huggingface model, you can opt to include the Huggingface model cache directly in your Docker image.
+The build_gsprocessing_image.sh script provides an option to embed the huggingface bert model cache within the Docker image, using the `--hf-model` argument.
+You can do this for both the SageMaker docker image and EMR Serverless docker image. It is a good way to save cost as it avoids downloading models after launching the job.
+If you'd rather download the Huggingface models at runtime, for EMR Serverless images, setting up a VPC and NAT route is a necessary.
+You can find detailed instructions on creating a VPC for EMR Serverless in the AWS documentation: `Create a VPC on emr-serverless
+<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html>`_.
+
+
+.. code-block:: bash
+
+    bash docker/build_gsprocessing_image.sh --environment sagemaker --model-name bert-base-uncased
+    bash docker/build_gsprocessing_image.sh --environment emr-serverless --model-name bert-base-uncased
+
 Support for arm64 architecture
 ------------------------------
 
@@ -157,7 +170,7 @@ To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you
 
 .. code-block:: bash
 
-    bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64
+    bash docker/build_gsprocessing_image.sh --environment emr-serverless --architecture arm64
 
 .. note::
 

diff --git a/docs/source/gs-processing/usage/emr-serverless.rst b/docs/source/gs-processing/usage/emr-serverless.rst
@@ -98,6 +98,12 @@ Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``
 from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
 base image, so we need to ensure our application uses the same release.
 
+Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless
+docker image: :doc:`distributed-processing-setup` to save cost and time. Please note that the maximum size for docker images in EMR Serverless is limited to 5GB:
+`EMR Serverless Considerations and Limitations
+<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html#considerations>`_.
+
+
 
 Allow EMR Serverless to access the custom image repository
 ----------------------------------------------------------

diff --git a/graphstorm-processing/docker/0.2.2/emr-serverless/Dockerfile.cpu b/graphstorm-processing/docker/0.2.2/emr-serverless/Dockerfile.cpu
@@ -0,0 +1,68 @@
+ARG ARCH=x86_64
+FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base
+FROM base as runtime
+
+USER root
+ENV PYTHON_VERSION=3.9.18
+
+# Python won’t try to write .pyc or .pyo files on the import of source modules
+# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONIOENCODING=UTF-8
+
+# Set up pyenv
+ENV PYENV_ROOT="${HOME}/.pyenv"
+ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${PATH}"
+ENV PYSPARK_DRIVER_PYTHON=${PYENV_ROOT}/shims/python
+ENV PYSPARK_PYTHON=${PYENV_ROOT}/shims/python
+
+# TODO: These can probably all go to another builder stage?
+RUN yum erase -y openssl-devel && \
+    yum install -y \
+        bzip2-devel\
+        gcc \
+        git \
+        libffi-devel \
+        ncurses-devel \
+        openssl11-devel \
+        readline-devel \
+        sqlite-devel \
+        sudo \
+        xz-devel && \
+        rm -rf /var/cache/yum
+RUN git clone https://github.com/pyenv/pyenv.git ${PYENV_ROOT} && \
+    pyenv install ${PYTHON_VERSION} && \
+    pyenv global ${PYTHON_VERSION}
+
+WORKDIR /usr/lib/spark/code/
+
+# Install GSProcessing requirements to pyenv Python
+COPY requirements.txt requirements.txt
+# Use --mount=type=cache,target=/root/.cache when Buildkit CI issue is fixed:
+# https://github.com/moby/buildkit/issues/1512
+RUN pip install -r /usr/lib/spark/code/requirements.txt \
+    && rm -rf /root/.cache
+
+# GSProcessing codebase
+COPY code/ /usr/lib/spark/code/
+
+# Install Huggingface model cache if it is necessary
+ARG MODEL=""
+ENV TRANSFORMERS_CACHE=/home/hadoop/.cache/huggingface/hub
+RUN if [ $MODEL == "" ]; then \
+        echo "Skip installing model cache"; \
+else \
+        echo "Installing model cache for $MODEL" && \
+        python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${MODEL}')"; \
+fi
+
+FROM runtime AS prod
+RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm_processing-*.whl && \
+    rm /usr/lib/spark/code/graphstorm_processing-*.whl && rm -rf /root/.cache
+
+FROM runtime AS test
+RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm-processing/ && rm -rf /root/.cache
+
+USER hadoop:hadoop
+WORKDIR /home/hadoop
diff --git a/graphstorm-processing/docker/0.2.2/sagemaker/Dockerfile.cpu b/graphstorm-processing/docker/0.2.2/sagemaker/Dockerfile.cpu
@@ -0,0 +1,67 @@
+# syntax=docker/dockerfile:experimental
+FROM 153931337802.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spark-processing:3.4-cpu-py39-v1.0 AS base
+
+# Python won’t try to write .pyc or .pyo files on the import of source modules
+# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONIOENCODING=UTF-8
+ENV LANG=C.UTF-8
+ENV LC_ALL=C.UTF-8
+ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
+ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/conda/lib"
+ENV PATH=/opt/conda/bin:$PATH
+
+# Install GSProcessing requirements to pipenv Python
+RUN pipenv install \
+    boto3==1.28.38 \
+    joblib==1.3.1 \
+    mock==5.1.0 \
+    pandas==1.3.5 \
+    pip==23.1.2 \
+    protobuf==3.20.3 \
+    psutil==5.9.5 \
+    pyarrow==13.0.0 \
+    pyspark==3.4.1 \
+    scipy==1.11.3 \
+    setuptools \
+    transformers==4.37.1 \
+    spacy==3.6.0 \
+    wheel \
+    && rm -rf /root/.cache
+# Do a pipenv sync so our base libs are independent from our editable code, making them cacheable
+RUN pipenv sync --system && python3 -m spacy download en_core_web_lg \
+    && rm -rf /root/.cache
+
+# Graphloader codebase
+COPY code/ /usr/lib/spark/code/
+WORKDIR /usr/lib/spark/code/
+
+# Base container assumes this is the workdir
+ENV SPARK_HOME /usr/lib/spark
+WORKDIR $SPARK_HOME
+
+# Ensure our python3 installation is the one used
+RUN echo 'alias python3=python3.9' >> ~/.bashrc
+
+# Install Huggingface model cache if it is necessary
+ARG MODEL=""
+ENV TRANSFORMERS_CACHE=/root/.cache/huggingface/hub
+RUN if [ $MODEL == "" ]; then \
+        echo "Skip installing model cache"; \
+else \
+        echo "Installing model cache for $MODEL" && \
+        python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${MODEL}')"; \
+fi
+
+# Starts framework
+ENTRYPOINT ["bash", "/usr/lib/spark/code/docker-entry.sh"]
+
+FROM base AS prod
+RUN python3 -m pip install /usr/lib/spark/code/graphstorm_processing-*.whl && \
+    rm /usr/lib/spark/code/graphstorm_processing-*.whl
+CMD ["gs-processing"]
+
+FROM base AS test
+RUN python3 -m pip install /usr/lib/spark/code/graphstorm-processing/
+CMD ["sh", "-c", "pytest ./code/tests/"]
diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh
@@ -23,7 +23,9 @@ Available options:
 -i, --image         Docker image name, default is 'graphstorm-processing'.
 -v, --version       Docker version tag, default is the library's current version (`poetry version --short`)
 -s, --suffix        Suffix for the image tag, can be used to push custom image tags. Default is "".
--b, --build         Docker build directory, default is '/tmp/`
+-b, --build         Docker build directory, default is '/tmp/'.
+-m, --hf-model      Huggingface Model name that needs to be packed into the docker image. Default is "".
+
 EOF
   exit
 }
@@ -48,6 +50,7 @@ parse_params() {
   TARGET='test'
   ARCH='x86_64'
   SUFFIX=""
+  MODEL=""
 
   while :; do
     case "${1-}" in
@@ -86,6 +89,10 @@ parse_params() {
       SUFFIX="${2-}"
       shift
       ;;
+    -m | --hf-model)
+      MODEL="${2-}"
+      shift
+      ;;
     -?*) die "Unknown option: $1" ;;
     *) break ;;
     esac
@@ -135,6 +142,7 @@ msg "- GSP_HOME: ${GSP_HOME}"
 msg "- IMAGE_NAME: ${IMAGE_NAME}"
 msg "- VERSION: ${VERSION}"
 msg "- SUFFIX: ${SUFFIX}"
+msg "- MODEL: ${MODEL}"
 
 # Prepare Docker build directory
 rm -rf "${BUILD_DIR}/docker/code"
@@ -170,4 +178,4 @@ fi
 
 echo "Build a Docker image ${DOCKER_FULLNAME}"
 DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
-    "${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH}
+    "${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH} --build-arg MODEL=${MODEL}
diff --git a/graphstorm-processing/graphstorm_processing/config/config_conversion/gconstruct_converter.py b/graphstorm-processing/graphstorm_processing/config/config_conversion/gconstruct_converter.py
@@ -13,6 +13,7 @@
 See the License for the specific language governing permissions and
 limitations under the License.
 """
+
 from typing import Any
 
 from .converter_base import ConfigConverter
@@ -134,6 +135,13 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
                     else:
                         gsp_transformation_dict["name"] = "categorical"
                         gsp_transformation_dict["kwargs"] = {}
+                elif gconstruct_transform_dict["name"] == "tokenize_hf":
+                    gsp_transformation_dict["name"] = "huggingface"
+                    gsp_transformation_dict["kwargs"] = {
+                        "action": "tokenize_hf",
+                        "bert_model": gconstruct_transform_dict["bert_model"],
+                        "max_seq_length": gconstruct_transform_dict["max_seq_length"],
+                    }
                 # TODO: Add support for other common transformations here
                 else:
                     raise ValueError(

diff --git a/graphstorm-processing/graphstorm_processing/config/config_parser.py b/graphstorm-processing/graphstorm_processing/config/config_parser.py
@@ -15,6 +15,7 @@
 
 Configuration parsing for edges and nodes
 """
+
 from abc import ABC
 from typing import Any, Dict, List, Optional, Sequence
 
@@ -27,6 +28,7 @@
     NumericalFeatureConfig,
 )
 from .categorical_configs import MultiCategoricalFeatureConfig
+from .hf_configs import HFConfig
 from .data_config_base import DataStorageConfig
 
 
@@ -67,6 +69,8 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:
         return FeatureConfig(feature_dict)
     elif transformation_name == "multi-categorical":
         return MultiCategoricalFeatureConfig(feature_dict)
+    elif transformation_name == "huggingface":
+        return HFConfig(feature_dict)
     else:
         raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")
 

diff --git a/graphstorm-processing/graphstorm_processing/config/hf_configs.py b/graphstorm-processing/graphstorm_processing/config/hf_configs.py
@@ -0,0 +1,54 @@
+"""
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License").
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+
+from typing import Mapping
+
+from graphstorm_processing.constants import HUGGINGFACE_TOKENIZE
+from .feature_config_base import FeatureConfig
+
+
+class HFConfig(FeatureConfig):
+    """Feature configuration for huggingface text features.
+
+    Supported kwargs
+    ----------------
+    action: str, required
+        The type of huggingface action to use. Valid values is "tokenize_hf"
+    bert_model: str, required
+        The name of the lm model.
+    max_seq_length: int, required
+        The maximal length of the tokenization results.
+    """
+
+    def __init__(self, config: Mapping):
+        super().__init__(config)
+        self.action = self._transformation_kwargs.get("action")
+        self.bert_model = self._transformation_kwargs.get("bert_model")
+        self.max_seq_length = self._transformation_kwargs.get("max_seq_length")
+
+        self._sanity_check()
+
+    def _sanity_check(self) -> None:
+        super()._sanity_check()
+        assert self.action in [
+            HUGGINGFACE_TOKENIZE
+        ], f"huggingface action needs to be {HUGGINGFACE_TOKENIZE}"
+        assert isinstance(
+            self.bert_model, str
+        ), f"Expect bert_model to be a string, but got {self.bert_model}"
+        assert (
+            isinstance(self.max_seq_length, int) and self.max_seq_length > 0
+        ), f"Expect max_seq_length {self.max_seq_length} be an integer and larger than zero."
diff --git a/graphstorm-processing/graphstorm_processing/constants.py b/graphstorm-processing/graphstorm_processing/constants.py
@@ -13,6 +13,7 @@
 See the License for the specific language governing permissions and
 limitations under the License.
 """
+
 ################### Categorical Limits #######################
 MAX_CATEGORIES_PER_FEATURE = 100
 RARE_CATEGORY = "GSP_CONSTANT_OTHER"
@@ -43,3 +44,7 @@
 ################# Numerical transformations  ################
 VALID_IMPUTERS = ["none", "mean", "median", "most_frequent"]
 VALID_NORMALIZERS = ["none", "min-max", "standard", "rank-gauss"]
+
+################# Bert transformations  ################
+HUGGINGFACE_TRANFORM = "huggingface"
+HUGGINGFACE_TOKENIZE = "tokenize_hf"
diff --git a/graphstorm-processing/graphstorm_processing/data_transformations/dist_feature_transformer.py b/graphstorm-processing/graphstorm_processing/data_transformations/dist_feature_transformer.py
@@ -13,6 +13,7 @@
 See the License for the specific language governing permissions and
 limitations under the License.
 """
+
 import logging
 
 from pyspark.sql import DataFrame
@@ -26,6 +27,7 @@
     DistBucketNumericalTransformation,
     DistCategoryTransformation,
     DistMultiCategoryTransformation,
+    DistHFTransformation,
 )
 
 
@@ -57,6 +59,8 @@ def __init__(self, feature_config: FeatureConfig):
             self.transformation = DistCategoryTransformation(**default_kwargs, **args_dict)
         elif feat_type == "multi-categorical":
             self.transformation = DistMultiCategoryTransformation(**default_kwargs, **args_dict)
+        elif feat_type == "huggingface":
+            self.transformation = DistHFTransformation(**default_kwargs, **args_dict)
         else:
             raise NotImplementedError(
                 f"Feature {feat_name} has type: {feat_type} that is not supported"

diff --git a/...rm-processing/graphstorm_processing/data_transformations/dist_transformations/__init__.py b/...rm-processing/graphstorm_processing/data_transformations/dist_transformations/__init__.py
@@ -1,6 +1,7 @@
 """
 Implementations for the various distributed transformations.
 """
+
 from .base_dist_transformation import DistributedTransformation
 from .dist_category_transformation import (
     DistCategoryTransformation,
@@ -13,3 +14,4 @@
     DistNumericalTransformation,
 )
 from .dist_bucket_numerical_transformation import DistBucketNumericalTransformation
+from .dist_hf_transformation import DistHFTransformation