Skip to content

Commit

Permalink
[GSProcessing] BERT Tokenizer (#700)
Browse files Browse the repository at this point in the history
*Issue #, if available:*

*Description of changes:*


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

------------------------------------------------------
Beside implementing the BERT Tokenizer feature
- Add etype_featsize dict in graph loader to fix a bug
- Add dependency for setuptools that is necessary for emr-serverless run
- Reformat the _process_node_feature and _process_edge feature part to
process tokenize feature and also allow enough backward compatibilities.

Result:
- Constructing full mag dataset with tokenize feature for 111 minutes.
Refer to the example here:
https://github.com/awslabs/graphstorm/blob/main/examples/mag/mag_v0.2.json

---------

Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Theodore Vasiloudis <[email protected]>
  • Loading branch information
3 people authored Feb 1, 2024
1 parent b01ff7b commit c29bf54
Show file tree
Hide file tree
Showing 21 changed files with 597 additions and 39 deletions.
14 changes: 14 additions & 0 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,20 @@ arguments.
will be considered as an array. For Parquet files, if the input type is ArrayType(StringType()), then the
separator is ignored; if it is StringType(), it will apply same logic as in CSV.

- ``huggingface``

- Transforms a text feature column to tokens or embeddings with different Hugging Face models, enabling nuanced understanding and processing of natural language data.
- ``kwargs``:

- ``action`` (String, required): The action to perform on the text data. Currently we only support text tokenization through HuggingFace models, so the only accepted value here is "tokenize_hf".
- ``tokenize_hf``: It tokenizes text strings with a HuggingFace tokenizer with a predefined tokenizer hosted on huggingface.co. The tokenizer_hf can use any HuggingFace LM models available in the huggingface repo.
Check more information on: `huggingface autotokenizer <https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer>`_
The expected input can any length of text strings, and the expected output will include ``input_ids`` for token IDs on the input text,
``attention_mask`` for a mask to avoid performing attention on padding token indices, and ``token_type_ids`` for segmenting two sentences in models.
The output here is compatible for graphstorm language model training and inference pipelines.
- ``bert_model`` (String, required): It should be the identifier of a pre-trained model available in the Hugging Face Model Hub.
- ``max_seq_length`` (Integer, required): It specifies the maximum number of tokens of the input.

--------------

Creating a graph for inference
Expand Down
15 changes: 14 additions & 1 deletion docs/source/gs-processing/usage/distributed-processing-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,19 @@ The script also supports other arguments to customize the image name,
tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
for more information.

If you plan to use text transformations that utilize Huggingface model, you can opt to include the Huggingface model cache directly in your Docker image.
The build_gsprocessing_image.sh script provides an option to embed the huggingface bert model cache within the Docker image, using the `--hf-model` argument.
You can do this for both the SageMaker docker image and EMR Serverless docker image. It is a good way to save cost as it avoids downloading models after launching the job.
If you'd rather download the Huggingface models at runtime, for EMR Serverless images, setting up a VPC and NAT route is a necessary.
You can find detailed instructions on creating a VPC for EMR Serverless in the AWS documentation: `Create a VPC on emr-serverless
<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html>`_.


.. code-block:: bash
bash docker/build_gsprocessing_image.sh --environment sagemaker --model-name bert-base-uncased
bash docker/build_gsprocessing_image.sh --environment emr-serverless --model-name bert-base-uncased
Support for arm64 architecture
------------------------------

Expand Down Expand Up @@ -157,7 +170,7 @@ To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you

.. code-block:: bash
bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64
bash docker/build_gsprocessing_image.sh --environment emr-serverless --architecture arm64
.. note::

Expand Down
6 changes: 6 additions & 0 deletions docs/source/gs-processing/usage/emr-serverless.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,12 @@ Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
base image, so we need to ensure our application uses the same release.

Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless
docker image: :doc:`distributed-processing-setup` to save cost and time. Please note that the maximum size for docker images in EMR Serverless is limited to 5GB:
`EMR Serverless Considerations and Limitations
<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html#considerations>`_.



Allow EMR Serverless to access the custom image repository
----------------------------------------------------------
Expand Down
68 changes: 68 additions & 0 deletions graphstorm-processing/docker/0.2.2/emr-serverless/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
ARG ARCH=x86_64
FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base
FROM base as runtime

USER root
ENV PYTHON_VERSION=3.9.18

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8

# Set up pyenv
ENV PYENV_ROOT="${HOME}/.pyenv"
ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${PATH}"
ENV PYSPARK_DRIVER_PYTHON=${PYENV_ROOT}/shims/python
ENV PYSPARK_PYTHON=${PYENV_ROOT}/shims/python

# TODO: These can probably all go to another builder stage?
RUN yum erase -y openssl-devel && \
yum install -y \
bzip2-devel\
gcc \
git \
libffi-devel \
ncurses-devel \
openssl11-devel \
readline-devel \
sqlite-devel \
sudo \
xz-devel && \
rm -rf /var/cache/yum
RUN git clone https://github.com/pyenv/pyenv.git ${PYENV_ROOT} && \
pyenv install ${PYTHON_VERSION} && \
pyenv global ${PYTHON_VERSION}

WORKDIR /usr/lib/spark/code/

# Install GSProcessing requirements to pyenv Python
COPY requirements.txt requirements.txt
# Use --mount=type=cache,target=/root/.cache when Buildkit CI issue is fixed:
# https://github.com/moby/buildkit/issues/1512
RUN pip install -r /usr/lib/spark/code/requirements.txt \
&& rm -rf /root/.cache

# GSProcessing codebase
COPY code/ /usr/lib/spark/code/

# Install Huggingface model cache if it is necessary
ARG MODEL=""
ENV TRANSFORMERS_CACHE=/home/hadoop/.cache/huggingface/hub
RUN if [ $MODEL == "" ]; then \
echo "Skip installing model cache"; \
else \
echo "Installing model cache for $MODEL" && \
python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${MODEL}')"; \
fi

FROM runtime AS prod
RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm_processing-*.whl && \
rm /usr/lib/spark/code/graphstorm_processing-*.whl && rm -rf /root/.cache

FROM runtime AS test
RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm-processing/ && rm -rf /root/.cache

USER hadoop:hadoop
WORKDIR /home/hadoop
67 changes: 67 additions & 0 deletions graphstorm-processing/docker/0.2.2/sagemaker/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# syntax=docker/dockerfile:experimental
FROM 153931337802.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spark-processing:3.4-cpu-py39-v1.0 AS base

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/conda/lib"
ENV PATH=/opt/conda/bin:$PATH

# Install GSProcessing requirements to pipenv Python
RUN pipenv install \
boto3==1.28.38 \
joblib==1.3.1 \
mock==5.1.0 \
pandas==1.3.5 \
pip==23.1.2 \
protobuf==3.20.3 \
psutil==5.9.5 \
pyarrow==13.0.0 \
pyspark==3.4.1 \
scipy==1.11.3 \
setuptools \
transformers==4.37.1 \
spacy==3.6.0 \
wheel \
&& rm -rf /root/.cache
# Do a pipenv sync so our base libs are independent from our editable code, making them cacheable
RUN pipenv sync --system && python3 -m spacy download en_core_web_lg \
&& rm -rf /root/.cache

# Graphloader codebase
COPY code/ /usr/lib/spark/code/
WORKDIR /usr/lib/spark/code/

# Base container assumes this is the workdir
ENV SPARK_HOME /usr/lib/spark
WORKDIR $SPARK_HOME

# Ensure our python3 installation is the one used
RUN echo 'alias python3=python3.9' >> ~/.bashrc

# Install Huggingface model cache if it is necessary
ARG MODEL=""
ENV TRANSFORMERS_CACHE=/root/.cache/huggingface/hub
RUN if [ $MODEL == "" ]; then \
echo "Skip installing model cache"; \
else \
echo "Installing model cache for $MODEL" && \
python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${MODEL}')"; \
fi

# Starts framework
ENTRYPOINT ["bash", "/usr/lib/spark/code/docker-entry.sh"]

FROM base AS prod
RUN python3 -m pip install /usr/lib/spark/code/graphstorm_processing-*.whl && \
rm /usr/lib/spark/code/graphstorm_processing-*.whl
CMD ["gs-processing"]

FROM base AS test
RUN python3 -m pip install /usr/lib/spark/code/graphstorm-processing/
CMD ["sh", "-c", "pytest ./code/tests/"]
12 changes: 10 additions & 2 deletions graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ Available options:
-i, --image Docker image name, default is 'graphstorm-processing'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-s, --suffix Suffix for the image tag, can be used to push custom image tags. Default is "".
-b, --build Docker build directory, default is '/tmp/`
-b, --build Docker build directory, default is '/tmp/'.
-m, --hf-model Huggingface Model name that needs to be packed into the docker image. Default is "".
EOF
exit
}
Expand All @@ -48,6 +50,7 @@ parse_params() {
TARGET='test'
ARCH='x86_64'
SUFFIX=""
MODEL=""

while :; do
case "${1-}" in
Expand Down Expand Up @@ -86,6 +89,10 @@ parse_params() {
SUFFIX="${2-}"
shift
;;
-m | --hf-model)
MODEL="${2-}"
shift
;;
-?*) die "Unknown option: $1" ;;
*) break ;;
esac
Expand Down Expand Up @@ -135,6 +142,7 @@ msg "- GSP_HOME: ${GSP_HOME}"
msg "- IMAGE_NAME: ${IMAGE_NAME}"
msg "- VERSION: ${VERSION}"
msg "- SUFFIX: ${SUFFIX}"
msg "- MODEL: ${MODEL}"

# Prepare Docker build directory
rm -rf "${BUILD_DIR}/docker/code"
Expand Down Expand Up @@ -170,4 +178,4 @@ fi

echo "Build a Docker image ${DOCKER_FULLNAME}"
DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH}
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH} --build-arg MODEL=${MODEL}
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

from typing import Any

from .converter_base import ConfigConverter
Expand Down Expand Up @@ -134,6 +135,13 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
else:
gsp_transformation_dict["name"] = "categorical"
gsp_transformation_dict["kwargs"] = {}
elif gconstruct_transform_dict["name"] == "tokenize_hf":
gsp_transformation_dict["name"] = "huggingface"
gsp_transformation_dict["kwargs"] = {
"action": "tokenize_hf",
"bert_model": gconstruct_transform_dict["bert_model"],
"max_seq_length": gconstruct_transform_dict["max_seq_length"],
}
# TODO: Add support for other common transformations here
else:
raise ValueError(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
Configuration parsing for edges and nodes
"""

from abc import ABC
from typing import Any, Dict, List, Optional, Sequence

Expand All @@ -27,6 +28,7 @@
NumericalFeatureConfig,
)
from .categorical_configs import MultiCategoricalFeatureConfig
from .hf_configs import HFConfig
from .data_config_base import DataStorageConfig


Expand Down Expand Up @@ -67,6 +69,8 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:
return FeatureConfig(feature_dict)
elif transformation_name == "multi-categorical":
return MultiCategoricalFeatureConfig(feature_dict)
elif transformation_name == "huggingface":
return HFConfig(feature_dict)
else:
raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")

Expand Down
54 changes: 54 additions & 0 deletions graphstorm-processing/graphstorm_processing/config/hf_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License").
You may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

from typing import Mapping

from graphstorm_processing.constants import HUGGINGFACE_TOKENIZE
from .feature_config_base import FeatureConfig


class HFConfig(FeatureConfig):
"""Feature configuration for huggingface text features.
Supported kwargs
----------------
action: str, required
The type of huggingface action to use. Valid values is "tokenize_hf"
bert_model: str, required
The name of the lm model.
max_seq_length: int, required
The maximal length of the tokenization results.
"""

def __init__(self, config: Mapping):
super().__init__(config)
self.action = self._transformation_kwargs.get("action")
self.bert_model = self._transformation_kwargs.get("bert_model")
self.max_seq_length = self._transformation_kwargs.get("max_seq_length")

self._sanity_check()

def _sanity_check(self) -> None:
super()._sanity_check()
assert self.action in [
HUGGINGFACE_TOKENIZE
], f"huggingface action needs to be {HUGGINGFACE_TOKENIZE}"
assert isinstance(
self.bert_model, str
), f"Expect bert_model to be a string, but got {self.bert_model}"
assert (
isinstance(self.max_seq_length, int) and self.max_seq_length > 0
), f"Expect max_seq_length {self.max_seq_length} be an integer and larger than zero."
5 changes: 5 additions & 0 deletions graphstorm-processing/graphstorm_processing/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

################### Categorical Limits #######################
MAX_CATEGORIES_PER_FEATURE = 100
RARE_CATEGORY = "GSP_CONSTANT_OTHER"
Expand Down Expand Up @@ -43,3 +44,7 @@
################# Numerical transformations ################
VALID_IMPUTERS = ["none", "mean", "median", "most_frequent"]
VALID_NORMALIZERS = ["none", "min-max", "standard", "rank-gauss"]

################# Bert transformations ################
HUGGINGFACE_TRANFORM = "huggingface"
HUGGINGFACE_TOKENIZE = "tokenize_hf"
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

import logging

from pyspark.sql import DataFrame
Expand All @@ -26,6 +27,7 @@
DistBucketNumericalTransformation,
DistCategoryTransformation,
DistMultiCategoryTransformation,
DistHFTransformation,
)


Expand Down Expand Up @@ -57,6 +59,8 @@ def __init__(self, feature_config: FeatureConfig):
self.transformation = DistCategoryTransformation(**default_kwargs, **args_dict)
elif feat_type == "multi-categorical":
self.transformation = DistMultiCategoryTransformation(**default_kwargs, **args_dict)
elif feat_type == "huggingface":
self.transformation = DistHFTransformation(**default_kwargs, **args_dict)
else:
raise NotImplementedError(
f"Feature {feat_name} has type: {feat_type} that is not supported"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Implementations for the various distributed transformations.
"""

from .base_dist_transformation import DistributedTransformation
from .dist_category_transformation import (
DistCategoryTransformation,
Expand All @@ -13,3 +14,4 @@
DistNumericalTransformation,
)
from .dist_bucket_numerical_transformation import DistBucketNumericalTransformation
from .dist_hf_transformation import DistHFTransformation
Loading

0 comments on commit c29bf54

Please sign in to comment.