Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] BERT Tokenizer #700

Merged
merged 40 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0964886
add gconstruct converter
jalencato Jan 10, 2024
0a67217
first commit about code structure on tokenize feature transformation
jalencato Jan 10, 2024
7c20436
add first version with udf implementation
jalencato Jan 11, 2024
626565d
remove torch related
jalencato Jan 17, 2024
cfebe1d
remove torch
jalencato Jan 17, 2024
32da0c8
add doc
jalencato Jan 18, 2024
18d61e4
add
Jan 23, 2024
178da9f
add
jalencato Jan 23, 2024
288604e
Merge branch 'main' into bert_tokenzier
jalencato Jan 25, 2024
8a0f872
add fix
jalencato Jan 25, 2024
69edfd9
rename
jalencato Jan 26, 2024
c26aa68
rename
jalencato Jan 26, 2024
ad74f46
add test for huggingface
jalencato Jan 26, 2024
d7bccff
black reformat
jalencato Jan 26, 2024
bd8c075
apply lint
jalencato Jan 26, 2024
ae967da
add dependency
jalencato Jan 26, 2024
a81f4b5
add
jalencato Jan 26, 2024
9181826
test fix
jalencato Jan 27, 2024
93a9685
add fix
jalencato Jan 29, 2024
e091bf9
add test
jalencato Jan 29, 2024
69bee37
change config
jalencato Jan 29, 2024
dfd63e4
apply comments
jalencato Jan 31, 2024
250938a
apply comment
jalencato Jan 31, 2024
efb1bfa
apply lint
jalencato Jan 31, 2024
6d13571
add final line
jalencato Jan 31, 2024
3d85407
Update docs/source/gs-processing/developer/input-configuration.rst
jalencato Jan 31, 2024
89924c6
name change
jalencato Jan 31, 2024
4ec8563
add build docker'
jalencato Jan 31, 2024
966e6fd
add doc
jalencato Jan 31, 2024
aa43a83
add doc
jalencato Jan 31, 2024
a2d36d5
add doc
jalencato Jan 31, 2024
9cbf74e
change dockerfile
jalencato Feb 1, 2024
b702b30
add docker packing
jalencato Feb 1, 2024
a67c66c
doc
jalencato Feb 1, 2024
fe49292
Apply suggestions from code review
jalencato Feb 1, 2024
5c6fd8c
final version
jalencato Feb 1, 2024
b09d058
apply black
jalencato Feb 1, 2024
9d1a234
doc
jalencato Feb 1, 2024
6247297
convert
jalencato Feb 1, 2024
3c7bef1
Merge branch 'main' into bert_tokenzier
jalencato Feb 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,15 @@ arguments.
will be considered as an array. For Parquet files, if the input type is ArrayType(StringType()), then the
separator is ignored; if it is StringType(), it will apply same logic as in CSV.

- ``huggingface``

- Transforms a text feature column to tokens or embeddings with different Hugging Face models, enabling nuanced understanding and processing of natural language data.
- ``kwargs``:

- ``action`` (String, required): The action to perform on the text data. Currently we only support text tokenization through HuggingFace models, so the only accepted value here is "tokenize_hf".
jalencato marked this conversation as resolved.
Show resolved Hide resolved
- ``bert_model`` (String, required): It should be the identifier of a pre-trained model available in the Hugging Face Model Hub.
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
- ``max_seq_length`` (Integer, required): It specifies the maximum number of tokens of the input.
thvasilo marked this conversation as resolved.
Show resolved Hide resolved

--------------

Creating a graph for inference
Expand Down
14 changes: 13 additions & 1 deletion docs/source/gs-processing/usage/distributed-processing-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,18 @@ The script also supports other arguments to customize the image name,
tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
for more information.

For EMR Serverless images, setting up a VPC and NAT route is a necessary step when using text data feature transformation.
You can find detailed instructions on creating a VPC for EMR Serverless in the AWS documentation: `Create a VPC on emr-serverless
<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html>`_.
Alternatively, there is one easier way to do that, you can opt to include the huggingface model cache directly in your Docker image.
It is available for both SageMaker docker image and EMR-serverless docker image. It is a good way to save cost as it avoids downloading when launching the clusters.
The build_gsprocessing_image.sh script provides an option to embed the huggingface bert model cache within the Docker image.
jalencato marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

bash docker/build_gsprocessing_image.sh --environment sagemaker --model-name bert-base-uncased
bash docker/build_gsprocessing_image.sh --environment emr-serverless --model-name bert-base-uncased

Support for arm64 architecture
------------------------------

Expand Down Expand Up @@ -157,7 +169,7 @@ To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you

.. code-block:: bash

bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64
bash docker/build_gsprocessing_image.sh --environment emr-serverless --architecture arm64

.. note::

Expand Down
4 changes: 4 additions & 0 deletions docs/source/gs-processing/usage/emr-serverless.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
base image, so we need to ensure our application uses the same release.

Additionally, if it is required to use text feature transformation, it is suggested to setup VPC and NAT route for the emr cluster:
`Create a VPC on emr-serverless
<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html>`_. If it is preferred to save model cache
inside the docker image, check out how we do that in: :doc:`distributed-processing-setup`.
jalencato marked this conversation as resolved.
Show resolved Hide resolved

Allow EMR Serverless to access the custom image repository
----------------------------------------------------------
Expand Down
68 changes: 68 additions & 0 deletions graphstorm-processing/docker/0.2.2/emr-serverless/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
ARG ARCH=x86_64
FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base
FROM base as runtime

USER root
ENV PYTHON_VERSION=3.9.18

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8

# Set up pyenv
ENV PYENV_ROOT="${HOME}/.pyenv"
ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${PATH}"
ENV PYSPARK_DRIVER_PYTHON=${PYENV_ROOT}/shims/python
ENV PYSPARK_PYTHON=${PYENV_ROOT}/shims/python

# TODO: These can probably all go to another builder stage?
RUN yum erase -y openssl-devel && \
yum install -y \
bzip2-devel\
gcc \
git \
libffi-devel \
ncurses-devel \
openssl11-devel \
readline-devel \
sqlite-devel \
sudo \
xz-devel && \
rm -rf /var/cache/yum
RUN git clone https://github.com/pyenv/pyenv.git ${PYENV_ROOT} && \
pyenv install ${PYTHON_VERSION} && \
pyenv global ${PYTHON_VERSION}

WORKDIR /usr/lib/spark/code/

# Install GSProcessing requirements to pyenv Python
COPY requirements.txt requirements.txt
# Use --mount=type=cache,target=/root/.cache when Buildkit CI issue is fixed:
# https://github.com/moby/buildkit/issues/1512
RUN pip install -r /usr/lib/spark/code/requirements.txt \
&& rm -rf /root/.cache

# GSProcessing codebase
COPY code/ /usr/lib/spark/code/

# Install Huggingface model cache if it is necessary
ARG MODEL=""
ENV TRANSFORMERS_CACHE=/home/hadoop/.cache/huggingface/hub
RUN if [ $MODEL == "" ]; then \
echo "Skip installing model cache"; \
else \
echo "Installing model cache for $MODEL" && \
python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${MODEL}')"; \
fi

FROM runtime AS prod
RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm_processing-*.whl && \
rm /usr/lib/spark/code/graphstorm_processing-*.whl && rm -rf /root/.cache

FROM runtime AS test
RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm-processing/ && rm -rf /root/.cache

USER hadoop:hadoop
WORKDIR /home/hadoop
66 changes: 66 additions & 0 deletions graphstorm-processing/docker/0.2.2/sagemaker/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# syntax=docker/dockerfile:experimental
FROM 153931337802.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spark-processing:3.4-cpu-py39-v1.0 AS base

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/conda/lib"
ENV PATH=/opt/conda/bin:$PATH

# Install GSProcessing requirements to pipenv Python
RUN pipenv install \
boto3==1.28.38 \
joblib==1.3.1 \
mock==5.1.0 \
pandas==1.3.5 \
pip==23.1.2 \
protobuf==3.20.3 \
psutil==5.9.5 \
pyarrow==13.0.0 \
pyspark==3.4.1 \
scipy==1.11.3 \
setuptools \
spacy==3.6.0 \
wheel \
&& rm -rf /root/.cache
# Do a pipenv sync so our base libs are independent from our editable code, making them cacheable
RUN pipenv sync --system && python3 -m spacy download en_core_web_lg \
&& rm -rf /root/.cache

# Graphloader codebase
COPY code/ /usr/lib/spark/code/
WORKDIR /usr/lib/spark/code/

# Base container assumes this is the workdir
ENV SPARK_HOME /usr/lib/spark
WORKDIR $SPARK_HOME

# Ensure our python3 installation is the one used
RUN echo 'alias python3=python3.9' >> ~/.bashrc

# Install Huggingface model cache if it is necessary
ARG MODEL=""
ENV TRANSFORMERS_CACHE=/root/.cache/huggingface/hub
RUN if [ $MODEL == "" ]; then \
echo "Skip installing model cache"; \
else \
echo "Installing model cache for $MODEL" && \
python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${MODEL}')"; \
fi

# Starts framework
ENTRYPOINT ["bash", "/usr/lib/spark/code/docker-entry.sh"]

FROM base AS prod
RUN python3 -m pip install /usr/lib/spark/code/graphstorm_processing-*.whl && \
rm /usr/lib/spark/code/graphstorm_processing-*.whl
CMD ["gs-processing"]

FROM base AS test
RUN python3 -m pip install /usr/lib/spark/code/graphstorm-processing/
CMD ["sh", "-c", "pytest ./code/tests/"]
12 changes: 10 additions & 2 deletions graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ Available options:
-i, --image Docker image name, default is 'graphstorm-processing'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-s, --suffix Suffix for the image tag, can be used to push custom image tags. Default is "".
-b, --build Docker build directory, default is '/tmp/`
-b, --build Docker build directory, default is '/tmp/'.
-m, --model Huggingface Model name that needs to be packed into the docker image. Default is "".
jalencato marked this conversation as resolved.
Show resolved Hide resolved

EOF
exit
}
Expand All @@ -48,6 +50,7 @@ parse_params() {
TARGET='test'
ARCH='x86_64'
SUFFIX=""
MODEL=""

while :; do
case "${1-}" in
Expand Down Expand Up @@ -86,6 +89,10 @@ parse_params() {
SUFFIX="${2-}"
shift
;;
-m | --MODEL)
jalencato marked this conversation as resolved.
Show resolved Hide resolved
MODEL="${2-}"
shift
;;
-?*) die "Unknown option: $1" ;;
*) break ;;
esac
Expand Down Expand Up @@ -135,6 +142,7 @@ msg "- GSP_HOME: ${GSP_HOME}"
msg "- IMAGE_NAME: ${IMAGE_NAME}"
msg "- VERSION: ${VERSION}"
msg "- SUFFIX: ${SUFFIX}"
msg "- MODEL: ${MODEL}"

# Prepare Docker build directory
rm -rf "${BUILD_DIR}/docker/code"
Expand Down Expand Up @@ -170,4 +178,4 @@ fi

echo "Build a Docker image ${DOCKER_FULLNAME}"
DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH}
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH} --build-arg MODEL=${MODEL}
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

from typing import Any

from .converter_base import ConfigConverter
Expand Down Expand Up @@ -134,6 +135,13 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
else:
gsp_transformation_dict["name"] = "categorical"
gsp_transformation_dict["kwargs"] = {}
elif gconstruct_transform_dict["name"] == "tokenize_hf":
gsp_transformation_dict["name"] = "huggingface"
gsp_transformation_dict["kwargs"] = {
"action": "tokenize_hf",
"bert_model": gconstruct_transform_dict["bert_model"],
"max_seq_length": gconstruct_transform_dict["max_seq_length"],
}
# TODO: Add support for other common transformations here
else:
raise ValueError(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
NumericalFeatureConfig,
)
from .categorical_configs import MultiCategoricalFeatureConfig
from .hf_configs import HFConfig
from .data_config_base import DataStorageConfig


Expand Down Expand Up @@ -67,6 +68,8 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:
return FeatureConfig(feature_dict)
elif transformation_name == "multi-categorical":
return MultiCategoricalFeatureConfig(feature_dict)
elif transformation_name == "huggingface":
return HFConfig(feature_dict)
else:
raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")

Expand Down
53 changes: 53 additions & 0 deletions graphstorm-processing/graphstorm_processing/config/hf_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License").
You may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

from typing import Mapping

from graphstorm_processing.constants import HUGGINGFACE_TOKENIZE
from .feature_config_base import FeatureConfig


class HFConfig(FeatureConfig):
"""Feature configuration for huggingface text features.

Supported kwargs
----------------
action: str, required
The type of huggingface action to use. Valid values is "tokenize_hf"
bert_model: str, required
The name of the lm model.
max_seq_length: int, required
The maximal length of the tokenization results.
"""

def __init__(self, config: Mapping):
super().__init__(config)
self.action = self._transformation_kwargs.get("action")
self.bert_model = self._transformation_kwargs.get("bert_model")
self.max_seq_length = self._transformation_kwargs.get("max_seq_length")

self._sanity_check()

def _sanity_check(self) -> None:
super()._sanity_check()
assert self.action in [HUGGINGFACE_TOKENIZE], \
f"huggingface action needs to be {HUGGINGFACE_TOKENIZE}"
assert isinstance(
self.bert_model, str
), f"Expect bert_model to be a string, but got {self.bert_model}"
assert (
isinstance(self.max_seq_length, int) and self.max_seq_length > 0
), f"Expect max_seq_length {self.max_seq_length} be an integer and larger than zero."
5 changes: 5 additions & 0 deletions graphstorm-processing/graphstorm_processing/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

################### Categorical Limits #######################
MAX_CATEGORIES_PER_FEATURE = 100
RARE_CATEGORY = "GSP_CONSTANT_OTHER"
Expand Down Expand Up @@ -43,3 +44,7 @@
################# Numerical transformations ################
VALID_IMPUTERS = ["none", "mean", "median", "most_frequent"]
VALID_NORMALIZERS = ["none", "min-max", "standard", "rank-gauss"]

################# Bert transformations ################
HUGGINGFACE_TRANFORM = "huggingface"
HUGGINGFACE_TOKENIZE = "tokenize_hf"
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

import logging

from pyspark.sql import DataFrame
Expand All @@ -26,6 +27,7 @@
DistBucketNumericalTransformation,
DistCategoryTransformation,
DistMultiCategoryTransformation,
DistHFTransformation,
)


Expand Down Expand Up @@ -57,6 +59,8 @@ def __init__(self, feature_config: FeatureConfig):
self.transformation = DistCategoryTransformation(**default_kwargs, **args_dict)
elif feat_type == "multi-categorical":
self.transformation = DistMultiCategoryTransformation(**default_kwargs, **args_dict)
elif feat_type == "huggingface":
self.transformation = DistHFTransformation(**default_kwargs, **args_dict)
else:
raise NotImplementedError(
f"Feature {feat_name} has type: {feat_type} that is not supported"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@
DistNumericalTransformation,
)
from .dist_bucket_numerical_transformation import DistBucketNumericalTransformation
from .dist_hf_transformation import DistHFTransformation
Loading
Loading