From 8d694f4709c7ab93979719a5b7c6037221c5dfe8 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Fri, 22 Sep 2023 22:54:10 +0000 Subject: [PATCH 1/3] [GSProcessing] Add documentation --- graphstorm-processing/docs/Makefile | 20 + graphstorm-processing/docs/make.bat | 35 ++ graphstorm-processing/docs/source/conf.py | 53 +++ .../docs/source/developer/developer-guide.rst | 227 ++++++++++ .../source/developer/input-configuration.rst | 390 ++++++++++++++++++ graphstorm-processing/docs/source/index.rst | 154 +++++++ .../docs/source/usage/amazon-sagemaker.rst | 141 +++++++ .../usage/distributed-processing-setup.rst | 136 ++++++ .../docs/source/usage/example.rst | 249 +++++++++++ 9 files changed, 1405 insertions(+) create mode 100644 graphstorm-processing/docs/Makefile create mode 100644 graphstorm-processing/docs/make.bat create mode 100644 graphstorm-processing/docs/source/conf.py create mode 100644 graphstorm-processing/docs/source/developer/developer-guide.rst create mode 100644 graphstorm-processing/docs/source/developer/input-configuration.rst create mode 100644 graphstorm-processing/docs/source/index.rst create mode 100644 graphstorm-processing/docs/source/usage/amazon-sagemaker.rst create mode 100644 graphstorm-processing/docs/source/usage/distributed-processing-setup.rst create mode 100644 graphstorm-processing/docs/source/usage/example.rst diff --git a/graphstorm-processing/docs/Makefile b/graphstorm-processing/docs/Makefile new file mode 100644 index 0000000000..d0c3cbf102 --- /dev/null +++ b/graphstorm-processing/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = source +BUILDDIR = build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/graphstorm-processing/docs/make.bat b/graphstorm-processing/docs/make.bat new file mode 100644 index 0000000000..6247f7e231 --- /dev/null +++ b/graphstorm-processing/docs/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=source +set BUILDDIR=build + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.http://sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/graphstorm-processing/docs/source/conf.py b/graphstorm-processing/docs/source/conf.py new file mode 100644 index 0000000000..c50e1cf877 --- /dev/null +++ b/graphstorm-processing/docs/source/conf.py @@ -0,0 +1,53 @@ +# pylint: skip-file +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +# import os +# import sys +# sys.path.insert(0, os.path.abspath('.')) + + +# -- Project information ----------------------------------------------------- + +project = 'graphstorm-processing' +copyright = '2023, AGML Team' +author = 'AGML Team' + + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = [] + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'alabaster' + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] diff --git a/graphstorm-processing/docs/source/developer/developer-guide.rst b/graphstorm-processing/docs/source/developer/developer-guide.rst new file mode 100644 index 0000000000..45d2e3ecf0 --- /dev/null +++ b/graphstorm-processing/docs/source/developer/developer-guide.rst @@ -0,0 +1,227 @@ +Developer Guide +--------------- + +The project is set up using ``poetry`` to make easier for developers to +jump into the project. + +The steps we recommend are: + +Install JDK 8, 11 or 17 +~~~~~~~~~~~~~~~~~~~~~~~ + +PySpark requires a compatible Java installation to run, so +you will need to ensure your active JDK is using either +Java 8 or 11. + +On MacOS you can do this using ``brew``: + +.. code-block:: bash + + brew install openjdk@11 + +On Linux it will depend on your distribution's package +manager. For Ubuntu you can use: + +.. code-block:: bash + + sudo apt install openjdk-11-jdk + +On Amazon Linux 2 you can use: + +.. code-block:: bash + + sudo yum install java-11-amazon-corretto-headless + sudo yum install java-11-amazon-corretto-devel + +Install pyenv +~~~~~~~~~~~~~ + +``pyenv`` is a tool to manage multiple Python version installations. It +can be installed through the installer below on a Linux machine: + +.. code-block:: bash + + curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash + +or use ``brew`` on a Mac: + +.. code-block:: bash + + brew update + brew install pyenv + +For more info on ``pyenv`` see https://github.com/pyenv/pyenv + +Create a Python 3.9 env and activate it. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We use Python 3.9 in our images so this most closely resembles the +execution environment on SageMaker. + +.. code-block:: bash + + pyenv install 3.9 + pyenv global 3.9 + +.. + + Note: We recommend not mixing up conda and pyenv. When developing for + this project, simply ``conda deactivate`` until there's no ``conda`` + env active (even ``base``) and just rely on pyenv+poetry to handle + dependencies. + +Install poetry +~~~~~~~~~~~~~~ + +``poetry`` is a dependency and build management system for Python. To install it +use: + +.. code-block:: bash + + curl -sSL https://install.python-poetry.org | python3 - + +Install dependencies through poetry +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now we are ready to install our dependencies through ``poetry``. + +We have split the project dependencies into the “main” dependencies that +``poetry`` installs by default, and the ``dev`` dependency group that +installs that dependencies that are only needed to develop the library. + +**On a POSIX system** (tested on Ubuntu, CentOS, MacOS) run: + +.. code-block:: bash + + # Install all dependencies into local .venv + poetry install --with dev + +Once all dependencies are installed you should be able to run the unit +tests for the project and continue with development using: + +.. code-block:: bash + + poetry run pytest ./graphstorm-processing/tests + +You can also activate and use the virtual environment using: + +.. code-block:: bash + + poetry shell + # We're now using the graphstorm-processing-py3.9 env so we can just run + pytest ./graphstorm-processing/tests + +To learn more about poetry see: +https://python-poetry.org/docs/basic-usage/ + +Use ``black`` to format code +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We use `black `__ to +format code in this project. ``black`` is an opinionated formatter that +helps speed up development and code reviews. It is included in our +``dev`` dependencies so it will be installed along with the other dev +dependencies. + +To use ``black`` in the project you can run (from the project's root, +same level as ``pyproject.toml``) + +.. code-block:: bash + + # From the project's root directory, graphstorm-processing run: + black . + +To get a preview of the changes ``black`` would make you can use: + +.. code-block:: bash + + black . --diff --color + +You can auto-formatting with ``black`` to VSCode using the `Black +Formatter `__ + + +Use mypy and pylint to lint code +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We include the ``mypy`` and ``pylint`` linters as a dependency under the ``dev`` group +of dependencies. These linters perform static checks on your code and +can be used in a complimentary manner. + +We recommend using VSCode and enabling the mypy linter to get in-editor +annotations: + +https://code.visualstudio.com/docs/python/linting#_general-settings + +You can also lint the project code through: + +.. code-block:: bash + + poetry run mypy ./graphstorm_processing + +To learn more about ``mypy`` and how it can help development see: +https://mypy.readthedocs.io/en/stable/ + +Our goal is to minimize ``mypy`` errors as much as possible for the +project. New code should be linted and not introduce additional mypy +errors. When necessary it's OK to use ``type: ignore`` to silence +``mypy`` errors inline, but this should be used sparingly. + +As a project, GraphStorm requires a 10/10 pylint score, so +ensure your code conforms to the expectation by running +`pylint --rcfile=/path/to/graphstorm/tests/lint/pylintrc` . +on your code before commits. To make this easier we include +a pre-commit hook below. + +Use a pre-commit hook to ensure black and pylint runs before commits +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To make code formatting and pylint checks easier for graphstorm-processing +developers we recommend using a pre-commit hook. + +We include ``pre-commit`` in the project’s ``dev`` dependencies, so once +you have activated the project's venv (``poetry shell``) you can just +create a file named ``.pre-commit-config.yaml`` with the following contents: + +.. code-block:: yaml + + # .pre-commit-config.yaml + repos: + - repo: https://github.com/psf/black + rev: 23.7.0 + hooks: + - id: black + language_version: python3.9 + files: 'graphstorm_processing\/.*\.pyi?$|tests\/.*\.pyi?$|scripts\/.*\.pyi?$' + exclude: 'python\/.*\.pyi' + - repo: local + hooks: + - id: pylint + name: pylint + entry: pylint + language: system + types: [python] + args: + [ + "--rcfile=./tests/lint/pylintrc" + ] + + +And then run: + +.. code-block:: bash + + pre-commit install + +which will install the ``black`` hook into your local repository and +ensure it runs before every commit. + +.. note:: text + + The pre-commit hook will also apply to all commits you make to the root + GraphStorm repository. Since that one doesn't use ``black``, you might + want to remove the hooks. You can do so from the root repo + using ``rm -rf .git/hooks``. + + Both project use ``pylint`` to check Python files so we'd still recommend using + that hook even if you're doing development for both GSProcessing and GraphStorm. diff --git a/graphstorm-processing/docs/source/developer/input-configuration.rst b/graphstorm-processing/docs/source/developer/input-configuration.rst new file mode 100644 index 0000000000..0cdfa5da01 --- /dev/null +++ b/graphstorm-processing/docs/source/developer/input-configuration.rst @@ -0,0 +1,390 @@ +.. _input-configuration: + +GraphStorm Processing Input Configuration +========================================= + +GraphStorm Processing uses a JSON configuration file to +parse and process the data into the format needed +by GraphStorm partitioning and training downstream. + +We provide scripts that can convert a ``GConstruct`` +input configuration file into one compatible with +GraphStorm Processing so users with existing +``GConstruct`` files can make use of the distributed +processing capabilities of GraphStorm Processing +to scale up their graph processing. + +The input data configuration has two top-level nodes: + +.. code-block:: json + + { + "version": "gsprocessing-v1.0", + "graph": {} + } + +- ``version`` (String, required): The version of configuration file being used. We include + the package name to allow self-contained identification of the file format. +- ``graph`` (JSON object, required): one configuration object that defines each + of the nodes and edges that describe the graph. + +We describe the ``graph`` object next. + +``graph`` configuration object +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``graph`` configuration object can have two top-level nodes: + +.. code-block:: json + + { + "edges": [...], + "nodes": [...] + } + +- ``edges``: (array of JSON objects, required). Each JSON object + in this array describes one edge type and determines how the edge + structure will be parsed. +- ``nodes``: (array of JSON objects, optional). Each JSON object + in this array describes one node type. This key is optional, in case + it is missing, node IDs are derived from the ``edges`` objects. + +-------------- + +Contents of an ``edges`` configuration object +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +An ``edges`` configuration object can contain the following top-level +nodes: + +.. code-block:: json + + { + "data": { + "format": "parquet" or "csv", + "files": [>], + "separator": String + }, + "source" : {"column": String, "type": String}, + "relation" : {"column": String, "type": String}, + "destination" : ["column": String, "type": String], + "labels" : [ + { + "column": String, "type": String, + "split_rate": { + "train": Float, + "val": Float, + "test": Float + } + }, + ...] + "features": [{feature_object}, ...] + } + +- ``data`` (JSON Object, required): Describes the physical files + that store the data described in this object. The JSON object has two + top level nodes: + + - ``format`` (String, required): indicates the format the data is + stored in. We accept either ``"csv"`` or ``"parquet"`` as valid + file formats. + + - We will add support for JSON input as P1 feature at a later + point. + + - ``files`` (array of String, required): the physical location of + files. The format accepts two options: + + - a single-element list a with directory-like (ending in ``/``) + **relative** path under which all the files that correspond to + the current edge type are stored. + + - e.g. ``"files": ['path/to/edge/type/']`` + - This option allows for concise listing of entire types and + would be preferred. + + - a multi-element list of **relative** file paths. + + - ``"files": ['path/to/edge/type/file_1.csv', 'path/to/edge/type/file_2.csv']`` + - This option allows for multiple types to be stored under the + same input prefix, but will result in more verbose spec + files. + + - Since the spec expects **relative paths**, the caller is + responsible for providing a path prefix to the execution + engine. The prefix will determine if the source is a local + filesystem or S3, allowing the spec to be portable, i.e. a user + can move the physical files and the spec will still be valid, + as long as the relative structure is kept. + + - ``separator`` (String, optional): Only relevant for CSV files, + determines the separator used between each column in the files. + +- ``source``: (JSON object, required): Describes the source nodes + for the edge type. The top-level keys for the object are: + + - ``column``: (String, required) The name of the column in the + physical data files. + - ``type``: (String, optional) The type name of the nodes. If not + provided, we assume that the column name is the type name. + +- ``destination``: (JSON object, required): Describes the + destination nodes for the edge type. Its format is the same as the + ``source`` key, with a JSON object that contains + ``{“column: String, and ”type“: String}``. +- ``relation``: (JSON object, required): Describes the relation + modeled by the edges. A relation can be common among all edges, or it + can have sub-types. The top-level keys for the object are: + + - ``type`` (String, required): The type of the relation described by + the edges. For example, for a source type ``user``, destination + ``movie`` we can have a relation type ``interacted_with`` for an + edge type ``user:interacted_with:movie``. + - ``column`` (String, optional): If present this column determines + the type of sub-relation described by the edge, breaking up the + edge type into further sub-types. + + - For + ``"type": "interacted_with", "column": "interaction_kind"``, we + might have the values ``watched``, ``rated``, ``shared`` in the + ``interaction_kind`` column, leading to fully qualified edge + types: ``user:interacted_with-watched:movie``, + ``user:interacted_with-rated:movie, user:interacted_with-shared:movie`` + . + +- ``labels`` (List of JSON objects, optional): Describes the label + for the current edge type. The label object is has the following + top-level keys: + + - ``column`` (String, required): The column that contains the values + for the label. Should be the empty string, ``""`` if the ``type`` + key has the value ``"link_prediction"``. + - ``type`` (String, required): The type of the learning task. Can + take the following String values: + + - ``“classification”``: An edge classification task. The values + in the specified ``column`` as treated as categorical + variables. + - ``"regression"``: An edge regression task. The values in the + specified ``column`` are treated as numerical values. + - ``"link_prediction"``: A link prediction tasks. The ``column`` + should be ``""`` in this case. + + - ``separator``: (String, optional): For multi-label classification + tasks, this separator is used within the column to list multiple + classification labels in one entry. + - ``split_rate`` (JSON object, optional): Defines a split rate + for the label items + + - ``train``: The percentage of the data with available labels to + assign to the train set (0.0, 1.0]. + - ``val``: The percentage of the data with available labels to + assign to the train set [0.0, 1.0). + - ``test``: The percentage of the data with available labels to + assign to the train set [0.0, 1.0). + +- ``features`` (List of JSON objects, optional)\ **:** Describes + the set of features for the current edge type. See the **Contents of + a ``features`` configuration object** section for details. + +-------------- + +Contents of a ``nodes`` configuration object +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A node configuration object in a ``nodes`` field can contain the +following top-level keys: + +.. code-block:: json + + { + "data": { + "format": "parquet" or "csv", + "files": [String], + "separator": String + }, + "column" : String, + "type" : String, + "labels" : [{ + "column": String, + "type": String, + "separator": String, + "split_rate": { + "train": Float, + "val": Float, + "test": Float + },...] + }, + "features": [{feature_object}, ...] + } + +- ``data``: (JSON object, required): Has the same definition as for + the edges object, with one top-level key for the ``format`` that + takes a String value, and one for the ``files`` that takes an array + of String values. +- ``column``: (String, required): The column in the data that + corresponds to the column that stores the node ids. +- ``type:`` (String, optional): A type name for the nodes described + in this object. If not provided the ``column`` value is used as the + node type. +- ``labels``: (List of JSON objects, optional): Similar to the + labels object defined for edges, but the values that the ``type`` can + take are different. + + - ``column`` (String, required): The name of the column that + contains the label values. + - ``type`` (String, required): Specifies that target task type which + can be: + + - ``"classification"``: A node classification task. + - ``"regression"``: A node regression task. + + - ``separator`` (String, optional): For multi-label + classification tasks, this separator is used within the column to + list multiple classification labels in one entry. + + - e.g. with separator ``|`` we can have ``action|comedy`` as a + label value. + + - ``split_rate`` (JSON object, optional): Defines a split rate + for the label items + + - ``train``: The percentage of the data with available labels to + assign to the train set (0.0, 1.0]. + - ``val``: The percentage of the data with available labels to + assign to the train set [0.0, 1.0). + - ``test``: The percentage of the data with available labels to + assign to the train set [0.0, 1.0). + +- ``features`` (List of JSON objects, optional): Describes + the set of features for the current edge type. See the **Contents of + a ``features`` configuration object** section for details. + +-------------- + +Contents of a ``features`` configuration object +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +An element of a ``features`` configuration object (for edges or nodes) +can contain the following top-level keys: + +.. code-block:: json + + { + "column": String, + "name": String, + "transformation": { + "name": String, + "kwargs": { + "arg_name": "", + [...] + } + }, + "data": { + "format": "parquet" or "csv", + "files": [>], + "separator": String + } + } + +- ``column`` (String, required): The column that contains the raw + feature values in the dataset +- ``transformation`` (JSON object, optional): The type of + transformation that will be applied to the feature. For details on + the individual transformations supported see the Section **Supported + transformations.** If this key is missing, the feature is treated as + a **no-op** feature without ``kwargs``. + + - ``name`` (String, required): The name of the transformation to be + applied. + - ``kwargs`` (JSON object, optional): A dictionary of parameter + names and values. Each individual transformation will have its own + supported parameters, described in **Supported transformations.** + +- ``name`` (String, optional): The name that will be given to the + encoded feature. If not given, **column** is used as the output name. +- ``data`` (JSON object, optional): If the data for the feature + exist in a file source that's different from the rest of the data of + the node/edge type, they are provided here. **The file source needs + to contain the column names of the parent node/edge type to allow a + 1-1 mapping between the structure and feature files.** For nodes the + node_id column suffices, for edges we need both the source and + destination columns to use as a composite key. + +Supported transformations +~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this section we'll describe the transformations we support. +The name of the transformation is the value that would appear +in the ``transform['name']`` element of the feature configuration, +with the attached ``kwargs`` for the transformations that support +arguments. + +- ``no-op`` + + - Passes along the data as-is to be written to storage and + used in the partitioning pipeline. The data are assumed to be single + values or vectors of floats. + - ``kwargs``: + + - ``separator`` (String, optional): Only relevant for CSV file + sources, when a separator is used to encode vector feature + values into one column. If given, the separator will be used to + split the values in the column and create a vector column + output. Example: for a separator ``'|'`` the CSV value + ``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``. + +-------------- + +Examples +~~~~~~~~ + +OAG-Paper dataset +----------------- + +.. code-block:: json + + { + "version" : "gsprocessing-v1.0", + "graph" : { + "edges" : [ + { + "data": { + "format": "csv", + "files": [ + "edges.csv" + ] + }, + "separator": ",", + "source": {"column": "~from", "type": "paper"}, + "dest": {"column": "~to", "type": "paper"}, + "relation": {"type": "cites"} + } + ], + "nodes" : [ + { + "data": { + "format": "csv", + "files": [ + "node_feat.csv" + ] + }, + "separator": ",", + "type": "paper", + "column": "ID", + "labels": [ + { + "column": "field", + "type": "classification", + "separator": ";", + "split_rate": { + "train": 0.7, + "val": 0.1, + "test": 0.2 + } + } + ] + } + ] + } + } diff --git a/graphstorm-processing/docs/source/index.rst b/graphstorm-processing/docs/source/index.rst new file mode 100644 index 0000000000..3702465074 --- /dev/null +++ b/graphstorm-processing/docs/source/index.rst @@ -0,0 +1,154 @@ +.. graphstorm-processing documentation master file, created by + sphinx-quickstart on Tue Aug 1 02:04:45 2023. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +Welcome to GraphStorm Processing documentation! +================================================= + +.. toctree:: + :maxdepth: 1 + :caption: Contents: + + Example + Distributed processing setup + Running on Amazon Sagemaker + Developer Guide + Input configuration + + +GraphStorm Processing allows you to process and prepare massive graph data +for training with GraphStorm. GraphStorm Processing takes care of generating +unique ids for nodes, using them to encode edge structure files, process +individual features and prepare the data to be passed into the +distributed partitioning and training pipeline of GraphStorm. + +We use PySpark to achieve +horizontal parallelism, allowing us to scale to graphs with billions of nodes +and edges. + +.. _installation-ref: + +Installation +------------ + +The project uses Python 3.9. We recommend using `PyEnv `_ +to have isolated Python installations. + +With PyEnv installed you can create and activate a Python 3.9 environment using + +.. code-block:: bash + + pyenv install 3.9 + pyenv local 3.9 + + +With a recent version of ``pip`` installed (we recommend ``pip>=21.3``), you can simply run ``pip install .`` +from the root directory of the project (``graphstorm/graphstorm-processing``), +which should install the library into your environment and pull in all dependencies. + +Install Java 8, 11, or 17 +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Spark has a runtime dependency on the JVM to run, so you'll need to ensure +Java is installed and available on your system. + +On MacOS you can install Java using ``brew``: + +.. code-block:: bash + + brew install openjdk@11 + +On Linux it will depend on your distribution's package +manager. For Ubuntu you can use: + +.. code-block:: bash + + sudo apt install openjdk-11-jdk + +On Amazon Linux 2 you can use: + +.. code-block:: bash + + sudo yum install java-11-amazon-corretto-headless + sudo yum install java-11-amazon-corretto-devel + +To check if Java is installed you can use. + +.. code-block:: bash + + java -version + + +Example +------- + +See the provided :doc:`usage/example` for an example of how to start with tabular +data and convert them into a graph representation before partitioning and +training with GraphStorm. + +Usage +----- + +To use the library to process your data, you will need to have your data +in a tabular format, and a corresponding JSON configuration file that describes the +data. The input data can be in CSV (with header) or Parquet format. + +The configuration file can be in GraphStorm's GConstruct format, +with the caveat that the file paths need to be relative to the +location of the config file. See :doc:`/usage/example` for more details. + +After installing the library, executing a processing job locally can be done using: + +.. code-block:: bash + + gs-processing \ + --config-filename gconstruct-config.json \ + --input-prefix /path/to/input/data \ + --output-prefix /path/to/output/data + +Once the processing engine has processed the data, we want to ensure +they match the requirements of the DGL distributed partitioning +pipeline, so we need to run an additional script that will +make sure the produced data matches the assumptions of DGL [#f1]_. + +.. note:: + + Ensure you pass the output path of the previous step as the input path here. + +.. code-block:: bash + + gs-repartition \ + --input-prefix /path/to/output/data + +Once this script completes, the data are ready to be fed into DGL's distributed +partitioning pipeline. +See `this guide `_ +for more details on how to use GraphStorm distributed partitioning on SageMaker. + +See :doc:`/usage/example` for a detailed walkthrough of using GSProcessing to +wrangle data into a format that's ready to be consumed by the GraphStorm/DGL +partitioning pipeline. + + +Using with Amazon SageMaker +--------------------------- + +To run distributed jobs on Amazon SageMaker we will have to build a Docker image +and push it to the Amazon Elastic Container Registry, which we cover in +:doc:`usage/distributed-processing-setup` and run a SageMaker Processing +job which we describe in :doc:`/usage/amazon-sagemaker`. + + +Developer guide +--------------- + +To get started with developing the package refer to :doc:`/developer/developer-guide`. + + +.. rubric:: Footnotes + +.. [#f1] DGL expects that every file produced for a single node/edge type + has matching row counts, which is something that Spark cannot guarantee. + We use the re-partitioning script to fix this where needed in the produced + output. \ No newline at end of file diff --git a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst new file mode 100644 index 0000000000..1a442daf3c --- /dev/null +++ b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst @@ -0,0 +1,141 @@ +Running distributed jobs on Amazon SageMaker +============================================ + +Once the :doc:`distributed processing setup ` is complete, we can +use the Amazon SageMaker launch scripts to launch distributed processing +jobs that use AWS resources. + +To demonstrate the usage of GSProcessing on Amazon SageMaker, we will execute the same job we used in our local +execution example, but this time use Amazon SageMaker to provide the compute instead of our +local machine. + +Upload data to S3 +----------------- + +Amazon SageMaker uses S3 as its storage target, so before starting +we'll need to upload our test data to S3. To do so you will need +to have read/write access to an S3 bucket, and the requisite AWS credentials +and permissions. + +We will use the AWS CLI to upload data so make sure it is +`installed `_ +and `configured `_ +in you local environment. + +Assuming ``graphstorm/graphstorm-processing`` is our current working +directory we can upload the test data to S3 using: + +.. code-block:: bash + + MY_BUCKET="enter-your-bucket-name-here" + REGION="bucket-region" # e.g. us-west-2 + aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \ + "${MY_BUCKET}/gsprocessing-input" + +.. note:: + + Make sure you are uploading your data to a bucket + that was created in the same region as the ECR image + you pushed in :doc:`/usage/distributed-processing-setup`. + + +Launch the GSProcessing job on Amazon SageMaker +----------------------------------------------- + +Once the data are uploaded to S3, we can use the Python script +``graphstorm-processing/scripts/run_distributed_processing.py`` +to run a GSProcessing job on Amazon SageMaker. + +For this example we'll use a Spark cluster with 2 ``ml.t3.xlarge`` instances +since this is a tiny dataset. Using SageMaker you'll be able to create clusters +of up to 20 instances, allowing you to scale your processing to massive graphs, +using larger instances like `ml.r5.24xlarge`. + +Since we're now executing on AWS, we'll need access to an execution role +for SageMaker and the ECR image URI we created in :doc:`/usage/distributed-processing-setup`. +For instructions on how to create an execution role for SageMaker +see the `AWS SageMaker documentation `. + +Let's set up a small bash script that will run the parametrized processing +job, followed by the re-partitioning job, both on SageMaker + +.. code-block:: bash + + ACCOUNT="enter-your-account-id-here" # e.g 1234567890 + MY_BUCKET="enter-your-bucket-name-here" + SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here" + REGION="bucket-region" # e.g. us-west-2 + DATASET_S3_PATH="s3://${MY_BUCKET}/gsprocessing-input" + OUTPUT_BUCKET=${MY_BUCKET} + DATASET_NAME="small-graph" + CONFIG_FILE="gconstruct-config.json" + INSTANCE_COUNT="2" + INSTANCE_TYPE="ml.t3.xlarge" + NUM_FILES="4" + + IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing:0.1.0" + ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}" + + OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/${DATASET_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/" + + # Conditionally delete data at output + echo "Delete all data under output path? ${OUTPUT_PREFIX}" + select yn in "Yes" "No"; do + case $yn in + Yes ) aws s3 rm --recursive ${OUTPUT_PREFIX} --quiet; break;; + No ) break;; + esac + done + + # This will run and block until the GSProcessing job is done + python scripts/run_distributed_processing.py \ + --s3-input-prefix ${DATASET_S3_PATH} \ + --s3-output-prefix ${OUTPUT_PREFIX} \ + --role ${ROLE} \ + --image ${IMAGE_URI} \ + --region ${REGION} \ + --config-filename ${CONFIG_FILE} \ + --instance-count ${INSTANCE_COUNT} \ + --instance-type ${INSTANCE_TYPE} \ + --job-name "${DATASET_NAME}-${INSTANCE_COUNT}x-${INSTANCE_TYPE//./-}-${NUM_FILES}files" \ + --num-output-files ${NUM_FILES} \ + --wait-for-job + + # This will run the follow-up re-partitioning job + python scripts/run_repartitioning.py --s3-input-prefix ${OUTPUT_PREFIX} \ + --role ${ROLE} --image ${IMAGE_URI} --config-filename "metadata.json" \ + --instance-type ${INSTANCE_TYPE} --wait-for-job + + +.. note:: + + The re-partitioning job runs on a single instance, so for large graphs you will + want to scale up to an instance with more memory to avoid memory errors. `ml.r5` instances + should allow you to re-partition graph data with billions of nodes and edges. + +Examine the output +------------------ + +Once both jobs are finished we can examine the output created, which +should match the output we saw when running the same jobs locally +in :doc:`/usage/example`: + + +.. code-block:: bash + + $ aws s3 ls ${OUTPUT_PREFIX} + + PRE edges/ + PRE node_data/ + PRE node_id_mappings/ + 2023-08-05 00:47:36 804 launch_arguments.json + 2023-08-05 00:47:36 11914 metadata.json + 2023-08-05 00:47:37 545 perf_counters.json + 2023-08-05 00:47:37 12082 updated_row_counts_metadata.json + +Run distributed partitioning and training on Amazon SageMaker +------------------------------------------------------------- + +With the data now processed you can follow the +`GraphStorm Amazon SageMaker guide `_ +to partition your data and run training on AWS. diff --git a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst b/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst new file mode 100644 index 0000000000..caf4606965 --- /dev/null +++ b/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst @@ -0,0 +1,136 @@ +Distributed Processing setup for Amazon SageMaker +================================================= + +In this guide we'll demonstrate how to prepare your environment to run +GraphStorm Processing (GSP) jobs on Amazon SageMaker. + +We're assuming a Linux host environment used throughout +this tutorial, but other OS should work fine as well. + +The steps required are: + +- Clone the GraphStorm repository. +- Install Docker. +- Install Poetry. +- Set up AWS access. +- Build the GraphStorm Processing image using Docker. +- Push the image to the Amazon Elastic Container Registry (ECR). +- Launch a SageMaker Processing job using the example scripts. + +Clone the GraphStorm repository +------------------------------- + +You can clone the GraphStorm repository using + +.. code-block:: bash + + git clone https://github.com/awslabs/graphstorm.git + +You can then navigate to the ``graphstorm-processing/docker`` directory +that contains the relevant code: + +.. code-block:: bash + + cd ./graphstorm/graphstorm-processing/docker + +Install Docker +-------------- + +To get started with building the GraphStorm Processing image +you'll need to have the Docker engine installed. + + +To install Docker follow the instructions at: +https://docs.docker.com/engine/install/ + +Install Poetry +-------------- + +We use `Poetry `_ as our build +tool and for dependency management, +so we need to install it to facilitate building the library. + +You can install Poetry using: + +.. code-block:: bash + + curl -sSL https://install.python-poetry.org | python3 - + +For detailed installation instructions see: +https://python-poetry.org/docs/ + + +Set up AWS access +----------------- + +To build and push the image to ECR we'll make use of the +``aws-cli`` and we'll need valid AWS credentials as well. + +To install the AWS CLI you can use: + +.. code-block:: bash + + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install + +To set up credentials for use with ``aws-cli`` see: +https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-examples + +Your role should have full ECR access to be able to pull from ECR to build the image, +create an ECR repository if it doesn't exist, and push the GSProcessing image to the repository. + +Building the GraphStorm Processing image using Docker +----------------------------------------------------- + +Once Docker and Poetry are installed, and your AWS credentials are set up, +we can use the provided scripts +in the ``graphstorm-processing/docker`` directory to build the image. + +The ``build_gsprocessing_image.sh`` script can build the image +locally and tag it. For example, assuming our current directory is where +we cloned ``graphstorm/graphstorm-processing``: + +.. code-block:: bash + + bash docker/build_gsprocessing_image.sh + +The above will use the Dockerfile of the latest available GSProcessing version, +build an image and tag it as ``graphstorm-processing:${VERSION}`` where +``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.1.0``). + +The script also supports other arguments to customize the image name, +tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help`` +for more information. + +Push the image to the Amazon Elastic Container Registry (ECR) +------------------------------------------------------------- + +Once the image is built we can use the ``push_gsprocessing_image.sh`` script +that will create an ECR repository if needed and push the image we just built. + +The script does not require any arguments and by default will +create a repository named ``graphstorm-processing`` in the ``us-west-2`` region, +on the default AWS account ``aws-cli`` is configured for, +and push the image tagged with the latest version of GSProcessing. + +The script supports 4 optional arguments: + +1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing`` +2. Image tag. Default: (``-v/--version``) ```` e.g. ``0.1.0``. +3. ECR region. Default: (``-r/--region``) ``us-west-2``. +4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``. + +Example: + +.. code-block:: bash + + bash push_gsprocessing_image.sh -i "graphstorm-processing" -v "0.1.0" -r "us-west-2" -a "1234567890" + + +Launch a SageMaker Processing job using the example scripts. +------------------------------------------------------------ + +Once the setup is complete, you can follow the +:doc:`SageMaker Processing job guide ` +to launch your distributed processing job using AWS resources. diff --git a/graphstorm-processing/docs/source/usage/example.rst b/graphstorm-processing/docs/source/usage/example.rst new file mode 100644 index 0000000000..d7305fc6aa --- /dev/null +++ b/graphstorm-processing/docs/source/usage/example.rst @@ -0,0 +1,249 @@ +GraphStorm Processing example +============================= + +To demonstrate how to use the library locally we will +use the same example data as we use in our +unit tests, which you can find in the project's repository, +under ``graphstorm/graphstorm-processing/tests/resources/small_heterogeneous_graph``. + +Install example dependencies +---------------------------- + +To run the local example you will need to install the GSProcessing +library to your Python environment, and you'll need to clone the +GraphStorm repository to get access to the data. + +Follow the :ref:`installation-ref` guide to install the GSProcessing library. + +You can clone the repository using + +.. code-block:: bash + + git clone https://github.com/awslabs/graphstorm.git + +You can then navigate to the ``graphstorm-processing/`` directory +that contains the relevant data: + +.. code-block:: bash + + cd ./graphstorm/graphstorm-processing/ + + +Expected file inputs and configuration +-------------------------------------- + +GSProcessing expects the input files to be in specific format that will allow +us to perform the processing and prepare the data for partitioning and training. + +The data files are expected to be: + +* Tabular data files. We support CSV with header format, or in Parquet format. + The files can be partitioned (multiple parts), or a single file. +* Available on a local filesystem or on S3. +* One tabular file source per edge and node type. For example, for a particular edge + type, all node identifiers (source, destination), features, and labels should + exist as columns in a single file source. + +Apart from the data, GSProcessing also requires a configuration file that describes the +data and the transformations we will need to apply to the features and labels. We support +both the GConstruct configuration format, and the library's own GSProcessing format. + +.. note:: + We expect end users to only provide a GConstruct configuration file, + and only use the configuration format of GSProcessing as an intermediate + layer to decouple the two projects. + + Developers who are looking to use GSProcessing + as their backend processing engine can either use the GSProcessing configuration + format directly, or translate their own configuration format to GSProcessing, + as we do with GConstruct. + +For a detailed description of all the entries of the GSProcessing configuration file see +:doc:`/developer/input-configuration`. + +Relative file paths required +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The one difference with single-instance GConstruct files, +is that we require that the file paths listed in the configuration file are +`relative to the location of the configuration file.` Specifically: + +* All file paths listed **must not** start with ``/``. +* Assuming the configuration file is under ``$PATH``, and a filepath is listed as ``${FILEPATH}`` + in the configuration file, the corresponding file is expected to exist at ``${PATH}/${FILEPATH}``. + +For example: + +.. code-block:: bash + + > pwd + /home/path/to/data/ # This is the current working directory (cwd) + > ls + gconstruct-config.json edge_data # These are the files under the cwd + > ls edge_data/ # These are the files under the edge_data director + movie-included_in-genre.csv + +The contents of the ``gconstruct-config.json`` can be: + +.. code-block:: json + + { + "edges" : [ + { + "files": ["edges/movie-included_in-genre.csv"], + "format": { + "name": "csv", + "separator" : "," + } + } + ] + } + +Given the above we can run a job with local input data as: + +.. code-block:: bash + + > gs-processing --input-data /home/path/to/data \ + --config-filename gsprocessing-config.json + +The benefit with using relative paths is that we can move the same files +to any location, including S3, and run the same job without making changes to the config +file: + +.. code-block:: bash + + # Move all files to new directory + > mv /home/path/to/data /home/new-path/to/data + # After moving all the files we can still use the same config + > gs-processing --input-data /home/new-path/to/data \ + --config-filename gsprocessing-config.json + + # Upload data to S3 + > aws s3 sync /home/new-path/to/data s3://my-bucket/data/ + # We can still use the same config, just change the prefix to an S3 path + > python run_distributed_processing.py --input-data s3://my-bucket/data \ + --config-filename gsprocessing-config.json + +Node files are optional +^^^^^^^^^^^^^^^^^^^^^^^ + +GSProcessing does not require node files to be provided for +every node type. If a node type appears in one of the edges, +its unique node identifiers will be determined by the edge files. + +In the example GConstruct above, the node ids for the node types +``movie`` and ``genre`` will be extracted from the edge list provided. + +Example data and configuration +------------------------------ + +For this example we use a small heterogeneous graph inspired by the Movielens dataset. +You can see the configuration file under +``graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json`` + +We have 4 node types, ``movie``, ``genre``, ``director``, and ``user``. The graph has 3 +edge types, ``movie:included_in:genre``, ``user:rated:movie``, and ``director:directed:movie``. + +We include one ``no-op`` feature, ``age``, that we directly pass to the output without any transformation, +and one label, ``gender``, that we transform to prepare the data for a node classification task. + + +Run a GSProcessing job locally +------------------------------ + +While GSProcessing is designed to run on distributed clusters on Amazon SageMaker, +we can also run small jobs in a local environment, using a local Spark instance. + +To do so, we will be using the ``gs-processing`` entry point, +to process the data and create the output on our local storage. + +We will provide an input and output prefix for our data, passing +local paths to the script. + +We also provide the argument ``--num-output-files`` that instructs PySpark +to try and create output with 4 partitions [#f1]_. + +Assuming our working directory is ``graphstorm/graphstorm-processing/`` +we can use the following command to run the processing job locally: + +.. code-block:: bash + + gs-processing --config-filename gconstruct-config.json \ + --input-prefix ./tests/resources/small_heterogeneous_graph \ + --output-prefix /tmp/gsprocessing-example/ \ + --num-output-files 4 + + +To finalize processing and to wrangle the data into the structure that +DGL distributed partitioning expects, we need an additional step that +guarantees the data conform to the expectations of DGL: + +.. code-block:: bash + + gs-repartition --input-prefix /tmp/gsprocessing-example/ + + +Examining the job output +------------------------ + +Once the processing and re-partitioning jobs are done, +we can examine the outputs they created. + +.. code-block:: bash + + $ cd /tmp/gsprocessing-example + $ ls + + edges/ launch_arguments.json metadata.json node_data/ + node_id_mappings/ perf_counters.json updated_row_counts_metadata.json + +We have a few JSON files and the data directories containing +the graph structure, features, and labels. In more detail: + +* ``launch_arguments.json``: Contains the arguments that were used + to launch the processing job, allowing you to check the parameters after the + job is finished. +* ``updated_row_counts_metadata.json``: + This file is meant to be used as the input configuration for the + distributed partitioning pipeline. ``repartition_files.py`` produces + this file using the original ``metadata.json`` file as input. +* ``metadata.json``: Created by ``gs-processing`` and used as input + for ``repartition_files.py``, can be removed once that script has run. +* ``perf_counters.json``: A JSON file that contains runtime measurements + for the various components of GSProcessing. Can be used to profile the + application and discover bottlenecks. + +The directories created contain: + +* ``edges``: Contains the edge structures, one sub-directory per edge + type. +* ``node_data``: Contains the features for the nodes, one sub-directory + per node type. +* ``node_id_mappings``: Contains mappings from the original node ids to the + ones created by the processing job. This mapping would allow you to trace + back predictions to the original nodes/edges. + +If the graph had included edge features they would appear +in an ``edge_data`` directory. + +At this point you can use the DGL distributed partitioning pipeline +to partition your data, as described in the +`DGL documentation `_ + +To simplify the process of partitioning and training, without the need +to manage your own infrastructure, we recommend using GraphStorm's +`SageMaker wrappers `_ +that do all the hard work for you and allow +you to focus on model development. + +To run GSProcessing jobs on Amazon SageMaker we'll need to follow +:doc:`/usage/distributed-processing-setup` to set up our environment +and :doc:`/usage/amazon-sagemaker` to execute the job. + + +.. rubric:: Footnotes + + +.. [#f1] Note that this is just a hint to the Spark engine, and it's + not guaranteed that the number of output partitions will always match + the requested value. \ No newline at end of file From 3a0fcaeb7bb461602ab0cf3545255d1e17ebe60c Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Mon, 25 Sep 2023 19:58:21 -0700 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: xiang song(charlie.song) --- graphstorm-processing/docs/source/conf.py | 2 +- .../docs/source/developer/input-configuration.rst | 4 ++-- graphstorm-processing/docs/source/index.rst | 6 +++--- .../docs/source/usage/amazon-sagemaker.rst | 2 +- graphstorm-processing/docs/source/usage/example.rst | 2 +- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/graphstorm-processing/docs/source/conf.py b/graphstorm-processing/docs/source/conf.py index c50e1cf877..7334ba97ae 100644 --- a/graphstorm-processing/docs/source/conf.py +++ b/graphstorm-processing/docs/source/conf.py @@ -20,7 +20,7 @@ project = 'graphstorm-processing' copyright = '2023, AGML Team' -author = 'AGML Team' +author = 'AGML Team, Amazon' # -- General configuration --------------------------------------------------- diff --git a/graphstorm-processing/docs/source/developer/input-configuration.rst b/graphstorm-processing/docs/source/developer/input-configuration.rst index 0cdfa5da01..3f42c1a694 100644 --- a/graphstorm-processing/docs/source/developer/input-configuration.rst +++ b/graphstorm-processing/docs/source/developer/input-configuration.rst @@ -101,7 +101,7 @@ nodes: - e.g. ``"files": ['path/to/edge/type/']`` - This option allows for concise listing of entire types and - would be preferred. + would be preferred. All the files under the path will be loaded. - a multi-element list of **relative** file paths. @@ -236,7 +236,7 @@ following top-level keys: - ``type`` (String, required): Specifies that target task type which can be: - - ``"classification"``: A node classification task. + - ``"classification"``: A node classification task. The values in the specified ``column`` as treated as categorical variables. - ``"regression"``: A node regression task. - ``separator`` (String, optional): For multi-label diff --git a/graphstorm-processing/docs/source/index.rst b/graphstorm-processing/docs/source/index.rst index 3702465074..cc027cbb08 100644 --- a/graphstorm-processing/docs/source/index.rst +++ b/graphstorm-processing/docs/source/index.rst @@ -3,7 +3,7 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to GraphStorm Processing documentation! +Welcome to GraphStorm Distributed Data Processing documentation! ================================================= .. toctree:: @@ -17,7 +17,7 @@ Welcome to GraphStorm Processing documentation! Input configuration -GraphStorm Processing allows you to process and prepare massive graph data +GraphStorm Distributed Data Processing allows you to process and prepare massive graph data for training with GraphStorm. GraphStorm Processing takes care of generating unique ids for nodes, using them to encode edge structure files, process individual features and prepare the data to be passed into the @@ -92,7 +92,7 @@ Usage To use the library to process your data, you will need to have your data in a tabular format, and a corresponding JSON configuration file that describes the -data. The input data can be in CSV (with header) or Parquet format. +data. The input data can be in CSV (with header(s)) or Parquet format. The configuration file can be in GraphStorm's GConstruct format, with the caveat that the file paths need to be relative to the diff --git a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst index 1a442daf3c..56253fd842 100644 --- a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst +++ b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst @@ -46,7 +46,7 @@ Once the data are uploaded to S3, we can use the Python script ``graphstorm-processing/scripts/run_distributed_processing.py`` to run a GSProcessing job on Amazon SageMaker. -For this example we'll use a Spark cluster with 2 ``ml.t3.xlarge`` instances +For this example we'll use a SageMaker Spark cluster with 2 ``ml.t3.xlarge`` instances since this is a tiny dataset. Using SageMaker you'll be able to create clusters of up to 20 instances, allowing you to scale your processing to massive graphs, using larger instances like `ml.r5.24xlarge`. diff --git a/graphstorm-processing/docs/source/usage/example.rst b/graphstorm-processing/docs/source/usage/example.rst index d7305fc6aa..89135d7b1a 100644 --- a/graphstorm-processing/docs/source/usage/example.rst +++ b/graphstorm-processing/docs/source/usage/example.rst @@ -202,7 +202,7 @@ the graph structure, features, and labels. In more detail: * ``launch_arguments.json``: Contains the arguments that were used to launch the processing job, allowing you to check the parameters after the - job is finished. + job finishes. * ``updated_row_counts_metadata.json``: This file is meant to be used as the input configuration for the distributed partitioning pipeline. ``repartition_files.py`` produces From dfe18139143502a3a274f561b8d86191c22acc07 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Wed, 27 Sep 2023 20:30:53 +0000 Subject: [PATCH 3/3] Review comments --- .../docs/source/developer/developer-guide.rst | 61 ++-- .../source/developer/input-configuration.rst | 326 ++++++++++-------- .../docs/source/usage/amazon-sagemaker.rst | 17 +- .../usage/distributed-processing-setup.rst | 12 +- .../docs/source/usage/example.rst | 69 ++-- 5 files changed, 280 insertions(+), 205 deletions(-) diff --git a/graphstorm-processing/docs/source/developer/developer-guide.rst b/graphstorm-processing/docs/source/developer/developer-guide.rst index 45d2e3ecf0..1a7faf85db 100644 --- a/graphstorm-processing/docs/source/developer/developer-guide.rst +++ b/graphstorm-processing/docs/source/developer/developer-guide.rst @@ -6,8 +6,8 @@ jump into the project. The steps we recommend are: -Install JDK 8, 11 or 17 -~~~~~~~~~~~~~~~~~~~~~~~ +Install JDK 8, 11 +~~~~~~~~~~~~~~~~~ PySpark requires a compatible Java installation to run, so you will need to ensure your active JDK is using either @@ -33,7 +33,7 @@ On Amazon Linux 2 you can use: sudo yum install java-11-amazon-corretto-headless sudo yum install java-11-amazon-corretto-devel -Install pyenv +Install ``pyenv`` ~~~~~~~~~~~~~ ``pyenv`` is a tool to manage multiple Python version installations. It @@ -50,13 +50,14 @@ or use ``brew`` on a Mac: brew update brew install pyenv -For more info on ``pyenv`` see https://github.com/pyenv/pyenv +For more info on ``pyenv`` see `its documentation. ` Create a Python 3.9 env and activate it. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use Python 3.9 in our images so this most closely resembles the -execution environment on SageMaker. +execution environment on our Docker images that will be used for distributed +training. .. code-block:: bash @@ -65,12 +66,12 @@ execution environment on SageMaker. .. - Note: We recommend not mixing up conda and pyenv. When developing for + Note: We recommend not mixing up ``conda`` and ``pyenv``. When developing for this project, simply ``conda deactivate`` until there's no ``conda`` - env active (even ``base``) and just rely on pyenv+poetry to handle + env active (even ``base``) and just rely on ``pyenv`` and ``poetry`` to handle dependencies. -Install poetry +Install ``poetry`` ~~~~~~~~~~~~~~ ``poetry`` is a dependency and build management system for Python. To install it @@ -80,7 +81,7 @@ use: curl -sSL https://install.python-poetry.org | python3 - -Install dependencies through poetry +Install dependencies through ``poetry`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we are ready to install our dependencies through ``poetry``. @@ -111,13 +112,12 @@ You can also activate and use the virtual environment using: # We're now using the graphstorm-processing-py3.9 env so we can just run pytest ./graphstorm-processing/tests -To learn more about poetry see: -https://python-poetry.org/docs/basic-usage/ +To learn more about ``poetry`` see its `documentation `_ -Use ``black`` to format code -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Use ``black`` to format code [optional] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We use `black `__ to +We use `black `_ to format code in this project. ``black`` is an opinionated formatter that helps speed up development and code reviews. It is included in our ``dev`` dependencies so it will be installed along with the other dev @@ -148,10 +148,8 @@ We include the ``mypy`` and ``pylint`` linters as a dependency under the ``dev`` of dependencies. These linters perform static checks on your code and can be used in a complimentary manner. -We recommend using VSCode and enabling the mypy linter to get in-editor -annotations: - -https://code.visualstudio.com/docs/python/linting#_general-settings +We recommend `using VSCode and enabling the mypy linter `_ +to get in-editor annotations. You can also lint the project code through: @@ -159,8 +157,9 @@ You can also lint the project code through: poetry run mypy ./graphstorm_processing -To learn more about ``mypy`` and how it can help development see: -https://mypy.readthedocs.io/en/stable/ +To learn more about ``mypy`` and how it can help development +`see its documentation `_. + Our goal is to minimize ``mypy`` errors as much as possible for the project. New code should be linted and not introduce additional mypy @@ -169,17 +168,21 @@ errors. When necessary it's OK to use ``type: ignore`` to silence As a project, GraphStorm requires a 10/10 pylint score, so ensure your code conforms to the expectation by running -`pylint --rcfile=/path/to/graphstorm/tests/lint/pylintrc` . + +.. code-block:: bash + + pylint --rcfile=/path/to/graphstorm/tests/lint/pylintrc + on your code before commits. To make this easier we include a pre-commit hook below. -Use a pre-commit hook to ensure black and pylint runs before commits +Use a pre-commit hook to ensure ``black`` and ``pylint`` runs before commits ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To make code formatting and pylint checks easier for graphstorm-processing -developers we recommend using a pre-commit hook. +To make code formatting and ``pylint`` checks easier for graphstorm-processing +developers, we recommend using a pre-commit hook. -We include ``pre-commit`` in the project’s ``dev`` dependencies, so once +We include ``pre-commit`` in the project's ``dev`` dependencies, so once you have activated the project's venv (``poetry shell``) you can just create a file named ``.pre-commit-config.yaml`` with the following contents: @@ -213,15 +216,15 @@ And then run: pre-commit install -which will install the ``black`` hook into your local repository and +which will install the ``black`` and ``pylin`` hooks into your local repository and ensure it runs before every commit. -.. note:: text +.. note:: The pre-commit hook will also apply to all commits you make to the root - GraphStorm repository. Since that one doesn't use ``black``, you might + GraphStorm repository. Since that Graphstorm doesn't use ``black``, you might want to remove the hooks. You can do so from the root repo using ``rm -rf .git/hooks``. - Both project use ``pylint`` to check Python files so we'd still recommend using + Both projects use ``pylint`` to check Python files so we'd still recommend using that hook even if you're doing development for both GSProcessing and GraphStorm. diff --git a/graphstorm-processing/docs/source/developer/input-configuration.rst b/graphstorm-processing/docs/source/developer/input-configuration.rst index 3f42c1a694..e6e2d7ae98 100644 --- a/graphstorm-processing/docs/source/developer/input-configuration.rst +++ b/graphstorm-processing/docs/source/developer/input-configuration.rst @@ -7,14 +7,18 @@ GraphStorm Processing uses a JSON configuration file to parse and process the data into the format needed by GraphStorm partitioning and training downstream. -We provide scripts that can convert a ``GConstruct`` -input configuration file into one compatible with -GraphStorm Processing so users with existing -``GConstruct`` files can make use of the distributed -processing capabilities of GraphStorm Processing -to scale up their graph processing. +We use this configuration format as an intermediate +between other config formats, such as the one used +by the single-machine GConstruct module. -The input data configuration has two top-level nodes: +GSProcessing can take a GConstruct-formatted file +directly, and we also provide `a script ` +that can convert a `GConstruct ` +input configuration file into the ``GSProcessing`` format, +although this is mostly aimed at developers, users are +can rely on the automatic conversion. + +The GSProcessing input data configuration has two top-level objects: .. code-block:: json @@ -26,20 +30,20 @@ The input data configuration has two top-level nodes: - ``version`` (String, required): The version of configuration file being used. We include the package name to allow self-contained identification of the file format. - ``graph`` (JSON object, required): one configuration object that defines each - of the nodes and edges that describe the graph. + of the node types and edge types that describe the graph. We describe the ``graph`` object next. ``graph`` configuration object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The ``graph`` configuration object can have two top-level nodes: +The ``graph`` configuration object can have two top-level objects: .. code-block:: json { - "edges": [...], - "nodes": [...] + "edges": [{}], + "nodes": [{}] } - ``edges``: (array of JSON objects, required). Each JSON object @@ -55,43 +59,41 @@ Contents of an ``edges`` configuration object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ An ``edges`` configuration object can contain the following top-level -nodes: +objects: .. code-block:: json { "data": { - "format": "parquet" or "csv", - "files": [>], - "separator": String + "format": "String", + "files": ["String"], + "separator": "String" }, - "source" : {"column": String, "type": String}, - "relation" : {"column": String, "type": String}, - "destination" : ["column": String, "type": String], + "source": {"column": "String", "type": "String"}, + "relation": {"type": "String"}, + "destination": {"column": "String", "type": "String"}, "labels" : [ - { - "column": String, "type": String, - "split_rate": { - "train": Float, - "val": Float, - "test": Float - } - }, - ...] - "features": [{feature_object}, ...] + { + "column": "String", + "type": "String", + "split_rate": { + "train": "Float", + "val": "Float", + "test": "Float" + } + }, + ] + "features": [{}] } - ``data`` (JSON Object, required): Describes the physical files that store the data described in this object. The JSON object has two - top level nodes: + top level objects: - ``format`` (String, required): indicates the format the data is stored in. We accept either ``"csv"`` or ``"parquet"`` as valid file formats. - - We will add support for JSON input as P1 feature at a later - point. - - ``files`` (array of String, required): the physical location of files. The format accepts two options: @@ -101,7 +103,7 @@ nodes: - e.g. ``"files": ['path/to/edge/type/']`` - This option allows for concise listing of entire types and - would be preferred. All the files under the path will be loaded. + would be preferred. All the files under the path will be loaded. - a multi-element list of **relative** file paths. @@ -134,27 +136,16 @@ nodes: ``{“column: String, and ”type“: String}``. - ``relation``: (JSON object, required): Describes the relation modeled by the edges. A relation can be common among all edges, or it - can have sub-types. The top-level keys for the object are: + can have sub-types. The top-level objects for the object are: - ``type`` (String, required): The type of the relation described by the edges. For example, for a source type ``user``, destination ``movie`` we can have a relation type ``interacted_with`` for an edge type ``user:interacted_with:movie``. - - ``column`` (String, optional): If present this column determines - the type of sub-relation described by the edge, breaking up the - edge type into further sub-types. - - - For - ``"type": "interacted_with", "column": "interaction_kind"``, we - might have the values ``watched``, ``rated``, ``shared`` in the - ``interaction_kind`` column, leading to fully qualified edge - types: ``user:interacted_with-watched:movie``, - ``user:interacted_with-rated:movie, user:interacted_with-shared:movie`` - . - ``labels`` (List of JSON objects, optional): Describes the label - for the current edge type. The label object is has the following - top-level keys: + for the current edge type. The label object has the following + top-level objects: - ``column`` (String, required): The column that contains the values for the label. Should be the empty string, ``""`` if the ``type`` @@ -174,7 +165,8 @@ nodes: tasks, this separator is used within the column to list multiple classification labels in one entry. - ``split_rate`` (JSON object, optional): Defines a split rate - for the label items + for the label items. The sum of the values for ``train``, ``val`` and + ``test`` needs to be 1.0. - ``train``: The percentage of the data with available labels to assign to the train set (0.0, 1.0]. @@ -184,8 +176,7 @@ nodes: assign to the train set [0.0, 1.0). - ``features`` (List of JSON objects, optional)\ **:** Describes - the set of features for the current edge type. See the **Contents of - a ``features`` configuration object** section for details. + the set of features for the current edge type. See the :ref:`features-object` section for details. -------------- @@ -197,26 +188,28 @@ following top-level keys: .. code-block:: json - { - "data": { - "format": "parquet" or "csv", - "files": [String], - "separator": String - }, - "column" : String, - "type" : String, - "labels" : [{ - "column": String, - "type": String, - "separator": String, - "split_rate": { - "train": Float, - "val": Float, - "test": Float - },...] - }, - "features": [{feature_object}, ...] - } + { + "data": { + "format": "String", + "files": ["String"], + "separator": "String" + }, + "column" : "String", + "type" : "String", + "labels" : [ + { + "column": "String", + "type": "String", + "separator": "String", + "split_rate": { + "train": "Float", + "val": "Float", + "test": "Float" + } + } + ], + "features": [{}] + } - ``data``: (JSON object, required): Has the same definition as for the edges object, with one top-level key for the ``format`` that @@ -236,18 +229,21 @@ following top-level keys: - ``type`` (String, required): Specifies that target task type which can be: - - ``"classification"``: A node classification task. The values in the specified ``column`` as treated as categorical variables. - - ``"regression"``: A node regression task. + - ``"classification"``: A node classification task. The values in the specified + ``column`` are treated as categorical variables. + - ``"regression"``: A node regression task. The values in the specified + ``column`` are treated as float values. - ``separator`` (String, optional): For multi-label classification tasks, this separator is used within the column to list multiple classification labels in one entry. - - e.g. with separator ``|`` we can have ``action|comedy`` as a + - e.g. with separator ``|`` we can have ``action|comedy`` as a label value. - ``split_rate`` (JSON object, optional): Defines a split rate - for the label items + for the label items. The sum of the values for ``train``, ``val`` and + ``test`` needs to be 1.0. - ``train``: The percentage of the data with available labels to assign to the train set (0.0, 1.0]. @@ -257,11 +253,13 @@ following top-level keys: assign to the train set [0.0, 1.0). - ``features`` (List of JSON objects, optional): Describes - the set of features for the current edge type. See the **Contents of - a ``features`` configuration object** section for details. + the set of features for the current edge type. See the next section, :ref:`features-object` + for details. -------------- +.. _features-object: + Contents of a ``features`` configuration object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -270,46 +268,88 @@ can contain the following top-level keys: .. code-block:: json - { - "column": String, - "name": String, - "transformation": { - "name": String, - "kwargs": { - "arg_name": "", - [...] - } - }, - "data": { - "format": "parquet" or "csv", - "files": [>], - "separator": String - } - } + { + "column": "String", + "name": "String", + "transformation": { + "name": "String", + "kwargs": { + "arg_name": "" + } + }, + "data": { + "format": "String", + "files": ["String"], + "separator": "String" + } + } - ``column`` (String, required): The column that contains the raw feature values in the dataset - ``transformation`` (JSON object, optional): The type of transformation that will be applied to the feature. For details on - the individual transformations supported see the Section **Supported - transformations.** If this key is missing, the feature is treated as + the individual transformations supported see :ref:`supported-transformations`. + If this key is missing, the feature is treated as a **no-op** feature without ``kwargs``. - ``name`` (String, required): The name of the transformation to be applied. - ``kwargs`` (JSON object, optional): A dictionary of parameter names and values. Each individual transformation will have its own - supported parameters, described in **Supported transformations.** + supported parameters, described in :ref:`supported-transformations`. - ``name`` (String, optional): The name that will be given to the encoded feature. If not given, **column** is used as the output name. - ``data`` (JSON object, optional): If the data for the feature exist in a file source that's different from the rest of the data of - the node/edge type, they are provided here. **The file source needs + the node/edge type, they are provided here. For example, you could + have each feature in one file source each: + + .. code-block:: python + + # Example node config with multiple features + { + # This is where the node structure data exist just need an id col + "data": { + "format": "parquet", + "files": ["path/to/node_ids"] + }, + "column" : "node_id", + "type" : "my_node_type", + "features": [ + # Feature 1 + { + "column": "feature_one", + # The files contain one "node_id" col and one "feature_one" col + "data": { + "format": "parquet", + "files": ["path/to/feature_one/"] + } + }, + # Feature 2 + { + "column": "feature_two", + # The files contain one "node_id" col and one "feature_two" col + "data": { + "format": "parquet", + "files": ["path/to/feature_two/"] + } + } + ] + } + + + **The file source needs to contain the column names of the parent node/edge type to allow a - 1-1 mapping between the structure and feature files.** For nodes the - node_id column suffices, for edges we need both the source and - destination columns to use as a composite key. + 1-1 mapping between the structure and feature files.** + + For nodes the + the feature files need to have one column named with the node id column + name, (the value of ``"column"`` for the parent node type), + for edges we need both the ``source`` and + ``destination`` columns to use as a composite key. + +.. _supported-transformations: Supported transformations ~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -344,47 +384,47 @@ OAG-Paper dataset .. code-block:: json - { - "version" : "gsprocessing-v1.0", - "graph" : { - "edges" : [ - { - "data": { - "format": "csv", - "files": [ - "edges.csv" - ] - }, - "separator": ",", - "source": {"column": "~from", "type": "paper"}, - "dest": {"column": "~to", "type": "paper"}, - "relation": {"type": "cites"} - } - ], - "nodes" : [ - { - "data": { - "format": "csv", - "files": [ - "node_feat.csv" - ] - }, - "separator": ",", - "type": "paper", - "column": "ID", - "labels": [ - { - "column": "field", - "type": "classification", - "separator": ";", - "split_rate": { - "train": 0.7, - "val": 0.1, - "test": 0.2 - } - } - ] - } - ] - } - } + { + "version" : "gsprocessing-v1.0", + "graph" : { + "edges" : [ + { + "data": { + "format": "csv", + "files": [ + "edges.csv" + ], + "separator": "," + }, + "source": {"column": "~from", "type": "paper"}, + "dest": {"column": "~to", "type": "paper"}, + "relation": {"type": "cites"} + } + ], + "nodes" : [ + { + "data": { + "format": "csv", + "separator": ",", + "files": [ + "node_feat.csv" + ] + }, + "type": "paper", + "column": "ID", + "labels": [ + { + "column": "field", + "type": "classification", + "separator": ";", + "split_rate": { + "train": 0.7, + "val": 0.1, + "test": 0.2 + } + } + ] + } + ] + } + } diff --git a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst index 56253fd842..53fe61c922 100644 --- a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst +++ b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst @@ -6,7 +6,7 @@ use the Amazon SageMaker launch scripts to launch distributed processing jobs that use AWS resources. To demonstrate the usage of GSProcessing on Amazon SageMaker, we will execute the same job we used in our local -execution example, but this time use Amazon SageMaker to provide the compute instead of our +execution example, but this time use Amazon SageMaker to provide the compute resources instead of our local machine. Upload data to S3 @@ -54,7 +54,7 @@ using larger instances like `ml.r5.24xlarge`. Since we're now executing on AWS, we'll need access to an execution role for SageMaker and the ECR image URI we created in :doc:`/usage/distributed-processing-setup`. For instructions on how to create an execution role for SageMaker -see the `AWS SageMaker documentation `. +see the `AWS SageMaker documentation `_. Let's set up a small bash script that will run the parametrized processing job, followed by the re-partitioning job, both on SageMaker @@ -113,6 +113,19 @@ job, followed by the re-partitioning job, both on SageMaker want to scale up to an instance with more memory to avoid memory errors. `ml.r5` instances should allow you to re-partition graph data with billions of nodes and edges. +The ``--num-output-files`` parameter +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can see that we provided a parameter named +``--num-output-files`` to ``run_distributed_processing.py``. This is an +important parameter, as it provides a hint to set the parallelism for Spark. + +It can safely be skipped and let Spark decide the proper value based on the cluster's +instance type and count. If setting it yourself a good value to use is +``num_instances * num_cores_per_instance * 2``, which will ensure good +utilization of the cluster resources. + + Examine the output ------------------ diff --git a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst b/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst index caf4606965..785dd5a514 100644 --- a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst +++ b/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst @@ -40,8 +40,8 @@ To get started with building the GraphStorm Processing image you'll need to have the Docker engine installed. -To install Docker follow the instructions at: -https://docs.docker.com/engine/install/ +To install Docker follow the instructions at the +`official site `_. Install Poetry -------------- @@ -56,8 +56,8 @@ You can install Poetry using: curl -sSL https://install.python-poetry.org | python3 - -For detailed installation instructions see: -https://python-poetry.org/docs/ +For detailed installation instructions the +`Poetry docs `_. Set up AWS access @@ -74,8 +74,8 @@ To install the AWS CLI you can use: unzip awscliv2.zip sudo ./aws/install -To set up credentials for use with ``aws-cli`` see: -https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-examples +To set up credentials for use with ``aws-cli`` see the +`AWS docs `_. Your role should have full ECR access to be able to pull from ECR to build the image, create an ECR repository if it doesn't exist, and push the GSProcessing image to the repository. diff --git a/graphstorm-processing/docs/source/usage/example.rst b/graphstorm-processing/docs/source/usage/example.rst index 89135d7b1a..ab25b5a1f1 100644 --- a/graphstorm-processing/docs/source/usage/example.rst +++ b/graphstorm-processing/docs/source/usage/example.rst @@ -37,16 +37,18 @@ us to perform the processing and prepare the data for partitioning and training. The data files are expected to be: -* Tabular data files. We support CSV with header format, or in Parquet format. - The files can be partitioned (multiple parts), or a single file. -* Available on a local filesystem or on S3. +* Tabular data files. We support CSV-with-header format, or in Parquet format. + The files can be split (multiple parts), or a single file. +* Available on a local file system or on S3. * One tabular file source per edge and node type. For example, for a particular edge type, all node identifiers (source, destination), features, and labels should exist as columns in a single file source. Apart from the data, GSProcessing also requires a configuration file that describes the -data and the transformations we will need to apply to the features and labels. We support -both the GConstruct configuration format, and the library's own GSProcessing format. +data and the transformations we will need to apply to the features and any encoding needed for +labels. +We support both the `GConstruct configuration format `_ +, and the library's own GSProcessing format, described in :doc:`/developer/input-configuration`. .. note:: We expect end users to only provide a GConstruct configuration file, @@ -80,23 +82,25 @@ For example: /home/path/to/data/ # This is the current working directory (cwd) > ls gconstruct-config.json edge_data # These are the files under the cwd - > ls edge_data/ # These are the files under the edge_data director + > ls edge_data/ # These are the files under the edge_data directory movie-included_in-genre.csv The contents of the ``gconstruct-config.json`` can be: -.. code-block:: json +.. code-block:: python { - "edges" : [ - { - "files": ["edges/movie-included_in-genre.csv"], - "format": { - "name": "csv", - "separator" : "," + "edges" : [ + { + # Note that the file is a relative path + "files": ["edges/movie-included_in-genre.csv"], + "format": { + "name": "csv", + "separator" : "," + } + # [...] Other edge config values } - } - ] + ] } Given the above we can run a job with local input data as: @@ -104,7 +108,7 @@ Given the above we can run a job with local input data as: .. code-block:: bash > gs-processing --input-data /home/path/to/data \ - --config-filename gsprocessing-config.json + --config-filename gconstruct-config.json The benefit with using relative paths is that we can move the same files to any location, including S3, and run the same job without making changes to the config @@ -116,13 +120,13 @@ file: > mv /home/path/to/data /home/new-path/to/data # After moving all the files we can still use the same config > gs-processing --input-data /home/new-path/to/data \ - --config-filename gsprocessing-config.json + --config-filename gconstruct-config.json # Upload data to S3 > aws s3 sync /home/new-path/to/data s3://my-bucket/data/ # We can still use the same config, just change the prefix to an S3 path > python run_distributed_processing.py --input-data s3://my-bucket/data \ - --config-filename gsprocessing-config.json + --config-filename gconstruct-config.json Node files are optional ^^^^^^^^^^^^^^^^^^^^^^^ @@ -131,7 +135,7 @@ GSProcessing does not require node files to be provided for every node type. If a node type appears in one of the edges, its unique node identifiers will be determined by the edge files. -In the example GConstruct above, the node ids for the node types +In the example GConstruct file above (`gconstruct-config.json`), the node ids for the node types ``movie`` and ``genre`` will be extracted from the edge list provided. Example data and configuration @@ -151,7 +155,7 @@ and one label, ``gender``, that we transform to prepare the data for a node clas Run a GSProcessing job locally ------------------------------ -While GSProcessing is designed to run on distributed clusters on Amazon SageMaker, +While GSProcessing is designed to run on distributed clusters, we can also run small jobs in a local environment, using a local Spark instance. To do so, we will be using the ``gs-processing`` entry point, @@ -187,7 +191,9 @@ Examining the job output ------------------------ Once the processing and re-partitioning jobs are done, -we can examine the outputs they created. +we can examine the outputs they created. The output will be +compatible with the `Chunked Graph Format of DistDGL `_ +and can be used downstream to create a partitioned graph. .. code-block:: bash @@ -216,23 +222,36 @@ the graph structure, features, and labels. In more detail: The directories created contain: * ``edges``: Contains the edge structures, one sub-directory per edge - type. + type. Each edge file will contain two columns, the source and destination + `numerical` node id, named ``src_int_id`` and ``dist_int_id`` respectively. * ``node_data``: Contains the features for the nodes, one sub-directory - per node type. + per node type. Each file will contain one column named after the original + feature name that contains the value of the feature (could be a scalar or a vector). * ``node_id_mappings``: Contains mappings from the original node ids to the ones created by the processing job. This mapping would allow you to trace - back predictions to the original nodes/edges. + back predictions to the original nodes/edges. The files will have two columns, + ``node_str_id`` that contains the original string ID of the node, and ``node_int_id`` + that contains the numerical id that the string id was mapped to. If the graph had included edge features they would appear in an ``edge_data`` directory. +.. note:: + + It's important to note that files for edges and edge data will have the + same order and row counts per file, as expected by DistDGL. Similarly, + all node feature files will have the same order and row counts, where + the first row corresponds to the feature value for node id 0, the second + for node id 1 etc. + + At this point you can use the DGL distributed partitioning pipeline to partition your data, as described in the `DGL documentation `_ To simplify the process of partitioning and training, without the need to manage your own infrastructure, we recommend using GraphStorm's -`SageMaker wrappers `_ +`SageMaker wrappers `_ that do all the hard work for you and allow you to focus on model development.