Skip to content

Commit

Permalink
Move GSProcessing docs to main repo documentation. (#502)
Browse files Browse the repository at this point in the history
*Issue #, if available:*

*Description of changes:*

Move the GSProcessing docs under the main repo to allow publishing under
common readthedocs project.

Add a new "Distributed Processing" section at the index root, rename
"Scale to Giant Graphs" to "Distributed Training" to differentiate
between processing and training.


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: xiang song(charlie.song) <[email protected]>
  • Loading branch information
thvasilo and classicsong authored Sep 28, 2023
1 parent 97288af commit 601701a
Show file tree
Hide file tree
Showing 11 changed files with 94 additions and 167 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ On Amazon Linux 2 you can use:
sudo yum install java-11-amazon-corretto-devel
Install ``pyenv``
~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~

``pyenv`` is a tool to manage multiple Python version installations. It
can be installed through the installer below on a Linux machine:
Expand All @@ -50,7 +50,7 @@ or use ``brew`` on a Mac:
brew update
brew install pyenv
For more info on ``pyenv`` see `its documentation. <https://github.com/pyenv/pyenv>`
For more info on ``pyenv`` see `its documentation. <https://github.com/pyenv/pyenv>`_

Create a Python 3.9 env and activate it.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -72,7 +72,7 @@ training.
dependencies.

Install ``poetry``
~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~

``poetry`` is a dependency and build management system for Python. To install it
use:
Expand All @@ -82,7 +82,7 @@ use:
curl -sSL https://install.python-poetry.org | python3 -
Install dependencies through ``poetry``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now we are ready to install our dependencies through ``poetry``.

Expand Down Expand Up @@ -176,8 +176,8 @@ ensure your code conforms to the expectation by running
on your code before commits. To make this easier we include
a pre-commit hook below.

Use a pre-commit hook to ensure ``black`` and ``pylint`` runs before commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use a pre-commit hook to ensure ``black`` and ``pylint`` run before commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To make code formatting and ``pylint`` checks easier for graphstorm-processing
developers, we recommend using a pre-commit hook.
Expand Down Expand Up @@ -216,14 +216,14 @@ And then run:
pre-commit install
which will install the ``black`` and ``pylin`` hooks into your local repository and
which will install the ``black`` and ``pylint`` hooks into your local repository and
ensure it runs before every commit.

.. note::

The pre-commit hook will also apply to all commits you make to the root
GraphStorm repository. Since that Graphstorm doesn't use ``black``, you might
want to remove the hooks. You can do so from the root repo
want to remove the ``black`` hook. You can do so from the root repo
using ``rm -rf .git/hooks``.

Both projects use ``pylint`` to check Python files so we'd still recommend using
Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,9 @@
.. graphstorm-processing documentation master file, created by
sphinx-quickstart on Tue Aug 1 02:04:45 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
GraphStorm Processing Getting Started
=====================================

Welcome to GraphStorm Distributed Data Processing documentation!
=================================================

.. toctree::
:maxdepth: 1
:caption: Contents:

Example <usage/example>
Distributed processing setup <usage/distributed-processing-setup>
Running on Amazon Sagemaker <usage/amazon-sagemaker>
Developer Guide <developer/developer-guide>
Input configuration <developer/input-configuration>


GraphStorm Distributed Data Processing allows you to process and prepare massive graph data
for training with GraphStorm. GraphStorm Processing takes care of generating
GraphStorm Distributed Data Processing (GSProcessing) allows you to process and prepare massive graph data
for training with GraphStorm. GSProcessing takes care of generating
unique ids for nodes, using them to encode edge structure files, process
individual features and prepare the data to be passed into the
distributed partitioning and training pipeline of GraphStorm.
Expand All @@ -27,11 +12,17 @@ We use PySpark to achieve
horizontal parallelism, allowing us to scale to graphs with billions of nodes
and edges.

.. _installation-ref:
.. _gsp-installation-ref:

Installation
------------

The project needs Python 3.9 and Java 8 or 11 installed. Below we provide brief
guides for each requirement.

Install Python 3.9
^^^^^^^^^^^^^^^^^^

The project uses Python 3.9. We recommend using `PyEnv <https://github.com/pyenv/pyenv>`_
to have isolated Python installations.

Expand All @@ -42,13 +33,37 @@ With PyEnv installed you can create and activate a Python 3.9 environment using
pyenv install 3.9
pyenv local 3.9
Install GSProcessing from source
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With a recent version of ``pip`` installed (we recommend ``pip>=21.3``), you can simply run ``pip install .``
from the root directory of the project (``graphstorm/graphstorm-processing``),
which should install the library into your environment and pull in all dependencies.
which should install the library into your environment and pull in all dependencies:

.. code-block:: bash
# Ensure Python is at least 3.9
python -V
cd graphstorm/graphstorm-processing
pip install .
Install Java 8, 11, or 17
~~~~~~~~~~~~~~~~~~~~~~~~~
Install GSProcessing using poetry
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can also create a local virtual environment using `poetry <https://python-poetry.org/docs/>`_.
With Python 3.9 and ``poetry`` installed you can run:

.. code-block:: bash
cd graphstorm/graphstorm-processing
# This will create a virtual env under graphstorm-processing/.venv
poetry install
# This will activate the .venv
poetry shell
Install Java 8 or 11
^^^^^^^^^^^^^^^^^^^^

Spark has a runtime dependency on the JVM to run, so you'll need to ensure
Java is installed and available on your system.
Expand Down Expand Up @@ -87,16 +102,19 @@ See the provided :doc:`usage/example` for an example of how to start with tabula
data and convert them into a graph representation before partitioning and
training with GraphStorm.

Usage
-----
Running locally
---------------

For data that fit into the memory of one machine, you can run jobs locally instead of a
cluster.

To use the library to process your data, you will need to have your data
in a tabular format, and a corresponding JSON configuration file that describes the
data. The input data can be in CSV (with header(s)) or Parquet format.

The configuration file can be in GraphStorm's GConstruct format,
with the caveat that the file paths need to be relative to the
location of the config file. See :doc:`/usage/example` for more details.
**with the caveat that the file paths need to be relative to the
location of the config file.** See :ref:`gsp-relative-paths` for more details.

After installing the library, executing a processing job locally can be done using:

Expand Down Expand Up @@ -126,7 +144,7 @@ partitioning pipeline.
See `this guide <https://github.com/awslabs/graphstorm/blob/main/sagemaker/README.md#launch-graph-partitioning-task>`_
for more details on how to use GraphStorm distributed partitioning on SageMaker.

See :doc:`/usage/example` for a detailed walkthrough of using GSProcessing to
See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to
wrangle data into a format that's ready to be consumed by the GraphStorm/DGL
partitioning pipeline.

Expand All @@ -137,13 +155,15 @@ Using with Amazon SageMaker
To run distributed jobs on Amazon SageMaker we will have to build a Docker image
and push it to the Amazon Elastic Container Registry, which we cover in
:doc:`usage/distributed-processing-setup` and run a SageMaker Processing
job which we describe in :doc:`/usage/amazon-sagemaker`.
job which we describe in :doc:`usage/amazon-sagemaker`.


Developer guide
---------------

To get started with developing the package refer to :doc:`/developer/developer-guide`.
To get started with developing the package refer to :doc:`developer/developer-guide`.
To see the input configuration format that GSProcessing uses internally see
:doc:`developer/input-configuration`.


.. rubric:: Footnotes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ directory we can upload the test data to S3 using:

Make sure you are uploading your data to a bucket
that was created in the same region as the ECR image
you pushed in :doc:`/usage/distributed-processing-setup`.
you pushed in :doc:`distributed-processing-setup`.


Launch the GSProcessing job on Amazon SageMaker
Expand All @@ -52,12 +52,12 @@ of up to 20 instances, allowing you to scale your processing to massive graphs,
using larger instances like `ml.r5.24xlarge`.

Since we're now executing on AWS, we'll need access to an execution role
for SageMaker and the ECR image URI we created in :doc:`/usage/distributed-processing-setup`.
for SageMaker and the ECR image URI we created in :doc:`distributed-processing-setup`.
For instructions on how to create an execution role for SageMaker
see the `AWS SageMaker documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-create-execution-role>`_.

Let's set up a small bash script that will run the parametrized processing
job, followed by the re-partitioning job, both on SageMaker
Let's set up a small ``bash`` script that will run the parametrized processing
job, followed by the re-partitioning job, both on SageMaker:

.. code-block:: bash
Expand Down Expand Up @@ -131,7 +131,7 @@ Examine the output

Once both jobs are finished we can examine the output created, which
should match the output we saw when running the same jobs locally
in :doc:`/usage/example`:
in :ref:`gsp-examining-output`.


.. code-block:: bash
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Distributed Processing setup for Amazon SageMaker
=================================================
GraphStorm Processing setup for Amazon SageMaker
================================================

In this guide we'll demonstrate how to prepare your environment to run
GraphStorm Processing (GSP) jobs on Amazon SageMaker.
GraphStorm Processing (GSProcessing) jobs on Amazon SageMaker.

We're assuming a Linux host environment used throughout
this tutorial, but other OS should work fine as well.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
GraphStorm Processing example
GraphStorm Processing Example
=============================

To demonstrate how to use the library locally we will
Expand All @@ -13,7 +13,7 @@ To run the local example you will need to install the GSProcessing
library to your Python environment, and you'll need to clone the
GraphStorm repository to get access to the data.

Follow the :ref:`installation-ref` guide to install the GSProcessing library.
Follow the :ref:`gsp-installation-ref` guide to install the GSProcessing library.

You can clone the repository using

Expand Down Expand Up @@ -48,7 +48,7 @@ Apart from the data, GSProcessing also requires a configuration file that descri
data and the transformations we will need to apply to the features and any encoding needed for
labels.
We support both the `GConstruct configuration format <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
, and the library's own GSProcessing format, described in :doc:`/developer/input-configuration`.
, and the library's own GSProcessing format, described in :doc:`/gs-processing/developer/input-configuration`.

.. note::
We expect end users to only provide a GConstruct configuration file,
Expand All @@ -61,7 +61,9 @@ We support both the `GConstruct configuration format <https://graphstorm.readthe
as we do with GConstruct.

For a detailed description of all the entries of the GSProcessing configuration file see
:doc:`/developer/input-configuration`.
:doc:`/gs-processing/developer/input-configuration`.

.. _gsp-relative-paths:

Relative file paths required
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -186,6 +188,7 @@ guarantees the data conform to the expectations of DGL:
gs-repartition --input-prefix /tmp/gsprocessing-example/
.. _gsp-examining-output:

Examining the job output
------------------------
Expand Down Expand Up @@ -248,16 +251,19 @@ in an ``edge_data`` directory.
At this point you can use the DGL distributed partitioning pipeline
to partition your data, as described in the
`DGL documentation <https://docs.dgl.ai/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_
.

To simplify the process of partitioning and training, without the need
to manage your own infrastructure, we recommend using GraphStorm's
`SageMaker wrappers <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
that do all the hard work for you and allow
you to focus on model development.
you to focus on model development. In particular you can follow the GraphStorm documentation to run
`distributed partititioning on SageMaker <https://github.com/awslabs/graphstorm/tree/main/sagemaker#launch-graph-partitioning-task>`_.


To run GSProcessing jobs on Amazon SageMaker we'll need to follow
:doc:`/usage/distributed-processing-setup` to set up our environment
and :doc:`/usage/amazon-sagemaker` to execute the job.
:doc:`/gs-processing/usage/distributed-processing-setup` to set up our environment
and :doc:`/gs-processing/usage/amazon-sagemaker` to execute the job.


.. rubric:: Footnotes
Expand Down
15 changes: 13 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,18 @@ Welcome to the GraphStorm Documentation and Tutorials

.. toctree::
:maxdepth: 1
:caption: Scale to Giant Graphs
:caption: Distributed Processing
:hidden:
:glob:

gs-processing/gs-processing-getting-started
gs-processing/usage/example
gs-processing/usage/distributed-processing-setup
gs-processing/usage/amazon-sagemaker

.. toctree::
:maxdepth: 1
:caption: Distributed Training
:hidden:
:glob:

Expand Down Expand Up @@ -52,7 +63,7 @@ Getting Started

For beginners, please first start with the :ref:`GraphStorm Docker environment setup<setup>`. This tutorial covers how to set up a Docker environment and build a GraphStorm Docker image, which serves as the Standalone running environment for GraphStorm. We are working on supporting more running environments for GraphStorm.

Once successfully set up the GraphStorm Docker running environment,
Once successfully set up the GraphStorm Docker running environment,

- follow the :ref:`GraphStorm Standalone Mode Quick-Start Tutorial<quick-start-standalone>` to run examples using GraphStorm built-in data and models, hence getting familiar with GraphStorm's usage of training and inference.
- follow the :ref:`Use Your Own Graph Data Tutorial<use-own-data>` to prepare your own graph data for using GraphStorm.
Expand Down
6 changes: 2 additions & 4 deletions graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info (set -x)
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'prod'.
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'.
-p, --path Path to graphstorm-processing directory, default is one level above this script.
-i, --image Docker image name, default is 'graphstorm-processing'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
Expand All @@ -41,6 +41,7 @@ parse_params() {
IMAGE_NAME='graphstorm-processing'
VERSION=`poetry version --short`
BUILD_DIR='/tmp'
TARGET='test'
while :; do
case "${1-}" in
Expand Down Expand Up @@ -75,9 +76,6 @@ parse_params() {
args=("$@")
# check required params and arguments
[[ -z "${TARGET-}" ]] && die "Missing required parameter: --target [prod|test]"
return 0
}
Expand Down
20 changes: 0 additions & 20 deletions graphstorm-processing/docs/Makefile

This file was deleted.

Loading

0 comments on commit 601701a

Please sign in to comment.