diff --git a/graphstorm-processing/docs/source/developer/developer-guide.rst b/docs/source/gs-processing/developer/developer-guide.rst similarity index 94% rename from graphstorm-processing/docs/source/developer/developer-guide.rst rename to docs/source/gs-processing/developer/developer-guide.rst index 1a7faf85db..385da9ec7d 100644 --- a/graphstorm-processing/docs/source/developer/developer-guide.rst +++ b/docs/source/gs-processing/developer/developer-guide.rst @@ -34,7 +34,7 @@ On Amazon Linux 2 you can use: sudo yum install java-11-amazon-corretto-devel Install ``pyenv`` -~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~ ``pyenv`` is a tool to manage multiple Python version installations. It can be installed through the installer below on a Linux machine: @@ -50,7 +50,7 @@ or use ``brew`` on a Mac: brew update brew install pyenv -For more info on ``pyenv`` see `its documentation. ` +For more info on ``pyenv`` see `its documentation. `_ Create a Python 3.9 env and activate it. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -72,7 +72,7 @@ training. dependencies. Install ``poetry`` -~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~ ``poetry`` is a dependency and build management system for Python. To install it use: @@ -82,7 +82,7 @@ use: curl -sSL https://install.python-poetry.org | python3 - Install dependencies through ``poetry`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we are ready to install our dependencies through ``poetry``. @@ -176,8 +176,8 @@ ensure your code conforms to the expectation by running on your code before commits. To make this easier we include a pre-commit hook below. -Use a pre-commit hook to ensure ``black`` and ``pylint`` runs before commits -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Use a pre-commit hook to ensure ``black`` and ``pylint`` run before commits +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To make code formatting and ``pylint`` checks easier for graphstorm-processing developers, we recommend using a pre-commit hook. @@ -216,14 +216,14 @@ And then run: pre-commit install -which will install the ``black`` and ``pylin`` hooks into your local repository and +which will install the ``black`` and ``pylint`` hooks into your local repository and ensure it runs before every commit. .. note:: The pre-commit hook will also apply to all commits you make to the root GraphStorm repository. Since that Graphstorm doesn't use ``black``, you might - want to remove the hooks. You can do so from the root repo + want to remove the ``black`` hook. You can do so from the root repo using ``rm -rf .git/hooks``. Both projects use ``pylint`` to check Python files so we'd still recommend using diff --git a/graphstorm-processing/docs/source/developer/input-configuration.rst b/docs/source/gs-processing/developer/input-configuration.rst similarity index 100% rename from graphstorm-processing/docs/source/developer/input-configuration.rst rename to docs/source/gs-processing/developer/input-configuration.rst diff --git a/graphstorm-processing/docs/source/index.rst b/docs/source/gs-processing/gs-processing-getting-started.rst similarity index 68% rename from graphstorm-processing/docs/source/index.rst rename to docs/source/gs-processing/gs-processing-getting-started.rst index cc027cbb08..648d1b7de6 100644 --- a/graphstorm-processing/docs/source/index.rst +++ b/docs/source/gs-processing/gs-processing-getting-started.rst @@ -1,24 +1,9 @@ -.. graphstorm-processing documentation master file, created by - sphinx-quickstart on Tue Aug 1 02:04:45 2023. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. +GraphStorm Processing Getting Started +===================================== -Welcome to GraphStorm Distributed Data Processing documentation! -================================================= -.. toctree:: - :maxdepth: 1 - :caption: Contents: - - Example - Distributed processing setup - Running on Amazon Sagemaker - Developer Guide - Input configuration - - -GraphStorm Distributed Data Processing allows you to process and prepare massive graph data -for training with GraphStorm. GraphStorm Processing takes care of generating +GraphStorm Distributed Data Processing (GSProcessing) allows you to process and prepare massive graph data +for training with GraphStorm. GSProcessing takes care of generating unique ids for nodes, using them to encode edge structure files, process individual features and prepare the data to be passed into the distributed partitioning and training pipeline of GraphStorm. @@ -27,11 +12,17 @@ We use PySpark to achieve horizontal parallelism, allowing us to scale to graphs with billions of nodes and edges. -.. _installation-ref: +.. _gsp-installation-ref: Installation ------------ +The project needs Python 3.9 and Java 8 or 11 installed. Below we provide brief +guides for each requirement. + +Install Python 3.9 +^^^^^^^^^^^^^^^^^^ + The project uses Python 3.9. We recommend using `PyEnv `_ to have isolated Python installations. @@ -42,13 +33,37 @@ With PyEnv installed you can create and activate a Python 3.9 environment using pyenv install 3.9 pyenv local 3.9 +Install GSProcessing from source +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ With a recent version of ``pip`` installed (we recommend ``pip>=21.3``), you can simply run ``pip install .`` from the root directory of the project (``graphstorm/graphstorm-processing``), -which should install the library into your environment and pull in all dependencies. +which should install the library into your environment and pull in all dependencies: + +.. code-block:: bash + + # Ensure Python is at least 3.9 + python -V + cd graphstorm/graphstorm-processing + pip install . -Install Java 8, 11, or 17 -~~~~~~~~~~~~~~~~~~~~~~~~~ +Install GSProcessing using poetry +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can also create a local virtual environment using `poetry `_. +With Python 3.9 and ``poetry`` installed you can run: + +.. code-block:: bash + + cd graphstorm/graphstorm-processing + # This will create a virtual env under graphstorm-processing/.venv + poetry install + # This will activate the .venv + poetry shell + + +Install Java 8 or 11 +^^^^^^^^^^^^^^^^^^^^ Spark has a runtime dependency on the JVM to run, so you'll need to ensure Java is installed and available on your system. @@ -87,16 +102,19 @@ See the provided :doc:`usage/example` for an example of how to start with tabula data and convert them into a graph representation before partitioning and training with GraphStorm. -Usage ------ +Running locally +--------------- + +For data that fit into the memory of one machine, you can run jobs locally instead of a +cluster. To use the library to process your data, you will need to have your data in a tabular format, and a corresponding JSON configuration file that describes the data. The input data can be in CSV (with header(s)) or Parquet format. The configuration file can be in GraphStorm's GConstruct format, -with the caveat that the file paths need to be relative to the -location of the config file. See :doc:`/usage/example` for more details. +**with the caveat that the file paths need to be relative to the +location of the config file.** See :ref:`gsp-relative-paths` for more details. After installing the library, executing a processing job locally can be done using: @@ -126,7 +144,7 @@ partitioning pipeline. See `this guide `_ for more details on how to use GraphStorm distributed partitioning on SageMaker. -See :doc:`/usage/example` for a detailed walkthrough of using GSProcessing to +See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to wrangle data into a format that's ready to be consumed by the GraphStorm/DGL partitioning pipeline. @@ -137,13 +155,15 @@ Using with Amazon SageMaker To run distributed jobs on Amazon SageMaker we will have to build a Docker image and push it to the Amazon Elastic Container Registry, which we cover in :doc:`usage/distributed-processing-setup` and run a SageMaker Processing -job which we describe in :doc:`/usage/amazon-sagemaker`. +job which we describe in :doc:`usage/amazon-sagemaker`. Developer guide --------------- -To get started with developing the package refer to :doc:`/developer/developer-guide`. +To get started with developing the package refer to :doc:`developer/developer-guide`. +To see the input configuration format that GSProcessing uses internally see +:doc:`developer/input-configuration`. .. rubric:: Footnotes diff --git a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst b/docs/source/gs-processing/usage/amazon-sagemaker.rst similarity index 95% rename from graphstorm-processing/docs/source/usage/amazon-sagemaker.rst rename to docs/source/gs-processing/usage/amazon-sagemaker.rst index 53fe61c922..8ab8f65bec 100644 --- a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst +++ b/docs/source/gs-processing/usage/amazon-sagemaker.rst @@ -36,7 +36,7 @@ directory we can upload the test data to S3 using: Make sure you are uploading your data to a bucket that was created in the same region as the ECR image - you pushed in :doc:`/usage/distributed-processing-setup`. + you pushed in :doc:`distributed-processing-setup`. Launch the GSProcessing job on Amazon SageMaker @@ -52,12 +52,12 @@ of up to 20 instances, allowing you to scale your processing to massive graphs, using larger instances like `ml.r5.24xlarge`. Since we're now executing on AWS, we'll need access to an execution role -for SageMaker and the ECR image URI we created in :doc:`/usage/distributed-processing-setup`. +for SageMaker and the ECR image URI we created in :doc:`distributed-processing-setup`. For instructions on how to create an execution role for SageMaker see the `AWS SageMaker documentation `_. -Let's set up a small bash script that will run the parametrized processing -job, followed by the re-partitioning job, both on SageMaker +Let's set up a small ``bash`` script that will run the parametrized processing +job, followed by the re-partitioning job, both on SageMaker: .. code-block:: bash @@ -131,7 +131,7 @@ Examine the output Once both jobs are finished we can examine the output created, which should match the output we saw when running the same jobs locally -in :doc:`/usage/example`: +in :ref:`gsp-examining-output`. .. code-block:: bash diff --git a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst similarity index 96% rename from graphstorm-processing/docs/source/usage/distributed-processing-setup.rst rename to docs/source/gs-processing/usage/distributed-processing-setup.rst index 785dd5a514..e6ca745bba 100644 --- a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst +++ b/docs/source/gs-processing/usage/distributed-processing-setup.rst @@ -1,8 +1,8 @@ -Distributed Processing setup for Amazon SageMaker -================================================= +GraphStorm Processing setup for Amazon SageMaker +================================================ In this guide we'll demonstrate how to prepare your environment to run -GraphStorm Processing (GSP) jobs on Amazon SageMaker. +GraphStorm Processing (GSProcessing) jobs on Amazon SageMaker. We're assuming a Linux host environment used throughout this tutorial, but other OS should work fine as well. diff --git a/graphstorm-processing/docs/source/usage/example.rst b/docs/source/gs-processing/usage/example.rst similarity index 94% rename from graphstorm-processing/docs/source/usage/example.rst rename to docs/source/gs-processing/usage/example.rst index ab25b5a1f1..98c2327cbb 100644 --- a/graphstorm-processing/docs/source/usage/example.rst +++ b/docs/source/gs-processing/usage/example.rst @@ -1,4 +1,4 @@ -GraphStorm Processing example +GraphStorm Processing Example ============================= To demonstrate how to use the library locally we will @@ -13,7 +13,7 @@ To run the local example you will need to install the GSProcessing library to your Python environment, and you'll need to clone the GraphStorm repository to get access to the data. -Follow the :ref:`installation-ref` guide to install the GSProcessing library. +Follow the :ref:`gsp-installation-ref` guide to install the GSProcessing library. You can clone the repository using @@ -48,7 +48,7 @@ Apart from the data, GSProcessing also requires a configuration file that descri data and the transformations we will need to apply to the features and any encoding needed for labels. We support both the `GConstruct configuration format `_ -, and the library's own GSProcessing format, described in :doc:`/developer/input-configuration`. +, and the library's own GSProcessing format, described in :doc:`/gs-processing/developer/input-configuration`. .. note:: We expect end users to only provide a GConstruct configuration file, @@ -61,7 +61,9 @@ We support both the `GConstruct configuration format `_ +. To simplify the process of partitioning and training, without the need to manage your own infrastructure, we recommend using GraphStorm's `SageMaker wrappers `_ that do all the hard work for you and allow -you to focus on model development. +you to focus on model development. In particular you can follow the GraphStorm documentation to run +`distributed partititioning on SageMaker `_. + To run GSProcessing jobs on Amazon SageMaker we'll need to follow -:doc:`/usage/distributed-processing-setup` to set up our environment -and :doc:`/usage/amazon-sagemaker` to execute the job. +:doc:`/gs-processing/usage/distributed-processing-setup` to set up our environment +and :doc:`/gs-processing/usage/amazon-sagemaker` to execute the job. .. rubric:: Footnotes diff --git a/docs/source/index.rst b/docs/source/index.rst index f745ad9913..2175c67a67 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -15,7 +15,18 @@ Welcome to the GraphStorm Documentation and Tutorials .. toctree:: :maxdepth: 1 - :caption: Scale to Giant Graphs + :caption: Distributed Processing + :hidden: + :glob: + + gs-processing/gs-processing-getting-started + gs-processing/usage/example + gs-processing/usage/distributed-processing-setup + gs-processing/usage/amazon-sagemaker + +.. toctree:: + :maxdepth: 1 + :caption: Distributed Training :hidden: :glob: @@ -52,7 +63,7 @@ Getting Started For beginners, please first start with the :ref:`GraphStorm Docker environment setup`. This tutorial covers how to set up a Docker environment and build a GraphStorm Docker image, which serves as the Standalone running environment for GraphStorm. We are working on supporting more running environments for GraphStorm. -Once successfully set up the GraphStorm Docker running environment, +Once successfully set up the GraphStorm Docker running environment, - follow the :ref:`GraphStorm Standalone Mode Quick-Start Tutorial` to run examples using GraphStorm built-in data and models, hence getting familiar with GraphStorm's usage of training and inference. - follow the :ref:`Use Your Own Graph Data Tutorial` to prepare your own graph data for using GraphStorm. diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh index 37c701da9f..d9ffe30316 100644 --- a/graphstorm-processing/docker/build_gsprocessing_image.sh +++ b/graphstorm-processing/docker/build_gsprocessing_image.sh @@ -15,7 +15,7 @@ Available options: -h, --help Print this help and exit -x, --verbose Print script debug info (set -x) --t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'prod'. +-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'. -p, --path Path to graphstorm-processing directory, default is one level above this script. -i, --image Docker image name, default is 'graphstorm-processing'. -v, --version Docker version tag, default is the library's current version (`poetry version --short`) @@ -41,6 +41,7 @@ parse_params() { IMAGE_NAME='graphstorm-processing' VERSION=`poetry version --short` BUILD_DIR='/tmp' + TARGET='test' while :; do case "${1-}" in @@ -75,9 +76,6 @@ parse_params() { args=("$@") - # check required params and arguments - [[ -z "${TARGET-}" ]] && die "Missing required parameter: --target [prod|test]" - return 0 } diff --git a/graphstorm-processing/docs/Makefile b/graphstorm-processing/docs/Makefile deleted file mode 100644 index d0c3cbf102..0000000000 --- a/graphstorm-processing/docs/Makefile +++ /dev/null @@ -1,20 +0,0 @@ -# Minimal makefile for Sphinx documentation -# - -# You can set these variables from the command line, and also -# from the environment for the first two. -SPHINXOPTS ?= -SPHINXBUILD ?= sphinx-build -SOURCEDIR = source -BUILDDIR = build - -# Put it first so that "make" without argument is like "make help". -help: - @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) - -.PHONY: help Makefile - -# Catch-all target: route all unknown targets to Sphinx using the new -# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). -%: Makefile - @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/graphstorm-processing/docs/make.bat b/graphstorm-processing/docs/make.bat deleted file mode 100644 index 6247f7e231..0000000000 --- a/graphstorm-processing/docs/make.bat +++ /dev/null @@ -1,35 +0,0 @@ -@ECHO OFF - -pushd %~dp0 - -REM Command file for Sphinx documentation - -if "%SPHINXBUILD%" == "" ( - set SPHINXBUILD=sphinx-build -) -set SOURCEDIR=source -set BUILDDIR=build - -if "%1" == "" goto help - -%SPHINXBUILD% >NUL 2>NUL -if errorlevel 9009 ( - echo. - echo.The 'sphinx-build' command was not found. Make sure you have Sphinx - echo.installed, then set the SPHINXBUILD environment variable to point - echo.to the full path of the 'sphinx-build' executable. Alternatively you - echo.may add the Sphinx directory to PATH. - echo. - echo.If you don't have Sphinx installed, grab it from - echo.http://sphinx-doc.org/ - exit /b 1 -) - -%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% -goto end - -:help -%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% - -:end -popd diff --git a/graphstorm-processing/docs/source/conf.py b/graphstorm-processing/docs/source/conf.py deleted file mode 100644 index 7334ba97ae..0000000000 --- a/graphstorm-processing/docs/source/conf.py +++ /dev/null @@ -1,53 +0,0 @@ -# pylint: skip-file -# Configuration file for the Sphinx documentation builder. -# -# This file only contains a selection of the most common options. For a full -# list see the documentation: -# https://www.sphinx-doc.org/en/master/usage/configuration.html - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -# import os -# import sys -# sys.path.insert(0, os.path.abspath('.')) - - -# -- Project information ----------------------------------------------------- - -project = 'graphstorm-processing' -copyright = '2023, AGML Team' -author = 'AGML Team, Amazon' - - -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ -] - -# Add any paths that contain templates here, relative to this directory. -templates_path = ['_templates'] - -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -# This pattern also affects html_static_path and html_extra_path. -exclude_patterns = [] - - -# -- Options for HTML output ------------------------------------------------- - -# The theme to use for HTML and HTML Help pages. See the documentation for -# a list of builtin themes. -# -html_theme = 'alabaster' - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = ['_static']