[SageMaker] Add README for pipelines and other small fixes

awslabs · Dec 10, 2024 · aebc24a · aebc24a
1 parent d044907
commit aebc24a
Show file tree

Hide file tree

Showing 3 changed files with 251 additions and 33 deletions.
diff --git a/sagemaker/pipeline/README.md b/sagemaker/pipeline/README.md
@@ -0,0 +1,195 @@
+# GraphStorm SageMaker Pipeline
+
+This project provides a set of tools to create and execute SageMaker pipelines for GraphStorm, a library for large-scale graph neural networks. The pipeline automates the process of graph construction, partitioning, training, and inference using Amazon SageMaker.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Prerequisites](#prerequisites)
+3. [Project Structure](#project-structure)
+4. [Installation](#installation)
+5. [Usage](#usage)
+   - [Creating a Pipeline](#creating-a-pipeline)
+   - [Executing a Pipeline](#executing-a-pipeline)
+6. [Pipeline Components](#pipeline-components)
+7. [Configuration](#configuration)
+8. [Advanced Usage](#advanced-usage)
+9. [Troubleshooting](#troubleshooting)
+
+## Overview
+
+This project simplifies the process of running GraphStorm workflows on Amazon SageMaker. It provides scripts to:
+
+1. Define and create SageMaker pipelines for GraphStorm tasks
+2. Execute these pipelines with customizable parameters
+3. Manage different stages of graph processing, including construction, partitioning, training, and inference
+
+## Prerequisites
+
+- Python 3.8+
+- AWS account with appropriate permissions. See the official
+  [SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-access.html) docs
+  for detailed permissions needed to create and run SageMaker Pipelines.
+- Familiarity with SageMaker AI and
+  [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).
+- Basic understanding of graph neural networks and GraphStorm
+
+## Project Structure
+
+The project consists of three main Python scripts:
+
+1. `create_sm_pipeline.py`: Defines the structure of the SageMaker pipeline
+2. `pipeline_parameters.py`: Manages the configuration and parameters for the pipeline
+3. `execute_pipeline.py`: Executes created pipelines
+
+## Access code and install dependencies
+
+1. Clone the GraphStorm repository:
+   ```
+   git clone https://github.com/awslabs/graphstorm.git
+   cd graphstorm/sagemaker/pipeline
+   ```
+
+2. Install the required dependencies:
+   ```
+   pip install sagemaker boto3
+   ```
+
+## Usage
+
+### Creating a Pipeline
+
+To create a new SageMaker pipeline for GraphStorm:
+
+```bash
+python create_sm_pipeline.py \
+    --role arn:aws:iam::123456789012:role/SageMakerRole \
+    --region us-west-2 \
+    --graphstorm-pytorch-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sm-cpu \
+    --instance-count 2 \
+    --jobs-to-run gconstruct train inference \
+    --graph-name my-graph \
+    --graph-construction-config-filename my_gconstruct_config.json \
+    --input-data-s3 s3://input-bucket/data \
+    --output-prefix-s3 s3://output-bucket/results \
+    --train-inference-task node_classification \
+    --train-yaml-s3 s3://config-bucket/train.yaml
+```
+
+This command creates a new pipeline with the specified configuration. The pipeline will
+include one GConstruct job, one training job and one inference job.
+It will use the configuration defined in `s3://input-bucket/data/my_gconstruct_config.json`
+to construct the graph and the train config file at `s3://config-bucket/train.yaml`
+to run training and inference.
+
+The `--instance-count` parameter determines the number of workers and partitions we will create and use
+during partitioning/training. It is also aliased to `--num-parts`.
+
+You can customize various aspects of the pipeline using additional command-line arguments. Refer to the script's help message for a full list of options:
+
+```bash
+python create_sm_pipeline.py --help
+```
+
+### Executing a Pipeline
+
+To execute a created pipeline:
+
+```bash
+python execute_pipeline.py \
+    --pipeline-name my-graphstorm-pipeline \
+    --region us-west-2
+```
+
+You can override various pipeline parameters during execution:
+
+```bash
+python execute_pipeline.py \
+    --pipeline-name my-graphstorm-pipeline \
+    --region us-west-2 \
+    --instance-count 4 \
+    --gpu-instance-type ml.g4dn.12xlarge
+```
+
+For a full list of execution options:
+
+```bash
+python execute_pipeline.py --help
+```
+
+For more fine-grained execution options see the
+[SageMaker AI documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-selective-ex.html).
+
+## Pipeline Components
+
+The GraphStorm SageMaker pipeline typically includes the following steps:
+
+1. **Graph Construction**: Builds the graph from input data
+2. **Graph Partitioning**: Partitions the graph for distributed processing
+3. **Training**: Trains the graph neural network model
+4. **Inference**: Runs inference on the trained model
+
+Each step is configurable and can be customized based on your specific requirements.
+
+## Configuration
+
+The pipeline's behavior is controlled by various configuration parameters, including:
+
+- AWS configuration (region, roles, image URLs)
+- Instance configuration (instance types, counts)
+- Task configuration (graph name, input/output locations)
+- Training and inference configurations
+
+Refer to the `PipelineArgs` class in `pipeline_parameters.py` for a complete list of configurable options.
+
+## Advanced Usage
+
+### Using GraphBolt
+
+To use GraphBolt for improved performance:
+
+```bash
+python create_sm_pipeline.py \
+    ... \
+    --use-graphbolt true
+```
+
+### Custom Job Sequences
+
+You can customize the sequence of jobs in the pipeline using the `--jobs-to-run` argument when creating the pipeline. For example:
+
+```bash
+python create_sm_pipeline.py \
+    ... \
+    --jobs-to-run gsprocessing dist_part gb_convert train inference \
+    --use-graphbolt true
+```
+
+will create a pipeline that uses GSProcessing to process and prepare the data for partitioning,
+use GSPartition to partition the data, convert the partitioned data to the GraphBolt format,
+then run a train and an inference job in sequence.
+
+
+### Asynchronous Execution
+
+To start a pipeline execution without waiting for it to complete:
+
+```bash
+python execute_pipeline.py \
+    --pipeline-name my-graphstorm-pipeline \
+    --region us-west-2 \
+    --async-execution
+```
+
+## Troubleshooting
+
+- Ensure all required AWS permissions are correctly set up
+- Check SageMaker execution logs for detailed error messages
+- Verify that all S3 paths are correct and accessible. Note trailing `/` that could cause issues.
+- Ensure that the specified EC2 instance types are available in your region
+
+See also [Troubleshooting Amazon SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-troubleshooting.html)
+
+For more detailed information about GraphStorm, refer to the [GraphStorm documentation](https://graphstorm.readthedocs.io/).
+
+If you encounter any issues or have questions, please open an issue in the project's [GitHub repository](https://github.com/awslabs/graphstorm/issues).
diff --git a/sagemaker/pipeline/create_sm_pipeline.py b/sagemaker/pipeline/create_sm_pipeline.py
@@ -50,6 +50,7 @@ class GraphStormPipelineGenerator:
     args : PipelineArgs
         Complete set of arguments for the pipeline this will create.
     """
+
     def __init__(self, args: PipelineArgs):
         self.args = args
         self.pipeline_session = self._create_pipeline_session()
@@ -182,8 +183,9 @@ def _create_pipeline_parameters(self, args: PipelineArgs):
         self.num_trainers_param = self._create_int_parameter(
             "NumTrainers", args.training_config.num_trainers
         )
+        # TODO: Maybe should not be configurable because it requires changing job sequence?
         self.use_graphbolt_param = self._create_string_parameter(
-            "UseGraphBolt", args.training_config.use_graphbolt
+            "UseGraphBolt", args.training_config.use_graphbolt_str
         )
         self.input_data_param = self._create_string_parameter(
             "InputData", args.task_config.input_data_s3
@@ -195,6 +197,7 @@ def _create_pipeline_parameters(self, args: PipelineArgs):
             "TrainConfigFile", args.training_config.train_yaml_file
         )
 
+        # If inference yaml is not provided, re-use the training one
         inference_yaml_default = (
             args.inference_config.inference_yaml_file
             or args.training_config.train_yaml_file
@@ -305,12 +308,6 @@ def _create_gconstruct_step(self, args: PipelineArgs) -> ProcessingStep:
             "--add-reverse-edges",
         ]
 
-        # TODO: This doesn't seem to work, can try to debug or try to enforce
-        # extended args.graph_construction_config.graph_construction_args during argparsing?
-        # Would that make use-graphbolt not execution-configurable?
-        # if self.use_graphbolt_param.to_string() == "true":
-        #     gconstruct_arguments.extend(["--use-graphbolt", "true"])
-
         # TODO: Make this a pipeline parameter?
         if args.graph_construction_config.graph_construction_args:
             gconstruct_arguments.extend(
@@ -328,6 +325,7 @@ def _create_gconstruct_step(self, args: PipelineArgs) -> ProcessingStep:
             ],
             job_arguments=gconstruct_arguments,
             code=args.script_paths.gconstruct_script,
+            cache_config=self.cache_config,
         )
 
         self.next_step_data_input = gconstruct_s3_output
@@ -406,6 +404,7 @@ def _create_gsprocessing_step(self, args: PipelineArgs) -> ProcessingStep:
             ],
             outputs=[gsprocessing_meta_output],
             job_arguments=gsprocessing_arguments,
+            cache_config=self.cache_config,
         )
 
         self.next_step_data_input = gsprocessing_output
@@ -483,8 +482,6 @@ def _create_gb_convert_step(self, args: PipelineArgs) -> ProcessingStep:
         gb_convert_arguments = [
             "--metadata-filename",
             args.partition_config.output_json_filename,
-            "--log-level",
-            args.task_config.log_level,
         ]
 
         gb_convert_step = ProcessingStep(