From ac1cfbd0720872ea11bed9a4466d04bc00e1989d Mon Sep 17 00:00:00 2001 From: Alexandria Barghi Date: Fri, 29 Dec 2023 10:38:07 -0800 Subject: [PATCH] add readme --- .../standalone/bulk_sampling/README.md | 67 +++++++++++++++---- .../standalone/bulk_sampling/run_train_job.sh | 33 ++++----- 2 files changed, 72 insertions(+), 28 deletions(-) diff --git a/benchmarks/cugraph/standalone/bulk_sampling/README.md b/benchmarks/cugraph/standalone/bulk_sampling/README.md index f48eea5c556..24c8abf7bf0 100644 --- a/benchmarks/cugraph/standalone/bulk_sampling/README.md +++ b/benchmarks/cugraph/standalone/bulk_sampling/README.md @@ -1,11 +1,13 @@ -# cuGraph Bulk Sampling +# cuGraph Sampling Benchmarks -## Overview +## cuGraph Bulk Sampling + +### Overview The `cugraph_bulk_sampling.py` script runs the bulk sampler for a variety of datasets, including both generated (rmat) datasets and disk (ogbn_papers100M, etc.) datasets. It can also load replicas of these datasets to create a larger benchmark (i.e. ogbn_papers100M x2). -## Arguments +### Arguments The script takes a variety of arguments to control sampling behavior. Required: --output_root @@ -51,14 +53,8 @@ Optional: Seed for random number generation. Defaults to '62' - --persist - Whether to aggressively use persist() in dask to make the ETL steps (NOT PART OF SAMPLING) faster. - Will probably make this script finish sooner at the expense of memory usage, but won't affect - sampling time. - Changing this is not recommended unless you know what you are doing. - Defaults to False. -## Input Format +### Input Format The script expects its input data in the following format: ``` @@ -103,7 +99,7 @@ the parquet files. It must have the following format: } ``` -## Output Meta +### Output Meta The script, in addition to the samples, will also output a file named `output_meta.json`. This file contains various statistics about the sampling run, including the runtime, as well as information about the dataset and system that the samples were produced from. @@ -111,6 +107,53 @@ as well as information about the dataset and system that the samples were produc This metadata file can be used to gather the results from the sampling and training stages together. -## Other Notes +### Other Notes For rmat datasets, you will need to generate your own bogus features in the training stage. Since that is trivial, that is not done in this sampling script. + +## cuGraph MNMG Training + +### Overview +The script `run_train_job.sh` runs with the `sbatch` command to launch a series of slurm jobs. +First, for a given number of epochs, the script will produce samples for a given graph. +Then, the training process starts where samples are loaded and training iterations are +processed. + +### Important Notes +Downloading the dataset files before running the slurm jobs is highly recommended. Even though +the script will attempt to download the files if they are not available, this can often +lead to a timeout which will crash the scripts. This applies regardless of whether you are training +with native PyG or cuGraph-PyG. You can download data as follows: + +``` +from ogb.nodeproppred import NodePropPredDataset +dataset = NodePropPredDataset('ogbn-papers100M', root='/home/username/datasets') +``` + +For datasets other than ogbn-papers100M, you follow the same process but only change the dataset name. +The dataset will be correctly preprocessed when you run training. In case you have a slow system, you +can also run preprocessing by running the training script on a single worker, which will avoid a timeout +which crashes the script. + +### Arguments +You will need to modify the bash scripts to run appopriately for your environment and +desired training workflow. The standard sbatch arguments are at the top of the script, such as +job name, queue, etc. These will need to be modified for your SLURM cluster. + +Next are arguments for the container image (which is currently set to the current DLFW image), +and directories where the data and outputs are stored. The directories default to subdirectories +of the current working directory. But if there is a high-throughput storage system available, +using that storage for the samples and datasets is highly recommended. + +Next are standard GNN training arguments such as `FANOUT`, `BATCH_SIZE`, etc. You can also set +the number of training epochs here. These are followed by the `REPLICATION_FACTOR` argument, which +can be used to create replications of the dataset for scale testing purposes. + +The final two arguments are `FRAMEWORK` which can be either "cuGraphPyG" or "PyG", and `GPUS_PER_NODE` +which must be set to the correct value, even if this is provided by a SLURM argument. If `GPUS_PER_NODE` +is not set to the correct number of GPUs, the script will hang indefinitely until it times out. Mismatched +GPUs per node is currently unsupported by this script but should be possible in practice. + +### Output +The results of training will be outputted to the logs directory with an `output.txt` file for each worker. +These will be overwritten upon each run. Accuracy is only reported on rank 0. \ No newline at end of file diff --git a/benchmarks/cugraph/standalone/bulk_sampling/run_train_job.sh b/benchmarks/cugraph/standalone/bulk_sampling/run_train_job.sh index c47b59a5380..a3efc75b8d9 100755 --- a/benchmarks/cugraph/standalone/bulk_sampling/run_train_job.sh +++ b/benchmarks/cugraph/standalone/bulk_sampling/run_train_job.sh @@ -6,24 +6,20 @@ #SBATCH -N 1 #SBATCH -t 00:22:00 -export CONTAINER_IMAGE="/lustre/fsw/rapids/abarghi/dlfw_patched.squash" -export SCRIPTS_DIR=$(pwd) -export LOGS_DIR="/lustre/fsw/rapids/abarghi/logs" -export SAMPLES_DIR="/lustre/fsw/rapids/abarghi/samples" -export DATASETS_DIR="/lustre/fsw/rapids/gnn_datasets" +CONTAINER_IMAGE="/lustre/fsw/rapids/abarghi/dlfw_patched.squash" +SCRIPTS_DIR=$(pwd) +LOGS_DIR=${LOGS_DIR:=$(pwd)"/logs"} +SAMPLES_DIR=${SAMPLES_DIR:=$(pwd)/samples} +DATASETS_DIR=${DATASETS_DIR:=$(pwd)/datasets} -export BATCH_SIZE=512 -export FANOUT="10_10_10" -export REPLICATION_FACTOR=1 -export NUM_EPOCHS=1 -# options: PyG or cuGraphPyG -export FRAMEWORK="cuGraphPyG" +BATCH_SIZE=512 +FANOUT="10_10_10" +NUM_EPOCHS=1 +REPLICATION_FACTOR=1 -export RAPIDS_NO_INITIALIZE=1 -export CUDF_SPILL=1 -export LIBCUDF_CUFILE_POLICY="KVIKIO" -export KVIKIO_NTHREADS=64 -export GPUS_PER_NODE=8 +# options: PyG or cuGraphPyG +FRAMEWORK="cuGraphPyG" +GPUS_PER_NODE=8 nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) nodes_array=($nodes) @@ -54,6 +50,11 @@ fi srun \ --container-image $CONTAINER_IMAGE \ --container-mounts=${LOGS_DIR}":/logs",${SAMPLES_DIR}":/samples",${SCRIPTS_DIR}":/scripts",${DATASETS_DIR}":/datasets" \ + RAPIDS_NO_INITIALIZE=1 \ + CUDF_SPILL=1 \ + LIBCUDF_CUFILE_POLICY="KVIKIO" \ + KVIKIO_NTHREADS=64 \ + GPUS_PER_NODE=$gpus_per_node \ torchrun \ --nnodes $nnodes \ --nproc-per-node $gpus_per_node \