diff --git a/docs/source/notebooks/Notebook_2_LP_Pipeline.ipynb b/docs/source/notebooks/Notebook_2_LP_Pipeline.ipynb new file mode 100644 index 0000000000..1e21a4e82e --- /dev/null +++ b/docs/source/notebooks/Notebook_2_LP_Pipeline.ipynb @@ -0,0 +1,362 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Notebook 2: Use GraphStorm APIs for Building a Link Prediction Pipeline\n", + "\n", + "This notebook demonstrates how to use GraphStorm's APIs to create a graph machine learning pipeline for a link prediction task.\n", + "\n", + "In this notebook, we modify the RGCN model used in the Notebook 1 to adapt to link prediction tasks and use it to conduct link prediction on the ACM dataset created by the **Notebook_0_Data_Prepare**. \n", + "\n", + "### Prerequsites\n", + "\n", + "- GraphStorm installed using pip. Please find [more details on installation of GraphStorm](https://graphstorm.readthedocs.io/en/latest/install/env-setup.html#setup-graphstorm-with-pip-packages).\n", + "- ACM data created in the [Notebook 0: Data Prepare](https://graphstorm.readthedocs.io/en/latest/notebooks/Notebook_0_Data_Prepare.html), and is stored in the `./acm_gs_1p/` folder.\n", + "- Installation of supporting libraries, e.g., matplotlib." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Setup log level in Jupyter Notebook\n", + "import logging\n", + "logging.basicConfig(level=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "The major steps of creating a link prediction pipeline are same as the node classification pipeline in the Notebook 1. In this notebook, we will only highlight the different components for clarity.\n", + "\n", + "### 0. Initialize the GraphStorm Standalone Environment" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import graphstorm as gs\n", + "gs.initialize()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1. Setup GraphStorm Dataset and DataLoaders\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "nfeats_4_modeling = {'author':['feat'], 'paper':['feat'],'subject':['feat']}\n", + "\n", + "# create a GraphStorm Dataset for the ACM graph data generated in the Notebook 0\n", + "acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json', node_feat_field=nfeats_4_modeling)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because link prediction needs both positive and negative edges for training, we use GraphStorm's `GSgnnLinkPredictionDataloader` which is dedicated for link prediction dataloading. This class takes some common arugments as these `NodePredictionDataloader`s, such as `dataset`, `target_idx`, `node_feats`, and `batch_size`. It also takes some link prediction-related arguments, e.g., `num_negative_edges`, `exlude_training_targets`, and etc." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# define dataloaders for training and validation\n", + "train_dataloader = gs.dataloading.GSgnnLinkPredictionDataLoader(\n", + " dataset=acm_data,\n", + " target_idx=acm_data.get_edge_train_set(etypes=[('paper', 'citing', 'paper')]),\n", + " fanout=[20, 20],\n", + " num_negative_edges=10,\n", + " node_feats=nfeats_4_modeling,\n", + " batch_size=64,\n", + " exclude_training_targets=False,\n", + " reverse_edge_types_map=[\"paper,citing,cited,paper\"],\n", + " train_task=True)\n", + "val_dataloader = gs.dataloading.GSgnnLinkPredictionTestDataLoader(\n", + " dataset=acm_data,\n", + " target_idx=acm_data.get_edge_val_set(etypes=[('paper', 'citing', 'paper')]),\n", + " fanout=[100, 100],\n", + " num_negative_edges=100,\n", + " node_feats=nfeats_4_modeling,\n", + " batch_size=256)\n", + "test_dataloader = gs.dataloading.GSgnnLinkPredictionTestDataLoader(\n", + " dataset=acm_data,\n", + " target_idx=acm_data.get_edge_test_set(etypes=[('paper', 'citing', 'paper')]),\n", + " fanout=[100, 100],\n", + " num_negative_edges=100,\n", + " node_feats=nfeats_4_modeling,\n", + " batch_size=256)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Create a GraphStorm-compatible RGCN Model for Link Prediction \n", + "\n", + "For the link prediction task, we modified the RGCN model used for node classification to adopt to link prediction task. Users can find the details in the `demon_models.py` file." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# import a simplified RGCN model for node classification\n", + "from demo_models import RgcnLPModel\n", + "\n", + "model = RgcnLPModel(g=acm_data.g,\n", + " num_hid_layers=2,\n", + " node_feat_field=nfeats_4_modeling,\n", + " hid_size=128)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Setup a GraphStorm Evaluator\n", + "\n", + "Here we change evaluator to a `GSgnnMrrLPEvaluator` that uses \"mrr\" as the metric dedicated for evaluation of link prediction performance." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# setup a link prediction evaluator for the trainer\n", + "evaluator = gs.eval.GSgnnMrrLPEvaluator(eval_frequency=1000)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4. Setup a Trainer and Training\n", + "\n", + "GraphStorm has the `GSgnnLinkPredictionTrainer` for link prediction training loop. The way of constructing this trainer and calling `fit()` method are same as the `GSgnnNodePredictionTrainer` used in the Notebook 1." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "# create a GraphStorm link prediction task trainer for the RGCN model\n", + "trainer = gs.trainer.GSgnnLinkPredictionTrainer(model, topk_model_to_save=1)\n", + "trainer.setup_evaluator(evaluator)\n", + "trainer.setup_device(gs.utils.get_device())" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Train the model with the trainer using fit() function\n", + "trainer.fit(train_loader=train_dataloader,\n", + " val_loader=val_dataloader,\n", + " test_loader=test_dataloader,\n", + " num_epochs=5,\n", + " save_model_path='a_save_path/',\n", + " save_model_frequency=1000,\n", + " use_mini_batch_infer=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### (Optional) 5. Visualize Model Performance History\n", + "\n", + "Same as the node classification pipeline, we can use the history stored in the evaluator." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# extract evaluation history of metrics from the trainer's evaluator:\n", + "val_metrics, test_metrics = [], []\n", + "for val_metric, test_metric in trainer.evaluator.history:\n", + " val_metrics.append(val_metric['mrr'])\n", + " test_metrics.append(test_metric['mrr'])\n", + "\n", + "# plot the performance curves\n", + "fig, ax = plt.subplots()\n", + "ax.plot(val_metrics, label='val')\n", + "ax.plot(test_metrics, label='test')\n", + "ax.set(xlabel='Epoch', ylabel='Mrr')\n", + "ax.legend(loc='best')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6. Inference with the Trained Model\n", + "\n", + "The operations of model restore are same as those used in the Notebook 1. Users can find the best model path first, and use model's `restore_model()` to load the trained model file." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# after training, the best model is saved to disk:\n", + "best_model_path = trainer.get_best_model_path()\n", + "print('Best model path:', best_model_path)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# we can restore the model from the saved path using the model's restore_model() function.\n", + "model.restore_model(best_model_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To do inference, users can either create a new dataloader as the following code does, or reuse one of the dataloaders defined in training." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# Setup dataloader for inference\n", + "infer_dataloader = gs.dataloading.GSgnnLinkPredictionTestDataLoader(\n", + " dataset=acm_data,\n", + " target_idx=acm_data.get_edge_infer_set(etypes=[('paper', 'citing', 'paper')]),\n", + " fanout=[100, 100],\n", + " num_negative_edges=100,\n", + " node_feats=nfeats_4_modeling,\n", + " batch_size=256)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can define a `GSgnnLinkPredictionInferrer` by giving the restored model and do inference by calling its `infer()` method." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "# Create an Inferrer object\n", + "infer = gs.inference.GSgnnLinkPredictionInferrer(model)\n", + "\n", + "# Run inference on the inference dataset\n", + "infer.infer(acm_data,\n", + " infer_dataloader,\n", + " save_embed_path='infer/embeddings',\n", + " use_mini_batch_infer=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For link prediction task, the inference outputs are embeddings of all nodes in the inference graph." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "# The GNN embeddings of all nodes in the inference graph are saved to the folder named after the target_ntype\n", + "!ls -lh infer/embeddings/paper" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "gsf", + "language": "python", + "name": "gsf" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/source/notebooks/demo_models.py b/docs/source/notebooks/demo_models.py index 93f53124da..03c642ea92 100644 --- a/docs/source/notebooks/demo_models.py +++ b/docs/source/notebooks/demo_models.py @@ -22,7 +22,10 @@ GSNodeEncoderInputLayer, RelationalGCNEncoder, EntityClassifier, - ClassifyLossFunc) + ClassifyLossFunc, + GSgnnLinkPredictionModel, + LinkPredictDotDecoder, + LinkPredictBCELossFunc) class RgcnNCModel(GSgnnNodeModel): @@ -89,3 +92,60 @@ def __init__(self, self.init_optimizer(lr=0.001, sparse_optimizer_lr=0.01, weight_decay=0) + + +class RgcnLPModel(GSgnnLinkPredictionModel): + """ A simple RGCN model for link prediction using Graphstorm APIs + + This RGCN model extends GraphStorm's GSgnnLinkPredictionModel, and it has the similar + model architecture as the node model, but has a different decoder layer and loss function: + 1. an input layer that converts input node features to the embeddings with hidden dimensions + 2. a GNN encoder layer that performs the message passing work + 3. a decoder layer that transfors edge representations into logits for link prediction, and + 4. a loss function that matches to link prediction tasks. + + The model also initialize its own optimizer object. + + Arguments + ---------- + g: DistGraph + a DGL DistGraph + num_hid_layers: int + the number of gnn layers + node_feat_field: dict of list of strings + The list features for each node type to be used in the model + hid_size: int + the dimension of hidden layers. + """ + def __init__(self, + g, + num_hid_layers, + node_feat_field, + hid_size): + super(RgcnLPModel, self).__init__(alpha_l2norm=0.) + + # extract feature size + feat_size = gs.get_node_feat_size(g, node_feat_field) + + # set an input layer encoder + encoder = GSNodeEncoderInputLayer(g=g, feat_size=feat_size, embed_size=hid_size) + self.set_node_input_encoder(encoder) + + # set a GNN encoder + gnn_encoder = RelationalGCNEncoder(g=g, + h_dim=hid_size, + out_dim=hid_size, + num_hidden_layers=num_hid_layers-1) + self.set_gnn_encoder(gnn_encoder) + + # set a decoder specific to link prediction task + decoder = LinkPredictDotDecoder(hid_size) + self.set_decoder(decoder) + + # link prediction loss function + self.set_loss_func(LinkPredictBCELossFunc()) + + # initialize model's optimizer + self.init_optimizer(lr=0.001, + sparse_optimizer_lr=0.01, + weight_decay=0)