[Refactor] Refactor Evaluator (#822)

*Issue #757, if available:* *Description of changes:* This PR includes code refactor of GraphStorm "**Evaluators**". And all codes, examples and documentations are refactored accordingly. The major changes include: 1. Unified `GSgnnInstanceEvaluator` and `GSgnnLPEvaluator` classes with `GSgnnBaseEvaluator` plus interface classes, i.e., `GSgnnPredictionEvalInterface` and `GSgnnLPRankingEvalInterface`. In such way, all sub evaluators share the same properties, and also can have different evaluation methods according to the interfaces they implemented. 2. Replaced `GSgnnAccEvaluator` with `GSgnnClassificationEvaluator`, which is implemented by extending `GSgnnBaseEvaluator` and `GSgnnPredictionEvalInterface`. Its behavior is nearly the same as the `GSgnnAccEvaluator`. 3. Modified `GSgnnRegressionEvaluator` in the new way by extending `GSgnnBaseEvaluator` and `GSgnnPredictionEvalInterface`, and unified its behavior as `GSgnnClassificationEvaluator`. 4. Modified `GSgnnMrrLPEvaluator` and `GSgnnPerEtypeMrrLPEvaluator` in the new way by extending `GSgnnBaseEvaluator` and `GSgnnRankingLPEvalInterface`. The detailed changes are: 1. `GSgnnBaseEvaluator` separates evaluator properties from abstract methods, i.e., `evaluate()` and `compute_score()`, which are defined in the interface classes. 2. `GSgnnPredictionEvalInterface` defines the `evaluate()` and `compute_score()` methods for classification and regression tasks, which ask for both predictions and labels as inputs. 3. `GSgnnRankingEvalInterface` defines the `evaluate()` and `compute_score()` methods for ranking-based LP tasks, which ask for ranking values as inputs. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]>
awslabs · Apr 29, 2024 · 51c6a95 · 51c6a95
1 parent 28e3b0c
commit 51c6a95
Show file tree

Hide file tree

Showing 30 changed files with 665 additions and 815 deletions.
diff --git a/docs/source/advanced/own-models.rst b/docs/source/advanced/own-models.rst
@@ -263,13 +263,13 @@ The GraphStorm trainers can have evaluators and task trackers associated. The fo
 .. code-block:: python
 
     # Optional: set up a evaluator
-    evaluator = GSgnnAccEvaluator(config.eval_frequency,
-                                  config.eval_metric,
-                                  config.multilabel,
-                                  config.use_early_stop,
-                                  config.early_stop_burnin_rounds,
-                                  config.early_stop_rounds,
-                                  config.early_stop_strategy)
+    evaluator = GSgnnClassificationEvaluator(config.eval_frequency,
+                                             config.eval_metric,
+                                             config.multilabel,
+                                             config.use_early_stop,
+                                             config.early_stop_burnin_rounds,
+                                             config.early_stop_rounds,
+                                             config.early_stop_strategy)
     trainer.setup_evaluator(evaluator)
     # Optional: set up a task tracker to show the progress of training.
     tracker = GSSageMakerTaskTracker(config.eval_frequency)

diff --git a/docs/source/api/graphstorm.eval.rst b/docs/source/api/graphstorm.eval.rst
@@ -7,8 +7,10 @@ graphstorm.eval
     Learning (GML) tasks.
 
     If users want to implement customized evaluators or evaluation methods, a best practice is to
-    extend base evaluators, i.e., the ``GSgnnInstanceEvaluator`` class for node or edge prediction
-    tasks, and ``GSgnnLPEvaluator`` for link prediction tasks, and then implement the abstract methods.
+    extend the base evaluator, i.e., the ``GSgnnBaseEvaluator``, and the corresponding evaluation
+    interfaces, e.g., ``GSgnnPredictionEvalInterface``` for prediction evaluation, and
+    ``GSgnnLPRankingEvalInterface`` for ranking based link prediction evaluation, and then
+    implement the abstract methods defined in those interface classes.
 
 .. currentmodule:: graphstorm.eval
 
@@ -20,8 +22,9 @@ Base Evaluators
     :nosignatures:
     :template: evaltemplate.rst
 
-    GSgnnInstanceEvaluator
-    GSgnnLPEvaluator
+    GSgnnBaseEvaluator
+    GSgnnPredictionEvalInterface
+    GSgnnLPRankingEvalInterface
 
 Evaluators
 -----------
@@ -31,8 +34,7 @@ Evaluators
     :nosignatures:
     :template: evaltemplate.rst
 
-    GSgnnLPEvaluator
+    GSgnnClassificationEvaluator
+    GSgnnRegressionEvaluator
     GSgnnMrrLPEvaluator
     GSgnnPerEtypeMrrLPEvaluator
-    GSgnnAccEvaluator
-    GSgnnRegressionEvaluator
diff --git a/examples/customized_models/HGT/hgt_nc.py b/examples/customized_models/HGT/hgt_nc.py
@@ -14,7 +14,7 @@
 from graphstorm.inference import GSgnnNodePredictionInferrer
 from graphstorm.dataloading import GSgnnNodeTrainData, GSgnnNodeInferData
 from graphstorm.dataloading import GSgnnNodeDataLoader
-from graphstorm.eval import GSgnnAccEvaluator
+from graphstorm.eval import GSgnnClassificationEvaluator
 from graphstorm.tracker import GSSageMakerTaskTracker
 from graphstorm.utils import get_device
 
@@ -326,13 +326,13 @@ def main(args):
                                           train_task=False)
 
     # Optional: set up a evaluator
-    evaluator = GSgnnAccEvaluator(config.eval_frequency,
-                                  config.eval_metric,
-                                  config.multilabel,
-                                  config.use_early_stop,
-                                  config.early_stop_burnin_rounds,
-                                  config.early_stop_rounds,
-                                  config.early_stop_strategy)
+    evaluator = GSgnnClassificationEvaluator(config.eval_frequency,
+                                             config.eval_metric,
+                                             config.multilabel,
+                                             config.use_early_stop,
+                                             config.early_stop_burnin_rounds,
+                                             config.early_stop_rounds,
+                                             config.early_stop_strategy)
     trainer.setup_evaluator(evaluator)
     # Optional: set up a task tracker to show the progress of training.
     tracker = GSSageMakerTaskTracker(config.eval_frequency)

diff --git a/examples/peft_llm_gnn/main_lp.py b/examples/peft_llm_gnn/main_lp.py
@@ -54,14 +54,12 @@ def main(config_args):
     trainer.setup_device(device=get_device())
 
     # set evaluator
-    evaluator = GSgnnMrrLPEvaluator(config.eval_frequency,
-        train_data,
-        config.num_negative_edges_eval,
-        config.lp_decoder_type,
-        config.use_early_stop,
-        config.early_stop_burnin_rounds,
-        config.early_stop_rounds,
-        config.early_stop_strategy
+    evaluator = GSgnnMrrLPEvaluator(
+        eval_frequency=config.eval_frequency,
+        use_early_stop=config.use_early_stop,
+        early_stop_burnin_rounds=config.early_stop_burnin_rounds,
+        early_stop_rounds=config.early_stop_rounds,
+        early_stop_strategy=config.early_stop_strategy
     )
     # disbale validation for efficiency
     # trainer.setup_evaluator(evaluator)

diff --git a/examples/peft_llm_gnn/main_nc.py b/examples/peft_llm_gnn/main_nc.py
@@ -3,7 +3,7 @@
 from graphstorm.config import get_argument_parser
 from graphstorm.config import GSConfig
 from graphstorm.dataloading import GSgnnNodeDataLoader
-from graphstorm.eval import GSgnnAccEvaluator
+from graphstorm.eval import GSgnnClassificationEvaluator
 from graphstorm.dataloading import GSgnnNodeTrainData
 from graphstorm.utils import get_device
 from graphstorm.inference import GSgnnNodePredictionInferrer
@@ -52,7 +52,7 @@ def main(config_args):
     trainer.setup_device(device=get_device())
 
     # set evaluator
-    evaluator = GSgnnAccEvaluator(
+    evaluator = GSgnnClassificationEvaluator(
         config.eval_frequency,
         config.eval_metric,
         config.multilabel,

diff --git a/examples/standalone_mode_demo.ipynb b/examples/standalone_mode_demo.ipynb
@@ -39,7 +39,7 @@
     "from graphstorm.dataloading import GSgnnNodeTrainData, GSgnnNodeDataLoader, GSgnnNodeInferData\n",
     "from graphstorm.model import GSgnnNodeModel, GSNodeEncoderInputLayer, EntityClassifier, ClassifyLossFunc, RelationalGCNEncoder\n",
     "from graphstorm.inference import GSgnnNodePredictionInferrer\n",
-    "from graphstorm.eval import GSgnnAccEvaluator"
+    "from graphstorm.eval import GSgnnClassificationEvaluator"
    ]
   },
   {
@@ -315,9 +315,8 @@
     "trainer.setup_device(device=device)\n",
     "\n",
     "# set up evaluator for the trainer:\n",
-    "evaluator = GSgnnAccEvaluator(\n",
+    "evaluator = GSgnnClassificationEvaluator(\n",
     "    eval_frequency=10000,\n",
-    "    eval_metric=['accuracy'],\n",
     "    multilabel=multilabel)\n",
     "\n",
     "trainer.setup_evaluator(evaluator)"

diff --git a/examples/temporal_graph_learning/main_nc.py b/examples/temporal_graph_learning/main_nc.py
@@ -3,7 +3,7 @@
 from graphstorm.config import get_argument_parser
 from graphstorm.config import GSConfig
 from graphstorm.dataloading import GSgnnNodeDataLoader
-from graphstorm.eval import GSgnnAccEvaluator
+from graphstorm.eval import GSgnnClassificationEvaluator
 from graphstorm.dataloading import GSgnnNodeTrainData
 from graphstorm.utils import get_device
 from graphstorm.trainer import GSgnnNodePredictionTrainer
@@ -45,7 +45,7 @@ def main(config_args):
     trainer.setup_device(device=get_device())
 
     # set evaluator
-    evaluator = GSgnnAccEvaluator(
+    evaluator = GSgnnClassificationEvaluator(
         config.eval_frequency,
         config.eval_metric,
         config.multilabel,

diff --git a/python/graphstorm/config/argument.py b/python/graphstorm/config/argument.py
@@ -352,7 +352,7 @@ def verify_arguments(self, is_train):
         _ = self.log_report_frequency
 
         _ = self.task_type
-        # For classification tasks.
+        # For classification/regression tasks.
         if self.task_type in [BUILTIN_TASK_NODE_CLASSIFICATION, BUILTIN_TASK_EDGE_CLASSIFICATION]:
             _ = self.label_field
             _ = self.num_classes
@@ -368,6 +368,7 @@ def verify_arguments(self, is_train):
                               BUILTIN_TASK_LINK_PREDICTION] and is_train:
             _ = self.exclude_training_targets
             _ = self.reverse_edge_types_map
+        # For link prediction tasks.
         if self.task_type == BUILTIN_TASK_LINK_PREDICTION:
             _ = self.gamma
             _ = self.lp_decoder_type

diff --git a/python/graphstorm/eval/__init__.py b/python/graphstorm/eval/__init__.py
@@ -23,8 +23,7 @@
 from .eval_func import SUPPORTED_REGRESSION_METRICS
 from .eval_func import SUPPORTED_LINK_PREDICTION_METRICS
 
-from .evaluator import GSgnnInstanceEvaluator
-from .evaluator import GSgnnLPEvaluator
-from .evaluator import GSgnnMrrLPEvaluator, GSgnnPerEtypeMrrLPEvaluator
-from .evaluator import GSgnnAccEvaluator
+from .evaluator import GSgnnMrrLPEvaluator
+from .evaluator import GSgnnPerEtypeMrrLPEvaluator
+from .evaluator import GSgnnClassificationEvaluator
 from .evaluator import GSgnnRegressionEvaluator
diff --git a/python/graphstorm/eval/eval_func.py b/python/graphstorm/eval/eval_func.py
@@ -103,6 +103,12 @@ def __init__(self):
         self.metric_function["mse"] = compute_mse
         self.metric_function["mae"] = compute_mae
 
+        # This is the operator used to measure each metric performance in evaluation
+        self.metric_eval_function = {}
+        self.metric_eval_function["rmse"] = compute_rmse
+        self.metric_eval_function["mse"] = compute_mse
+        self.metric_eval_function["mae"] = compute_mae
+
     def assert_supported_metric(self, metric):
         """ check if the given metric is supported.
         """
@@ -135,6 +141,14 @@ def __init__(self):
         self.metric_comparator = {}
         self.metric_comparator["mrr"] = operator.le
 
+        # This is the operator used to measure each metric performance
+        self.metric_function = {}
+        self.metric_function["mrr"] = compute_mrr
+
+        # This is the operator used to measure each metric performance in evaluation
+        self.metric_eval_function = {}
+        self.metric_eval_function["mrr"] = compute_mrr
+
     def assert_supported_metric(self, metric):
         """ check if the given metric is supported.
         """
@@ -583,3 +597,19 @@ def compute_mae(pred, labels):
 
     diff = th.abs(pred.cpu() - labels.cpu())
     return th.mean(diff).cpu().item()
+
+def compute_mrr(ranking):
+    """ Get link prediction mrr metrics
+
+        Parameters
+        ----------
+        ranking:
+            ranking of each positive edge
+
+        Returns
+        -------
+        link prediction mrr metrics: tensor
+    """
+    logs = th.div(1.0, ranking)
+    metrics = th.tensor(th.div(th.sum(logs),len(logs)))
+    return metrics