Replace trainer function with Trainer class (#185)

--------- Co-authored-by: frostedoyster <[email protected]> Co-authored-by: Filippo Bigi <[email protected]> Co-authored-by: Arslan Mazitov <[email protected]>
metatensor · May 29, 2024 · e5b5c0a · e5b5c0a
1 parent 2bc152f
commit e5b5c0a
Show file tree

Hide file tree

Showing 80 changed files with 2,445 additions and 2,225 deletions.
diff --git a/docs/src/dev-docs/adding-models.rst b/docs/src/dev-docs/adding-models.rst
diff --git a/docs/src/dev-docs/architecture-life-cycle.rst b/docs/src/dev-docs/architecture-life-cycle.rst
@@ -33,7 +33,7 @@ repository. To qualify as an experimental architecture, certain criteria must be
    a public git repository or another public URL with a repository is acceptable.
 
 For detailed instructions on adding a new architecture, refer to
-:ref:`adding-new-models`.
+:ref:`adding-new-architecture`.
 
 Stable Architectures
 --------------------

diff --git a/docs/src/dev-docs/index.rst b/docs/src/dev-docs/index.rst
@@ -9,7 +9,7 @@ module.
 .. toctree::
    :maxdepth: 1
 
-   adding-models
    architecture-life-cycle
+   new-architecture
    cli/index
    utils/index
diff --git a/docs/src/dev-docs/new-architecture.rst b/docs/src/dev-docs/new-architecture.rst
@@ -0,0 +1,144 @@
+.. _adding-new-architecture:
+
+Adding a new architecture
+=========================
+
+To work with` metatensor-models` any architecture has to follow the same public API to
+be called correctly within the :py:func:`metatensor.models.cli.train` function to
+process the user's options. In brief, the core of the ``train`` function looks similar
+to these lines
+
+.. code-block:: python
+
+    from architecture import __model__ as Model
+    from architecture import __trainer__ as Trainer
+
+    hypers = {}
+    dataset_info = DatasetInfo()
+
+    if "continue_from":
+        model = Model.load_checkpoint("path")
+        model = model.restart(dataset_info)
+    else:
+        model = Model(hypers["architecture"], dataset_info)
+
+    trainer = Trainer(hypers["training"])
+
+    trainer.train(
+        model=model,
+        devices=[],
+        train_datasets=[],
+        validation_datasets=[],
+        checkpoint_dir="path",
+    )
+
+    model.save_checkpoint("final.ckpt")
+
+    mts_atomistic_model = model.export()
+    mts_atomistic_model.export("path", collect_extensions="extensions-dir/")
+
+
+In order to follow this, a new architectures has two define two classes
+
+- a ``Model`` class, defining the core of the architecture. This class must implement
+  the interface documented below in :py:class:`ModelInterface`
+- a ``Trainer`` class, used to train an architecture and produce a model that can be
+  evaluated and exported. This class must implement the interface documented below in
+  :py:class:`TrainerInterface`.
+
+The ``ModelInterface`` is the main model class and must implement a
+``save_checkpoint()``, ``load_checkpoint()``  as well as a ``restart()`` and
+``export()`` method.
+
+.. code-block:: python
+
+    class ModelInterface:
+
+        __supported_devices__ = ["cuda", "cpu"]
+        __supported_dtypes__ = [torch.float64, torch.float32]
+
+        def __init__(self, model_hypers, dataset_info: DatasetInfo):
+            self.hypers = model_hypers
+            self.dataset_info = dataset_info
+
+        def save_checkpoint(self, path: Union[str, Path]):
+            pass
+
+        @classmethod
+        def load_checkpoint(cls, path: Union[str, Path]) -> "ModelInterface":
+            pass
+
+        def restart(cls, dataset_info: DatasetInfo) -> "ModelInterface":
+            """Restart training.
+
+            This function is called whenever training restarts, with the same or a
+            different dataset.
+
+            It enables transfer learning (changing the targets), and fine tuning (same
+            targets, different dataset)
+            """
+            pass
+
+        def export(self) -> MetatensorAtomisticModel:
+            pass
+
+Note that the ``ModelInterface`` does not necessary inherit from
+:py:class:`torch.nn.Module` since training can be performed in any way.
+``__supported_devices__`` and ``__supported_dtypes__`` can be defined to set the
+capabilities of the model. These two lists should be sorted in order of preference since
+`metatensor-models` will use these to determine, based on the user request and
+machines's availability, the optimal `dtype` and `device` for training.
+
+The ``export()`` method is required to transform a trained model into a standalone file
+to be used in combination with molecular dynamic engines to run simulations. We provide
+a helper function :py:func:`metatensor.models.utils.export.export` to export a torch
+model to an :py:class:`MetatensorAtomisticModel
+<metatensor.torch.atomistic.MetatensorAtomisticModel>`.
+
+The ``TrainerInterface`` class should have the following signature with a required
+methods for ``train()``.
+
+.. code-block:: python
+
+    class TrainerInterface:
+        def __init__(self, train_hypers):
+            self.hypers = train_hypers
+
+        def train(
+            self,
+            model: ModelInterface,
+            devices: List[torch.device],
+            train_datasets: List[Union[Dataset, torch.utils.data.Subset]],
+            validation_datasets: List[Union[Dataset, torch.utils.data.Subset]],
+            checkpoint_dir: str,
+        ): ...
+
+The names of the ``ModelInterface`` and the ``TrainerInterface`` are free to choose but
+should be linked to constants in the ``__init__.py`` of each architecture. On top of
+these two constants the ``__init__.py`` must contain constants for the original
+`__authors__` and current `__maintainers__` of the architecture.
+
+.. code-block:: python
+
+    from .model import CustomSOTAModel
+    from .trainer import Trainer
+
+    __model__ = CustomSOTAModel
+    __trainer__ = Trainer
+
+    __authors__ = [
+        ("Jane Roe <[email protected]>", "@janeroe"),
+        ("John Doe <[email protected]>", "@johndoe"),
+    ]
+
+    __maintainers__ = [("Joe Bloggs <[email protected]>", "@joebloggs")]
+
+
+:param __model__: Mapping of the custom ``ModelInterface`` to a general one to be loaded
+    by metatensor-models
+:param __trainer__: Same as ``__MODEL_CLASS__`` but the Trainer class.
+:param __authors__: Tuple denoting the original authors with email address and Github
+    handle of an architecture. These do not necessary be in charge of maintaining the
+    the architecture
+:param __maintainers__: Tuple denoting the current maintainers of the architecture. Uses
+    the same style as the ``__authors__`` constant.
diff --git a/docs/src/dev-docs/utils/dtype.rst b/docs/src/dev-docs/utils/dtype.rst
@@ -0,0 +1,7 @@
+Dtype
+#####
+
+.. automodule:: metatensor.models.utils.dtype
+    :members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/src/dev-docs/utils/index.rst b/docs/src/dev-docs/utils/index.rst
@@ -10,14 +10,14 @@ This is the API for the ``utils`` module of ``metatensor-models``.
    architectures
    composition
    devices
+   dtype
    errors
    evaluate_model
    external_naming
    export
    io
    logging
    loss
-   merge_capabilities
    metrics
    neighbor_lists
    omegaconf

diff --git a/docs/src/dev-docs/utils/merge_capabilities.rst b/docs/src/dev-docs/utils/merge_capabilities.rst
diff --git a/pyproject.toml b/pyproject.toml
@@ -61,7 +61,7 @@ alchemical-model = [
   "torch_alchemical @ git+https://github.com/abmazitov/torch_alchemical.git@51ff519",
 ]
 pet = [
-  "pet @ git+https://github.com/spozdn/pet.git@ad3dc8a",
+  "pet @ git+https://github.com/spozdn/pet.git@9f6119d",
 ]
 
 [tool.setuptools.packages.find]

diff --git a/src/metatensor/models/__main__.py b/src/metatensor/models/__main__.py
@@ -1,23 +1,40 @@
 """The main entry point for the metatensor-models command line interface."""
 
 import argparse
+import importlib
 import logging
 import os
 import sys
 import traceback
-import warnings
 from datetime import datetime
 from pathlib import Path
 
+import metatensor.torch
 from omegaconf import OmegaConf
 
 from . import __version__
 from .cli.eval import _add_eval_model_parser, eval_model
 from .cli.export import _add_export_model_parser, export_model
 from .cli.train import _add_train_model_parser, train_model
+from .utils.architectures import check_architecture_name
 from .utils.logging import setup_logging
 
 
+# This import is necessary to avoid errors when loading an
+# exported alchemical model, which depends on sphericart-torch.
+# TODO: Remove this when https://github.com/lab-cosmo/metatensor/issues/512
+# is ready
+try:
+    import sphericart.torch  # noqa: F401
+except ImportError:
+    pass
+
+try:
+    import rascaline.torch  # noqa: F401
+except ImportError:
+    pass
+
+
 logger = logging.getLogger(__name__)
 
 
@@ -69,14 +86,27 @@ def main():
     args = ap.parse_args()
     callable = args.__dict__.pop("callable")
     debug = args.__dict__.pop("debug")
+    logfile = None
 
     if debug:
         level = logging.DEBUG
     else:
         level = logging.INFO
-        warnings.filterwarnings("ignore")  # ignore all warnings if not in debug mode
 
-    if callable == "train_model":
+    if callable == "eval_model":
+        args.__dict__["model"] = metatensor.torch.atomistic.load_atomistic_model(
+            path=args.__dict__.pop("path"),
+            extensions_directory=args.__dict__.pop("extensions_directory"),
+        )
+    elif callable == "export_model":
+        architecture_name = args.__dict__.pop("architecture_name")
+        check_architecture_name(architecture_name)
+        architecture = importlib.import_module(f"metatensor.models.{architecture_name}")
+
+        args.__dict__["model"] = architecture.__model__.load_checkpoint(
+            args.__dict__.pop("path")
+        )
+    elif callable == "train_model":
         # define and create `checkpoint_dir` based on current directory and date/time
         checkpoint_dir = _datetime_output_path(now=datetime.now())
         os.makedirs(checkpoint_dir)
@@ -92,7 +122,7 @@ def main():
 
         args.options = OmegaConf.merge(args.options, override_options)
     else:
-        logfile = None
+        raise ValueError("internal error when selecting a sub-command.")
 
     with setup_logging(logger, logfile=logfile, level=level):
         try:
@@ -104,11 +134,11 @@ def main():
                 train_model(**args.__dict__)
             else:
                 raise ValueError("internal error when selecting a sub-command.")
-        except Exception as e:
+        except Exception as err:
             if debug:
                 traceback.print_exc()
             else:
-                sys.exit(f"\033[31mERROR: {e}\033[0m")  # format error in red!
+                sys.exit(str(err))
 
 
 if __name__ == "__main__":

diff --git a/src/metatensor/models/cli/eval.py b/src/metatensor/models/cli/eval.py
@@ -18,9 +18,7 @@
     write_predictions,
 )
 from ..utils.errors import ArchitectureError
-from ..utils.evaluate_model import evaluate_model
-from ..utils.export import is_exported
-from ..utils.io import load
+from ..utils.evaluate_model import _get_outputs, evaluate_model
 from ..utils.logging import MetricLogger
 from ..utils.metrics import RMSEAccumulator
 from ..utils.neighbor_lists import get_system_with_neighbor_lists
@@ -49,15 +47,27 @@ def _add_eval_model_parser(subparser: argparse._SubParsersAction) -> None:
     )
     parser.set_defaults(callable="eval_model")
     parser.add_argument(
-        "model",
-        type=load,
+        "path",
+        type=str,
         help="Saved exported model to be evaluated.",
     )
     parser.add_argument(
         "options",
         type=OmegaConf.load,
         help="Eval options file to define a dataset for evaluation.",
     )
+    parser.add_argument(
+        "-e",
+        "--extdir",
+        type=str,
+        required=False,
+        dest="extensions_directory",
+        default=None,
+        help=(
+            "path to a directory containing all extensions required by the exported "
+            "model"
+        ),
+    )
     parser.add_argument(
         "-o",
         "--output",
@@ -186,7 +196,8 @@ def _eval_targets(
     rmse_values = rmse_accumulator.finalize(not_per_atom=["positions_gradients"])
     # print the RMSEs with MetricLogger
     metric_logger = MetricLogger(
-        model_capabilities=model.capabilities(),
+        logobj=logger,
+        model_outputs=_get_outputs(model),
         initial_metrics=rmse_values,
     )
     metric_logger.log(rmse_values)
@@ -200,7 +211,9 @@ def _eval_targets(
 
 
 def eval_model(
-    model: torch.nn.Module, options: DictConfig, output: Union[Path, str] = "output.xyz"
+    model: Union[MetatensorAtomisticModel, torch.jit._script.RecursiveScriptModule],
+    options: DictConfig,
+    output: Union[Path, str] = "output.xyz",
 ) -> None:
     """Evaluate an exported model on a given data set.
 
@@ -212,12 +225,6 @@ def eval_model(
     :param options: DictConfig to define a test dataset taken for the evaluation.
     :param output: Path to save the predicted values
     """
-    if not is_exported(model):
-        raise ValueError(
-            "The model must already be exported to be used in `eval`. "
-            "If you are trying to evaluate a checkpoint, export it first "
-            "with the `metatensor-models export` command."
-        )
     logger.info("Setting up evaluation set.")
 
     # TODO: once https://github.com/lab-cosmo/metatensor/pull/551 is merged and released