Add callback doc (#130)

* Add callback doc * Fix cross-reference issue * Fix the intersphinx_mapping configuration format in docs/conf.py * cleans up callback design * spelling error * pass function pointer not return * Modify runtime_view scheme --------- Co-authored-by: ryan <[email protected]>
NWChemEx · Jan 25, 2024 · d312e39 · d312e39
1 parent 8799d74
commit d312e39
Show file tree

Hide file tree

Showing 6 changed files with 270 additions and 4 deletions.
diff --git a/docs/source/background/abbreviations.rst b/docs/source/background/abbreviations.rst
@@ -76,6 +76,18 @@ MPI
 Message Passing Interface. An API standard defining functions and utilities
 useful for writing software using distributed parallelism.
 
+.. _raii:
+
+****
+RAII
+****
+
+Resource acquisition is initialization. In C++ RAII has come to mean that
+resources (such as memory, file handles, basically anything whose use needs to
+be managed) should be tied to the lifetime of an object. This ensures that when
+the object is deleted the resources are released, which in turn helps avoid
+leaks.
+
 .. _simd:
 
 ****

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -171,7 +171,7 @@
 # -- Options for intersphinx extension ---------------------------------------
 
 # Example configuration for intersphinx: refer to the Python standard library.
-intersphinx_mapping = {'https://docs.python.org/': None}
+intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}
 
 # -- Options for todo extension ----------------------------------------------
 

diff --git a/docs/source/developer/design/assets/runtime_view.png b/docs/source/developer/design/assets/runtime_view.png
diff --git a/docs/source/developer/design/runtime_view.rst b/docs/source/developer/design/runtime_view.rst
@@ -47,8 +47,12 @@ number one supercomputer in the world, or anything in between.
    - Hardware
 
 #. Multi-process operations need to go through ``RuntimeView``.
-#. MPI compatability.
+#. MPI compatibility.
 #. Flexibility of backend.
+#. Setup/teardown of parallel resources
+
+   - See :ref:`understanding_runtime_initialization_finalization` for more
+     details, but basically we need callbacks.
 
 ************************
 RuntimeView Architecture
@@ -72,11 +76,21 @@ addresses the above consideration by (numbering is from above):
      ``GPU`` objects in a particular ``ResourceSet``.
    - This facilitates selecting start/end points.
 
-#. MADNESS is built on MPI. MPI is exposed through MADNESS.
+#. MPI support happens via the ``CommPP`` class.
+
 #. The use of the PIMPL design allows us to hide many of the backend types. It
    also facilitates writing an implementation for a different backend down the
    line (although the API would need to change too).
 
+#. Storing of callbacks allows us to tie the lifetime of the ``RuntimeView`` to
+   the teardown of parallel resources, i.e., ``RuntimeView`` will automatically
+   finalize any parallel resources which depend on ``RuntimeView`` before
+   finalizing itself.
+
+   - Note, finalization callbacks are stored in a stack to ensure a controlled
+     teardown order as is usually needed for libraries with initialize/finalize
+     functions.
+
 Some finer points:
 
 - The scheduler is envisioned as taking task graphs and scheduling them in a
@@ -92,7 +106,7 @@ Some finer points:
 Proposed APIs
 *************
 
-Examples of all-to-all communications
+Examples of all-to-all communications:
 
 .. code-block:: c++
 
@@ -106,6 +120,21 @@ Examples of all-to-all communications
    // This is an all reduce
    auto output2 = rt.reduce(data, op);
 
+
+Example of tying another library's parallel runtime teardown to the lifetime of
+a ``RuntimeView`` (note this is only relevant when ParallelZone starts MPI):
+
+.. code-block:: c++
+
+   // Create a RuntimeView object
+   RuntimeView rt;
+
+   // Initialize the other library
+   other_library_initialize();
+
+   // Register the corresponding finalization routine with the RuntimeView
+   rt.stack_callback(other_library_finalize);
+
 .. note::
 
    As written the APIs assume the data is going to/from RAM. If we eventually

diff --git a/docs/source/developer/index.rst b/docs/source/developer/index.rst
@@ -28,3 +28,4 @@ developers may also find the more general `NWChemEx Developer Documentation
    :caption: Contents:
 
    design/index
+   initialize_finalize
diff --git a/docs/source/developer/initialize_finalize.rst b/docs/source/developer/initialize_finalize.rst
@@ -0,0 +1,224 @@
+.. Copyright 2024 NWChemEx-Project
+..
+.. Licensed under the Apache License, Version 2.0 (the "License");
+.. you may not use this file except in compliance with the License.
+.. You may obtain a copy of the License at
+..
+.. http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+.. _understanding_runtime_initialization_finalization:
+
+#################################################
+Understanding Runtime Initialization/Finalization
+#################################################
+
+:ref:`mpi` requires users to call ``MPI_Init``  to start MPI and
+``MPI_Finalize`` to end it. MPI requires that each of these functions be called
+only once, regardless of how many code units actually use MPI, i.e., managing
+the lifetime of resources such as MPI processes and adhering to :ref:`raii` can
+be tricky. This page works through some scenarios to help the reader become
+better acquainted with the complexities.
+
+************
+RAII and MPI
+************
+
+ParallelZone opts to manage MPI through RAII. To do this we associate the
+lifetime of MPI with the lifetime of a ``Runtime`` object. When a ``Runtime``
+object is created it is either initialized with an existing MPI communicator or
+it initializes MPI and then uses the MPI communicator resulting from
+initialization. Each ``Runtime`` object internally tracks whether it initialized
+MPI or not. When a ``Runtime`` object is destructed it will only call
+``MPI_Finalize`` if ``*this`` initialized MPI.
+
+.. note::
+
+   At present there is no user-accessible ``Runtime`` object, rather users
+   interact with an implicit ``Runtime`` through ``RuntimeView`` objects. When
+   all ``RuntimeView`` objects go out of scope the implicit ``Runtime`` object
+   is destructed. This decision stems from not wanting accidental/implicit
+   copies to inadvertently shut down MPI.
+
+.. _traditional_solution:
+
+********************
+Traditional Solution
+********************
+
+Many existing libraries deal with the MPI problem in one of two ways:
+
+1. Assume the user will manage MPI. Thus the library requires the user to
+   provide an already initialized MPI communicator.
+2. Define functions like ``initialize`` / ``finalize`` which wrap MPI's
+   ``MPI_Init`` / ``MPI_Finalize`` functions respectively.
+
+From the perspective of PZ Scenario 1 is the easiest to deal with because it
+means PZ is free to manage the lifetime of MPI however it wants, so long as MPI
+is finalized after the library is done with it. Scenario 1 works well with our
+RAII solution "out of the box" and is not considered further.
+
+Scenario 2 is much harder because we know the library's ``initialize`` and
+``finalize`` functions will contain MPI functions. This is because they will
+minimally contain ``MPI_Init`` and ``MPI_Finalize``, but the functions may also
+check if MPI has been initialized and finalized (this is a common practice to
+avoid accidentally calling ``MPI_Init``/``MPI_Finalize`` after MPI has already
+been initialize/finalized). It is also conceivable that these functions do
+additional initialization/finalization which requires MPI to be initialized, but
+not yet finalized, e.g., calls to synchronize data.
+
+.. _raii_interacting_with_traditional_solution:
+
+******************************************
+RAII Interacting With Traditional Solution
+******************************************
+
+In :ref:`traditional_solution` we noted that when a library provides its own
+``initialize`` / ``finalize`` functions (which we called "Scenario 2") RAII
+interactions become more complicated. It's worth noting that Scenario 2 has two
+sub-scenarios:
+
+a. User should only call ``initialize`` and ``finalize`` if the library is
+   managing MPI.
+b. The user should always call ``initialize`` and ``finalize``.
+
+Each of these sub-scenarios can occur interact with ParallelZone in one of two
+states: PZ started MPI or PZ did not start MPI. Sub-scenario a is essentially
+the same as Scenario 1 in :ref:`traditional_solution` if ParallelZone starts
+MPI. If, however, the library starts MPI we have:
+
+.. code-block:: c++
+
+   initialize(); // library starts MPI
+   auto comm = get_mpi_communicator_from_library();
+   RuntimeView rv(comm); // PZ uses MPI from library
+
+   finalize(); // library releases MPI
+   // end of code, rv is released
+
+This is fine so long as destruction of ``rv`` is guaranteed not to use any
+MPI functions (which we ultimately will not be able to guarantee, but we'll get
+to that). For now we note that there is a better way to write this which will
+work even if ``rv`` calls MPI functions, namely we force ``rv`` to go out of
+scope before ``finalize`` is called:
+
+.. code-block:: c++
+
+
+   initialize(); // library starts MPI
+   {
+       auto comm = get_mpi_communicator_from_library();
+       RuntimeView rv(comm); // PZ uses MPI from library
+       // rv is released
+   }
+
+   finalize(); // library releases MPI
+   // end of code
+
+Moving on Scenario 2b, if the library starts MPI it is identical to when the
+library starts MPI in Scenario 2a and no further comment is necessary. The
+remaining condition is Scenario 2b with ParallelZone starting MPI:
+
+.. code-block:: c++
+
+   RuntimeView rv; // ParallelZone starts MPI
+   auto comm = rv.mpi_comm();
+   initialize(comm); // library uses MPI from PZ
+
+   finalize();
+   // end of code, PZ releases MPI
+
+This is okay as long as ``rv`` is guaranteed to be in scope when ``finalize``
+is called.
+
+***********************
+RAII Plus Encapsulation
+***********************
+
+:ref:`raii_interacting_with_traditional_solution` showed that our RAII solution
+is fine as long as we control the order of destruction. This is a detail we'd
+rather not leak to the user, especially if more initialization/finalization
+functions are added later (or if some are removed). With the traditional
+solution we can easily encapsulate this detail with something like:
+
+.. code-block:: c++
+
+   void initialize() {
+       library_a::initialize();
+       library_b::initialize();
+   }
+
+   void finalize() {
+       library_b::finalize();
+       library_a::finalize();
+   }
+
+   // User's code
+   initialize(); // A initializes MPI, then B uses A's MPI
+
+   finalize(); // B cleans up, then A finalizes MPI
+
+As shown, the order of initialization/finalization is guaranteed by creating
+wrappers around sub-library initialization/finalization. Users rely on the
+wrappers and never need to worry about the order.
+
+So now what about RAII? Let's start with Scenario 2b, and ParallelZone starting
+MPI:
+
+.. code-block:: c++
+
+   RuntimeView initialize() {
+       RuntimeView rv; // PZ starts MPI
+       library_a::initialize(rv.mpi_comm());
+       return rv; // Must keep rv alive
+   }
+
+   void finalize() {
+        library_a::finalize();
+   }
+
+   // User's code
+   auto rv = initialize();
+
+   finalize(); // library finalizes
+   // end of code, PZ ends MPI
+
+While this works, it violates RAII because the user needs to remember to call
+``finalize`` before the code ends or else there will be a resource leak. The
+entire point of RAII is to avoid the possibility of leaks. If we want our
+``RuntimeView`` to adhere to RAII we must find a way for the destructor of
+``rv`` to call ``finalize`` before it stops MPI. The easiest way to do this is
+with callbacks:
+
+.. code-block:: c++
+
+   RuntimeView initialize() {
+       RuntimeView rv; // PZ starts MPI
+       library_a::initialize(rv.mpi_comm());
+
+       // Register that rv must call finalize upon destruction
+       rv.stack_callback(library_a::finalize);
+       return rv; // Must keep rv alive
+   }
+
+   // User's code
+   auto rv = initialize();
+
+   // end of code, PZ's dtor calls library_a::finalize() then ends MPI
+
+
+*******
+Summary
+*******
+
+- MPI leaks initialization/finalization concerns to all dependencies.
+- This has led to many libraries leaking those same details to their
+  dependencies too.
+- When ParallelZone manages MPI we can use RAII to avoid leaking those details
+  to our dependencies.
+- RAII however requires that ``RuntimeView`` be able to hold callbacks.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -28,3 +28,4 @@ developers may also find the more general `NWChemEx Developer Documentation
		:caption: Contents:

		design/index
		initialize_finalize