Skip to content

Commit

Permalink
Add callback doc (#130)
Browse files Browse the repository at this point in the history
* Add callback doc

* Fix cross-reference issue

* Fix the intersphinx_mapping configuration format in docs/conf.py

* cleans up callback design

* spelling error

* pass function pointer not return

* Modify runtime_view scheme

---------

Co-authored-by: ryan <[email protected]>
  • Loading branch information
yzhang-23 and ryanmrichard authored Jan 25, 2024
1 parent 8799d74 commit d312e39
Show file tree
Hide file tree
Showing 6 changed files with 270 additions and 4 deletions.
12 changes: 12 additions & 0 deletions docs/source/background/abbreviations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,18 @@ MPI
Message Passing Interface. An API standard defining functions and utilities
useful for writing software using distributed parallelism.

.. _raii:

****
RAII
****

Resource acquisition is initialization. In C++ RAII has come to mean that
resources (such as memory, file handles, basically anything whose use needs to
be managed) should be tied to the lifetime of an object. This ensures that when
the object is deleted the resources are released, which in turn helps avoid
leaks.

.. _simd:

****
Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
# -- Options for intersphinx extension ---------------------------------------

# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'https://docs.python.org/': None}
intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}

# -- Options for todo extension ----------------------------------------------

Expand Down
Binary file modified docs/source/developer/design/assets/runtime_view.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 32 additions & 3 deletions docs/source/developer/design/runtime_view.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,12 @@ number one supercomputer in the world, or anything in between.
- Hardware

#. Multi-process operations need to go through ``RuntimeView``.
#. MPI compatability.
#. MPI compatibility.
#. Flexibility of backend.
#. Setup/teardown of parallel resources

- See :ref:`understanding_runtime_initialization_finalization` for more
details, but basically we need callbacks.

************************
RuntimeView Architecture
Expand All @@ -72,11 +76,21 @@ addresses the above consideration by (numbering is from above):
``GPU`` objects in a particular ``ResourceSet``.
- This facilitates selecting start/end points.

#. MADNESS is built on MPI. MPI is exposed through MADNESS.
#. MPI support happens via the ``CommPP`` class.

#. The use of the PIMPL design allows us to hide many of the backend types. It
also facilitates writing an implementation for a different backend down the
line (although the API would need to change too).

#. Storing of callbacks allows us to tie the lifetime of the ``RuntimeView`` to
the teardown of parallel resources, i.e., ``RuntimeView`` will automatically
finalize any parallel resources which depend on ``RuntimeView`` before
finalizing itself.

- Note, finalization callbacks are stored in a stack to ensure a controlled
teardown order as is usually needed for libraries with initialize/finalize
functions.

Some finer points:

- The scheduler is envisioned as taking task graphs and scheduling them in a
Expand All @@ -92,7 +106,7 @@ Some finer points:
Proposed APIs
*************

Examples of all-to-all communications
Examples of all-to-all communications:

.. code-block:: c++

Expand All @@ -106,6 +120,21 @@ Examples of all-to-all communications
// This is an all reduce
auto output2 = rt.reduce(data, op);


Example of tying another library's parallel runtime teardown to the lifetime of
a ``RuntimeView`` (note this is only relevant when ParallelZone starts MPI):

.. code-block:: c++

// Create a RuntimeView object
RuntimeView rt;

// Initialize the other library
other_library_initialize();

// Register the corresponding finalization routine with the RuntimeView
rt.stack_callback(other_library_finalize);

.. note::

As written the APIs assume the data is going to/from RAM. If we eventually
Expand Down
1 change: 1 addition & 0 deletions docs/source/developer/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ developers may also find the more general `NWChemEx Developer Documentation
:caption: Contents:

design/index
initialize_finalize
224 changes: 224 additions & 0 deletions docs/source/developer/initialize_finalize.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
.. Copyright 2024 NWChemEx-Project
..
.. Licensed under the Apache License, Version 2.0 (the "License");
.. you may not use this file except in compliance with the License.
.. You may obtain a copy of the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.
.. _understanding_runtime_initialization_finalization:

#################################################
Understanding Runtime Initialization/Finalization
#################################################

:ref:`mpi` requires users to call ``MPI_Init`` to start MPI and
``MPI_Finalize`` to end it. MPI requires that each of these functions be called
only once, regardless of how many code units actually use MPI, i.e., managing
the lifetime of resources such as MPI processes and adhering to :ref:`raii` can
be tricky. This page works through some scenarios to help the reader become
better acquainted with the complexities.

************
RAII and MPI
************

ParallelZone opts to manage MPI through RAII. To do this we associate the
lifetime of MPI with the lifetime of a ``Runtime`` object. When a ``Runtime``
object is created it is either initialized with an existing MPI communicator or
it initializes MPI and then uses the MPI communicator resulting from
initialization. Each ``Runtime`` object internally tracks whether it initialized
MPI or not. When a ``Runtime`` object is destructed it will only call
``MPI_Finalize`` if ``*this`` initialized MPI.

.. note::

At present there is no user-accessible ``Runtime`` object, rather users
interact with an implicit ``Runtime`` through ``RuntimeView`` objects. When
all ``RuntimeView`` objects go out of scope the implicit ``Runtime`` object
is destructed. This decision stems from not wanting accidental/implicit
copies to inadvertently shut down MPI.

.. _traditional_solution:

********************
Traditional Solution
********************

Many existing libraries deal with the MPI problem in one of two ways:

1. Assume the user will manage MPI. Thus the library requires the user to
provide an already initialized MPI communicator.
2. Define functions like ``initialize`` / ``finalize`` which wrap MPI's
``MPI_Init`` / ``MPI_Finalize`` functions respectively.

From the perspective of PZ Scenario 1 is the easiest to deal with because it
means PZ is free to manage the lifetime of MPI however it wants, so long as MPI
is finalized after the library is done with it. Scenario 1 works well with our
RAII solution "out of the box" and is not considered further.

Scenario 2 is much harder because we know the library's ``initialize`` and
``finalize`` functions will contain MPI functions. This is because they will
minimally contain ``MPI_Init`` and ``MPI_Finalize``, but the functions may also
check if MPI has been initialized and finalized (this is a common practice to
avoid accidentally calling ``MPI_Init``/``MPI_Finalize`` after MPI has already
been initialize/finalized). It is also conceivable that these functions do
additional initialization/finalization which requires MPI to be initialized, but
not yet finalized, e.g., calls to synchronize data.

.. _raii_interacting_with_traditional_solution:

******************************************
RAII Interacting With Traditional Solution
******************************************

In :ref:`traditional_solution` we noted that when a library provides its own
``initialize`` / ``finalize`` functions (which we called "Scenario 2") RAII
interactions become more complicated. It's worth noting that Scenario 2 has two
sub-scenarios:

a. User should only call ``initialize`` and ``finalize`` if the library is
managing MPI.
b. The user should always call ``initialize`` and ``finalize``.

Each of these sub-scenarios can occur interact with ParallelZone in one of two
states: PZ started MPI or PZ did not start MPI. Sub-scenario a is essentially
the same as Scenario 1 in :ref:`traditional_solution` if ParallelZone starts
MPI. If, however, the library starts MPI we have:

.. code-block:: c++

initialize(); // library starts MPI
auto comm = get_mpi_communicator_from_library();
RuntimeView rv(comm); // PZ uses MPI from library

finalize(); // library releases MPI
// end of code, rv is released

This is fine so long as destruction of ``rv`` is guaranteed not to use any
MPI functions (which we ultimately will not be able to guarantee, but we'll get
to that). For now we note that there is a better way to write this which will
work even if ``rv`` calls MPI functions, namely we force ``rv`` to go out of
scope before ``finalize`` is called:

.. code-block:: c++


initialize(); // library starts MPI
{
auto comm = get_mpi_communicator_from_library();
RuntimeView rv(comm); // PZ uses MPI from library
// rv is released
}

finalize(); // library releases MPI
// end of code

Moving on Scenario 2b, if the library starts MPI it is identical to when the
library starts MPI in Scenario 2a and no further comment is necessary. The
remaining condition is Scenario 2b with ParallelZone starting MPI:

.. code-block:: c++

RuntimeView rv; // ParallelZone starts MPI
auto comm = rv.mpi_comm();
initialize(comm); // library uses MPI from PZ

finalize();
// end of code, PZ releases MPI

This is okay as long as ``rv`` is guaranteed to be in scope when ``finalize``
is called.

***********************
RAII Plus Encapsulation
***********************

:ref:`raii_interacting_with_traditional_solution` showed that our RAII solution
is fine as long as we control the order of destruction. This is a detail we'd
rather not leak to the user, especially if more initialization/finalization
functions are added later (or if some are removed). With the traditional
solution we can easily encapsulate this detail with something like:

.. code-block:: c++

void initialize() {
library_a::initialize();
library_b::initialize();
}

void finalize() {
library_b::finalize();
library_a::finalize();
}

// User's code
initialize(); // A initializes MPI, then B uses A's MPI

finalize(); // B cleans up, then A finalizes MPI

As shown, the order of initialization/finalization is guaranteed by creating
wrappers around sub-library initialization/finalization. Users rely on the
wrappers and never need to worry about the order.

So now what about RAII? Let's start with Scenario 2b, and ParallelZone starting
MPI:

.. code-block:: c++

RuntimeView initialize() {
RuntimeView rv; // PZ starts MPI
library_a::initialize(rv.mpi_comm());
return rv; // Must keep rv alive
}

void finalize() {
library_a::finalize();
}

// User's code
auto rv = initialize();

finalize(); // library finalizes
// end of code, PZ ends MPI

While this works, it violates RAII because the user needs to remember to call
``finalize`` before the code ends or else there will be a resource leak. The
entire point of RAII is to avoid the possibility of leaks. If we want our
``RuntimeView`` to adhere to RAII we must find a way for the destructor of
``rv`` to call ``finalize`` before it stops MPI. The easiest way to do this is
with callbacks:

.. code-block:: c++

RuntimeView initialize() {
RuntimeView rv; // PZ starts MPI
library_a::initialize(rv.mpi_comm());

// Register that rv must call finalize upon destruction
rv.stack_callback(library_a::finalize);
return rv; // Must keep rv alive
}

// User's code
auto rv = initialize();

// end of code, PZ's dtor calls library_a::finalize() then ends MPI


*******
Summary
*******

- MPI leaks initialization/finalization concerns to all dependencies.
- This has led to many libraries leaking those same details to their
dependencies too.
- When ParallelZone manages MPI we can use RAII to avoid leaking those details
to our dependencies.
- RAII however requires that ``RuntimeView`` be able to hold callbacks.

0 comments on commit d312e39

Please sign in to comment.