start of some docs describing the GPU flow (#2843)

AMReX-Astro · Jun 18, 2024 · bf363d8 · bf363d8
1 parent 627cc1e
commit bf363d8
Show file tree

Hide file tree

Showing 3 changed files with 78 additions and 0 deletions.
diff --git a/Docs/source/gpu.rst b/Docs/source/gpu.rst
@@ -0,0 +1,58 @@
+*********************
+GPU Programming Model
+*********************
+
+CPUs and GPUs have separate memory, which means that working on both
+the host and device may involve managing the transfer of data between
+the memory on the host and that on the GPU.
+
+In Castro, the core design when running on GPUs is that all of the compute
+should be done on the GPU.
+
+When we compile with ``USE_CUDA=TRUE`` or ``USE_HIP=TRUE``, AMReX will allocate
+a pool of memory on the GPUs and all of the ``StateData`` will be stored there.
+As long as we then do all of the computation on the GPUs, then we don't need
+to manage any of the data movement manually.
+
+.. note::
+
+   We can tell AMReX to allocate the data using managed-memory by
+   setting:
+
+   ::
+
+      amrex.the_arena_is_managed = 1
+
+   This is generally not needed.
+
+The programming model used throughout Castro is C++-lambda-capturing
+by value.  We access the ``FArrayBox`` stored in the ``StateData``
+``MultiFab`` by creating an ``Array4`` object.  The ``Array4`` does
+not directly store a copy of the data, but instead has a pointer to
+the data in the ``FArrayBox``.  When we capture the ``Array4`` by
+value in the GPU kernel, the GPU gets access to the pointer to the
+underlying data.
+
+
+Most AMReX functions will work on the data directly on the GPU (like
+``.setVal()``).
+
+In rare instances where we might need to operate on the data on the
+host, we can force a copy to the host, do the work, and then copy
+back.  For an example, see the reduction done in  ``Gravity.cpp``.
+
+.. note::
+
+   For a thorough discussion of how the AMReX GPU offloading works
+   see :cite:`amrex-ecp`.
+
+
+Runtime parameters
+------------------
+
+The main exception for all data being on the GPUs all the time are the
+runtime parameters.  At the moment, these are allocated as managed
+memory and stored in global memory.  This is simply to make it easier
+to read them in and initialize them on the CPU at runtime.
+
+
diff --git a/Docs/source/index.rst b/Docs/source/index.rst
@@ -21,6 +21,7 @@ https://github.com/amrex-astro/Castro
    mpi_plus_x
    FlowChart
    software
+   gpu
    problem_setups
    timestepping
    creating_a_problem

diff --git a/Docs/source/refs.bib b/Docs/source/refs.bib
@@ -1131,3 +1131,22 @@ @ARTICLE{doubledet2024
 title = {Sensitivity of Simulations of Double-detonation Type Ia Supernovae to Integration Methodology},
 journal = {The Astrophysical Journal},
 }
+
+
+@ARTICLE{amrex-ecp,
+       author = {{Myers}, Andrew and {Zhang}, Weiqun and {Almgren}, Ann and {Antoun}, Thierry and {Bell}, John and {Huebl}, Axel and {Sinn}, Alexander},
+        title = "{AMReX and pyAMReX: Looking Beyond ECP}",
+      journal = {arXiv e-prints},
+     keywords = {Computer Science - Distributed, Parallel, and Cluster Computing},
+         year = 2024,
+        month = mar,
+          eid = {arXiv:2403.12179},
+        pages = {arXiv:2403.12179},
+          doi = {10.48550/arXiv.2403.12179},
+archivePrefix = {arXiv},
+       eprint = {2403.12179},
+ primaryClass = {cs.DC},
+       adsurl = {https://ui.adsabs.harvard.edu/abs/2024arXiv240312179M},
+      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+