From d8a83297aba5eba4ee80c0824b4b9f5f3f1fb304 Mon Sep 17 00:00:00 2001
From: Andrey Alekseenko <al42and@gmail.com>
Date: Wed, 20 Nov 2024 14:09:32 +0100
Subject: [PATCH 1/2] 4: Extract the table into separate file to make docstrfmt
 happy

---
 content/4-gpu-concepts-table.rst | 29 +++++++++++++++++++++++++++++
 content/4-gpu-concepts.rst       | 32 +-------------------------------
 2 files changed, 30 insertions(+), 31 deletions(-)
 create mode 100644 content/4-gpu-concepts-table.rst

diff --git a/content/4-gpu-concepts-table.rst b/content/4-gpu-concepts-table.rst
new file mode 100644
index 00000000..564b21c6
--- /dev/null
+++ b/content/4-gpu-concepts-table.rst
@@ -0,0 +1,29 @@
+.. table:: Software mapping naming
+   :align: center
+
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | CUDA                    | HIP                     | OpenCL                    | SYCL                                              |
+   +=========================+=========================+===========================+===================================================+
+   | grid of threads                                   | NDRange                                                                       |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | block                                             | work-group                                                                    |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | warp                    | wavefront               | sub-group                                                                     |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | thread                                            | work-item                                                                     |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | registers                                         | private memory                                                                |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | shared memory           | local data share        | local memory                                                                  |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | threadIdx.\{x,y,z\}                               | get_local_id(\{0,1,2\})   | nd_item::get_local(\{2,1,0\}) [#syclindex]_       |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | blockIdx.\{x,y,z\}                                | get_group_id(\{0,1,2\})   | nd_item::get_group(\{2,1,0\}) [#syclindex]_       |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+   | blockDim.\{x,y,z\}                                | get_local_size(\{0,1,2\}) | nd_item::get_local_range(\{2,1,0\}) [#syclindex]_ |
+   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
+
+.. [#syclindex] In SYCL, the thread indexing is inverted. In a 3D grid, physically adjacent threads have consecutive X (0) index in CUDA, HIP, and OpenCL, but consecutive Z (2) index in SYCL.
+   In a 2D grid, CUDA, HIP, and OpenCL still has contiguous indexing along X (0) dimension, while in SYCL it is Y (1).
+   Same applies to block dimensions and indexing.
+
diff --git a/content/4-gpu-concepts.rst b/content/4-gpu-concepts.rst
index 8d3230be..89a3ded0 100644
--- a/content/4-gpu-concepts.rst
+++ b/content/4-gpu-concepts.rst
@@ -236,37 +236,7 @@ Terminology
 
 At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. While the basic concept behind GPUs is pretty similar they use different names for the various parts. Furthermore there are software environments for GPU programming, some from the producers and some from external groups all having different naming as well. Below there is a short compilation of the some terms used across different platforms and software environments.
 
-
-.. table:: Software mapping naming
-   :align: center
-
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | CUDA                    | HIP                     | OpenCL                    | SYCL                                              |
-   +=========================+=========================+===========================+===================================================+
-   | grid of threads                                   | NDRange                                                                       |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | block                                             | work-group                                                                    |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | warp                    | wavefront               | sub-group                                                                     |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | thread                                            | work-item                                                                     |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | registers                                         | private memory                                                                |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | shared memory           | local data share        | local memory                                                                  |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | threadIdx.\{x,y,z\}                               | get_local_id(\{0,1,2\})   | nd_item::get_local(\{2,1,0\}) [#syclindex]_       |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | blockIdx.\{x,y,z\}                                | get_group_id(\{0,1,2\})   | nd_item::get_group(\{2,1,0\}) [#syclindex]_       |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-   | blockDim.\{x,y,z\}                                | get_local_size(\{0,1,2\}) | nd_item::get_local_range(\{2,1,0\}) [#syclindex]_ |
-   +-------------------------+-------------------------+---------------------------+---------------------------------------------------+
-
-.. [#syclindex] In SYCL, the thread indexing is inverted. In a 3D grid, physically adjacent threads have consecutive X (0) index in CUDA, HIP, and OpenCL, but consecutive Z (2) index in SYCL. 
-   In a 2D grid, CUDA, HIP, and OpenCL still has contiguous indexing along X (0) dimension, while in SYCL it is Y (1).
-   Same applies to block dimensions and indexing. 
-
-
+.. include:: 4-gpu-concepts-table.rst
 
 Exercises
 ---------

From b8a4cdf6a49200c32a465eb16c293137cc802910 Mon Sep 17 00:00:00 2001
From: Andrey Alekseenko <al42and@gmail.com>
Date: Wed, 20 Nov 2024 14:11:56 +0100
Subject: [PATCH 2/2] Apply docstrfmt to ReST files

---
 content/0-setup.rst                      |  112 +-
 content/1-gpu-history.rst                |  136 +-
 content/10-multiple_gpu.rst              |  474 ++++---
 content/11-gpu-porting.rst               |  636 +++++-----
 content/12-recommendations.rst           |  106 +-
 content/13-examples.rst                  |  979 ++++++++-------
 content/2-gpu-ecosystem.rst              |  566 +++++----
 content/3-gpu-problems.rst               |  350 +++---
 content/4-gpu-concepts.rst               |  481 ++++---
 content/5-intro-to-gpu-prog-models.rst   |  285 +++--
 content/6-directive-based-models.rst     | 1454 +++++++++++-----------
 content/7-non-portable-kernel-models.rst |  849 ++++++++-----
 content/8-portable-kernel-models.rst     |  841 ++++++++-----
 content/9-language-support.rst           |  842 +++++++------
 content/glossary.rst                     |   84 +-
 content/guide.rst                        |  188 +--
 content/index.rst                        |  193 ++-
 content/quick-reference.rst              |    2 +-
 requirements.txt                         |    1 +
 19 files changed, 4731 insertions(+), 3848 deletions(-)

diff --git a/content/0-setup.rst b/content/0-setup.rst
index 6465f04a..c3c42bac 100644
--- a/content/0-setup.rst
+++ b/content/0-setup.rst
@@ -6,18 +6,18 @@ Setup
 Local installation
 ------------------
 
-Since this lesson is taught using an HPC cluster, no software installation on your own computer is needed. 
-
+Since this lesson is taught using an HPC cluster, no software installation on your own
+computer is needed.
 
 Running on LUMI
 ---------------
 
-Interactive job, 1 node, 1 GPU, 1 hour:  
+Interactive job, 1 node, 1 GPU, 1 hour:
 
 .. code-block:: console
 
-   $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
-   $ srun <some-command>
+    $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
+    $ srun <some-command>
 
 Exit interactive allocation with ``exit``.
 
@@ -25,104 +25,108 @@ Interacive terminal session on compute node:
 
 .. code-block:: console
 
-   $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
-   $ <some-command>
+    $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
+    $ <some-command>
 
 Corresponding batch script ``submit.sh``:
 
 .. code-block:: bash
 
-   #!/bin/bash -l
-   #SBATCH --account=project_465001310
-   #SBATCH --job-name=example-job
-   #SBATCH --output=examplejob.o%j
-   #SBATCH --error=examplejob.e%j
-   #SBATCH --partition=standard-g
-   #SBATCH --nodes=1
-   #SBATCH --gpus-per-node=1
-   #SBATCH --ntasks-per-node=1
-   #SBATCH --time=1:00:00
+    #!/bin/bash -l
+    #SBATCH --account=project_465001310
+    #SBATCH --job-name=example-job
+    #SBATCH --output=examplejob.o%j
+    #SBATCH --error=examplejob.e%j
+    #SBATCH --partition=standard-g
+    #SBATCH --nodes=1
+    #SBATCH --gpus-per-node=1
+    #SBATCH --ntasks-per-node=1
+    #SBATCH --time=1:00:00
 
-   srun <some_command> 
+    srun <some_command>
 
 - Submit the job: ``sbatch submit.sh``
 - Monitor your job: ``squeue --me``
 - Kill job: ``scancel <JOB_ID>``
 
-
-
 Running Julia on LUMI
-^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~
 
-In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory structure and assume it is our working directory.
+In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory
+structure and assume it is our working directory.
 
 .. code-block:: console
 
-	.
-	├── Project.toml  # Julia environment
-	├── script.jl     # Julia script
-	└── submit.sh     # Slurm batch script
+    .
+    ├── Project.toml  # Julia environment
+    ├── script.jl     # Julia script
+    └── submit.sh     # Slurm batch script
 
 An example of a ``Project.toml`` project file.
 
 .. code-block:: console
 
-	[deps]
-	AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
+    [deps]
+    AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
 
-For the ``submit.sh`` batch script, include additional content to the batch script mentioned above.
+For the ``submit.sh`` batch script, include additional content to the batch script
+mentioned above.
 
 .. code-block:: bash
 
-   #SBATCH --cpus-per-task=2
-   #SBATCH --mem-per-cpu=1750
+    #SBATCH --cpus-per-task=2
+    #SBATCH --mem-per-cpu=1750
 
-   module use /appl/local/csc/modulefiles
+    module use /appl/local/csc/modulefiles
 
-   module load julia
-   module load julia-amdgpu
+    module load julia
+    module load julia-amdgpu
 
-   julia --project=. -e 'using Pkg; Pkg.instantiate()'
-   julia --project=. script.jl
+    julia --project=. -e 'using Pkg; Pkg.instantiate()'
+    julia --project=. script.jl
 
 An example of the ``script.jl`` code is provided below.
 
 .. code-block:: julia
 
-   using AMDGPU
-   
-   A = rand(2^9, 2^9)
-   A_d = ROCArray(A)
-   B_d = A_d * A_d
-
-   println("----EOF----")
+    using AMDGPU
 
+    A = rand(2^9, 2^9)
+    A_d = ROCArray(A)
+    B_d = A_d * A_d
 
+    println("----EOF----")
 
 Running on Google Colab
 -----------------------
 
-Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook environment which runs in your web browser. Using it requires login with a Google account.
+Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook
+environment which runs in your web browser. Using it requires login with a Google
+account.
 
 This is how you can get access to NVIDIA GPUs on Colab:
 
 - Visit https://colab.research.google.com/ and sign in to your Google account
 - In the menu in front of you, click "New notebook" in the bottom right corner
-- After the notebook loads, go to the "Runtime" menu at the top and select "Change runtime type"
-- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU (e.g. T4)
-- Click "Save". The runtime takes a few seconds to load - you can see the status in the top right corner
-- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about the GPU.
+- After the notebook loads, go to the "Runtime" menu at the top and select "Change
+  runtime type"
+- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU
+  (e.g. T4)
+- Click "Save". The runtime takes a few seconds to load - you can see the status in the
+  top right corner
+- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about
+  the GPU.
 - You can now write Python code that runs on GPUs through e.g. the numba library.
 
-
 Access to code examples
 -----------------------
 
-Some exercises in this lesson rely on source code that you should download and modify in your own home directory on the cluster. All code examples are available in the same GitHub repository as this lesson itself. To download it you should use Git:
+Some exercises in this lesson rely on source code that you should download and modify in
+your own home directory on the cluster. All code examples are available in the same
+GitHub repository as this lesson itself. To download it you should use Git:
 
 .. code-block:: console
 
-   $ git clone https://github.com/ENCCS/gpu-programming.git
-   $ cd gpu-programming/content/examples/
-   $ ls
-
+    $ git clone https://github.com/ENCCS/gpu-programming.git
+    $ cd gpu-programming/content/examples/
+    $ ls
diff --git a/content/1-gpu-history.rst b/content/1-gpu-history.rst
index 9ee4986c..a670efa4 100644
--- a/content/1-gpu-history.rst
+++ b/content/1-gpu-history.rst
@@ -1,131 +1,141 @@
 .. _gpu-history:
 
-
 Why GPUs?
 =========
 
-
 .. questions::
 
-   - What is Moore's law?
-   - What problem do GPUs solve?
+    - What is Moore's law?
+    - What problem do GPUs solve?
 
 .. objectives::
 
-   - Explain the historical development of microprocessors and how GPUs enable 
-     continued scaling in computational power
+    - Explain the historical development of microprocessors and how GPUs enable
+      continued scaling in computational power
 
 .. instructor-note::
 
-   - 15 min teaching
-   - 0 min exercises
-
+    - 15 min teaching
+    - 0 min exercises
 
 Moore's law
 -----------
 
-It states that the number of transistors in a dense integrated circuit doubles about every two years.
-More transistors means smaller size of a single element, so higher core frequency can be achieved.
-However, power consumption scales with frequency to the third power, therefore the growth in the core frequency has slowed down significantly.
-Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD (single instruction multiple data), branch prediction, etc.
+It states that the number of transistors in a dense integrated circuit doubles about
+every two years. More transistors means smaller size of a single element, so higher core
+frequency can be achieved. However, power consumption scales with frequency to the third
+power, therefore the growth in the core frequency has slowed down significantly. Higher
+performance of a single node has to rely on its more complicated structure and still can
+be achieved with SIMD (single instruction multiple data), branch prediction, etc.
 
 .. figure:: img/history/microprocessor-trend-data.png
-   :align: center
+    :align: center
 
-   The evolution of microprocessors.
-   The number of transistors per chip doubles roughly every 2 years.
-   However, it can no longer be explored by the core frequency due to the power consumption limits.
-   Before 2000, the increase in the single core clock frequency was the major source of the 
-   increase in the performance. Mid 2000 mark a transition towards multi-core processors.
+    The evolution of microprocessors. The number of transistors per chip doubles roughly
+    every 2 years. However, it can no longer be explored by the core frequency due to
+    the power consumption limits. Before 2000, the increase in the single core clock
+    frequency was the major source of the increase in the performance. Mid 2000 mark a
+    transition towards multi-core processors.
 
 Increasing performance has been sustained with two main strategies over the years:
 
-    - Increase the single processor performance: 
+    - Increase the single processor performance:
     - More recently, increase the number of physical cores.
 
-
 Computing in parallel
 ---------------------
 
-The underlying idea of parallel computing is to split a computational problem into smaller 
-subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units. 
+The underlying idea of parallel computing is to split a computational problem into
+smaller subtasks. Many subtasks can then be solved *simultaneously* by multiple
+processing units.
 
 .. figure:: img/history/compp.png
-   :align: center
-   
-   Computing in parallel.
+    :align: center
 
-How a problem is split into smaller subtasks strongly depends on the problem. 
-There are various paradigms and programming approaches to do this. 
+    Computing in parallel.
 
+How a problem is split into smaller subtasks strongly depends on the problem. There are
+various paradigms and programming approaches to do this.
 
 Graphics processing units
 -------------------------
 
-Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
-GPUs were initially developed for highly-parallel task of graphic processing.
-But over the years, they were used more and more in HPC.
+Graphics processing units (GPU) have been the most common accelerators during the last
+few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
+GPUs were initially developed for highly-parallel task of graphic processing. But over
+the years, they were used more and more in HPC.
 
-GPUs are a specialized parallel hardware for floating point operations.
-They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
-but it delegates highly-parallel tasks to the GPU.
-GPUs are based on highly parallel architectures, which allows taking advantage of the 
-increasing number of transistors.
+GPUs are a specialized parallel hardware for floating point operations. They are
+basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
+but it delegates highly-parallel tasks to the GPU. GPUs are based on highly parallel
+architectures, which allows taking advantage of the increasing number of transistors.
 
-Using GPUs allows one to achieve extreme performance per node.
-As a result, the single GPU-equipped workstation can outperform small CPU-based clusters 
-for some type of computational tasks. The drawback is: usually major rewrites of programs is required
+Using GPUs allows one to achieve extreme performance per node. As a result, the single
+GPU-equipped workstation can outperform small CPU-based clusters for some type of
+computational tasks. The drawback is: usually major rewrites of programs is required
 with an accompanying change in the programming paradigm.
 
 .. callout:: Host vs device
 
-   GPU-enabled systems require a heterogeneous programming model that involves both 
-   CPU and GPU, where the CPU and its memory are referred to as the host, 
-   and the GPU and its memory as the device.
+    GPU-enabled systems require a heterogeneous programming model that involves both
+    CPU and GPU, where the CPU and its memory are referred to as the host,
+    and the GPU and its memory as the device.
 
 .. figure:: img/history/CPU_and_GPU_separated.png
-   :align: center
-
-   Figure adapted from the Carpentry `GPU Programming lesson <https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
+    :align: center
 
+    Figure adapted from the Carpentry `GPU Programming lesson
+    <https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
 
 A look at the Top-500 list
 --------------------------
 
-The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2024, where the columns show:
+The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful
+non-distributed computer systems in the world. The project was started in 1993 and
+publishes an updated list of the supercomputers twice a year. The snapshot below shows
+the top-5 HPC systems as of June 2024, where the columns show:
 
-- **Cores** - Number of processors 
+- **Cores** - Number of processors
 - **Rmax** - Maximal LINPACK performance achieved
 - **Rpeak** - Theoretical peak performance
 - **Power** - Power consumption
 
 .. figure:: img/history/top-5.png
-   :align: center
+    :align: center
 
-   Snapshot from the `TOP500 list from June, 2024 <https://www.top500.org/lists/top500/2024/06/>`__.
-
-All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for Fugaku which instead relies on custom-built Arm A64FX CPUs.
+    Snapshot from the `TOP500 list from June, 2024
+    <https://www.top500.org/lists/top500/2024/06/>`__.
 
+All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for
+Fugaku which instead relies on custom-built Arm A64FX CPUs.
 
 Why GPUs?
 ---------
 
-- **Speed**: GPU computing can significantly accelerate many types of scientific workloads.
-- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations per watt of power consumed,
-  which can result in significant energy savings. This is indeed evident from the `Green500 list <https://www.top500.org/lists/green500/2024/06/>`__.
-- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based systems for certain workloads.
-
+- **Speed**: GPU computing can significantly accelerate many types of scientific
+  workloads.
+- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations
+  per watt of power consumed, which can result in significant energy savings. This is
+  indeed evident from the `Green500 list
+  <https://www.top500.org/lists/green500/2024/06/>`__.
+- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based
+  systems for certain workloads.
 
 Limitations and drawbacks
 -------------------------
 
-- **Only for certain workloads**: Not all workloads can be efficiently parallelized and accelerated on GPUs. Certain types of workloads, such as those with irregular data access patterns or high branching behavior, may not see significant performance improvements on GPUs.
-- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU computing could require specialized skills in GPU programming and knowledge of GPU architecture, leading to a steeper learning curve compared to CPU programming. Fortunately, if you study this training material closely you will become productive with GPU programming quickly!
-
-
+- **Only for certain workloads**: Not all workloads can be efficiently parallelized and
+  accelerated on GPUs. Certain types of workloads, such as those with irregular data
+  access patterns or high branching behavior, may not see significant performance
+  improvements on GPUs.
+- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU
+  computing could require specialized skills in GPU programming and knowledge of GPU
+  architecture, leading to a steeper learning curve compared to CPU programming.
+  Fortunately, if you study this training material closely you will become productive
+  with GPU programming quickly!
 
 .. keypoints::
 
-   - GPUs are accelerators for some types of tasks
-   - Highly parallilizable compute-intensive tasks are suitable for GPUs
-   - New programming skills are needed to use GPUs efficiently
+    - GPUs are accelerators for some types of tasks
+    - Highly parallilizable compute-intensive tasks are suitable for GPUs
+    - New programming skills are needed to use GPUs efficiently
diff --git a/content/10-multiple_gpu.rst b/content/10-multiple_gpu.rst
index 73684985..0e2309ba 100644
--- a/content/10-multiple_gpu.rst
+++ b/content/10-multiple_gpu.rst
@@ -5,33 +5,61 @@ Multiple GPU programming with MPI
 
 .. questions::
 
-   - What approach should be adopted to extend the synchronous OpenACC and OpenMP offloading models to utilise multiple GPUs across multiple nodes? 
+    - What approach should be adopted to extend the synchronous OpenACC and OpenMP offloading models to utilise multiple GPUs across multiple nodes?
 
 .. objectives::
 
-   - To learn about combining MPI with either OpenACC or OpenMP offloading models.
-   - To learn about implementing GPU-awareness MPI approach. 
+    - To learn about combining MPI with either OpenACC or OpenMP offloading models.
+    - To learn about implementing GPU-awareness MPI approach.
 
 .. instructor-note::
 
-   - 30 min teaching
-   - 30 min exercises
+    - 30 min teaching
+    - 30 min exercises
 
 Introduction
 ------------
 
-Exploring multiple GPUs (Graphics Processing Units) across distributed nodes offers the potential to fully leveraging the capacity of modern HPC (High-Performance Computing) systems at a large scale. Here one of the approaches to accelerate computing on distributed systems is to combine MPI (Message Passing Interface) with a GPU programming model such as OpenACC and OpenMP application programming interfaces (APIs). This combination is motivated by both the simplicity of these APIs, and the widespread use of MPI.   
-
-In this guide we provide readers, who are familiar with MPI, with insights on implementing a hybrid model in which the MPI communication framework is combined with either OpenACC or OpenMP APIs. A special focus will be on performing point-to-point (e.g. `MPI_Send` and `MPI_Recv`) and collective operations (e.g. `MPI_Allreduce`) from OpenACC and OpenMP APIs. Here we address two scenarios: (i) a scenario in which MPI operations are performed in the CPU-host followed by an offload to the GPU-device; and (ii) a scenario in which MPI operations are performed between a pair of GPUs without involving the CPU-host memory. The latter scenario is referred to as GPU-awareness MPI, and has the advantage of reducing the computing time caused by transferring data via the host-memory during heterogeneous communications, thus rendering HPC applications efficient. 
-
-This guide is organized as follows: we first introduce how to assign each MPI rank to a GPU device within the same node. We consider a situation in which the host and the device have a distinct memory. This is followed by a presentation on the hybrid MPI-OpenACC/OpenMP offloading with and without the GPU-awareness MPI. Exercises to help understanding these concepts are provided at the end.
+Exploring multiple GPUs (Graphics Processing Units) across distributed nodes offers the
+potential to fully leveraging the capacity of modern HPC (High-Performance Computing)
+systems at a large scale. Here one of the approaches to accelerate computing on
+distributed systems is to combine MPI (Message Passing Interface) with a GPU programming
+model such as OpenACC and OpenMP application programming interfaces (APIs). This
+combination is motivated by both the simplicity of these APIs, and the widespread use of
+MPI.
+
+In this guide we provide readers, who are familiar with MPI, with insights on
+implementing a hybrid model in which the MPI communication framework is combined with
+either OpenACC or OpenMP APIs. A special focus will be on performing point-to-point
+(e.g. `MPI_Send` and `MPI_Recv`) and collective operations (e.g. `MPI_Allreduce`) from
+OpenACC and OpenMP APIs. Here we address two scenarios: (i) a scenario in which MPI
+operations are performed in the CPU-host followed by an offload to the GPU-device; and
+(ii) a scenario in which MPI operations are performed between a pair of GPUs without
+involving the CPU-host memory. The latter scenario is referred to as GPU-awareness MPI,
+and has the advantage of reducing the computing time caused by transferring data via the
+host-memory during heterogeneous communications, thus rendering HPC applications
+efficient.
+
+This guide is organized as follows: we first introduce how to assign each MPI rank to a
+GPU device within the same node. We consider a situation in which the host and the
+device have a distinct memory. This is followed by a presentation on the hybrid
+MPI-OpenACC/OpenMP offloading with and without the GPU-awareness MPI. Exercises to help
+understanding these concepts are provided at the end.
 
 Assigning MPI-ranks to GPU-devices
 ----------------------------------
 
-Accelerating MPI applications to utilise multiple GPUs on distributed nodes requires as a first step assigning each MPI rank to a GPU device, such that two MPI ranks do not use the same GPU device. This is necessarily in order to prevent the application from a potential crash. This is because GPUs are designed to handle multiple threading tasks, but not multiple MPI ranks. 
+Accelerating MPI applications to utilise multiple GPUs on distributed nodes requires as
+a first step assigning each MPI rank to a GPU device, such that two MPI ranks do not use
+the same GPU device. This is necessarily in order to prevent the application from a
+potential crash. This is because GPUs are designed to handle multiple threading tasks,
+but not multiple MPI ranks.
 
-One of the way to ensure that two MPI ranks do not use the same GPU, is to determine which MPI processes run on the same node, such that each process can be assigned to a GPU device within the same node. This can be done, for instance, by splitting the world communicator into sub-groups of communicators (or sub-communicators) using the routine `MPI_COMM_SPLIT_TYPE()`. 
+One of the way to ensure that two MPI ranks do not use the same GPU, is to determine
+which MPI processes run on the same node, such that each process can be assigned to a
+GPU device within the same node. This can be done, for instance, by splitting the world
+communicator into sub-groups of communicators (or sub-communicators) using the routine
+`MPI_COMM_SPLIT_TYPE()`.
 
 .. tabs::
 
@@ -47,314 +75,384 @@ One of the way to ensure that two MPI ranks do not use the same GPU, is to deter
             :language: C++
             :lines: 17-22
 
-Here, the size of each sub-communicator corresponds to the number of GPUs per node (which is also the number of tasks per node), and each sub-communicator contains a list of processes indicated by a rank. These processes have a shared-memory region defined by the argument `MPI_COMM_TYPE_SHARED` (see the `MPI report <https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf>`_) for more details). Calling the routine `MPI_COMM_SPLIT_TYPE()` returns a sub-communicator labelled in the code above *”host_comm”*, and in which MPI-ranks are ranked from 0 to number of processes per node -1. These MPI ranks are in turn assigned to different GPU devices within the same node. This procedure is done according to which directive-based model is implemented. The retrieved MPI ranks are then stored in the variable **myDevice**. The variable is passed to an OpenACC or OpenMP routine as indicated in the code below. 
+Here, the size of each sub-communicator corresponds to the number of GPUs per node
+(which is also the number of tasks per node), and each sub-communicator contains a list
+of processes indicated by a rank. These processes have a shared-memory region defined by
+the argument `MPI_COMM_TYPE_SHARED` (see the `MPI report
+<https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf>`_) for more details). Calling
+the routine `MPI_COMM_SPLIT_TYPE()` returns a sub-communicator labelled in the code
+above *”host_comm”*, and in which MPI-ranks are ranked from 0 to number of processes per
+node -1. These MPI ranks are in turn assigned to different GPU devices within the same
+node. This procedure is done according to which directive-based model is implemented.
+The retrieved MPI ranks are then stored in the variable **myDevice**. The variable is
+passed to an OpenACC or OpenMP routine as indicated in the code below.
 
 .. typealong:: Example: ``Assign device``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: Fortran OpenACC
+       .. tab:: Fortran OpenACC
 
-         .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90
-            :language: fortran
-            :lines: 34-40
+          .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90
+             :language: fortran
+             :lines: 34-40
 
-      .. tab:: Fortran OpenMP
+       .. tab:: Fortran OpenMP
 
-         .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90
-            :language: fortran
-            :lines: 34-40
+          .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90
+             :language: fortran
+             :lines: 34-40
 
-      .. tab:: C++ OpenMP
+       .. tab:: C++ OpenMP
 
-         .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp
-            :language: C++
-            :lines: 29-34
+          .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp
+             :language: C++
+             :lines: 29-34
+
+Another useful function for retrieving the device number of a specific device, which is
+useful, e.g., to map data to a specific device is
 
-Another useful function for retrieving the device number of a specific device, which is useful, e.g., to map data to a specific device is
-	
 .. tabs::
 
-   .. tab:: OpenACC
-     
-      .. code-block:: fortran
- 	
-         acc_get_device_num()
+    .. tab:: OpenACC
+
+       .. code-block:: fortran
 
-   .. tab:: OpenMP
+          acc_get_device_num()
 
-      .. code-block:: fortran
-	 
-       	 omp_get_device_num()
+    .. tab:: OpenMP
+
+       .. code-block:: fortran
+
+          omp_get_device_num()
 
 The syntax of assigning MPI ranks to GPU devices is summarised below
 
 .. typealong:: Example: ``Set device``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: Fortran OpenACC
+       .. tab:: Fortran OpenACC
 
-         .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90
-            :language: fortran
-            :lines: 15-40
+          .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90
+             :language: fortran
+             :lines: 15-40
 
-      .. tab:: Fortran OpenMP
+       .. tab:: Fortran OpenMP
 
-         .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90
-            :language: fortran
-            :lines: 15-40
+          .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90
+             :language: fortran
+             :lines: 15-40
 
-      .. tab:: C++ OpenMP
+       .. tab:: C++ OpenMP
 
-         .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp
-            :language: C++
-            :lines: 8-34
+          .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp
+             :language: C++
+             :lines: 8-34
 
 Hybrid MPI-OpenACC/OpenMP without GPU-awareness approach
 --------------------------------------------------------
 
-After covering how to assign each MPI-rank to a GPU device, we now address the concept of combining MPI with either
-OpenACC or OpenMP offloading. In this approach, calling an MPI routine from an OpenACC or OpenMP API requires updating the data in the CPU host before and after an MPI call. In this scenario, the data is copied back and forth between the host and the device before and after each MPI call. In the hybrid MPI-OpenACC model, the procedure is defined by specifying the directive `update host()` for copying the data from the device to the host before an MPI call; and by the directive `update device()` specified after an MPI call for copying the data back to the device. Similarly in the hybrid MPI-OpenMP. Here, updating the data in the host can be done by specifying the OpenMP directives `update device() from()` and `update device() to()`, respectively, for copying the data from the device to the host and back to the device.
-
-To illustrate the concept of the hybrid MPI-OpenACC/OpenMP, we show below an example of an implementation that involves the MPI functions `MPI_Send()` and `MPI_Recv()`.
-
+After covering how to assign each MPI-rank to a GPU device, we now address the concept
+of combining MPI with either OpenACC or OpenMP offloading. In this approach, calling an
+MPI routine from an OpenACC or OpenMP API requires updating the data in the CPU host
+before and after an MPI call. In this scenario, the data is copied back and forth
+between the host and the device before and after each MPI call. In the hybrid
+MPI-OpenACC model, the procedure is defined by specifying the directive `update host()`
+for copying the data from the device to the host before an MPI call; and by the
+directive `update device()` specified after an MPI call for copying the data back to the
+device. Similarly in the hybrid MPI-OpenMP. Here, updating the data in the host can be
+done by specifying the OpenMP directives `update device() from()` and `update device()
+to()`, respectively, for copying the data from the device to the host and back to the
+device.
+
+To illustrate the concept of the hybrid MPI-OpenACC/OpenMP, we show below an example of
+an implementation that involves the MPI functions `MPI_Send()` and `MPI_Recv()`.
 
 .. typealong:: Example: ``Update host/device directives``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: Fortran OpenACC
+       .. tab:: Fortran OpenACC
 
-         .. literalinclude:: examples/mpi_acc/mpiacc.f90
-            :language: fortran
-            :lines: 62-77
+          .. literalinclude:: examples/mpi_acc/mpiacc.f90
+             :language: fortran
+             :lines: 62-77
 
-      .. tab:: Fortran OpenMP
+       .. tab:: Fortran OpenMP
 
-         .. literalinclude:: examples/mpi_omp/mpiomp.f90
-            :language: fortran
-            :lines: 63-78
+          .. literalinclude:: examples/mpi_omp/mpiomp.f90
+             :language: fortran
+             :lines: 63-78
 
-      .. tab:: C++ OpenMP
+       .. tab:: C++ OpenMP
 
-         .. literalinclude:: examples/mpi_omp/mpiomp.cpp
-            :language: C++
-            :lines: 63-78
+          .. literalinclude:: examples/mpi_omp/mpiomp.cpp
+             :language: C++
+             :lines: 63-78
 
 Here we present a code example that combines MPI with OpenACC/OpenMP API.
 
 .. typealong:: Example: ``Update host/device directives``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: Fortan OpenACC
- 
-         .. literalinclude:: examples/mpi_acc/mpiacc.f90
-            :language: fortran
-            :lines: 60-94
+       .. tab:: Fortan OpenACC
 
-      .. tab:: Fortran OpenMP
+          .. literalinclude:: examples/mpi_acc/mpiacc.f90
+             :language: fortran
+             :lines: 60-94
 
-         .. literalinclude:: examples/mpi_omp/mpiomp.f90
-            :language: fortran
-            :lines: 61-97
+       .. tab:: Fortran OpenMP
 
-      .. tab:: C++ OpenMP
+          .. literalinclude:: examples/mpi_omp/mpiomp.f90
+             :language: fortran
+             :lines: 61-97
 
-         .. literalinclude:: examples/mpi_omp/mpiomp.cpp
-            :language: C++
-            :lines: 60-97
+       .. tab:: C++ OpenMP
 
-Despite the simplicity of implementing the hybrid MPI-OpenACC/OpenMP offloading, it suffers from a low performance caused by an explicit transfer of data between the host and the device before and after calling an MPI routine. This constitutes a bottleneck in GPU-programming. To improve the performance affected by the host staging during the data transfer, one can implement the GPU-awareness MPI approach as described in the following section.
-	  
-Hybrid MPI-OpenACC/OpenMP with GPU-awareness approach 
------------------------------------------------------
+          .. literalinclude:: examples/mpi_omp/mpiomp.cpp
+             :language: C++
+             :lines: 60-97
 
-The concept of the GPU-aware MPI enables an MPI library to directly access the GPU-device memory without necessarily using the CPU-host memory as an intermediate buffer (see e.g. `OpenMPI documentation <https://docs.open-mpi.org/en/v5.0.1/tuning-apps/networking/cuda.html>`__). This offers the benefit of transferring data from one GPU to another GPU without the involvement of the CPU-host memory.
-	  
-To be specific, in the GPU-awareness approach, the device pointers point to the data allocated in the GPU memory space (data should be present in the GPU device). Here, the pointers are passed as arguments to an MPI routine that is supported by the GPU memory. As MPI routines can directly access GPU memory, it offers the possibility of communicating between pairs of GPUs without transferring data back to the host. 
+Despite the simplicity of implementing the hybrid MPI-OpenACC/OpenMP offloading, it
+suffers from a low performance caused by an explicit transfer of data between the host
+and the device before and after calling an MPI routine. This constitutes a bottleneck in
+GPU-programming. To improve the performance affected by the host staging during the data
+transfer, one can implement the GPU-awareness MPI approach as described in the following
+section.
 
-In the hybrid MPI-OpenACC model, the concept is defined by combining the directive `host_data` together with the clause
-`use_device(list_array)`. This combination enables the access to the arrays listed in the clause `use_device(list_array)` from the host (see `here <https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.2-final.pdf>`__). The list of arrays, which are already present in the GPU-device memory, are directly passed to an MPI routine without a need of a staging host-memory for copying the data. Note that for initially copying data to GPU, we use unstructured data blocks characterized by the directives `enter data` and `exit data`. The unstructured data has the advantage of allowing to allocate and deallocate arrays within a data region.
+Hybrid MPI-OpenACC/OpenMP with GPU-awareness approach
+-----------------------------------------------------
 
-To illustrate the concept of the GPU-awareness MPI, we show below two examples that make use of point-to-point and collective operations from OpenACC and OpenMP APIs. In the first code example, the device pointer **f** is passed to the MPI functions `MPI_Send()` and `MPI_Recv()`; and in the second one, the pointer **SumToT** is passed to the MPI function `MPI_Allreduce`. Here, the MPI operations `MPI_Send` and `MPI_Recv` as well as `MPI_Allreduce` are performed between a pair of GPUs without passing through the CPU-host memory. 
+The concept of the GPU-aware MPI enables an MPI library to directly access the
+GPU-device memory without necessarily using the CPU-host memory as an intermediate
+buffer (see e.g. `OpenMPI documentation
+<https://docs.open-mpi.org/en/v5.0.1/tuning-apps/networking/cuda.html>`__). This offers
+the benefit of transferring data from one GPU to another GPU without the involvement of
+the CPU-host memory.
+
+To be specific, in the GPU-awareness approach, the device pointers point to the data
+allocated in the GPU memory space (data should be present in the GPU device). Here, the
+pointers are passed as arguments to an MPI routine that is supported by the GPU memory.
+As MPI routines can directly access GPU memory, it offers the possibility of
+communicating between pairs of GPUs without transferring data back to the host.
+
+In the hybrid MPI-OpenACC model, the concept is defined by combining the directive
+`host_data` together with the clause `use_device(list_array)`. This combination enables
+the access to the arrays listed in the clause `use_device(list_array)` from the host
+(see `here
+<https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.2-final.pdf>`__).
+The list of arrays, which are already present in the GPU-device memory, are directly
+passed to an MPI routine without a need of a staging host-memory for copying the data.
+Note that for initially copying data to GPU, we use unstructured data blocks
+characterized by the directives `enter data` and `exit data`. The unstructured data has
+the advantage of allowing to allocate and deallocate arrays within a data region.
+
+To illustrate the concept of the GPU-awareness MPI, we show below two examples that make
+use of point-to-point and collective operations from OpenACC and OpenMP APIs. In the
+first code example, the device pointer **f** is passed to the MPI functions `MPI_Send()`
+and `MPI_Recv()`; and in the second one, the pointer **SumToT** is passed to the MPI
+function `MPI_Allreduce`. Here, the MPI operations `MPI_Send` and `MPI_Recv` as well as
+`MPI_Allreduce` are performed between a pair of GPUs without passing through the
+CPU-host memory.
 
 .. typealong:: Example: ``GPU-awareness: MPI_Send & MPI_Recv``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: GPU-aware MPI with OpenACC (Fortran)
-	 
-         .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90
-            :language: fortran
-            :lines: 65-74
+       .. tab:: GPU-aware MPI with OpenACC (Fortran)
 
-      .. tab:: GPU-aware MPI with OpenMP (Fortran)
-	 
-         .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
-            :language: fortran
-            :lines: 66-75
+          .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90
+             :language: fortran
+             :lines: 65-74
 
-      .. tab:: GPU-aware MPI with OpenMP (C++)
-	 
-         .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp
-            :language: C++
-            :lines: 66-76
+       .. tab:: GPU-aware MPI with OpenMP (Fortran)
 
+          .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
+             :language: fortran
+             :lines: 66-75
+
+       .. tab:: GPU-aware MPI with OpenMP (C++)
+
+          .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp
+             :language: C++
+             :lines: 66-76
 
 .. typealong:: Example: ``GPU-awareness: MPI_Allreduce``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: GPU-aware MPI with OpenACC (Fortran)
-	 
-         .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90
-            :language: fortran
-            :lines: 90-94
+       .. tab:: GPU-aware MPI with OpenACC (Fortran)
 
-      .. tab:: GPU-aware MPI with OpenMP (Fortran)
-	 
-         .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
-            :language: fortran
-            :lines: 93-97 
+          .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90
+             :language: fortran
+             :lines: 90-94
 
-      .. tab:: GPU-aware MPI with OpenMP (C++)
-	 
-         .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp
-            :language: C++
-            :lines: 90-97 
+       .. tab:: GPU-aware MPI with OpenMP (Fortran)
+
+          .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
+             :language: fortran
+             :lines: 93-97
 
-We provide below a code example that illustrates the implementation of the MPI functions `MPI_Send()`, `MPI_Recv()` and `MPI_Allreduce()` within an OpenACC/OpenMP API. This implementation is specifically designed to support GPU-aware MPI operations. 
+       .. tab:: GPU-aware MPI with OpenMP (C++)
+
+          .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp
+             :language: C++
+             :lines: 90-97
+
+We provide below a code example that illustrates the implementation of the MPI functions
+`MPI_Send()`, `MPI_Recv()` and `MPI_Allreduce()` within an OpenACC/OpenMP API. This
+implementation is specifically designed to support GPU-aware MPI operations.
 
 .. typealong:: Example: ``GPU-awareness approach``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: GPU-aware MPI with OpenACC (Fortran)
+       .. tab:: GPU-aware MPI with OpenACC (Fortran)
 
-         .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90
-            :language: fortran
-            :lines: 60-97
+          .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90
+             :language: fortran
+             :lines: 60-97
 
-      .. tab:: GPU-aware MPI with OpenMP (Fortran)
+       .. tab:: GPU-aware MPI with OpenMP (Fortran)
 
-         .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
-            :language: fortran
-            :lines: 60-100
+          .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
+             :language: fortran
+             :lines: 60-100
 
-      .. tab:: GPU-aware MPI with OpenMP (C++)
+       .. tab:: GPU-aware MPI with OpenMP (C++)
 
-         .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
-            :language: C++
-            :lines: 61-99
+          .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90
+             :language: C++
+             :lines: 61-99
 
-The GPU-aware MPI with OpenACC/OpenMP APIs has the capability of directly communicating between a pair of GPUs within a single node. However, performing the GPU-to-GPU communication across multiple nodes requires the the GPUDirect RDMA (Remote Direct Memory Access) technology. This technology can further improve performance by reducing latency.
+The GPU-aware MPI with OpenACC/OpenMP APIs has the capability of directly communicating
+between a pair of GPUs within a single node. However, performing the GPU-to-GPU
+communication across multiple nodes requires the the GPUDirect RDMA (Remote Direct
+Memory Access) technology. This technology can further improve performance by reducing
+latency.
 
 Compilation process
 -------------------
 
-The compilation process of the hybrid MPI-OpenACC and MPI-OpenMP offloading is described below. This description is given for a Cray compiler of the wrapper `ftn`. On LUMI-G, the following modules may be necessary before compiling (see the `LUMI documentation <https://docs.lumi-supercomputer.eu/development/compiling/prgenv/>`_ for further details about the available programming environments): 
+The compilation process of the hybrid MPI-OpenACC and MPI-OpenMP offloading is described
+below. This description is given for a Cray compiler of the wrapper `ftn`. On LUMI-G,
+the following modules may be necessary before compiling (see the `LUMI documentation
+<https://docs.lumi-supercomputer.eu/development/compiling/prgenv/>`_ for further details
+about the available programming environments):
 
 .. code-block:: console
 
-	 $ ml LUMI/24.03
-	 $ ml PrgEnv-cray
-	 $ ml cray-mpich
-	 $ ml rocm
-	 $ ml craype-accel-amd-gfx90a
-
+    $ ml LUMI/24.03
+    $ ml PrgEnv-cray
+    $ ml cray-mpich
+    $ ml rocm
+    $ ml craype-accel-amd-gfx90a
 
 .. typealong:: Example: ``Compilation process``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: Compiling MPI-OpenACC (Fortran)
-         .. code-block:: console
+       .. tab:: Compiling MPI-OpenACC (Fortran)
+          .. code-block:: console
 
-            $ ftn -hacc -o mycode.mpiacc.exe mycode_mpiacc.f90
+             $ ftn -hacc -o mycode.mpiacc.exe mycode_mpiacc.f90
 
-      .. tab:: Compiling MPI-OpenMP (Fortran)
-         .. code-block:: console
+       .. tab:: Compiling MPI-OpenMP (Fortran)
+          .. code-block:: console
 
-             $ ftn -homp -o mycode.mpiomp.exe mycode_mpiomp.f90
+              $ ftn -homp -o mycode.mpiomp.exe mycode_mpiomp.f90
 
-      .. tab:: Compiling MPI-OpenMP (C++)
-         .. code-block:: console
+       .. tab:: Compiling MPI-OpenMP (C++)
+          .. code-block:: console
 
-             $ CC -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target -march=gfx90a -o mycode.mpiomp.exe mycode_mpiomp.cpp
+              $ CC -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target -march=gfx90a -o mycode.mpiomp.exe mycode_mpiomp.cpp
 
-Here, the flags `hacc` and `homp` enable the OpenACC and OpenMP directives in the hybrid MPI-OpenACC and MPI-OpenMP applications, respectively.
+Here, the flags `hacc` and `homp` enable the OpenACC and OpenMP directives in the hybrid
+MPI-OpenACC and MPI-OpenMP applications, respectively.
 
 **Enabling GPU-aware support**
 
-To enable the GPU-aware support in MPICH library, one needs to set the following environment variable before running the application.
+To enable the GPU-aware support in MPICH library, one needs to set the following
+environment variable before running the application.
 
 .. code-block::
 
-     $ export MPICH_GPU_SUPPORT_ENABLED=1
-
+    $ export MPICH_GPU_SUPPORT_ENABLED=1
 
 Conclusion
 ----------
-In conclusion, we have presented an overview of a GPU-hybrid programming by integrating GPU-directive models, specifically OpenACC and OpenMP APIs, with the MPI library. The approach adopted here allows us to utilise multiple GPU-devices not only within a single node but it extends to distributed nodes. In particular, we have addressed GPU-aware MPI approach, which has the advantage of enabling a direct interaction between an MPI library and a GPU-device memory. In other words, it permits performing MPI operations between a pair of GPUs, thus reducing the computing time caused by the data locality. 
- 
+
+In conclusion, we have presented an overview of a GPU-hybrid programming by integrating
+GPU-directive models, specifically OpenACC and OpenMP APIs, with the MPI library. The
+approach adopted here allows us to utilise multiple GPU-devices not only within a single
+node but it extends to distributed nodes. In particular, we have addressed GPU-aware MPI
+approach, which has the advantage of enabling a direct interaction between an MPI
+library and a GPU-device memory. In other words, it permits performing MPI operations
+between a pair of GPUs, thus reducing the computing time caused by the data locality.
+
 Exercises
 ---------
 
-We consider an MPI fortran code that solves a 2D-Laplace equation, and which is partially accelerated. The focus of the exercises is to complete the acceleration using either OpenACC or OpenMP API by following these steps. 
+We consider an MPI fortran code that solves a 2D-Laplace equation, and which is
+partially accelerated. The focus of the exercises is to complete the acceleration using
+either OpenACC or OpenMP API by following these steps.
 
 .. callout:: Access exercise material
 
-   Code examples for the exercises below can be accessed in the `content/examples/exercise_multipleGPU` subdirectory of this repository. To access them, you need to clone the repository:
+    Code examples for the exercises below can be accessed in the `content/examples/exercise_multipleGPU` subdirectory of this repository. To access them, you need to clone the repository:
 
-   .. code-block:: console
+    .. code-block:: console
 
-      $ git clone https://github.com/ENCCS/gpu-programming.git
-      $ cd gpu-programming/content/examples/exercise_multipleGPU
-      $ ls
+       $ git clone https://github.com/ENCCS/gpu-programming.git
+       $ cd gpu-programming/content/examples/exercise_multipleGPU
+       $ ls
 
 .. challenge:: Exercise I: Set a GPU device
 
-   1. Implement OpenACC/OpenMP functions that enable assigning each MPI rank to a GPU device.
+    1. Implement OpenACC/OpenMP functions that enable assigning each MPI rank to a GPU device.
 
-   1.1 Compile and run the code on multiple GPUs.
+    1.1 Compile and run the code on multiple GPUs.
 
 .. challenge:: Exercise II: Apply traditional MPI-OpenACC/OpenMP
 
-   2.1 Incorporate the OpenACC directives `*update host()*` and `*update device()*` before and after calling an MPI function, respectively. 
+    2.1 Incorporate the OpenACC directives `*update host()*` and `*update device()*` before and after calling an MPI function, respectively.
 
-   .. note:: 
-      The OpenACC directive `*update host()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device()*` is used to transfer the data from CPU to GPU. 
+    .. note::
+       The OpenACC directive `*update host()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device()*` is used to transfer the data from CPU to GPU.
 
-   2.2 Incorporate the OpenMP directives `*update device() from()*` and `*update device() to()*` before and after calling an MPI function, respectively.
+    2.2 Incorporate the OpenMP directives `*update device() from()*` and `*update device() to()*` before and after calling an MPI function, respectively.
 
-   .. note:: 
-      The OpenMP directive `*update device() from()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device() to()*` is used to transfer the data from CPU to GPU. 
+    .. note::
+       The OpenMP directive `*update device() from()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device() to()*` is used to transfer the data from CPU to GPU.
 
-   2.3 Compile and run the code on multiple GPUs.
+    2.3 Compile and run the code on multiple GPUs.
 
 .. challenge:: Exercise III: Implement GPU-aware support
 
-   3.1 Incorporate the OpenACC directive `*host_data use_device()*` to pass a device pointer to an MPI function.
+    3.1 Incorporate the OpenACC directive `*host_data use_device()*` to pass a device pointer to an MPI function.
 
-   3.2 Incorporate the OpenMP directive `*data use_device_ptr()*` to pass a device pointer to an MPI function.
+    3.2 Incorporate the OpenMP directive `*data use_device_ptr()*` to pass a device pointer to an MPI function.
 
-   3.3 Compile and run the code on multiple GPUs.
+    3.3 Compile and run the code on multiple GPUs.
 
 .. challenge:: Exercise IV: Evaluate the performance
 
-   1. Evaluate the execution time of the accelerated codes in the exercises **II** and **III**, and compare it with that of a pure MPI implementation.  
+    1. Evaluate the execution time of the accelerated codes in the exercises **II** and **III**, and compare it with that of a pure MPI implementation.
 
 See also
 --------
 
-- `GPU-aware MPI <https://documentation.sigma2.no/code_development/guides/gpuaware_mpi.html>`_.
+- `GPU-aware MPI
+  <https://documentation.sigma2.no/code_development/guides/gpuaware_mpi.html>`_.
 - `MPI documentation <https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf>`_.
-- `OpenACC specification <https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.2-final.pdf>`_.
-- `OpenMP specification <https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf>`_.
-- `LUMI documentation <https://docs.lumi-supercomputer.eu/development/compiling/prgenv/>`_.
-- `OpenACC vs OpenMP offloading <https://documentation.sigma2.no/code_development/guides/converting_acc2omp/openacc2openmp.html>`_.
+- `OpenACC specification
+  <https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.2-final.pdf>`_.
+- `OpenMP specification
+  <https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf>`_.
+- `LUMI documentation
+  <https://docs.lumi-supercomputer.eu/development/compiling/prgenv/>`_.
+- `OpenACC vs OpenMP offloading
+  <https://documentation.sigma2.no/code_development/guides/converting_acc2omp/openacc2openmp.html>`_.
 - `OpenACC course <https://github.com/HichamAgueny/GPU-course>`_.
-
-
diff --git a/content/11-gpu-porting.rst b/content/11-gpu-porting.rst
index 8a9576d3..f7c04301 100644
--- a/content/11-gpu-porting.rst
+++ b/content/11-gpu-porting.rst
@@ -5,273 +5,309 @@ Preparing code for GPU porting
 
 .. questions::
 
-   - What are the key steps involved in porting code to take advantage of GPU parallel processing capability?
-   - How can I identify the computationally intensive parts of my code that can benefit from GPU acceleration?
-   - What are the considerations for refactoring loops to suit the GPU architecture and improve memory access patterns?
-   - Are there any tools that can translate automatically between different frameworks?
+    - What are the key steps involved in porting code to take advantage of GPU parallel processing capability?
+    - How can I identify the computationally intensive parts of my code that can benefit from GPU acceleration?
+    - What are the considerations for refactoring loops to suit the GPU architecture and improve memory access patterns?
+    - Are there any tools that can translate automatically between different frameworks?
 
 .. objectives::
 
-   - Getting familiarized the steps involved in porting code to GPUs to take advantage of parallel processing capabilities.
-   - Giving some idea about refactoring loops and modifying operations to suit the GPU architecture and improve memory access patterns.
-   - Learn to use automatic translation tools to port from CUDA to HIP and from OpenACC to OpenMP
+    - Getting familiarized the steps involved in porting code to GPUs to take advantage of parallel processing capabilities.
+    - Giving some idea about refactoring loops and modifying operations to suit the GPU architecture and improve memory access patterns.
+    - Learn to use automatic translation tools to port from CUDA to HIP and from OpenACC to OpenMP
 
 .. instructor-note::
 
-   - 30 min teaching
-   - 20 min exercises
+    - 30 min teaching
+    - 20 min exercises
 
 Porting from CPU to GPU
 -----------------------
 
-When porting code to take advantage of the parallel processing capability of GPUs, several steps need to be followed and some additional work is required before writing actual parallel code to be executed on the GPUs:
-
-* **Identify Targeted Parts**: Begin by identifying the parts of the code that contribute significantly to the execution time. These are often computationally intensive sections such as loops or matrix operations. The Pareto principle suggests that roughly 10-20% of the code accounts for 80-90% of the execution time.
-
-* **Equivalent GPU Libraries**: If the original code uses CPU libraries like BLAS, FFT, etc, it's crucial to identify the equivalent GPU libraries. For example, `cuBLAS` or `hipBLAS` can replace CPU-based BLAS libraries. Utilizing GPU-specific libraries ensures efficient GPU utilization.
-
-* **Refactor Loops**: When porting loops directly to GPUs, some refactoring is necessary to suit the GPU architecture. This typically involves splitting the loop into multiple steps or modifying operations to exploit the independence between iterations and improve memory access patterns. Each step of the original loop can be mapped to a kernel, executed by multiple GPU threads, with each thread corresponding to an iteration.
-
-* **Memory Access Optimization**: Consider the memory access patterns in the code. GPUs perform best when memory access is coalesced and aligned. Minimizing global memory accesses and maximizing utilization of shared memory or registers can significantly enhance performance. Review the code to ensure optimal memory access for GPU execution.
+When porting code to take advantage of the parallel processing capability of GPUs,
+several steps need to be followed and some additional work is required before writing
+actual parallel code to be executed on the GPUs:
+
+- **Identify Targeted Parts**: Begin by identifying the parts of the code that
+  contribute significantly to the execution time. These are often computationally
+  intensive sections such as loops or matrix operations. The Pareto principle suggests
+  that roughly 10-20% of the code accounts for 80-90% of the execution time.
+- **Equivalent GPU Libraries**: If the original code uses CPU libraries like BLAS, FFT,
+  etc, it's crucial to identify the equivalent GPU libraries. For example, `cuBLAS` or
+  `hipBLAS` can replace CPU-based BLAS libraries. Utilizing GPU-specific libraries
+  ensures efficient GPU utilization.
+- **Refactor Loops**: When porting loops directly to GPUs, some refactoring is necessary
+  to suit the GPU architecture. This typically involves splitting the loop into multiple
+  steps or modifying operations to exploit the independence between iterations and
+  improve memory access patterns. Each step of the original loop can be mapped to a
+  kernel, executed by multiple GPU threads, with each thread corresponding to an
+  iteration.
+- **Memory Access Optimization**: Consider the memory access patterns in the code. GPUs
+  perform best when memory access is coalesced and aligned. Minimizing global memory
+  accesses and maximizing utilization of shared memory or registers can significantly
+  enhance performance. Review the code to ensure optimal memory access for GPU
+  execution.
 
 Discussion
-^^^^^^^^^^
- .. challenge:: How would this be ported? (n_soap ≈ 100, n_sites ⩾ 10000, k_max ≈ 20*n_sites )
-
-    Inspect the following Fortran code (if you don't read Fortran: do-loops == for-loops)
-
-    .. code-block:: Fortran
-
-        k2 = 0
-        do i = 1, n_sites
-          do j = 1, n_neigh(i)
-            k2 = k2 + 1
-            counter = 0
-            counter2 = 0
-            do n = 1, n_max
-              do np = n, n_max
-                do l = 0, l_max
-                  if( skip_soap_component(l, np, n) )cycle
-
-                  counter = counter+1
-                  do m = 0, l
-                    k = 1 + l*(l+1)/2 + m
-                    counter2 = counter2 + 1
-                    multiplicity = multiplicity_array(counter2)
-                    soap_rad_der(counter, k2) = soap_rad_der(counter, k2) + multiplicity * real( cnk_rad_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_rad_der(k, np, k2)) )
-                    soap_azi_der(counter, k2) = soap_azi_der(counter, k2) + multiplicity * real( cnk_azi_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_azi_der(k, np, k2)) )
-                    soap_pol_der(counter, k2) = soap_pol_der(counter, k2) + multiplicity * real( cnk_pol_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_pol_der(k, np, k2)) )
-                  end do
-                end do
-              end do
-            end do
-
-            soap_rad_der(1:n_soap, k2) = soap_rad_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_rad_der(1:n_soap, k2) )
-            soap_azi_der(1:n_soap, k2) = soap_azi_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_azi_der(1:n_soap, k2) )
-            soap_pol_der(1:n_soap, k2) = soap_pol_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_pol_der(1:n_soap, k2) )
-
-            if( j == 1 )then
-              k3 = k2
-            else
-              soap_cart_der(1, 1:n_soap, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2)
-              soap_cart_der(2, 1:n_soap, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2)
-              soap_cart_der(3, 1:n_soap, k2) = dcos(thetas(k2)) * soap_rad_der(1:n_soap, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2)
-              soap_cart_der(1, 1:n_soap, k3) = soap_cart_der(1, 1:n_soap, k3) - soap_cart_der(1, 1:n_soap, k2)
-              soap_cart_der(2, 1:n_soap, k3) = soap_cart_der(2, 1:n_soap, k3) - soap_cart_der(2, 1:n_soap, k2)
-              soap_cart_der(3, 1:n_soap, k3) = soap_cart_der(3, 1:n_soap, k3) - soap_cart_der(3, 1:n_soap, k2)
-            end if
-          end do
-        end do
-
-   Some steps at first glance:
-
-      * the code could (has to) be splitted in 3-4 kernels. Why?
-      * check if there are any variables that could lead to false dependencies between iterations, like the index `k2`
-      * is it efficient for GPUs to split the work over the index `i`? What about the memory access? Note the arrays are `2D` in Fortran
-      * is it possible to collapse some loops? Combining nested loops can reduce overhead and improve memory access patterns, leading to better GPU performance.
-      * what is the best memory access in a GPU? Review memory access patterns in the code. Minimize global memory access by utilizing shared memory or registers where appropriate. Ensure memory access is coalesced and aligned, maximizing GPU memory throughput
-
+~~~~~~~~~~
+
+    .. challenge:: How would this be ported? (n_soap ≈ 100, n_sites ⩾ 10000, k_max ≈ 20*n_sites )
+
+         Inspect the following Fortran code (if you don't read Fortran: do-loops == for-loops)
+
+         .. code-block:: Fortran
+
+             k2 = 0
+             do i = 1, n_sites
+               do j = 1, n_neigh(i)
+                 k2 = k2 + 1
+                 counter = 0
+                 counter2 = 0
+                 do n = 1, n_max
+                   do np = n, n_max
+                     do l = 0, l_max
+                       if( skip_soap_component(l, np, n) )cycle
+
+                       counter = counter+1
+                       do m = 0, l
+                         k = 1 + l*(l+1)/2 + m
+                         counter2 = counter2 + 1
+                         multiplicity = multiplicity_array(counter2)
+                         soap_rad_der(counter, k2) = soap_rad_der(counter, k2) + multiplicity * real( cnk_rad_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_rad_der(k, np, k2)) )
+                         soap_azi_der(counter, k2) = soap_azi_der(counter, k2) + multiplicity * real( cnk_azi_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_azi_der(k, np, k2)) )
+                         soap_pol_der(counter, k2) = soap_pol_der(counter, k2) + multiplicity * real( cnk_pol_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_pol_der(k, np, k2)) )
+                       end do
+                     end do
+                   end do
+                 end do
+
+                 soap_rad_der(1:n_soap, k2) = soap_rad_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_rad_der(1:n_soap, k2) )
+                 soap_azi_der(1:n_soap, k2) = soap_azi_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_azi_der(1:n_soap, k2) )
+                 soap_pol_der(1:n_soap, k2) = soap_pol_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_pol_der(1:n_soap, k2) )
+
+                 if( j == 1 )then
+                   k3 = k2
+                 else
+                   soap_cart_der(1, 1:n_soap, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2)
+                   soap_cart_der(2, 1:n_soap, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2)
+                   soap_cart_der(3, 1:n_soap, k2) = dcos(thetas(k2)) * soap_rad_der(1:n_soap, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2)
+                   soap_cart_der(1, 1:n_soap, k3) = soap_cart_der(1, 1:n_soap, k3) - soap_cart_der(1, 1:n_soap, k2)
+                   soap_cart_der(2, 1:n_soap, k3) = soap_cart_der(2, 1:n_soap, k3) - soap_cart_der(2, 1:n_soap, k2)
+                   soap_cart_der(3, 1:n_soap, k3) = soap_cart_der(3, 1:n_soap, k3) - soap_cart_der(3, 1:n_soap, k2)
+                 end if
+               end do
+             end do
+
+        Some steps at first glance:
+
+           * the code could (has to) be splitted in 3-4 kernels. Why?
+           * check if there are any variables that could lead to false dependencies between iterations, like the index `k2`
+           * is it efficient for GPUs to split the work over the index `i`? What about the memory access? Note the arrays are `2D` in Fortran
+           * is it possible to collapse some loops? Combining nested loops can reduce overhead and improve memory access patterns, leading to better GPU performance.
+           * what is the best memory access in a GPU? Review memory access patterns in the code. Minimize global memory access by utilizing shared memory or registers where appropriate. Ensure memory access is coalesced and aligned, maximizing GPU memory throughput
 
 .. admonition:: Refactored code!
-   :class: dropdown
-
-   - Registers are limited and the larger the kernel use more registers registers resulting in less active threads (small occupancy).
-   - In order to compute `soap_rad_der(is,k2)` the CUDA thread needs access to all the previous values `soap_rad_der(1:nsoap,k2)`.
-   - In order to compute `soap_cart_der(1, 1:n_soap, k3)` it is required to have access to all values `(k3+1:k2+n_neigh(i))`.
-   - Note the indices in the first part. The matrices are transposed for better access patterns.
-
-    .. code-block:: Fortran
-
-        !omp target teams distribute parallel do private (i)
-        do k2 = 1, k2_max
-          i=list_of_i(k2)
-          counter = 0
-          counter2 = 0
-          do n = 1, n_max
-            do np = n, n_max
-              do l = 0, l_max
-                if( skip_soap_component(l, np, n) ) then
-                  cycle
-                endif
-                counter = counter+1
-                do m = 0, l
-                  k = 1 + l*(l+1)/2 + m
-                  counter2 = counter2 + 1
-                  multiplicity = multiplicity_array(counter2)
-                  tsoap_rad_der(k2,counter) = tsoap_rad_der(k2,counter) + multiplicity * real( tcnk_rad_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_rad_der(k2,k,np)) )
-                  tsoap_azi_der(k2,counter) = tsoap_azi_der(k2,counter) + multiplicity * real( tcnk_azi_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_azi_der(k2,k,np)) )
-                  tsoap_pol_der(k2,counter) = tsoap_pol_der(k2,counter) + multiplicity * real( tcnk_pol_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_pol_der(k2,k,np)) )
+
+    - Registers are limited and the larger the kernel use more registers registers
+      resulting in less active threads (small occupancy).
+    - In order to compute `soap_rad_der(is,k2)` the CUDA thread needs access to all the
+      previous values `soap_rad_der(1:nsoap,k2)`.
+    - In order to compute `soap_cart_der(1, 1:n_soap, k3)` it is required to have access
+      to all values `(k3+1:k2+n_neigh(i))`.
+    - Note the indices in the first part. The matrices are transposed for better access
+      patterns.
+
+        .. code-block:: Fortran
+
+              !omp target teams distribute parallel do private (i)
+              do k2 = 1, k2_max
+                i=list_of_i(k2)
+                counter = 0
+                counter2 = 0
+                do n = 1, n_max
+                  do np = n, n_max
+                    do l = 0, l_max
+                      if( skip_soap_component(l, np, n) ) then
+                        cycle
+                      endif
+                      counter = counter+1
+                      do m = 0, l
+                        k = 1 + l*(l+1)/2 + m
+                        counter2 = counter2 + 1
+                        multiplicity = multiplicity_array(counter2)
+                        tsoap_rad_der(k2,counter) = tsoap_rad_der(k2,counter) + multiplicity * real( tcnk_rad_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_rad_der(k2,k,np)) )
+                        tsoap_azi_der(k2,counter) = tsoap_azi_der(k2,counter) + multiplicity * real( tcnk_azi_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_azi_der(k2,k,np)) )
+                        tsoap_pol_der(k2,counter) = tsoap_pol_der(k2,counter) + multiplicity * real( tcnk_pol_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_pol_der(k2,k,np)) )
+                      end do
+                    end do
+                  end do
                 end do
               end do
-            end do
-          end do
-        end do
-
-      ! Before the next part the variables are transposed again to their original layout.
-
-       !omp target teams  distribute private(i)
-       do k2 = 1, k2_max
-         i=list_of_i(k2)
-         locdot=0.d0
-
-         !omp parallel do reduction(+:locdot_rad_der,locdot_azi_der,locdot_pol_der)
-         do is=1,nsoap
-           locdot_rad_der=locdot_rad_der+soap(is, i) * soap_rad_der(is, k2)
-           locdot_azi_der=locdot_azi_der+soap(is, i) * soap_azi_der(is, k2)
-           locdot_pol_der=locdot_pol_der+soap(is, i) * soap_pol_der(is, k2)
-         enddo
-         dot_soap_rad_der(k2)= locdot_rad_der
-         dot_soap_azi_der(k2)= locdot_azi_der
-         dot_soap_pol_der(k2)= locdot_pol_der
-       end do
-
-       !omp target teams distribute
-       do k2 = 1, k2_max
-         i=list_of_i(k2)
-
-         !omp parallel do
-         do is=1,nsoap
-           soap_rad_der(is, k2) = soap_rad_der(is, k2) / sqrt_dot_p(i) -   soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_rad_der(k2)
-           soap_azi_der(is, k2) = soap_azi_der(is, k2) / sqrt_dot_p(i) -   soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_azi_der(k2)
-           soap_pol_der(is, k2) = soap_pol_der(is, k2) / sqrt_dot_p(i) -   soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_pol_der(k2)
-         end do
-       end do
-
-       !omp teams distribute private(k3)
-       do k2 = 1, k2_max
-         k3=list_k2k3(k2)
-
-         !omp parallel do private (is)
-         do is=1,n_soap
-           if( k3 /= k2)then
-             soap_cart_der(1, is, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(is, k2)
-             soap_cart_der(2, is, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(is, k2)
-             soap_cart_der(3, is, k2) = dcos(thetas(k2)) * soap_rad_der(is, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(is, k2)
-           end if
-         end do
-       end do
-
-       !omp teams distribute private(k3)
-       do i = 1, n_sites
-         k3=list_k3(i)
-
-         !omp parallel do private(is, k2)
-         do is=1,n_soap
-           do k2=k3+1,k3+n_neigh(i)
-             soap_cart_der(1, is, k3) = soap_cart_der(1, is, k3) - soap_cart_der(1, is, k2)
-             soap_cart_der(2, is, k3) = soap_cart_der(2, is, k3) - soap_cart_der(2, is, k2)
-             soap_cart_der(3, is, k3) = soap_cart_der(3, is, k3) - soap_cart_der(3, is, k2)
-           end do
-         end do
-       end do
 
+            ! Before the next part the variables are transposed again to their original layout.
+
+             !omp target teams  distribute private(i)
+             do k2 = 1, k2_max
+               i=list_of_i(k2)
+               locdot=0.d0
+
+               !omp parallel do reduction(+:locdot_rad_der,locdot_azi_der,locdot_pol_der)
+               do is=1,nsoap
+                 locdot_rad_der=locdot_rad_der+soap(is, i) * soap_rad_der(is, k2)
+                 locdot_azi_der=locdot_azi_der+soap(is, i) * soap_azi_der(is, k2)
+                 locdot_pol_der=locdot_pol_der+soap(is, i) * soap_pol_der(is, k2)
+               enddo
+               dot_soap_rad_der(k2)= locdot_rad_der
+               dot_soap_azi_der(k2)= locdot_azi_der
+               dot_soap_pol_der(k2)= locdot_pol_der
+             end do
+
+             !omp target teams distribute
+             do k2 = 1, k2_max
+               i=list_of_i(k2)
+
+               !omp parallel do
+               do is=1,nsoap
+                 soap_rad_der(is, k2) = soap_rad_der(is, k2) / sqrt_dot_p(i) -   soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_rad_der(k2)
+                 soap_azi_der(is, k2) = soap_azi_der(is, k2) / sqrt_dot_p(i) -   soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_azi_der(k2)
+                 soap_pol_der(is, k2) = soap_pol_der(is, k2) / sqrt_dot_p(i) -   soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_pol_der(k2)
+               end do
+             end do
+
+             !omp teams distribute private(k3)
+             do k2 = 1, k2_max
+               k3=list_k2k3(k2)
+
+               !omp parallel do private (is)
+               do is=1,n_soap
+                 if( k3 /= k2)then
+                   soap_cart_der(1, is, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(is, k2)
+                   soap_cart_der(2, is, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(is, k2)
+                   soap_cart_der(3, is, k2) = dcos(thetas(k2)) * soap_rad_der(is, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(is, k2)
+                 end if
+               end do
+             end do
+
+             !omp teams distribute private(k3)
+             do i = 1, n_sites
+               k3=list_k3(i)
+
+               !omp parallel do private(is, k2)
+               do is=1,n_soap
+                 do k2=k3+1,k3+n_neigh(i)
+                   soap_cart_der(1, is, k3) = soap_cart_der(1, is, k3) - soap_cart_der(1, is, k2)
+                   soap_cart_der(2, is, k3) = soap_cart_der(2, is, k3) - soap_cart_der(2, is, k2)
+                   soap_cart_der(3, is, k3) = soap_cart_der(3, is, k3) - soap_cart_der(3, is, k2)
+                 end do
+               end do
+             end do
 
 .. keypoints::
 
-   - Identify equivalent GPU libraries for CPU-based libraries and utilizing them to ensure efficient GPU utilization.
-   - Importance of identifying the computationally intensive parts of the code that contribute significantly to the execution time.
-   - The need to refactor loops to suit the GPU architecture.
-   - Significance of memory access optimization for efficient GPU execution, including coalesced and aligned memory access patterns.
+    - Identify equivalent GPU libraries for CPU-based libraries and utilizing them to ensure efficient GPU utilization.
+    - Importance of identifying the computationally intensive parts of the code that contribute significantly to the execution time.
+    - The need to refactor loops to suit the GPU architecture.
+    - Significance of memory access optimization for efficient GPU execution, including coalesced and aligned memory access patterns.
 
 Porting between different GPU frameworks
 ----------------------------------------
 
-You might also find yourself in a situation where you need to port a code from one particular
-GPU framework to another. This section gives an overview of different tools that enable converting CUDA and
-OpenACC codes to HIP and OpenMP, respectively. This conversion process enables an application to target various
-GPU architectures, specifically, NVIDIA and AMD GPUs. Here we focus on
-`hipify <https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html>`__ and
-`clacc <https://csmd.ornl.gov/project/clacc>`__ tools.
-This guide is adapted from the `NRIS documentation <https://documentation.sigma2.no/code_development/guides/cuda_translating-tools.html>`__.
+You might also find yourself in a situation where you need to port a code from one
+particular GPU framework to another. This section gives an overview of different tools
+that enable converting CUDA and OpenACC codes to HIP and OpenMP, respectively. This
+conversion process enables an application to target various GPU architectures,
+specifically, NVIDIA and AMD GPUs. Here we focus on `hipify
+<https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html>`__ and
+`clacc <https://csmd.ornl.gov/project/clacc>`__ tools. This guide is adapted from the
+`NRIS documentation
+<https://documentation.sigma2.no/code_development/guides/cuda_translating-tools.html>`__.
 
 Translating CUDA to HIP with Hipify
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-In this section, we cover the use of ``hipify-perl`` and ``hipify-clang`` tools to translate a CUDA code to HIP.
+In this section, we cover the use of ``hipify-perl`` and ``hipify-clang`` tools to
+translate a CUDA code to HIP.
 
 Hipify-perl
-~~~~~~~~~~~
-
-The ``hipify-perl`` tool is a script based on perl that translates CUDA syntax into HIP syntax
-(see .e.g. `here <https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html#perl>`_ for more details).
-For instance, in a CUDA code that incorporates the CUDA functions ``cudaMalloc``` and ``cudaDeviceSynchronize``, the tool will substitute ``cudaMalloc`` with the HIP function ``hipMalloc``. Similarly the CUDA function ``cudaDeviceSynchronize`` will be substituted with the HIP function ``hipDeviceSynchronize``. We list below the basic steps to run ``hipify-perl`` on LUMI-G.
++++++++++++
+
+The ``hipify-perl`` tool is a script based on perl that translates CUDA syntax into HIP
+syntax (see .e.g. `here
+<https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html#perl>`_
+for more details). For instance, in a CUDA code that incorporates the CUDA functions
+``cudaMalloc``` and ``cudaDeviceSynchronize``, the tool will substitute ``cudaMalloc``
+with the HIP function ``hipMalloc``. Similarly the CUDA function
+``cudaDeviceSynchronize`` will be substituted with the HIP function
+``hipDeviceSynchronize``. We list below the basic steps to run ``hipify-perl`` on
+LUMI-G.
 
 - **Step 1**: Generating ``hipify-perl`` script
 
   .. code-block:: console
 
-           $ module load rocm/5.2.3
-           $ hipify-clang --perl
+      $ module load rocm/5.2.3
+      $ hipify-clang --perl
 
 - **Step 2**: Running the generated ``hipify-perl``
 
   .. code-block:: console
 
-           $ hipify-perl program.cu > program.cu.hip
+      $ hipify-perl program.cu > program.cu.hip
 
 - **Step 3**: Compiling with ``hipcc`` the generated HIP code
 
   .. code-block:: console
 
-           $ hipcc --offload-arch=gfx90a -o program.hip.exe program.cu.hip
+      $ hipcc --offload-arch=gfx90a -o program.hip.exe program.cu.hip
 
-Despite the simplicity of the use of ``hipify-perl``, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings (e.g. it substitutes ``*cuda*`` with ``*hip*``).
-In addition, ``hipify-perl`` lacks the ability of `distinguishing device/host function calls <https://docs.amd.com/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html#perl>`_.
-The alternative here is to use the ``hipify-clang`` tool as will be described in the next section.
+Despite the simplicity of the use of ``hipify-perl``, the tool might not be suitable for
+large applications, as it relies heavily on substituting CUDA strings with HIP strings
+(e.g. it substitutes ``*cuda*`` with ``*hip*``). In addition, ``hipify-perl`` lacks the
+ability of `distinguishing device/host function calls
+<https://docs.amd.com/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html#perl>`_. The
+alternative here is to use the ``hipify-clang`` tool as will be described in the next
+section.
 
 Hipify-clang
-~~~~~~~~~~~~
-
-As described in the `HIPIFY documentation <https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html#perl>`_,
-the ``hipify-clang`` tool is based on clang for translating CUDA sources into HIP sources.
-The tool is more robust for translating CUDA codes compared to the ``hipify-perl`` tool.
-Furthermore, it facilitates the analysis of the code by providing assistance.
-
-In short, ``hipify-clang`` requires ``LLVM+CLANG`` and ``CUDA``. Details about building ``hipify-clang`` can be found `here <https://github.com/ROCm/HIPIFY>`__. Note that ``hipify-clang`` is available on LUMI-G.
-The issue however might be related to the installation of CUDA-toolkit.
-To avoid any eventual issues with the installation procedure we opt for CUDA singularity container. Here we present a step-by-step guide for running ``hipify-clang``:
+++++++++++++
+
+As described in the `HIPIFY documentation
+<https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html#perl>`_,
+the ``hipify-clang`` tool is based on clang for translating CUDA sources into HIP
+sources. The tool is more robust for translating CUDA codes compared to the
+``hipify-perl`` tool. Furthermore, it facilitates the analysis of the code by providing
+assistance.
+
+In short, ``hipify-clang`` requires ``LLVM+CLANG`` and ``CUDA``. Details about building
+``hipify-clang`` can be found `here <https://github.com/ROCm/HIPIFY>`__. Note that
+``hipify-clang`` is available on LUMI-G. The issue however might be related to the
+installation of CUDA-toolkit. To avoid any eventual issues with the installation
+procedure we opt for CUDA singularity container. Here we present a step-by-step guide
+for running ``hipify-clang``:
 
 - **Step 1**: Pulling a CUDA singularity container e.g.
 
   .. code-block:: console
 
-           $ singularity pull docker://nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04
+      $ singularity pull docker://nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04
 
 - **Step 2**: Loading a rocm module and launching the CUDA singularity
 
   .. code-block:: console
 
-           $ module load rocm/5.2.3
-           $ singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif
+      $ module load rocm/5.2.3
+      $ singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif
 
-  where the current directory ``$PWD`` in the host is mounted to that of the container, and the directory ``/opt`` in the host is mounted to the that inside the container.
+  where the current directory ``$PWD`` in the host is mounted to that of the container,
+  and the directory ``/opt`` in the host is mounted to the that inside the container.
 
-- **Step 3**: Setting the environment variable ``$PATH``.
-  In order to run ``hipify-clang`` from inside the container, one can set the environment variable ``$PATH`` that defines the path to look for the binary ``hipify-clang``.
+- **Step 3**: Setting the environment variable ``$PATH``. In order to run
+  ``hipify-clang`` from inside the container, one can set the environment variable
+  ``$PATH`` that defines the path to look for the binary ``hipify-clang``.
 
   .. code-block:: console
 
-           $ export PATH=/opt/rocm-5.2.3/bin:$PATH
+      $ export PATH=/opt/rocm-5.2.3/bin:$PATH
 
   Note that the rocm version we used is ``rocm-5.2.3``.
 
@@ -279,149 +315,179 @@ To avoid any eventual issues with the installation procedure we opt for CUDA sin
 
   .. code-block:: console
 
-           $ hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include
+      $ hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include
 
-  Here the cuda path and the path to the ``*includes*`` and ``*defines*`` files should be specified. The CUDA source code and the generated output code are `program.cu` and `hip_program.cu.hip`, respectively.
+  Here the cuda path and the path to the ``*includes*`` and ``*defines*`` files should
+  be specified. The CUDA source code and the generated output code are `program.cu` and
+  `hip_program.cu.hip`, respectively.
 
-  The syntax for the compilation process of the generated hip code is similar to the one described in the previous section (see the **Step 3** in the hipify-perl section).
+  The syntax for the compilation process of the generated hip code is similar to the one
+  described in the previous section (see the **Step 3** in the hipify-perl section).
 
-Code examples for the ``Hipify`` exercises can be accessed in the `content/examples/exercise_hipify` subdirectory by cloning this repository:
+Code examples for the ``Hipify`` exercises can be accessed in the
+`content/examples/exercise_hipify` subdirectory by cloning this repository:
 
-   .. code-block:: console
+    .. code-block:: console
 
-      $ git clone https://github.com/ENCCS/gpu-programming.git
-      $ cd gpu-programming/content/examples/exercise_hipify
-      $ ls
+        $ git clone https://github.com/ENCCS/gpu-programming.git
+        $ cd gpu-programming/content/examples/exercise_hipify
+        $ ls
 
 .. challenge:: Exercise I : Translate an CUDA code to HIP with ``hipify-perl``
 
-   1.1 Generate the ``hipify-perl`` tool.
+    1.1 Generate the ``hipify-perl`` tool.
 
-   1.2 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_perl`` with the ``Hipify-perl`` tool to HIP.
+    1.2 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_perl`` with the ``Hipify-perl`` tool to HIP.
 
-   1.3 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it.
+    1.3 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it.
 
 .. challenge:: Exercise II : Translate an CUDA code to HIP with ``hipify-clang``
 
-   2.1 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_clang`` with the ``Hipify-clang`` tool to HIP.
-
-   2.2 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it.
+    2.1 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_clang`` with the ``Hipify-clang`` tool to HIP.
 
+    2.2 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it.
 
 Translating OpenACC to OpenMP with Clacc
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-`Clacc <https://github.com/llvm-doe-org/llvm-project/tree/clacc/main>`_ is a tool to translate an OpenACC
-application to OpenMP offloading with the Clang/LLVM compiler environment.
-Note that the tool is specific to OpenACC C, while OpenACC Fortran is already supported on AMD GPU.
-As indicated in the `GitHub repository <https://github.com/llvm-doe-org/llvm-project/tree/clacc/main>`_ the compiler ``Clacc`` is the ``Clang``'s executable in the subdirectory ``\bin`` of the ``\install`` directory as described below.
+`Clacc <https://github.com/llvm-doe-org/llvm-project/tree/clacc/main>`_ is a tool to
+translate an OpenACC application to OpenMP offloading with the Clang/LLVM compiler
+environment. Note that the tool is specific to OpenACC C, while OpenACC Fortran is
+already supported on AMD GPU. As indicated in the `GitHub repository
+<https://github.com/llvm-doe-org/llvm-project/tree/clacc/main>`_ the compiler ``Clacc``
+is the ``Clang``'s executable in the subdirectory ``\bin`` of the ``\install`` directory
+as described below.
 
 In the following we present a step-by-step guide for building and using `Clacc`:
 
-- **Step 1**: Building and installing `Clacc <https://github.com/llvm-doe-org/llvm-project/tree/clacc/main>`_.
+- **Step 1**: Building and installing `Clacc
+  <https://github.com/llvm-doe-org/llvm-project/tree/clacc/main>`_.
 
   .. code-block:: console
 
-           $ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git
-           $ cd llvm-project
-           $ mkdir build && cd build
-           $ cmake -DCMAKE_INSTALL_PREFIX=../install     \
-              -DCMAKE_BUILD_TYPE=Release            \
-              -DLLVM_ENABLE_PROJECTS="clang;lld"    \
-              -DLLVM_ENABLE_RUNTIMES=openmp         \
-              -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \
-              -DCMAKE_C_COMPILER=gcc                \
-              -DCMAKE_CXX_COMPILER=g++              \
-              ../llvm
-           $ make
-           $ make install
+      $ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git
+      $ cd llvm-project
+      $ mkdir build && cd build
+      $ cmake -DCMAKE_INSTALL_PREFIX=../install     \
+         -DCMAKE_BUILD_TYPE=Release            \
+         -DLLVM_ENABLE_PROJECTS="clang;lld"    \
+         -DLLVM_ENABLE_RUNTIMES=openmp         \
+         -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \
+         -DCMAKE_C_COMPILER=gcc                \
+         -DCMAKE_CXX_COMPILER=g++              \
+         ../llvm
+      $ make
+      $ make install
 
-- **Step 2**: Setting up environment variables to be able to work from the ``/install`` directory, which is the simplest way. We assume that the ``/install`` directory is located in the path ``/project/project_xxxxxx/Clacc/llvm-project``.
+- **Step 2**: Setting up environment variables to be able to work from the ``/install``
+  directory, which is the simplest way. We assume that the ``/install`` directory is
+  located in the path ``/project/project_xxxxxx/Clacc/llvm-project``.
 
-For more advanced usage, which includes for instance modifying ``Clacc``, we refer readers to `"Usage from Build directory" <https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/README.md>`_
+For more advanced usage, which includes for instance modifying ``Clacc``, we refer
+readers to `"Usage from Build directory"
+<https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/README.md>`_
 
-  .. code-block:: console
+    .. code-block:: console
 
-           $ export PATH=/project/project_xxxxxx/Clacc/llvm-project/install/bin:$PATH
-           $ export LD_LIBRARY_PATH=/project/project_xxxxxx/Clacc/llvm-project/install/lib:$LD_LIBRARY_PATH
+        $ export PATH=/project/project_xxxxxx/Clacc/llvm-project/install/bin:$PATH
+        $ export LD_LIBRARY_PATH=/project/project_xxxxxx/Clacc/llvm-project/install/lib:$LD_LIBRARY_PATH
 
-- **Step 3**: Source to source conversion of the `openACC_code.c` code to be printed out to the file `openMP_code.c`:
+- **Step 3**: Source to source conversion of the `openACC_code.c` code to be printed out
+  to the file `openMP_code.c`:
 
   .. code-block:: console
 
-           $ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-ompx-hold openACC_code.c > openMP_code.c
+      $ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-ompx-hold openACC_code.c > openMP_code.c
 
-  Here the flag ``-fopenacc-structured-ref-count-omp=no-ompx-hold`` is introduced to disable the ``ompx_hold`` map type modifier, which is used by the OpenACC ``copy`` clause translation. The ``ompx_hold`` is an OpenMP extension that might not be supported yet by other compilers.
+  Here the flag ``-fopenacc-structured-ref-count-omp=no-ompx-hold`` is introduced to
+  disable the ``ompx_hold`` map type modifier, which is used by the OpenACC ``copy``
+  clause translation. The ``ompx_hold`` is an OpenMP extension that might not be
+  supported yet by other compilers.
 
-- **Step 4** Compiling the code with the `cc compiler wrapper <https://docs.lumi-supercomputer.eu/development/compiling/prgenv/>`_
+- **Step 4** Compiling the code with the `cc compiler wrapper
+  <https://docs.lumi-supercomputer.eu/development/compiling/prgenv/>`_
 
   .. code-block::
 
-           module load CrayEnv
-           module load PrgEnv-cray
-           module load craype-accel-amd-gfx90a
-           module load rocm/5.2.3
+      module load CrayEnv
+      module load PrgEnv-cray
+      module load craype-accel-amd-gfx90a
+      module load rocm/5.2.3
 
-           cc -fopenmp -o executable openMP_code.c
+      cc -fopenmp -o executable openMP_code.c
 
 .. callout:: Access exercise material
 
-   Code examples for the ``Clacc`` exercise can be accessed in the `content/examples/exercise_clacc` subdirectory by cloning this repository:
+    Code examples for the ``Clacc`` exercise can be accessed in the `content/examples/exercise_clacc` subdirectory by cloning this repository:
 
-   .. code-block:: console
+    .. code-block:: console
 
-      $ git clone https://github.com/ENCCS/gpu-programming.git
-      $ cd gpu-programming/content/examples/exercise_clacc
-      $ ls
+       $ git clone https://github.com/ENCCS/gpu-programming.git
+       $ cd gpu-programming/content/examples/exercise_clacc
+       $ ls
 
 .. challenge:: Exercise : Translate an OpenACC code to OpenMP
 
-   1. Convert the OpenACC code ``openACC_code.c`` located in ``/exercise_clacc`` with the ``Clacc`` compiler.
+    1. Convert the OpenACC code ``openACC_code.c`` located in ``/exercise_clacc`` with the ``Clacc`` compiler.
 
-   2. Compile the generated OpenMP code with the ``cc`` compiler wrapper and run it.
+    2. Compile the generated OpenMP code with the ``cc`` compiler wrapper and run it.
 
 Translating CUDA to SYCL/DPC++ with SYCLomatic
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Intel offers a tool for CUDA-to-SYCL code migration, included in the Intel oneAPI Basekit.
+Intel offers a tool for CUDA-to-SYCL code migration, included in the Intel oneAPI
+Basekit.
 
-It is not installed on LUMI, but the general workflow is similar to the HIPify Clang and also requires an existing CUDA installation:
+It is not installed on LUMI, but the general workflow is similar to the HIPify Clang and
+also requires an existing CUDA installation:
 
-  .. code-block:: console
+    .. code-block:: console
 
-           $ dpct program.cu
-           $ cd dpct_output/
-           $ icpx -fsycl program.dp.cpp
+        $ dpct program.cu
+        $ cd dpct_output/
+        $ icpx -fsycl program.dp.cpp
 
-SYCLomatic can migrate larger projects by using ``-in-root`` and ``-out-root`` flags to process directories recursively. It can also
-use compilation database (supported by CMake and other build systems) to deal with more complex project layouts.
+SYCLomatic can migrate larger projects by using ``-in-root`` and ``-out-root`` flags to
+process directories recursively. It can also use compilation database (supported by
+CMake and other build systems) to deal with more complex project layouts.
 
-Please note that the code generated by SYCLomatic relies on oneAPI-specific extensions, and thus cannot be directly used with other
-SYCL implementations, such as AdaptiveCpp (hipSYCL). The ``--no-incremental-migration`` flag can be added to ``dpct`` command to minimize, but not
-completely avoid, the use of this compatibility layer. That would require manual effort, since some CUDA concepts cannot be directly
-mapped to SYCL.
+Please note that the code generated by SYCLomatic relies on oneAPI-specific extensions,
+and thus cannot be directly used with other SYCL implementations, such as AdaptiveCpp
+(hipSYCL). The ``--no-incremental-migration`` flag can be added to ``dpct`` command to
+minimize, but not completely avoid, the use of this compatibility layer. That would
+require manual effort, since some CUDA concepts cannot be directly mapped to SYCL.
 
-Additionally, CUDA applications might assume certain hardware behavior, such as 32-wide warps. If the target hardware is different
-(e.g., AMD MI250 GPUs, used in LUMI, have warp size of 64), the algorithms might need to be adjusted manually.
+Additionally, CUDA applications might assume certain hardware behavior, such as 32-wide
+warps. If the target hardware is different (e.g., AMD MI250 GPUs, used in LUMI, have
+warp size of 64), the algorithms might need to be adjusted manually.
 
 Conclusion
-^^^^^^^^^^
+~~~~~~~~~~
 
-This concludes a brief overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process for large applications might be incomplete and thus requires manual modification to complete the porting process. It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes.
+This concludes a brief overview of the usage of available tools to convert CUDA codes to
+HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process
+for large applications might be incomplete and thus requires manual modification to
+complete the porting process. It is however worth noting that the accuracy of the
+translation process requires that applications are written correctly according to the
+CUDA and OpenACC syntaxes.
 
 See also
 --------
 
 - `Hipify GitHub <https://github.com/ROCm/HIPIFY>`_
-- `HIPify Reference Guide v5.1 <https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html>`_
-- `HIP example <https://github.com/olcf-tutorials/simple_HIP_examples/tree/master/vector_addition>`_
-- `Porting CUDA to HIP <https://www.admin-magazine.com/HPC/Articles/Porting-CUDA-to-HIP>`_
-- `Clacc Main repository README <https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/README.md>`_
-- `SYCLomatic main mage <https://www.intel.com/content/www/us/en/developer/articles/technical/syclomatic-new-cuda-to-sycl-code-migration-tool.html>`_
-- `SYCLomatic documentation <https://oneapi-src.github.io/SYCLomatic/get_started/index.html>`_
+- `HIPify Reference Guide v5.1
+  <https://docs.amd.com/en-US/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html>`_
+- `HIP example
+  <https://github.com/olcf-tutorials/simple_HIP_examples/tree/master/vector_addition>`_
+- `Porting CUDA to HIP
+  <https://www.admin-magazine.com/HPC/Articles/Porting-CUDA-to-HIP>`_
+- `Clacc Main repository README
+  <https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/README.md>`_
+- `SYCLomatic main mage
+  <https://www.intel.com/content/www/us/en/developer/articles/technical/syclomatic-new-cuda-to-sycl-code-migration-tool.html>`_
+- `SYCLomatic documentation
+  <https://oneapi-src.github.io/SYCLomatic/get_started/index.html>`_
 
 .. keypoints::
 
-   - Useful tools exist to automatically translate tools from CUDA to HIP and SYCL and from OpenACC to OpenMP, but they may require manual modifications.
-
-
+    - Useful tools exist to automatically translate tools from CUDA to HIP and SYCL and from OpenACC to OpenMP, but they may require manual modifications.
diff --git a/content/12-recommendations.rst b/content/12-recommendations.rst
index ef07536d..81f2482e 100644
--- a/content/12-recommendations.rst
+++ b/content/12-recommendations.rst
@@ -3,82 +3,88 @@ Recommendations
 
 .. questions::
 
-   - Which GPU programming framework is right for me and my project?
-
+    - Which GPU programming framework is right for me and my project?
 
 .. instructor-note::
 
-   - 30 min teaching
-   - 15 min exercises
-
+    - 30 min teaching
+    - 15 min exercises
 
 Portability
 -----------
 
-One of the critical factors when diving into GPU programming is the portability of the chosen framework. 
-It's crucial to ensure that the framework you decide to utilize is compatible with the GPU or GPUs you intend 
-to use. This might seem like a basic step, but it's essential to avoid unnecessary hardware-software mismatches 
-that could lead to performance bottlenecks or, worse, a complete failure of the system.
+One of the critical factors when diving into GPU programming is the portability of the
+chosen framework. It's crucial to ensure that the framework you decide to utilize is
+compatible with the GPU or GPUs you intend to use. This might seem like a basic step,
+but it's essential to avoid unnecessary hardware-software mismatches that could lead to
+performance bottlenecks or, worse, a complete failure of the system.
 
-Moreover, if you're targeting multiple platforms or GPUs, it's wise to consider using frameworks that support 
-portable kernel-based models or those that come with high-level language support. 
-The benefit of these frameworks is that they allow for efficient execution of your code on a variety of 
-hardware configurations without needing significant alterations.
+Moreover, if you're targeting multiple platforms or GPUs, it's wise to consider using
+frameworks that support portable kernel-based models or those that come with high-level
+language support. The benefit of these frameworks is that they allow for efficient
+execution of your code on a variety of hardware configurations without needing
+significant alterations.
 
 Programming Effort
 ------------------
 
-The amount of programming effort required is another factor to consider when choosing a GPU programming framework. 
-It's advisable to select a framework that supports the programming language you're comfortable with. 
-This consideration will ensure a smoother learning curve and a more efficient development process.
+The amount of programming effort required is another factor to consider when choosing a
+GPU programming framework. It's advisable to select a framework that supports the
+programming language you're comfortable with. This consideration will ensure a smoother
+learning curve and a more efficient development process.
 
-Furthermore, it's important to check the availability of supportive resources for the chosen framework. 
-Comprehensive documentation, illustrative examples, and an active community are important when learning 
-a new framework or troubleshooting issues. They not only minimize the time spent on resolving bugs but also 
-foster continuous learning and mastery of the framework.
+Furthermore, it's important to check the availability of supportive resources for the
+chosen framework. Comprehensive documentation, illustrative examples, and an active
+community are important when learning a new framework or troubleshooting issues. They
+not only minimize the time spent on resolving bugs but also foster continuous learning
+and mastery of the framework.
 
 Performance Requirements
 ------------------------
 
-Every application or project has unique performance requirements. Therefore, it's crucial to evaluate the 
-performance characteristics and optimization capabilities of the potential frameworks before choosing one. 
-Some frameworks offer extensive optimization features and can automatically tune your code to maximize its 
-performance. Knowing how well a framework can handle your specific workload requirements can save you 
+Every application or project has unique performance requirements. Therefore, it's
+crucial to evaluate the performance characteristics and optimization capabilities of the
+potential frameworks before choosing one. Some frameworks offer extensive optimization
+features and can automatically tune your code to maximize its performance. Knowing how
+well a framework can handle your specific workload requirements can save you
 considerable time and resources in the long run.
 
 Cost-benefit Analysis
 ---------------------
 
-Before finalizing your choice of a GPU programming framework, it's recommended to perform a cost-benefit analysis. 
-Consider the specific requirements of your project, like the processing power needed, the complexity of the tasks, 
-the amount of data to be processed, and the cost associated with the potential framework. 
-Understanding these factors will help you determine the most suitable and cost-effective framework for your needs.
+Before finalizing your choice of a GPU programming framework, it's recommended to
+perform a cost-benefit analysis. Consider the specific requirements of your project,
+like the processing power needed, the complexity of the tasks, the amount of data to be
+processed, and the cost associated with the potential framework. Understanding these
+factors will help you determine the most suitable and cost-effective framework for your
+needs.
 
 Choosing Between Frameworks
 ---------------------------
 
-The decision of choosing between different GPU programming frameworks often depends on several factors, including:
-
-- **The specifics of the problem**: Different problems might need different computational capabilities. 
-  Understand your problem thoroughly and evaluate which framework is best equipped to handle it.
-
-- **Starting point**: If you're starting from scratch, you might have more flexibility in choosing your framework than 
-  if you're building on top of existing code.
-
-- **Background knowledge of the programmer**: The familiarity of the programmer with certain programming languages or 
-  frameworks plays a big role in the decision-making process.
-
-- **Time investment**: Some frameworks may have a steeper learning curve but offer more extensive capabilities, 
-  while others might be easier to grasp but provide limited features.
-
-- **Performance needs**: Some applications require maximum computational power, while others might not. 
-  The performance capabilities of the framework should align with the needs of your project.
-
-By keeping these considerations in mind, you can make a more informed decision and choose a GPU programming 
-framework that best suits your needs.
+The decision of choosing between different GPU programming frameworks often depends on
+several factors, including:
+
+- **The specifics of the problem**: Different problems might need different
+  computational capabilities. Understand your problem thoroughly and evaluate which
+  framework is best equipped to handle it.
+- **Starting point**: If you're starting from scratch, you might have more flexibility
+  in choosing your framework than if you're building on top of existing code.
+- **Background knowledge of the programmer**: The familiarity of the programmer with
+  certain programming languages or frameworks plays a big role in the decision-making
+  process.
+- **Time investment**: Some frameworks may have a steeper learning curve but offer more
+  extensive capabilities, while others might be easier to grasp but provide limited
+  features.
+- **Performance needs**: Some applications require maximum computational power, while
+  others might not. The performance capabilities of the framework should align with the
+  needs of your project.
+
+By keeping these considerations in mind, you can make a more informed decision and
+choose a GPU programming framework that best suits your needs.
 
 .. discussion:: Question and discussion time
 
-   - Has your mental model of how GPUs work and how they are programmed changed?
-   - Do you have a better idea about what framework is right for your code?
-   - What questions do you have? Ask us anything!
+    - Has your mental model of how GPUs work and how they are programmed changed?
+    - Do you have a better idea about what framework is right for your code?
+    - What questions do you have? Ask us anything!
diff --git a/content/13-examples.rst b/content/13-examples.rst
index 3215613f..13d23938 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -5,671 +5,730 @@ GPU programming example: stencil computation
 
 .. questions::
 
-   - How do I compile and run code developed using different programming models and frameworks?
-   - What can I expect from the GPU-ported programs in terms of performance gains / trends and how do I estimate this?
+    - How do I compile and run code developed using different programming models and frameworks?
+    - What can I expect from the GPU-ported programs in terms of performance gains / trends and how do I estimate this?
 
 .. objectives::
 
-   - To show a self-contained example of parallel computation executed on CPU and GPU using different programming models
-   - To show differences and consequences of implementing the same algorithm in natural "style" of different models/ frameworks
-   - To discuss how to assess theoretical and practical performance scaling of GPU codes
+    - To show a self-contained example of parallel computation executed on CPU and GPU using different programming models
+    - To show differences and consequences of implementing the same algorithm in natural "style" of different models/ frameworks
+    - To discuss how to assess theoretical and practical performance scaling of GPU codes
 
 .. instructor-note::
 
-   - 35 min teaching
-   - 30 min exercises
-
+    - 35 min teaching
+    - 30 min exercises
 
 Problem: heat flow in two-dimensional area
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+------------------------------------------
 
-Heat flows in objects according to local temperature differences, as if seeking local equilibrium. The following example defines a rectangular area with two always-warm sides (temperature 70 and 85), two cold sides (temperature 20 and 5) and a cold disk at the center. Because of heat diffusion, temperature of neighboring patches of the area is bound to equalize, changing the overall distribution:
+Heat flows in objects according to local temperature differences, as if seeking local
+equilibrium. The following example defines a rectangular area with two always-warm sides
+(temperature 70 and 85), two cold sides (temperature 20 and 5) and a cold disk at the
+center. Because of heat diffusion, temperature of neighboring patches of the area is
+bound to equalize, changing the overall distribution:
 
 .. figure:: img/stencil/heat_montage.png
-   :align: center
-   
-   Over time, the temperature distribution progresses from the initial state toward an end state where upper triangle is warm and lower is cold. The average temperature tends to (70 + 85 + 20 + 5) / 4 = 45.
+    :align: center
 
+    Over time, the temperature distribution progresses from the initial state toward an
+    end state where upper triangle is warm and lower is cold. The average temperature
+    tends to (70 + 85 + 20 + 5) / 4 = 45.
 
 Technique: stencil computation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+------------------------------
 
-Heat transfer in the system above is governed by the partial differential equation(s) describing local variation of the temperature field in time and space. That is, the rate of change of the temperature field :math:`u(x, y, t)` over two spatial dimensions :math:`x` and :math:`y` and time :math:`t` (with rate coefficient :math:`\alpha`) can be modelled via the equation
+Heat transfer in the system above is governed by the partial differential equation(s)
+describing local variation of the temperature field in time and space. That is, the rate
+of change of the temperature field :math:`u(x, y, t)` over two spatial dimensions
+:math:`x` and :math:`y` and time :math:`t` (with rate coefficient :math:`\alpha`) can be
+modelled via the equation
 
 .. math::
-   \frac{\partial u}{\partial t} = \alpha \left( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right)
-   
-The standard way to numerically solve differential equations is to *discretize* them, i. e. to consider only a set/ grid of specific area points at specific moments in time. That way, partial derivatives :math:`{\partial u}` are converted into differences between adjacent grid points :math:`u^{m}(i,j)`, with :math:`m, i, j` denoting time and spatial grid points, respectively. Temperature change in time at a certain point can now be computed from the values of neighboring points at earlier time; the same expression, called *stencil*, is applied to every point on the grid.
+
+    \frac{\partial u}{\partial t} = \alpha \left( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right)
+
+The standard way to numerically solve differential equations is to *discretize* them, i.
+e. to consider only a set/ grid of specific area points at specific moments in time.
+That way, partial derivatives :math:`{\partial u}` are converted into differences
+between adjacent grid points :math:`u^{m}(i,j)`, with :math:`m, i, j` denoting time and
+spatial grid points, respectively. Temperature change in time at a certain point can now
+be computed from the values of neighboring points at earlier time; the same expression,
+called *stencil*, is applied to every point on the grid.
 
 .. figure:: img/stencil/stencil.svg
-   :align: center
+    :align: center
 
-   This simplified model uses an 8x8 grid of data in light blue in state :math:`m`, each location of which has to be updated based on the indicated 5-point stencil in yellow to move to the next time point :math:`m+1`.
+    This simplified model uses an 8x8 grid of data in light blue in state :math:`m`,
+    each location of which has to be updated based on the indicated 5-point stencil in
+    yellow to move to the next time point :math:`m+1`.
 
 .. challenge:: Question: stencil applications
 
-   Stencil computation is a common occurrence in solving numerical problems. Have you already encountered it? Can you think of a problem that could be formulated this way in your field / area of expertise?
-   
-   .. solution::
-      
-      One obvious choice is *convolution* operation, used in image processing to apply various filter kernels; in some contexts, "convolution" and "stencil" are used almost interchangeably. Other related use is for averaging/ pooling adjacent values.
+    Stencil computation is a common occurrence in solving numerical problems. Have you already encountered it? Can you think of a problem that could be formulated this way in your field / area of expertise?
 
+    .. solution::
+
+       One obvious choice is *convolution* operation, used in image processing to apply various filter kernels; in some contexts, "convolution" and "stencil" are used almost interchangeably. Other related use is for averaging/ pooling adjacent values.
 
 Technical considerations
-------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~
 
 **1. How fast and/ or accurate can the solution be?**
 
-Spatial resolution of the temperature field is controlled by the number/ density of the grid points. As the full grid update is required to proceed from one time point to the next, stencil computation is the main target of parallelization (on CPU or GPU).
+Spatial resolution of the temperature field is controlled by the number/ density of the
+grid points. As the full grid update is required to proceed from one time point to the
+next, stencil computation is the main target of parallelization (on CPU or GPU).
 
-Moreover, in many cases the chosen time step cannot be arbitrarily large, otherwise the numerical differentiation will fail, and dense/ accurate grids imply small time steps (see inset below), which makes efficient spatial update even more important.
+Moreover, in many cases the chosen time step cannot be arbitrarily large, otherwise the
+numerical differentiation will fail, and dense/ accurate grids imply small time steps
+(see inset below), which makes efficient spatial update even more important.
 
 .. solution:: Optional: stencil expression and time-step limit
-   
-   Differential equation shown above can be discretized using different schemes. For this example, temperature values at each grid point :math:`u^{m}(i,j)` are updated from one time point (:math:`m`) to the next (:math:`m+1`), using the following expressions:
-      
-   .. math::
-       u^{m+1}(i,j) = u^m(i,j) + \Delta t \alpha \nabla^2 u^m(i,j) ,
-   
-   where
-   
-   .. math::
-      \nabla^2 u  &= \frac{u(i-1,j)-2u(i,j)+u(i+1,j)}{(\Delta x)^2} \\
-          &+ \frac{u(i,j-1)-2u(i,j)+u(i,j+1)}{(\Delta y)^2} ,
-   
-   and :math:`\Delta x`, :math:`\Delta y`, :math:`\Delta t` are step sizes in space and time, respectively.
-   
-   Time-update schemes often have a limit on the maximum allowed time step :math:`\Delta t`. For the current scheme, it is equal to
-   
-   .. math::
-      \Delta t_{max} = \frac{(\Delta x)^2 (\Delta y)^2}{2 \alpha ((\Delta x)^2 + (\Delta y)^2)}
 
-**2. What to do with area boundaries?**
+    Differential equation shown above can be discretized using different schemes. For this example, temperature values at each grid point :math:`u^{m}(i,j)` are updated from one time point (:math:`m`) to the next (:math:`m+1`), using the following expressions:
 
-Naturally, stencil expression can't be applied directly to the outermost grid points that have no outer neighbors. This can be solved by either changing the expression for those points or by adding an additional layer of grid that is used in computing update, but not updated itself -- points of fixed temperature for the sides are being used in this example.
+    .. math::
+        u^{m+1}(i,j) = u^m(i,j) + \Delta t \alpha \nabla^2 u^m(i,j) ,
 
-**3. How could the algorithm be optimized further?**
+    where
 
-In `an earlier episode <https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/#memory-optimizations>`_, importance of efficient memory access was already stressed. In the following examples, each grid point (and its neighbors) is treated mostly independently; however, this also means that for 5-point stencil each value of the grid point may be read up to 5 times from memory (even if it's the fast GPU memory). By rearranging the order of mathematical operations, it may be possible to reuse these values in a more efficient way.
+    .. math::
+       \nabla^2 u  &= \frac{u(i-1,j)-2u(i,j)+u(i+1,j)}{(\Delta x)^2} \\
+           &+ \frac{u(i,j-1)-2u(i,j)+u(i,j+1)}{(\Delta y)^2} ,
 
-Another point to note is that even if the solution is propagated in small time steps, not every step might actually be needed for output. Once some *local* region of the field is updated, mathematically nothing prevents it from being updated for the second time step -- even if the rest of the field is still being recalculated -- as long as :math:`t = m-1` values for the region boundary are there when needed. (Of course, this is more complicated to implement and would only give benefits in certain cases.)
+    and :math:`\Delta x`, :math:`\Delta y`, :math:`\Delta t` are step sizes in space and time, respectively.
 
+    Time-update schemes often have a limit on the maximum allowed time step :math:`\Delta t`. For the current scheme, it is equal to
 
-.. challenge:: Poll: which programming model/ framework are you most interested in today?
+    .. math::
+       \Delta t_{max} = \frac{(\Delta x)^2 (\Delta y)^2}{2 \alpha ((\Delta x)^2 + (\Delta y)^2)}
 
-   - OpenMP offloading (C++)
-   - SYCL (C++)
-   - *Python* (``numba``/CUDA)
-   - Julia
+**2. What to do with area boundaries?**
 
+Naturally, stencil expression can't be applied directly to the outermost grid points
+that have no outer neighbors. This can be solved by either changing the expression for
+those points or by adding an additional layer of grid that is used in computing update,
+but not updated itself -- points of fixed temperature for the sides are being used in
+this example.
+
+**3. How could the algorithm be optimized further?**
+
+In `an earlier episode
+<https://enccs.github.io/gpu-programming/7-non-portable-kernel-models/#memory-optimizations>`_,
+importance of efficient memory access was already stressed. In the following examples,
+each grid point (and its neighbors) is treated mostly independently; however, this also
+means that for 5-point stencil each value of the grid point may be read up to 5 times
+from memory (even if it's the fast GPU memory). By rearranging the order of mathematical
+operations, it may be possible to reuse these values in a more efficient way.
+
+Another point to note is that even if the solution is propagated in small time steps,
+not every step might actually be needed for output. Once some *local* region of the
+field is updated, mathematically nothing prevents it from being updated for the second
+time step -- even if the rest of the field is still being recalculated -- as long as
+:math:`t = m-1` values for the region boundary are there when needed. (Of course, this
+is more complicated to implement and would only give benefits in certain cases.)
+
+.. challenge:: Poll: which programming model/ framework are you most interested in today?
+
+    - OpenMP offloading (C++)
+    - SYCL (C++)
+    - *Python* (``numba``/CUDA)
+    - Julia
 
 The following table will aid you in navigating the rest of this section:
 
 .. admonition:: Episode guide
 
-   - `Sequential and OpenMP-threaded code <https://enccs.github.io/gpu-programming/13-examples/#sequential-and-thread-parallel-program-in-c>`__ in C++, including compilation/ running instructions
-   - `Naive GPU parallelization <https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-first-steps>`__, including SYCL compilation instructions
-   - `GPU code with device data management <https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-data-movement>`__ (OpenMP, SYCL)
-   - `Python implementation <https://enccs.github.io/gpu-programming/13-examples/#python-jit-and-gpu-acceleration>`__, including running instructions on `Google Colab <https://colab.research.google.com/>`__
-   - `Julia implementation <https://enccs.github.io/gpu-programming/13-examples/#julia-gpu-acceleration>`__, including running instructions
-
+    - `Sequential and OpenMP-threaded code
+      <https://enccs.github.io/gpu-programming/13-examples/#sequential-and-thread-parallel-program-in-c>`__
+      in C++, including compilation/ running instructions
+    - `Naive GPU parallelization
+      <https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-first-steps>`__,
+      including SYCL compilation instructions
+    - `GPU code with device data management
+      <https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-data-movement>`__
+      (OpenMP, SYCL)
+    - `Python implementation
+      <https://enccs.github.io/gpu-programming/13-examples/#python-jit-and-gpu-acceleration>`__,
+      including running instructions on `Google Colab
+      <https://colab.research.google.com/>`__
+    - `Julia implementation
+      <https://enccs.github.io/gpu-programming/13-examples/#julia-gpu-acceleration>`__,
+      including running instructions
 
 Sequential and thread-parallel program in C++
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+---------------------------------------------
 
 .. callout:: Trying out code examples
 
-   Source files of the examples presented for the rest of this episode are available in the `content/examples/stencil/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/>`_ directory.
-   To download them to your preferred directory on the cluster (f.e. ``/scratch/project_<#>/<your_folder>/``), you can use Git:
-   
-   .. code-block:: console
+    Source files of the examples presented for the rest of this episode are available in the `content/examples/stencil/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/>`_ directory.
+    To download them to your preferred directory on the cluster (f.e. ``/scratch/project_<#>/<your_folder>/``), you can use Git:
 
-      $ git clone https://github.com/ENCCS/gpu-programming.git
-      $ cd gpu-programming/content/examples/stencil/
-      $ ls
+    .. code-block:: console
 
-   .. warning::
+       $ git clone https://github.com/ENCCS/gpu-programming.git
+       $ cd gpu-programming/content/examples/stencil/
+       $ ls
 
-      Don't forget to ``git pull`` for the latest updates if you already have the content from the first day of the workshop!
+    .. warning::
 
-If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``:
+       Don't forget to ``git pull`` for the latest updates if you already have the content from the first day of the workshop!
+
+If we assume the grid point values to be truly independent *for a single time step*,
+stencil application procedure may be straightforwardly written as a loop over the grid
+points, as shown below in tab "Stencil update". (General structure of the program and
+the default parameter values for the problem model are also provided for reference.)
+CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``:
 
 .. tabs::
 
-   .. tab:: Stencil update
+    .. tab:: Stencil update
 
-         .. literalinclude:: examples/stencil/base/core.cpp 
-                        :language: cpp
-                        :emphasize-lines: 25
+          .. literalinclude:: examples/stencil/base/core.cpp
+                         :language: cpp
+                         :emphasize-lines: 25
 
-   .. tab:: Main function
+    .. tab:: Main function
 
-         .. literalinclude:: examples/stencil/base/main.cpp 
-                        :language: cpp
-                        :emphasize-lines: 37
- 
-   .. tab:: Default params
+          .. literalinclude:: examples/stencil/base/main.cpp
+                         :language: cpp
+                         :emphasize-lines: 37
 
-         .. literalinclude:: examples/stencil/base/heat.h 
-                        :language: cpp
-                        :lines: 7-34
+    .. tab:: Default params
 
+          .. literalinclude:: examples/stencil/base/heat.h
+                         :language: cpp
+                         :lines: 7-34
 
 .. solution:: Optional: compiling the executables
 
-    To compile executable files for the OpenMP-based variants, follow the instructions below:
-   
-   .. code-block:: console
-
-      salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00
-
-      module load LUMI/24.03
-      module load partition/G
-      module load rocm/6.0.3
-      module load PrgEnv-cray/8.5.0
-      
-      cd base/
-      make all
-   
-   Afterwards login into a compute node and test the executables (or just ``srun <executable>`` directly):
-   
-   .. code-block:: console
-
-      $ srun --pty bash
-      
-      $ ./stencil
-      $ ./stencil_off
-      $ ./stencil_data
-      
-      $ exit
-      
-   If everything works well, the output should look similar to this:
-   
-   .. code-block:: console
-
-      $ ./stencil
-      Average temperature, start: 59.763305
-      Average temperature at end: 59.281239
-      Control temperature at end: 59.281239
-      Iterations took 0.566 seconds.
-      $ ./stencil_off
-      Average temperature, start: 59.763305
-      Average temperature at end: 59.281239
-      Control temperature at end: 59.281239
-      Iterations took 3.792 seconds.
-      $ ./stencil_data   
-      Average temperature, start: 59.763305
-      Average temperature at end: 59.281239
-      Control temperature at end: 59.281239
-      Iterations took 1.211 seconds.
-      $ 
+     To compile executable files for the OpenMP-based variants, follow the instructions below:
+
+    .. code-block:: console
+
+       salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00
+
+       module load LUMI/24.03
+       module load partition/G
+       module load rocm/6.0.3
+       module load PrgEnv-cray/8.5.0
 
+       cd base/
+       make all
+
+    Afterwards login into a compute node and test the executables (or just ``srun <executable>`` directly):
+
+    .. code-block:: console
+
+       $ srun --pty bash
+
+       $ ./stencil
+       $ ./stencil_off
+       $ ./stencil_data
+
+       $ exit
+
+    If everything works well, the output should look similar to this:
+
+    .. code-block:: console
+
+       $ ./stencil
+       Average temperature, start: 59.763305
+       Average temperature at end: 59.281239
+       Control temperature at end: 59.281239
+       Iterations took 0.566 seconds.
+       $ ./stencil_off
+       Average temperature, start: 59.763305
+       Average temperature at end: 59.281239
+       Control temperature at end: 59.281239
+       Iterations took 3.792 seconds.
+       $ ./stencil_data
+       Average temperature, start: 59.763305
+       Average temperature at end: 59.281239
+       Control temperature at end: 59.281239
+       Iterations took 1.211 seconds.
+       $
 
 CPU parallelization: timings
-----------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-(**NOTE**: for thread-parallel runs it is necessary to request multiple CPU cores. In LUMI-G partitions, this can be done by asking for multiple GPUs; an alternative is to use -C partitions.)
+(**NOTE**: for thread-parallel runs it is necessary to request multiple CPU cores. In
+LUMI-G partitions, this can be done by asking for multiple GPUs; an alternative is to
+use -C partitions.)
 
-For later comparison, some benchmarks of the OpenMP thread-parallel implementation are provided below:
+For later comparison, some benchmarks of the OpenMP thread-parallel implementation are
+provided below:
 
 .. list-table:: Run times of OpenMP-enabled executable, s
-   :widths: 25 25 25
-   :header-rows: 1
-   
-   * - Job size
-     - 1 CPU core
-     - 32 CPU cores
-   * - S:2000 T:500
-     - 1.402
-     - 0.064
-   * - S:2000 T:5000
-     - 13.895
-     - 0.538
-   * - S:2000 T:10000
-     - 27.753
-     - 1.071
-   * - S:4000 T:500
-     - 5.727
-     - 0.633
-   * - S:8000 T:500
-     - 24.130
-     - 16.616
-
-A closer look reveals that the computation time scales very nicely with increasing **time steps**:
+    :widths: 25 25 25
+    :header-rows: 1
+
+    - - Job size
+      - 1 CPU core
+      - 32 CPU cores
+    - - S:2000 T:500
+      - 1.402
+      - 0.064
+    - - S:2000 T:5000
+      - 13.895
+      - 0.538
+    - - S:2000 T:10000
+      - 27.753
+      - 1.071
+    - - S:4000 T:500
+      - 5.727
+      - 0.633
+    - - S:8000 T:500
+      - 24.130
+      - 16.616
+
+A closer look reveals that the computation time scales very nicely with increasing
+**time steps**:
 
 .. figure:: img/stencil/omp-cpu-scaling-step.png
-   :align: center
-   
-However, for larger **grid sizes** the parallelization becomes inefficient -- as the individual chunks of the grid get too large to fit into CPU cache, threads become bound by the speed of RAM reads/writes:
+    :align: center
 
-.. figure:: img/stencil/omp-cpu-scaling-grid.png
-   :align: center
+However, for larger **grid sizes** the parallelization becomes inefficient -- as the
+individual chunks of the grid get too large to fit into CPU cache, threads become bound
+by the speed of RAM reads/writes:
 
+.. figure:: img/stencil/omp-cpu-scaling-grid.png
+    :align: center
 
 .. challenge:: Discussion: heat flow computation scaling
 
-   1. How is heat flow computation **expected** to scale with respect to the number of time steps?
-   
-      a. Linearly
-      b. Quadratically
-      c. Exponentially
-   
-   2. How is stencil application (grid update) **expected** to scale with respect to the size of the grid side?
-   
-      a. Linearly
-      b. Quadratically
-      c. Exponentially
-   
-   3. (Optional) Do you expect GPU-accelerated computations to follow the above-mentioned trends? Why/ why not?
-   
-   .. solution::
-   
-      1. The answer is a: since each time-step follows the previous one and involves a similar number of operations, then the update time per step will be more or less constant.
-      2. The answer is b: since stencil application is independent for every grid point, the update time will be proportional to the number of points, i.e. side * side.
+    1. How is heat flow computation **expected** to scale with respect to the number of time steps?
+
+       a. Linearly
+       b. Quadratically
+       c. Exponentially
+
+    2. How is stencil application (grid update) **expected** to scale with respect to the size of the grid side?
+
+       a. Linearly
+       b. Quadratically
+       c. Exponentially
+
+    3. (Optional) Do you expect GPU-accelerated computations to follow the above-mentioned trends? Why/ why not?
 
+    .. solution::
+
+       1. The answer is a: since each time-step follows the previous one and involves a similar number of operations, then the update time per step will be more or less constant.
+       2. The answer is b: since stencil application is independent for every grid point, the update time will be proportional to the number of points, i.e. side * side.
 
 GPU parallelization: first steps
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------------------------
 
-Let's apply several techniques presented in previous episodes to make stencil update run on GPU.
+Let's apply several techniques presented in previous episodes to make stencil update run
+on GPU.
 
-OpenMP (or OpenACC) offloading requires to define a region to be executed in parallel as well as data that shall be copied over/ used in GPU memory.
-Similarly, SYCL programming model offers convenient ways to define execution kernels, as well as context to run them in (called queue). 
+OpenMP (or OpenACC) offloading requires to define a region to be executed in parallel as
+well as data that shall be copied over/ used in GPU memory. Similarly, SYCL programming
+model offers convenient ways to define execution kernels, as well as context to run them
+in (called queue).
 
 Changes of stencil update code for OpenMP and SYCL are shown in the tabs below:
 
 .. tabs::
 
-   .. tab:: OpenMP (naive)
+    .. tab:: OpenMP (naive)
 
-         .. literalinclude:: examples/stencil/base/core-off.cpp 
-                        :language: cpp
-                        :emphasize-lines: 25-26
-         
-   .. tab:: SYCL (naive)
+          .. literalinclude:: examples/stencil/base/core-off.cpp
+                         :language: cpp
+                         :emphasize-lines: 25-26
 
-         .. literalinclude:: examples/stencil/sycl/core-naive.cpp 
-                        :language: cpp
-                        :emphasize-lines: 24-27,29,43-45
+    .. tab:: SYCL (naive)
 
+          .. literalinclude:: examples/stencil/sycl/core-naive.cpp
+                         :language: cpp
+                         :emphasize-lines: 24-27,29,43-45
 
 .. callout:: Loading SYCL modules on LUMI
-   
-   As SYCL is placed on top of ROCm/HIP (or CUDA) software stack, running SYCL executables may require respective modules to be loaded. On current nodes, it can be done as follows:
-   
-   .. code-block:: console
-
-      # salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00
-      
-      module load LUMI/24.03
-      module load partition/G
-      module load rocm/6.0.3
-      module use  /appl/local/csc/modulefiles
-      module load acpp/24.06.0
+
+    As SYCL is placed on top of ROCm/HIP (or CUDA) software stack, running SYCL executables may require respective modules to be loaded. On current nodes, it can be done as follows:
+
+    .. code-block:: console
+
+       # salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00
+
+       module load LUMI/24.03
+       module load partition/G
+       module load rocm/6.0.3
+       module use  /appl/local/csc/modulefiles
+       module load acpp/24.06.0
 
 .. solution:: Optional: compiling the SYCL executables
 
-   As previously, you are welcome to generate your own executables:
-   
-   .. code-block:: console
-
-      $ cd ../sycl/
-      (give the following lines some time, probably a couple of min)
-      $ acpp -O2 -o stencil_naive core-naive.cpp io.cpp main-naive.cpp pngwriter.c setup.cpp utilities.cpp
-      $ acpp -O2 -o stencil core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp
-      
-      $ srun stencil_naive
-      $ srun stencil
-
-   If everything works well, the output should look similar to this:
-   
-   .. code-block:: console
-
-      $ srun stencil_naive
-      Average temperature, start: 59.763305
-      Average temperature at end: 59.281239
-      Control temperature at end: 59.281239
-      Iterations took 2.086 seconds.
-      $ srun stencil
-      Average temperature, start: 59.763305
-      Average temperature at end: 59.281239
-      Control temperature at end: 59.281239
-      Iterations took 0.052 seconds.
+    As previously, you are welcome to generate your own executables:
+
+    .. code-block:: console
+
+       $ cd ../sycl/
+       (give the following lines some time, probably a couple of min)
+       $ acpp -O2 -o stencil_naive core-naive.cpp io.cpp main-naive.cpp pngwriter.c setup.cpp utilities.cpp
+       $ acpp -O2 -o stencil core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp
+
+       $ srun stencil_naive
+       $ srun stencil
+
+    If everything works well, the output should look similar to this:
 
+    .. code-block:: console
+
+       $ srun stencil_naive
+       Average temperature, start: 59.763305
+       Average temperature at end: 59.281239
+       Control temperature at end: 59.281239
+       Iterations took 2.086 seconds.
+       $ srun stencil
+       Average temperature, start: 59.763305
+       Average temperature at end: 59.281239
+       Control temperature at end: 59.281239
+       Iterations took 0.052 seconds.
 
 .. challenge:: Exercise: naive GPU ports
 
-   Test your compiled executables ``base/stencil``, ``base/stencil_off`` and ``sycl/stencil_naive``. Try changing problem size parameters:
-   
-   - ``srun stencil_naive 2000 2000 5000``
-   
-   Things to look for:
-   
-   - How computation times change? 
-   - Do the results align to your expectations?
-   
-   
-   .. solution::
-   
-      You might notice that the GPU-"ported" versions actually run slower than the single-CPU-core version! In fact, the scaling behavior of all three variants is similar and expected, which is a good sign; only the "computation unit cost" is different. You can compare benchmark summaries in the tabs below:
+    Test your compiled executables ``base/stencil``, ``base/stencil_off`` and ``sycl/stencil_naive``. Try changing problem size parameters:
+
+    - ``srun stencil_naive 2000 2000 5000``
+
+    Things to look for:
+
+    - How computation times change?
+    - Do the results align to your expectations?
+
 
-      .. tabs::
+    .. solution::
 
-         .. tab:: Sequential
+       You might notice that the GPU-"ported" versions actually run slower than the single-CPU-core version! In fact, the scaling behavior of all three variants is similar and expected, which is a good sign; only the "computation unit cost" is different. You can compare benchmark summaries in the tabs below:
 
-            .. figure:: img/stencil/cpu-seq-scaling.png
-               :align: center
+       .. tabs::
 
-         .. tab:: OpenMP (naive)
+          .. tab:: Sequential
 
-            .. figure:: img/stencil/omp-gpu-naive-scaling.png
-               :align: center
+             .. figure:: img/stencil/cpu-seq-scaling.png
+                :align: center
 
-         .. tab:: SYCL (naive)
+          .. tab:: OpenMP (naive)
 
-            .. figure:: img/stencil/omp-sycl-naive-scaling-new.png
-               :align: center
+             .. figure:: img/stencil/omp-gpu-naive-scaling.png
+                :align: center
 
+          .. tab:: SYCL (naive)
+
+             .. figure:: img/stencil/omp-sycl-naive-scaling-new.png
+                :align: center
 
 GPU parallelization: data movement
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+----------------------------------
 
 Why the porting approach above seems to be quite inefficient?
 
 On each step, we:
 
-- re-allocate GPU memory, 
-- copy the data from CPU to GPU, 
-- perform the computation, 
+- re-allocate GPU memory,
+- copy the data from CPU to GPU,
+- perform the computation,
 - then copy the data back.
 
-But overhead can be reduced by taking care to minimize data transfers between *host* and *device* memory:
+But overhead can be reduced by taking care to minimize data transfers between *host* and
+*device* memory:
 
 - allocate GPU memory once at the start of the program,
 - only copy the data from GPU to CPU when we need it,
-- swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.)
+- swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this
+  automatically.)
 
-Changes of stencil update code as well as the main program are shown in tabs below. 
+Changes of stencil update code as well as the main program are shown in tabs below.
 
 .. tabs::
 
-   .. tab:: OpenMP
+    .. tab:: OpenMP
 
-         .. literalinclude:: examples/stencil/base/core-data.cpp
-                        :language: cpp
-                        :emphasize-lines: 25,40-75
-   
-   .. tab:: SYCL
+          .. literalinclude:: examples/stencil/base/core-data.cpp
+                         :language: cpp
+                         :emphasize-lines: 25,40-75
 
-         .. literalinclude:: examples/stencil/sycl/core.cpp
-                        :language: cpp
-                        :emphasize-lines: 13-14,25,40-50
+    .. tab:: SYCL
 
-   .. tab:: Python
+          .. literalinclude:: examples/stencil/sycl/core.cpp
+                         :language: cpp
+                         :emphasize-lines: 13-14,25,40-50
 
-         .. literalinclude:: examples/stencil/python/core_cuda.py
-                        :language: py
-                        :lines: 6-34
-                        :emphasize-lines: 14-16,18
+    .. tab:: Python
 
-   .. tab:: main() (SYCL)
+          .. literalinclude:: examples/stencil/python/core_cuda.py
+                         :language: py
+                         :lines: 6-34
+                         :emphasize-lines: 14-16,18
 
-         .. literalinclude:: examples/stencil/sycl/main.cpp 
-                        :language: cpp
-                        :emphasize-lines: 38-39,44-45,51,56,59,75-77
+    .. tab:: main() (SYCL)
 
+          .. literalinclude:: examples/stencil/sycl/main.cpp
+                         :language: cpp
+                         :emphasize-lines: 38-39,44-45,51,56,59,75-77
 
 .. challenge:: Exercise: updated GPU ports
 
-   Test your compiled executables ``base/stencil_data`` and ``sycl/stencil``. Try changing problem size parameters:
-   
-   - ``srun stencil 2000 2000 5000``
-   
-   Things to look for:
-      
-   - How computation times change this time around?
-   - What largest grid and/or longest propagation time can you get in 10 s on your machine?
-   
-   
-   .. solution::
-   
-      .. tabs::
-      
-         .. tab:: OpenMP data mapping
-         
-            Using GPU offloading with mapped device data, it is possible to achieve performance gains compared to thread-parallel version for larger grid sizes, due to the fact that the latter version becomes essentially RAM-bound, but the former does not.
-            
-            .. figure:: img/stencil/omp-cpu-vs-gpu.png
-               :align: center
-               
-         .. tab:: SYCL device buffers
-         
-            Below you can find the summary graphs for step- and grid- scaling of the stencil update task. Because of the more explicit programming approach, SYCL GPU port is much faster than OpenMP-offloaded version, comparable with thread-parallel CPU version running on all cores of a single node.
-            
-            .. figure:: img/stencil/summary-scaling-step-new.png
-               :align: center
-
-            .. figure:: img/stencil/summary-scaling-grid-new.png
-               :align: center
+    Test your compiled executables ``base/stencil_data`` and ``sycl/stencil``. Try changing problem size parameters:
+
+    - ``srun stencil 2000 2000 5000``
+
+    Things to look for:
+
+    - How computation times change this time around?
+    - What largest grid and/or longest propagation time can you get in 10 s on your machine?
+
+
+    .. solution::
+
+       .. tabs::
+
+          .. tab:: OpenMP data mapping
+
+             Using GPU offloading with mapped device data, it is possible to achieve performance gains compared to thread-parallel version for larger grid sizes, due to the fact that the latter version becomes essentially RAM-bound, but the former does not.
 
+             .. figure:: img/stencil/omp-cpu-vs-gpu.png
+                :align: center
+
+          .. tab:: SYCL device buffers
+
+             Below you can find the summary graphs for step- and grid- scaling of the stencil update task. Because of the more explicit programming approach, SYCL GPU port is much faster than OpenMP-offloaded version, comparable with thread-parallel CPU version running on all cores of a single node.
+
+             .. figure:: img/stencil/summary-scaling-step-new.png
+                :align: center
+
+             .. figure:: img/stencil/summary-scaling-grid-new.png
+                :align: center
 
 Python: JIT and GPU acceleration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------------------------
 
-As mentioned `previously <https://enccs.github.io/gpu-programming/9-language-support/#numba>`_, Numba package allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can also be used for JIT compiling for (NVIDIA) GPUs. JIT seems to work well on loop-based, computationally heavy functions, so trying it out is a nice choice for initial source version:
+As mentioned `previously
+<https://enccs.github.io/gpu-programming/9-language-support/#numba>`_, Numba package
+allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can
+also be used for JIT compiling for (NVIDIA) GPUs. JIT seems to work well on loop-based,
+computationally heavy functions, so trying it out is a nice choice for initial source
+version:
 
 .. tabs::
 
-   .. tab:: Stencil update
+    .. tab:: Stencil update
 
-         .. literalinclude:: examples/stencil/python/core.py
-                        :language: py
-                        :lines: 6-29
-                        :emphasize-lines: 17
-   
-   .. tab:: Data generation
+          .. literalinclude:: examples/stencil/python/core.py
+                         :language: py
+                         :lines: 6-29
+                         :emphasize-lines: 17
 
-         .. literalinclude:: examples/stencil/python/heat.py
-                        :language: py
-                        :lines: 57-78
-                        :emphasize-lines: 1
+    .. tab:: Data generation
 
+          .. literalinclude:: examples/stencil/python/heat.py
+                         :language: py
+                         :lines: 57-78
+                         :emphasize-lines: 1
 
-The alternative approach would be to rewrite stencil update code in NumPy style, exploiting loop vectorization.
+The alternative approach would be to rewrite stencil update code in NumPy style,
+exploiting loop vectorization.
 
 .. callout:: Trying out Python examples
 
-   You can run provided code examples on Google Colab using instructions provided in the `Setup <https://enccs.github.io/gpu-programming/0-setup/#running-on-google-colab>`_, your local machine, or LUMI node (non-GPU variants). On LUMI, you can set up Python distribution as following:
-   
-   .. code-block:: console
+    You can run provided code examples on Google Colab using instructions provided in the `Setup <https://enccs.github.io/gpu-programming/0-setup/#running-on-google-colab>`_, your local machine, or LUMI node (non-GPU variants). On LUMI, you can set up Python distribution as following:
 
-      $ module load cray-python/3.9.13.1
-      (install needed dependencies locally)
-      $ pip3 install --user numba matplotlib
-      $ cd ../python/
-      (make sure you have active allocation)
-      $ srun python3 main.py
+    .. code-block:: console
 
+       $ module load cray-python/3.9.13.1
+       (install needed dependencies locally)
+       $ pip3 install --user numba matplotlib
+       $ cd ../python/
+       (make sure you have active allocation)
+       $ srun python3 main.py
 
 Short summary of a typical Colab run is provided below:
 
 .. list-table:: Run times of Numba JIT-enabled Python program, s
-   :widths: 25 25 25 25 25
-   :header-rows: 1
-   
-   * - Job size
-     - JIT (LUMI)
-     - JIT (Colab)
-     - Job size
-     - no JIT (Colab)
-   * - S:2000 T:500
-     - 1.648
-     - 8.495
-     - S:200 T:50
-     - 5.318
-   * - S:2000 T:200
-     - 0.787
-     - 3.524
-     - S:200 T:20
-     - 1.859
-   * - S:1000 T:500
-     - 0.547
-     - 2.230
-     - S:100 T:50
-     - 1.156
-
-Numba's ``@vectorize`` and ``@guvectorize`` decorators offer an interface to create CPU- (or GPU-) accelerated *Python* functions without explicit implementation details. However, such functions become increasingly complicated to write (and optimize by the compiler) with increasing complexity of the computations within.
-
-Numba also offers direct CUDA-based kernel programming, which can be the best choice for those already familiar with CUDA. Example for stencil update written in Numba CUDA is shown in the `data movement section <https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-data-movement>`_, tab "Python". In this case, data transfer functions ``devdata = cuda.to_device(data)`` and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba package.
-
+    :widths: 25 25 25 25 25
+    :header-rows: 1
+
+    - - Job size
+      - JIT (LUMI)
+      - JIT (Colab)
+      - Job size
+      - no JIT (Colab)
+    - - S:2000 T:500
+      - 1.648
+      - 8.495
+      - S:200 T:50
+      - 5.318
+    - - S:2000 T:200
+      - 0.787
+      - 3.524
+      - S:200 T:20
+      - 1.859
+    - - S:1000 T:500
+      - 0.547
+      - 2.230
+      - S:100 T:50
+      - 1.156
+
+Numba's ``@vectorize`` and ``@guvectorize`` decorators offer an interface to create CPU-
+(or GPU-) accelerated *Python* functions without explicit implementation details.
+However, such functions become increasingly complicated to write (and optimize by the
+compiler) with increasing complexity of the computations within.
+
+Numba also offers direct CUDA-based kernel programming, which can be the best choice for
+those already familiar with CUDA. Example for stencil update written in Numba CUDA is
+shown in the `data movement section
+<https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-data-movement>`_,
+tab "Python". In this case, data transfer functions ``devdata = cuda.to_device(data)``
+and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba
+package.
 
 .. challenge:: Exercise: CUDA acceleration in Python
 
-   Using Google Colab (or your own machine), run provided Numba-CUDA Python program. Try changing problem size parameters:
-   
-   - ``args.rows, args.cols, args.nsteps = 2000, 2000, 5000`` for notebooks,
-   - [``srun``] ``python3 main.py 2000 2000 5000`` for command line.
-   
-   Things to look for:
-      
-   - How computation times change?
-   - Do you get better performance than from JIT-compiled CPU version? How far can you push the problem size?
-   - Are you able to monitor the GPU usage?
-
-   
-   .. solution::
-   
-      Some numbers from Colab:
-      
-      .. list-table:: Run times of Numba CUDA Python program, s
-         :widths: 25 25 25 25
-         :header-rows: 1
-
-         * - Job size
-           - JIT (LUMI)
-           - JIT (Colab)
-           - CUDA (Colab)
-         * - S:2000 T:500
-           - 1.648
-           - 8.495
-           - 1.079
-         * - S:2000 T:2000
-           - 6.133
-           - 36.61
-           - 3.931
-         * - S:5000 T:500
-           - 9.478
-           - 57.19
-           - 6.448
+    Using Google Colab (or your own machine), run provided Numba-CUDA Python program. Try changing problem size parameters:
+
+    - ``args.rows, args.cols, args.nsteps = 2000, 2000, 5000`` for notebooks,
+    - [``srun``] ``python3 main.py 2000 2000 5000`` for command line.
+
+    Things to look for:
+
+    - How computation times change?
+    - Do you get better performance than from JIT-compiled CPU version? How far can you push the problem size?
+    - Are you able to monitor the GPU usage?
+
 
+    .. solution::
+
+       Some numbers from Colab:
+
+       .. list-table:: Run times of Numba CUDA Python program, s
+          :widths: 25 25 25 25
+          :header-rows: 1
+
+          * - Job size
+            - JIT (LUMI)
+            - JIT (Colab)
+            - CUDA (Colab)
+          * - S:2000 T:500
+            - 1.648
+            - 8.495
+            - 1.079
+          * - S:2000 T:2000
+            - 6.133
+            - 36.61
+            - 3.931
+          * - S:5000 T:500
+            - 9.478
+            - 57.19
+            - 6.448
 
 Julia GPU acceleration
-~~~~~~~~~~~~~~~~~~~~~~
+----------------------
 
-A Julia version of the stencil example above can be found below (a simplified version of the HeatEquation module at https://github.com/ENCCS/HeatEquation.jl). 
-The source files are also available in the `content/examples/stencil/julia <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/julia>`_ directory of this repository.
+A Julia version of the stencil example above can be found below (a simplified version of
+the HeatEquation module at https://github.com/ENCCS/HeatEquation.jl). The source files
+are also available in the `content/examples/stencil/julia
+<https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/julia>`_
+directory of this repository.
 
 To run the example on LUMI CPU partition, type:
 
 .. code-block:: console
 
-   $ # interactive CPU node
-   $ srun --account=project_465001310 --partition=standard --nodes=1 --cpus-per-task=32 --ntasks-per-node=1 --time=01:00:00 --pty bash
-   $ # load Julia env
-   $ module purge
-   $ module use /appl/local/csc/modulefiles
-   $ module load julia/1.9.0
-   $ # in directory with Project.toml and source files, instantiate an environment to install packages
-   $ julia --project -e "using Pkg ; Pkg.instantiate()"
-   $ # finally run
-   $ julia --project main.jl
+    $ # interactive CPU node
+    $ srun --account=project_465001310 --partition=standard --nodes=1 --cpus-per-task=32 --ntasks-per-node=1 --time=01:00:00 --pty bash
+    $ # load Julia env
+    $ module purge
+    $ module use /appl/local/csc/modulefiles
+    $ module load julia/1.9.0
+    $ # in directory with Project.toml and source files, instantiate an environment to install packages
+    $ julia --project -e "using Pkg ; Pkg.instantiate()"
+    $ # finally run
+    $ julia --project main.jl
 
-To run on the GPU partition, use instead the ``srun`` command 
+To run on the GPU partition, use instead the ``srun`` command
 
 .. code-block:: console
 
-   $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
-
+    $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
 
 .. callout:: Optional dependency
 
-   Note that the ``Plots.jl`` dependency is commented out in ``main.jl`` and ``Project.toml``. This saves ~2 minute precompilation time when you first instantiate the Julia environment. To generate plots, just uncomment the commented ``Plots.jl`` dependency in ``Project.toml``, instantiate again, and import and use ``Plots`` in ``main.jl``.
+    Note that the ``Plots.jl`` dependency is commented out in ``main.jl`` and ``Project.toml``. This saves ~2 minute precompilation time when you first instantiate the Julia environment. To generate plots, just uncomment the commented ``Plots.jl`` dependency in ``Project.toml``, instantiate again, and import and use ``Plots`` in ``main.jl``.
 
 .. tabs::
 
-   .. tab:: main.jl
-
-      .. literalinclude:: examples/stencil/julia/main.jl
-         :language: julia
+    .. tab:: main.jl
 
-   .. tab:: core.jl
+       .. literalinclude:: examples/stencil/julia/main.jl
+          :language: julia
 
-      .. literalinclude:: examples/stencil/julia/core.jl
-         :language: julia
+    .. tab:: core.jl
 
-   .. tab:: heat.jl
+       .. literalinclude:: examples/stencil/julia/core.jl
+          :language: julia
 
-      .. literalinclude:: examples/stencil/julia/heat.jl
-         :language: julia
+    .. tab:: heat.jl
 
-   .. tab:: Project.toml
+       .. literalinclude:: examples/stencil/julia/heat.jl
+          :language: julia
 
-      .. literalinclude:: examples/stencil/julia/Project.toml
-         :language: julia
+    .. tab:: Project.toml
 
+       .. literalinclude:: examples/stencil/julia/Project.toml
+          :language: julia
 
 .. challenge:: Exercise: Julia port to GPUs
 
-   Carefully inspect all Julia source files and consider the following questions:
+    Carefully inspect all Julia source files and consider the following questions:
 
-   1. Which functions should be ported to run on GPU?
-   2. Look at the :meth:`initialize!` function and how it uses the ``arraytype`` argument. This could be done more compactly and elegantly, but this solution solves scalar indexing errors. What are scalar indexing errors?
-   3. Try to start sketching GPU-ported versions of the key functions.
-   4. When you have a version running on a GPU (your own or the solution provided below), try benchmarking it by adding ``@btime`` in front of :meth:`simulate!` in ``main.jl``. Benchmark also the CPU version, and compare.
+    1. Which functions should be ported to run on GPU?
+    2. Look at the :meth:`initialize!` function and how it uses the ``arraytype`` argument. This could be done more compactly and elegantly, but this solution solves scalar indexing errors. What are scalar indexing errors?
+    3. Try to start sketching GPU-ported versions of the key functions.
+    4. When you have a version running on a GPU (your own or the solution provided below), try benchmarking it by adding ``@btime`` in front of :meth:`simulate!` in ``main.jl``. Benchmark also the CPU version, and compare.
 
-   .. solution:: Hints
+    .. solution:: Hints
 
-      - create a new function :meth:`evolve_gpu!` which contains the GPU kernelized version of :meth:`evolve!`
-      - in the loop over timesteps in :meth:`simulate!`, you will need a conditional like ``if typeof(curr.data) <: ROCArray`` to call your GPU-ported function
-      - you cannot pass the struct ``Field`` to the kernel. You will instead need to directly pass the array ``Field.data``. This also necessitates passing in other variables like ``curr.dx^2``, etc.
+       - create a new function :meth:`evolve_gpu!` which contains the GPU kernelized version of :meth:`evolve!`
+       - in the loop over timesteps in :meth:`simulate!`, you will need a conditional like ``if typeof(curr.data) <: ROCArray`` to call your GPU-ported function
+       - you cannot pass the struct ``Field`` to the kernel. You will instead need to directly pass the array ``Field.data``. This also necessitates passing in other variables like ``curr.dx^2``, etc.
 
 
-   .. solution:: More hints
+    .. solution:: More hints
 
-      - since the data is two-dimensional, you'll need ``i = (blockIdx().x - 1) * blockDim().x + threadIdx().x`` and ``j = (blockIdx().y - 1) * blockDim().y + threadIdx().y``
-      - to not overindex the 2D array, you can use a conditional like ``if i > 1 && j > 1 && i < nx+2 && j < ny+2``
-      - when calling the kernel, you can set the number of threads and blocks like ``xthreads = ythreads = 16`` and ``xblocks, yblocks = cld(curr.nx, xthreads), cld(curr.ny, ythreads)``, and then call it with, e.g., ``@roc threads=(xthreads, ythreads) blocks = (xblocks, yblocks) evolve_rocm!(curr.data, prev.data, curr.dx^2, curr.dy^2, nx, ny, a, dt)``.
+       - since the data is two-dimensional, you'll need ``i = (blockIdx().x - 1) * blockDim().x + threadIdx().x`` and ``j = (blockIdx().y - 1) * blockDim().y + threadIdx().y``
+       - to not overindex the 2D array, you can use a conditional like ``if i > 1 && j > 1 && i < nx+2 && j < ny+2``
+       - when calling the kernel, you can set the number of threads and blocks like ``xthreads = ythreads = 16`` and ``xblocks, yblocks = cld(curr.nx, xthreads), cld(curr.ny, ythreads)``, and then call it with, e.g., ``@roc threads=(xthreads, ythreads) blocks = (xblocks, yblocks) evolve_rocm!(curr.data, prev.data, curr.dx^2, curr.dy^2, nx, ny, a, dt)``.
 
 
-   .. solution:: 
+    .. solution::
 
-      1. The :meth:`evolve!` and :meth:`simulate!` functions need to be ported. The ``main.jl`` file also needs to be updated to work with GPU arrays.
-      2. "Scalar indexing" is where you iterate over a GPU array, which would be excruciatingly slow and is indeed only allowed in interactive REPL sessions. Without the if-statements in the :meth:`initialize!` function, the :meth:`generate_field!` method would be doing disallowed scalar indexing if you were running on a GPU.
-      3. The GPU-ported version is found below. Try it out on both CPU and GPU and observe the speedup. Play around with array size to see if the speedup is affected. You can also play around with the ``xthreads`` and ``ythreads`` variables to see if it changes anything.
+       1. The :meth:`evolve!` and :meth:`simulate!` functions need to be ported. The ``main.jl`` file also needs to be updated to work with GPU arrays.
+       2. "Scalar indexing" is where you iterate over a GPU array, which would be excruciatingly slow and is indeed only allowed in interactive REPL sessions. Without the if-statements in the :meth:`initialize!` function, the :meth:`generate_field!` method would be doing disallowed scalar indexing if you were running on a GPU.
+       3. The GPU-ported version is found below. Try it out on both CPU and GPU and observe the speedup. Play around with array size to see if the speedup is affected. You can also play around with the ``xthreads`` and ``ythreads`` variables to see if it changes anything.
 
-      .. tabs::
+       .. tabs::
 
-         .. tab:: main_gpu.jl
+          .. tab:: main_gpu.jl
 
-            .. literalinclude:: examples/stencil/julia/main_gpu.jl
-               :language: julia
+             .. literalinclude:: examples/stencil/julia/main_gpu.jl
+                :language: julia
 
-         .. tab:: core_gpu.jl
-
-            .. literalinclude:: examples/stencil/julia/core_gpu.jl
-               :language: julia
+          .. tab:: core_gpu.jl
 
+             .. literalinclude:: examples/stencil/julia/core_gpu.jl
+                :language: julia
 
 See also
-~~~~~~~~
+--------
 
-This section leans heavily on source code and material created for several other computing workshops 
-by `ENCCS <https://enccs.se/>`_ and `CSC <https://csc.fi/>`_ and adapted for the purposes of this lesson.
-If you want to know more about specific programming models / framework, definitely check these out!
+This section leans heavily on source code and material created for several other
+computing workshops by `ENCCS <https://enccs.se/>`_ and `CSC <https://csc.fi/>`_ and
+adapted for the purposes of this lesson. If you want to know more about specific
+programming models / framework, definitely check these out!
 
 - `OpenMP for GPU offloading <https://enccs.github.io/openmp-gpu/>`_
 - `Heterogeneous programming with SYCL <https://enccs.github.io/sycl-workshop/>`_
-- `Educational implementation of heat flow example (incl. MPI-aware CUDA) <https://github.com/cschpc/heat-equation/>`_
-
-
-
+- `Educational implementation of heat flow example (incl. MPI-aware CUDA)
+  <https://github.com/cschpc/heat-equation/>`_
diff --git a/content/2-gpu-ecosystem.rst b/content/2-gpu-ecosystem.rst
index 906e169c..487bd1bf 100644
--- a/content/2-gpu-ecosystem.rst
+++ b/content/2-gpu-ecosystem.rst
@@ -1,25 +1,22 @@
 .. _gpu-ecosystem:
 
-
 The GPU hardware and software ecosystem
 =======================================
 
-
 .. questions::
 
-   - What are the differences between GPUs and CPUs?
-   - What GPU software stacks are available? What do they provide?
+    - What are the differences between GPUs and CPUs?
+    - What GPU software stacks are available? What do they provide?
 
 .. objectives::
 
-   - Understand the fundamental differences between GPUs and CPUs
-   - Explore the major GPU software suites available, such as CUDA, ROCm, and oneAPI, and gain a basic understanding of them
+    - Understand the fundamental differences between GPUs and CPUs
+    - Explore the major GPU software suites available, such as CUDA, ROCm, and oneAPI, and gain a basic understanding of them
 
 .. instructor-note::
 
-   - 20 min teaching
-   - 0 min exercises
-
+    - 20 min teaching
+    - 0 min exercises
 
 Overview of GPU hardware
 ------------------------
@@ -27,278 +24,371 @@ Overview of GPU hardware
 .. figure:: img/hardware/CPUAndGPU.png
     :align: center
 
-    A comparison of the CPU and GPU architecture.
-    CPU (left) has complex core structure and pack several cores on a single chip.
-    GPU cores are very simple in comparison, they also share data and control between each other.
-    This allows to pack more cores on a single chip, thus achieving very high compute density.
+    A comparison of the CPU and GPU architecture. CPU (left) has complex core structure
+    and pack several cores on a single chip. GPU cores are very simple in comparison,
+    they also share data and control between each other. This allows to pack more cores
+    on a single chip, thus achieving very high compute density.
 
 .. admonition:: In short
-   :class: dropdown
-
-   - Accelerators offer high performance due to their scalability and high density of compute elements.
-   - They have separate circuit boards connected to CPUs via PCIe bus, with their own memory.
-   - CPUs copy data from their own memory to the GPU memory, execute the program, and copy the results back.
-   - GPUs run thousands of threads simultaneously, quickly switching between them to hide memory operations.
-   - Effective data management and access pattern is critical on the GPU to avoid running out of memory.
-
-
-Accelerators are a separate main circuit board with the processor, memory, power management, etc.
-It is connected to the motherboard with CPUs via PCIe bus.
-Having its own memory means that the data has to be copied to and from it (not neceseraly true anymore).
-CPU acts as a main processor, controlling the execution workflow.
-It copies the data from its own memory to the GPU memory, executes the program and copies the results back.
-GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management.
-With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly.
-This makes the data management and its access pattern is essential on the GPU.
-Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly.
-This allows to hide the memory operations: while some threads wait, others can compute.
-
-
-A very important feature of  the accelerators  is their scalability.
-Computational cores on accelerators are usually grouped into multiprocessors.
-The multiprocessors share the data and logical elements.
-This allows to achieve a very high density of compute elements on a GPU.
-This also allows the scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available.
-
 
+    - Accelerators offer high performance due to their scalability and high density of
+      compute elements.
+    - They have separate circuit boards connected to CPUs via PCIe bus, with their own
+      memory.
+    - CPUs copy data from their own memory to the GPU memory, execute the program, and
+      copy the results back.
+    - GPUs run thousands of threads simultaneously, quickly switching between them to
+      hide memory operations.
+    - Effective data management and access pattern is critical on the GPU to avoid
+      running out of memory.
+
+Accelerators are a separate main circuit board with the processor, memory, power
+management, etc. It is connected to the motherboard with CPUs via PCIe bus. Having its
+own memory means that the data has to be copied to and from it (not neceseraly true
+anymore). CPU acts as a main processor, controlling the execution workflow. It copies
+the data from its own memory to the GPU memory, executes the program and copies the
+results back. GPUs runs tens of thousands of threads simultaneously on thousands of
+cores and does not do much of the data management. With many cores trying to access the
+memory simultaneously and with little cache available, the accelerator can run out of
+memory very quickly. This makes the data management and its access pattern is essential
+on the GPU. Accelerators like to be overloaded with the number of threads, because they
+can switch between threads very quickly. This allows to hide the memory operations:
+while some threads wait, others can compute.
+
+A very important feature of the accelerators is their scalability. Computational cores
+on accelerators are usually grouped into multiprocessors. The multiprocessors share the
+data and logical elements. This allows to achieve a very high density of compute
+elements on a GPU. This also allows the scaling: more multiprocessors means more raw
+performance and this is very easy to achieve with more transistors available.
 
 How do GPUs differ from CPUs?
 -----------------------------
 
-CPUs and GPUs were designed with different goals in mind. While the CPU 
-is designed to excel at executing a sequence of operations, called a thread, 
-as fast as possible and can execute a few tens of these threads in parallel, 
-the GPU is designed to excel at executing many thousands of them in parallel. 
-GPUs were initially developed for highly-parallel task of graphic processing 
-and therefore designed such that more transistors are devoted to data processing 
-rather than data caching and flow control. More transistors dedicated to 
-data processing is beneficial for highly parallel computations; the GPU can 
-hide memory access latencies with computation, instead of relying on large data caches 
-and complex flow control to avoid long memory access latencies, 
-both of which are expensive in terms of transistors.
-
-
-.. list-table::  
-   :widths: 100 100
-   :header-rows: 1
-
-   * - CPU
-     - GPU
-   * - General purpose
-     - Highly specialized for parallelism
-   * - Good for serial processing
-     - Good for parallel processing
-   * - Great for task parallelism
-     - Great for data parallelism
-   * - Low latency per thread
-     - High-throughput
-   * - Large area dedicated cache and control
-     - Hundreds of floating-point execution units
-
-
+CPUs and GPUs were designed with different goals in mind. While the CPU is designed to
+excel at executing a sequence of operations, called a thread, as fast as possible and
+can execute a few tens of these threads in parallel, the GPU is designed to excel at
+executing many thousands of them in parallel. GPUs were initially developed for
+highly-parallel task of graphic processing and therefore designed such that more
+transistors are devoted to data processing rather than data caching and flow control.
+More transistors dedicated to data processing is beneficial for highly parallel
+computations; the GPU can hide memory access latencies with computation, instead of
+relying on large data caches and complex flow control to avoid long memory access
+latencies, both of which are expensive in terms of transistors.
+
+.. list-table::
+    :widths: 100 100
+    :header-rows: 1
+
+    - - CPU
+      - GPU
+    - - General purpose
+      - Highly specialized for parallelism
+    - - Good for serial processing
+      - Good for parallel processing
+    - - Great for task parallelism
+      - Great for data parallelism
+    - - Low latency per thread
+      - High-throughput
+    - - Large area dedicated cache and control
+      - Hundreds of floating-point execution units
 
 GPU platforms
 -------------
 
-GPUs come together with software stacks or APIs that work in conjunction with the hardware and give a standard way for the software to interact with the GPU hardware. They are used by software developers to write code that can take advantage of the parallel processing power of the GPU, and they provide a standard way for software to interact with the GPU hardware. Typically, they provide access to low-level functionality, such as memory management, data transfer between the CPU and the GPU, and the scheduling and execution of parallel processing tasks on the GPU. They may also provide higher level functions and libraries optimized for specific HPC workloads, like linear algebra or fast Fourier transforms. Finally, in order to facilitate the developers to optimize and write correct codes, debugging and profiling tools are also included. 
-
-*NVIDIA*, *AMD*, and *Intel* are the major companies which design and produces GPUs for HPC providing each its own suite **CUDA**, **ROCm**, and respectively **oneAPI**. This way they can offer optimization, differentiation (offering unique features tailored to their devices), vendor lock-in, licensing, and royalty fees, which can result in better performance, profitability, and customer loyalty. 
-There are also cross-platform APIs such **DirectCompute** (only for Windows operating system), **OpenCL**, and **SYCL**.
+GPUs come together with software stacks or APIs that work in conjunction with the
+hardware and give a standard way for the software to interact with the GPU hardware.
+They are used by software developers to write code that can take advantage of the
+parallel processing power of the GPU, and they provide a standard way for software to
+interact with the GPU hardware. Typically, they provide access to low-level
+functionality, such as memory management, data transfer between the CPU and the GPU, and
+the scheduling and execution of parallel processing tasks on the GPU. They may also
+provide higher level functions and libraries optimized for specific HPC workloads, like
+linear algebra or fast Fourier transforms. Finally, in order to facilitate the
+developers to optimize and write correct codes, debugging and profiling tools are also
+included.
+
+*NVIDIA*, *AMD*, and *Intel* are the major companies which design and produces GPUs for
+HPC providing each its own suite **CUDA**, **ROCm**, and respectively **oneAPI**. This
+way they can offer optimization, differentiation (offering unique features tailored to
+their devices), vendor lock-in, licensing, and royalty fees, which can result in better
+performance, profitability, and customer loyalty. There are also cross-platform APIs
+such **DirectCompute** (only for Windows operating system), **OpenCL**, and **SYCL**.
 
 .. admonition:: CUDA - In short
-   :class: dropdown
-
-   - CUDA: NVIDIA's parallel computing platform
-      - Components: CUDA Toolkit & CUDA driver
-      - Supports C, C++, and Fortran languages
-   - CUDA API Libraries: cuBLAS, cuFFT, cuRAND, cuSPARSE
-      - Accelerate complex computations on GPUs
-   - Compilers: nvcc, nvc, nvc++, nvfortran
-      - Support GPU and multicore CPU programming
-      - Compatible with OpenACC and OpenMP
-   - Debugging tools: cuda-gdb, compute-sanitizer
-      - Debug GPU and CPU code simultaneously
-      - Identify memory access issues
-   - Performance analysis tools: NVIDIA Nsight Systems, NVIDIA Nsight Compute
-      - Analyze system-wide and kernel-level performance
-      - Optimize CPU and GPU usage, memory bandwidth, instruction throughput
-   - Comprehensive CUDA ecosystem with extensive tools and features
+
+    - CUDA: NVIDIA's parallel computing platform
+          - Components: CUDA Toolkit & CUDA driver
+          - Supports C, C++, and Fortran languages
+    - CUDA API Libraries: cuBLAS, cuFFT, cuRAND, cuSPARSE
+          - Accelerate complex computations on GPUs
+    - Compilers: nvcc, nvc, nvc++, nvfortran
+          - Support GPU and multicore CPU programming
+          - Compatible with OpenACC and OpenMP
+    - Debugging tools: cuda-gdb, compute-sanitizer
+          - Debug GPU and CPU code simultaneously
+          - Identify memory access issues
+    - Performance analysis tools: NVIDIA Nsight Systems, NVIDIA Nsight Compute
+          - Analyze system-wide and kernel-level performance
+          - Optimize CPU and GPU usage, memory bandwidth, instruction throughput
+    - Comprehensive CUDA ecosystem with extensive tools and features
 
 .. admonition:: ROCm - In short
-   :class: dropdown
-
-   - ROCm: Open software platform for AMD accelerators
-      - Built for open portability across multiple vendors and architectures
-      - Offers libraries, compilers, and development tools for AMD GPUs
-      - Supports C, C++, and Fortran languages
-      - Support GPU and multicore CPU programming
-   - Debugging: ``roc-gdb`` command line tool
-      - Facilitates debugging of GPU programs
-   - Performance analysis: ``rocprof`` and ``roctracer`` tools
-      - Analyze and optimize program performance  
-   - Supports various heterogeneous programming models such as **HIP**, **OpenMP**, and **OpenCL**
-   - Heterogeneous-Computing Interface for Portability (HIP)
-      - Enables source portability for NVIDIA and AMD platforms, Intel in plan
-      - Provides ``hipcc`` compiler driver and runtime libraries
-   - Libraries: Prefixed with ``roc`` for AMD platforms
-      - Can be called directly from HIP
-      - ``hip``-prefixed wrappers ensure portability with no performance cost
+
+    - ROCm: Open software platform for AMD accelerators
+          - Built for open portability across multiple vendors and architectures
+          - Offers libraries, compilers, and development tools for AMD GPUs
+          - Supports C, C++, and Fortran languages
+          - Support GPU and multicore CPU programming
+    - Debugging: ``roc-gdb`` command line tool
+          - Facilitates debugging of GPU programs
+    - Performance analysis: ``rocprof`` and ``roctracer`` tools
+          - Analyze and optimize program performance
+    - Supports various heterogeneous programming models such as **HIP**, **OpenMP**, and
+      **OpenCL**
+    - Heterogeneous-Computing Interface for Portability (HIP)
+          - Enables source portability for NVIDIA and AMD platforms, Intel in plan
+          - Provides ``hipcc`` compiler driver and runtime libraries
+    - Libraries: Prefixed with ``roc`` for AMD platforms
+          - Can be called directly from HIP
+          - ``hip``-prefixed wrappers ensure portability with no performance cost
 
 .. admonition:: oneAPI - In short
-   :class: dropdown
-
-   - Intel oneAPI: Unified software toolkit for optimizing and deploying applications across various architectures
-      - Supports CPUs, GPUs, and FPGAs
-      - Enables code reusability and performance portability
-   - Intel oneAPI Base Toolkit: Core set of tools and libraries for high-performance, data-centric applications
-      - Includes C++ compiler with SYCL support
-      - Features Collective Communications Library, Data Analytics Library, Deep Neural Networks Library, and more
-   - Additional toolkits: Intel oneAPI HPC Toolkit
-      - Contains compilers, debugging tools, MPI library, and performance analysis tool
-   - Multiple programming models and languages supported:
-      - OpenMP, Classic Fortran, C++, SYCL
-      - Unless custom Intel libraries are used, the code is portable to other OpenMP and SYCL frameworks
-   - DPC++ Compiler: Supports Intel, NVIDIA, and AMD GPUs
-      - Targets Intel GPUs using oneAPI Level Zero interface
-      - Added support for NVIDIA GPUs with CUDA and AMD GPUs with ROCm
-   - Debugging and performance analysis tools: Intel Adviser, Intel Vtune Profiler, Cluster Checker, Inspector, Intel Trace Analyzer and Collector, Intel Distribution for GDB
-   - Comprehensive and unified approach to heterogeneous computing
-      - Abstracts complexities and provides consistent programming interface
-      - Promotes code reusability, productivity, and performance portability
 
+    - Intel oneAPI: Unified software toolkit for optimizing and deploying applications across various architectures
+          - Supports CPUs, GPUs, and FPGAs
+          - Enables code reusability and performance portability
+    - Intel oneAPI Base Toolkit: Core set of tools and libraries for high-performance, data-centric applications
+          - Includes C++ compiler with SYCL support
+          - Features Collective Communications Library, Data Analytics Library, Deep
+            Neural Networks Library, and more
+    - Additional toolkits: Intel oneAPI HPC Toolkit
+          - Contains compilers, debugging tools, MPI library, and performance analysis
+            tool
+    - Multiple programming models and languages supported:
+          - OpenMP, Classic Fortran, C++, SYCL
+          - Unless custom Intel libraries are used, the code is portable to other OpenMP
+            and SYCL frameworks
+    - DPC++ Compiler: Supports Intel, NVIDIA, and AMD GPUs
+          - Targets Intel GPUs using oneAPI Level Zero interface
+          - Added support for NVIDIA GPUs with CUDA and AMD GPUs with ROCm
+    - Debugging and performance analysis tools: Intel Adviser, Intel Vtune Profiler,
+      Cluster Checker, Inspector, Intel Trace Analyzer and Collector, Intel Distribution
+      for GDB
+    - Comprehensive and unified approach to heterogeneous computing
+          - Abstracts complexities and provides consistent programming interface
+          - Promotes code reusability, productivity, and performance portability
 
 CUDA
-^^^^
-
-**Compute Unified Device Architecture** is the parallel computing platform from NVIDIA. The CUDA API provides a comprehensive set of functions and tools for developing high-performance applications that run on NVIDIA GPUs. It consists of two main components: the CUDA Toolkit and the CUDA driver. The toolkit provides a set of libraries, compilers, and development tools for programming and optimizing CUDA applications, while the driver is responsible for communication between the host CPU and the device GPU. CUDA is designed to work with programming languages such as C, C++, and Fortran.
-
-CUDA API provides many highly optimize libraries such as: **cuBLAS** (for linear algebra operations, such a dense matrix multiplication), **cuFFT** (for performing fast Fourier transforms), **cuRAND** (for generating pseudo-random numbers), **cuSPARSE** (for sparse matrices operations). Using these libraries, developers can quickly and easily accelerate complex computations on NVIDIA GPUs without having to write low-level GPU code themselves.
-
-There are several compilers that can be used for developing and executing code on NVIDIA GPUs: **nvcc**. The latest versions are based on the widely used LLVM (low level virtual machine) open source compiler infrastructure. nvcc produces optimized code for NVIDIA GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs.
-
-In addition to this are provided **nvc** (C11 compiler), **nvc++** (C++17 compiler), and  **nvfortran** (ISO Fortran 2003 compiler). These compilers can as well create code for execution on the NVIDIA GPUs, and also support GPU and multicore CPU programming with parallel language features, OpeanACC and OpenMP.
-
-
-When programming mistakes are inevitable they have to be fixed as soon as possible. The CUDA toolkit includes the command line tool **cuda-gdb** which can be used to find errors in the code. It is an extension to GDB, the GNU Project debugger.  The existing GDB debugging features are inherently present for debugging the host code, and additional features have been provided to support debugging CUDA device code, allowing simultaneous debugging of both GPU and CPU code within the same application. The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware. This enables developers to debug applications without the potential variations introduced by simulation and emulation environments.
-
-In addition to this the command line tool **compute-sanitizer** can be used to look exclusively for memory access problems: unallocated buffers, out of bounds accesses, race conditions, and uninitialized variables. 
-
-Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the developers to optimize their applications. The former, NVIDIA Nsight Systems, is a system-wide performance analysis tool that provides detailed metrics on both CPU and GPU usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight Compute, is a kernel-level performance analysis tool that allows developers to analyze the performance of individual CUDA kernels. It provides detailed metrics on kernel execution, including memory usage, instruction throughput, and occupancy. These tools have graphical which can be used for all steps of the performance analysis, however on supercomputers it is recommended to use the command line interface for collecting the information needed and then visualize and analyse the results using the graphical interface on personal computers.
-
-Apart from what was presented above there are many others tools and features provided by NVIDIA. The CUDA ecosystem is very well developed. 
-
+~~~~
+
+**Compute Unified Device Architecture** is the parallel computing platform from NVIDIA.
+The CUDA API provides a comprehensive set of functions and tools for developing
+high-performance applications that run on NVIDIA GPUs. It consists of two main
+components: the CUDA Toolkit and the CUDA driver. The toolkit provides a set of
+libraries, compilers, and development tools for programming and optimizing CUDA
+applications, while the driver is responsible for communication between the host CPU and
+the device GPU. CUDA is designed to work with programming languages such as C, C++, and
+Fortran.
+
+CUDA API provides many highly optimize libraries such as: **cuBLAS** (for linear algebra
+operations, such a dense matrix multiplication), **cuFFT** (for performing fast Fourier
+transforms), **cuRAND** (for generating pseudo-random numbers), **cuSPARSE** (for sparse
+matrices operations). Using these libraries, developers can quickly and easily
+accelerate complex computations on NVIDIA GPUs without having to write low-level GPU
+code themselves.
+
+There are several compilers that can be used for developing and executing code on NVIDIA
+GPUs: **nvcc**. The latest versions are based on the widely used LLVM (low level virtual
+machine) open source compiler infrastructure. nvcc produces optimized code for NVIDIA
+GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs.
+
+In addition to this are provided **nvc** (C11 compiler), **nvc++** (C++17 compiler), and
+**nvfortran** (ISO Fortran 2003 compiler). These compilers can as well create code for
+execution on the NVIDIA GPUs, and also support GPU and multicore CPU programming with
+parallel language features, OpeanACC and OpenMP.
+
+When programming mistakes are inevitable they have to be fixed as soon as possible. The
+CUDA toolkit includes the command line tool **cuda-gdb** which can be used to find
+errors in the code. It is an extension to GDB, the GNU Project debugger. The existing
+GDB debugging features are inherently present for debugging the host code, and
+additional features have been provided to support debugging CUDA device code, allowing
+simultaneous debugging of both GPU and CPU code within the same application. The tool
+provides developers with a mechanism for debugging CUDA applications running on actual
+hardware. This enables developers to debug applications without the potential variations
+introduced by simulation and emulation environments.
+
+In addition to this the command line tool **compute-sanitizer** can be used to look
+exclusively for memory access problems: unallocated buffers, out of bounds accesses,
+race conditions, and uninitialized variables.
+
+Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA
+provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the
+developers to optimize their applications. The former, NVIDIA Nsight Systems, is a
+system-wide performance analysis tool that provides detailed metrics on both CPU and GPU
+usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight
+Compute, is a kernel-level performance analysis tool that allows developers to analyze
+the performance of individual CUDA kernels. It provides detailed metrics on kernel
+execution, including memory usage, instruction throughput, and occupancy. These tools
+have graphical which can be used for all steps of the performance analysis, however on
+supercomputers it is recommended to use the command line interface for collecting the
+information needed and then visualize and analyse the results using the graphical
+interface on personal computers.
+
+Apart from what was presented above there are many others tools and features provided by
+NVIDIA. The CUDA ecosystem is very well developed.
 
 ROCm
-^^^^
-
-
-ROCm is an open software platform allowing researchers to tap the power of AMD accelerators. 
-The ROCm platform is built on the foundation of open portability, supporting environments across multiple 
-accelerator vendors and architectures. In some way it is very similar to CUDA API. 
-It contains libraries, compilers, and development tools for programming and optimizing programs for AMD GPUs. 
-For debugging, it provides the command line tool ``rocgdb``, while for performance analysis ``rocprof`` and ``roctracer``.
-In order to produce code for the AMD GPUs, one can use the Heterogeneous-Computing Interface for Portability (HIP). 
-HIP is a C++ runtime API and a set of tools that allows developers to write portable GPU-accelerated code for both NVIDIA and AMD platforms. 
-It provides the ``hipcc`` compiler driver, which will call the appropriate toolchain depending on the desired platform. 
-On the AMD ROCm platform, HIP provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). 
-On an NVIDIA platform, HIP provides a header file which translates from the HIP runtime APIs to CUDA runtime APIs. 
-The header file contains mostly inlined functions and thus has very low overhead. 
-The code is then compiled with ``nvcc``, the standard C++ compiler provided with CUDA.
-On AMD platforms, libraries are prefixed by ``roc``, which can be called directly from HIP. In order to make portable calls, 
-one can call the libraries using ``hip``-prefixed wrappers. These wrappers can be used at no performance cost and ensure that 
-HIP code can be used on other platforms with no changes. Libraries included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA.
-
-ROCm also integrates with popular machine learning frameworks such as TensorFlow and PyTorch and provides optimized libraries and drivers to accelerate machine learning workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD accelerators to train and deploy machine learning models efficiently.
-
+~~~~
+
+ROCm is an open software platform allowing researchers to tap the power of AMD
+accelerators. The ROCm platform is built on the foundation of open portability,
+supporting environments across multiple accelerator vendors and architectures. In some
+way it is very similar to CUDA API. It contains libraries, compilers, and development
+tools for programming and optimizing programs for AMD GPUs. For debugging, it provides
+the command line tool ``rocgdb``, while for performance analysis ``rocprof`` and
+``roctracer``. In order to produce code for the AMD GPUs, one can use the
+Heterogeneous-Computing Interface for Portability (HIP). HIP is a C++ runtime API and a
+set of tools that allows developers to write portable GPU-accelerated code for both
+NVIDIA and AMD platforms. It provides the ``hipcc`` compiler driver, which will call the
+appropriate toolchain depending on the desired platform. On the AMD ROCm platform, HIP
+provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). On
+an NVIDIA platform, HIP provides a header file which translates from the HIP runtime
+APIs to CUDA runtime APIs. The header file contains mostly inlined functions and thus
+has very low overhead. The code is then compiled with ``nvcc``, the standard C++
+compiler provided with CUDA. On AMD platforms, libraries are prefixed by ``roc``, which
+can be called directly from HIP. In order to make portable calls, one can call the
+libraries using ``hip``-prefixed wrappers. These wrappers can be used at no performance
+cost and ensure that HIP code can be used on other platforms with no changes. Libraries
+included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA.
+
+ROCm also integrates with popular machine learning frameworks such as TensorFlow and
+PyTorch and provides optimized libraries and drivers to accelerate machine learning
+workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD
+accelerators to train and deploy machine learning models efficiently.
 
 oneAPI
-^^^^^^
-
-
-**Intel oneAPI** is a unified software toolkit developed by Intel that allows developers to optimize and deploy applications across a variety of architectures, including CPUs, GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks, enabling developers to leverage the full potential of heterogeneous computing environments. With oneAPI, the developers can write code once and deploy it across different hardware targets without the need for significant modifications or rewriting. This approach promotes code reusability, productivity, and performance portability, as it abstracts the complexities of heterogeneous computing and provides a consistent programming interface based on open standards.
-
-The core of suite is **Intel oneAPI Base Toolkit**, a set of tools and libraries for developing high-performance, data-centric applications across diverse architectures. It features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for heterogeneous computing. It includes the **Collective Communications Library**, the **Data Analytics Library**, the **Deep Neural Networks Library**, the **DPC++/C++ Compiler**, the **DPC++ Library**, the **Math Kernel Library**, the **Threading Building Blocks**, debugging tool **Intel Distribution for GDB**, performance analysis tools **Intel Adviser** and **Intel Vtune Profiler**, the **Video Processing Library**, **Intel Distribution for Python**, the **DPC++ Compatibility Tool**, the **FPGA Add-on for oneAPI Base Toolkit**, the **Integrated Performance Primitives**.
-This can be complemented with additional toolkits. The **Intel oneAPI HPC Toolkit** contains **DPC++/C++ Compiler**, **Fortran** and **C++** Compiler Classic, debugging tools **Cluster Checker** and **Inspector**, **Intel MPI Library**, and performance analysis tool **Intel Trace Analyzer and Collector**. 
-
-oneAPI supports multiple programming models and programming languages. It enables developers to write **OpenMP** codes targeting multi-core CPUs and Intel GPUs using the Classic Fortran and C++ compilers and as well **SYCL** programs for GPUs and FPGAs using the **DPC++** compiler. Initially, the **DPC++** compiler only targeted Intel GPUs using the **oneAPI Level Zero** low-level programming interface, but now support for NVIDIA GPUs (using  CUDA) and AMD GPUs (using ROCm) has been added. 
-Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneous computing, empowering developers to optimize and deploy applications across different architectures with ease. By abstracting the complexities and providing a consistent programming interface, oneAPI promotes code reusability, productivity, and performance portability, making it an invaluable toolkit for developers in the era of diverse computing platforms.
-
-
+~~~~~~
+
+**Intel oneAPI** is a unified software toolkit developed by Intel that allows developers
+to optimize and deploy applications across a variety of architectures, including CPUs,
+GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks,
+enabling developers to leverage the full potential of heterogeneous computing
+environments. With oneAPI, the developers can write code once and deploy it across
+different hardware targets without the need for significant modifications or rewriting.
+This approach promotes code reusability, productivity, and performance portability, as
+it abstracts the complexities of heterogeneous computing and provides a consistent
+programming interface based on open standards.
+
+The core of suite is **Intel oneAPI Base Toolkit**, a set of tools and libraries for
+developing high-performance, data-centric applications across diverse architectures. It
+features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for
+heterogeneous computing. It includes the **Collective Communications Library**, the
+**Data Analytics Library**, the **Deep Neural Networks Library**, the **DPC++/C++
+Compiler**, the **DPC++ Library**, the **Math Kernel Library**, the **Threading Building
+Blocks**, debugging tool **Intel Distribution for GDB**, performance analysis tools
+**Intel Adviser** and **Intel Vtune Profiler**, the **Video Processing Library**,
+**Intel Distribution for Python**, the **DPC++ Compatibility Tool**, the **FPGA Add-on
+for oneAPI Base Toolkit**, the **Integrated Performance Primitives**. This can be
+complemented with additional toolkits. The **Intel oneAPI HPC Toolkit** contains
+**DPC++/C++ Compiler**, **Fortran** and **C++** Compiler Classic, debugging tools
+**Cluster Checker** and **Inspector**, **Intel MPI Library**, and performance analysis
+tool **Intel Trace Analyzer and Collector**.
+
+oneAPI supports multiple programming models and programming languages. It enables
+developers to write **OpenMP** codes targeting multi-core CPUs and Intel GPUs using the
+Classic Fortran and C++ compilers and as well **SYCL** programs for GPUs and FPGAs using
+the **DPC++** compiler. Initially, the **DPC++** compiler only targeted Intel GPUs using
+the **oneAPI Level Zero** low-level programming interface, but now support for NVIDIA
+GPUs (using CUDA) and AMD GPUs (using ROCm) has been added. Overall, Intel oneAPI offers
+a comprehensive and unified approach to heterogeneous computing, empowering developers
+to optimize and deploy applications across different architectures with ease. By
+abstracting the complexities and providing a consistent programming interface, oneAPI
+promotes code reusability, productivity, and performance portability, making it an
+invaluable toolkit for developers in the era of diverse computing platforms.
 
 Differences and similarities
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-GPUs in general support different features, even among the same producer. In general newer cards come with extra 
-features and sometimes old features are not supported anymore. It is important when compiling to create binaries 
-targeting the specific architecture when compiling. A binary built for a newer card will not run on older devices, 
-while a binary build for older devices might not run efficiently on newer architectures. In CUDA the compute 
-capability which is targeted is specified by the ``-arch=sm_XY``, where ``X`` specifies the major architecture and it is between 1 and 9, and ``Y`` the minor. When using HIP on NVIDIA platforms one needs to use compiling option ``--gpu-architecture=sm_XY``, while on AMD platforms  ``--offload-arch=gfxabc`` ( where ``abc`` is the architecture code such as ``90a`` for the MI200 series or ``908`` for MI100 series). 
-Note that in the case of portable (single source) programs one would specify ``openmp`` as well as target for 
-compilation, enabling to run the same code on multicore CPU. 
-
-
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+GPUs in general support different features, even among the same producer. In general
+newer cards come with extra features and sometimes old features are not supported
+anymore. It is important when compiling to create binaries targeting the specific
+architecture when compiling. A binary built for a newer card will not run on older
+devices, while a binary build for older devices might not run efficiently on newer
+architectures. In CUDA the compute capability which is targeted is specified by the
+``-arch=sm_XY``, where ``X`` specifies the major architecture and it is between 1 and 9,
+and ``Y`` the minor. When using HIP on NVIDIA platforms one needs to use compiling
+option ``--gpu-architecture=sm_XY``, while on AMD platforms ``--offload-arch=gfxabc`` (
+where ``abc`` is the architecture code such as ``90a`` for the MI200 series or ``908``
+for MI100 series). Note that in the case of portable (single source) programs one would
+specify ``openmp`` as well as target for compilation, enabling to run the same code on
+multicore CPU.
 
 Terminology
-^^^^^^^^^^^
+~~~~~~~~~~~
 
 .. list-table:: Hardware
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - NVIDIA
-     - AMD
-     - Intel
-   * - Streaming processor/streaming core
-     - SIMD lane
-     - Processing element
-   * - SIMT unit
-     - SIMD unit
-     - Vector engine (XVE)
-   * - Streaming Multiprocessor (SM)
-     - Computing Unit (CU)
-     - Xe-core / Execution unit (EU)
-   * - GPU processing clusters (GPC)
-     - Compute Engine
-     - Xe-slice
-
-Please keep in mind, that this table is only a rough approximation.
-Each GPU architecture is different, and it's impossible to make a 1-to-1 mapping between terms used by different vendors.
-
-
+    :widths: 25 25 50
+    :header-rows: 1
+
+    - - NVIDIA
+      - AMD
+      - Intel
+    - - Streaming processor/streaming core
+      - SIMD lane
+      - Processing element
+    - - SIMT unit
+      - SIMD unit
+      - Vector engine (XVE)
+    - - Streaming Multiprocessor (SM)
+      - Computing Unit (CU)
+      - Xe-core / Execution unit (EU)
+    - - GPU processing clusters (GPC)
+      - Compute Engine
+      - Xe-slice
+
+Please keep in mind, that this table is only a rough approximation. Each GPU
+architecture is different, and it's impossible to make a 1-to-1 mapping between terms
+used by different vendors.
 
 Summary
 -------
 
-- GPUs are designed to execute thousands of threads simultaneously, making them highly parallel processors. In contrast, CPUs excel at executing a smaller number of threads in parallel.
-- GPUs allocate a larger portion of transistors to data processing rather than data caching and flow control. This prioritization of data processing enables GPUs to effectively handle parallel computations and hide memory access latencies through computation.
-- GPU producers provide comprehensive toolkits, libraries, and compilers for developing high-performance applications that leverage the parallel processing power of GPUs. Examples include CUDA (NVIDIA), ROCm (AMD), and oneAPI (Intel).
-- These platforms offer debugging tools (e.g., ``cuda-gdb``, ``rocgdb``) and performance analysis tools (e.g., NVIDIA Nsight Systems, NVIDIA Nsight Compute, ``rocprof``, ``roctracer``) to facilitate code optimization and ensure efficient utilization of GPU resources.
-
-
+- GPUs are designed to execute thousands of threads simultaneously, making them highly
+  parallel processors. In contrast, CPUs excel at executing a smaller number of threads
+  in parallel.
+- GPUs allocate a larger portion of transistors to data processing rather than data
+  caching and flow control. This prioritization of data processing enables GPUs to
+  effectively handle parallel computations and hide memory access latencies through
+  computation.
+- GPU producers provide comprehensive toolkits, libraries, and compilers for developing
+  high-performance applications that leverage the parallel processing power of GPUs.
+  Examples include CUDA (NVIDIA), ROCm (AMD), and oneAPI (Intel).
+- These platforms offer debugging tools (e.g., ``cuda-gdb``, ``rocgdb``) and performance
+  analysis tools (e.g., NVIDIA Nsight Systems, NVIDIA Nsight Compute, ``rocprof``,
+  ``roctracer``) to facilitate code optimization and ensure efficient utilization of GPU
+  resources.
 
 Exercises
 ---------
 
 .. challenge:: GPUs and memory
 
-   Which statement about the relationship between GPUs and memory is true?
+    Which statement about the relationship between GPUs and memory is true?
 
-   - A) GPUs are not affected by memory access latencies.
-   - B) GPUs can run out of memory quickly with many cores trying to access the memory simultaneously.
-   - C) GPUs have an unlimited cache size.
-   - D) GPUs prefer to run with a minimal number of threads to manage memory effectively.
+    - A) GPUs are not affected by memory access latencies.
+    - B) GPUs can run out of memory quickly with many cores trying to access the memory simultaneously.
+    - C) GPUs have an unlimited cache size.
+    - D) GPUs prefer to run with a minimal number of threads to manage memory effectively.
 
-   .. solution::
-
-      The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of 
-      cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many 
-      cores are trying to access the memory simultaneously. This is why data management and access patterns 
-      are essential in GPU computing.
+    .. solution::
 
+       The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of
+       cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many
+       cores are trying to access the memory simultaneously. This is why data management and access patterns
+       are essential in GPU computing.
 
 .. keypoints::
 
-   - GPUs vs. CPUs, key differences between them
-   - GPU software suites, support specific GPU features, programming models, compatibility
-   - Applications of GPUs
-
+    - GPUs vs. CPUs, key differences between them
+    - GPU software suites, support specific GPU features, programming models, compatibility
+    - Applications of GPUs
diff --git a/content/3-gpu-problems.rst b/content/3-gpu-problems.rst
index 405a7bf1..d8b50736 100644
--- a/content/3-gpu-problems.rst
+++ b/content/3-gpu-problems.rst
@@ -1,257 +1,267 @@
 .. _gpu-problems:
 
-
 What problems fit to GPU?
 =========================
 
 .. questions::
 
-   - What are the strengths and weaknesses of GPUs?
-   - What makes a particular problem suitable for GPU-porting?
-   - Why are GPUs so ubiquitous in machine learning applications?
+    - What are the strengths and weaknesses of GPUs?
+    - What makes a particular problem suitable for GPU-porting?
+    - Why are GPUs so ubiquitous in machine learning applications?
 
 .. objectives::
 
-   - Get a feeling for the type of use cases that GPUs excel at.
+    - Get a feeling for the type of use cases that GPUs excel at.
 
 .. instructor-note::
 
-   - 10 min teaching
-   - 10 min exercises
-
-
+    - 10 min teaching
+    - 10 min exercises
 
 What are GPUs good for?
 -----------------------
 
+Answer from `Stack Exchange
+<https://scicomp.stackexchange.com/questions/943/what-kinds-of-problems-lend-themselves-well-to-gpu-computing>`__:
 
-Answer from `Stack Exchange <https://scicomp.stackexchange.com/questions/943/what-kinds-of-problems-lend-themselves-well-to-gpu-computing>`__:
-
-   *From a metaphorical point of view, the GPU can be seen as a person lying on a bed 
-   of nails. The person lying on top is the data and in the base of each nail there 
-   is a processor, so the nail is actually an arrow pointing from processor to memory. 
-   All nails are in a regular pattern, like a grid. If the body is well spread, 
-   it feels good (performance is good), if the body only touches some spots of the 
-   nail bed, then the pain is bad (bad performance).*
+    *From a metaphorical point of view, the GPU can be seen as a person lying on a bed
+    of nails. The person lying on top is the data and in the base of each nail there is
+    a processor, so the nail is actually an arrow pointing from processor to memory. All
+    nails are in a regular pattern, like a grid. If the body is well spread, it feels
+    good (performance is good), if the body only touches some spots of the nail bed,
+    then the pain is bad (bad performance).*
 
-
-GPU computing is well-suited to problems that involve large amounts of data parallelism. 
+GPU computing is well-suited to problems that involve large amounts of data parallelism.
 Specifically, you can expect good performance on GPUs for:
 
-- **Large-scale matrix and vector operations**: Common in machine learning, scientific computing, and image processing.
-- **Fourier transforms**: Also common in machine learning, scientific computing, and image processing.
-- **Monte Carlo simulations**: Used across finance, physics, and other fields to simulate complex systems.
+- **Large-scale matrix and vector operations**: Common in machine learning, scientific
+  computing, and image processing.
+- **Fourier transforms**: Also common in machine learning, scientific computing, and
+  image processing.
+- **Monte Carlo simulations**: Used across finance, physics, and other fields to
+  simulate complex systems.
 - **Molecular dynamics simulations**: Used in chemistry, biochemistry and physics.
 - **Computational fluid dynamics**: Used in engineering, physics, and other fields.
 - **Convolutional neural networks** and **computer vision algorithms**.
 - **Big data analytics**: Clustering, classification, regression, etc.
 - **Graphics rendering**: Original use-case for GPUs.
 
-
 What are GPUs not good for?
 ---------------------------
 
-
-Not all programming problems can efficiently leverage the parallelism offered by GPUs. 
+Not all programming problems can efficiently leverage the parallelism offered by GPUs.
 Some types of problems that do not fit well on a GPU include:
 
-- **Sequential tasks**: Problems that require a series of dependent steps, 
-  where each step relies on the outcome of the previous step, are not well-suited 
-  for parallel processing. Examples include recursive algorithms, certain dynamic 
-  programming problems, and some graph traversal algorithms.
-
-- **Fine-grained branching**: GPUs perform best when the code being executed across 
-  different threads follows a similar control flow. When there is extensive 
-  branching (i.e., many ``if`` statements) within a kernel or algorithm, performance 
-  may suffer due to the divergence in execution paths among the GPU threads.
-
-- **Low arithmetic intensity**: GPUs excel at performing a large number of mathematical 
-  operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of 
-  arithmetic operations to memory accesses), the GPU may not be able to efficiently utilize 
-  its computational power, leading to underperformance.
-
-- **Small data sets**: If the problem involves a small data set that does not require significant 
-  parallelism, using a GPU may not result in noticeable performance gains. In such cases, 
-  the overhead of transferring data between the CPU and GPU, and the time spent initializing the GPU, 
-  may outweigh any potential benefits.
-
-- **Limited parallelism**: Some algorithms have inherent limitations on the degree of parallelism that can be 
-  achieved. In these cases, using a GPU may not lead to significant performance improvements.
-
-- **Memory-bound problems**: GPUs generally have less memory available compared to CPUs, and their memory bandwidth 
-  can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations, 
-  it may not be well-suited for a GPU.
-
+- **Sequential tasks**: Problems that require a series of dependent steps, where each
+  step relies on the outcome of the previous step, are not well-suited for parallel
+  processing. Examples include recursive algorithms, certain dynamic programming
+  problems, and some graph traversal algorithms.
+- **Fine-grained branching**: GPUs perform best when the code being executed across
+  different threads follows a similar control flow. When there is extensive branching
+  (i.e., many ``if`` statements) within a kernel or algorithm, performance may suffer
+  due to the divergence in execution paths among the GPU threads.
+- **Low arithmetic intensity**: GPUs excel at performing a large number of mathematical
+  operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of
+  arithmetic operations to memory accesses), the GPU may not be able to efficiently
+  utilize its computational power, leading to underperformance.
+- **Small data sets**: If the problem involves a small data set that does not require
+  significant parallelism, using a GPU may not result in noticeable performance gains.
+  In such cases, the overhead of transferring data between the CPU and GPU, and the time
+  spent initializing the GPU, may outweigh any potential benefits.
+- **Limited parallelism**: Some algorithms have inherent limitations on the degree of
+  parallelism that can be achieved. In these cases, using a GPU may not lead to
+  significant performance improvements.
+- **Memory-bound problems**: GPUs generally have less memory available compared to CPUs,
+  and their memory bandwidth can be a limiting factor. If a problem requires a large
+  amount of memory or involves memory-intensive operations, it may not be well-suited
+  for a GPU.
 
 Examples of GPU acceleration
 ----------------------------
 
-To give a flavor of what type of performance gains we can achieve by porting a calculations to a GPU 
-(if we're lucky!), let's look at a few case examples.
+To give a flavor of what type of performance gains we can achieve by porting a
+calculations to a GPU (if we're lucky!), let's look at a few case examples.
 
 .. discussion:: Effect of array size
-   
-   Consider the case of matrix multiplication in the Julia language:
-
-   .. code-block:: julia
-
-      using AMDGPU
-      using BenchmarkTools
-
-      N = [9, 10, 11, 12]
-
-      for n in N
-         A = rand(2^n, 2^n); A_d = ROCArray(A);
-
-         @btime $A * $A;
-
-         @btime begin
-            $A_d * $A_d;
-            AMDGPU.synchronize()
-         end         
-      end
-
-
-   - How much faster do you think the GPU version is compared to running on a single CPU core? 
-   - Julia automatically parallelises matrix multiplication over available CPU cores. Will the GPU version be faster than running on 64 cores?
-   - Does the size of the array affect how much the performance improves?
-
-   .. solution::
-
-      Example results from running on LUMI (MI250X AMD GPU, 64-core AMD Trento CPUs):
-
-      .. list-table:: GPU acceleration for matrix multiply in Julia
-         :widths: 25 25 25 25 25
-         :header-rows: 1
-      
-         * - Matrix size
-           - 1 CPU core
-           - 64 CPU cores
-           - 1 GPU
-           - GPU speedup
-         * - (512, 512)
-           - 5.472 ms
-           - 517.722 μs
-           - 115.805 μs
-           - ~47x / ~5x
-         * - (1024, 1024)
-           - 43.364 ms
-           - 2.929 ms
-           - 173.316 μs
-           - ~250x / ~17x
-         * - (2048, 2048)
-           - 344.364 ms
-           - 30.081 ms
-           - 866.348 μs
-           - ~400x / ~35x
-         * - (4096, 4096)
-           - 3.221 s 
-           - 159.563 ms
-           - 5.910 ms
-           - ~550x / ~27x
 
+    Consider the case of matrix multiplication in the Julia language:
+
+    .. code-block:: julia
+
+       using AMDGPU
+       using BenchmarkTools
+
+       N = [9, 10, 11, 12]
+
+       for n in N
+          A = rand(2^n, 2^n); A_d = ROCArray(A);
+
+          @btime $A * $A;
+
+          @btime begin
+             $A_d * $A_d;
+             AMDGPU.synchronize()
+          end
+       end
+
+
+    - How much faster do you think the GPU version is compared to running on a single CPU core?
+    - Julia automatically parallelises matrix multiplication over available CPU cores. Will the GPU version be faster than running on 64 cores?
+    - Does the size of the array affect how much the performance improves?
+
+    .. solution::
+
+       Example results from running on LUMI (MI250X AMD GPU, 64-core AMD Trento CPUs):
+
+       .. list-table:: GPU acceleration for matrix multiply in Julia
+          :widths: 25 25 25 25 25
+          :header-rows: 1
+
+          * - Matrix size
+            - 1 CPU core
+            - 64 CPU cores
+            - 1 GPU
+            - GPU speedup
+          * - (512, 512)
+            - 5.472 ms
+            - 517.722 μs
+            - 115.805 μs
+            - ~47x / ~5x
+          * - (1024, 1024)
+            - 43.364 ms
+            - 2.929 ms
+            - 173.316 μs
+            - ~250x / ~17x
+          * - (2048, 2048)
+            - 344.364 ms
+            - 30.081 ms
+            - 866.348 μs
+            - ~400x / ~35x
+          * - (4096, 4096)
+            - 3.221 s
+            - 159.563 ms
+            - 5.910 ms
+            - ~550x / ~27x
 
 Electronic structure calculations
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-VASP is a popular software package used for electronic structure calculations. The figures below show the speedup observed in a recent benchmark study on the Perlmutter and Cori supercomputers, along with an analysis of total energy usage.
+VASP is a popular software package used for electronic structure calculations. The
+figures below show the speedup observed in a recent benchmark study on the Perlmutter
+and Cori supercomputers, along with an analysis of total energy usage.
 
 .. figure:: img/problems/vasp_gpu.png
-   :align: center
+    :align: center
 
-   VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs).
+    VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of
+    nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)).
+    (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs).
 
 .. figure:: img/problems/vasp_energy.png
-   :align: center
-
-   Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to the Cori Haswell usage.  (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs)
-
+    :align: center
 
+    Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical
+    axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars),
+    CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to
+    the Cori Haswell usage. (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs)
 
 Computational Chemistry
-^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A great deal of computational resources are spent in Quantum Chemical calculations which
+involve the solution of the Hartree-Fock eigenvalue problem, which requires the
+diagonalization of the Fock matrix whose elements are given by:
 
-A great deal of computational resources are spent in Quantum Chemical calculations which involve
-the solution of the Hartree-Fock eigenvalue problem, which requires the diagonalization of the
-Fock matrix whose elements are given by:
-   
 .. math::
+
     F_{\alpha \beta} = H^{\textrm{core}}_{\alpha \beta} + \sum_{\gamma \delta}D_{\gamma \delta} \left [ (\alpha \beta|\gamma \delta) - \frac{1}{2} (\alpha \delta|\gamma \beta) \right ],
 
-The first term is related to the one electron contributions and the second term is related to the 
-electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix 
-:math:`D_{\gamma \delta}`. One of the most expensive parts in the solution of the Hartree-Fock equations is the 
-processing (digestion) of the ERIs, one algorithm to do this task is as follows:
+The first term is related to the one electron contributions and the second term is
+related to the electron repulsion integrals (ERIs), in parenthesis, weighted by the by
+the density matrix :math:`D_{\gamma \delta}`. One of the most expensive parts in the
+solution of the Hartree-Fock equations is the processing (digestion) of the ERIs, one
+algorithm to do this task is as follows:
 
 .. figure:: img/concepts/algorithms.svg
     :width: 200
     :align: center
 
-    Algorithm for processing ERIs [see `JCTC, 17, 7486, (2021) <https://doi.org/10.1021/acs.jctc.1c00720>`__ for details]
-
-This algorithm is suitable for GPUs as it involves many arithmetic operations. In addition to this,
-there are symmetries and properties of the integrals that could be used to rearrange the loops in
-an efficient manner that fit GPU architectures. 
+    Algorithm for processing ERIs [see `JCTC, 17, 7486, (2021)
+    <https://doi.org/10.1021/acs.jctc.1c00720>`__ for details]
 
+This algorithm is suitable for GPUs as it involves many arithmetic operations. In
+addition to this, there are symmetries and properties of the integrals that could be
+used to rearrange the loops in an efficient manner that fit GPU architectures.
 
 Humanities
-^^^^^^^^^^
+~~~~~~~~~~
 
-A brief introduction into some of the work that is being done in the humanities that can benefit from utilizing GPUs. 
+A brief introduction into some of the work that is being done in the humanities that can
+benefit from utilizing GPUs.
 
 **Language models and NLP (natural language processing)**
 
-With the recent popularity of ChatGPT, the use of language models has come into the mainstream, 
-however such models have been used in the humanities many years already. One of the biggest goals of humanities 
-researchers is working with textual data which has increased exponentially over recent years due to the rise in 
-social media. Analyzing such textual data to gain insights into questions of sociology, linguistics and various 
-other fields have become increasingly reliant on using language models. Along with language models, 
-the need for GPU access has become essential.
-
+With the recent popularity of ChatGPT, the use of language models has come into the
+mainstream, however such models have been used in the humanities many years already. One
+of the biggest goals of humanities researchers is working with textual data which has
+increased exponentially over recent years due to the rise in social media. Analyzing
+such textual data to gain insights into questions of sociology, linguistics and various
+other fields have become increasingly reliant on using language models. Along with
+language models, the need for GPU access has become essential.
 
 **Archeology**
 
-The field of archeology also makes use of GPUs in their 3D modelling 
-and rendering work. The biggest problem with archeological sites is that once they are excavated, 
-they are destroyed, so any researchers who aren't present at the site, would lose valuable insights into how 
-it looked when it was found. However, with recent developments in technology and accessibility to high-performance 
-computing, they are able to generate extremely detailed renderings of the excavation sites which act as a way to 
-preserve the site for future researchers to gain critical insights and contribute to the research. 
+The field of archeology also makes use of GPUs in their 3D modelling and rendering work.
+The biggest problem with archeological sites is that once they are excavated, they are
+destroyed, so any researchers who aren't present at the site, would lose valuable
+insights into how it looked when it was found. However, with recent developments in
+technology and accessibility to high-performance computing, they are able to generate
+extremely detailed renderings of the excavation sites which act as a way to preserve the
+site for future researchers to gain critical insights and contribute to the research.
 
 **Cognitive Science**
 
-Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics. MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces. This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems. By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations. 
-
-By offloading the computational workload to GPUs, researchers can experience substantial speedup in the execution of MCMC algorithms. This speedup allows for more extensive exploration of parameter spaces and facilitates the analysis of larger datasets, leading to more accurate and detailed insights into human behavior or population dynamics. Examples of studies done using these methods can be found at the `Center for Humanities Computing Aarhus <https://chc.au.dk/>`__ (CHCAA) and `Interacting Minds Centre <https://interactingminds.au.dk/>`__ (IMC) at Aarhus University.
-
-
+Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable
+in studies that delve into human behavior or population dynamics. MCMC sampling allows
+researchers to simulate and analyze complex systems by iteratively sampling from a
+Markov chain, enabling the exploration of high-dimensional parameter spaces. This method
+is particularly useful when studying human behavior, as it can capture the inherent
+randomness and interdependencies that characterize social systems. By leveraging MCMC
+sampling, researchers can gain insights into various aspects of human behavior, such as
+decision-making, social interactions, and the spread of information or diseases within
+populations.
+
+By offloading the computational workload to GPUs, researchers can experience substantial
+speedup in the execution of MCMC algorithms. This speedup allows for more extensive
+exploration of parameter spaces and facilitates the analysis of larger datasets, leading
+to more accurate and detailed insights into human behavior or population dynamics.
+Examples of studies done using these methods can be found at the `Center for Humanities
+Computing Aarhus <https://chc.au.dk/>`__ (CHCAA) and `Interacting Minds Centre
+<https://interactingminds.au.dk/>`__ (IMC) at Aarhus University.
 
 Exercises
 ---------
 
 .. challenge:: Discussion
 
-   - What type of problems have you used GPUs for?
-   - How large was the performance boost?
-
+    - What type of problems have you used GPUs for?
+    - How large was the performance boost?
 
 .. challenge:: Good and bad use cases for GPU porting
 
-   Which of the following computational tasks is likely to gain the least performance benefit from being ported to a GPU?
-
-   1. Training a large, deep neural network.
-   2. Performing a Monte Carlo simulation with a large number of independent trials.
-   3. Executing an algorithm with heavy use of recursion and frequent branching.
-   4. Processing a large image with a convolutional filter.
-
-   .. solution::
+    Which of the following computational tasks is likely to gain the least performance benefit from being ported to a GPU?
 
-      The right answer is option 3. GPUs do not handle recursion and branching as effectively as more 
-      data-heavy algorithms.
+    1. Training a large, deep neural network.
+    2. Performing a Monte Carlo simulation with a large number of independent trials.
+    3. Executing an algorithm with heavy use of recursion and frequent branching.
+    4. Processing a large image with a convolutional filter.
 
+    .. solution::
 
+       The right answer is option 3. GPUs do not handle recursion and branching as effectively as more
+       data-heavy algorithms.
 
 .. keypoints::
 
-   - GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics. 
-   - GPUs struggle with sequential tasks, problems with extensive control flow divergence, low arithmetic intensity tasks, small data sets, and memory-bound problems.
+    - GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics.
+    - GPUs struggle with sequential tasks, problems with extensive control flow divergence, low arithmetic intensity tasks, small data sets, and memory-bound problems.
diff --git a/content/4-gpu-concepts.rst b/content/4-gpu-concepts.rst
index 89a3ded0..77a81fd9 100644
--- a/content/4-gpu-concepts.rst
+++ b/content/4-gpu-concepts.rst
@@ -1,240 +1,335 @@
 .. _gpu-concepts:
 
-
 GPU programming concepts
 ========================
 
-
 .. questions::
 
-   - What types of parallel computing is possible?
-   - How does data parallelism differ from task parallelism, and how are they utilized in parallel computing?
-   - How is the work parallelized and executed on GPUs?
-   - What are general considerations for an efficient code running on GPUs?
+    - What types of parallel computing is possible?
+    - How does data parallelism differ from task parallelism, and how are they utilized in parallel computing?
+    - How is the work parallelized and executed on GPUs?
+    - What are general considerations for an efficient code running on GPUs?
 
 .. objectives::
 
-   - Understand parallel computing principles and architectures.
-   - Differentiate data parallelism from task parallelism. 
-   - Learn the GPU execution model.
-   - Parallelize and execute work on GPUs.
-   - Develop efficient GPU code for high performance.
+    - Understand parallel computing principles and architectures.
+    - Differentiate data parallelism from task parallelism.
+    - Learn the GPU execution model.
+    - Parallelize and execute work on GPUs.
+    - Develop efficient GPU code for high performance.
 
 .. instructor-note::
 
-   - 25 min teaching
-   - 0 min exercises
-
+    - 25 min teaching
+    - 0 min exercises
 
 Different types of parallelism
 ------------------------------
 
-
 Distributed- vs. Shared-Memory Architecture
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Most of computing problems are not trivially parallelizable, which means that the subtasks 
-need to have access from time to time to some of the results computed by other subtasks. 
-The way subtasks exchange needed information depends on the available hardware.
+Most of computing problems are not trivially parallelizable, which means that the
+subtasks need to have access from time to time to some of the results computed by other
+subtasks. The way subtasks exchange needed information depends on the available
+hardware.
 
 .. figure:: img/history/distributed_vs_shared.png
-   :align: center
-   
-   Distributed- vs shared-memory parallel computing.
+    :align: center
 
-In a distributed memory environment each processing unit operates independently from the 
-others. It has its own memory and it  **cannot** access the memory in other nodes. 
-The communication is done via network and each computing unit runs a separate copy of the 
-operating system. In a shared memory machine all processing units have access to the memory 
-and can read or modify the variables within.
+    Distributed- vs shared-memory parallel computing.
 
+In a distributed memory environment each processing unit operates independently from the
+others. It has its own memory and it **cannot** access the memory in other nodes. The
+communication is done via network and each computing unit runs a separate copy of the
+operating system. In a shared memory machine all processing units have access to the
+memory and can read or modify the variables within.
 
 Processes and Threads
 ~~~~~~~~~~~~~~~~~~~~~
 
-The type of environment (distributed- or shared-memory) determines the programming model. 
-There are two types of parallelism possible, process based and thread based. 
+The type of environment (distributed- or shared-memory) determines the programming
+model. There are two types of parallelism possible, process based and thread based.
 
 .. figure:: img/history/processes-threads.png
-   :align: center
-
-For distributed memory machines, a process-based parallel programming model is employed. 
-The processes are independent execution units which have their *own memory* address spaces. 
-They are created when the parallel program is started and they are only terminated at the 
-end. The communication between them is done explicitly via message passing like MPI.
-
-On the shared memory architectures it is possible to use a thread based parallelism.  
-The threads are light execution units and can be created and destroyed at a relatively 
-small cost. The threads have their own state information, but they *share* the *same memory* 
-address space. When needed the communication is done though the shared memory. 
-
-
-Both approaches have their advantages and disadvantages.  Distributed machines are 
-relatively cheap to build and they  have an "infinite " capacity. In principle one could 
-add more and more computing units. In practice the more computing units are used the more 
-time consuming is the communication. The shared memory systems can achieve good performance 
-and the programming model is quite simple. However they are limited by the memory capacity 
-and by the access speed. In addition in the shared parallel model it is much easier to 
-create race conditions.
+    :align: center
 
+For distributed memory machines, a process-based parallel programming model is employed.
+The processes are independent execution units which have their *own memory* address
+spaces. They are created when the parallel program is started and they are only
+terminated at the end. The communication between them is done explicitly via message
+passing like MPI.
+
+On the shared memory architectures it is possible to use a thread based parallelism. The
+threads are light execution units and can be created and destroyed at a relatively small
+cost. The threads have their own state information, but they *share* the *same memory*
+address space. When needed the communication is done though the shared memory.
+
+Both approaches have their advantages and disadvantages. Distributed machines are
+relatively cheap to build and they have an "infinite " capacity. In principle one could
+add more and more computing units. In practice the more computing units are used the
+more time consuming is the communication. The shared memory systems can achieve good
+performance and the programming model is quite simple. However they are limited by the
+memory capacity and by the access speed. In addition in the shared parallel model it is
+much easier to create race conditions.
 
 Exposing parallelism
 --------------------
 
-There are two types of parallelism that can be explored.
-The data parallelism is when the data can be distributed across computational units that can run in parallel.
-The units process the data by applying the same or very similar operation to different data elements.
-A common example is applying a blur filter to an image --- the same function is applied to all the pixels on an image.
-This parallelism is natural for the GPU, where the same instruction set is executed in multiple :abbr:`threads`.
+There are two types of parallelism that can be explored. The data parallelism is when
+the data can be distributed across computational units that can run in parallel. The
+units process the data by applying the same or very similar operation to different data
+elements. A common example is applying a blur filter to an image --- the same function
+is applied to all the pixels on an image. This parallelism is natural for the GPU, where
+the same instruction set is executed in multiple :abbr:`threads`.
 
 .. figure:: img/concepts/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png
     :align: center
     :scale: 40 %
 
-    Data parallelism and task parallelism.
-    The data parallelism is when the same operation applies to multiple data (e.g. multiple elements of an array are transformed).
-    The task parallelism implies that there are more than one independent task that, in principle, can be executed in parallel.
-
-Data parallelism can usually be explored by the GPUs quite easily.
-The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel.
-If the number of elements in the data set is fairly large (tens or hundred of thousands elements), the GPU should perform quite well. Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take. Getting absolute maximum out of the data parallelism requires good understanding of how GPU works.
-
-
-Another type of parallelism is a task parallelism.
-This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data.
-An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time.
-Note that the tasks can consume totally different resources, which also can be explored.
+    Data parallelism and task parallelism. The data parallelism is when the same
+    operation applies to multiple data (e.g. multiple elements of an array are
+    transformed). The task parallelism implies that there are more than one independent
+    task that, in principle, can be executed in parallel.
+
+Data parallelism can usually be explored by the GPUs quite easily. The most basic
+approach would be finding a loop over many data elements and converting it into a GPU
+kernel. If the number of elements in the data set is fairly large (tens or hundred of
+thousands elements), the GPU should perform quite well. Although it would be odd to
+expect absolute maximum performance from such a naive approach, it is often the one to
+take. Getting absolute maximum out of the data parallelism requires good understanding
+of how GPU works.
+
+Another type of parallelism is a task parallelism. This is when an application consists
+of more than one task that requiring to perform different operations with (the same or)
+different data. An example of task parallelism is cooking: slicing vegetables and
+grilling are very different tasks and can be done at the same time. Note that the tasks
+can consume totally different resources, which also can be explored.
 
 .. admonition:: In short
-   :class: dropdown
-
-   - Computing problems can be parallelized in distributed memory or shared memory architectures.
-   - In distributed memory, each unit operates independently, with no direct memory access between nodes.
-   - In shared memory, units have access to the same memory and can communicate through shared variables.
-   - Parallel programming can be process-based (distributed memory) or thread-based (shared memory).
-   - Process-based parallelism uses independent processes with separate memory spaces and explicit message passing.
-   - Thread-based parallelism uses lightweight threads that share the same memory space and communicate through shared memory.
-   - Data parallelism distributes data across computational units, processing them with the same or similar operations.
-   - Task parallelism involves multiple independent tasks that perform different operations on the same or different data.
-   - Task parallelism involves executing different tasks concurrently, leveraging different resources.
 
+    - Computing problems can be parallelized in distributed memory or shared memory
+      architectures.
+    - In distributed memory, each unit operates independently, with no direct memory
+      access between nodes.
+    - In shared memory, units have access to the same memory and can communicate through
+      shared variables.
+    - Parallel programming can be process-based (distributed memory) or thread-based
+      (shared memory).
+    - Process-based parallelism uses independent processes with separate memory spaces
+      and explicit message passing.
+    - Thread-based parallelism uses lightweight threads that share the same memory space
+      and communicate through shared memory.
+    - Data parallelism distributes data across computational units, processing them with
+      the same or similar operations.
+    - Task parallelism involves multiple independent tasks that perform different
+      operations on the same or different data.
+    - Task parallelism involves executing different tasks concurrently, leveraging
+      different resources.
 
 GPU Execution Model
 -------------------
 
-In order to obtain maximum performance it is important to understand how GPUs execute the programs. As mentioned before a CPU is a flexible device oriented towards general purpose usage. It's fast and versatile, designed to run operating systems and various, very different types of applications. It has lots of features, such as better control logic, caches and cache coherence, that are not related to pure computing. CPUs optimize the execution by trying to achieve low latency via heavy caching and branch prediction. 
+In order to obtain maximum performance it is important to understand how GPUs execute
+the programs. As mentioned before a CPU is a flexible device oriented towards general
+purpose usage. It's fast and versatile, designed to run operating systems and various,
+very different types of applications. It has lots of features, such as better control
+logic, caches and cache coherence, that are not related to pure computing. CPUs optimize
+the execution by trying to achieve low latency via heavy caching and branch prediction.
 
 .. figure:: img/concepts/cpu-gpu-highway.png
     :align: center
     :scale: 40 %
 
-    Cars and roads analogy for the CPU and GPU behavior. The compact road is analogous to the CPU
-    (low latency, low throughput) and the broader road is analogous to the GPU (high latency, high throughput).
-
-In contrast the GPUs contain a relatively small amount of transistors dedicated to control and caching, and a much larger fraction of transistors dedicated to the mathematical operations. Since the cores in a GPU are designed just for 3D graphics, they can be made much simpler and there can be a very larger number of cores. The current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by having a very high degree of parallelism. Lots of threads are launched in parallel. For good performance there should be at least several times more than the number of CUDA cores. GPU :abbr:`threads` are much lighter than the usual CPU threads and they have very little penalty for context switching. This way when some threads are performing some memory operations (reading or writing) others execute instructions.
-
-
+    Cars and roads analogy for the CPU and GPU behavior. The compact road is analogous
+    to the CPU (low latency, low throughput) and the broader road is analogous to the
+    GPU (high latency, high throughput).
+
+In contrast the GPUs contain a relatively small amount of transistors dedicated to
+control and caching, and a much larger fraction of transistors dedicated to the
+mathematical operations. Since the cores in a GPU are designed just for 3D graphics,
+they can be made much simpler and there can be a very larger number of cores. The
+current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by having a
+very high degree of parallelism. Lots of threads are launched in parallel. For good
+performance there should be at least several times more than the number of CUDA cores.
+GPU :abbr:`threads` are much lighter than the usual CPU threads and they have very
+little penalty for context switching. This way when some threads are performing some
+memory operations (reading or writing) others execute instructions.
 
 CUDA Threads, Warps, Blocks
 ---------------------------
 
-In order to understand the GPU execution model let's look at the so called `axpy` operation. On a single CPU core this operation would be executed in a serial manner in a `for/do` loop going over each element on the array, `id`, and computing `y[id]=y[id]+a*x[id]`. 
+In order to understand the GPU execution model let's look at the so called `axpy`
+operation. On a single CPU core this operation would be executed in a serial manner in a
+`for/do` loop going over each element on the array, `id`, and computing
+`y[id]=y[id]+a*x[id]`.
 
 .. code-block:: C++
-    
-        void axpy_(int n, double a, double *x, double *y)
-        {
-            for(int id=0;id<n; id++) {
-               y[id] += a * x[id];
-            }
+
+    void axpy_(int n, double a, double *x, double *y)
+    {
+        for(int id=0;id<n; id++) {
+           y[id] += a * x[id];
         }
+    }
 
-In order to perform the some operation on a GPU the program launches a function called *kernel*, which is executed simultaneously by tens of thousands of :abbr:`threads` that can be run on GPU cores parallelly.
+In order to perform the some operation on a GPU the program launches a function called
+*kernel*, which is executed simultaneously by tens of thousands of :abbr:`threads` that
+can be run on GPU cores parallelly.
 
 .. code-block:: C++
-    
-        GPU_K void ker_axpy_(int n, double a, double *x, double *y, int id)
-        {
-            y[id] += a * x[id]; // id<n
-        }
 
-The programmers control how many instances of `ker_axpy_` are created and they have to make sure that all the elements are processed and also that no out of bounds accessed are happening. 
+    GPU_K void ker_axpy_(int n, double a, double *x, double *y, int id)
+    {
+        y[id] += a * x[id]; // id<n
+    }
+
+The programmers control how many instances of `ker_axpy_` are created and they have to
+make sure that all the elements are processed and also that no out of bounds accessed
+are happening.
 
-GPU threads are much lighter than the usual CPU threads and they have very little penalty for context switching. By "over-subscribing" the GPU there are threads that are performing some memory operations (reading or writing), while others execute instructions.  
+GPU threads are much lighter than the usual CPU threads and they have very little
+penalty for context switching. By "over-subscribing" the GPU there are threads that are
+performing some memory operations (reading or writing), while others execute
+instructions.
 
 .. figure:: img/concepts/THREAD_CORE.png
     :align: center
     :scale: 40 %
 
-Every :abbr:`thread` is associated with a particular intrinsic index which can be used to calculate and access  memory locations in an array. Each thread has its context and set of private variables. All threads have access to the global GPU memory, but there is no general way to synchronize when executing a kernel. If some threads need data from the global memory which was modified by other threads the code would have to be split in several kernels because only at the completion of a kernel it is ensured that the writing to the global memory was completed.  
+Every :abbr:`thread` is associated with a particular intrinsic index which can be used
+to calculate and access memory locations in an array. Each thread has its context and
+set of private variables. All threads have access to the global GPU memory, but there is
+no general way to synchronize when executing a kernel. If some threads need data from
+the global memory which was modified by other threads the code would have to be split in
+several kernels because only at the completion of a kernel it is ensured that the
+writing to the global memory was completed.
 
-Apart from being much light weighted there are more differences between GPU threads and CPU threads. GPU :abbr:`threads` are grouped together in groups called :abbr:`warps`. This done at hardware level. 
+Apart from being much light weighted there are more differences between GPU threads and
+CPU threads. GPU :abbr:`threads` are grouped together in groups called :abbr:`warps`.
+This done at hardware level.
 
 .. figure:: img/concepts/WARP_SMTU.png
     :align: center
     :scale: 40 %
-    
-    
-All memory accesses to the GPU memory are as a group in blocks of specific sizes (32B, 64B, 128B etc.). To obtain good performance the CUDA threads in the same warp need to access elements of the data which are adjacent in the memory. This is called *coalesced* memory access.
-
-
-On some architectures, all members of a :abbr:`warp` have to execute the 
-same instruction, the so-called "lock-step" execution. This is done to achieve 
-higher performance, but there are some drawbacks. If an **if** statement 
-is present inside a :abbr:`warp` will cause the warp to be executed more than once, 
-one time for each branch. When different threads within a single :abbr:`warp`
-take different execution paths based on a conditional statement (if), both
-branches are executed sequentially, with some threads being active while
-others are inactive. On architectures without lock-step execution, such 
-as NVIDIA Volta / Turing (e.g., GeForce 16xx-series) or newer, :abbr:`warp`
-divergence is less costly.
-
-There is another level in the GPU :abbr:`threads` hierarchy. The :abbr:`threads` are grouped together in so called :abbr:`blocks`. Each block is assigned to one Streaming Multiprocessor (SMP) unit. A SMP contains one or more SIMT (single instruction multiple threads) units, schedulers, and very fast on-chip memory. Some of this on-chip memory can be used in the programs, this is called :abbr:`shared memory`. The shared memory can be used to "cache" data that is used by more than one thread, thus avoiding multiple reads from the global memory. It can also be used to avoid memory accesses which are not efficient. For example in a matrix transpose operation, we have two memory operations per element and only can be coalesced. In the first step a tile of the matrix is saved read a coalesced manner in the shared memory. After all the reads of the block are done the tile can be locally transposed (which is very fast) and then written to the destination matrix in a coalesced manner as well. Shared memory can also be used to perform block-level reductions and similar collective operations. All threads can be synchronized at block level. Furthermore when the shared memory is written in order to ensure that all threads have completed the operation the synchronization is compulsory to ensure correctness of the program.
 
+All memory accesses to the GPU memory are as a group in blocks of specific sizes (32B,
+64B, 128B etc.). To obtain good performance the CUDA threads in the same warp need to
+access elements of the data which are adjacent in the memory. This is called *coalesced*
+memory access.
+
+On some architectures, all members of a :abbr:`warp` have to execute the same
+instruction, the so-called "lock-step" execution. This is done to achieve higher
+performance, but there are some drawbacks. If an **if** statement is present inside a
+:abbr:`warp` will cause the warp to be executed more than once, one time for each
+branch. When different threads within a single :abbr:`warp` take different execution
+paths based on a conditional statement (if), both branches are executed sequentially,
+with some threads being active while others are inactive. On architectures without
+lock-step execution, such as NVIDIA Volta / Turing (e.g., GeForce 16xx-series) or newer,
+:abbr:`warp` divergence is less costly.
+
+There is another level in the GPU :abbr:`threads` hierarchy. The :abbr:`threads` are
+grouped together in so called :abbr:`blocks`. Each block is assigned to one Streaming
+Multiprocessor (SMP) unit. A SMP contains one or more SIMT (single instruction multiple
+threads) units, schedulers, and very fast on-chip memory. Some of this on-chip memory
+can be used in the programs, this is called :abbr:`shared memory`. The shared memory can
+be used to "cache" data that is used by more than one thread, thus avoiding multiple
+reads from the global memory. It can also be used to avoid memory accesses which are not
+efficient. For example in a matrix transpose operation, we have two memory operations
+per element and only can be coalesced. In the first step a tile of the matrix is saved
+read a coalesced manner in the shared memory. After all the reads of the block are done
+the tile can be locally transposed (which is very fast) and then written to the
+destination matrix in a coalesced manner as well. Shared memory can also be used to
+perform block-level reductions and similar collective operations. All threads can be
+synchronized at block level. Furthermore when the shared memory is written in order to
+ensure that all threads have completed the operation the synchronization is compulsory
+to ensure correctness of the program.
 
 .. figure:: img/concepts/BLOCK_SMP.png
     :align: center
     :scale: 40 %
 
-
-Finally, a block of threads can not be split among SMPs. For performance blocks should have more than one :abbr:`warp`. The more warps are active on an SMP the better is hidden the latency associated with the memory operations. If the resources are sufficient, due to fast context switching, an SMP can have more than one block active in the same time. However these blocks can not share data with each other via the on-chip memory.
-
-
-To summarize this section. In order to take advantage of GPUs the algorithms must allow the division of work in many small subtasks which can be executed in the same time. The computations are offloaded to GPUs, by launching tens of thousands of threads all executing the same function, *kernel*, each thread working on different part of the problem. The threads are executed in groups called *blocks*, each block being assigned to a SMP. Furthermore the threads of a block are divided in *warps*, each executed by SIMT unit. All threads in a warp execute the same instructions and all memory accesses are done collectively at warp level. The threads can synchronize and share data only at block level. Depending on the architecture, some data sharing can be done as well at warp level. 
-
-In order to hide latencies it is recommended to "over-subscribe" the GPU. There should be many more blocks than SMPs present on the device. Also in order to ensure a good occupancy of the CUDA cores there should be more warps active on a given SMP than SIMT units. This way while some warps of threads are idle waiting for some memory operations to complete, others use the CUDA cores, thus ensuring a high occupancy of the GPU.
-
-In addition to this there are some architecture-specific features of which the developers can take advantage. :abbr:`Warp`-level operations are primitives provided by the GPU architecture to allow for efficient communication and synchronization within a warp. They allow :abbr:`threads` within a warp to exchange data efficiently, without the need for explicit synchronization. These warp-level operations, combined with the organization of threads into blocks and clusters, make it possible to implement complex algorithms and achieve high performance on the GPU. The cooperative groups feature introduced in recent versions of CUDA provides even finer-grained control over thread execution, allowing for even more efficient processing by giving more flexibility to the thread hierarchy. Cooperative groups allow threads within a block to organize themselves into smaller groups, called cooperative groups, and to synchronize their execution and share data within the group.
-
-Below there is an example of how the threads in a grid can be associated with specific elements of an array
-
+Finally, a block of threads can not be split among SMPs. For performance blocks should
+have more than one :abbr:`warp`. The more warps are active on an SMP the better is
+hidden the latency associated with the memory operations. If the resources are
+sufficient, due to fast context switching, an SMP can have more than one block active in
+the same time. However these blocks can not share data with each other via the on-chip
+memory.
+
+To summarize this section. In order to take advantage of GPUs the algorithms must allow
+the division of work in many small subtasks which can be executed in the same time. The
+computations are offloaded to GPUs, by launching tens of thousands of threads all
+executing the same function, *kernel*, each thread working on different part of the
+problem. The threads are executed in groups called *blocks*, each block being assigned
+to a SMP. Furthermore the threads of a block are divided in *warps*, each executed by
+SIMT unit. All threads in a warp execute the same instructions and all memory accesses
+are done collectively at warp level. The threads can synchronize and share data only at
+block level. Depending on the architecture, some data sharing can be done as well at
+warp level.
+
+In order to hide latencies it is recommended to "over-subscribe" the GPU. There should
+be many more blocks than SMPs present on the device. Also in order to ensure a good
+occupancy of the CUDA cores there should be more warps active on a given SMP than SIMT
+units. This way while some warps of threads are idle waiting for some memory operations
+to complete, others use the CUDA cores, thus ensuring a high occupancy of the GPU.
+
+In addition to this there are some architecture-specific features of which the
+developers can take advantage. :abbr:`Warp`-level operations are primitives provided by
+the GPU architecture to allow for efficient communication and synchronization within a
+warp. They allow :abbr:`threads` within a warp to exchange data efficiently, without the
+need for explicit synchronization. These warp-level operations, combined with the
+organization of threads into blocks and clusters, make it possible to implement complex
+algorithms and achieve high performance on the GPU. The cooperative groups feature
+introduced in recent versions of CUDA provides even finer-grained control over thread
+execution, allowing for even more efficient processing by giving more flexibility to the
+thread hierarchy. Cooperative groups allow threads within a block to organize themselves
+into smaller groups, called cooperative groups, and to synchronize their execution and
+share data within the group.
+
+Below there is an example of how the threads in a grid can be associated with specific
+elements of an array
 
 .. figure:: img/concepts/Indexing.png
     :align: center
     :scale: 40 %
 
-The thread marked by orange color is part of a grid of threads size 4096. The threads are grouped in blocks of size 256. The "orange" thread has index 3 in the block 2 and the global calculated index 515.
+The thread marked by orange color is part of a grid of threads size 4096. The threads
+are grouped in blocks of size 256. The "orange" thread has index 3 in the block 2 and
+the global calculated index 515.
 
-For a vector addition example this would be used as follow ``c[index]=a[index]+b[index]``.
+For a vector addition example this would be used as follow
+``c[index]=a[index]+b[index]``.
 
 .. admonition:: In short
-   :class: dropdown
-
-   - GPUs have a different execution model compared to CPUs, with a focus on parallelism and mathematical operations.
-   - GPUs consist of thousands of lightweight threads that can be executed simultaneously on GPU cores.
-   - Threads are organized into warps, and warps are grouped into blocks assigned to streaming multiprocessors (SMPs).
-   - GPUs achieve performance through high degrees of parallelism and efficient memory access.
-   - Shared memory can be used to cache data and improve memory access efficiency within a block.
-   - Synchronization and data sharing are limited to the block level, with some possible sharing at the warp level depending on the architecture.
-   - Over-subscribing the GPU and maximizing warp and block occupancy help hide latencies and improve performance.
-   - Warp-level operations and cooperative groups provide efficient communication and synchronization within a warp or block.
-   - Thread indexing allows associating threads with specific elements in an array for parallel processing.
 
+    - GPUs have a different execution model compared to CPUs, with a focus on
+      parallelism and mathematical operations.
+    - GPUs consist of thousands of lightweight threads that can be executed
+      simultaneously on GPU cores.
+    - Threads are organized into warps, and warps are grouped into blocks assigned to
+      streaming multiprocessors (SMPs).
+    - GPUs achieve performance through high degrees of parallelism and efficient memory
+      access.
+    - Shared memory can be used to cache data and improve memory access efficiency
+      within a block.
+    - Synchronization and data sharing are limited to the block level, with some
+      possible sharing at the warp level depending on the architecture.
+    - Over-subscribing the GPU and maximizing warp and block occupancy help hide
+      latencies and improve performance.
+    - Warp-level operations and cooperative groups provide efficient communication and
+      synchronization within a warp or block.
+    - Thread indexing allows associating threads with specific elements in an array for
+      parallel processing.
 
 Terminology
 -----------
 
-At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. While the basic concept behind GPUs is pretty similar they use different names for the various parts. Furthermore there are software environments for GPU programming, some from the producers and some from external groups all having different naming as well. Below there is a short compilation of the some terms used across different platforms and software environments.
+At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. While the
+basic concept behind GPUs is pretty similar they use different names for the various
+parts. Furthermore there are software environments for GPU programming, some from the
+producers and some from external groups all having different naming as well. Below there
+is a short compilation of the some terms used across different platforms and software
+environments.
 
 .. include:: 4-gpu-concepts-table.rst
 
@@ -243,87 +338,85 @@ Exercises
 
 .. challenge:: What are threads in the context of shared memory architectures?
 
-   a) Independent execution units with their own memory address spaces
-   b) Light execution units with shared memory address spaces
-   c) Communication devices between separate memory units
-   d) Programming models for distributed memory machines
+    a) Independent execution units with their own memory address spaces
+    b) Light execution units with shared memory address spaces
+    c) Communication devices between separate memory units
+    d) Programming models for distributed memory machines
 
-   .. solution::
+    .. solution::
 
-      Correct answer:  *b) Light execution units with shared memory address spaces*
+       Correct answer:  *b) Light execution units with shared memory address spaces*
 
 .. challenge:: What is data parallelism?
 
-   a) Distributing data across computational units that run in parallel, applying the same or similar operations to different data elements.
-   b) Distributing tasks across computational units that run in parallel, applying different operations to the same data elements.
-   c) Distributing data across computational units that run sequentially, applying the same operation to all data elements.
-   d) Distributing tasks across computational units that run sequentially, applying different operations to different data elements.
+    a) Distributing data across computational units that run in parallel, applying the same or similar operations to different data elements.
+    b) Distributing tasks across computational units that run in parallel, applying different operations to the same data elements.
+    c) Distributing data across computational units that run sequentially, applying the same operation to all data elements.
+    d) Distributing tasks across computational units that run sequentially, applying different operations to different data elements.
 
-   .. solution::
+    .. solution::
 
-      Correct answer: *a) Distributing data across computational units that run in parallel, applying the same or similar operations to different data elements.*
+       Correct answer: *a) Distributing data across computational units that run in parallel, applying the same or similar operations to different data elements.*
 
 .. challenge:: What type of parallelism is natural for GPU?
 
-   a) Task Parallelism
-   b) Data Parallelism
-   c) Both data and task parallelism
-   d) Neither data nor task parallelism
+    a) Task Parallelism
+    b) Data Parallelism
+    c) Both data and task parallelism
+    d) Neither data nor task parallelism
+
+    .. solution::
 
-   .. solution::
-      
-      Correct answer: *b) Data Parallelism*
+       Correct answer: *b) Data Parallelism*
 
 .. challenge:: What is a kernel in the context of GPU execution?
 
-   a) A specific section of the CPU used for memory operations.
-   b) A specific section of the GPU used for memory operations.
-   c) A type of thread that operates on the GPU.
-   d) A function that is executed simultaneously by tens of thousands of threads on GPU cores.   
+    a) A specific section of the CPU used for memory operations.
+    b) A specific section of the GPU used for memory operations.
+    c) A type of thread that operates on the GPU.
+    d) A function that is executed simultaneously by tens of thousands of threads on GPU cores.
 
-   .. solution:: 
+    .. solution::
 
-      Correct answer: *d) A function that is executed simultaneously by tens of thousands of threads on GPU cores.*
+       Correct answer: *d) A function that is executed simultaneously by tens of thousands of threads on GPU cores.*
 
 .. challenge:: What is coalesced memory access?
 
-   a) It's when CUDA threads in the same warp access elements of the data which are adjacent in the memory.
-   b) It's when CUDA threads in different warps access elements of the data which are far in the memory.
-   c) It's when all threads have access to the global GPU memory.
-   d) It's when threads in a warp perform different operations.
+    a) It's when CUDA threads in the same warp access elements of the data which are adjacent in the memory.
+    b) It's when CUDA threads in different warps access elements of the data which are far in the memory.
+    c) It's when all threads have access to the global GPU memory.
+    d) It's when threads in a warp perform different operations.
 
-   .. solution::
+    .. solution::
 
-      Correct answer: *a) It's when CUDA threads in the same warp access elements of the data which are adjacent in the memory.*
+       Correct answer: *a) It's when CUDA threads in the same warp access elements of the data which are adjacent in the memory.*
 
 .. challenge:: What is the function of shared memory in the context of GPU execution?
 
-   a) It's used to store global memory.
-   b) It's used to store all the threads in a block.
-   c) It can be used to "cache" data that is used by more than one thread, avoiding multiple reads from the global memory.
-   d) It's used to store all the CUDA cores.
+    a) It's used to store global memory.
+    b) It's used to store all the threads in a block.
+    c) It can be used to "cache" data that is used by more than one thread, avoiding multiple reads from the global memory.
+    d) It's used to store all the CUDA cores.
 
-   .. solution::
+    .. solution::
 
-      Correct answer: *c) It can be used to "cache" data that is used by more than one thread, avoiding multiple reads from the global memory.*
+       Correct answer: *c) It can be used to "cache" data that is used by more than one thread, avoiding multiple reads from the global memory.*
 
 .. challenge:: What is the significance of over-subscribing the GPU?
 
-   a) It reduces the overall performance of the GPU.
-   b) It ensures that there are more blocks than SMPs present on the device, helping to hide latencies and ensure high occupancy of the GPU.
-   c) It leads to a memory overflow in the GPU.
-   d) It ensures that there are more SMPs than blocks present on the device.
-
-   .. solution::
-
-      Correct answer: *b) It ensures that there are more blocks than SMPs present on the device, helping to hide latencies and ensure high occupancy of the GPU.*
+    a) It reduces the overall performance of the GPU.
+    b) It ensures that there are more blocks than SMPs present on the device, helping to hide latencies and ensure high occupancy of the GPU.
+    c) It leads to a memory overflow in the GPU.
+    d) It ensures that there are more SMPs than blocks present on the device.
 
+    .. solution::
 
+       Correct answer: *b) It ensures that there are more blocks than SMPs present on the device, helping to hide latencies and ensure high occupancy of the GPU.*
 
 .. keypoints::
 
-   - Parallel computing can be classified into distributed-memory and shared-memory architectures
-   - Two types of parallelism that can be explored are data parallelism and task parallelism.
-   - GPUs are a type of shared memory architecture suitable for data parallelism.
-   - GPUs have high parallelism, with threads organized into warps and blocks and.
-   - GPU optimization involves coalesced memory access, shared memory usage, and high thread and warp occupancy. Additionally, architecture-specific features such as warp-level operations and cooperative groups can be leveraged for more efficient processing.
+    - Parallel computing can be classified into distributed-memory and shared-memory architectures
+    - Two types of parallelism that can be explored are data parallelism and task parallelism.
+    - GPUs are a type of shared memory architecture suitable for data parallelism.
+    - GPUs have high parallelism, with threads organized into warps and blocks and.
+    - GPU optimization involves coalesced memory access, shared memory usage, and high thread and warp occupancy. Additionally, architecture-specific features such as warp-level operations and cooperative groups can be leveraged for more efficient processing.
diff --git a/content/5-intro-to-gpu-prog-models.rst b/content/5-intro-to-gpu-prog-models.rst
index 47ef3180..5d0dad9b 100644
--- a/content/5-intro-to-gpu-prog-models.rst
+++ b/content/5-intro-to-gpu-prog-models.rst
@@ -1,106 +1,188 @@
 .. _intro-to-gpu-prog-models:
 
-
 Introduction to GPU programming models
 ======================================
 
 .. questions::
 
-   - What are the key differences between different GPU programming approaches?
-   - How should I choose which framework to use for my project?
+    - What are the key differences between different GPU programming approaches?
+    - How should I choose which framework to use for my project?
 
 .. objectives::
 
-   - Understand the  basic ideas in different GPU programming frameworks
-   - Perform a quick cost-benefit analysis in the context of own code projects
+    - Understand the  basic ideas in different GPU programming frameworks
+    - Perform a quick cost-benefit analysis in the context of own code projects
 
 .. instructor-note::
 
-   - 20 min teaching
-   - 10 min discussion
-
-
-There are different ways to use GPUs for computations. In the best case, when the code has already been written, one only needs to set the parameters and initial configuration in order to get started. In some other cases the problem is posed in such a way that a third-party library can be used to solve the most intensive part of the code (for example, this is increasingly the case with machine-learning workflows in Python). 
-However, these cases are stil quite limited; in general, some additional programming might be needed. There are many GPU programming software environments and APIs available, which can be broadly grouped into **directive-based models**, **non-portable kernel-based models**, and **portable kernel-based models**, as well as high-level frameworks and libraries (including attempts at language-level support).
+    - 20 min teaching
+    - 10 min discussion
 
+There are different ways to use GPUs for computations. In the best case, when the code
+has already been written, one only needs to set the parameters and initial configuration
+in order to get started. In some other cases the problem is posed in such a way that a
+third-party library can be used to solve the most intensive part of the code (for
+example, this is increasingly the case with machine-learning workflows in Python).
+However, these cases are stil quite limited; in general, some additional programming
+might be needed. There are many GPU programming software environments and APIs
+available, which can be broadly grouped into **directive-based models**, **non-portable
+kernel-based models**, and **portable kernel-based models**, as well as high-level
+frameworks and libraries (including attempts at language-level support).
 
 Standard C++/Fortran
 --------------------
 
-Programs written in standard C++ and Fortran languages can now take advantage of NVIDIA GPUs without depending on any external library. This is possible thanks to the `NVIDIA SDK <https://developer.nvidia.com/hpc-sdk>`__ suite of compilers that translates and optimizes the code for running on GPUs.
-
-- `Here <https://developer.nvidia.com/blog/developing-accelerated-code-with-standard-language-parallelism/>`_ is the series of articles on acceleration with standard language parallelism.
-- Guidelines for writing C++ code can be found `here <https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/>`__, 
-- while those for Fortran code can be found `here <https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk/>`__.
+Programs written in standard C++ and Fortran languages can now take advantage of NVIDIA
+GPUs without depending on any external library. This is possible thanks to the `NVIDIA
+SDK <https://developer.nvidia.com/hpc-sdk>`__ suite of compilers that translates and
+optimizes the code for running on GPUs.
 
-The performance of these two approaches is promising, as can be seen in the examples provided in those guidelines.
+- `Here
+  <https://developer.nvidia.com/blog/developing-accelerated-code-with-standard-language-parallelism/>`_
+  is the series of articles on acceleration with standard language parallelism.
+- Guidelines for writing C++ code can be found `here
+  <https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/>`__,
+- while those for Fortran code can be found `here
+  <https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk/>`__.
 
+The performance of these two approaches is promising, as can be seen in the examples
+provided in those guidelines.
 
 Directive-based programming
 ---------------------------
 
-A fast and cheap way is to use **directive based** approaches. In this case the existing *serial* code is annotated with *hints* which indicate to the compiler which loops and regions should be executed on the GPU. In the absence of the API the directives are treated as comments and the code will just be executed as a usual serial code. This approach is focused on productivity and easy usage (but to the possible detriment of performance), and allows employing accelerators with minimal programming effort by adding parallelism to existing code without the need to write accelerator-specific code. There are two common ways to program using directives, namely **OpenACC** and **OpenMP**.
-
+A fast and cheap way is to use **directive based** approaches. In this case the existing
+*serial* code is annotated with *hints* which indicate to the compiler which loops and
+regions should be executed on the GPU. In the absence of the API the directives are
+treated as comments and the code will just be executed as a usual serial code. This
+approach is focused on productivity and easy usage (but to the possible detriment of
+performance), and allows employing accelerators with minimal programming effort by
+adding parallelism to existing code without the need to write accelerator-specific code.
+There are two common ways to program using directives, namely **OpenACC** and
+**OpenMP**.
 
 OpenACC
 ~~~~~~~
 
-`OpenACC <https://www.openacc.org/>`_ is developed by a consortium formed in 2010 with the goal of developing a standard, portable, and scalable programming model for accelerators, including GPUs. Members of the OpenACC consortium include GPU vendors, such as NVIDIA and AMD, as well as leading supercomputing centers, universities, and software companies. Until recently it was supporting only NVIDIA GPUs, but now there is effort to support more devices and architectures.
-
+`OpenACC <https://www.openacc.org/>`_ is developed by a consortium formed in 2010 with
+the goal of developing a standard, portable, and scalable programming model for
+accelerators, including GPUs. Members of the OpenACC consortium include GPU vendors,
+such as NVIDIA and AMD, as well as leading supercomputing centers, universities, and
+software companies. Until recently it was supporting only NVIDIA GPUs, but now there is
+effort to support more devices and architectures.
 
 OpenMP
 ~~~~~~
 
-`OpenMP <https://www.openmp.org/>`_ started as a multi-platform, shared-memory parallel programming API for multi-core CPUs and relatively recently has added support for GPU offloading. OpenMP aims to support various types of GPUs, which is done through the parent compiler. 
-
-The directive based approaches work with C/C++ and FORTRAN codes, while some third party extensions are available for other languages. 
+`OpenMP <https://www.openmp.org/>`_ started as a multi-platform, shared-memory parallel
+programming API for multi-core CPUs and relatively recently has added support for GPU
+offloading. OpenMP aims to support various types of GPUs, which is done through the
+parent compiler.
 
+The directive based approaches work with C/C++ and FORTRAN codes, while some third party
+extensions are available for other languages.
 
 Non-portable kernel-based models (native programming models)
 ------------------------------------------------------------
 
-When doing direct GPU programming the developer has a large level of control by writing low-level code that directly communicates with the GPU and its hardware. Theoretically direct GPU programming methods provide the ability to write low-level, GPU-accelerated code that can provide significant performance improvements over CPU-only code. However, they also require a deeper understanding of the GPU architecture and its capabilities, as well as the specific programming method being used.
+When doing direct GPU programming the developer has a large level of control by writing
+low-level code that directly communicates with the GPU and its hardware. Theoretically
+direct GPU programming methods provide the ability to write low-level, GPU-accelerated
+code that can provide significant performance improvements over CPU-only code. However,
+they also require a deeper understanding of the GPU architecture and its capabilities,
+as well as the specific programming method being used.
 
 CUDA
 ~~~~
 
-`CUDA <https://developer.nvidia.com/cuda-toolkit>`_ is a parallel computing platform and API developed by NVIDIA. It is historically the first mainstream GPU programming framework. It allows developers to write C-like code that is executed on the GPU. CUDA provides a set of libraries and tools for low-level GPU programming and provides a performance boost for demanding computationally-intensive applications. While there is an extensive ecosystem, CUDA is restricted to NVIDIA hardware. 
-
+`CUDA <https://developer.nvidia.com/cuda-toolkit>`_ is a parallel computing platform and
+API developed by NVIDIA. It is historically the first mainstream GPU programming
+framework. It allows developers to write C-like code that is executed on the GPU. CUDA
+provides a set of libraries and tools for low-level GPU programming and provides a
+performance boost for demanding computationally-intensive applications. While there is
+an extensive ecosystem, CUDA is restricted to NVIDIA hardware.
 
 HIP
 ~~~
 
-`HIP <https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html>`_ (Heterogeneous Interface for Portability) is an API developed by AMD that provides a low-level interface for GPU programming. HIP is designed to provide a single source code that can be used on both NVIDIA and AMD GPUs. It is based on the CUDA programming model and provides an almost identical programming interface to CUDA.
-
-Multiple examples of CUDA/HIP code are available in the `content/examples/cuda-hip <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip>`__ directory of this repository.
+`HIP <https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html>`_
+(Heterogeneous Interface for Portability) is an API developed by AMD that provides a
+low-level interface for GPU programming. HIP is designed to provide a single source code
+that can be used on both NVIDIA and AMD GPUs. It is based on the CUDA programming model
+and provides an almost identical programming interface to CUDA.
 
+Multiple examples of CUDA/HIP code are available in the `content/examples/cuda-hip
+<https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip>`__
+directory of this repository.
 
 Portable kernel-based models (cross-platform portability ecosystems)
 --------------------------------------------------------------------
 
-Cross-platform portability ecosystems typically provide a higher-level abstraction layer which enables a convenient and portable programming model for GPU programming. They can help reduce the time and effort required to maintain and deploy GPU-accelerated applications. The goal of these ecosystems is to achieve performance portability with a single-source application. In C++, the most notable cross-platform portability ecosystems are `SYCL <https://www.khronos.org/sycl/>`_, `OpenCL <https://www.khronos.org/opencl/>`_ (C and C++ APIs), and `Kokkos <https://github.com/kokkos/kokkos>`_; others include `Alpaka <https://alpaka.readthedocs.io/>`_ and `RAJA <https://github.com/LLNL/RAJA>`_.
-
+Cross-platform portability ecosystems typically provide a higher-level abstraction layer
+which enables a convenient and portable programming model for GPU programming. They can
+help reduce the time and effort required to maintain and deploy GPU-accelerated
+applications. The goal of these ecosystems is to achieve performance portability with a
+single-source application. In C++, the most notable cross-platform portability
+ecosystems are `SYCL <https://www.khronos.org/sycl/>`_, `OpenCL
+<https://www.khronos.org/opencl/>`_ (C and C++ APIs), and `Kokkos
+<https://github.com/kokkos/kokkos>`_; others include `Alpaka
+<https://alpaka.readthedocs.io/>`_ and `RAJA <https://github.com/LLNL/RAJA>`_.
 
 OpenCL
 ~~~~~~
 
-`OpenCL <https://www.khronos.org/opencl/>`_ (Open Computing Language) is a cross-platform, open-standard API for general-purpose parallel computing on CPUs, GPUs and FPGAs. It supports a wide range of hardware from multiple vendors. OpenCL provides a low-level programming interface for GPU programming and enables developers to write programs that can be executed on a variety of platforms. Unlike programming models such as CUDA, HIP, Kokkos, and SYCL, OpenCL uses a separate-source model. Recent versions of the OpenCL standard added C++ support for both API and the kernel code, but the C-based interface is still more widely used. 
-The OpenCL Working Group doesn’t provide any frameworks of its own. Instead, vendors who produce OpenCL-compliant devices release frameworks as part of their software development kits (SDKs). The two most popular OpenCL SDKs are released by NVIDIA and AMD. In both cases, the development kits are free and contain the libraries and tools that make it possible to build OpenCL applications.
-
+`OpenCL <https://www.khronos.org/opencl/>`_ (Open Computing Language) is a
+cross-platform, open-standard API for general-purpose parallel computing on CPUs, GPUs
+and FPGAs. It supports a wide range of hardware from multiple vendors. OpenCL provides a
+low-level programming interface for GPU programming and enables developers to write
+programs that can be executed on a variety of platforms. Unlike programming models such
+as CUDA, HIP, Kokkos, and SYCL, OpenCL uses a separate-source model. Recent versions of
+the OpenCL standard added C++ support for both API and the kernel code, but the C-based
+interface is still more widely used. The OpenCL Working Group doesn’t provide any
+frameworks of its own. Instead, vendors who produce OpenCL-compliant devices release
+frameworks as part of their software development kits (SDKs). The two most popular
+OpenCL SDKs are released by NVIDIA and AMD. In both cases, the development kits are free
+and contain the libraries and tools that make it possible to build OpenCL applications.
 
 Kokkos
 ~~~~~~
 
-`Kokkos <https://github.com/kokkos/kokkos>`_ is an open-source performance portable programming model for heterogeneous parallel computing that has been mainly developed at Sandia National Laboratories. It is a C++-based ecosystem that provides a programming model for developing efficient and scalable parallel applications that run on many-core architectures such as CPUs, GPUs, and FPGAs. The Kokkos ecosystem consists of several components, such as the Kokkos core library, which provides parallel execution and memory abstraction, the Kokkos kernel library, which provides math kernels for linear algebra and graph algorithms, and the Kokkos tools library, which provides profiling and debugging tools. Kokkos components integrate well with other software libraries and technologies, such as MPI and OpenMP. Furthermore, the project collaborates with other projects, in order to provide interoperability and standardization for portable C++ programming.
-
+`Kokkos <https://github.com/kokkos/kokkos>`_ is an open-source performance portable
+programming model for heterogeneous parallel computing that has been mainly developed at
+Sandia National Laboratories. It is a C++-based ecosystem that provides a programming
+model for developing efficient and scalable parallel applications that run on many-core
+architectures such as CPUs, GPUs, and FPGAs. The Kokkos ecosystem consists of several
+components, such as the Kokkos core library, which provides parallel execution and
+memory abstraction, the Kokkos kernel library, which provides math kernels for linear
+algebra and graph algorithms, and the Kokkos tools library, which provides profiling and
+debugging tools. Kokkos components integrate well with other software libraries and
+technologies, such as MPI and OpenMP. Furthermore, the project collaborates with other
+projects, in order to provide interoperability and standardization for portable C++
+programming.
 
 SYCL
 ~~~~
 
-`SYCL <https://www.khronos.org/sycl/>`_ is a royalty-free, open-standard C++ programming model for multi-device programming. It provides a high-level, single-source programming model for heterogeneous systems, including GPUs. Originally SYCL was developed on top of OpenCL; however, it is no more limited to just that. It can be implemented on top of other low-level heterogeneous computing APIs, such as CUDA or HIP, enabling developers to write programs that can be executed on a variety of platforms. Note that while SYCL is relatively high-level model, the developers are still required to write GPU kernels explicitly.
-
-While Alpaka, Kokkos, and RAJA refer to specific projects, SYCL itself is only a standard, for which several implementations exist. For GPU programming, `Intel oneAPI DPC++ <https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>`_ (supporting Intel GPUs natively, and NVIDIA and AMD GPUs with `Codeplay oneAPI plugins <https://codeplay.com/solutions/oneapi/>`_) and `AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`_ (previously also known as hipSYCL or Open SYCL, supporting NVIDIA and AMD GPUs, with experimental Intel GPU support available in combination with Intel oneAPI DPC++) are the most widely used. Other implementations of note are `triSYCL <https://github.com/triSYCL/triSYCL>`_ and `ComputeCPP <https://developer.codeplay.com/products/computecpp/ce/home/>`_.
-
+`SYCL <https://www.khronos.org/sycl/>`_ is a royalty-free, open-standard C++ programming
+model for multi-device programming. It provides a high-level, single-source programming
+model for heterogeneous systems, including GPUs. Originally SYCL was developed on top of
+OpenCL; however, it is no more limited to just that. It can be implemented on top of
+other low-level heterogeneous computing APIs, such as CUDA or HIP, enabling developers
+to write programs that can be executed on a variety of platforms. Note that while SYCL
+is relatively high-level model, the developers are still required to write GPU kernels
+explicitly.
+
+While Alpaka, Kokkos, and RAJA refer to specific projects, SYCL itself is only a
+standard, for which several implementations exist. For GPU programming, `Intel oneAPI
+DPC++
+<https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>`_
+(supporting Intel GPUs natively, and NVIDIA and AMD GPUs with `Codeplay oneAPI plugins
+<https://codeplay.com/solutions/oneapi/>`_) and `AdaptiveCpp
+<https://github.com/AdaptiveCpp/AdaptiveCpp/>`_ (previously also known as hipSYCL or
+Open SYCL, supporting NVIDIA and AMD GPUs, with experimental Intel GPU support available
+in combination with Intel oneAPI DPC++) are the most widely used. Other implementations
+of note are `triSYCL <https://github.com/triSYCL/triSYCL>`_ and `ComputeCPP
+<https://developer.codeplay.com/products/computecpp/ce/home/>`_.
 
 High-level language support
 ---------------------------
@@ -110,97 +192,114 @@ Python
 
 Python offers support for GPU programming through multiple abstraction levels.
 
-
 **CUDA Python, HIP Python and PyCUDA**
 
-These projects are, respectively, `NVIDIA- <https://developer.nvidia.com/cuda-python>`_, `AMD- <https://rocm.docs.amd.com/projects/hip-python/en/latest/>`_ 
-and `community-supported <https://documen.tician.de/pycuda/>`_ wrappers providing Python bindings to the low-level CUDA and HIP APIs. To use these approaches directly, in most cases knowledge of CUDA or HIP programming is needed. 
-
-CUDA Python also aims to support higher-level toolkits and libraries, such as **CuPy** and **Numba**.
+These projects are, respectively, `NVIDIA- <https://developer.nvidia.com/cuda-python>`_,
+`AMD- <https://rocm.docs.amd.com/projects/hip-python/en/latest/>`_ and
+`community-supported <https://documen.tician.de/pycuda/>`_ wrappers providing Python
+bindings to the low-level CUDA and HIP APIs. To use these approaches directly, in most
+cases knowledge of CUDA or HIP programming is needed.
 
+CUDA Python also aims to support higher-level toolkits and libraries, such as **CuPy**
+and **Numba**.
 
 **CuPy**
 
-`CuPy <https://cupy.dev/>`_ is a GPU-based data array library compatible with NumPy/SciPy. It offers a highly similar interface to NumPy and SciPy, making it easy for developers to transition to GPU computing. Code written with NumPy can often be adapted to use CuPy with minimal modifications; in most straightforward cases, one might simply replace 'numpy' and 'scipy' with 'cupy' and 'cupyx.scipy' in their Python code. 
-
+`CuPy <https://cupy.dev/>`_ is a GPU-based data array library compatible with
+NumPy/SciPy. It offers a highly similar interface to NumPy and SciPy, making it easy for
+developers to transition to GPU computing. Code written with NumPy can often be adapted
+to use CuPy with minimal modifications; in most straightforward cases, one might simply
+replace 'numpy' and 'scipy' with 'cupy' and 'cupyx.scipy' in their Python code.
 
 **Numba**
 
-`Numba <https://numba.pydata.org/>`_ is an open-source JIT compiler that translates a subset of Python and NumPy code into optimized machine code. Numba supports CUDA-capable GPUs and is able to generate code for them using several different syntax variants. However, previously-available AMD (ROCm) support has since been discontinued.
-
+`Numba <https://numba.pydata.org/>`_ is an open-source JIT compiler that translates a
+subset of Python and NumPy code into optimized machine code. Numba supports CUDA-capable
+GPUs and is able to generate code for them using several different syntax variants.
+However, previously-available AMD (ROCm) support has since been discontinued.
 
 Julia
 ~~~~~
 
-Julia has first-class support for GPU programming through the following packages that target GPUs from all three major vendors:
+Julia has first-class support for GPU programming through the following packages that
+target GPUs from all three major vendors:
 
 - `CUDA.jl <https://cuda.juliagpu.org/stable/>`_ for NVIDIA GPUs
 - `AMDGPU.jl <https://amdgpu.juliagpu.org/stable/>`_ for AMD GPUs
 - `oneAPI.jl <https://github.com/JuliaGPU/oneAPI.jl>`_ for Intel GPUs
 - `Metal.jl <https://github.com/JuliaGPU/Metal.jl>`_ for Apple M-series GPUs
 
-``CUDA.jl`` is the most mature, ``AMDGPU.jl`` is somewhat behind but still ready for general use, while ``oneAPI.jl`` and ``Metal.jl`` are functional but might contain bugs, miss some features and provide suboptimal performance. Their respective APIs are however completely analogous and translation between libraries is straightforward.
-
-All packages offer both high-level abstractions that require very little programming effort and a lower level approach for writing kernels for fine-grained control.
+``CUDA.jl`` is the most mature, ``AMDGPU.jl`` is somewhat behind but still ready for
+general use, while ``oneAPI.jl`` and ``Metal.jl`` are functional but might contain bugs,
+miss some features and provide suboptimal performance. Their respective APIs are however
+completely analogous and translation between libraries is straightforward.
 
+All packages offer both high-level abstractions that require very little programming
+effort and a lower level approach for writing kernels for fine-grained control.
 
 .. admonition:: In short
-   :class: dropdown
-   
-   - **Directive-based programming:**
-  
-     - Existing serial code is annotated with directives to indicate which parts should be executed on the GPU.
-     - OpenACC and OpenMP are common directive-based programming models.
-     - Productivity and easy usage are prioritized over performance.
-     - Minimum programming effort is required to add parallelism to existing code.
-
-   - **Non-portable kernel-based models:**
-  
-     - Low-level code is written to directly communicate with the GPU.
-     - CUDA is NVIDIA's parallel computing platform and API for GPU programming.
-     - HIP is an API developed by AMD that provides a similar programming interface to CUDA for both NVIDIA and AMD GPUs.
-     - Deeper understanding of GPU architecture and programming methods is needed.
-
-   - **Portable kernel-based models:**
-     
-     - Higher-level abstractions for GPU programming that provide portability.
-     - Examples include OpenCL, Kokkos, Alpaka, RAJA, and SYCL.
-     - Aim to achieve performance portability with a single-source application.
-     - Can run on various GPUs and platforms, reducing the effort required to maintain and deploy GPU-accelerated applications.
-
-   - **High-level language support:**
-
-     - C++ and Fortran feature initiatives to support GPUs through language-standard parallelism.
-     - Python libraries like PyCUDA, CuPy, and Numba offer GPU programming capabilities.
-     - Julia has packages such as CUDA.jl, AMDGPU.jl, oneAPI.jl, and Metal.jl for GPU programming.
-     - These approaches provide high-level abstraction and interfaces for GPU programming in the respective languages.
 
+    - **Directive-based programming:**
+
+      - Existing serial code is annotated with directives to indicate which parts should
+        be executed on the GPU.
+      - OpenACC and OpenMP are common directive-based programming models.
+      - Productivity and easy usage are prioritized over performance.
+      - Minimum programming effort is required to add parallelism to existing code.
+
+    - **Non-portable kernel-based models:**
+
+      - Low-level code is written to directly communicate with the GPU.
+      - CUDA is NVIDIA's parallel computing platform and API for GPU programming.
+      - HIP is an API developed by AMD that provides a similar programming interface to
+        CUDA for both NVIDIA and AMD GPUs.
+      - Deeper understanding of GPU architecture and programming methods is needed.
+
+    - **Portable kernel-based models:**
+
+      - Higher-level abstractions for GPU programming that provide portability.
+      - Examples include OpenCL, Kokkos, Alpaka, RAJA, and SYCL.
+      - Aim to achieve performance portability with a single-source application.
+      - Can run on various GPUs and platforms, reducing the effort required to maintain
+        and deploy GPU-accelerated applications.
+
+    - **High-level language support:**
+
+      - C++ and Fortran feature initiatives to support GPUs through language-standard
+        parallelism.
+      - Python libraries like PyCUDA, CuPy, and Numba offer GPU programming
+        capabilities.
+      - Julia has packages such as CUDA.jl, AMDGPU.jl, oneAPI.jl, and Metal.jl for GPU
+        programming.
+      - These approaches provide high-level abstraction and interfaces for GPU
+        programming in the respective languages.
 
 Summary
 -------
 
-Each of these GPU programming environments has its own strengths and weaknesses, and the best choice for a given project will depend on a range of factors, including: 
+Each of these GPU programming environments has its own strengths and weaknesses, and the
+best choice for a given project will depend on a range of factors, including:
 
 - the hardware platforms being targeted,
 - the type of computation being performed, and
 - the developer's experience and preferences.
- 
-**High-level and productivity-focused APIs** provide a simplified programming model and maximize code portability, while **low-level and performance-focused APIs** provide a high level of control over the GPU's hardware but also require more coding effort and expertise.
 
+**High-level and productivity-focused APIs** provide a simplified programming model and
+maximize code portability, while **low-level and performance-focused APIs** provide a
+high level of control over the GPU's hardware but also require more coding effort and
+expertise.
 
 Exercises
 ---------
 
 .. challenge:: Discussion
 
-   - Which GPU programming frameworks have you already used previously, if any?
-   - What did you find most challenging? What was most useful?
-
-   Let us know in the main room or via comments in HackMD document.
+    - Which GPU programming frameworks have you already used previously, if any?
+    - What did you find most challenging? What was most useful?
 
+    Let us know in the main room or via comments in HackMD document.
 
 .. keypoints::
 
-   - GPU programming approaches can be split into 1) directive-based, 2) non-portable kernel-based, 3) portable kernel-based, and 4) high-level language support.
-   - There are multiple frameworks/languages available for each approach, each with pros and cons. 
-
+    - GPU programming approaches can be split into 1) directive-based, 2) non-portable kernel-based, 3) portable kernel-based, and 4) high-level language support.
+    - There are multiple frameworks/languages available for each approach, each with pros and cons.
diff --git a/content/6-directive-based-models.rst b/content/6-directive-based-models.rst
index ae4928a8..a814dbe2 100644
--- a/content/6-directive-based-models.rst
+++ b/content/6-directive-based-models.rst
@@ -5,987 +5,981 @@ Directive-based models
 
 .. questions::
 
-   - What is OpenACC and OpenMP offloading
-   - How to write GPU code using directives
+    - What is OpenACC and OpenMP offloading
+    - How to write GPU code using directives
 
 .. objectives::
 
-   - Understand the process of offloading
-   - Understand the differences between OpenACC and OpenMP offloading
-   - Understand the various levels of parallelism on a GPU
-   - Understand what is data movement
+    - Understand the process of offloading
+    - Understand the differences between OpenACC and OpenMP offloading
+    - Understand the various levels of parallelism on a GPU
+    - Understand what is data movement
 
 .. instructor-note::
 
-   - 40 min teaching
-   - 40 min exercises
+    - 40 min teaching
+    - 40 min exercises
 
+The most common directive-based models for GPU parallel programming are OpenMP
+offloading and OpenACC. The parallelization is done by introducing directives in places
+which are targeted for parallelization.
 
-The most common directive-based models for GPU parallel programming are OpenMP offloading and OpenACC. 
-The parallelization is done by introducing directives in places which are targeted for parallelization. 
+- **OpenACC** is known to be more **descriptive**, which means the programmer uses
+  directives to tell the compiler how/where to parallelize the code and to move the
+  data.
+- **OpenMP offloading** approach, on the other hand, is known to be more
+  **prescriptive**, where the programmer uses directives to tell the compiler more
+  explicitly how/where to parallelize the code, instead of letting the compiler decides.
 
-- **OpenACC** is known to be more **descriptive**, which means the programmer uses directives to tell the compiler how/where to parallelize the code and to move the data. 
-- **OpenMP offloading** approach, on the other hand, is known to be more **prescriptive**, where the programmer uses directives to tell the compiler more explicitly how/where to parallelize the code, instead of letting the compiler decides.
+In OpenMP/OpenACC the compiler directives are specified by using **#pragma** in C/C++ or
+as special comments identified by unique sentinels in Fortran. Compilers can ignore the
+directives if the support for OpenMP/OpenACC is not enabled.
 
-In OpenMP/OpenACC the compiler directives are specified by using **#pragma** in C/C++ or as special comments identified by unique sentinels in Fortran. Compilers can ignore the directives if the support for OpenMP/OpenACC is not enabled.
+The compiler directives are used for various purposes: for thread creation, workload
+distribution (work sharing), data-environment management, serializing sections of code
+or for synchronization of work among the threads.
 
-The compiler directives are used for various purposes: for thread creation, workload 
-distribution (work sharing), data-environment management, serializing sections of code or 
-for synchronization of work among the threads.
+Execution model
+---------------
 
-
-Execution model 
-~~~~~~~~~~~~~~~
-
-OpenMP and OpenACC use the fork-join model of parallel execution. The program begins as a single 
-thread of execution, the **master** thread. Everything is executed sequentially until the 
-first parallel region construct is encountered. 
+OpenMP and OpenACC use the fork-join model of parallel execution. The program begins as
+a single thread of execution, the **master** thread. Everything is executed sequentially
+until the first parallel region construct is encountered.
 
 .. figure:: img/levels/threads.png
-   :align: center
-
-When a parallel region is encountered, master thread creates a group of threads, 
-becomes the master of this group of threads, and is assigned the thread index 0 within 
-the group. There is an implicit barrier at the end of the parallel regions. 
+    :align: center
 
+When a parallel region is encountered, master thread creates a group of threads, becomes
+the master of this group of threads, and is assigned the thread index 0 within the
+group. There is an implicit barrier at the end of the parallel regions.
 
 Offloading Directives
-~~~~~~~~~~~~~~~~~~~~~
-
+---------------------
 
 OpenACC
-^^^^^^^
-
-In OpenACC, one of the most commonly used directives is ``kernels``,
-which defines a region to be transferred into a series of kernels to be executed in sequence on a GPU. 
-Work sharing is defined automatically for the separate kernels, but tuning prospects is limited.
+~~~~~~~
 
+In OpenACC, one of the most commonly used directives is ``kernels``, which defines a
+region to be transferred into a series of kernels to be executed in sequence on a GPU.
+Work sharing is defined automatically for the separate kernels, but tuning prospects is
+limited.
 
 .. challenge:: Example: ``kernels``
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: C/C++
+       .. tab:: C/C++
 
-         .. literalinclude:: examples/acc/vec_add_kernels.c 
-                        :language: cpp
-                        :emphasize-lines: 17
+          .. literalinclude:: examples/acc/vec_add_kernels.c
+                         :language: cpp
+                         :emphasize-lines: 17
 
-      .. tab:: Fortran
+       .. tab:: Fortran
 
-         .. literalinclude:: examples/acc/vec_add_kernels.f90
-                        :language: fortran
-                        :emphasize-lines: 14,18
+          .. literalinclude:: examples/acc/vec_add_kernels.f90
+                         :language: fortran
+                         :emphasize-lines: 14,18
 
+The other approach of OpenACC to define parallel regions is to use ``parallel``
+directive. Contrary to the ``kernels`` directive, the ``parallel`` directive is more
+explicit and requires more analysis by the programmer. Work sharing has to be defined
+manually using the ``loop`` directive, and refined tuning is possible to achieve. The
+above example can be re-written as the following:
 
+.. challenge:: Example: ``parallel loop``
 
-The other approach of OpenACC to define parallel regions is to use ``parallel`` directive.
-Contrary to the ``kernels`` directive, the ``parallel`` directive is more explicit and requires 
-more analysis by the programmer. Work sharing has to be defined manually using the ``loop`` directive, 
-and refined tuning is possible to achieve. The above example can be re-written as the following:
+    .. tabs::
 
+       .. tab:: C/C++
 
-.. challenge:: Example: ``parallel loop``
+          .. literalinclude:: examples/acc/vec_add_loop.c
+                         :language: cpp
+                         :emphasize-lines: 17
 
-   .. tabs::
+       .. tab:: Fortran
 
-      .. tab:: C/C++
+          .. literalinclude:: examples/acc/vec_add_loop.f90
+                         :language: fortran
+                         :emphasize-lines: 14,18
 
-         .. literalinclude:: examples/acc/vec_add_loop.c 
-                        :language: cpp
-                        :emphasize-lines: 17
+Sometimes we can obtain a little more performance by guiding the compiler to make
+specific choices. OpenACC has four levels of parallelism for offloading execution:
 
-      .. tab:: Fortran
+    - **gang** coarse grain: the iterations are distributed among the gangs
+    - **worker** fine grain: worker's threads are activated within gangs and iterations
+      are shared among the threads
+    - **vector** each worker activates its threads working in SIMT fashion and the work
+      is shared among the threads
+    - **seq** the iterations are executed sequentially
 
-         .. literalinclude:: examples/acc/vec_add_loop.f90
-                        :language: fortran
-                        :emphasize-lines: 14,18
+.. note::
 
+    By default, ``gang``, ``worker`` and ``vector`` parallelism are automatically
+    decided and applied by the compiler.
 
+    The programmer could add clauses like ``num_gangs``, ``num_workers`` and
+    ``vector_length`` within the parallel region to specify the number of gangs, workers
+    and vector length.
 
-Sometimes we can obtain a little more performance by guiding the compiler to make specific choices. 
-OpenACC has four levels of parallelism for offloading execution: 
+    The optimal numbers are highly GPU architecture and compiler implementation
+    dependent though.
 
-  - **gang** coarse grain: the iterations are distributed among the gangs
-  - **worker** fine grain: worker's threads are activated within gangs and iterations are shared among the threads 
-  - **vector** each worker activates its threads working in SIMT fashion and the work is shared among the threads
-  - **seq** the iterations are executed sequentially
+    There is no thread synchronization at ``gang`` level, which means there maybe a risk
+    of race condition.
 
+OpenMP Offloading
+~~~~~~~~~~~~~~~~~
 
+With OpenMP, the ``target`` directive is used for device offloading.
 
-.. note:: 
+.. challenge:: Example: ``target`` construct
 
-    By default, ``gang``, ``worker`` and ``vector`` parallelism are automatically decided and applied by the compiler. 
+    .. tabs::
 
-    The programmer could add clauses like ``num_gangs``, ``num_workers`` and ``vector_length`` within the parallel region to specify the number of gangs, workers and vector length. 
+       .. tab:: C/C++
 
-    The optimal numbers are highly GPU architecture and compiler implementation dependent though.
+          .. literalinclude:: examples/omp/vec_add_target.c
+                         :language: cpp
+                         :emphasize-lines: 16
 
-    There is no thread synchronization at ``gang`` level, which means there maybe a risk of race condition.
+       .. tab:: Fortran
 
+          .. literalinclude:: examples/omp/vec_add_target.f90
+                         :language: fortran
+                         :emphasize-lines: 14,18
 
+Compared to the OpenACC's ``kernels`` directive, the ``target`` directive will not
+parallelise the underlying loop at all. To achieve proper parallelisation, one needs to
+be more prescriptive and specify what one wants. OpenMP offloading offers multiple
+levels of parallelism as well:
 
-OpenMP Offloading
-^^^^^^^^^^^^^^^^^
+    - **teams** coarse grain: creates a league of teams and one master thread in each
+      team, but no worksharing among the teams
+    - **distribute** distributes the iterations across the master threads in the teams,
+      but no worksharing among the threads within one team
+    - **parallel do/for** fine grain: threads are activated within one team and
+      worksharing among them
+    - **SIMD** like the ``vector`` directive in OpenACC
 
-With OpenMP, the ``target`` directive is used for device offloading. 
+.. note::
 
-.. challenge:: Example: ``target`` construct 
+    The programmer could add clauses like ``num_teams`` and ``thread_limit`` to specify
+    the number of teams and threads within a team.
 
-   .. tabs::
+    Threads in a team can synchronize but no synchronization among the teams.
 
-      .. tab:: C/C++
+    Since OpenMP 5.0, there is a new ``loop`` directive available, which has the similar
+    functionality as the corresponding one in OpenACC.
 
-         .. literalinclude:: examples/omp/vec_add_target.c 
-                        :language: cpp
-                        :emphasize-lines: 16
+.. keypoints::
 
-      .. tab:: Fortran
+    .. list-table:: Mapping between OpenACC/OpenMP directives and GPU (HPE implementation)
+       :widths: 25 25 25 25
+       :header-rows: 1
+
+       * - NVIDIA
+         - AMD
+         - Fortran OpenACC/OpenMP
+         - C/C++ OpenMP
+       * - Threadblock
+         - Work group
+         - gang/teams
+         - teams
+       * - Wrap
+         - Wavefront
+         - worker/simd
+         - parallel for simd
+       * - Thread
+         - Work item
+         - vector/simd
+         - parallel for simd
 
-         .. literalinclude:: examples/omp/vec_add_target.f90
-                        :language: fortran
-                        :emphasize-lines: 14,18
+.. exercise:: Exercise: Change the levels of parallelism
 
+    In this exercise we would like to change the levels of parallelism using clauses.
+    First compile and run one of the example to find out the default number of block and thread set by compiler at runtime.
+    To make a change, adding clauses like ``num_gangs``, ``num_workers``,  ``vector_length`` for OpenACC
+    and ``num_teams``, ``thread_limit`` for OpenMP offloading.
 
-Compared to the OpenACC's ``kernels`` directive, the ``target`` directive will not parallelise the underlying loop at all. 
-To achieve proper parallelisation, one needs to be more prescriptive and specify what one wants. 
-OpenMP offloading offers multiple levels of parallelism as well:
+    Remember to set the environment by executing ``export CRAY_ACC_DEBUG=2`` at runtime.
 
-  - **teams** coarse grain: creates a league of teams and one master thread in each team, but no worksharing among the teams
-  - **distribute** distributes the iterations across the master threads in the teams, but no worksharing among the threads within one team
-  - **parallel do/for** fine grain: threads are activated within one team and worksharing among them
-  - **SIMD** like the ``vector`` directive in OpenACC
+    How to compile and run the code interactively:
 
+    .. tabs::
 
-.. note:: 
+       .. tab:: C/C++
 
-    The programmer could add clauses like ``num_teams`` and ``thread_limit`` to specify the number of teams and threads within a team.
+              .. code-block:: bash
 
-    Threads in a team can synchronize but no synchronization among the teams. 
+                 salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
 
-    Since OpenMP 5.0, there is a new ``loop`` directive available, which has the similar functionality as the corresponding one in OpenACC.
+                 module load LUMI/24.03
+                 module load partition/G
+                 module load rocm/6.0.3
 
+                 # OpenMP
+                 cc -O2 -fopenmp -o ex1 ex1.c
+                 # Only OpenACC Fortran is supported by HPE compiler.
 
+                 export CRAY_ACC_DEBUG=2
+                 srun ./ex1
 
-.. keypoints::
 
-   .. list-table:: Mapping between OpenACC/OpenMP directives and GPU (HPE implementation)
-      :widths: 25 25 25 25
-      :header-rows: 1
-
-      * - NVIDIA
-        - AMD
-        - Fortran OpenACC/OpenMP
-        - C/C++ OpenMP
-      * - Threadblock
-        - Work group
-        - gang/teams
-        - teams
-      * - Wrap
-        - Wavefront
-        - worker/simd
-        - parallel for simd
-      * - Thread
-        - Work item
-        - vector/simd
-        - parallel for simd
 
+       .. tab:: Fortran
 
+              .. code-block:: bash
 
-.. exercise:: Exercise: Change the levels of parallelism
+                 salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
 
-   In this exercise we would like to change the levels of parallelism using clauses. 
-   First compile and run one of the example to find out the default number of block and thread set by compiler at runtime. 
-   To make a change, adding clauses like ``num_gangs``, ``num_workers``,  ``vector_length`` for OpenACC 
-   and ``num_teams``, ``thread_limit`` for OpenMP offloading.
+                 module load LUMI/24.03
+                 module load partition/G
+                 module load rocm/6.0.3
 
-   Remember to set the environment by executing ``export CRAY_ACC_DEBUG=2`` at runtime.
-   
-   How to compile and run the code interactively:
+                 export CRAY_ACC_DEBUG=2
+                 # OpenMP
+                 ftn -O2 -homp -o ex1 ex1.f90
+                 srun ./ex1
 
-   .. tabs:: 
+                 # OpenACC
+                 ftn -O2 -hacc -o ex1 ex1.f90
+                 srun ./ex1
 
-      .. tab:: C/C++
 
-             .. code-block:: bash
 
-                salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
+    Example of a trivially parallelizable vector addition problem:
 
-                module load LUMI/24.03
-                module load partition/G
-                module load rocm/6.0.3
+    .. tabs::
 
-                # OpenMP
-                cc -O2 -fopenmp -o ex1 ex1.c 
-                # Only OpenACC Fortran is supported by HPE compiler.
+       .. tab:: OpenMP
 
-                export CRAY_ACC_DEBUG=2
-                srun ./ex1
-        
+          .. tabs::
 
+             .. tab::  C/C++
 
-      .. tab:: Fortran
+                .. code-block:: C++
 
-             .. code-block:: bash
+                   #include <stdio.h>
+                   #include <math.h>
+                   #define NX 102400
 
-                salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
+                   int main(void){
+                       double vecA[NX],vecB[NX],vecC[NX];
 
-                module load LUMI/24.03
-                module load partition/G
-                module load rocm/6.0.3
+                       /* Initialize vectors */
+                       for (int i = 0; i < NX; i++) {
+                           vecA[i] = 1.0;
+                           vecB[i] = 1.0;
+                       }
 
-                export CRAY_ACC_DEBUG=2
-                # OpenMP
-                ftn -O2 -homp -o ex1 ex1.f90
-                srun ./ex1
+                       #pragma omp target teams distribute parallel for simd
+                       {
+                       for (int i = 0; i < NX; i++) {
+                          vecC[i] = vecA[i] + vecB[i];
+                       }
+                       }
+                    }
 
-                # OpenACC
-                ftn -O2 -hacc -o ex1 ex1.f90
-                srun ./ex1
+             .. tab::  Fortran
 
+                .. code-block:: Fortran
 
+                   program vecsum
+                       implicit none
 
-   Example of a trivially parallelizable vector addition problem:
+                       integer, parameter :: nx = 102400
+                       real, dimension(nx) :: vecA,vecB,vecC
+                       integer :: i
 
-   .. tabs::
+                       ! Initialization of vectors
+                       do i = 1, nx
+                           vecA(i) = 1.0
+                           vecB(i) = 1.0
+                       end do
 
-      .. tab:: OpenMP 
+                       !$omp target teams distribute parallel do simd
+                       do i=1,nx
+                          vecC(i) = vecA(i) + vecB(i)
+                       enddo
+                       !$omp end target teams distribute parallel do simd
+                   end program vecsum
 
-         .. tabs::
+       .. tab:: OpenACC
 
-            .. tab::  C/C++
+          .. tabs::
 
-               .. code-block:: C++
+             .. tab:: C/C++
 
-                  #include <stdio.h>
-                  #include <math.h>
-                  #define NX 102400
-          
-                  int main(void){
-                      double vecA[NX],vecB[NX],vecC[NX];
-          
-                      /* Initialize vectors */
-                      for (int i = 0; i < NX; i++) {
-                          vecA[i] = 1.0;
-                          vecB[i] = 1.0;
-                      }  
-          
-                      #pragma omp target teams distribute parallel for simd
-                      {
-                      for (int i = 0; i < NX; i++) {
-                         vecC[i] = vecA[i] + vecB[i];
-                      }
-                      }
-                   }
+                .. code-block:: C++
 
-            .. tab::  Fortran
-
-               .. code-block:: Fortran
-
-                  program vecsum
-                      implicit none
-          
-                      integer, parameter :: nx = 102400
-                      real, dimension(nx) :: vecA,vecB,vecC
-                      integer :: i
-                      
-                      ! Initialization of vectors
-                      do i = 1, nx
-                          vecA(i) = 1.0
-                          vecB(i) = 1.0
-                      end do     
-          
-                      !$omp target teams distribute parallel do simd
-                      do i=1,nx
-                         vecC(i) = vecA(i) + vecB(i)
-                      enddo  
-                      !$omp end target teams distribute parallel do simd
-                  end program vecsum
-
-      .. tab:: OpenACC 
-
-         .. tabs::
-
-            .. tab:: C/C++
-
-               .. code-block:: C++
-
-                  #include <stdio.h>
-                  #include <openacc.h>
-                  #define NX 102400
-          
-                  int main(void) {
-                      double vecA[NX], vecB[NX], vecC[NX];
-          
-                      /* Initialization of the vectors */
-                      for (int i = 0; i < NX; i++) {
-                          vecA[i] = 1.0;
-                          vecB[i] = 1.0;
-                      }
-                      #pragma acc parallel loop
-                      {
-                      for (int i = 0; i < NX; i++) {
-                         vecC[i] = vecA[i] + vecB[i];
-                      }
-                      }
-                  }         
-
-            .. tab:: Fortran
-
-               .. code-block:: Fortran
-
-                  program vecsum
-                      implicit none
-          
-                      integer, parameter :: nx = 102400
-                      real, dimension(:), allocatable :: vecA,vecB,vecC
-                      integer :: i
-                      
-                      allocate (vecA(nx), vecB(nx),vecC(nx))
-                      ! Initialization of vectors
-                      do i = 1, nx
-                          vecA(i) = 1.0
-                          vecB(i) = 1.0
-                      end do     
-          
-                      !$acc parallel loop
-                      do i=1,nx
-                          vecC(i) = vecA(i) + vecB(i)
-                      enddo  
-                      !$acc end parallel loop
-                  end program vecsum
+                   #include <stdio.h>
+                   #include <openacc.h>
+                   #define NX 102400
 
+                   int main(void) {
+                       double vecA[NX], vecB[NX], vecC[NX];
 
-.. keypoints::
+                       /* Initialization of the vectors */
+                       for (int i = 0; i < NX; i++) {
+                           vecA[i] = 1.0;
+                           vecB[i] = 1.0;
+                       }
+                       #pragma acc parallel loop
+                       {
+                       for (int i = 0; i < NX; i++) {
+                          vecC[i] = vecA[i] + vecB[i];
+                       }
+                       }
+                   }
 
-   .. list-table:: Mapping between OpenACC/OpenMP directives and GPU (**HPE implementation**)
-      :widths: 25 25 25 25
-      :header-rows: 1
+             .. tab:: Fortran
 
-      * - Nvidia
-        - AMD
-        - Fortran OpenACC/OpenMP
-        - C/C++ OpenMP
-      * - Threadblock
-        - Work group
-        - gang/teams
-        - teams
-      * - Wrap
-        - Wavefront
-        - worker/simd
-        - parallel for simd
-      * - Thread
-        - Work item
-        - vector/simd
-        - parallel for simd
+                .. code-block:: Fortran
 
+                   program vecsum
+                       implicit none
 
-   - Each compiler supports different levels of parallelism
-   - The size of gang/team/worker/vector_length can be chosen arbitrarily by the user but there are limits defined by the implementation.
-   - The maximum thread/grid/block size can be found via ``rocminfo``/``nvaccelinfo``
+                       integer, parameter :: nx = 102400
+                       real, dimension(:), allocatable :: vecA,vecB,vecC
+                       integer :: i
 
+                       allocate (vecA(nx), vecB(nx),vecC(nx))
+                       ! Initialization of vectors
+                       do i = 1, nx
+                           vecA(i) = 1.0
+                           vecB(i) = 1.0
+                       end do
+
+                       !$acc parallel loop
+                       do i=1,nx
+                           vecC(i) = vecA(i) + vecB(i)
+                       enddo
+                       !$acc end parallel loop
+                   end program vecsum
+
+.. keypoints::
 
+    .. list-table:: Mapping between OpenACC/OpenMP directives and GPU (**HPE implementation**)
+       :widths: 25 25 25 25
+       :header-rows: 1
+
+       * - Nvidia
+         - AMD
+         - Fortran OpenACC/OpenMP
+         - C/C++ OpenMP
+       * - Threadblock
+         - Work group
+         - gang/teams
+         - teams
+       * - Wrap
+         - Wavefront
+         - worker/simd
+         - parallel for simd
+       * - Thread
+         - Work item
+         - vector/simd
+         - parallel for simd
+
+
+    - Each compiler supports different levels of parallelism
+    - The size of gang/team/worker/vector_length can be chosen arbitrarily by the user but there are limits defined by the implementation.
+    - The maximum thread/grid/block size can be found via ``rocminfo``/``nvaccelinfo``
 
 Data Movement
-~~~~~~~~~~~~~
+-------------
 
-Due to distinct memory spaces on host and device, transferring data becomes inevitable. 
-New directives are needed to specify how variables are transferred from the host to the device data environment. 
-The common transferred items consist of arrays (array sections), scalars, pointers, and structure elements. 
-Various data clauses used for data movement is summarised in the following table
+Due to distinct memory spaces on host and device, transferring data becomes inevitable.
+New directives are needed to specify how variables are transferred from the host to the
+device data environment. The common transferred items consist of arrays (array
+sections), scalars, pointers, and structure elements. Various data clauses used for data
+movement is summarised in the following table
 
 .. csv-table::
-   :widths: auto
-   :delim: ;
-
-   ``OpenMP`` ; ``OpenACC`` ; 
-   ``map(to:list)`` ; ``copyin(list)`` ; On entering the region, variables in the list are initialized on the device using the original values from the host
-   ``map(from:list)`` ; ``copyout(list)`` ;  At the end of the target region, the values from variables in the list are copied into the original variables on the host. On entering the region, the initial value of the variables on the device is not initialized       
-   ``map(tofrom:list)`` ; ``copy(list)`` ; The effect of both a map-to and a map-from
-   ``map(alloc:list)`` ;  ``create(list)`` ; On entering the region, data is allocated and uninitialized on the device
-   ``map(delete:list)`` ;  ``delete(list)`` ; Delete data on the device
-   
+    :widths: auto
+    :delim: ;
+
+    ``OpenMP`` ; ``OpenACC`` ; ``map(to:list)`` ; ``copyin(list)`` ; On entering the
+    region, variables in the list are initialized on the device using the original
+    values from the host ``map(from:list)`` ; ``copyout(list)`` ; At the end of the
+    target region, the values from variables in the list are copied into the original
+    variables on the host. On entering the region, the initial value of the variables on
+    the device is not initialized ``map(tofrom:list)`` ; ``copy(list)`` ; The effect of
+    both a map-to and a map-from ``map(alloc:list)`` ; ``create(list)`` ; On entering
+    the region, data is allocated and uninitialized on the device ``map(delete:list)`` ;
+    ``delete(list)`` ; Delete data on the device
 
 .. note::
 
-   When mapping data arrays or pointers, be careful about the array section notation:
-     - In C/C++: array[lower-bound:length]. The notation :N is equivalent to 0:N.
-     - In Fortran:array[lower-bound:upper-bound]. The notation :N is equivalent to 1:N.
-
+    When mapping data arrays or pointers, be careful about the array section notation:
+        - In C/C++: array[lower-bound:length]. The notation :N is equivalent to 0:N.
+        - In Fortran:array[lower-bound:upper-bound]. The notation :N is equivalent to
+          1:N.
 
 Data region
-^^^^^^^^^^^
-
-The specific data clause combined with the data directive constitutes the start of a data region.
-How the directives create storage, transfer data, and remove storage on the device are classified as two categories: 
-structured data region and unstructured data region. 
+~~~~~~~~~~~
 
+The specific data clause combined with the data directive constitutes the start of a
+data region. How the directives create storage, transfer data, and remove storage on the
+device are classified as two categories: structured data region and unstructured data
+region.
 
 Structured Data Region
 ++++++++++++++++++++++
 
-A structured data region is convenient for providing persistent data on the device which could be used for subsequent GPU directives.
-
+A structured data region is convenient for providing persistent data on the device which
+could be used for subsequent GPU directives.
 
 .. challenge:: Syntax for structured data region
 
-   .. tabs::
-
-      .. tab:: OpenMP 
+    .. tabs::
 
-         .. tabs::
+       .. tab:: OpenMP
 
-            .. tab:: C/C++
+          .. tabs::
 
-               .. code-block:: c
+             .. tab:: C/C++
 
-                  #pragma omp target data [clauses]
-                  {structured-block}
+                .. code-block:: c
 
-            .. tab:: Fortran
+                   #pragma omp target data [clauses]
+                   {structured-block}
 
-               .. code-block:: fortran
-                
-                  !$omp target data [clauses]
-                   structured-block
-                  !$omp end target data
+             .. tab:: Fortran
 
+                .. code-block:: fortran
 
-      .. tab:: OpenACC 
+                   !$omp target data [clauses]
+                    structured-block
+                   !$omp end target data
 
-         .. tabs::
 
-            .. tab:: C/C++
+       .. tab:: OpenACC
 
-               .. code-block:: c
+          .. tabs::
 
-                  #pragma acc data [clauses]
-                   {structured-block}
+             .. tab:: C/C++
 
-            .. tab:: Fortran
+                .. code-block:: c
 
-               .. code-block:: fortran
+                   #pragma acc data [clauses]
+                    {structured-block}
 
-                  !$acc data [clauses]
-                    structured-block
-                  !$acc end data
+             .. tab:: Fortran
 
+                .. code-block:: fortran
 
+                   !$acc data [clauses]
+                     structured-block
+                   !$acc end data
 
 Unstructured Data Region
 ++++++++++++++++++++++++
 
-However it is inconvenient in real applications to use structured data region, therefore the unstructured data region  
-with much more freedom in creating and deleting of data on the device at any appropriate point is adopted.
+However it is inconvenient in real applications to use structured data region, therefore
+the unstructured data region with much more freedom in creating and deleting of data on
+the device at any appropriate point is adopted.
 
 .. challenge:: Syntax for unstructured data region
 
-   .. tabs::
+    .. tabs::
 
-      .. tab:: OpenMP 
+       .. tab:: OpenMP
 
-         .. tabs::
+          .. tabs::
 
-            .. tab:: C/C++
-           
-              .. code-block:: c
-              
-                  #pragma omp target enter data [clauses]
-        
-              .. code-block:: c
-              
-                  #pragma omp target exit data
+             .. tab:: C/C++
 
+               .. code-block:: c
 
-            .. tab:: Fortran
-            
-               .. code-block:: fortran
-               
-                  !$omp target enter data [clauses] 
+                   #pragma omp target enter data [clauses]
 
-               .. code-block:: fortran
-               
-                  !$omp target exit data
+               .. code-block:: c
 
+                   #pragma omp target exit data
 
-      .. tab:: OpenACC 
-      
-         .. tabs::
-         
-            .. tab:: C/C++
-            
-               .. code-block:: c
-               
-                     #pragma acc enter data [clauses]
 
-               .. code-block:: c
-               
-                     #pragma acc exit data
+             .. tab:: Fortran
 
-            .. tab:: Fortran
-            
-               .. code-block:: fortran
-               
-                     !$acc enter data [clauses] 
+                .. code-block:: fortran
 
-               .. code-block:: fortran
-               
-                     !$acc exit data
+                   !$omp target enter data [clauses]
 
+                .. code-block:: fortran
 
+                   !$omp target exit data
 
-.. keypoints::
 
-  Structured Data Region
-    - Start and end points within a single subroutine
-    - Memory exists within the data region
+       .. tab:: OpenACC
 
-  Unstructured Data Region
-    - Multiple start and end points across different subroutines
-    - Memory exists until explicitly deallocated
+          .. tabs::
 
+             .. tab:: C/C++
 
-Update
-++++++
+                .. code-block:: c
 
-Sometimes, variables need to be synchronized between the host and the device memory, e.g. in order to write out variables on the host for debugging or visualization, and it is often used in conjunction with unstructured data regions. To control data transfer direction, a motion-clause must be present.
+                      #pragma acc enter data [clauses]
 
+                .. code-block:: c
 
+                      #pragma acc exit data
 
-.. challenge:: Syntax for update directive
+             .. tab:: Fortran
 
-   .. tabs::
+                .. code-block:: fortran
 
-      .. tab:: OpenMP 
+                      !$acc enter data [clauses]
 
-         .. tabs::
+                .. code-block:: fortran
 
-            .. tab:: C/C++
+                      !$acc exit data
 
-               .. code-block:: c
+.. keypoints::
 
-                 #pragma omp target update [clauses]
+    Structured Data Region
+      - Start and end points within a single subroutine
+      - Memory exists within the data region
 
-               .. code-block:: c
+    Unstructured Data Region
+      - Multiple start and end points across different subroutines
+      - Memory exists until explicitly deallocated
 
-                  motion-clause:
-                            to (list)
-                            from (list)
+Update
+++++++
 
+Sometimes, variables need to be synchronized between the host and the device memory,
+e.g. in order to write out variables on the host for debugging or visualization, and it
+is often used in conjunction with unstructured data regions. To control data transfer
+direction, a motion-clause must be present.
 
-            .. tab:: Fortran
+.. challenge:: Syntax for update directive
 
-               .. code-block:: fortran
+    .. tabs::
 
-                  !$omp target update [clauses] 
+       .. tab:: OpenMP
 
-               .. code-block:: fortran
+          .. tabs::
 
-                  motion-clause:
-                            to (list)
-                            from (list)
+             .. tab:: C/C++
 
-      .. tab:: OpenACC 
+                .. code-block:: c
 
-         .. tabs::
+                  #pragma omp target update [clauses]
 
-            .. tab:: C/C++
+                .. code-block:: c
 
-               .. code-block:: c
+                   motion-clause:
+                             to (list)
+                             from (list)
 
-                  #pragma acc update [clauses]
 
-               .. code-block:: c
+             .. tab:: Fortran
+
+                .. code-block:: fortran
+
+                   !$omp target update [clauses]
+
+                .. code-block:: fortran
 
-                  motion-clause:
-                            self (list)
-                            device (list)
+                   motion-clause:
+                             to (list)
+                             from (list)
 
-            .. tab:: Fortran
+       .. tab:: OpenACC
 
-               .. code-block:: fortran
+          .. tabs::
 
-                  !$acc update [clauses] 
+             .. tab:: C/C++
 
-               .. code-block:: fortran
+                .. code-block:: c
 
-                  motion-clause:
-                            self (list)
-                            device (list)
+                   #pragma acc update [clauses]
 
+                .. code-block:: c
 
+                   motion-clause:
+                             self (list)
+                             device (list)
+
+             .. tab:: Fortran
+
+                .. code-block:: fortran
+
+                   !$acc update [clauses]
+
+                .. code-block:: fortran
+
+                   motion-clause:
+                             self (list)
+                             device (list)
 
 .. note::
 
-    - ``update`` directive can only be used in host code since data movement must be initiated from the host, i.e. it may not appear inside of a compute region.
+    - ``update`` directive can only be used in host code since data movement must be
+      initiated from the host, i.e. it may not appear inside of a compute region.
     - in OpenACC, motion-clause "host" has been deprecated and renamed "self"
 
+.. challenge:: Exercise:  ``update``
+
+    Trying to figure out the variable values on host and device at each check point.
 
+    .. tabs::
 
-.. challenge:: Exercise:  ``update``
+       .. tab:: C/C++
 
-   Trying to figure out the variable values on host and device at each check point.
+          .. code-block:: c
 
-   .. tabs::
+             #include <stdio.h>
+             int main(void)
+             {
+             int x = 0;
 
-      .. tab:: C/C++
+             #pragma omp target data map(tofrom:x)
+             {
+                /* check point 1 */
+               x = 10;
+                /* check point 2 */
+             #pragma omp target update to(x)
+                /* check point 3 */
+             }
 
-         .. code-block:: c
+             return 0;
+             }
 
-            #include <stdio.h>
-            int main(void)
-            {
-            int x = 0;
 
-            #pragma omp target data map(tofrom:x)
-            {
-               /* check point 1 */
-              x = 10;                        
-               /* check point 2 */
-            #pragma omp target update to(x)       
-               /* check point 3 */
-            }
+       .. tab:: Fortran
 
-            return 0;
-            }
+          .. code-block:: fortran
 
+             program ex_update
+             implicit none
 
-      .. tab:: Fortran
+             integer :: x
 
-         .. code-block:: fortran
+             x = 0
+             !$acc data copy(x)
+             ! check point 1
+             x = 10
+             ! check point 2
+             !$acc update device(x)
+             ! check point 3
+             !$acc end data
 
-            program ex_update
-            implicit none
-     
-            integer :: x
-           
-            x = 0
-            !$acc data copy(x) 
-            ! check point 1 
-            x = 10                        
-            ! check point 2 
-            !$acc update device(x)       
-            ! check point 3 
-            !$acc end data
-     
-            end program ex_update
+             end program ex_update
 
 
-    
-   .. solution:: 
 
-      +-------------+---------+-----------+
-      |check point  |x on host|x on device|
-      +=============+=========+===========+
-      |check point1 |   0     |  0        | 
-      +-------------+---------+-----------+
-      |check point2 |  10     |  0        | 
-      +-------------+---------+-----------+
-      |check point3 |  10     | 10        | 
-      +-------------+---------+-----------+
+    .. solution::
 
+       +-------------+---------+-----------+
+       |check point  |x on host|x on device|
+       +=============+=========+===========+
+       |check point1 |   0     |  0        |
+       +-------------+---------+-----------+
+       |check point2 |  10     |  0        |
+       +-------------+---------+-----------+
+       |check point3 |  10     | 10        |
+       +-------------+---------+-----------+
 
 .. challenge:: Exercise: Adding data mapping clauses
 
-   Add proper data mapping clauses explicitly to the directives
-
-   .. tabs::
-
-      .. tab:: OpenMP 
-
-         .. tabs::
-
-            .. tab::  C/C++
-
-               .. code-block:: C++
-
-                  #include <stdio.h>
-                  #include <math.h>
-                  #define NX 102400
-          
-                  int main(void){
-                      double vecA[NX],vecB[NX],vecC[NX];
-          
-                      /* Initialize vectors */
-                      for (int i = 0; i < NX; i++) {
-                      vecA[i] = 1.0;
-                      vecB[i] = 1.0;
-                      }  
-                      /* Adding mapping clauses here */
-                      #pragma omp target teams distribute parallel for simd
-                      {
-                      for (int i = 0; i < NX; i++) {
-                         vecC[i] = vecA[i] + vecB[i];
-                      }
-                      }
-          
-                      double sum = 0.0;
-                      for (int i = 0; i < NX; i++) {
-                         sum += vecC[i];
-                      }
-                      printf("The sum is: %8.6f \n", sum);
-                  }
-
-            .. tab::  Fortran
-
-               .. code-block:: Fortran
-
-                  program vecsum
-                  implicit none
-
-                  integer, parameter :: nx = 102400
-                  real, dimension(nx) :: vecA,vecB,vecC
-                            real    :: sum
-                  integer :: i
-                  
-                  ! Initialization of vectors
-                  do i = 1, nx
-                      vecA(i) = 1.0
-                      vecB(i) = 1.0
-                  end do     
-                  ! Adding mapping clauses here
-                  !$omp target teams distribute parallel do simd
-                  do i=1,nx
-                      vecC(i) = vecA(i) + vecB(i)
-                  enddo  
-                  !$omp end target teams distribute parallel do simd
-
-                  sum = 0.0
-                  ! Calculate the sum
-                  do i = 1, nx
-                      sum =  vecC(i) + sum
-                  end do
-                  write(*,'(A,F18.6)') 'The sum is: ', sum
-
-                  end program vecsum
-
-      .. tab:: OpenACC 
-
-         .. tabs::
-
-            .. tab:: C/C++
-
-               .. code-block:: C++
-
-                  #include <stdio.h>
-                  #include <openacc.h>
-                  #define NX 102400
-          
-                  int main(void) {
-                      double vecA[NX], vecB[NX], vecC[NX];
-          
-                      /* Initialization of the vectors */
-                      for (int i = 0; i < NX; i++) {
-                          vecA[i] = 1.0;
-                          vecB[i] = 1.0;
-                      }
-                      /* Adding mapping clauses here */
-                      #pragma acc parallel loop
-                      {
-                      for (int i = 0; i < NX; i++) {
+    Add proper data mapping clauses explicitly to the directives
+
+    .. tabs::
+
+       .. tab:: OpenMP
+
+          .. tabs::
+
+             .. tab::  C/C++
+
+                .. code-block:: C++
+
+                   #include <stdio.h>
+                   #include <math.h>
+                   #define NX 102400
+
+                   int main(void){
+                       double vecA[NX],vecB[NX],vecC[NX];
+
+                       /* Initialize vectors */
+                       for (int i = 0; i < NX; i++) {
+                       vecA[i] = 1.0;
+                       vecB[i] = 1.0;
+                       }
+                       /* Adding mapping clauses here */
+                       #pragma omp target teams distribute parallel for simd
+                       {
+                       for (int i = 0; i < NX; i++) {
                           vecC[i] = vecA[i] + vecB[i];
-                      }
-                      }
-          
-                      double sum = 0.0;
-                      for (int i = 0; i < NX; i++) {
-                         sum += vecC[i];
-                      }
-                      printf("The sum is: %8.6f \n", sum);
-                      }         
-
-            .. tab:: Fortran
-
-               .. code-block:: Fortran
-
-                  program vecsum
-                      implicit none
-          
-                      integer, parameter :: nx = 102400
-                      real, dimension(:), allocatable :: vecA,vecB,vecC
-                                real    :: sum
-                      integer :: i
-                      
-                      allocate (vecA(nx), vecB(nx),vecC(nx))
-                      ! Initialization of vectors
-                      do i = 1, nx
-                          vecA(i) = 1.0
-                          vecB(i) = 1.0
-                      end do     
-                      ! Adding mapping clauses here
-                      !$acc parallel loop
-                      do i=1,nx
-                          vecC(i) = vecA(i) + vecB(i)
-                      enddo  
-                      !$acc end parallel loop
-          
-                      sum = 0.0
-                      ! Calculate the sum
-                      do i = 1, nx
-                         sum =  vecC(i) + sum
-                      end do
-                      write(*,'(A,F18.6)') 'The sum is: ', sum
-          
-                      end program vecsum
+                       }
+                       }
+
+                       double sum = 0.0;
+                       for (int i = 0; i < NX; i++) {
+                          sum += vecC[i];
+                       }
+                       printf("The sum is: %8.6f \n", sum);
+                   }
 
+             .. tab::  Fortran
+
+                .. code-block:: Fortran
+
+                   program vecsum
+                   implicit none
 
+                   integer, parameter :: nx = 102400
+                   real, dimension(nx) :: vecA,vecB,vecC
+                             real    :: sum
+                   integer :: i
 
+                   ! Initialization of vectors
+                   do i = 1, nx
+                       vecA(i) = 1.0
+                       vecB(i) = 1.0
+                   end do
+                   ! Adding mapping clauses here
+                   !$omp target teams distribute parallel do simd
+                   do i=1,nx
+                       vecC(i) = vecA(i) + vecB(i)
+                   enddo
+                   !$omp end target teams distribute parallel do simd
 
-   .. solution::
+                   sum = 0.0
+                   ! Calculate the sum
+                   do i = 1, nx
+                       sum =  vecC(i) + sum
+                   end do
+                   write(*,'(A,F18.6)') 'The sum is: ', sum
 
-      .. tabs::
+                   end program vecsum
 
-         .. tab:: OpenMP 
+       .. tab:: OpenACC
 
-            .. tabs::
+          .. tabs::
 
-               .. tab::  C/C++
+             .. tab:: C/C++
 
-                  .. code-block:: C++
-                     :emphasize-lines: 14
+                .. code-block:: C++
 
-                     #include <stdio.h>
-                     #include <math.h>
-                     #define NX 102400
+                   #include <stdio.h>
+                   #include <openacc.h>
+                   #define NX 102400
 
-                     int main(void){
-                         double vecA[NX],vecB[NX],vecC[NX];
+                   int main(void) {
+                       double vecA[NX], vecB[NX], vecC[NX];
 
-                         /* Initialize vectors */
-                         for (int i = 0; i < NX; i++) {
-                             vecA[i] = 1.0;
-                             vecB[i] = 1.0;
-                         }  
+                       /* Initialization of the vectors */
+                       for (int i = 0; i < NX; i++) {
+                           vecA[i] = 1.0;
+                           vecB[i] = 1.0;
+                       }
+                       /* Adding mapping clauses here */
+                       #pragma acc parallel loop
+                       {
+                       for (int i = 0; i < NX; i++) {
+                           vecC[i] = vecA[i] + vecB[i];
+                       }
+                       }
 
-                         #pragma omp target teams distribute parallel for simd map(to:vecA[0:NX],vecB[0:NX]) map(from:vecC[0:NX])
-                         {
-                         for (int i = 0; i < NX; i++) {
-                              vecC[i] = vecA[i] + vecB[i];
-                         }
-                         }
+                       double sum = 0.0;
+                       for (int i = 0; i < NX; i++) {
+                          sum += vecC[i];
+                       }
+                       printf("The sum is: %8.6f \n", sum);
+                       }
 
-                         double sum = 0.0;
-                         for (int i = 0; i < NX; i++) {
-                            sum += vecC[i];
-                         }
-                         printf("The sum is: %8.6f \n", sum);
-                         }
+             .. tab:: Fortran
 
-               .. tab::  Fortran
+                .. code-block:: Fortran
 
-                  .. code-block:: Fortran
-                     :emphasize-lines: 15
+                   program vecsum
+                       implicit none
+
+                       integer, parameter :: nx = 102400
+                       real, dimension(:), allocatable :: vecA,vecB,vecC
+                                 real    :: sum
+                       integer :: i
 
-                     program vecsum
-                         implicit none
+                       allocate (vecA(nx), vecB(nx),vecC(nx))
+                       ! Initialization of vectors
+                       do i = 1, nx
+                           vecA(i) = 1.0
+                           vecB(i) = 1.0
+                       end do
+                       ! Adding mapping clauses here
+                       !$acc parallel loop
+                       do i=1,nx
+                           vecC(i) = vecA(i) + vecB(i)
+                       enddo
+                       !$acc end parallel loop
 
-                         integer, parameter :: nx = 102400
-                         real, dimension(nx) :: vecA,vecB,vecC
-                                   real    :: sum
-                         integer :: i
-                        
-                         ! Initialization of vectors
-                         do i = 1, nx
-                             vecA(i) = 1.0
-                             vecB(i) = 1.0
-                         end do     
+                       sum = 0.0
+                       ! Calculate the sum
+                       do i = 1, nx
+                          sum =  vecC(i) + sum
+                       end do
+                       write(*,'(A,F18.6)') 'The sum is: ', sum
 
-                         !$omp target teams distribute parallel do simd map(to:vecA,vecB) map(from:vecC) 
-                         do i=1,nx
+                       end program vecsum
+
+
+
+
+    .. solution::
+
+       .. tabs::
+
+          .. tab:: OpenMP
+
+             .. tabs::
+
+                .. tab::  C/C++
+
+                   .. code-block:: C++
+                      :emphasize-lines: 14
+
+                      #include <stdio.h>
+                      #include <math.h>
+                      #define NX 102400
+
+                      int main(void){
+                          double vecA[NX],vecB[NX],vecC[NX];
+
+                          /* Initialize vectors */
+                          for (int i = 0; i < NX; i++) {
+                              vecA[i] = 1.0;
+                              vecB[i] = 1.0;
+                          }
+
+                          #pragma omp target teams distribute parallel for simd map(to:vecA[0:NX],vecB[0:NX]) map(from:vecC[0:NX])
+                          {
+                          for (int i = 0; i < NX; i++) {
+                               vecC[i] = vecA[i] + vecB[i];
+                          }
+                          }
+
+                          double sum = 0.0;
+                          for (int i = 0; i < NX; i++) {
+                             sum += vecC[i];
+                          }
+                          printf("The sum is: %8.6f \n", sum);
+                          }
+
+                .. tab::  Fortran
+
+                   .. code-block:: Fortran
+                      :emphasize-lines: 15
+
+                      program vecsum
+                          implicit none
+
+                          integer, parameter :: nx = 102400
+                          real, dimension(nx) :: vecA,vecB,vecC
+                                    real    :: sum
+                          integer :: i
+
+                          ! Initialization of vectors
+                          do i = 1, nx
+                              vecA(i) = 1.0
+                              vecB(i) = 1.0
+                          end do
+
+                          !$omp target teams distribute parallel do simd map(to:vecA,vecB) map(from:vecC)
+                          do i=1,nx
+                              vecC(i) = vecA(i) + vecB(i)
+                          enddo
+                          !$omp end target teams distribute parallel do simd
+
+                          sum = 0.0
+                          ! Calculate the sum
+                          do i = 1, nx
+                              sum =  vecC(i) + sum
+                          end do
+                          write(*,'(A,F18.6)') 'The sum is: ', sum
+
+                          end program vecsum
+
+          .. tab:: OpenACC
+
+             .. tabs::
+
+                .. tab:: C/C++
+
+                   .. code-block:: C++
+                      :emphasize-lines: 14
+
+                      #include <stdio.h>
+                      #include <openacc.h>
+                      #define NX 102400
+
+                      int main(void) {
+                          double vecA[NX], vecB[NX], vecC[NX];
+
+                          /* Initialization of the vectors */
+                          for (int i = 0; i < NX; i++) {
+                              vecA[i] = 1.0;
+                              vecB[i] = 1.0;
+                          }
+
+                          #pragma acc parallel loop copyin(vecA[0:NX],vecB[0:NX]) copyout(vecC[0:NX])
+                          {
+                          for (int i = 0; i < NX; i++) {
+                             vecC[i] = vecA[i] + vecB[i];
+                          }
+                          }
+
+                          double sum = 0.0;
+                          for (int i = 0; i < NX; i++) {
+                             sum += vecC[i];
+                          }
+                          printf("The sum is: %8.6f \n", sum);
+                          }
+
+                .. tab:: Fortran
+
+                   .. code-block:: Fortran
+                      :emphasize-lines: 15
+
+                      program vecsum
+                          implicit none
+
+                          integer, parameter :: nx = 102400
+                          real, dimension(nx) :: vecA,vecB,vecC
+                                    real    :: sum
+                          integer :: i
+
+                          ! Initialization of vectors
+                          do i = 1, nx
+                              vecA(i) = 1.0
+                              vecB(i) = 1.0
+                          end do
+
+                          !$acc parallel loop copyin(vecA,vecB) copyout(vecC)
+                          do i=1,nx
                              vecC(i) = vecA(i) + vecB(i)
-                         enddo  
-                         !$omp end target teams distribute parallel do simd
+                          enddo
+                          !$acc end parallel loop
 
-                         sum = 0.0
-                         ! Calculate the sum
-                         do i = 1, nx
+                          sum = 0.0
+                          ! Calculate the sum
+                          do i = 1, nx
                              sum =  vecC(i) + sum
-                         end do
-                         write(*,'(A,F18.6)') 'The sum is: ', sum
-
-                         end program vecsum
-
-         .. tab:: OpenACC 
-
-            .. tabs::
-
-               .. tab:: C/C++
-
-                  .. code-block:: C++
-                     :emphasize-lines: 14
-
-                     #include <stdio.h>
-                     #include <openacc.h>
-                     #define NX 102400
-          
-                     int main(void) {
-                         double vecA[NX], vecB[NX], vecC[NX];
-          
-                         /* Initialization of the vectors */
-                         for (int i = 0; i < NX; i++) {
-                             vecA[i] = 1.0;
-                             vecB[i] = 1.0;
-                         }
-          
-                         #pragma acc parallel loop copyin(vecA[0:NX],vecB[0:NX]) copyout(vecC[0:NX])
-                         {
-                         for (int i = 0; i < NX; i++) {
-                            vecC[i] = vecA[i] + vecB[i];
-                         }
-                         }
-          
-                         double sum = 0.0;
-                         for (int i = 0; i < NX; i++) {
-                            sum += vecC[i];
-                         }
-                         printf("The sum is: %8.6f \n", sum);
-                         }         
-          
-               .. tab:: Fortran
-                   
-                  .. code-block:: Fortran
-                     :emphasize-lines: 15
-
-                     program vecsum
-                         implicit none
-          
-                         integer, parameter :: nx = 102400
-                         real, dimension(nx) :: vecA,vecB,vecC
-                                   real    :: sum
-                         integer :: i
-                         
-                         ! Initialization of vectors
-                         do i = 1, nx
-                             vecA(i) = 1.0
-                             vecB(i) = 1.0
-                         end do     
-          
-                         !$acc parallel loop copyin(vecA,vecB) copyout(vecC)
-                         do i=1,nx
-                            vecC(i) = vecA(i) + vecB(i)
-                         enddo  
-                         !$acc end parallel loop
-          
-                         sum = 0.0
-                         ! Calculate the sum
-                         do i = 1, nx
-                            sum =  vecC(i) + sum
-                         end do
-                         write(*,'(A,F18.6)') 'The sum is: ', sum
-          
-                         end program vecsum
+                          end do
+                          write(*,'(A,F18.6)') 'The sum is: ', sum
 
+                          end program vecsum
 
 Optimize Data Transfers
-^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~
 
 - Explicitly transfer the data as much as possible
-- Reduce the amount of data mapping between host and device, get rid of unnecessary data transfer
+- Reduce the amount of data mapping between host and device, get rid of unnecessary data
+  transfer
 - Try to keep data environment residing on the device as long as possible
 
-
 Pros of directive-based frameworks
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+----------------------------------
 
 - Incremental programming
 - Porting of existing software requires less work
@@ -993,15 +987,13 @@ Pros of directive-based frameworks
 - Low learning curve, do not need to know low-level hardware details
 - Good portability
 
-
 See also
-~~~~~~~~
+--------
 
 - `ENCCS lesson on OpenACC <https://enccs.github.io/openacc/>`__
 - `ENCCS lesson on OpenMP for GPU offloading <https://enccs.github.io/openmp-gpu/>`__
 
-
 .. keypoints::
 
-   - OpenACC and OpenMP-offloading enables you to annotate your code with special directives to identify areas to be executed in parallel on a GPU. 
-   - This saves time compared to lower-level approaches, but you need to be mindful of memory movement.
+    - OpenACC and OpenMP-offloading enables you to annotate your code with special directives to identify areas to be executed in parallel on a GPU.
+    - This saves time compared to lower-level approaches, but you need to be mindful of memory movement.
diff --git a/content/7-non-portable-kernel-models.rst b/content/7-non-portable-kernel-models.rst
index 9b66a5d7..905599b8 100644
--- a/content/7-non-portable-kernel-models.rst
+++ b/content/7-non-portable-kernel-models.rst
@@ -1,448 +1,669 @@
 .. _non-portable-kernel-models:
 
-
 Non-portable kernel-based models
 ================================
 
 .. questions::
 
-   - How to program GPUs with CUDA and HIP?
-   - What optimizations are possible when programming with CUDA and HIP? 
+    - How to program GPUs with CUDA and HIP?
+    - What optimizations are possible when programming with CUDA and HIP?
 
 .. objectives::
 
-   - Be able to use CUDA and HIP to write basic codes
-   - Understand how the execution is done and how to do optimizations
+    - Be able to use CUDA and HIP to write basic codes
+    - Understand how the execution is done and how to do optimizations
 
 .. instructor-note::
 
-   - 45 min teaching
-   - 30 min exercises
+    - 45 min teaching
+    - 30 min exercises
 
 Fundamentals of GPU programming with CUDA and HIP
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Unlike some cross-platform portability ecosystems, such as Alpaka, Kokkos, OpenCL, RAJA, and SYCL, which cater to multiple architectures, CUDA and HIP are solely focused on GPUs. They provide extensive libraries, APIs, and compiler toolchains that optimize code execution on NVIDIA GPUs (in the case of CUDA) and both NVIDIA and AMD GPUs (in the case of HIP). Because they are developed by the device producers, these programming models provide high-performance computing capabilities and offer advanced features like shared memory, thread synchronization, and memory management specific to GPU architectures.
-
-CUDA, developed by NVIDIA, has gained significant popularity and is widely used for GPU programming. It offers a comprehensive ecosystem that includes not only the CUDA programming model but also a vast collection of GPU-accelerated libraries. Developers can write CUDA kernels using C++ and seamlessly integrate them into their applications to harness the massive parallelism of GPUs.
-
-HIP, on the other hand, is an open-source project that aims to provide a more "portable" GPU programming interface. It allows developers to write GPU code in a syntax similar to CUDA and provides a translation layer that enables the same code to run on both NVIDIA and AMD GPUs. This approach minimizes the effort required to port CUDA code to different GPU architectures and provides flexibility for developers to target multiple platforms.
-
-By being closely tied to the GPU hardware, CUDA and HIP provide a level of performance optimization that may not be achievable with cross-platform portability ecosystems. The libraries and toolchains offered by these programming models are specifically designed to exploit the capabilities of the underlying GPU architectures, enabling developers to achieve high performance.
-
-Developers utilizing CUDA or HIP can tap into an extensive ecosystem of GPU-accelerated libraries, covering various domains, including linear algebra, signal processing, image processing, machine learning, and more. These libraries are highly optimized to take advantage of the parallelism and computational power offered by GPUs, allowing developers to accelerate their applications without having to implement complex algorithms from scratch.
-
-As mentioned before, CUDA and HIP are very similar so it makes sense to cover both at the same time. 
+-------------------------------------------------
+
+Unlike some cross-platform portability ecosystems, such as Alpaka, Kokkos, OpenCL, RAJA,
+and SYCL, which cater to multiple architectures, CUDA and HIP are solely focused on
+GPUs. They provide extensive libraries, APIs, and compiler toolchains that optimize code
+execution on NVIDIA GPUs (in the case of CUDA) and both NVIDIA and AMD GPUs (in the case
+of HIP). Because they are developed by the device producers, these programming models
+provide high-performance computing capabilities and offer advanced features like shared
+memory, thread synchronization, and memory management specific to GPU architectures.
+
+CUDA, developed by NVIDIA, has gained significant popularity and is widely used for GPU
+programming. It offers a comprehensive ecosystem that includes not only the CUDA
+programming model but also a vast collection of GPU-accelerated libraries. Developers
+can write CUDA kernels using C++ and seamlessly integrate them into their applications
+to harness the massive parallelism of GPUs.
+
+HIP, on the other hand, is an open-source project that aims to provide a more "portable"
+GPU programming interface. It allows developers to write GPU code in a syntax similar to
+CUDA and provides a translation layer that enables the same code to run on both NVIDIA
+and AMD GPUs. This approach minimizes the effort required to port CUDA code to different
+GPU architectures and provides flexibility for developers to target multiple platforms.
+
+By being closely tied to the GPU hardware, CUDA and HIP provide a level of performance
+optimization that may not be achievable with cross-platform portability ecosystems. The
+libraries and toolchains offered by these programming models are specifically designed
+to exploit the capabilities of the underlying GPU architectures, enabling developers to
+achieve high performance.
+
+Developers utilizing CUDA or HIP can tap into an extensive ecosystem of GPU-accelerated
+libraries, covering various domains, including linear algebra, signal processing, image
+processing, machine learning, and more. These libraries are highly optimized to take
+advantage of the parallelism and computational power offered by GPUs, allowing
+developers to accelerate their applications without having to implement complex
+algorithms from scratch.
+
+As mentioned before, CUDA and HIP are very similar so it makes sense to cover both at
+the same time.
 
 .. callout:: Comparison to portable kernel-based models
 
-   In code examples below, we will also show examples in the portable kernel-based frameworks Kokkos, SYCL and OpenCL, which will be covered in the next episode.
+    In code examples below, we will also show examples in the portable kernel-based frameworks Kokkos, SYCL and OpenCL, which will be covered in the next episode.
 
 Hello World
 ~~~~~~~~~~~
 
 Below we have the most basic example of CUDA and HIP, the "Hello World" program:
 
-.. tabs:: 
+.. tabs::
 
-   ..  group-tab:: CUDA
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-hello-world.c
-           :language: C
+    ..  group-tab:: CUDA
+         .. literalinclude:: examples/non-portable-kernel-models/cuda-hello-world.c
+            :language: C
 
-   ..  group-tab:: HIP
-        .. literalinclude:: examples/non-portable-kernel-models/hip-hello-world.c
-           :language: C
+    ..  group-tab:: HIP
+         .. literalinclude:: examples/non-portable-kernel-models/hip-hello-world.c
+            :language: C
 
-   ..  group-tab:: Kokkos
-        .. literalinclude:: examples/non-portable-kernel-models/kokkos-hello-world.cpp
-           :language: C++
+    ..  group-tab:: Kokkos
+         .. literalinclude:: examples/non-portable-kernel-models/kokkos-hello-world.cpp
+            :language: C++
 
-   ..  group-tab:: OpenCL
-        .. literalinclude:: examples/non-portable-kernel-models/opencl-hello-world.c
-           :language: C
+    ..  group-tab:: OpenCL
+         .. literalinclude:: examples/non-portable-kernel-models/opencl-hello-world.c
+            :language: C
 
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-hello-world.cpp
-           :language: C++
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-hello-world.cpp
+            :language: C++
 
-In both versions, we include the necessary headers: **cuda_runtime.h** and **cuda.h** for CUDA, and **hip_runtime.h** for HIP. These headers provide the required functionality for GPU programming.
+In both versions, we include the necessary headers: **cuda_runtime.h** and **cuda.h**
+for CUDA, and **hip_runtime.h** for HIP. These headers provide the required
+functionality for GPU programming.
 
-To retrieve information about the available devices, we use the functions **<cuda/hip>GetDeviceCount** and **<cuda/hip>GetDevice**. These functions allow us to determine the total number of GPUs and the index of the currently used device. In the code examples, we default to using device 0.
+To retrieve information about the available devices, we use the functions
+**<cuda/hip>GetDeviceCount** and **<cuda/hip>GetDevice**. These functions allow us to
+determine the total number of GPUs and the index of the currently used device. In the
+code examples, we default to using device 0.
 
-As an exercise, modify the "Hello World" code to explicitly use a specific GPU. Do this by using the **<cuda/hip>SetDevice** function, which allows to set the desired GPU device. 
-Note that the device number provided has to be within the range of available devices, otherwise, the program may fail to run or produce unexpected results.
-To experiment with different GPUs, modify the code to include the following line before retrieving device information:
+As an exercise, modify the "Hello World" code to explicitly use a specific GPU. Do this
+by using the **<cuda/hip>SetDevice** function, which allows to set the desired GPU
+device. Note that the device number provided has to be within the range of available
+devices, otherwise, the program may fail to run or produce unexpected results. To
+experiment with different GPUs, modify the code to include the following line before
+retrieving device information:
 
- .. code-block:: C
- 
-     cudaSetDevice(deviceNumber); // For CUDA  
-     hipSetDevice(deviceNumber); // For HIP
- 
+    .. code-block:: C
 
-Replace **deviceNumber** with the desired GPU device index. Run the code with different device numbers to observe the output (more examples for the "Hello World" program are available in the `content/examples/cuda-hip <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip>`__ subdirectory of this lesson repository).
+        cudaSetDevice(deviceNumber); // For CUDA
+        hipSetDevice(deviceNumber); // For HIP
 
+Replace **deviceNumber** with the desired GPU device index. Run the code with different
+device numbers to observe the output (more examples for the "Hello World" program are
+available in the `content/examples/cuda-hip
+<https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip>`__
+subdirectory of this lesson repository).
 
 Vector Addition
 ~~~~~~~~~~~~~~~
-To demonstrate the fundamental features of CUDA/HIP programming, let's begin with a straightforward task of element-wise vector addition. The code snippet below demonstrates how to utilize CUDA and HIP for efficiently executing this operation.
-
-.. tabs:: 
-
-   ..  group-tab:: CUDA
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-vector-add.cu
-           :language: C++
-      
-   ..  group-tab:: HIP
-        .. literalinclude:: examples/non-portable-kernel-models/hip-vector-add.cpp
-           :language: C++
-      
-   ..  group-tab:: OpenCL
-        .. literalinclude:: examples/non-portable-kernel-models/opencl-vector-add.c
-           :language: C
-
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-vector-add.cpp
-           :language: C++
-      
-In this case, the CUDA and HIP codes are equivalent one to one so we will only refer to the CUDA version. The CUDA and HIP programming model are host centric programming models. The main program is executed on CPU and controls all the operations, memory allocations, data transfers between CPU and GPU, and launches the kernels to be executed on the GPU. The code starts with defining the GPU kernel function called **vector_add** with attribute **___global__**. It takes three input arrays `A`, `B`, and `C` along with the array size `n`. The kernel function contains the actually code which is executed on the GPU by multiple threads in parallel.
-
-Accelerators in general and GPUs in particular usually have their own dedicated memory separate from the system memory (AMD MI300A is one exception, using the same memory for both CPU and GPU). When programming for GPUs, there are two sets of pointers involved and it's necessary to manage data movement between the host memory and the accelerator memory. Data needs to be explicitly copied from the host memory to the accelerator memory before it can be processed by the accelerator. Similarly, results or modified data may need to be copied back from the accelerator memory to the host memory to make them accessible to the CPU. 
-
-The main function of the code initializes the input arrays `Ah, Bh` on the CPU and computes the reference array `Cref`. It then allocates memory on the GPU for the input and output arrays `Ad, Bd`, and `Cd` using **cudaMalloc**. Herein, `h` is for the 'host' (CPU) and `d` for the 'device' (GPU). The data is transferred from the CPU to the GPU using hipMemcpy, and then the GPU kernel is launched using the `<<<.>>>` syntax. All kernels launch are asynchronous. After launch the control returns to the `main()` and the code proceeds to the next instructions. 
-
-After the kernel execution, the result array `Cd` is copied back to the CPU using **cudaMemcpy**. The code then prints the reference and result arrays, calculates the error by comparing the reference and result arrays. Finally, the GPU and CPU memory are deallocated using **cudaFree** and **free** functions, respectively. 
-
-The host functions  **cudaSetDevice**, **cudaMalloc**, **cudaMemcpy**, and **cudaFree** are blocking, i.e. the code does not continues to next instructions until the operations are completed. However this is not the default behaviour, for many operations there are asynchronous equivalents and there are as well many library calls return the control to the `main()` after calling. This allows the developers to launch independent operations and overlap them. 
-
-In short, this code demonstrates how to utilize the CUDA and HIP to perform vector addition on a GPU, showcasing the steps involved in allocating memory, transferring data between the CPU and GPU, launching a kernel function, and handling the results. It serves as a starting point for GPU-accelerated computations using CUDA and HIP.
-More examples for the vector (array) addition program are available at `content/examples <https://github.com/ENCCS/gpu-programming/tree/main/content/examples>`_.
-
-In order to practice the concepts shown above, edit the skeleton code in the repository and the code corresponding to setting the device, memory allocations and transfers, and the kernel execution. 
 
+To demonstrate the fundamental features of CUDA/HIP programming, let's begin with a
+straightforward task of element-wise vector addition. The code snippet below
+demonstrates how to utilize CUDA and HIP for efficiently executing this operation.
+
+.. tabs::
+
+    ..  group-tab:: CUDA
+         .. literalinclude:: examples/non-portable-kernel-models/cuda-vector-add.cu
+            :language: C++
+
+    ..  group-tab:: HIP
+         .. literalinclude:: examples/non-portable-kernel-models/hip-vector-add.cpp
+            :language: C++
+
+    ..  group-tab:: OpenCL
+         .. literalinclude:: examples/non-portable-kernel-models/opencl-vector-add.c
+            :language: C
+
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-vector-add.cpp
+            :language: C++
+
+In this case, the CUDA and HIP codes are equivalent one to one so we will only refer to
+the CUDA version. The CUDA and HIP programming model are host centric programming
+models. The main program is executed on CPU and controls all the operations, memory
+allocations, data transfers between CPU and GPU, and launches the kernels to be executed
+on the GPU. The code starts with defining the GPU kernel function called **vector_add**
+with attribute **___global__**. It takes three input arrays `A`, `B`, and `C` along with
+the array size `n`. The kernel function contains the actually code which is executed on
+the GPU by multiple threads in parallel.
+
+Accelerators in general and GPUs in particular usually have their own dedicated memory
+separate from the system memory (AMD MI300A is one exception, using the same memory for
+both CPU and GPU). When programming for GPUs, there are two sets of pointers involved
+and it's necessary to manage data movement between the host memory and the accelerator
+memory. Data needs to be explicitly copied from the host memory to the accelerator
+memory before it can be processed by the accelerator. Similarly, results or modified
+data may need to be copied back from the accelerator memory to the host memory to make
+them accessible to the CPU.
+
+The main function of the code initializes the input arrays `Ah, Bh` on the CPU and
+computes the reference array `Cref`. It then allocates memory on the GPU for the input
+and output arrays `Ad, Bd`, and `Cd` using **cudaMalloc**. Herein, `h` is for the 'host'
+(CPU) and `d` for the 'device' (GPU). The data is transferred from the CPU to the GPU
+using hipMemcpy, and then the GPU kernel is launched using the `<<<.>>>` syntax. All
+kernels launch are asynchronous. After launch the control returns to the `main()` and
+the code proceeds to the next instructions.
+
+After the kernel execution, the result array `Cd` is copied back to the CPU using
+**cudaMemcpy**. The code then prints the reference and result arrays, calculates the
+error by comparing the reference and result arrays. Finally, the GPU and CPU memory are
+deallocated using **cudaFree** and **free** functions, respectively.
+
+The host functions **cudaSetDevice**, **cudaMalloc**, **cudaMemcpy**, and **cudaFree**
+are blocking, i.e. the code does not continues to next instructions until the operations
+are completed. However this is not the default behaviour, for many operations there are
+asynchronous equivalents and there are as well many library calls return the control to
+the `main()` after calling. This allows the developers to launch independent operations
+and overlap them.
+
+In short, this code demonstrates how to utilize the CUDA and HIP to perform vector
+addition on a GPU, showcasing the steps involved in allocating memory, transferring data
+between the CPU and GPU, launching a kernel function, and handling the results. It
+serves as a starting point for GPU-accelerated computations using CUDA and HIP. More
+examples for the vector (array) addition program are available at `content/examples
+<https://github.com/ENCCS/gpu-programming/tree/main/content/examples>`_.
+
+In order to practice the concepts shown above, edit the skeleton code in the repository
+and the code corresponding to setting the device, memory allocations and transfers, and
+the kernel execution.
 
 Vector Addition with Unified Memory
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-For a while already GPUs support unified memory, which allows to use the same pointer for both CPU and GPU data. This simplifies developing codes by removing the explicit data transfers. The data resides on CPU until it is needed on GPU or vice-versa. However the data transfers still happens "under the hood" and the developer needs to construct the code to avoid unnecessary transfers. Below one can see the modified vector addition codes:
+For a while already GPUs support unified memory, which allows to use the same pointer
+for both CPU and GPU data. This simplifies developing codes by removing the explicit
+data transfers. The data resides on CPU until it is needed on GPU or vice-versa. However
+the data transfers still happens "under the hood" and the developer needs to construct
+the code to avoid unnecessary transfers. Below one can see the modified vector addition
+codes:
 
+.. tabs::
 
-.. tabs:: 
+    ..  group-tab:: CUDA
+         .. literalinclude:: examples/non-portable-kernel-models/cuda-vector-add-unified-memory.cu
+            :language: C++
 
-   ..  group-tab:: CUDA
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-vector-add-unified-memory.cu
-           :language: C++
-      
-   ..  group-tab:: HIP
-        .. literalinclude:: examples/non-portable-kernel-models/hip-vector-add-unified-memory.cpp
-           :language: C++
+    ..  group-tab:: HIP
+         .. literalinclude:: examples/non-portable-kernel-models/hip-vector-add-unified-memory.cpp
+            :language: C++
 
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-vector-add-unified-memory.cpp
-           :language: C++
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-vector-add-unified-memory.cpp
+            :language: C++
 
-Now the arrays `Ah`, `Bh`, `Ch`, and `Cref` are using `cudaMallocManaged` to allocate Unified Memory. The **vector_add kernel** is launched by passing these Unified Memory pointers directly. After the kernel launch, **cudaDeviceSynchronize** is used to wait for the kernel to complete execution. Finally, **cudaFree** is used to free the Unified Memory arrays. The Unified Memory allows for transparent data migration between CPU and GPU, eliminating the need for explicit data transfers.
+Now the arrays `Ah`, `Bh`, `Ch`, and `Cref` are using `cudaMallocManaged` to allocate
+Unified Memory. The **vector_add kernel** is launched by passing these Unified Memory
+pointers directly. After the kernel launch, **cudaDeviceSynchronize** is used to wait
+for the kernel to complete execution. Finally, **cudaFree** is used to free the Unified
+Memory arrays. The Unified Memory allows for transparent data migration between CPU and
+GPU, eliminating the need for explicit data transfers.
 
-As an exercise modify the skeleton code for vector addition to use Unified Memory. 
+As an exercise modify the skeleton code for vector addition to use Unified Memory.
 
 .. admonition:: Basics - In short
-   :class: dropdown
-
-   
-   - CUDA is developed by NVIDIA, while HIP is an open-source project (from AMD) for multi-platform GPU programming.
-   - CUDA and HIP are GPU-focused programming models for optimized code execution on NVIDIA and AMD GPUs.
-   - CUDA and HIP are similar, allowing developers to write GPU code in a syntax similar to CUDA and target multiple platforms.
-   - CUDA and HIP are programming models focused solely on GPUs
-   - CUDA and HIP offer high-performance computing capabilities and advanced features specific to GPU architectures, such as shared memory and memory management.
-   - They provide highly GPU-accelerated libraries in various domains like linear algebra, signal processing, image processing, and machine learning.
-   - Programming for GPUs involves managing data movement between host and accelerator memory.
-   - Unified Memory simplifies data transfers by using the same pointer for CPU and GPU data, but code optimization is still necessary.
 
+    - CUDA is developed by NVIDIA, while HIP is an open-source project (from AMD) for
+      multi-platform GPU programming.
+    - CUDA and HIP are GPU-focused programming models for optimized code execution on
+      NVIDIA and AMD GPUs.
+    - CUDA and HIP are similar, allowing developers to write GPU code in a syntax
+      similar to CUDA and target multiple platforms.
+    - CUDA and HIP are programming models focused solely on GPUs
+    - CUDA and HIP offer high-performance computing capabilities and advanced features
+      specific to GPU architectures, such as shared memory and memory management.
+    - They provide highly GPU-accelerated libraries in various domains like linear
+      algebra, signal processing, image processing, and machine learning.
+    - Programming for GPUs involves managing data movement between host and accelerator
+      memory.
+    - Unified Memory simplifies data transfers by using the same pointer for CPU and GPU
+      data, but code optimization is still necessary.
 
 Memory Optimizations
-^^^^^^^^^^^^^^^^^^^^
-Vector addition is a relatively simple, straight forward case. Each thread reads data from memory, does an addition and then saves the result. Two adjacent threads access memory location in memory close to each other. Also the data is used only once. In practice this not the case. Also sometimes the same data is used several times resulting in additional memory accesses. 
-
-Memory optimization is one of the most important type of optimization done to efficiently use the GPUs. Before looking how it is done in practice let's revisit some basic concepts about GPUs and execution model.  
-
-
-GPUs are comprised many light cores, the so-called Streaming Processors (SP) in CUDA, which are physically group together in units, i.e. Streaming Multi-Processors (SMP) in CUDA architecture (note that in AMD the equivalent is called Computing Units, while in Intel GPUs they are Execution Units). The work is done on GPUs by launching many threads each executing an instance of the same kernel. The order of execution is not defined, and the threads can only exchange information in specific conditions. Because of the way the SPs are grouped the threads are also grouped in **blocks**. Each **block** is assigned to an SMP, and can not be split. An SMP can have more than block residing at a moment, however there is no communications between the threads in different blocks. In addition to the SPs, each SMP contains very fast memory which in CUDA is referred to as `shared memory`. The threads in a block can read and write to the shared memory and use it as a user controlled cache. One thread can for example write to a location in the shared memory while another thread in the same block can read and use that data. In order to be sure that all threads in the block completed writing **__syncthreads()** function has to be used to make the threads in the block wait until all of them reached the specific place in the kernel. Another important aspect in the GPU programming model is that the threads in the block are not executed independently. The threads in a block are physically grouped in warps of size 32 in NVIDIA devices or wavefronts of size 32 or 64 in AMD devices (depending on device architecture). Intel devices are notable in that the warp size, called SIMD width, is highly configurable, with typical possible values of 8, 16, or 32 (depends on the hardware). All memory accesses of the global GPU memory are done per warp. When data is needed for some calculations a warp loads from the GPU memory blocks of specific size (64 or 128 Bytes). These operation is very expensive, it has a latency of hundreds of cycles. This means that the threads in a warp should work with elements of the data located close in the memory. In the vector addition two threads near each other, of index tid and tid+1, access elements adjacent in the GPU memory.  
-
-
-The shared memory can be used to improve performance in two ways. It is possible to avoid extra reads from the memory when several threads in the same block need the same data (see `stencil <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil>`_ code) or it can be used to improve the memory access patterns like in the case of matrix transpose.
+--------------------
+
+Vector addition is a relatively simple, straight forward case. Each thread reads data
+from memory, does an addition and then saves the result. Two adjacent threads access
+memory location in memory close to each other. Also the data is used only once. In
+practice this not the case. Also sometimes the same data is used several times resulting
+in additional memory accesses.
+
+Memory optimization is one of the most important type of optimization done to
+efficiently use the GPUs. Before looking how it is done in practice let's revisit some
+basic concepts about GPUs and execution model.
+
+GPUs are comprised many light cores, the so-called Streaming Processors (SP) in CUDA,
+which are physically group together in units, i.e. Streaming Multi-Processors (SMP) in
+CUDA architecture (note that in AMD the equivalent is called Computing Units, while in
+Intel GPUs they are Execution Units). The work is done on GPUs by launching many threads
+each executing an instance of the same kernel. The order of execution is not defined,
+and the threads can only exchange information in specific conditions. Because of the way
+the SPs are grouped the threads are also grouped in **blocks**. Each **block** is
+assigned to an SMP, and can not be split. An SMP can have more than block residing at a
+moment, however there is no communications between the threads in different blocks. In
+addition to the SPs, each SMP contains very fast memory which in CUDA is referred to as
+`shared memory`. The threads in a block can read and write to the shared memory and use
+it as a user controlled cache. One thread can for example write to a location in the
+shared memory while another thread in the same block can read and use that data. In
+order to be sure that all threads in the block completed writing **__syncthreads()**
+function has to be used to make the threads in the block wait until all of them reached
+the specific place in the kernel. Another important aspect in the GPU programming model
+is that the threads in the block are not executed independently. The threads in a block
+are physically grouped in warps of size 32 in NVIDIA devices or wavefronts of size 32 or
+64 in AMD devices (depending on device architecture). Intel devices are notable in that
+the warp size, called SIMD width, is highly configurable, with typical possible values
+of 8, 16, or 32 (depends on the hardware). All memory accesses of the global GPU memory
+are done per warp. When data is needed for some calculations a warp loads from the GPU
+memory blocks of specific size (64 or 128 Bytes). These operation is very expensive, it
+has a latency of hundreds of cycles. This means that the threads in a warp should work
+with elements of the data located close in the memory. In the vector addition two
+threads near each other, of index tid and tid+1, access elements adjacent in the GPU
+memory.
+
+The shared memory can be used to improve performance in two ways. It is possible to
+avoid extra reads from the memory when several threads in the same block need the same
+data (see `stencil
+<https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil>`_ code) or
+it can be used to improve the memory access patterns like in the case of matrix
+transpose.
 
 .. admonition:: Memory, Execution - In short
-   :class: dropdown
-
-   - GPUs consist of streaming processors (SPs) grouped together in units, such as Streaming Multi-Processors (SMPs) in CUDA architecture.
-   - Work on GPUs is done by launching threads, with each thread executing an instance of the same kernel, and the execution order is not defined.
-   - Threads are organized into blocks, assigned to an SMP, and cannot be split, and there is no communication between threads in different blocks.
-   - Each SMP contains shared memory, which acts as a user-controlled cache for threads within a block, allowing efficient data sharing and synchronization.
-   - The shared memory can be used to avoid extra memory reads when multiple threads in the same block need the same data or to improve memory access patterns, such as in matrix transpose operations.
-   - Memory accesses from global GPU memory are performed per warp (groups of threads), and loading data from GPU memory has high latency.
-   - To optimize memory access, threads within a warp should work with adjacent elements in memory to reduce latency.
-   - Proper utilization of shared memory can improve performance by reducing memory reads and enhancing memory access patterns.
 
+    - GPUs consist of streaming processors (SPs) grouped together in units, such as
+      Streaming Multi-Processors (SMPs) in CUDA architecture.
+    - Work on GPUs is done by launching threads, with each thread executing an instance
+      of the same kernel, and the execution order is not defined.
+    - Threads are organized into blocks, assigned to an SMP, and cannot be split, and
+      there is no communication between threads in different blocks.
+    - Each SMP contains shared memory, which acts as a user-controlled cache for threads
+      within a block, allowing efficient data sharing and synchronization.
+    - The shared memory can be used to avoid extra memory reads when multiple threads in
+      the same block need the same data or to improve memory access patterns, such as in
+      matrix transpose operations.
+    - Memory accesses from global GPU memory are performed per warp (groups of threads),
+      and loading data from GPU memory has high latency.
+    - To optimize memory access, threads within a warp should work with adjacent
+      elements in memory to reduce latency.
+    - Proper utilization of shared memory can improve performance by reducing memory
+      reads and enhancing memory access patterns.
 
 Matrix Transpose
 ~~~~~~~~~~~~~~~~
-Matrix transpose is a classic example where shared memory can significantly improve the performance. The use of shared memory reduces global memory accesses and exploits the high bandwidth and low latency of shared memory.
+
+Matrix transpose is a classic example where shared memory can significantly improve the
+performance. The use of shared memory reduces global memory accesses and exploits the
+high bandwidth and low latency of shared memory.
 
 .. figure:: img/concepts/transpose_img.png
-   :align: center
+    :align: center
 
-First as a reference we use a simple kernel which copy the data from one array to the other. 
+First as a reference we use a simple kernel which copy the data from one array to the
+other.
 
-.. tabs:: 
-         
-   ..  group-tab:: CUDA
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v0.cu
-           :language: C++
+.. tabs::
 
-   ..  group-tab:: HIP
-        .. literalinclude:: examples/non-portable-kernel-models/hip-matrix-transpose-v0.cpp
-           :language: C++
-      
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v0.cpp
-           :language: C++
+    ..  group-tab:: CUDA
+         .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v0.cu
+            :language: C++
 
-We note that this code does not do any calculations. Each thread reads one element and then writes it to another locations. By measuring the execution time of the kernel we can compute the effective bandwidth achieve by this kernel. We can measure the time using **rocprof** or **cuda/hip events**. On a NVIDIA V100 GPU this code achieves `717 GB/s` out of the theoretical peak `900 GB/s`. 
+    ..  group-tab:: HIP
+         .. literalinclude:: examples/non-portable-kernel-models/hip-matrix-transpose-v0.cpp
+            :language: C++
 
-Now we do the first iteration of the code, a naive transpose. The reads have a nice `coalesced` access pattern, but the writing is now very inefficient. 
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v0.cpp
+            :language: C++
 
-.. tabs:: 
+We note that this code does not do any calculations. Each thread reads one element and
+then writes it to another locations. By measuring the execution time of the kernel we
+can compute the effective bandwidth achieve by this kernel. We can measure the time
+using **rocprof** or **cuda/hip events**. On a NVIDIA V100 GPU this code achieves `717
+GB/s` out of the theoretical peak `900 GB/s`.
 
-   ..  group-tab:: CUDA/HIP
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v1.cu
-           :language: C++
-           :lines: 13-22
+Now we do the first iteration of the code, a naive transpose. The reads have a nice
+`coalesced` access pattern, but the writing is now very inefficient.
 
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v1.cpp
-           :language: C++
-           :lines: 11-19
+.. tabs::
 
-Checking the index `in_index` we see that two adjacent threads (`threadIx.x, threadIdx.x+1`) access location in memory near each other. However the writes are not. Threads access data which in a strided way. Two adjacent threads access data separated by `height` elements. This practically results in 32 memory operations, however due to under the hood optimizations the achieved bandwidth is `311 GB/s`.
+    ..  group-tab:: CUDA/HIP
+         .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v1.cu
+            :language: C++
+            :lines: 13-22
 
-We can improve the code by reading the data in a `coalesced` way, save it in the shared memory row by row and then write in the global memory column by column.
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v1.cpp
+            :language: C++
+            :lines: 11-19
 
+Checking the index `in_index` we see that two adjacent threads (`threadIx.x,
+threadIdx.x+1`) access location in memory near each other. However the writes are not.
+Threads access data which in a strided way. Two adjacent threads access data separated
+by `height` elements. This practically results in 32 memory operations, however due to
+under the hood optimizations the achieved bandwidth is `311 GB/s`.
 
- .. tabs:: 
+We can improve the code by reading the data in a `coalesced` way, save it in the shared
+memory row by row and then write in the global memory column by column.
 
-   ..  group-tab:: CUDA/HIP
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v2.cu
-           :language: C++
-           :lines: 13-30
+    .. tabs::
 
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v2.cpp
-           :language: C++
-           :lines: 11-28
+        ..  group-tab:: CUDA/HIP
+             .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v2.cu
+                :language: C++
+                :lines: 13-30
 
-We define a **tile_dim** constant to determine the size of the shared memory tile. The matrix transpose kernel uses a 2D grid of thread blocks, where each thread block operates on a `tile_dim x tile_dim` tile of the input matrix.
+        ..  group-tab:: SYCL
+             .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v2.cpp
+                :language: C++
+                :lines: 11-28
 
-The kernel first loads data from the global memory into the shared memory tile. Each thread loads a single element from the input matrix into the shared memory tile. Then, a **__syncthreads()** barrier ensures that all threads have finished loading data into shared memory before proceeding.
+We define a **tile_dim** constant to determine the size of the shared memory tile. The
+matrix transpose kernel uses a 2D grid of thread blocks, where each thread block
+operates on a `tile_dim x tile_dim` tile of the input matrix.
 
-Next, the kernel writes the transposed data from the shared memory tile back to the output matrix in global memory. Each thread writes a single element from the shared memory tile to the output matrix. 
-By using shared memory, this optimized implementation reduces global memory accesses and exploits memory coalescence, resulting in improved performance compared to a naive transpose implementation.
+The kernel first loads data from the global memory into the shared memory tile. Each
+thread loads a single element from the input matrix into the shared memory tile. Then, a
+**__syncthreads()** barrier ensures that all threads have finished loading data into
+shared memory before proceeding.
 
-This kernel achieved on NVIDIA V100 `674 GB/s`. 
+Next, the kernel writes the transposed data from the shared memory tile back to the
+output matrix in global memory. Each thread writes a single element from the shared
+memory tile to the output matrix. By using shared memory, this optimized implementation
+reduces global memory accesses and exploits memory coalescence, resulting in improved
+performance compared to a naive transpose implementation.
 
-This is pretty close to the bandwidth achieved by the simple copy kernel, but there is one more thing to improve. 
+This kernel achieved on NVIDIA V100 `674 GB/s`.
 
-Shared memory is composed of `banks`. Each banks can service only one request at the time. Bank conflicts happen when more than 1 thread in a specific warp try to access data in bank. The bank conflicts are resolved by serializing the accesses resulting in less performance. In the above example when data is saved to the shared memory, each thread in the warp will save an element of the data in a different one. Assuming that shared memory has 16 banks after writing each bank will contain one column. At the last step when we write from the shared memory to the global memory each warp load data from the same bank. A simple way to avoid this is by just padding the temporary array. 
+This is pretty close to the bandwidth achieved by the simple copy kernel, but there is
+one more thing to improve.
 
+Shared memory is composed of `banks`. Each banks can service only one request at the
+time. Bank conflicts happen when more than 1 thread in a specific warp try to access
+data in bank. The bank conflicts are resolved by serializing the accesses resulting in
+less performance. In the above example when data is saved to the shared memory, each
+thread in the warp will save an element of the data in a different one. Assuming that
+shared memory has 16 banks after writing each bank will contain one column. At the last
+step when we write from the shared memory to the global memory each warp load data from
+the same bank. A simple way to avoid this is by just padding the temporary array.
 
-.. tabs:: 
+.. tabs::
 
-   ..  group-tab:: CUDA/HIP
-        .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v3.cu
-           :language: C++
-           :lines: 13-30
+    ..  group-tab:: CUDA/HIP
+         .. literalinclude:: examples/non-portable-kernel-models/cuda-matrix-transpose-v3.cu
+            :language: C++
+            :lines: 13-30
 
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v3.cpp
-           :language: C++
-           :lines: 11-28
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-matrix-transpose-v3.cpp
+            :language: C++
+            :lines: 11-28
 
-By padding the array the data is slightly shifting it resulting in no bank conflicts. The effective bandwidth for this kernel is `697 GB/s`. 
+By padding the array the data is slightly shifting it resulting in no bank conflicts.
+The effective bandwidth for this kernel is `697 GB/s`.
 
 .. admonition:: Using sharing memory as a cache - In short
-   :class: dropdown
-
-   - Shared memory can significantly improve performance in operations like matrix transpose.
-   - Shared memory reduces global memory accesses and exploits the high bandwidth and low latency of shared memory.
-   - An optimized implementation utilizes shared memory, loads data coalescedly, and performs transpose operations.
-   - The optimized implementation uses a 2D grid of thread blocks and a shared memory tile size determined by a constant.
-   - The kernel loads data from global memory into the shared memory tile and uses a synchronization barrier.
-   - To avoid bank conflicts in shared memory, padding the temporary array is a simple solution.
 
+    - Shared memory can significantly improve performance in operations like matrix
+      transpose.
+    - Shared memory reduces global memory accesses and exploits the high bandwidth and
+      low latency of shared memory.
+    - An optimized implementation utilizes shared memory, loads data coalescedly, and
+      performs transpose operations.
+    - The optimized implementation uses a 2D grid of thread blocks and a shared memory
+      tile size determined by a constant.
+    - The kernel loads data from global memory into the shared memory tile and uses a
+      synchronization barrier.
+    - To avoid bank conflicts in shared memory, padding the temporary array is a simple
+      solution.
 
 Reductions
 ~~~~~~~~~~
 
-`Reductions` refer to operations in which the elements of an array are aggregated in a single value through operations such as summing, finding the maximum or minimum, or performing logical operations. 
-
-In the serial approach, the reduction is performed sequentially by iterating through the collection of values and accumulating the result step by step. This will be enough for small sizes, but for big problems this results in significant time spent in this part of an application. On a GPU, this approach is not feasible. Using just one thread to do this operation means the rest of the GPU is wasted. Doing reduction in parallel is a little tricky. In order for a thread to do work, it needs to have some partial result to use. If we launch, for example, a kernel performing a simple vector summation, ``sum[0]+=a[tid]``, with `N` threads we notice that this would result in undefined behaviour. GPUs have mechanisms to access the memory and lock the access for other threads while 1 thread is doing some operations to a given data via **atomics**, however this means that the memory access gets again to be serialized. There is not much gain. 
-We note that when doing reductions the order of the iterations is not important (barring the typical non-associative behavior of floating-point operations). Also we can we might have to divide our problem in several subsets and do the reduction operation for each subset separately. On the GPUs, since the GPU threads are grouped in blocks, the size of the subset based on that. Inside the block, threads can cooperate with each other, they can share data via the shared memory and can be synchronized as well. All threads read the data to be reduced, but now we have significantly less partial results to deal with. In general, the size of the block ranges from 256 to 1024 threads. In case of very large problems, after this procedure if we are left too many partial results this step can be repeated.
-
-At the block level we still have to perform a reduction in an efficient way. Doing it serially means that we are not using all GPU cores (roughly 97% of the computing capacity is wasted). Doing it naively parallel using **atomics**, but on the shared memory is also not a good option. Going back back to the fact the reduction operations are commutative and associative we can set each thread to "reduce" two elements of the local part of the array. Shared memory can be used to store the partial "reductions" as shown below in the code:
-
-.. tabs:: 
-         
-   ..  group-tab:: CUDA/HIP
-
-      .. code-block:: C++
-         
-         #define tpb 512 // size in this case has to be known at compile time
-         // this kernel has to be launched with at least N/2 threads
-         __global__ void reduction_one(double x, double *sum, int N){
-           int ibl=blockIdx.y+blockIdx.x*gridDim.y;
-           int ind=threadIdx.x+blockDim.x*ibl;
-           
-           __shared__ double shtmp[2*tpb];  
-           shtmp[threadIdx.x]=0; // for sums we initiate with 0, for other operations should be different
-           if(ind<N/2)
-           {
-              shtmp[threadIdx.x]=x[ind];
-           }
-           if(ind+N/2<N) 
-           {
-              shtmp[threadIdx.x+tpb]=x[ind+N/2];
-           }
-           __syncthreads();
-           for(int s=tpb;s>0;s>>=1){
-             if(threadIdx.x<s){
-                shtmp[threadIdx.x]+=shtmp[threadIdx.x+s];}
-             __syncthreads(); 
-           }
-           if(threadIdx.x==0)
-           {
-             sum[ibl]=shtmp[0]; // each block saves its partial result to an array 
-             // atomicAdd(&sum[0], shene[0]); // alternatively could aggregate everything together at index 0. Only use when there not many partial sums left
-           }
-         }
-
-   ..  group-tab:: SYCL
-        .. literalinclude:: examples/non-portable-kernel-models/sycl-reduction.cpp
-           :language: C++
-           :lines: 9-51
-
-In the kernel we have each GPU performing thread a reduction of two elements from the local portion of the array. If we have `tpb` GPU threads per block, we utilize them to store `2xtpb elements` in the local shared memory. To ensure synchronization until all data is available in the shared memory, we employ the `syncthreads()` function.
-
-Next, we instruct each thread to "reduce" the element in the array at `threadIdx.x` with the element at `threadIdx.x+tpb`. As this operation saves the result back into the shared memory, we once again employ `syncthreads()`. By doing this, we effectively halve the number of elements to be reduced.
-
-This procedure can be repeated, but now we only utilize `tpb/2 threads`. Each thread is responsible for "reducing" the element in the array at `threadIdx.x` with the element at `threadIdx.x+tpb/2`. After this step, we are left with `tpb/4` numbers to be reduced. We continue applying this procedure until only one number remains.
-
-At this point, we can either "reduce" the final number with a global partial result using atomic read and write operations, or we can save it into an array for further processing.
+`Reductions` refer to operations in which the elements of an array are aggregated in a
+single value through operations such as summing, finding the maximum or minimum, or
+performing logical operations.
+
+In the serial approach, the reduction is performed sequentially by iterating through the
+collection of values and accumulating the result step by step. This will be enough for
+small sizes, but for big problems this results in significant time spent in this part of
+an application. On a GPU, this approach is not feasible. Using just one thread to do
+this operation means the rest of the GPU is wasted. Doing reduction in parallel is a
+little tricky. In order for a thread to do work, it needs to have some partial result to
+use. If we launch, for example, a kernel performing a simple vector summation,
+``sum[0]+=a[tid]``, with `N` threads we notice that this would result in undefined
+behaviour. GPUs have mechanisms to access the memory and lock the access for other
+threads while 1 thread is doing some operations to a given data via **atomics**, however
+this means that the memory access gets again to be serialized. There is not much gain.
+We note that when doing reductions the order of the iterations is not important (barring
+the typical non-associative behavior of floating-point operations). Also we can we might
+have to divide our problem in several subsets and do the reduction operation for each
+subset separately. On the GPUs, since the GPU threads are grouped in blocks, the size of
+the subset based on that. Inside the block, threads can cooperate with each other, they
+can share data via the shared memory and can be synchronized as well. All threads read
+the data to be reduced, but now we have significantly less partial results to deal with.
+In general, the size of the block ranges from 256 to 1024 threads. In case of very large
+problems, after this procedure if we are left too many partial results this step can be
+repeated.
+
+At the block level we still have to perform a reduction in an efficient way. Doing it
+serially means that we are not using all GPU cores (roughly 97% of the computing
+capacity is wasted). Doing it naively parallel using **atomics**, but on the shared
+memory is also not a good option. Going back back to the fact the reduction operations
+are commutative and associative we can set each thread to "reduce" two elements of the
+local part of the array. Shared memory can be used to store the partial "reductions" as
+shown below in the code:
+
+.. tabs::
+
+    ..  group-tab:: CUDA/HIP
+
+       .. code-block:: C++
+
+          #define tpb 512 // size in this case has to be known at compile time
+          // this kernel has to be launched with at least N/2 threads
+          __global__ void reduction_one(double x, double *sum, int N){
+            int ibl=blockIdx.y+blockIdx.x*gridDim.y;
+            int ind=threadIdx.x+blockDim.x*ibl;
+
+            __shared__ double shtmp[2*tpb];
+            shtmp[threadIdx.x]=0; // for sums we initiate with 0, for other operations should be different
+            if(ind<N/2)
+            {
+               shtmp[threadIdx.x]=x[ind];
+            }
+            if(ind+N/2<N)
+            {
+               shtmp[threadIdx.x+tpb]=x[ind+N/2];
+            }
+            __syncthreads();
+            for(int s=tpb;s>0;s>>=1){
+              if(threadIdx.x<s){
+                 shtmp[threadIdx.x]+=shtmp[threadIdx.x+s];}
+              __syncthreads();
+            }
+            if(threadIdx.x==0)
+            {
+              sum[ibl]=shtmp[0]; // each block saves its partial result to an array
+              // atomicAdd(&sum[0], shene[0]); // alternatively could aggregate everything together at index 0. Only use when there not many partial sums left
+            }
+          }
+
+    ..  group-tab:: SYCL
+         .. literalinclude:: examples/non-portable-kernel-models/sycl-reduction.cpp
+            :language: C++
+            :lines: 9-51
+
+In the kernel we have each GPU performing thread a reduction of two elements from the
+local portion of the array. If we have `tpb` GPU threads per block, we utilize them to
+store `2xtpb elements` in the local shared memory. To ensure synchronization until all
+data is available in the shared memory, we employ the `syncthreads()` function.
+
+Next, we instruct each thread to "reduce" the element in the array at `threadIdx.x` with
+the element at `threadIdx.x+tpb`. As this operation saves the result back into the
+shared memory, we once again employ `syncthreads()`. By doing this, we effectively halve
+the number of elements to be reduced.
+
+This procedure can be repeated, but now we only utilize `tpb/2 threads`. Each thread is
+responsible for "reducing" the element in the array at `threadIdx.x` with the element at
+`threadIdx.x+tpb/2`. After this step, we are left with `tpb/4` numbers to be reduced. We
+continue applying this procedure until only one number remains.
+
+At this point, we can either "reduce" the final number with a global partial result
+using atomic read and write operations, or we can save it into an array for further
+processing.
 
 .. figure:: img/concepts/Reduction.png
-   :align: center
-   
-   Schematic representation on the reduction algorithm with 8 GPU threads.
-   
-For a detail analysis of how to optimize reduction operations in CUDA/HIP check this presentation `Optimizing Parallel Reduction in CUDA <https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf>`_  
+    :align: center
 
-.. admonition:: Reductions - In short
-   :class: dropdown
+    Schematic representation on the reduction algorithm with 8 GPU threads.
+
+For a detail analysis of how to optimize reduction operations in CUDA/HIP check this
+presentation `Optimizing Parallel Reduction in CUDA
+<https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf>`_
 
-   - Reductions refer to aggregating elements of an array into a single value through operations like summing, finding maximum or minimum, or performing logical operations.
-   - Performing reductions sequentially in a serial approach is inefficient for large problems, while parallel reduction on GPUs offers better performance.
-   - Parallel reduction on GPUs involves dividing the problem into subsets, performing reductions within blocks of threads using shared memory, and repeatedly reducing the number of elements (two per GPU thread) until only one remains.
+.. admonition:: Reductions - In short
 
+    - Reductions refer to aggregating elements of an array into a single value through
+      operations like summing, finding maximum or minimum, or performing logical
+      operations.
+    - Performing reductions sequentially in a serial approach is inefficient for large
+      problems, while parallel reduction on GPUs offers better performance.
+    - Parallel reduction on GPUs involves dividing the problem into subsets, performing
+      reductions within blocks of threads using shared memory, and repeatedly reducing
+      the number of elements (two per GPU thread) until only one remains.
 
 Overlapping Computations and Memory transfer. CUDA/HIP Streams
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Modern GPUs can overlap independent operations. They can do transfers between CPU and GPU and execute kernels in the same time, or they can execute kernels concurrently. CUDA/HIP streams are independent execution units, a sequence of operations that execute in issue-order on the GPU. The operations issue in different streams can be executed concurrently. 
-
-Consider the previous case of vector addition, which involves copying data from CPU to GPU, computations and then copying back the result to GPU. In this way nothing can be overlap. 
 
+Modern GPUs can overlap independent operations. They can do transfers between CPU and
+GPU and execute kernels in the same time, or they can execute kernels concurrently.
+CUDA/HIP streams are independent execution units, a sequence of operations that execute
+in issue-order on the GPU. The operations issue in different streams can be executed
+concurrently.
 
+Consider the previous case of vector addition, which involves copying data from CPU to
+GPU, computations and then copying back the result to GPU. In this way nothing can be
+overlap.
 
-We can improve the performance by dividing the problem in smaller independent parts. Let's consider 5 streams and consider the case where copy in one direction and computation take the same amount of time. 
+We can improve the performance by dividing the problem in smaller independent parts.
+Let's consider 5 streams and consider the case where copy in one direction and
+computation take the same amount of time.
 
 .. figure:: img/concepts/StreamsTimeline.png
-   :align: center
+    :align: center
 
+After the first and second stream copy data to the GPU, the GPU is practically occupied
+all time. We can see that significant performance improvements can be obtained by
+eliminating the time in which the GPU is idle, waiting for data to arrive from the CPU.
+This very useful for problems where there is often communication to the CPU because the
+GPU memory can not fit all the problem or the application runs in a multi-GPU set up and
+communication is needed often.
 
-After the first and second stream copy data to the GPU, the GPU is practically occupied all time. We can see that significant performance  improvements can be obtained by eliminating the time in which the GPU is idle, waiting for data to arrive from the CPU. This very useful for problems where there is often communication to the CPU because the GPU memory can not fit all the problem or the application runs in a multi-GPU set up and communication is needed often.  
+We can apply this to the vector addition problem above.
 
-We can apply this to the vector addition problem above. 
+.. tabs::
 
-.. tabs:: 
-         
-   ..  group-tab:: CUDA
+    ..  group-tab:: CUDA
 
-      .. code-block:: C++
-         
-         // Distribute kernel for 'n_streams' streams, and record each stream's timing
-         for (int i = 0; i < n_streams; ++i) {
-           int offset = i * stream_size;
-           cudaEventRecord(start_event[i], stream[i]); // stamp the moment when the kernel is submitted on stream i
+       .. code-block:: C++
 
-           cudaMemcpyAsync( &Ad[offset],  &Ah[offset], N/n_streams*sizeof(float), cudaMemcpyHostToDevice, stream[i]);
-           cudaMemcpyAsync( &Bd[offset],  &Bh[offset], N/n_streams*sizeof(float), cudaMemcpyHostToDevice, stream[i]);
-           vector_add<<<gridsize / n_streams, blocksize, 0, stream[i]>>>(&Ad[offset], &Bd[offset], &Cd[offset], N/n_streams); //each call processes N/n_streams elements
-           cudaMemcpyAsync( &Ch[offset],  &Cd[offset], N/n_streams*sizeof(float), cudaMemcpyDeviceToHost, stream[i]);
+          // Distribute kernel for 'n_streams' streams, and record each stream's timing
+          for (int i = 0; i < n_streams; ++i) {
+            int offset = i * stream_size;
+            cudaEventRecord(start_event[i], stream[i]); // stamp the moment when the kernel is submitted on stream i
 
-           cudaEventRecord(stop_event[i], stream[i]);  // stamp the moment when the kernel on stream i finished
-         }
-      
-   ..  group-tab:: HIP
+            cudaMemcpyAsync( &Ad[offset],  &Ah[offset], N/n_streams*sizeof(float), cudaMemcpyHostToDevice, stream[i]);
+            cudaMemcpyAsync( &Bd[offset],  &Bh[offset], N/n_streams*sizeof(float), cudaMemcpyHostToDevice, stream[i]);
+            vector_add<<<gridsize / n_streams, blocksize, 0, stream[i]>>>(&Ad[offset], &Bd[offset], &Cd[offset], N/n_streams); //each call processes N/n_streams elements
+            cudaMemcpyAsync( &Ch[offset],  &Cd[offset], N/n_streams*sizeof(float), cudaMemcpyDeviceToHost, stream[i]);
 
-      .. code-block:: C++    
-         
-         // Distribute kernel for 'n_streams' streams, and record each stream's timing
-         for (int i = 0; i < n_streams; ++i) {
-           int offset = i * (N/stream_size);
-           hipEventRecord(start_event[i], stream[i]); // stamp the moment when the kernel is submitted on stream i
+            cudaEventRecord(stop_event[i], stream[i]);  // stamp the moment when the kernel on stream i finished
+          }
 
-           hipMemcpyAsync( &Ad[offset],  &Ah[offset], N/n_streams*sizeof(float), hipMemcpyHostToDevice, stream[i]);
-           hipMemcpyAsync( &Bd[offset],  &Bh[offset], N/n_streams*sizeof(float), hipMemcpyHostToDevice, stream[i]);
-           vector_add<<<gridsize / n_streams, blocksize, 0, stream[i]>>>(&Ad[offset], &Bd[offset], &Cd[offset], N/n_streams); //each call processes N/n_streams elements
-           hipMemcpyAsync( &Ch[offset],  &Cd[offset], N/n_streams*sizeof(float), hipMemcpyDeviceToHost, stream[i]);
+    ..  group-tab:: HIP
 
-           hipEventRecord(stop_event[i], stream[i]);  // stamp the moment when the kernel on stream i finished
-         }
-         ...
+       .. code-block:: C++
 
-Instead of having one copy to gpu, one execution of the kernel and one copy back, we now have several of these calls independent of each other. 
+          // Distribute kernel for 'n_streams' streams, and record each stream's timing
+          for (int i = 0; i < n_streams; ++i) {
+            int offset = i * (N/stream_size);
+            hipEventRecord(start_event[i], stream[i]); // stamp the moment when the kernel is submitted on stream i
 
-Note that even when streams are not explicitly used it is possible to launch all the GPU operations asynchronous and overlap CPU operations (such I/O) and GPU operations. 
-In order to learn more about how to improve performance using streams check the NVIDIA blog `How to Overlap Data Transfers in CUDA C/C++ <https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/>`_.
+            hipMemcpyAsync( &Ad[offset],  &Ah[offset], N/n_streams*sizeof(float), hipMemcpyHostToDevice, stream[i]);
+            hipMemcpyAsync( &Bd[offset],  &Bh[offset], N/n_streams*sizeof(float), hipMemcpyHostToDevice, stream[i]);
+            vector_add<<<gridsize / n_streams, blocksize, 0, stream[i]>>>(&Ad[offset], &Bd[offset], &Cd[offset], N/n_streams); //each call processes N/n_streams elements
+            hipMemcpyAsync( &Ch[offset],  &Cd[offset], N/n_streams*sizeof(float), hipMemcpyDeviceToHost, stream[i]);
 
-.. admonition:: Streams - In short
-   :class: dropdown
+            hipEventRecord(stop_event[i], stream[i]);  // stamp the moment when the kernel on stream i finished
+          }
+          ...
 
-   - CUDA/HIP streams are independent execution contexts on the GPU that allow for concurrent execution of operations issued in different streams.
-   - Using streams can improve GPU performance by overlapping operations such as data transfers between CPU and GPU and kernel executions.
-   - By dividing a problem into smaller independent parts and utilizing multiple streams, the GPU can avoid idle time, resulting in significant performance improvements, especially for problems with frequent CPU communication or multi-GPU setups.
+Instead of having one copy to gpu, one execution of the kernel and one copy back, we now
+have several of these calls independent of each other.
 
+Note that even when streams are not explicitly used it is possible to launch all the GPU
+operations asynchronous and overlap CPU operations (such I/O) and GPU operations. In
+order to learn more about how to improve performance using streams check the NVIDIA blog
+`How to Overlap Data Transfers in CUDA C/C++
+<https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/>`_.
+
+.. admonition:: Streams - In short
+
+    - CUDA/HIP streams are independent execution contexts on the GPU that allow for
+      concurrent execution of operations issued in different streams.
+    - Using streams can improve GPU performance by overlapping operations such as data
+      transfers between CPU and GPU and kernel executions.
+    - By dividing a problem into smaller independent parts and utilizing multiple
+      streams, the GPU can avoid idle time, resulting in significant performance
+      improvements, especially for problems with frequent CPU communication or multi-GPU
+      setups.
 
 Pros and cons of native programming models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+------------------------------------------
+
 There are advantages and limitations to CUDA and HIP:
 
 CUDA Pros:
-   1. Performance Boost: CUDA is designed for NVIDIA GPUs and delivers excellent performance.
-   2. Wide Adoption: CUDA is popular, with many resources and tools available.
-   3. Mature Ecosystem: NVIDIA provides comprehensive libraries and tools for CUDA programming.
+    1. Performance Boost: CUDA is designed for NVIDIA GPUs and delivers excellent
+       performance.
+    2. Wide Adoption: CUDA is popular, with many resources and tools available.
+    3. Mature Ecosystem: NVIDIA provides comprehensive libraries and tools for CUDA
+       programming.
 
 HIP Pros:
-   1. Portability: HIP is portable across different GPU architectures.
-   2. Open Standards: HIP is based on open standards, making it more accessible.
-   3. Growing Community: The HIP community is growing, providing more resources and support.
+    1. Portability: HIP is portable across different GPU architectures.
+    2. Open Standards: HIP is based on open standards, making it more accessible.
+    3. Growing Community: The HIP community is growing, providing more resources and
+       support.
 
 Cons:
-   0. Exclusive for GPUs
-   1. Vendor Lock-in: CUDA is exclusive to NVIDIA GPUs, limiting compatibility.
-   2. Learning Curve: Both CUDA and HIP require learning GPU programming concepts.
-   3. Limited Hardware Support: HIP may face limitations on older or less common GPUs.
-
-
+     Exclusive for GPUs
+     Vendor Lock-in: CUDA is exclusive to NVIDIA GPUs, limiting compatibility.
+     Learning Curve: Both CUDA and HIP require learning GPU programming concepts.
+     Limited Hardware Support: HIP may face limitations on older or less common GPUs.
 
 .. keypoints::
 
-   - CUDA and HIP are two GPU programming models
-   - Memory optimizations are very important
-   - Asynchronous launching can be used to overlap operations and avoid idle GPU
+    - CUDA and HIP are two GPU programming models
+    - Memory optimizations are very important
+    - Asynchronous launching can be used to overlap operations and avoid idle GPU
diff --git a/content/8-portable-kernel-models.rst b/content/8-portable-kernel-models.rst
index 9d7a9a19..cf4ab3cb 100644
--- a/content/8-portable-kernel-models.rst
+++ b/content/8-portable-kernel-models.rst
@@ -5,33 +5,67 @@ Portable kernel-based models
 
 .. questions::
 
-   - How to program GPUs with C++ StdPar, Kokkos, OpenCL, and SYCL?
-   - What are the differences between these programming models.
+    - How to program GPUs with C++ StdPar, Kokkos, OpenCL, and SYCL?
+    - What are the differences between these programming models.
 
 .. objectives::
 
-   - Be able to use portable kernel-based models to write simple codes
-   - Understand how different approaches to memory and synchronization in Kokkos and SYCL work
+    - Be able to use portable kernel-based models to write simple codes
+    - Understand how different approaches to memory and synchronization in Kokkos and SYCL work
 
 .. instructor-note::
 
-   - 60 min teaching
-   - 30 min exercises
-
-The goal of the cross-platform portability ecosystems is to allow the same code to run on multiple architectures, therefore reducing code duplication. They are usually based on C++, and use function objects/lambda functions to define the loop body (i.e., the kernel), which can run on multiple architectures like CPU, GPU, and FPGA from different vendors. An exception to this is OpenCL, which originally offered only a C API (although currently also C++ API is available), and uses a separate-source model for the kernel code. However, unlike in many conventional CUDA or HIP implementations, the portability ecosystems require kernels to be written only once if one prefers to run it on CPU and GPU for example. Some notable cross-platform portability ecosystems are Alpaka, Kokkos, OpenCL, RAJA, and SYCL. Alpaka, Kokkos and RAJA are individual projects whereas OpenCL and SYCL are standards followed by several projects implementing (and extending) them. For example, some notable SYCL implementations include `Intel oneAPI DPC++ <https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>`_, `AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`_ (previously known as hipSYCL or Open SYCL), `triSYCL <https://github.com/triSYCL/triSYCL>`_, and `ComputeCPP <https://developer.codeplay.com/products/computecpp/ce/home/>`_.
+    - 60 min teaching
+    - 30 min exercises
+
+The goal of the cross-platform portability ecosystems is to allow the same code to run
+on multiple architectures, therefore reducing code duplication. They are usually based
+on C++, and use function objects/lambda functions to define the loop body (i.e., the
+kernel), which can run on multiple architectures like CPU, GPU, and FPGA from different
+vendors. An exception to this is OpenCL, which originally offered only a C API (although
+currently also C++ API is available), and uses a separate-source model for the kernel
+code. However, unlike in many conventional CUDA or HIP implementations, the portability
+ecosystems require kernels to be written only once if one prefers to run it on CPU and
+GPU for example. Some notable cross-platform portability ecosystems are Alpaka, Kokkos,
+OpenCL, RAJA, and SYCL. Alpaka, Kokkos and RAJA are individual projects whereas OpenCL
+and SYCL are standards followed by several projects implementing (and extending) them.
+For example, some notable SYCL implementations include `Intel oneAPI DPC++
+<https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>`_,
+`AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`_ (previously known as
+hipSYCL or Open SYCL), `triSYCL <https://github.com/triSYCL/triSYCL>`_, and `ComputeCPP
+<https://developer.codeplay.com/products/computecpp/ce/home/>`_.
 
 C++ StdPar
-^^^^^^^^^^
-
-In C++17, the initial support for parallel execution of standard algorithms has been introduced.
-Most algorithms available via the standard ``<algorithms>`` header were given an overload accepting with an `*execution policy* <https://en.cppreference.com/w/cpp/algorithm>`_ argument which allows the programmer to request parallel execution of the standard library function.
-While the main goal was to allow low-effort, high-level interface to run existing algorithms like ``std::sort`` on many CPU cores, implementations are allowed to use other hardware, and functions like ``std::for_each`` or ``std::transform`` offer great flexibility in writing the algorithm.
-
-C++ StdPar, also called Parallel STL or PSTL, could be considered similar to directive-based models, as it is very high-level and does not give the programmer fine-grained control over data movement or any access to hardware-specific features like shared (local) memory.
-Even the GPU to run on is selected automatically, since standard C++ does not have the concept of a *device* (but there are vendor extensions allowing the programmer more control)
-However, for applications that already relies on algorithms from C++ standard library, StdPar can be a good way to reap the performance benefits of both CPUs and GPUs with minimal code modifications.
-
-For GPU programming, all three vendors offer their implementations of StdPar with the ability to offload code to the GPU: NVIDIA has ``nvc++``, AMD has experimental `roc-stdpar <https://github.com/ROCm/roc-stdpar>`_, and Intel offers StdPar offload with their oneAPI compiler. `AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`__ offers an independent StdPar implementation, able to target devices from all three vendors. While being a part of the C++ standard, the level of support and the maturity of StdPar implementations varies a lot between different compilers: not all compilers support all algorithms, and different heuristics for mapping the algorithm to hardware and for managing data movement can have effect on performance.
+----------
+
+In C++17, the initial support for parallel execution of standard algorithms has been
+introduced. Most algorithms available via the standard ``<algorithms>`` header were
+given an overload accepting with an `*execution policy*
+<https://en.cppreference.com/w/cpp/algorithm>`_ argument which allows the programmer to
+request parallel execution of the standard library function. While the main goal was to
+allow low-effort, high-level interface to run existing algorithms like ``std::sort`` on
+many CPU cores, implementations are allowed to use other hardware, and functions like
+``std::for_each`` or ``std::transform`` offer great flexibility in writing the
+algorithm.
+
+C++ StdPar, also called Parallel STL or PSTL, could be considered similar to
+directive-based models, as it is very high-level and does not give the programmer
+fine-grained control over data movement or any access to hardware-specific features like
+shared (local) memory. Even the GPU to run on is selected automatically, since standard
+C++ does not have the concept of a *device* (but there are vendor extensions allowing
+the programmer more control) However, for applications that already relies on algorithms
+from C++ standard library, StdPar can be a good way to reap the performance benefits of
+both CPUs and GPUs with minimal code modifications.
+
+For GPU programming, all three vendors offer their implementations of StdPar with the
+ability to offload code to the GPU: NVIDIA has ``nvc++``, AMD has experimental
+`roc-stdpar <https://github.com/ROCm/roc-stdpar>`_, and Intel offers StdPar offload with
+their oneAPI compiler. `AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`__
+offers an independent StdPar implementation, able to target devices from all three
+vendors. While being a part of the C++ standard, the level of support and the maturity
+of StdPar implementations varies a lot between different compilers: not all compilers
+support all algorithms, and different heuristics for mapping the algorithm to hardware
+and for managing data movement can have effect on performance.
 
 StdPar compilation
 ~~~~~~~~~~~~~~~~~~
@@ -40,12 +74,14 @@ The build process depends a lot on the used compiler:
 
 - AdaptiveCpp: Add ``-acpp-stdpar`` flag when calling ``acpp``.
 - Intel oneAPI: Add ``-fsycl -fsycl-pstl-offload=gpu`` flags when calling ``icpx``.
-- NVIDIA NVC++: Add ``-stdpar`` flag when calling ``nvc++`` (not supported with plain ``nvcc``).
+- NVIDIA NVC++: Add ``-stdpar`` flag when calling ``nvc++`` (not supported with plain
+  ``nvcc``).
 
 StdPar programming
 ~~~~~~~~~~~~~~~~~~
 
-In its simplest form, using C++ standard parallelism requires including an additional ``<execution>`` header and adding one argument to a supported standard library function.
+In its simplest form, using C++ standard parallelism requires including an additional
+``<execution>`` header and adding one argument to a supported standard library function.
 
 For example, let's look at the following sequential code sorting a vector:
 
@@ -53,7 +89,7 @@ For example, let's look at the following sequential code sorting a vector:
 
     #include <algorithm>
     #include <vector>
-    
+
     void f(std::vector<int>& a) {
       std::sort(a.begin(), a.end());
     }
@@ -65,7 +101,7 @@ To make it run sorting on the GPU, only a minor modification is needed:
     #include <algorithm>
     #include <vector>
     #include <execution> // To get std::execution
-    
+
     void f(std::vector<int>& a) {
       std::sort(
           std::execution::par_unseq, // This algorithm can be run in parallel
@@ -73,9 +109,14 @@ To make it run sorting on the GPU, only a minor modification is needed:
         );
     }
 
-Now, when compiled with one of the supported compilers, the code will run the sorting on a GPU.
+Now, when compiled with one of the supported compilers, the code will run the sorting on
+a GPU.
 
-While the can initially seem very limiting, many standard algorithms, such as ``std::transform``, ``std::accumulate``, ``std::transform_reduce``, and ``std::for_each`` can run custom functions over an array, thus allowing one to offload an arbitrary algorithm, as long as it does not violate typical limitations of GPU kernels, such as not throwing any exceptions and not doing system calls.
+While the can initially seem very limiting, many standard algorithms, such as
+``std::transform``, ``std::accumulate``, ``std::transform_reduce``, and
+``std::for_each`` can run custom functions over an array, thus allowing one to offload
+an arbitrary algorithm, as long as it does not violate typical limitations of GPU
+kernels, such as not throwing any exceptions and not doing system calls.
 
 StdPar execution policies
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -83,117 +124,211 @@ StdPar execution policies
 In C++, there are four different execution policies to choose from:
 
 - ``std::execution::seq``: run algorithm serially, don't parallelize it.
-- ``std::execution::par``: allow parallelizing the algorithm (as if using multiple threads),
+- ``std::execution::par``: allow parallelizing the algorithm (as if using multiple
+  threads),
 - ``std::execution::unseq``: allow vectorizing the algorithm (as if using SIMD),
 - ``std::execution::par_unseq``: allow both vectorizing and parallelizing the algorithm.
 
-The main difference between ``par`` and ``unseq`` is related to thread progress and locks: using ``unseq`` or ``par_unseq`` requires that the algorithms does not contain mutexes and other locks between the processes, while ``par`` does not have this limitation.
-
-For GPU, the optimal choice is ``par_unseq``, since this places the least requirement on the compiler in terms of operation ordering.
-While ``par`` is also supported in some cases, it is best avoided, both due to limited compiler support and as an indication that the algorithm is likely a poor fit for the hardware.
+The main difference between ``par`` and ``unseq`` is related to thread progress and
+locks: using ``unseq`` or ``par_unseq`` requires that the algorithms does not contain
+mutexes and other locks between the processes, while ``par`` does not have this
+limitation.
 
+For GPU, the optimal choice is ``par_unseq``, since this places the least requirement on
+the compiler in terms of operation ordering. While ``par`` is also supported in some
+cases, it is best avoided, both due to limited compiler support and as an indication
+that the algorithm is likely a poor fit for the hardware.
 
 Kokkos
-^^^^^^
-
-Kokkos is an open-source performance portability ecosystem for parallelization on large heterogeneous hardware architectures of which development has mostly taken place on Sandia National Laboratories. The project started in 2011 as a parallel C++ programming model, but have since expanded into a more broad ecosystem including Kokkos Core (the programming model), Kokkos Kernels (math library), and Kokkos Tools (debugging, profiling and tuning tools). By preparing proposals for the C++ standard committee, the project also aims to influence the ISO/C++ language standard such that, eventually, Kokkos capabilities will become native to the language standard. A more detailed introduction is found `HERE <https://www.sandia.gov/news/publications/hpc-annual-reports/article/kokkos/>`__.
-
-The Kokkos library provides an abstraction layer for a variety of different parallel programming models, currently CUDA, HIP, SYCL, HPX, OpenMP, and C++ threads. Therefore, it allows better portability across different hardware manufactured by different vendors, but introduces an additional dependency to the software stack. For example, when using CUDA, only CUDA installation is required, but when using Kokkos with NVIDIA GPUs, Kokkos and CUDA installation are both required. Kokkos is not a very popular choice for parallel programming, and therefore, learning and using Kokkos can be more difficult compared to more established programming models such as CUDA, for which a much larger amount of search results and Stack Overflow discussions can be found.
-
+------
+
+Kokkos is an open-source performance portability ecosystem for parallelization on large
+heterogeneous hardware architectures of which development has mostly taken place on
+Sandia National Laboratories. The project started in 2011 as a parallel C++ programming
+model, but have since expanded into a more broad ecosystem including Kokkos Core (the
+programming model), Kokkos Kernels (math library), and Kokkos Tools (debugging,
+profiling and tuning tools). By preparing proposals for the C++ standard committee, the
+project also aims to influence the ISO/C++ language standard such that, eventually,
+Kokkos capabilities will become native to the language standard. A more detailed
+introduction is found `HERE
+<https://www.sandia.gov/news/publications/hpc-annual-reports/article/kokkos/>`__.
+
+The Kokkos library provides an abstraction layer for a variety of different parallel
+programming models, currently CUDA, HIP, SYCL, HPX, OpenMP, and C++ threads. Therefore,
+it allows better portability across different hardware manufactured by different
+vendors, but introduces an additional dependency to the software stack. For example,
+when using CUDA, only CUDA installation is required, but when using Kokkos with NVIDIA
+GPUs, Kokkos and CUDA installation are both required. Kokkos is not a very popular
+choice for parallel programming, and therefore, learning and using Kokkos can be more
+difficult compared to more established programming models such as CUDA, for which a much
+larger amount of search results and Stack Overflow discussions can be found.
 
 Kokkos compilation
 ~~~~~~~~~~~~~~~~~~
 
-Furthermore, one challenge with some cross-platform portability libraries is that even on the same system, different projects may require different combinations of compilation settings for the portability library. For example, in Kokkos, one project may wish the default execution space to be a CUDA device, whereas another requires a CPU. Even if the projects prefer the same execution space, one project may desire the Unified Memory to be the default memory space and the other may wish to use pinned GPU memory. It may be burdensome to maintain a large number of library instances on a single system. 
+Furthermore, one challenge with some cross-platform portability libraries is that even
+on the same system, different projects may require different combinations of compilation
+settings for the portability library. For example, in Kokkos, one project may wish the
+default execution space to be a CUDA device, whereas another requires a CPU. Even if the
+projects prefer the same execution space, one project may desire the Unified Memory to
+be the default memory space and the other may wish to use pinned GPU memory. It may be
+burdensome to maintain a large number of library instances on a single system.
 
-However, Kokkos offers a simple way to compile Kokkos library simultaneously with the user project. This is achieved by specifying Kokkos compilation settings (see `HERE <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Compiling.html>`__) and including the Kokkos Makefile in the user Makefile. CMake is also supported. This way, the user application and Kokkos library are compiled together. The following is an example Makefile for a single-file Kokkos project (hello.cpp) that uses CUDA (Volta architecture) as the backend (default execution space) and Unified Memory as the default memory space:
+However, Kokkos offers a simple way to compile Kokkos library simultaneously with the
+user project. This is achieved by specifying Kokkos compilation settings (see `HERE
+<https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Compiling.html>`__) and including
+the Kokkos Makefile in the user Makefile. CMake is also supported. This way, the user
+application and Kokkos library are compiled together. The following is an example
+Makefile for a single-file Kokkos project (hello.cpp) that uses CUDA (Volta
+architecture) as the backend (default execution space) and Unified Memory as the default
+memory space:
 
-.. tabs:: 
+.. tabs::
 
-   .. tab:: Makefile for hello.cpp
+    .. tab:: Makefile for hello.cpp
 
-      .. code-block:: makefile
+       .. code-block:: makefile
 
-         default: build
-   
-         # Set compiler
-         KOKKOS_PATH = $(shell pwd)/kokkos
-         CXX = hipcc
-         # CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
-         
-         # Variables for the Makefile.kokkos
-         KOKKOS_DEVICES = "HIP"
-         # KOKKOS_DEVICES = "Cuda"
-         KOKKOS_ARCH = "VEGA90A"
-         # KOKKOS_ARCH = "Volta70"
-         KOKKOS_CUDA_OPTIONS = "enable_lambda,force_uvm"
-         
-         # Include Makefile.kokkos
-         include $(KOKKOS_PATH)/Makefile.kokkos
-         
-         build: $(KOKKOS_LINK_DEPENDS) $(KOKKOS_CPP_DEPENDS) hello.cpp
-          $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(KOKKOS_LDFLAGS) hello.cpp $(KOKKOS_LIBS) -o hello
+          default: build
 
-To build a **hello.cpp** project with the above Makefile, no steps other than cloning the Kokkos project into the current directory is required. 
+          # Set compiler
+          KOKKOS_PATH = $(shell pwd)/kokkos
+          CXX = hipcc
+          # CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
 
-Kokkos programming
-~~~~~~~~~~~~~~~~~~
-
-When starting to write a project using Kokkos, the first step is understand Kokkos initialization and finalization. Kokkos must be initialized by calling ``Kokkos::initialize(int& argc, char* argv[])`` and finalized by calling ``Kokkos::finalize()``. More details are given in `HERE <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Initialization.html>`__.
-
-Kokkos uses an execution space model to abstract the details of parallel hardware. The execution space instances map to the available backend options such as CUDA, OpenMP, HIP, or SYCL. If the execution space is not explicitly chosen by the programmer in the source code, the default execution space ``Kokkos::DefaultExecutionSpace`` is used. This is chosen when the Kokkos library is compiled. The Kokkos execution space model is described in more detail in `HERE <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Machine-Model.html#kokkos-spaces>`__.
-
-Similarly, Kokkos uses a memory space model for different types of memory, such as host memory or device memory. If not defined explicitly, Kokkos uses the default memory space specified during Kokkos compilation as described `HERE <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Machine-Model.html#kokkos-memory-spaces>`__.
-
-The following is an example of a Kokkos program that initializes Kokkos and prints the execution space and memory space instances: 
+          # Variables for the Makefile.kokkos
+          KOKKOS_DEVICES = "HIP"
+          # KOKKOS_DEVICES = "Cuda"
+          KOKKOS_ARCH = "VEGA90A"
+          # KOKKOS_ARCH = "Volta70"
+          KOKKOS_CUDA_OPTIONS = "enable_lambda,force_uvm"
 
-.. tabs:: 
+          # Include Makefile.kokkos
+          include $(KOKKOS_PATH)/Makefile.kokkos
 
-   .. tab:: hello.cpp
-      
-      .. code-block:: C++
+          build: $(KOKKOS_LINK_DEPENDS) $(KOKKOS_CPP_DEPENDS) hello.cpp
+           $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(KOKKOS_LDFLAGS) hello.cpp $(KOKKOS_LIBS) -o hello
 
-         #include <Kokkos_Core.hpp>
-         #include <iostream>
-         
-         int main(int argc, char* argv[]) {
-           Kokkos::initialize(argc, argv);
-           std::cout << "Execution Space: " << 
-             typeid(Kokkos::DefaultExecutionSpace).name() << std::endl;
-           std::cout << "Memory Space: " << 
-             typeid(Kokkos::DefaultExecutionSpace::memory_space).name() << std::endl;
-           Kokkos::finalize();
-           return 0;
-         }
+To build a **hello.cpp** project with the above Makefile, no steps other than cloning
+the Kokkos project into the current directory is required.
 
-With Kokkos, the data can be accessed either through raw pointers or through Kokkos Views. With raw pointers, the memory allocation into the default memory space can be done using ``Kokkos::kokkos_malloc(n * sizeof(int))``. Kokkos Views are a data type that provides a way to access data more efficiently in memory corresponding to a certain Kokkos memory space, such as host memory or device memory. A 1-dimensional view of type int* can be created by ``Kokkos::View<int*> a("a", n)``, where ``"a"`` is a label, and ``n`` is the size of the allocation in the number of integers. Kokkos determines the optimal layout for the data at compile time for best overall performance as a function of the computer architecture. Furthermore, Kokkos handles the deallocation of such memory automatically. More details about Kokkos Views are found `HERE <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html>`__.
+Kokkos programming
+~~~~~~~~~~~~~~~~~~
 
-Finally, Kokkos provides three different parallel operations: ``parallel_for``, ``parallel_reduce``, and ``parallel_scan``. The ``parallel_for`` operation is used to execute a loop in parallel. The ``parallel_reduce`` operation is used to execute a loop in parallel and reduce the results to a single value. The ``parallel_scan`` operation implements a prefix scan. The usage of ``parallel_for`` and ``parallel_reduce`` are demonstrated in the examples later in this chapter. More detail about the parallel operations are found `HERE <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/ParallelDispatch.html>`__.
+When starting to write a project using Kokkos, the first step is understand Kokkos
+initialization and finalization. Kokkos must be initialized by calling
+``Kokkos::initialize(int& argc, char* argv[])`` and finalized by calling
+``Kokkos::finalize()``. More details are given in `HERE
+<https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Initialization.html>`__.
+
+Kokkos uses an execution space model to abstract the details of parallel hardware. The
+execution space instances map to the available backend options such as CUDA, OpenMP,
+HIP, or SYCL. If the execution space is not explicitly chosen by the programmer in the
+source code, the default execution space ``Kokkos::DefaultExecutionSpace`` is used. This
+is chosen when the Kokkos library is compiled. The Kokkos execution space model is
+described in more detail in `HERE
+<https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Machine-Model.html#kokkos-spaces>`__.
+
+Similarly, Kokkos uses a memory space model for different types of memory, such as host
+memory or device memory. If not defined explicitly, Kokkos uses the default memory space
+specified during Kokkos compilation as described `HERE
+<https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Machine-Model.html#kokkos-memory-spaces>`__.
+
+The following is an example of a Kokkos program that initializes Kokkos and prints the
+execution space and memory space instances:
+
+.. tabs::
+
+    .. tab:: hello.cpp
+
+       .. code-block:: C++
+
+          #include <Kokkos_Core.hpp>
+          #include <iostream>
+
+          int main(int argc, char* argv[]) {
+            Kokkos::initialize(argc, argv);
+            std::cout << "Execution Space: " <<
+              typeid(Kokkos::DefaultExecutionSpace).name() << std::endl;
+            std::cout << "Memory Space: " <<
+              typeid(Kokkos::DefaultExecutionSpace::memory_space).name() << std::endl;
+            Kokkos::finalize();
+            return 0;
+          }
+
+With Kokkos, the data can be accessed either through raw pointers or through Kokkos
+Views. With raw pointers, the memory allocation into the default memory space can be
+done using ``Kokkos::kokkos_malloc(n * sizeof(int))``. Kokkos Views are a data type that
+provides a way to access data more efficiently in memory corresponding to a certain
+Kokkos memory space, such as host memory or device memory. A 1-dimensional view of type
+int* can be created by ``Kokkos::View<int*> a("a", n)``, where ``"a"`` is a label, and
+``n`` is the size of the allocation in the number of integers. Kokkos determines the
+optimal layout for the data at compile time for best overall performance as a function
+of the computer architecture. Furthermore, Kokkos handles the deallocation of such
+memory automatically. More details about Kokkos Views are found `HERE
+<https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html>`__.
+
+Finally, Kokkos provides three different parallel operations: ``parallel_for``,
+``parallel_reduce``, and ``parallel_scan``. The ``parallel_for`` operation is used to
+execute a loop in parallel. The ``parallel_reduce`` operation is used to execute a loop
+in parallel and reduce the results to a single value. The ``parallel_scan`` operation
+implements a prefix scan. The usage of ``parallel_for`` and ``parallel_reduce`` are
+demonstrated in the examples later in this chapter. More detail about the parallel
+operations are found `HERE
+<https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/ParallelDispatch.html>`__.
 
 Run Kokkos hello.cpp example in simple steps
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The following should work on AMD VEGA90A devices straight out of the box (needs ROCm installation). On NVIDIA Volta V100 devices (needs CUDA installation), use the variables commented out on the Makefile.
+The following should work on AMD VEGA90A devices straight out of the box (needs ROCm
+installation). On NVIDIA Volta V100 devices (needs CUDA installation), use the variables
+commented out on the Makefile.
 
 1. ``git clone https://github.com/kokkos/kokkos.git``
-2. Copy the above Makefile into the current folder (make sure the indentation of the last line is tab, and not space)
+2. Copy the above Makefile into the current folder (make sure the indentation of the
+   last line is tab, and not space)
 3. Copy the above hello.cpp file into the current folder
 4. ``make``
 5. ``./hello``
 
-
 OpenCL
-^^^^^^
-OpenCL is a cross-platform, open-standard API for writing parallel programs that execute across heterogeneous platforms consisting of CPUs, GPUs, FPGAs and other devices. The first version of OpenCL (1.0) was released in December 2008, and the latest version of OpenCL (3.0) was released in September 2020. OpenCL is supported by a number of vendors, including AMD, ARM, Intel, NVIDIA, and Qualcomm. It is a royalty-free standard, and the OpenCL specification is maintained by the Khronos Group. OpenCL provides a low-level programming interface initially based on C, but more recently also a C++ interface has become available.
+------
+
+OpenCL is a cross-platform, open-standard API for writing parallel programs that execute
+across heterogeneous platforms consisting of CPUs, GPUs, FPGAs and other devices. The
+first version of OpenCL (1.0) was released in December 2008, and the latest version of
+OpenCL (3.0) was released in September 2020. OpenCL is supported by a number of vendors,
+including AMD, ARM, Intel, NVIDIA, and Qualcomm. It is a royalty-free standard, and the
+OpenCL specification is maintained by the Khronos Group. OpenCL provides a low-level
+programming interface initially based on C, but more recently also a C++ interface has
+become available.
 
 OpenCL compilation
 ~~~~~~~~~~~~~~~~~~
-OpenCL supports two modes for compiling the programs: online and offline. Online compilation occurs at runtime, when the host program calls a function to compile the source code. Online mode allows dynamic generation and loading of kernels, but may incur some overhead due to compilation time and possible errors. Offline compilation occurs before runtime, when the source code of a kernel is compiled into a binary format that can be loaded by the host program. This mode allows faster execution and better optimization of kernels, but may limit the portability of the program, because the binary can only run on the architectures it was compiled for.
-
-OpenCL comes bundled with several parallel programming ecosystems, such as NVIDIA CUDA and Intel oneAPI. For example, after successfully installing such packages and setting up the environment, one may simply compile an OpenCL program by the commands such as ``icx cl_devices.c -lOpenCL`` (Intel oneAPI) or ``nvcc cl_devices.c -lOpenCL`` (NVIDIA CUDA), where ``cl_devices.c`` is the compiled file. Unlike most other programming models, OpenCL stores kernels as text and compiles them for the device in runtime (JIT-compilation), and thus does not require any special compiler support: one can compile the code using simply ``gcc cl_devices.c -lOpenCL`` (or ``g++`` when using C++ API), as long as the required libraries and headers are installed in a standard locations.
 
-The AMD compiler installed on LUMI supports both OpenCL C and C++ API, the latter with some limitations.
-To compile a program, you can use the AMD compilers on a GPU partition:
+OpenCL supports two modes for compiling the programs: online and offline. Online
+compilation occurs at runtime, when the host program calls a function to compile the
+source code. Online mode allows dynamic generation and loading of kernels, but may incur
+some overhead due to compilation time and possible errors. Offline compilation occurs
+before runtime, when the source code of a kernel is compiled into a binary format that
+can be loaded by the host program. This mode allows faster execution and better
+optimization of kernels, but may limit the portability of the program, because the
+binary can only run on the architectures it was compiled for.
+
+OpenCL comes bundled with several parallel programming ecosystems, such as NVIDIA CUDA
+and Intel oneAPI. For example, after successfully installing such packages and setting
+up the environment, one may simply compile an OpenCL program by the commands such as
+``icx cl_devices.c -lOpenCL`` (Intel oneAPI) or ``nvcc cl_devices.c -lOpenCL`` (NVIDIA
+CUDA), where ``cl_devices.c`` is the compiled file. Unlike most other programming
+models, OpenCL stores kernels as text and compiles them for the device in runtime
+(JIT-compilation), and thus does not require any special compiler support: one can
+compile the code using simply ``gcc cl_devices.c -lOpenCL`` (or ``g++`` when using C++
+API), as long as the required libraries and headers are installed in a standard
+locations.
+
+The AMD compiler installed on LUMI supports both OpenCL C and C++ API, the latter with
+some limitations. To compile a program, you can use the AMD compilers on a GPU
+partition:
 
 .. code-block:: console
 
@@ -203,116 +338,171 @@ To compile a program, you can use the AMD compilers on a GPU partition:
     $ CC program.cpp -lOpenCL -o program # C++ program
     $ cc program.c -lOpenCL -o program # C program
 
-
 OpenCL programming
 ~~~~~~~~~~~~~~~~~~
-OpenCL programs consist of two parts: a host program that runs on the host device (usually a CPU) and one or more kernels that run on compute devices (such as GPUs). The host program is responsible for the tasks such as managing the devices for the selected platform, allocating memory objects, building and enqueueing kernels, and managing memory objects. 
-
-The first steps when writing an OpenCL program are to initialize the OpenCL environment by selecting the platform and devices, creating a context or contexts associated with the selected device(s), and creating a command queue for each device. A simple example of selecting the default device, creating a context and a queue associated with the device is show below.
-
-.. tabs:: 
-
-   .. tab:: OpenCL initialization (C++ API)
-      
-      .. code-block:: C++
-         
-         // Initialize OpenCL
-         cl::Device device = cl::Device::getDefault();
-         cl::Context context(device);
-         cl::CommandQueue queue(context, device);
-
-   .. tab:: OpenCL initialization (C API)
-      
-      .. code-block:: C
-         
-         // Initialize OpenCL
-         cl_int err; // Error code returned by API calls
-         cl_platform_id platform;
-         err = clGetPlatformIDs(1, &platform, NULL);
-         assert(err == CL_SUCCESS); // Checking error codes is skipped later for brevity
-         cl_device_id device;
-         err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
-         cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
-         cl_command_queue queue = clCreateCommandQueue(context, device, 0, &err);
-
-
-OpenCL provides two main programming models to manage the memory hierarchy of host and accelerator devices: buffers and shared virtual memory (SVM). Buffers are the traditional memory model of OpenCL, where the host and the devices have separate address spaces and the programmer has to explicitly specify the memory allocations and how and where the memory is accessed. This can be done with class ``cl::Buffer`` and functions such as ``cl::CommandQueue::enqueueReadBuffer()``. Buffers are supported since early versions of OpenCL, and work well across different architectures. Buffers can also take advantage of device-specific memory features, such as constant or local memory.
-
-SVM is a newer memory model of OpenCL, introduced in version 2.0, where the host and the devices share a single virtual address space. Thus, the programmer can use the same pointers to access the data from host and devices simplifying the programming effort. In OpenCL, SVM comes in different levels such as coarse-grained buffer SVM, fine-grained buffer SVM, and fine-grained system SVM. All levels allow using the same pointers across a host and devices, but they differ in their granularity and synchronization requirements for the memory regions. Furthermore, the support for SVM is not universal across all OpenCL platforms and devices, and for example, GPUs such as NVIDIA V100 and A100 only support the coarse-grained SVM buffer. This level requires explicit synchronization for memory accesses from a host and devices (using functions such as ``cl::CommandQueue::enqueueMapSVM()`` and ``cl::CommandQueue::enqueueUnmapSVM()``), making the usage of SVM less convenient. It is further noted that this is unlike the regular Unified Memory offered by CUDA, which is closer to the fine-grained system SVM level in OpenCL. 
-
-OpenCL uses a separate-source kernel model where the kernel code is often kept in separate files that may be compiled during runtime. The model allows the kernel source code to be passed as a string to the OpenCL driver after which the program object can be executed on a specific device. Although referred to as the separate-source kernel model, the kernels can still be defined as a string in the host program compilation units as well, which may be a more convenient approach in some cases.
-
-The online compilation with the separate-source kernel model has several advantages over the binary model, which requires offline compilation of kernels into device-specific binaries that can are loaded by the application at runtime. Online compilation preserves the portability and flexibility of OpenCL, as the same kernel source code can run on any supported device. Furthermore, dynamic optimization of kernels based on runtime information, such as input size, work-group size, or device capabilities, is possible. An example of an OpenCL kernel, defined by a string in the host compilation unit, and assigning the global thread index into a global device memory is shown below.
-
-.. tabs:: 
-
-   .. tab:: OpenCL kernel example
-      
-      .. code-block:: C++
-         
-         static const std::string kernel_source = R"(
-           __kernel void dot(__global int *a) {
-             int i = get_global_id(0);
-             a[i] = i;
-           }
-         )";
-
-The above kernel named ``dot`` and stored in the string ``kernel_source`` can be set to build in the host code as follows:
-
-.. tabs:: 
-
-   .. tab:: OpenCL kernel build example (C++ API)
-      
-      .. code-block:: C++
-         
-         cl::Program program(context, kernel_source);
-         program.build({device});
-         cl::Kernel kernel_dot(program, "dot");
-
-   .. tab:: OpenCL kernel build example (C API)
-      
-      .. code-block:: C
-         
-         cl_int err;
-         cl_program program = clCreateProgramWithSource(context, 1, &kernel_source, NULL, &err);
-         err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);
-         cl_kernel kernel_dot = clCreateKernel(program, "vector_add", &err);
 
+OpenCL programs consist of two parts: a host program that runs on the host device
+(usually a CPU) and one or more kernels that run on compute devices (such as GPUs). The
+host program is responsible for the tasks such as managing the devices for the selected
+platform, allocating memory objects, building and enqueueing kernels, and managing
+memory objects.
+
+The first steps when writing an OpenCL program are to initialize the OpenCL environment
+by selecting the platform and devices, creating a context or contexts associated with
+the selected device(s), and creating a command queue for each device. A simple example
+of selecting the default device, creating a context and a queue associated with the
+device is show below.
+
+.. tabs::
+
+    .. tab:: OpenCL initialization (C++ API)
+
+       .. code-block:: C++
+
+          // Initialize OpenCL
+          cl::Device device = cl::Device::getDefault();
+          cl::Context context(device);
+          cl::CommandQueue queue(context, device);
+
+    .. tab:: OpenCL initialization (C API)
+
+       .. code-block:: C
+
+          // Initialize OpenCL
+          cl_int err; // Error code returned by API calls
+          cl_platform_id platform;
+          err = clGetPlatformIDs(1, &platform, NULL);
+          assert(err == CL_SUCCESS); // Checking error codes is skipped later for brevity
+          cl_device_id device;
+          err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
+          cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
+          cl_command_queue queue = clCreateCommandQueue(context, device, 0, &err);
+
+OpenCL provides two main programming models to manage the memory hierarchy of host and
+accelerator devices: buffers and shared virtual memory (SVM). Buffers are the
+traditional memory model of OpenCL, where the host and the devices have separate address
+spaces and the programmer has to explicitly specify the memory allocations and how and
+where the memory is accessed. This can be done with class ``cl::Buffer`` and functions
+such as ``cl::CommandQueue::enqueueReadBuffer()``. Buffers are supported since early
+versions of OpenCL, and work well across different architectures. Buffers can also take
+advantage of device-specific memory features, such as constant or local memory.
+
+SVM is a newer memory model of OpenCL, introduced in version 2.0, where the host and the
+devices share a single virtual address space. Thus, the programmer can use the same
+pointers to access the data from host and devices simplifying the programming effort. In
+OpenCL, SVM comes in different levels such as coarse-grained buffer SVM, fine-grained
+buffer SVM, and fine-grained system SVM. All levels allow using the same pointers across
+a host and devices, but they differ in their granularity and synchronization
+requirements for the memory regions. Furthermore, the support for SVM is not universal
+across all OpenCL platforms and devices, and for example, GPUs such as NVIDIA V100 and
+A100 only support the coarse-grained SVM buffer. This level requires explicit
+synchronization for memory accesses from a host and devices (using functions such as
+``cl::CommandQueue::enqueueMapSVM()`` and ``cl::CommandQueue::enqueueUnmapSVM()``),
+making the usage of SVM less convenient. It is further noted that this is unlike the
+regular Unified Memory offered by CUDA, which is closer to the fine-grained system SVM
+level in OpenCL.
+
+OpenCL uses a separate-source kernel model where the kernel code is often kept in
+separate files that may be compiled during runtime. The model allows the kernel source
+code to be passed as a string to the OpenCL driver after which the program object can be
+executed on a specific device. Although referred to as the separate-source kernel model,
+the kernels can still be defined as a string in the host program compilation units as
+well, which may be a more convenient approach in some cases.
+
+The online compilation with the separate-source kernel model has several advantages over
+the binary model, which requires offline compilation of kernels into device-specific
+binaries that can are loaded by the application at runtime. Online compilation preserves
+the portability and flexibility of OpenCL, as the same kernel source code can run on any
+supported device. Furthermore, dynamic optimization of kernels based on runtime
+information, such as input size, work-group size, or device capabilities, is possible.
+An example of an OpenCL kernel, defined by a string in the host compilation unit, and
+assigning the global thread index into a global device memory is shown below.
+
+.. tabs::
+
+    .. tab:: OpenCL kernel example
+
+       .. code-block:: C++
+
+          static const std::string kernel_source = R"(
+            __kernel void dot(__global int *a) {
+              int i = get_global_id(0);
+              a[i] = i;
+            }
+          )";
+
+The above kernel named ``dot`` and stored in the string ``kernel_source`` can be set to
+build in the host code as follows:
+
+.. tabs::
+
+    .. tab:: OpenCL kernel build example (C++ API)
+
+       .. code-block:: C++
+
+          cl::Program program(context, kernel_source);
+          program.build({device});
+          cl::Kernel kernel_dot(program, "dot");
+
+    .. tab:: OpenCL kernel build example (C API)
+
+       .. code-block:: C
+
+          cl_int err;
+          cl_program program = clCreateProgramWithSource(context, 1, &kernel_source, NULL, &err);
+          err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);
+          cl_kernel kernel_dot = clCreateKernel(program, "vector_add", &err);
 
 SYCL
-^^^^
-
-`SYCL <https://www.khronos.org/sycl/>`__ is a royalty-free, open-standard C++ programming model for multi-device programming. It provides a high-level, single-source programming model for heterogeneous systems, including GPUs. There are several implementations of the standard. For GPU programming, `Intel oneAPI DPC++ <https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>`__ and `AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`__ (also known as hipSYCL) are the most popular for desktop and HPC GPUs; `ComputeCPP <https://developer.codeplay.com/products/computecpp/ce/home/>`__ is a good choice for embedded devices. The same standard-compliant SYCL code should work with any implementation, but they are not binary-compatible.
-
-The most recent version of the SYCL standard is SYCL 2020, and it is the version we will be using in this course. 
+----
+
+`SYCL <https://www.khronos.org/sycl/>`__ is a royalty-free, open-standard C++
+programming model for multi-device programming. It provides a high-level, single-source
+programming model for heterogeneous systems, including GPUs. There are several
+implementations of the standard. For GPU programming, `Intel oneAPI DPC++
+<https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>`__
+and `AdaptiveCpp <https://github.com/AdaptiveCpp/AdaptiveCpp/>`__ (also known as
+hipSYCL) are the most popular for desktop and HPC GPUs; `ComputeCPP
+<https://developer.codeplay.com/products/computecpp/ce/home/>`__ is a good choice for
+embedded devices. The same standard-compliant SYCL code should work with any
+implementation, but they are not binary-compatible.
+
+The most recent version of the SYCL standard is SYCL 2020, and it is the version we will
+be using in this course.
 
 SYCL compilation
 ~~~~~~~~~~~~~~~~
 
 Intel oneAPI DPC++
-******************
+++++++++++++++++++
 
-For targeting Intel GPUs, it is enough to install `Intel oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html>`__. Then, the compilation is as simple as ``icpx -fsycl file.cpp``.
+For targeting Intel GPUs, it is enough to install `Intel oneAPI Base Toolkit
+<https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html>`__.
+Then, the compilation is as simple as ``icpx -fsycl file.cpp``.
 
-It is also possible to use oneAPI for NVIDIA and AMD GPUs. In addition to oneAPI Base Toolkit, the vendor-provided runtime (CUDA or HIP) and the corresponding `Codeplay oneAPI plugin <https://codeplay.com/solutions/oneapi/>`__ must be installed.
-Then, the code can be compiled using Intel LLVM compiler bundled with oneAPI:
+It is also possible to use oneAPI for NVIDIA and AMD GPUs. In addition to oneAPI Base
+Toolkit, the vendor-provided runtime (CUDA or HIP) and the corresponding `Codeplay
+oneAPI plugin <https://codeplay.com/solutions/oneapi/>`__ must be installed. Then, the
+code can be compiled using Intel LLVM compiler bundled with oneAPI:
 
-- ``clang++ -fsycl -fsycl-targets=nvidia_gpu_sm_86 file.cpp`` for targeting CUDA 8.6 NVIDIA GPU,
+- ``clang++ -fsycl -fsycl-targets=nvidia_gpu_sm_86 file.cpp`` for targeting CUDA 8.6
+  NVIDIA GPU,
 - ``clang++ -fsycl -fsycl-targets=amd_gpu_gfx90a`` for targeting GFX90a AMD GPU.
 
 AdaptiveCpp
-***********
++++++++++++
 
-Using AdaptiveCpp for NVIDIA or AMD GPUs also requires having CUDA or HIP installed first. Then ``acpp`` can be used for compiling the code, specifying the target devices. For example, here is how to compile the program supporting an AMD and an NVIDIA device:
+Using AdaptiveCpp for NVIDIA or AMD GPUs also requires having CUDA or HIP installed
+first. Then ``acpp`` can be used for compiling the code, specifying the target devices.
+For example, here is how to compile the program supporting an AMD and an NVIDIA device:
 
 - ``acpp --acpp-targets='hip:gfx90a;cuda:sm_70' file.cpp``
 
-
 Using SYCL on LUMI
-******************
+++++++++++++++++++
 
-LUMI does not have a system-wide installation of any SYCL framework, but a recent AdaptiveCpp installation is
-available in CSC modules:
+LUMI does not have a system-wide installation of any SYCL framework, but a recent
+AdaptiveCpp installation is available in CSC modules:
 
 .. code-block:: console
 
@@ -321,31 +511,37 @@ available in CSC modules:
     $ module use /appl/local/csc/modulefiles
     $ module load acpp/24.06.0
 
-The default compilation target is preset to MI250 GPUs, so to compile a single C++ file it is enough to call ``acpp -O2 file.cpp``.
+The default compilation target is preset to MI250 GPUs, so to compile a single C++ file
+it is enough to call ``acpp -O2 file.cpp``.
 
-When running applications built with AdaptiveCpp, one can often see the warning "dag_direct_scheduler: Detected a requirement that is neither of discard access mode", reflecting the lack of an optimization hint when using buffer-accessor model. The warning is harmless and can be ignored.
+When running applications built with AdaptiveCpp, one can often see the warning
+"dag_direct_scheduler: Detected a requirement that is neither of discard access mode",
+reflecting the lack of an optimization hint when using buffer-accessor model. The
+warning is harmless and can be ignored.
 
 SYCL programming
 ~~~~~~~~~~~~~~~~
 
-SYCL is, in many aspects, similar to OpenCL, but uses, like Kokkos, a single-source model with kernel lambdas.
+SYCL is, in many aspects, similar to OpenCL, but uses, like Kokkos, a single-source
+model with kernel lambdas.
 
-To submit a task to device, first a `sycl::queue` must be created, which is used as a way to manage the
-task scheduling and execution. In the simplest case, that's all the initialization one needs:
+To submit a task to device, first a `sycl::queue` must be created, which is used as a
+way to manage the task scheduling and execution. In the simplest case, that's all the
+initialization one needs:
 
 .. code-block:: C++
-    
+
     int main() {
       // Create an out-of-order queue on the default device:
       sycl::queue q;
       // Now we can submit tasks to q!
     }
 
-If one wants more control, the device can be explicitly specified, or additional properties can be passed to
-a queue:
+If one wants more control, the device can be explicitly specified, or additional
+properties can be passed to a queue:
 
 .. code-block:: C++
-    
+
     // Iterate over all available devices
     for (const auto &device : sycl::device::get_devices()) {
       // Print the device name
@@ -355,27 +551,28 @@ a queue:
       // Now we can submit tasks to q!
     }
 
-
-Memory management can be done in two different ways: *buffer-accessor* model and *unified shared memory* (USM).
-The choice of the memory management models also influences how the GPU tasks are synchronized.
-
-In the *buffer-accessor* model, a ``sycl::buffer`` objects are used to represent arrays of data. A buffer is
-not mapped to any single one memory space, and can be migrated between the GPU and the CPU memory
-transparently. The data in ``sycl::buffer`` cannot be read or written directly, an accessor must be created.
-``sycl::accessor`` objects specify the location of data access (host or a certain GPU kernel) and the access
-mode (read-only, write-only, read-write).
-Such approach allows optimizing task scheduling by building a directed acyclic graph (DAG) of data dependencies:
-if kernel *A* creates a write-only accessor to a buffer, and then kernel *B* is submitted with a read-only
-accessor to the same buffer, and then a host-side read-only accessor is requested, then it can be deduced that
-*A* must complete before *B* is launched and also that the results must be copied to the host
-before the host task can proceed, but the host task can run in parallel with kernel *B*.
-Since the dependencies between tasks can be built automatically, by default SYCL uses *out-of-order queues*:
-when two tasks are submitted to the same ``sycl::queue``, it is not guaranteed that the second one will launch
-only after the first one completes.
-When launching a kernel, accessors must be created:
+Memory management can be done in two different ways: *buffer-accessor* model and
+*unified shared memory* (USM). The choice of the memory management models also
+influences how the GPU tasks are synchronized.
+
+In the *buffer-accessor* model, a ``sycl::buffer`` objects are used to represent arrays
+of data. A buffer is not mapped to any single one memory space, and can be migrated
+between the GPU and the CPU memory transparently. The data in ``sycl::buffer`` cannot be
+read or written directly, an accessor must be created. ``sycl::accessor`` objects
+specify the location of data access (host or a certain GPU kernel) and the access mode
+(read-only, write-only, read-write). Such approach allows optimizing task scheduling by
+building a directed acyclic graph (DAG) of data dependencies: if kernel *A* creates a
+write-only accessor to a buffer, and then kernel *B* is submitted with a read-only
+accessor to the same buffer, and then a host-side read-only accessor is requested, then
+it can be deduced that *A* must complete before *B* is launched and also that the
+results must be copied to the host before the host task can proceed, but the host task
+can run in parallel with kernel *B*. Since the dependencies between tasks can be built
+automatically, by default SYCL uses *out-of-order queues*: when two tasks are submitted
+to the same ``sycl::queue``, it is not guaranteed that the second one will launch only
+after the first one completes. When launching a kernel, accessors must be created:
 
 .. code-block:: C++
-    
+
     // Create a buffer of n integers
     auto buf = sycl::buffer<int>(sycl::range<1>(n));
     // Submit a kernel into a queue; cgh is a helper object
@@ -391,20 +588,25 @@ When launching a kernel, accessors must be created:
     /* If we now submit another kernel with accessor to buf, it will not
      * start running until the kernel above is done */
 
-Buffer-accessor model simplifies many aspects of heterogeneous programming and prevents many synchronization-related
-bugs, but it only allows very coarse control of data movement and kernel execution.
-
-The *USM* model is similar to how NVIDIA CUDA or AMD HIP manage memory. The programmer has to explicitly allocate
-the memory on the device (``sycl::malloc_device``), on the host (``sycl::malloc_host``), or in the shared memory
-space (``sycl::malloc_shared``). Despite its name, unified shared memory, and the similarity to OpenCL's SVM, not
-all USM allocations are shared: for example, a memory allocated by ``sycl::malloc_device`` cannot be accessed
-from the host. The allocation functions return memory pointers that can be used directly, without accessors.
-This means that the programmer have to ensure the correct synchronization between host and device tasks to avoid
-data races. With USM, it is often convenient to use *in-order queues* with USM, instead of the default *out-of-order* queues.
-More information on USM can be found in the `Section 4.8 of SYCL 2020 specification <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__.
+Buffer-accessor model simplifies many aspects of heterogeneous programming and prevents
+many synchronization-related bugs, but it only allows very coarse control of data
+movement and kernel execution.
+
+The *USM* model is similar to how NVIDIA CUDA or AMD HIP manage memory. The programmer
+has to explicitly allocate the memory on the device (``sycl::malloc_device``), on the
+host (``sycl::malloc_host``), or in the shared memory space (``sycl::malloc_shared``).
+Despite its name, unified shared memory, and the similarity to OpenCL's SVM, not all USM
+allocations are shared: for example, a memory allocated by ``sycl::malloc_device``
+cannot be accessed from the host. The allocation functions return memory pointers that
+can be used directly, without accessors. This means that the programmer have to ensure
+the correct synchronization between host and device tasks to avoid data races. With USM,
+it is often convenient to use *in-order queues* with USM, instead of the default
+*out-of-order* queues. More information on USM can be found in the `Section 4.8 of SYCL
+2020 specification
+<https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__.
 
 .. code-block:: C++
-    
+
     // Create a shared (migratable) allocation of n integers
     // Unlike with buffers, we need to specify a queue (or, explicitly, a device and a context)
     int* v = sycl::malloc_shared<int>(n, q);
@@ -426,148 +628,145 @@ Exercise
 
 .. exercise:: Exercise: Implement SAXPY in SYCL
 
-   In this exercise we would like to write (fill-in-the-blanks) a simple code doing SAXPY (vector addition).
-   
-   To compile and run the code interactively, first make an allocation and load the AdaptiveCpp module:
+    In this exercise we would like to write (fill-in-the-blanks) a simple code doing SAXPY (vector addition).
 
-   .. code-block:: console
+    To compile and run the code interactively, first make an allocation and load the AdaptiveCpp module:
 
-      $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
-      ....
-      salloc: Granted job allocation 123456
+    .. code-block:: console
 
-      $ module load LUMI/24.03 partition/G
-      $ module use /appl/local/csc/modulefiles
-      $ module load rocm/6.0.3 acpp/24.06.0
+       $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
+       ....
+       salloc: Granted job allocation 123456
 
-   Now you can run a simple device-detection utility to check that a GPU is available (note ``srun``):
+       $ module load LUMI/24.03 partition/G
+       $ module use /appl/local/csc/modulefiles
+       $ module load rocm/6.0.3 acpp/24.06.0
 
-    .. code-block:: console
+    Now you can run a simple device-detection utility to check that a GPU is available (note ``srun``):
 
-      $ srun acpp-info -l
-      =================Backend information===================
-      Loaded backend 0: HIP
-        Found device: AMD Instinct MI250X
-      Loaded backend 1: OpenMP
-        Found device: hipSYCL OpenMP host device
+     .. code-block:: console
 
+       $ srun acpp-info -l
+       =================Backend information===================
+       Loaded backend 0: HIP
+         Found device: AMD Instinct MI250X
+       Loaded backend 1: OpenMP
+         Found device: hipSYCL OpenMP host device
 
-   If you have not done it already, clone the repository using ``git clone https://github.com/ENCCS/gpu-programming.git`` or **update it** using ``git pull origin main``.
 
-   Now, let's look at the example code in ``content/examples/portable-kernel-models/exercise-sycl-saxpy.cpp``:
+    If you have not done it already, clone the repository using ``git clone https://github.com/ENCCS/gpu-programming.git`` or **update it** using ``git pull origin main``.
 
-   .. literalinclude:: examples/portable-kernel-models/exercise-sycl-saxpy.cpp
-      :language: c++
-      :emphasize-lines: 16,17,25,30,31,35,39,40,62
+    Now, let's look at the example code in ``content/examples/portable-kernel-models/exercise-sycl-saxpy.cpp``:
 
+    .. literalinclude:: examples/portable-kernel-models/exercise-sycl-saxpy.cpp
+       :language: c++
+       :emphasize-lines: 16,17,25,30,31,35,39,40,62
 
-   To compile and run the code, use the following command:
 
-   .. code-block:: console
+    To compile and run the code, use the following command:
 
-      $ acpp -O3 exercise-sycl-saxpy.cpp -o exercise-sycl-saxpy
-      $ srun ./exercise-sycl-saxpy
-      Running on AMD Instinct MI250X
-      Results are correct!
+    .. code-block:: console
 
-   The code will not compile as-is!
-   Your task is to fill in missing bits indicated by ``TODO`` comments.
-   You can also test your understanding using the "Bonus questions" in the code.
+       $ acpp -O3 exercise-sycl-saxpy.cpp -o exercise-sycl-saxpy
+       $ srun ./exercise-sycl-saxpy
+       Running on AMD Instinct MI250X
+       Results are correct!
 
-   If you feel stuck, take a look at the ``exercise-sycl-saxpy-solution.cpp`` file.
+    The code will not compile as-is!
+    Your task is to fill in missing bits indicated by ``TODO`` comments.
+    You can also test your understanding using the "Bonus questions" in the code.
 
+    If you feel stuck, take a look at the ``exercise-sycl-saxpy-solution.cpp`` file.
 
 Examples
-^^^^^^^^
+--------
 
 Parallel for with Unified Memory
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. tabs:: 
+.. tabs::
 
-   .. tab:: StdPar
-         .. literalinclude:: examples/portable-kernel-models/stdpar-unified-memory.cpp
-            :language: C++
+    .. tab:: StdPar
+          .. literalinclude:: examples/portable-kernel-models/stdpar-unified-memory.cpp
+             :language: C++
 
-   .. tab:: Kokkos
-         .. literalinclude:: examples/portable-kernel-models/kokkos-unified-memory.cpp
-            :language: C++
+    .. tab:: Kokkos
+          .. literalinclude:: examples/portable-kernel-models/kokkos-unified-memory.cpp
+             :language: C++
 
-   .. tab:: OpenCL
-         .. literalinclude:: examples/portable-kernel-models/opencl-unified-memory.c
-            :language: C
-
-   .. tab:: SYCL
-         .. literalinclude:: examples/portable-kernel-models/sycl-unified-memory.cpp
-            :language: C++
+    .. tab:: OpenCL
+          .. literalinclude:: examples/portable-kernel-models/opencl-unified-memory.c
+             :language: C
 
+    .. tab:: SYCL
+          .. literalinclude:: examples/portable-kernel-models/sycl-unified-memory.cpp
+             :language: C++
 
 Parallel for with GPU buffers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. tabs:: 
+.. tabs::
+
+    .. tab:: Kokkos
+          .. literalinclude:: examples/portable-kernel-models/kokkos-buffers.cpp
+             :language: C++
 
-   .. tab:: Kokkos
-         .. literalinclude:: examples/portable-kernel-models/kokkos-buffers.cpp
-            :language: C++
+    .. tab:: OpenCL
+          .. literalinclude:: examples/portable-kernel-models/opencl-buffers.cpp
+             :language: C++
 
-   .. tab:: OpenCL
-         .. literalinclude:: examples/portable-kernel-models/opencl-buffers.cpp
-            :language: C++
-   
-   .. tab:: SYCL
-         .. literalinclude:: examples/portable-kernel-models/sycl-buffers.cpp
-            :language: C++
+    .. tab:: SYCL
+          .. literalinclude:: examples/portable-kernel-models/sycl-buffers.cpp
+             :language: C++
 
-   
 Asynchronous parallel for kernels
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. tabs:: 
-
-   .. tab:: Kokkos
-         .. literalinclude:: examples/portable-kernel-models/kokkos-async-kernels.cpp
-            :language: C++
-   
-   .. tab:: OpenCL
-         .. literalinclude:: examples/portable-kernel-models/opencl-async-kernels.c
-            :language: C
-  
-   .. tab:: SYCL
-         .. literalinclude:: examples/portable-kernel-models/sycl-async-kernels.cpp
-            :language: C++
- 
+.. tabs::
+
+    .. tab:: Kokkos
+          .. literalinclude:: examples/portable-kernel-models/kokkos-async-kernels.cpp
+             :language: C++
+
+    .. tab:: OpenCL
+          .. literalinclude:: examples/portable-kernel-models/opencl-async-kernels.c
+             :language: C
+
+    .. tab:: SYCL
+          .. literalinclude:: examples/portable-kernel-models/sycl-async-kernels.cpp
+             :language: C++
+
 Reduction
 ~~~~~~~~~
 
-.. tabs:: 
+.. tabs::
 
-   .. tab:: StdPar
-         .. literalinclude:: examples/portable-kernel-models/stdpar-reduction.cpp
-            :language: C++
+    .. tab:: StdPar
+          .. literalinclude:: examples/portable-kernel-models/stdpar-reduction.cpp
+             :language: C++
 
-   .. tab:: Kokkos
-         .. literalinclude:: examples/portable-kernel-models/kokkos-reduction.cpp
-            :language: C++
+    .. tab:: Kokkos
+          .. literalinclude:: examples/portable-kernel-models/kokkos-reduction.cpp
+             :language: C++
 
-   .. tab:: OpenCL
-         .. literalinclude:: examples/portable-kernel-models/opencl-reduction.cpp
-            :language: C++
+    .. tab:: OpenCL
+          .. literalinclude:: examples/portable-kernel-models/opencl-reduction.cpp
+             :language: C++
 
-   .. tab:: SYCL
-         .. literalinclude:: examples/portable-kernel-models/sycl-reduction.cpp
-            :language: C++
- 
+    .. tab:: SYCL
+          .. literalinclude:: examples/portable-kernel-models/sycl-reduction.cpp
+             :language: C++
 
 Pros and cons of cross-platform portability ecosystems
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+------------------------------------------------------
 
 General observations
 ~~~~~~~~~~~~~~~~~~~~
 
     - The amount of code duplication is minimized.
     - The same code can be compiled to multiple architectures from different vendors.
-    - Limited learning resources compared to CUDA (Stack Overflow, course material, documentation).
+    - Limited learning resources compared to CUDA (Stack Overflow, course material,
+      documentation).
 
 Lambda-based kernel models (Kokkos, SYCL)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -576,7 +775,7 @@ Lambda-based kernel models (Kokkos, SYCL)
     - Less knowledge of the underlying architecture is needed for initial porting.
     - Very nice and readable source code (C++ API).
     - The models are relatively new and not very popular yet.
-    
+
 Separate-source kernel models (OpenCL)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -597,5 +796,5 @@ C++ Standard Parallelism (StdPar, PSTL)
 
 .. keypoints::
 
-   - General code organization is similar to non-portable kernel-based models.
-   - As long as no vendor-specific functionality is used, the same code can run on any GPU.
+    - General code organization is similar to non-portable kernel-based models.
+    - As long as no vendor-specific functionality is used, the same code can run on any GPU.
diff --git a/content/9-language-support.rst b/content/9-language-support.rst
index aa5728dd..f98bddd3 100644
--- a/content/9-language-support.rst
+++ b/content/9-language-support.rst
@@ -3,658 +3,650 @@ High-level language support
 
 .. questions::
 
-   - Can I port code in high-level languages to run on GPUs?
+    - Can I port code in high-level languages to run on GPUs?
 
 .. objectives::
 
-   - Get an overview of libraries for GPU programming in Python and Julia
+    - Get an overview of libraries for GPU programming in Python and Julia
 
 .. instructor-note::
 
-   - 40 min teaching
-   - 20 min exercises
-
+    - 40 min teaching
+    - 20 min exercises
 
 Julia
 -----
 
-Julia has first-class support for GPU programming through the following 
-packages that target GPUs from all three major vendors:
+Julia has first-class support for GPU programming through the following packages that
+target GPUs from all three major vendors:
 
 - `CUDA.jl <https://cuda.juliagpu.org/stable/>`_ for NVIDIA GPUs
 - `AMDGPU.jl <https://amdgpu.juliagpu.org/stable/>`_ for AMD GPUs
 - `oneAPI.jl <https://github.com/JuliaGPU/oneAPI.jl>`_ for Intel GPUs
 - `Metal.jl <https://github.com/JuliaGPU/Metal.jl>`_ for Apple M-series GPUs
 
-``CUDA.jl`` is the most mature, ``AMDGPU.jl`` is somewhat behind but still 
-ready for general use, while ``oneAPI.jl`` and ``Metal.jl`` are functional but might 
-contain bugs, miss some features and provide suboptimal performance.
+``CUDA.jl`` is the most mature, ``AMDGPU.jl`` is somewhat behind but still ready for
+general use, while ``oneAPI.jl`` and ``Metal.jl`` are functional but might contain bugs,
+miss some features and provide suboptimal performance.
 
-The APIs of these libraries are completely analogous and translation between them is 
-normally straightforward. The libraries offer both user-friendly **high-level abstractions** 
-(the array interface and higher-level abstractions) that require little programming effort, 
-and a **lower level** approach for writing kernels for fine-grained control.
+The APIs of these libraries are completely analogous and translation between them is
+normally straightforward. The libraries offer both user-friendly **high-level
+abstractions** (the array interface and higher-level abstractions) that require little
+programming effort, and a **lower level** approach for writing kernels for fine-grained
+control.
 
 Installing these packages is done with the Julia package manager:
 
 .. tabs::
 
-   .. group-tab:: NVIDIA
+    .. group-tab:: NVIDIA
 
-      Installing ``CUDA.jl``:
+       Installing ``CUDA.jl``:
 
-      .. code-block:: julia
-      
-         using Pkg
-         Pkg.add("CUDA")
+       .. code-block:: julia
 
-   .. group-tab:: AMD
+          using Pkg
+          Pkg.add("CUDA")
 
-      Installing ``AMDGPU.jl``:
+    .. group-tab:: AMD
 
-      .. code-block:: julia
-      
-         using Pkg
-         Pkg.add("AMDGPU")
+       Installing ``AMDGPU.jl``:
 
-   .. group-tab:: Intel
+       .. code-block:: julia
 
-      Installing ``oneAPI.jl``:
+          using Pkg
+          Pkg.add("AMDGPU")
 
-      .. code-block:: julia
-      
-         using Pkg
-         Pkg.add("oneAPI")
+    .. group-tab:: Intel
 
-   .. group-tab:: Apple
+       Installing ``oneAPI.jl``:
 
-      Installing ``Metal.jl``:
+       .. code-block:: julia
 
-      .. code-block:: julia
-      
-         using Pkg
-         Pkg.add("Metal")
+          using Pkg
+          Pkg.add("oneAPI")
 
-To use the Julia GPU stack, one needs to have the relevant GPU drivers and 
-programming toolkits installed. GPU drivers are already installed on HPC systems 
-while on your own machine you will need to install them yourself (see e.g. these 
-`instructions from NVIDIA <https://www.nvidia.com/Download/index.aspx>`_). 
-Programming toolkits for CUDA can be installed automatically through 
-Julia's artifact system upon the first usage:
+    .. group-tab:: Apple
 
-.. code-block:: julia
+       Installing ``Metal.jl``:
+
+       .. code-block:: julia
+
+          using Pkg
+          Pkg.add("Metal")
 
-   using CUDA
-   CUDA.versioninfo()
+To use the Julia GPU stack, one needs to have the relevant GPU drivers and programming
+toolkits installed. GPU drivers are already installed on HPC systems while on your own
+machine you will need to install them yourself (see e.g. these `instructions from NVIDIA
+<https://www.nvidia.com/Download/index.aspx>`_). Programming toolkits for CUDA can be
+installed automatically through Julia's artifact system upon the first usage:
 
+.. code-block:: julia
+
+    using CUDA
+    CUDA.versioninfo()
 
 The array interface
-^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~
 
-GPU programming with Julia can be as simple as using a different array type 
-instead of regular ``Base.Array`` arrays:
+GPU programming with Julia can be as simple as using a different array type instead of
+regular ``Base.Array`` arrays:
 
 - ``CuArray`` from CUDA.jl for NVIDIA GPUs
 - ``ROCArray`` from AMDGPU.jl for AMD GPUs
 - ``oneArray`` from oneAPI.jl for Intel GPUs
 - ``MtlArray`` from Metal.jl for Apple GPUs
 
-These array types closely resemble ``Base.Array`` which enables 
-us to write generic code which works on both types.
+These array types closely resemble ``Base.Array`` which enables us to write generic code
+which works on both types.
 
-The following code copies an array to the GPU and executes a simple operation on 
-the GPU:
+The following code copies an array to the GPU and executes a simple operation on the
+GPU:
 
 .. tabs::
 
-   .. group-tab:: NVIDIA
+    .. group-tab:: NVIDIA
+
+       .. code-block:: julia
+
+          using CUDA
 
-      .. code-block:: julia
-      
-         using CUDA
+          A_d = CuArray([1,2,3,4])
+          A_d .+= 1
 
-         A_d = CuArray([1,2,3,4])
-         A_d .+= 1
+    .. group-tab:: AMD
 
-   .. group-tab:: AMD
+       .. code-block:: julia
 
-      .. code-block:: julia
-      
-         using AMDGPU
-      
-         A_d = ROCArray([1,2,3,4])
-         A_d .+= 1
+          using AMDGPU
 
-   .. group-tab:: Intel
+          A_d = ROCArray([1,2,3,4])
+          A_d .+= 1
 
-      .. code-block:: julia
-      
-         using oneAPI
-      
-         A_d = oneArray([1,2,3,4])
-         A_d .+= 1
+    .. group-tab:: Intel
 
-   .. group-tab:: Apple
+       .. code-block:: julia
 
-      .. code-block:: julia
-      
-         using Metal
-      
-         A_d = MtlArray([1,2,3,4])
-         A_d .+= 1
+          using oneAPI
+
+          A_d = oneArray([1,2,3,4])
+          A_d .+= 1
+
+    .. group-tab:: Apple
+
+       .. code-block:: julia
+
+          using Metal
+
+          A_d = MtlArray([1,2,3,4])
+          A_d .+= 1
 
 Moving an array back from the GPU to the CPU is simple:
 
 .. code-block:: julia
-   
-   A = Array(A_d)
 
-Let's have a look at a more realistic example: matrix multiplication. We 
-create two random arrays, one on the CPU and one on the GPU, and compare the 
-performance using the `BenchmarkTools package <https://github.com/JuliaCI/BenchmarkTools.jl>`__:
+    A = Array(A_d)
+
+Let's have a look at a more realistic example: matrix multiplication. We create two
+random arrays, one on the CPU and one on the GPU, and compare the performance using the
+`BenchmarkTools package <https://github.com/JuliaCI/BenchmarkTools.jl>`__:
 
 .. tabs::
 
-   .. group-tab:: NVIDIA
-
-      .. code-block:: julia
-      
-         using BenchmarkTools
-         using CUDA
-
-         A = rand(2^9, 2^9);
-         A_d = CuArray(A);
-
-         @btime $A * $A;
-         @btime CUDA.@sync $A_d * $A_d;
-
-   .. group-tab:: AMD
-
-      .. code-block:: julia
-      
-         using BenchmarkTools
-         using AMDGPU
-      
-         A = rand(2^9, 2^9);
-         A_d = ROCArray(A);
-      
-         @btime $A * $A;
-         @btime begin
-            $A_d * $A_d;
-            AMDGPU.synchronize()
-         end
-
-   .. group-tab:: Intel
-
-      .. code-block:: julia
-      
-         using BenchmarkTools
-         using oneAPI
-      
-         A = rand(2^9, 2^9);
-         A_d = oneArray(A);
-      
-         @btime $A * $A;
-         @btime $A_d * $A_d;
-
-   .. group-tab:: Apple
-
-      .. code-block:: julia
-      
-         using BenchmarkTools
-         using Metal         
-      
-         A = rand(2^9, 2^9);
-         A_d = MtlArray(A);
-      
-         @btime $A * $A;
-         @btime $A_d * $A_d;
+    .. group-tab:: NVIDIA
+
+       .. code-block:: julia
+
+          using BenchmarkTools
+          using CUDA
+
+          A = rand(2^9, 2^9);
+          A_d = CuArray(A);
+
+          @btime $A * $A;
+          @btime CUDA.@sync $A_d * $A_d;
+
+    .. group-tab:: AMD
+
+       .. code-block:: julia
+
+          using BenchmarkTools
+          using AMDGPU
+
+          A = rand(2^9, 2^9);
+          A_d = ROCArray(A);
+
+          @btime $A * $A;
+          @btime begin
+             $A_d * $A_d;
+             AMDGPU.synchronize()
+          end
+
+    .. group-tab:: Intel
+
+       .. code-block:: julia
+
+          using BenchmarkTools
+          using oneAPI
+
+          A = rand(2^9, 2^9);
+          A_d = oneArray(A);
+
+          @btime $A * $A;
+          @btime $A_d * $A_d;
+
+    .. group-tab:: Apple
 
+       .. code-block:: julia
+
+          using BenchmarkTools
+          using Metal
+
+          A = rand(2^9, 2^9);
+          A_d = MtlArray(A);
+
+          @btime $A * $A;
+          @btime $A_d * $A_d;
 
 Vendor libraries
-^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~
 
-Support for using GPU vendor libraries from Julia is currently most mature on 
-NVIDIA GPUs. NVIDIA libraries contain precompiled kernels for common 
-operations like matrix multiplication (`cuBLAS`), fast Fourier transforms 
-(`cuFFT`), linear solvers (`cuSOLVER`), etc. These kernels are wrapped
-in ``CUDA.jl`` and can be used directly with ``CuArrays``:
+Support for using GPU vendor libraries from Julia is currently most mature on NVIDIA
+GPUs. NVIDIA libraries contain precompiled kernels for common operations like matrix
+multiplication (`cuBLAS`), fast Fourier transforms (`cuFFT`), linear solvers
+(`cuSOLVER`), etc. These kernels are wrapped in ``CUDA.jl`` and can be used directly
+with ``CuArrays``:
 
 .. code-block:: julia
 
-   # create a 100x100 Float32 random array and an uninitialized array
-   A = CUDA.rand(2^9, 2^9);
-   B = CuArray{Float32, 2}(undef, 2^9, 2^9);
+    # create a 100x100 Float32 random array and an uninitialized array
+    A = CUDA.rand(2^9, 2^9);
+    B = CuArray{Float32, 2}(undef, 2^9, 2^9);
 
-   # regular matrix multiplication uses cuBLAS under the hood
-   A * A
+    # regular matrix multiplication uses cuBLAS under the hood
+    A * A
 
-   # use LinearAlgebra for matrix multiplication
-   using LinearAlgebra
-   mul!(B, A, A)
+    # use LinearAlgebra for matrix multiplication
+    using LinearAlgebra
+    mul!(B, A, A)
 
-   # use cuSOLVER for QR factorization
-   qr(A)
+    # use cuSOLVER for QR factorization
+    qr(A)
 
-   # solve equation A*X == B
-   A \ B
+    # solve equation A*X == B
+    A \ B
 
-   # use cuFFT for FFT
-   using CUDA.CUFFT
-   fft(A)
+    # use cuFFT for FFT
+    using CUDA.CUFFT
+    fft(A)
 
 ``AMDGPU.jl`` currently supports some of the ROCm libraries:
 
-- `rocBLAS` for BLAS support 
-- `rocFFT` for FFT support 
-- `rocRAND` for RNG support 
-- `MIOpen` for DNN support 
+- `rocBLAS` for BLAS support
+- `rocFFT` for FFT support
+- `rocRAND` for RNG support
+- `MIOpen` for DNN support
 
 Higher-order abstractions
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-A powerful way to program GPUs with arrays is through Julia's higher-order array 
-abstractions. The simple element-wise addition we saw above, ``a .+= 1``, is 
-an example of this, but more general constructs can be created with 
-``broadcast``, ``map``, ``reduce``, ``accumulate`` etc:
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. tabs:: 
+A powerful way to program GPUs with arrays is through Julia's higher-order array
+abstractions. The simple element-wise addition we saw above, ``a .+= 1``, is an example
+of this, but more general constructs can be created with ``broadcast``, ``map``,
+``reduce``, ``accumulate`` etc:
 
-   .. tab:: broadcast
+.. tabs::
 
-      .. code-block:: julia
+    .. tab:: broadcast
 
-         broadcast(A) do x
-             x += 1
-         end
+       .. code-block:: julia
 
-   .. tab:: map
+          broadcast(A) do x
+              x += 1
+          end
 
-      .. code-block:: julia
+    .. tab:: map
 
-         map(A) do x
-             x + 1
-         end
+       .. code-block:: julia
 
-   .. tab:: reduce
+          map(A) do x
+              x + 1
+          end
 
-      .. code-block:: julia
+    .. tab:: reduce
 
-         reduce(+, A)
+       .. code-block:: julia
 
-   .. tab:: accumulate
+          reduce(+, A)
 
-      .. code-block:: julia
+    .. tab:: accumulate
 
-         accumulate(+, A)
+       .. code-block:: julia
 
+          accumulate(+, A)
 
 Writing your own kernels
-^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~
 
-Not all algorithms can be made to work with the higher-level abstractions 
-in ``CUDA.jl``. In such cases it's necessary to explicitly write our own GPU kernel.
+Not all algorithms can be made to work with the higher-level abstractions in
+``CUDA.jl``. In such cases it's necessary to explicitly write our own GPU kernel.
 
-Similarly to writing kernels in CUDA or HIP, we use a special function to 
-return the index of the GPU thread which executes it (e.g., ``threadIdx().x`` for NVIDIA 
-and ``workitemIdx().x`` for AMD), and two additional functions to parallelise over multiple blocks 
-(e.g., :meth:`blockDim().x` and :meth:`blockIdx().x` for NVIDIA, and :meth:`workgroupDim().x` and 
-:meth:`workgroupIdx().x` for AMD).
+Similarly to writing kernels in CUDA or HIP, we use a special function to return the
+index of the GPU thread which executes it (e.g., ``threadIdx().x`` for NVIDIA and
+``workitemIdx().x`` for AMD), and two additional functions to parallelise over multiple
+blocks (e.g., :meth:`blockDim().x` and :meth:`blockIdx().x` for NVIDIA, and
+:meth:`workgroupDim().x` and :meth:`workgroupIdx().x` for AMD).
 
 .. figure:: img/language/MappingBlocksToSMs.png
-   :align: center
+    :align: center
 
 Here's an example of vector addition kernels for NVIDIA, AMD, Intel and Apple GPUs:
 
-
 .. tabs::
 
-   .. group-tab:: NVIDIA
+    .. group-tab:: NVIDIA
 
-      .. code-block:: julia
-      
-         using CUDA
+       .. code-block:: julia
 
-         function vadd!(C, A, B)
-             i = threadIdx().x + (blockIdx().x - 1) * blockDim().x        
-             if i <= length(A)
-                 @inbounds C[i] = A[i] + B[i]
-             end
-             return
-         end
+          using CUDA
 
-         A, B = CUDA.ones(2^9)*2, CUDA.ones(2^9)*3;
-         C = similar(A);
+          function vadd!(C, A, B)
+              i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
+              if i <= length(A)
+                  @inbounds C[i] = A[i] + B[i]
+              end
+              return
+          end
 
-         nthreads = 256
-         # smallest integer larger than or equal to length(A)/threads
-         numblocks = cld(length(A), nthreads)
+          A, B = CUDA.ones(2^9)*2, CUDA.ones(2^9)*3;
+          C = similar(A);
 
-         # run using 256 threads
-         @cuda threads=nthreads blocks=numblocks vadd!(C, A, B)
+          nthreads = 256
+          # smallest integer larger than or equal to length(A)/threads
+          numblocks = cld(length(A), nthreads)
 
-         @assert all(Array(C) .== 5.0)
+          # run using 256 threads
+          @cuda threads=nthreads blocks=numblocks vadd!(C, A, B)
 
-   .. group-tab:: AMD
+          @assert all(Array(C) .== 5.0)
 
-      .. code-block:: julia
-      
-         using AMDGPU
+    .. group-tab:: AMD
 
-         function vadd!(C, A, B)
-             i = workitemIdx().x + (workgroupIdx().x - 1) * workgroupDim().x 
-             if i <= length(A)
-                 @inbounds C[i] = A[i] + B[i]
-             end
-             return
-         end
+       .. code-block:: julia
 
-         A, B = ROCArray(ones(2^9)*2), ROCArray(ones(2^9)*3);
-         C = similar(A);
+          using AMDGPU
 
-         nthreads = 256
-         # smallest integer larger than or equal to length(A)/threads
-         numblocks = cld(length(A), nthreads)
-      
-         # run using 256 threads
-         @roc threads=nthreads blocks=numblocks vadd!(C, A, B)
+          function vadd!(C, A, B)
+              i = workitemIdx().x + (workgroupIdx().x - 1) * workgroupDim().x
+              if i <= length(A)
+                  @inbounds C[i] = A[i] + B[i]
+              end
+              return
+          end
 
-         @assert all(Array(C) .== 5.0)
+          A, B = ROCArray(ones(2^9)*2), ROCArray(ones(2^9)*3);
+          C = similar(A);
 
-   .. group-tab:: Intel
+          nthreads = 256
+          # smallest integer larger than or equal to length(A)/threads
+          numblocks = cld(length(A), nthreads)
 
-      .. code-block:: julia
+          # run using 256 threads
+          @roc threads=nthreads blocks=numblocks vadd!(C, A, B)
 
-         using oneAPI
-         # WARNING: this is still untested on Intel GPUs
-         function vadd!(C, A, B)
-             i = get_global_id()
-             if i <= length(a)
-                 c[i] = a[i] + b[i]
-             end
-             return
-         end
+          @assert all(Array(C) .== 5.0)
 
-         A, B = oneArray(ones(2^9)*2), oneArray(ones(2^9)*3);
-         C = similar(A);
+    .. group-tab:: Intel
 
-         nthreads = 256
-         # smallest integer larger than or equal to length(A)/threads
-         numgroups = cld(length(a),256)
-   
-         @oneapi items=nthreads groups=numgroups vadd!(c, a, b)
+       .. code-block:: julia
 
-         @assert all(Array(C) .== 5.0)
+          using oneAPI
+          # WARNING: this is still untested on Intel GPUs
+          function vadd!(C, A, B)
+              i = get_global_id()
+              if i <= length(a)
+                  c[i] = a[i] + b[i]
+              end
+              return
+          end
 
-   .. group-tab:: Apple
+          A, B = oneArray(ones(2^9)*2), oneArray(ones(2^9)*3);
+          C = similar(A);
 
-      .. code-block:: julia
-      
-         using Metal
+          nthreads = 256
+          # smallest integer larger than or equal to length(A)/threads
+          numgroups = cld(length(a),256)
 
-         function vadd!(C, A, B)
-             i = thread_position_in_grid_1d()
-             if i <= length(A)
-                 @inbounds C[i] = A[i] + B[i]
-             end
-             return
-         end
+          @oneapi items=nthreads groups=numgroups vadd!(c, a, b)
 
-         A, B = MtlArray(ones(Float32, 2^9)*2), MtlArray(Float32, ones(2^9)*3);
-         C = similar(A);
+          @assert all(Array(C) .== 5.0)
 
-         nthreads = 256
-         # smallest integer larger than or equal to length(A)/threads
-         numblocks = cld(length(A), nthreads)
-      
-         # run using 256 threads
-         @metal threads=nthreads grid=numblocks vadd!(C, A, B)    
+    .. group-tab:: Apple
 
-         @assert all(Array(C) .== 5.0)              
+       .. code-block:: julia
 
+          using Metal
 
-.. callout:: Restrictions in kernel programming
+          function vadd!(C, A, B)
+              i = thread_position_in_grid_1d()
+              if i <= length(A)
+                  @inbounds C[i] = A[i] + B[i]
+              end
+              return
+          end
 
-   Within kernels, most of the Julia language is supported with the exception of functionality 
-   that requires the Julia runtime library. This means one cannot allocate memory or perform 
-   dynamic function calls, both of which are easy to do accidentally!
+          A, B = MtlArray(ones(Float32, 2^9)*2), MtlArray(Float32, ones(2^9)*3);
+          C = similar(A);
 
-.. callout:: 1D, 2D and 3D
+          nthreads = 256
+          # smallest integer larger than or equal to length(A)/threads
+          numblocks = cld(length(A), nthreads)
 
-   CUDA.jl and AMDGPU.jl support indexing in up to 3 dimensions (x, y and z, e.g. 
-   ``threadIdx().x`` and ``workitemIdx().x``). This is convenient 
-   for multidimensional data where thread blocks can be organised into 1D, 2D or 3D arrays of 
-   threads.
+          # run using 256 threads
+          @metal threads=nthreads grid=numblocks vadd!(C, A, B)
 
+          @assert all(Array(C) .== 5.0)
+
+.. callout:: Restrictions in kernel programming
+
+    Within kernels, most of the Julia language is supported with the exception of functionality
+    that requires the Julia runtime library. This means one cannot allocate memory or perform
+    dynamic function calls, both of which are easy to do accidentally!
+
+.. callout:: 1D, 2D and 3D
 
+    CUDA.jl and AMDGPU.jl support indexing in up to 3 dimensions (x, y and z, e.g.
+    ``threadIdx().x`` and ``workitemIdx().x``). This is convenient
+    for multidimensional data where thread blocks can be organised into 1D, 2D or 3D arrays of
+    threads.
 
 Python
 ------
 
-There has been a lot of progress in GPU programming using Python and the ecosystem is still evolving.
-There are a couple of options available to work with GPU.
+There has been a lot of progress in GPU programming using Python and the ecosystem is
+still evolving. There are a couple of options available to work with GPU.
 
 CuPy
-^^^^
+~~~~
 
-CuPy is a NumPy/SciPy-compatible data array library used on GPU. It has been developed for NVIDIA GPUs 
-but as experimental support for AMD GPUs. 
-CuPy has a highly compatible interface with NumPy and SciPy. As stated on its official website, 
-"All you need to do is just replace *numpy* and *scipy* with *cupy* and *cupyx.scipy* in your Python code." 
+CuPy is a NumPy/SciPy-compatible data array library used on GPU. It has been developed
+for NVIDIA GPUs but as experimental support for AMD GPUs. CuPy has a highly compatible
+interface with NumPy and SciPy. As stated on its official website, "All you need to do
+is just replace *numpy* and *scipy* with *cupy* and *cupyx.scipy* in your Python code."
 If you know NumPy, CuPy is a very easy way to get started on the GPU.
 
-
 cuDF
-^^^^
-
-RAPIDS is a high level packages collections which implement CUDA functionalities and API with Python bindings.
-It only supports NVIDIA GPUs.
-cuDF belongs to RAPIDS and is the library for manipulating data frames on GPU.
-cuDF provides a pandas-like API, so if you are familiar with Pandas, you can accelerate your work 
-without knowing too much CUDA programming.
+~~~~
 
+RAPIDS is a high level packages collections which implement CUDA functionalities and API
+with Python bindings. It only supports NVIDIA GPUs. cuDF belongs to RAPIDS and is the
+library for manipulating data frames on GPU. cuDF provides a pandas-like API, so if you
+are familiar with Pandas, you can accelerate your work without knowing too much CUDA
+programming.
 
 PyCUDA
-^^^^^^
-
-PyCUDA is a Python programming environment for CUDA. It allows users to access to NVIDIA's CUDA API from Python. 
-PyCUDA is powerful library but only runs on NVIDIA GPUs. Knowledge of CUDA programming is needed.
+~~~~~~
 
+PyCUDA is a Python programming environment for CUDA. It allows users to access to
+NVIDIA's CUDA API from Python. PyCUDA is powerful library but only runs on NVIDIA GPUs.
+Knowledge of CUDA programming is needed.
 
 Numba
-^^^^^
+~~~~~
 
-Numba allows users to just-in-time (JIT) compile Python code to run fast on CPUs, but can also 
-be used for JIT compiling for GPUs.
-In the following we will focus on using Numba, which supports GPUs from both NVIDIA and AMD.
+Numba allows users to just-in-time (JIT) compile Python code to run fast on CPUs, but
+can also be used for JIT compiling for GPUs. In the following we will focus on using
+Numba, which supports GPUs from both NVIDIA and AMD.
 
 .. callout:: AMD support deprecated
 
-   Numba supported AMD GPUs up until version 0.53 but has since deprecated the support. 
-
-Numba supports GPU programming by directly compiling a restricted subset of Python code 
-into kernels and device functions following the execution model. 
-Kernels written in Numba appear to have direct access to NumPy arrays. 
-NumPy arrays are transferred between the CPU and the GPU automatically.
+    Numba supported AMD GPUs up until version 0.53 but has since deprecated the support.
 
+Numba supports GPU programming by directly compiling a restricted subset of Python code
+into kernels and device functions following the execution model. Kernels written in
+Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between
+the CPU and the GPU automatically.
 
 ufunc (gufunc) decorator
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-Using ufuncs (and generalized ufuncs) is the easiest way to run on a GPU with Numba, 
-and it requires minimal understanding of GPU programming. Numba ``@vectorize`` 
-will produce a ufunc-like object. This object is a close analog but not fully compatible 
-with a regular NumPy ufunc. Generating a ufunc for GPU requires the explicit 
-type signature and  target attribute.
+++++++++++++++++++++++++
 
+Using ufuncs (and generalized ufuncs) is the easiest way to run on a GPU with Numba, and
+it requires minimal understanding of GPU programming. Numba ``@vectorize`` will produce
+a ufunc-like object. This object is a close analog but not fully compatible with a
+regular NumPy ufunc. Generating a ufunc for GPU requires the explicit type signature and
+target attribute.
 
 Examples
-~~~~~~~~
-
-.. demo:: Demo: Numba ufunc 
-   
-   Let's look at a simple mathematical problem:
+++++++++
 
-   .. tabs::
+.. demo:: Demo: Numba ufunc
 
-      .. tab:: python
+    Let's look at a simple mathematical problem:
 
-         .. literalinclude:: examples/numba/math_cpu.py
-            :language: python
+    .. tabs::
 
-      .. tab:: Numba ufunc cpu
+       .. tab:: python
 
-         .. literalinclude:: examples/numba/math_numba_cpu.py
-            :language: python
+          .. literalinclude:: examples/numba/math_cpu.py
+             :language: python
 
-      .. tab:: Numba ufunc gpu
+       .. tab:: Numba ufunc cpu
 
-         .. literalinclude:: examples/numba/math_numba_gpu.py
-            :language: python
+          .. literalinclude:: examples/numba/math_numba_cpu.py
+             :language: python
 
+       .. tab:: Numba ufunc gpu
 
-   Let's benchmark:
+          .. literalinclude:: examples/numba/math_numba_gpu.py
+             :language: python
 
-   .. tabs::
 
-      .. tab:: python
+    Let's benchmark:
 
-	 .. code-block:: python
+    .. tabs::
 
-            import numpy as np
-	    x = np.random.rand(10000000)
-	    res = np.random.rand(10000000)
+       .. tab:: python
 
-	 .. code-block:: ipython
+          .. code-block:: python
 
-	    %%timeit -r 1
-            for i in range(10000000):
-                res[i]=f(x[i], x[i])
-                # 6.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
+             import numpy as np
+             x = np.random.rand(10000000)
+             res = np.random.rand(10000000)
 
-      .. tab:: Numba cpu
+          .. code-block:: ipython
 
-	 .. code-block:: ipython
+             %%timeit -r 1
+             for i in range(10000000):
+                 res[i]=f(x[i], x[i])
+                 # 6.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
 
-            import numpy as np
-            import numba
+       .. tab:: Numba cpu
 
-	    x = np.random.rand(10000000)
-	    res = np.random.rand(10000000)
+          .. code-block:: ipython
 
-	    %timeit res=f_numba_cpu(x, x)
-            # 734 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
+             import numpy as np
+             import numba
 
-      .. tab:: Numba gpu
+             x = np.random.rand(10000000)
+             res = np.random.rand(10000000)
 
-	 .. code-block:: ipython
+             %timeit res=f_numba_cpu(x, x)
+             # 734 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
 
-            import numpy as np
-            import numba
+       .. tab:: Numba gpu
 
-            x = np.random.rand(10000000)
-	    res = np.random.rand(10000000)
+          .. code-block:: ipython
 
-	    %timeit res=f_numba_gpu(x, x)
-            # 78.4 ms ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+             import numpy as np
+             import numba
 
+             x = np.random.rand(10000000)
+             res = np.random.rand(10000000)
 
-Numba ``@vectorize`` is limited to scalar arguments in the core function, for multi-dimensional arrays arguments, 
-``@guvectorize`` is used. Consider the following example which does matrix multiplication. 
+             %timeit res=f_numba_gpu(x, x)
+             # 78.4 ms ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 
+Numba ``@vectorize`` is limited to scalar arguments in the core function, for
+multi-dimensional arrays arguments, ``@guvectorize`` is used. Consider the following
+example which does matrix multiplication.
 
 .. warning::
 
-   One should never implement things like matrix multiplication by oneself 
-   since there are plenty of highly optimized libraries available! 
+    One should never implement things like matrix multiplication by oneself since there
+    are plenty of highly optimized libraries available!
 
+.. demo:: Numba gufunc
 
-.. demo:: Numba gufunc  
+    .. tabs::
 
-   .. tabs::
+       .. tab:: python
 
-      .. tab:: python
+          .. literalinclude:: examples/numba/matmul_cpu.py
+             :language: python
 
-         .. literalinclude:: examples/numba/matmul_cpu.py
-            :language: python
+       .. tab:: numba gufunc cpu
 
-      .. tab:: numba gufunc cpu
+          .. literalinclude:: examples/numba/matmul_numba_cpu.py
+             :language: python
 
-         .. literalinclude:: examples/numba/matmul_numba_cpu.py
-            :language: python
+       .. tab:: numba gufunc gpu
 
-      .. tab:: numba gufunc gpu
+          .. literalinclude:: examples/numba/matmul_numba_gpu.py
+             :language: python
 
-         .. literalinclude:: examples/numba/matmul_numba_gpu.py
-            :language: python
 
+    Benchmark:
 
-   Benchmark:
+    .. tabs::
 
-   .. tabs::
+       .. tab:: Numba gufunc cpu
 
-      .. tab:: Numba gufunc cpu
+          .. code-block:: ipython
 
-	 .. code-block:: ipython
+                 import numpy as np
+                 import numba
+                 N = 50
+                 A = np.random.rand(N,N)
+                 B = np.random.rand(N,N)
+                 C = np.random.rand(N,N)
+                 %timeit matmul_numba_cpu(A,B,C)
 
-                import numpy as np
-                import numba
-		N = 50
-		A = np.random.rand(N,N)
-		B = np.random.rand(N,N)
-		C = np.random.rand(N,N)
-		%timeit matmul_numba_cpu(A,B,C)
-		
 
-      .. tab:: Numba gufunc gpu
+       .. tab:: Numba gufunc gpu
 
-	 .. code-block:: ipython
+          .. code-block:: ipython
 
-                import numpy as np
-                import numba
-		N = 50
-		A = np.random.rand(N,N)
-		B = np.random.rand(N,N)
-		C = np.random.rand(N,N)
-		%timeit matmul_numba_gpu(A,B,C)
+                 import numpy as np
+                 import numba
+                 N = 50
+                 A = np.random.rand(N,N)
+                 B = np.random.rand(N,N)
+                 C = np.random.rand(N,N)
+                 %timeit matmul_numba_gpu(A,B,C)
 
+.. note::
 
+    Numba automatically did a lot of things for us:
 
-.. note:: 
-
-   Numba automatically did a lot of things for us:
-
-   - Memory was allocated on GPU
-   - Data was copied from CPU and GPU
-   - The kernel was configured and launched
-   - Data was copied back from GPU to CPU
-
-
-Using ufuncs (or gfuncs) for GPU processing can be straightforward, but this approach may not always yield optimal performance due to automatic handling of data transfer to and from the GPU, as well as kernel launching. Additionally, in practice, not every function can be constructed as a ufunc. 
-
-To gain greater control and flexibility, one may need to craft their own kernels and manually manage data transfer. Refer to the *Python for HPDA* resource linked below for guidance on implementing such techniques using Numba.
+    - Memory was allocated on GPU
+    - Data was copied from CPU and GPU
+    - The kernel was configured and launched
+    - Data was copied back from GPU to CPU
 
+Using ufuncs (or gfuncs) for GPU processing can be straightforward, but this approach
+may not always yield optimal performance due to automatic handling of data transfer to
+and from the GPU, as well as kernel launching. Additionally, in practice, not every
+function can be constructed as a ufunc.
 
+To gain greater control and flexibility, one may need to craft their own kernels and
+manually manage data transfer. Refer to the *Python for HPDA* resource linked below for
+guidance on implementing such techniques using Numba.
 
 Exercises
 ---------
 
 .. challenge:: Play around yourself
 
-   Are you a Julian or a Pythonista? Maybe neither, but take a pick between Python and Julia and play around with the code examples provided above. 
-
-   You can find instructions for running Julia on LUMI and Python on Google Colab in the :doc:`0-setup` episode.
-
+    Are you a Julian or a Pythonista? Maybe neither, but take a pick between Python and Julia and play around with the code examples provided above.
 
+    You can find instructions for running Julia on LUMI and Python on Google Colab in the :doc:`0-setup` episode.
 
 See also
 --------
 
-* `Introduction to programming in Julia (ENCCS) <https://enccs.github.io/julia-intro/>`__
-* `Julia for High-Performance Scientific Computing (ENCCS) <https://enccs.github.io/julia-for-hpc/>`__
-* `Julia for high-performance data analytics (ENCCS) <https://enccs.github.io/julia-for-hpda/>`__
-* `Introduction to running R, Python, Julia, and Matlab in HPC (NAISS-LUNARC-HPC2N-UPPMAX) <https://uppmax.github.io/R-python-julia-matlab-HPC/>`__
-* `High Performance Data Analytics in Python (ENCCS) <https://enccs.github.io/hpda-python/>`_
-* `Practical Intro to GPU Programming using Python (ENCCS) <https://github.com/ENCCS/webinar_documents/tree/main/2024-oct-24-python>`_
-* `Using Python in an HPC environment (UPPMAX-HPC2N)  <https://uppmax.github.io/HPC-python/>`__
-* `Python for Scientific Computing (Aalto Scientific Computing) <https://aaltoscicomp.github.io/python-for-scicomp/>`_
+- `Introduction to programming in Julia (ENCCS)
+  <https://enccs.github.io/julia-intro/>`__
+- `Julia for High-Performance Scientific Computing (ENCCS)
+  <https://enccs.github.io/julia-for-hpc/>`__
+- `Julia for high-performance data analytics (ENCCS)
+  <https://enccs.github.io/julia-for-hpda/>`__
+- `Introduction to running R, Python, Julia, and Matlab in HPC
+  (NAISS-LUNARC-HPC2N-UPPMAX) <https://uppmax.github.io/R-python-julia-matlab-HPC/>`__
+- `High Performance Data Analytics in Python (ENCCS)
+  <https://enccs.github.io/hpda-python/>`_
+- `Practical Intro to GPU Programming using Python (ENCCS)
+  <https://github.com/ENCCS/webinar_documents/tree/main/2024-oct-24-python>`_
+- `Using Python in an HPC environment (UPPMAX-HPC2N)
+  <https://uppmax.github.io/HPC-python/>`__
+- `Python for Scientific Computing (Aalto Scientific Computing)
+  <https://aaltoscicomp.github.io/python-for-scicomp/>`_
diff --git a/content/glossary.rst b/content/glossary.rst
index d37fd99f..db9e316e 100644
--- a/content/glossary.rst
+++ b/content/glossary.rst
@@ -2,52 +2,50 @@ Glossary
 ========
 
 ..
-   how to refer to terms:
-   :term:`thread`
-   :term:`threads <thread>`  - different text
-   :term:`thread`\ s  - different way to make plural
+    how to refer to terms:
+    :term:`thread`
+    :term:`threads <thread>`  - different text
+    :term:`thread`s  - different way to make plural
 
 .. glossary::
-   :sorted:
-
-   thread
-      Definition.  otherframework: :term:`workitem`
-
-   workitem
-      Definition.  otherframework: :term:`thread`
+    :sorted:
 
+    thread
+       Definition.  otherframework: :term:`workitem`
 
+    workitem
+       Definition.  otherframework: :term:`thread`
 
 Abbreviations
-^^^^^^^^^^^^^
-
-.. list-table::  
-   :widths: 60 120
-   :header-rows: 1
-
-   * - Abbreviations
-     - Full names
-   * - CUDA 
-     - compute unified device architecture
-   * - DAG
-     - directed acyclic graph
-   * - FPGAs
-     - field-programmable gate arrays
-   * - GPU
-     - graphics processing units
-   * - HIP
-     - heterogeneous-computing interface for portability
-   * - NLP
-     - natural language processing 
-   * - SIMD
-     - single instruction multiple data
-   * - SIMT
-     - single instruction multiple threads
-   * - SP
-     - streaming processors
-   * - SMP
-     - streaming multi-processors
-   * - SVM
-     - shared virtual memory
-   * - USM
-     - unified shared memory
+-------------
+
+.. list-table::
+    :widths: 60 120
+    :header-rows: 1
+
+    - - Abbreviations
+      - Full names
+    - - CUDA
+      - compute unified device architecture
+    - - DAG
+      - directed acyclic graph
+    - - FPGAs
+      - field-programmable gate arrays
+    - - GPU
+      - graphics processing units
+    - - HIP
+      - heterogeneous-computing interface for portability
+    - - NLP
+      - natural language processing
+    - - SIMD
+      - single instruction multiple data
+    - - SIMT
+      - single instruction multiple threads
+    - - SP
+      - streaming processors
+    - - SMP
+      - streaming multi-processors
+    - - SVM
+      - shared virtual memory
+    - - USM
+      - unified shared memory
diff --git a/content/guide.rst b/content/guide.rst
index 0501c67d..7d13f4da 100644
--- a/content/guide.rst
+++ b/content/guide.rst
@@ -1,143 +1,91 @@
 Instructor's guide
 ==================
 
-
 Updated schedule for a three-day workshop (2024)
 ------------------------------------------------
 
-
 **Day 1**
 
-+-------------+--------------------------------------------+
-| Time        | Section                                    |
-+=============+============================================+
-| 9:00-9:15   | Welcome                                    |
-+-------------+--------------------------------------------+
-| 9:15-9:40   | Why GPUs?                                  |
-+-------------+--------------------------------------------+
-| 9:40-10:20  | The GPU hardware and software ecosystem    |
-+-------------+--------------------------------------------+
-| 10:20-10:30 | Break                                      |
-+-------------+--------------------------------------------+
-| 10:30-11:00 | What problems fit to GPU?                  |
-+-------------+--------------------------------------------+
-| 11:00-11:30 | GPU programming concepts                   |
-+-------------+--------------------------------------------+
-| 11:30-12:00 | Introduction to GPU programming models     |
-+-------------+--------------------------------------------+
-| 12:00-13:00 | Lunch break                                |
-+-------------+--------------------------------------------+
-| 13:00-14:20 | Directive-based models                     |
-+-------------+--------------------------------------------+
-| 14:20-14:30 | Break                                      |
-+-------------+--------------------------------------------+
-| 14:30-16:00 | Non-portable kernel-based models           |
-+-------------+--------------------------------------------+
-
+=========== =======================================
+Time        Section
+=========== =======================================
+9:00-9:15   Welcome
+9:15-9:40   Why GPUs?
+9:40-10:20  The GPU hardware and software ecosystem
+10:20-10:30 Break
+10:30-11:00 What problems fit to GPU?
+11:00-11:30 GPU programming concepts
+11:30-12:00 Introduction to GPU programming models
+12:00-13:00 Lunch break
+13:00-14:20 Directive-based models
+14:20-14:30 Break
+14:30-16:00 Non-portable kernel-based models
+=========== =======================================
 
 **Day 2**
 
-+-------------+--------------------------------------------+
-| Time        | Section                                    |
-+=============+============================================+
-| 9:00-10:30  | Portable kernel-based models               |
-+-------------+--------------------------------------------+
-| 10:30-10:40 | Break                                      |
-+-------------+--------------------------------------------+
-| 10:40-12:00 | Exercises for various programming models   |
-+-------------+--------------------------------------------+
-| 12:00-13:00 | Lunch break                                |
-+-------------+--------------------------------------------+
-| 13:00-14:15 | High-level language support                |
-+-------------+--------------------------------------------+
-| 14:14-14:30 | Break                                      |
-+-------------+--------------------------------------------+
-| 14:30-15:50 | Multi-GPU programming with MPI             |
-+-------------+--------------------------------------------+
-| 15:50-16:00 | Buffer time                                |
-+-------------+--------------------------------------------+
-
+=========== ========================================
+Time        Section
+=========== ========================================
+9:00-10:30  Portable kernel-based models
+10:30-10:40 Break
+10:40-12:00 Exercises for various programming models
+12:00-13:00 Lunch break
+13:00-14:15 High-level language support
+14:14-14:30 Break
+14:30-15:50 Multi-GPU programming with MPI
+15:50-16:00 Buffer time
+=========== ========================================
 
 **Day 3**
 
-+-------------+--------------------------------------------+
-| Time        | Section                                    |
-+=============+============================================+
-| 09:00-10:00 | Preparing code for GPU porting             |
-+-------------+--------------------------------------------+
-| 10:00-10:30 | Recommendations and discussions            |
-+-------------+--------------------------------------------+
-| 10:30-10:45 | Break                                      |
-+-------------+--------------------------------------------+
-| 10:45-11:50 | Problem example                            |
-+-------------+--------------------------------------------+
-| 11:50-12:00 | Wrap-up                                    |
-+-------------+--------------------------------------------+
-| 12:00-13:00 | Lunch break                                |
-+-------------+--------------------------------------------+
-| 13:00-15:50 | Bring your code and get expert advice	   |
-+-------------+--------------------------------------------+
-| 15:50-16:00 | Summary of this workshop                   |
-+-------------+--------------------------------------------+
-
-
+=========== =====================================
+Time        Section
+=========== =====================================
+09:00-10:00 Preparing code for GPU porting
+10:00-10:30 Recommendations and discussions
+10:30-10:45 Break
+10:45-11:50 Problem example
+11:50-12:00 Wrap-up
+12:00-13:00 Lunch break
+13:00-15:50 Bring your code and get expert advice
+15:50-16:00 Summary of this workshop
+=========== =====================================
 
 Suggested two-day schedule (2023)
 ---------------------------------
 
 **Day 1**
 
-+-------------+--------------------------------------------+
-| Time        | Section                                    |
-+=============+============================================+
-| 9:00-9:15   | Welcome                                    |
-+-------------+--------------------------------------------+
-| 9:15-9:30   | Why GPUs?                                  |
-+-------------+--------------------------------------------+
-| 9:30-9:50   | The GPU hardware and software ecosystem    |
-+-------------+--------------------------------------------+
-| 9:50-10:10  | What problems fit to GPU?                  |
-+-------------+--------------------------------------------+
-| 10:10-10:25 | Break                                      |
-+-------------+--------------------------------------------+
-| 10:25-10:50 | GPU programming concepts                   |
-+-------------+--------------------------------------------+
-| 10:50-11:10 | Introduction to GPU programming models     |
-+-------------+--------------------------------------------+
-| 11:10-11:50 | High-level language support                |
-+-------------+--------------------------------------------+
-| 11:50-12:50 | Lunch break                                |
-+-------------+--------------------------------------------+
-| 12:50-13:40 | Directive-based models                     |
-+-------------+--------------------------------------------+
-| 13:40-14:30 | Multi-GPU programming with MPI             |
-+-------------+--------------------------------------------+
-| 14:30-14:45 | Break                                      |
-+-------------+--------------------------------------------+
-| 14:45-16:00 | Non-portable kernel-based models           |
-+-------------+--------------------------------------------+
-
+=========== =======================================
+Time        Section
+=========== =======================================
+9:00-9:15   Welcome
+9:15-9:30   Why GPUs?
+9:30-9:50   The GPU hardware and software ecosystem
+9:50-10:10  What problems fit to GPU?
+10:10-10:25 Break
+10:25-10:50 GPU programming concepts
+10:50-11:10 Introduction to GPU programming models
+11:10-11:50 High-level language support
+11:50-12:50 Lunch break
+12:50-13:40 Directive-based models
+13:40-14:30 Multi-GPU programming with MPI
+14:30-14:45 Break
+14:45-16:00 Non-portable kernel-based models
+=========== =======================================
 
 **Day 2**
 
-+-------------+--------------------------------------------+
-| Time        | Section                                    |
-+=============+============================================+
-| 9:00-10:15  | Portable kernel-based models               |
-+-------------+--------------------------------------------+
-| 10:15-10:30 | Break                                      |
-+-------------+--------------------------------------------+
-| 10:30-11:20 | Preparing code for GPU porting             |
-+-------------+--------------------------------------------+
-| 11:20-12:00 | Recommendations and discussions            |
-+-------------+--------------------------------------------+
-| 12:00-13:00 | Lunch break                                |
-+-------------+--------------------------------------------+
-| 13:00-14:30 | Problem example                            |
-+-------------+--------------------------------------------+
-| 14:30-14:50 | Break                                      |
-+-------------+--------------------------------------------+
-| 14:50-16:00 | Buffer time                                |
-+-------------+--------------------------------------------+
-
-
+=========== ===============================
+Time        Section
+=========== ===============================
+9:00-10:15  Portable kernel-based models
+10:15-10:30 Break
+10:30-11:20 Preparing code for GPU porting
+11:20-12:00 Recommendations and discussions
+12:00-13:00 Lunch break
+13:00-14:30 Problem example
+14:30-14:50 Break
+14:50-16:00 Buffer time
+=========== ===============================
diff --git a/content/index.rst b/content/index.rst
index c7cc9505..ef5de8b8 100644
--- a/content/index.rst
+++ b/content/index.rst
@@ -1,95 +1,93 @@
 GPU Programming: When, Why and How?
 ===================================
 
-Graphical processing units (GPUs) are the workhorse of many high performance 
-computing (HPC) systems around the world. The number of GPU-enabled supercomputers 
-on the `Top500 <https://www.top500.org/>`__ has been steadily increasing in recent years 
-and this development is expected to continue. In the near future, the majority of HPC 
-computing power available to researchers and engineers is likely to be provided by GPUs 
-or other types of accelerators. Programming GPUs and other accelerators is thus crucial
-to developers of software run on HPC systems.
-
-However, the landscape of GPU hardware, software and programming environments is complicated. 
-Multiple vendors compete in the high-end GPU market, with each vendor providing its own software 
-stack and development toolkits, and even beyond that, there is a proliferation of tools, 
-languages and frameworks that can be used to write code for GPUs.
-It can thus be difficult for individual developers and project owners to know how to 
-navigate across this landscape and select the most appropriate GPU programming framework for their 
-projects based on the requirements of a given project and technical requirements of any 
-existing code.
-
-This material is meant to help both software developers and decision makers navigate the 
-GPU programming landscape and make more informed decisions on which languages or frameworks 
-to learn and use for their projects. Specifically, you will:
+Graphical processing units (GPUs) are the workhorse of many high performance computing
+(HPC) systems around the world. The number of GPU-enabled supercomputers on the `Top500
+<https://www.top500.org/>`__ has been steadily increasing in recent years and this
+development is expected to continue. In the near future, the majority of HPC computing
+power available to researchers and engineers is likely to be provided by GPUs or other
+types of accelerators. Programming GPUs and other accelerators is thus crucial to
+developers of software run on HPC systems.
+
+However, the landscape of GPU hardware, software and programming environments is
+complicated. Multiple vendors compete in the high-end GPU market, with each vendor
+providing its own software stack and development toolkits, and even beyond that, there
+is a proliferation of tools, languages and frameworks that can be used to write code for
+GPUs. It can thus be difficult for individual developers and project owners to know how
+to navigate across this landscape and select the most appropriate GPU programming
+framework for their projects based on the requirements of a given project and technical
+requirements of any existing code.
+
+This material is meant to help both software developers and decision makers navigate the
+GPU programming landscape and make more informed decisions on which languages or
+frameworks to learn and use for their projects. Specifically, you will:
 
 - Understand why and when to use GPUs.
 - Become comfortable with key concepts in GPU programming.
-- Acquire a comprehensive overview of different software frameworks, what levels they operate at, and which to use when.
-- Learn the fundamentals in at least one framework to a level which will enable you to quickly become a productive GPU programmer.
+- Acquire a comprehensive overview of different software frameworks, what levels they
+  operate at, and which to use when.
+- Learn the fundamentals in at least one framework to a level which will enable you to
+  quickly become a productive GPU programmer.
 
 .. prereq::
 
-   Familiarity with one or more programming languages like C/C++, Fortran, Python or 
-   Julia is recommended.
-   
+    Familiarity with one or more programming languages like C/C++, Fortran, Python or
+    Julia is recommended.
+
 .. toctree::
-   :maxdepth: 1
-   :caption: Prerequisites
+    :maxdepth: 1
+    :caption: Prerequisites
 
-   0-setup
+    0-setup
 
 .. toctree::
-   :maxdepth: 1
-   :caption: The lesson
-
-   1-gpu-history
-   2-gpu-ecosystem
-   3-gpu-problems
-   4-gpu-concepts
-   5-intro-to-gpu-prog-models
-   6-directive-based-models
-   7-non-portable-kernel-models
-   8-portable-kernel-models
-   9-language-support
-   10-multiple_gpu
-   11-gpu-porting
-   12-recommendations
-   13-examples
+    :maxdepth: 1
+    :caption: The lesson
+
+    1-gpu-history
+    2-gpu-ecosystem
+    3-gpu-problems
+    4-gpu-concepts
+    5-intro-to-gpu-prog-models
+    6-directive-based-models
+    7-non-portable-kernel-models
+    8-portable-kernel-models
+    9-language-support
+    10-multiple_gpu
+    11-gpu-porting
+    12-recommendations
+    13-examples
 
 .. toctree::
-   :maxdepth: 1
-   :caption: Reference
+    :maxdepth: 1
+    :caption: Reference
 
-   quick-reference
-   glossary
-   guide
+    quick-reference
+    glossary
+    guide
 
 .. toctree::
-   :maxdepth: 1
-   :caption: About
-
-   All lessons <https://enccs.se/lessons/>
-   ENCCS <https://enccs.se/>
-
-
+    :maxdepth: 1
+    :caption: About
 
+    All lessons <https://enccs.se/lessons/>
+    ENCCS <https://enccs.se/>
 
 .. _learner-personas:
 
 Who is the course for?
 ----------------------
 
-This material is most relevant to researchers and engineers who already develop software 
-which runs on CPUs in workstations or supercomputers, but also to decision makers or 
-project managers who don't write code but make strategic decisions in software projects, 
+This material is most relevant to researchers and engineers who already develop software
+which runs on CPUs in workstations or supercomputers, but also to decision makers or
+project managers who don't write code but make strategic decisions in software projects,
 whether it's in academia, industry or the public sector.
 
-
-
 About the course
 ----------------
 
-This training material is the result of a multilateral effort by GPU programming experts from:
+This training material is the result of a multilateral effort by GPU programming experts
+from:
 
 - `Aalto University in Finland <https://www.aalto.fi/en>`_
 - `Aarhus University in Denmark <https://www.au.dk/>`__
@@ -98,68 +96,67 @@ This training material is the result of a multilateral effort by GPU programming
 - `HPC2N centre in Umeå, Sweden <https://www.hpc2n.umu.se/>`__
 - `KTH Royal Institute for Technology in Sweden <https://www.kth.se/>`__
 - `NRIS in Norway <https://www.sigma2.no/nris>`__
-- `Vilnius University in Lithuania <https://www.vu.lt/en/>`__ and `NCC Lithuania <https://www.eurocc-lithuania.lt/>`__
-
-
+- `Vilnius University in Lithuania <https://www.vu.lt/en/>`__ and `NCC Lithuania
+  <https://www.eurocc-lithuania.lt/>`__
 
 See also
 --------
 
 Links to additional resources and tutorials can be found in the lesson episodes.
 
-
 Credits
 -------
 
-Several sections in this lesson have been adapted from the following sources created by 
-`ENCCS <https://enccs.se/>`__ and `CSC <https://csc.fi/>`__, which are 
-all distributed under a 
-`Creative Commons Attribution license (CC-BY-4.0) <https://creativecommons.org/licenses/by/4.0/>`__:
+Several sections in this lesson have been adapted from the following sources created by
+`ENCCS <https://enccs.se/>`__ and `CSC <https://csc.fi/>`__, which are all distributed
+under a `Creative Commons Attribution license (CC-BY-4.0)
+<https://creativecommons.org/licenses/by/4.0/>`__:
 
 - `OpenMP for GPU offloading <https://enccs.github.io/openmp-gpu/>`__
 - `High Performance Data Analytics in Python <https://enccs.github.io/hpda-python/>`__
 - `Julia for HPC <https://enccs.github.io/julia-for-hpc/>`__
 
-
-The lesson file structure and browsing layout is inspired by and derived from
-`work <https://github.com/coderefinery/sphinx-lesson>`__ by `CodeRefinery
+The lesson file structure and browsing layout is inspired by and derived from `work
+<https://github.com/coderefinery/sphinx-lesson>`__ by `CodeRefinery
 <https://coderefinery.org/>`__ licensed under the `MIT license
-<http://opensource.org/licenses/mit-license.html>`__. We have copied and adapted
-most of their license text.
-
+<http://opensource.org/licenses/mit-license.html>`__. We have copied and adapted most of
+their license text.
 
 Instructional Material
-^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~
 
-This instructional material is made available under the
-`Creative Commons Attribution license (CC-BY-4.0) <https://creativecommons.org/licenses/by/4.0/>`__.
-The following is a human-readable summary of (and not a substitute for) the
-`full legal text of the CC-BY-4.0 license
-<https://creativecommons.org/licenses/by/4.0/legalcode>`__.
-You are free to:
+This instructional material is made available under the `Creative Commons Attribution
+license (CC-BY-4.0) <https://creativecommons.org/licenses/by/4.0/>`__. The following is
+a human-readable summary of (and not a substitute for) the `full legal text of the
+CC-BY-4.0 license <https://creativecommons.org/licenses/by/4.0/legalcode>`__. You are
+free to:
 
 - **share** - copy and redistribute the material in any medium or format
-- **adapt** - remix, transform, and build upon the material for any purpose, even commercially.
+- **adapt** - remix, transform, and build upon the material for any purpose, even
+  commercially.
 
 The licensor cannot revoke these freedoms as long as you follow these license terms:
 
-- **Attribution** - You must give appropriate credit (mentioning that your work is derived from work that is Copyright (c) ENCCS and individual contributors and, where practical, linking to `<https://enccs.github.io/sphinx-lesson-template>`_), provide a `link to the license <https://creativecommons.org/licenses/by/4.0/>`__, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. 
-- **No additional restrictions** - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
+- **Attribution** - You must give appropriate credit (mentioning that your work is
+  derived from work that is Copyright (c) ENCCS and individual contributors and, where
+  practical, linking to https://enccs.github.io/sphinx-lesson-template), provide a `link
+  to the license <https://creativecommons.org/licenses/by/4.0/>`__, and indicate if
+  changes were made. You may do so in any reasonable manner, but not in any way that
+  suggests the licensor endorses you or your use.
+- **No additional restrictions** - You may not apply legal terms or technological
+  measures that legally restrict others from doing anything the license permits.
 
 With the understanding that:
 
-- You do not have to comply with the license for elements of the material in
-  the public domain or where your use is permitted by an applicable exception
-  or limitation.
-- No warranties are given. The license may not give you all of the permissions
-  necessary for your intended use. For example, other rights such as
-  publicity, privacy, or moral rights may limit how you use the material.
-
+- You do not have to comply with the license for elements of the material in the public
+  domain or where your use is permitted by an applicable exception or limitation.
+- No warranties are given. The license may not give you all of the permissions necessary
+  for your intended use. For example, other rights such as publicity, privacy, or moral
+  rights may limit how you use the material.
 
 Software
-^^^^^^^^
-
-Except where otherwise noted, the example programs and other software provided
-with this repository are made available under the `OSI <http://opensource.org/>`__-approved
-`MIT license <https://opensource.org/licenses/mit-license.html>`__.
+~~~~~~~~
 
+Except where otherwise noted, the example programs and other software provided with this
+repository are made available under the `OSI <http://opensource.org/>`__-approved `MIT
+license <https://opensource.org/licenses/mit-license.html>`__.
diff --git a/content/quick-reference.rst b/content/quick-reference.rst
index dab20157..492da1e1 100644
--- a/content/quick-reference.rst
+++ b/content/quick-reference.rst
@@ -1,2 +1,2 @@
 Quick Reference
----------------
+===============
diff --git a/requirements.txt b/requirements.txt
index 135f7aa5..b808fc63 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -3,3 +3,4 @@ sphinx_rtd_theme
 sphinx_rtd_theme_ext_color_contrast
 myst_nb
 sphinx-lesson
+docstrfmt==1.9.0