ENCCS · al42and · Nov 20, 2024 · Nov 20, 2024
diff --git a/content/0-setup.rst b/content/0-setup.rst
@@ -6,123 +6,127 @@ Setup
 Local installation
 ------------------
 
-Since this lesson is taught using an HPC cluster, no software installation on your own computer is needed. 
-
+Since this lesson is taught using an HPC cluster, no software installation on your own
+computer is needed.
 
 Running on LUMI
 ---------------
 
-Interactive job, 1 node, 1 GPU, 1 hour:  
+Interactive job, 1 node, 1 GPU, 1 hour:
 
 .. code-block:: console
 
-   $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
-   $ srun <some-command>
+    $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
+    $ srun <some-command>
 
 Exit interactive allocation with ``exit``.
 
 Interacive terminal session on compute node:
 
 .. code-block:: console
 
-   $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
-   $ <some-command>
+    $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
+    $ <some-command>
 
 Corresponding batch script ``submit.sh``:
 
 .. code-block:: bash
 
-   #!/bin/bash -l
-   #SBATCH --account=project_465001310
-   #SBATCH --job-name=example-job
-   #SBATCH --output=examplejob.o%j
-   #SBATCH --error=examplejob.e%j
-   #SBATCH --partition=standard-g
-   #SBATCH --nodes=1
-   #SBATCH --gpus-per-node=1
-   #SBATCH --ntasks-per-node=1
-   #SBATCH --time=1:00:00
+    #!/bin/bash -l
+    #SBATCH --account=project_465001310
+    #SBATCH --job-name=example-job
+    #SBATCH --output=examplejob.o%j
+    #SBATCH --error=examplejob.e%j
+    #SBATCH --partition=standard-g
+    #SBATCH --nodes=1
+    #SBATCH --gpus-per-node=1
+    #SBATCH --ntasks-per-node=1
+    #SBATCH --time=1:00:00
 
-   srun <some_command> 
+    srun <some_command>
 
 - Submit the job: ``sbatch submit.sh``
 - Monitor your job: ``squeue --me``
 - Kill job: ``scancel <JOB_ID>``
 
-
-
 Running Julia on LUMI
-^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~
 
-In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory structure and assume it is our working directory.
+In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory
+structure and assume it is our working directory.
 
 .. code-block:: console
 
-	.
-	├── Project.toml  # Julia environment
-	├── script.jl     # Julia script
-	└── submit.sh     # Slurm batch script
+    .
+    ├── Project.toml  # Julia environment
+    ├── script.jl     # Julia script
+    └── submit.sh     # Slurm batch script
 
 An example of a ``Project.toml`` project file.
 
 .. code-block:: console
 
-	[deps]
-	AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
+    [deps]
+    AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
 
-For the ``submit.sh`` batch script, include additional content to the batch script mentioned above.
+For the ``submit.sh`` batch script, include additional content to the batch script
+mentioned above.
 
 .. code-block:: bash
 
-   #SBATCH --cpus-per-task=2
-   #SBATCH --mem-per-cpu=1750
+    #SBATCH --cpus-per-task=2
+    #SBATCH --mem-per-cpu=1750
 
-   module use /appl/local/csc/modulefiles
+    module use /appl/local/csc/modulefiles
 
-   module load julia
-   module load julia-amdgpu
+    module load julia
+    module load julia-amdgpu
 
-   julia --project=. -e 'using Pkg; Pkg.instantiate()'
-   julia --project=. script.jl
+    julia --project=. -e 'using Pkg; Pkg.instantiate()'
+    julia --project=. script.jl
 
 An example of the ``script.jl`` code is provided below.
 
 .. code-block:: julia
 
-   using AMDGPU
-
-   A = rand(2^9, 2^9)
-   A_d = ROCArray(A)
-   B_d = A_d * A_d
-
-   println("----EOF----")
+    using AMDGPU
 
+    A = rand(2^9, 2^9)
+    A_d = ROCArray(A)
+    B_d = A_d * A_d
 
+    println("----EOF----")
 
 Running on Google Colab
 -----------------------
 
-Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook environment which runs in your web browser. Using it requires login with a Google account.
+Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook
+environment which runs in your web browser. Using it requires login with a Google
+account.
 
 This is how you can get access to NVIDIA GPUs on Colab:
 
 - Visit https://colab.research.google.com/ and sign in to your Google account
 - In the menu in front of you, click "New notebook" in the bottom right corner
-- After the notebook loads, go to the "Runtime" menu at the top and select "Change runtime type"
-- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU (e.g. T4)
-- Click "Save". The runtime takes a few seconds to load - you can see the status in the top right corner
-- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about the GPU.
+- After the notebook loads, go to the "Runtime" menu at the top and select "Change
+  runtime type"
+- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU
+  (e.g. T4)
+- Click "Save". The runtime takes a few seconds to load - you can see the status in the
+  top right corner
+- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about
+  the GPU.
 - You can now write Python code that runs on GPUs through e.g. the numba library.
 
-
 Access to code examples
 -----------------------
 
-Some exercises in this lesson rely on source code that you should download and modify in your own home directory on the cluster. All code examples are available in the same GitHub repository as this lesson itself. To download it you should use Git:
+Some exercises in this lesson rely on source code that you should download and modify in
+your own home directory on the cluster. All code examples are available in the same
+GitHub repository as this lesson itself. To download it you should use Git:
 
 .. code-block:: console
 
-   $ git clone https://github.com/ENCCS/gpu-programming.git
-   $ cd gpu-programming/content/examples/
-   $ ls
-
+    $ git clone https://github.com/ENCCS/gpu-programming.git
+    $ cd gpu-programming/content/examples/
+    $ ls
diff --git a/content/1-gpu-history.rst b/content/1-gpu-history.rst
@@ -1,131 +1,141 @@
 .. _gpu-history:
 
-
 Why GPUs?
 =========
 
-
 .. questions::
 
-   - What is Moore's law?
-   - What problem do GPUs solve?
+    - What is Moore's law?
+    - What problem do GPUs solve?
 
 .. objectives::
 
-   - Explain the historical development of microprocessors and how GPUs enable 
-     continued scaling in computational power
+    - Explain the historical development of microprocessors and how GPUs enable
+      continued scaling in computational power
 
 .. instructor-note::
 
-   - 15 min teaching
-   - 0 min exercises
-
+    - 15 min teaching
+    - 0 min exercises
 
 Moore's law
 -----------
 
-It states that the number of transistors in a dense integrated circuit doubles about every two years.
-More transistors means smaller size of a single element, so higher core frequency can be achieved.
-However, power consumption scales with frequency to the third power, therefore the growth in the core frequency has slowed down significantly.
-Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD (single instruction multiple data), branch prediction, etc.
+It states that the number of transistors in a dense integrated circuit doubles about
+every two years. More transistors means smaller size of a single element, so higher core
+frequency can be achieved. However, power consumption scales with frequency to the third
+power, therefore the growth in the core frequency has slowed down significantly. Higher
+performance of a single node has to rely on its more complicated structure and still can
+be achieved with SIMD (single instruction multiple data), branch prediction, etc.
 
 .. figure:: img/history/microprocessor-trend-data.png
-   :align: center
+    :align: center
 
-   The evolution of microprocessors.
-   The number of transistors per chip doubles roughly every 2 years.
-   However, it can no longer be explored by the core frequency due to the power consumption limits.
-   Before 2000, the increase in the single core clock frequency was the major source of the 
-   increase in the performance. Mid 2000 mark a transition towards multi-core processors.
+    The evolution of microprocessors. The number of transistors per chip doubles roughly
+    every 2 years. However, it can no longer be explored by the core frequency due to
+    the power consumption limits. Before 2000, the increase in the single core clock
+    frequency was the major source of the increase in the performance. Mid 2000 mark a
+    transition towards multi-core processors.
 
 Increasing performance has been sustained with two main strategies over the years:
 
-    - Increase the single processor performance: 
+    - Increase the single processor performance:
     - More recently, increase the number of physical cores.
 
-
 Computing in parallel
 ---------------------
 
-The underlying idea of parallel computing is to split a computational problem into smaller 
-subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units. 
+The underlying idea of parallel computing is to split a computational problem into
+smaller subtasks. Many subtasks can then be solved *simultaneously* by multiple
+processing units.
 
 .. figure:: img/history/compp.png
-   :align: center
-
-   Computing in parallel.
+    :align: center
 
-How a problem is split into smaller subtasks strongly depends on the problem. 
-There are various paradigms and programming approaches to do this. 
+    Computing in parallel.
 
+How a problem is split into smaller subtasks strongly depends on the problem. There are
+various paradigms and programming approaches to do this.
 
 Graphics processing units
 -------------------------
 
-Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
-GPUs were initially developed for highly-parallel task of graphic processing.
-But over the years, they were used more and more in HPC.
+Graphics processing units (GPU) have been the most common accelerators during the last
+few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
+GPUs were initially developed for highly-parallel task of graphic processing. But over
+the years, they were used more and more in HPC.
 
-GPUs are a specialized parallel hardware for floating point operations.
-They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
-but it delegates highly-parallel tasks to the GPU.
-GPUs are based on highly parallel architectures, which allows taking advantage of the 
-increasing number of transistors.
+GPUs are a specialized parallel hardware for floating point operations. They are
+basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
+but it delegates highly-parallel tasks to the GPU. GPUs are based on highly parallel
+architectures, which allows taking advantage of the increasing number of transistors.
 
-Using GPUs allows one to achieve extreme performance per node.
-As a result, the single GPU-equipped workstation can outperform small CPU-based clusters 
-for some type of computational tasks. The drawback is: usually major rewrites of programs is required
+Using GPUs allows one to achieve extreme performance per node. As a result, the single
+GPU-equipped workstation can outperform small CPU-based clusters for some type of
+computational tasks. The drawback is: usually major rewrites of programs is required
 with an accompanying change in the programming paradigm.
 
 .. callout:: Host vs device
 
-   GPU-enabled systems require a heterogeneous programming model that involves both 
-   CPU and GPU, where the CPU and its memory are referred to as the host, 
-   and the GPU and its memory as the device.
+    GPU-enabled systems require a heterogeneous programming model that involves both
+    CPU and GPU, where the CPU and its memory are referred to as the host,
+    and the GPU and its memory as the device.
 
 .. figure:: img/history/CPU_and_GPU_separated.png
-   :align: center
-
-   Figure adapted from the Carpentry `GPU Programming lesson <https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
+    :align: center
 
+    Figure adapted from the Carpentry `GPU Programming lesson
+    <https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
 
 A look at the Top-500 list
 --------------------------
 
-The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2024, where the columns show:
+The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful
+non-distributed computer systems in the world. The project was started in 1993 and
+publishes an updated list of the supercomputers twice a year. The snapshot below shows
+the top-5 HPC systems as of June 2024, where the columns show:
 
-- **Cores** - Number of processors 
+- **Cores** - Number of processors
 - **Rmax** - Maximal LINPACK performance achieved
 - **Rpeak** - Theoretical peak performance
 - **Power** - Power consumption
 
 .. figure:: img/history/top-5.png
-   :align: center
+    :align: center
 
-   Snapshot from the `TOP500 list from June, 2024 <https://www.top500.org/lists/top500/2024/06/>`__.
-
-All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for Fugaku which instead relies on custom-built Arm A64FX CPUs.
+    Snapshot from the `TOP500 list from June, 2024
+    <https://www.top500.org/lists/top500/2024/06/>`__.
 
+All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for
+Fugaku which instead relies on custom-built Arm A64FX CPUs.
 
 Why GPUs?
 ---------
 
-- **Speed**: GPU computing can significantly accelerate many types of scientific workloads.
-- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations per watt of power consumed,
-  which can result in significant energy savings. This is indeed evident from the `Green500 list <https://www.top500.org/lists/green500/2024/06/>`__.
-- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based systems for certain workloads.
-
+- **Speed**: GPU computing can significantly accelerate many types of scientific
+  workloads.
+- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations
+  per watt of power consumed, which can result in significant energy savings. This is
+  indeed evident from the `Green500 list
+  <https://www.top500.org/lists/green500/2024/06/>`__.
+- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based
+  systems for certain workloads.
 
 Limitations and drawbacks
 -------------------------
 
-- **Only for certain workloads**: Not all workloads can be efficiently parallelized and accelerated on GPUs. Certain types of workloads, such as those with irregular data access patterns or high branching behavior, may not see significant performance improvements on GPUs.
-- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU computing could require specialized skills in GPU programming and knowledge of GPU architecture, leading to a steeper learning curve compared to CPU programming. Fortunately, if you study this training material closely you will become productive with GPU programming quickly!
-
-
+- **Only for certain workloads**: Not all workloads can be efficiently parallelized and
+  accelerated on GPUs. Certain types of workloads, such as those with irregular data
+  access patterns or high branching behavior, may not see significant performance
+  improvements on GPUs.
+- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU
+  computing could require specialized skills in GPU programming and knowledge of GPU
+  architecture, leading to a steeper learning curve compared to CPU programming.
+  Fortunately, if you study this training material closely you will become productive
+  with GPU programming quickly!
 
 .. keypoints::
 
-   - GPUs are accelerators for some types of tasks
-   - Highly parallilizable compute-intensive tasks are suitable for GPUs
-   - New programming skills are needed to use GPUs efficiently
+    - GPUs are accelerators for some types of tasks
+    - Highly parallilizable compute-intensive tasks are suitable for GPUs
+    - New programming skills are needed to use GPUs efficiently