Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply docstrfmt to ReST files #92

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 58 additions & 54 deletions content/0-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,123 +6,127 @@ Setup
Local installation
------------------

Since this lesson is taught using an HPC cluster, no software installation on your own computer is needed.

Since this lesson is taught using an HPC cluster, no software installation on your own
computer is needed.

Running on LUMI
---------------

Interactive job, 1 node, 1 GPU, 1 hour:
Interactive job, 1 node, 1 GPU, 1 hour:

.. code-block:: console

$ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
$ srun <some-command>
$ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
$ srun <some-command>

Exit interactive allocation with ``exit``.

Interacive terminal session on compute node:

.. code-block:: console

$ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
$ <some-command>
$ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
$ <some-command>

Corresponding batch script ``submit.sh``:

.. code-block:: bash

#!/bin/bash -l
#SBATCH --account=project_465001310
#SBATCH --job-name=example-job
#SBATCH --output=examplejob.o%j
#SBATCH --error=examplejob.e%j
#SBATCH --partition=standard-g
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1:00:00
#!/bin/bash -l
#SBATCH --account=project_465001310
#SBATCH --job-name=example-job
#SBATCH --output=examplejob.o%j
#SBATCH --error=examplejob.e%j
#SBATCH --partition=standard-g
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1:00:00

srun <some_command>
srun <some_command>

- Submit the job: ``sbatch submit.sh``
- Monitor your job: ``squeue --me``
- Kill job: ``scancel <JOB_ID>``



Running Julia on LUMI
^^^^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~~~~

In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory structure and assume it is our working directory.
In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory
structure and assume it is our working directory.

.. code-block:: console

.
├── Project.toml # Julia environment
├── script.jl # Julia script
└── submit.sh # Slurm batch script
.
├── Project.toml # Julia environment
├── script.jl # Julia script
└── submit.sh # Slurm batch script

An example of a ``Project.toml`` project file.

.. code-block:: console

[deps]
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
[deps]
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"

For the ``submit.sh`` batch script, include additional content to the batch script mentioned above.
For the ``submit.sh`` batch script, include additional content to the batch script
mentioned above.

.. code-block:: bash

#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1750
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1750

module use /appl/local/csc/modulefiles
module use /appl/local/csc/modulefiles

module load julia
module load julia-amdgpu
module load julia
module load julia-amdgpu

julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl

An example of the ``script.jl`` code is provided below.

.. code-block:: julia

using AMDGPU

A = rand(2^9, 2^9)
A_d = ROCArray(A)
B_d = A_d * A_d

println("----EOF----")
using AMDGPU

A = rand(2^9, 2^9)
A_d = ROCArray(A)
B_d = A_d * A_d

println("----EOF----")

Running on Google Colab
-----------------------

Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook environment which runs in your web browser. Using it requires login with a Google account.
Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook
environment which runs in your web browser. Using it requires login with a Google
account.

This is how you can get access to NVIDIA GPUs on Colab:

- Visit https://colab.research.google.com/ and sign in to your Google account
- In the menu in front of you, click "New notebook" in the bottom right corner
- After the notebook loads, go to the "Runtime" menu at the top and select "Change runtime type"
- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU (e.g. T4)
- Click "Save". The runtime takes a few seconds to load - you can see the status in the top right corner
- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about the GPU.
- After the notebook loads, go to the "Runtime" menu at the top and select "Change
runtime type"
- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU
(e.g. T4)
- Click "Save". The runtime takes a few seconds to load - you can see the status in the
top right corner
- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about
the GPU.
- You can now write Python code that runs on GPUs through e.g. the numba library.


Access to code examples
-----------------------

Some exercises in this lesson rely on source code that you should download and modify in your own home directory on the cluster. All code examples are available in the same GitHub repository as this lesson itself. To download it you should use Git:
Some exercises in this lesson rely on source code that you should download and modify in
your own home directory on the cluster. All code examples are available in the same
GitHub repository as this lesson itself. To download it you should use Git:

.. code-block:: console

$ git clone https://github.com/ENCCS/gpu-programming.git
$ cd gpu-programming/content/examples/
$ ls

$ git clone https://github.com/ENCCS/gpu-programming.git
$ cd gpu-programming/content/examples/
$ ls
136 changes: 73 additions & 63 deletions content/1-gpu-history.rst
Original file line number Diff line number Diff line change
@@ -1,131 +1,141 @@
.. _gpu-history:


Why GPUs?
=========


.. questions::

- What is Moore's law?
- What problem do GPUs solve?
- What is Moore's law?
- What problem do GPUs solve?

.. objectives::

- Explain the historical development of microprocessors and how GPUs enable
continued scaling in computational power
- Explain the historical development of microprocessors and how GPUs enable
continued scaling in computational power

.. instructor-note::

- 15 min teaching
- 0 min exercises

- 15 min teaching
- 0 min exercises

Moore's law
-----------

It states that the number of transistors in a dense integrated circuit doubles about every two years.
More transistors means smaller size of a single element, so higher core frequency can be achieved.
However, power consumption scales with frequency to the third power, therefore the growth in the core frequency has slowed down significantly.
Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD (single instruction multiple data), branch prediction, etc.
It states that the number of transistors in a dense integrated circuit doubles about
every two years. More transistors means smaller size of a single element, so higher core
frequency can be achieved. However, power consumption scales with frequency to the third
power, therefore the growth in the core frequency has slowed down significantly. Higher
performance of a single node has to rely on its more complicated structure and still can
be achieved with SIMD (single instruction multiple data), branch prediction, etc.

.. figure:: img/history/microprocessor-trend-data.png
:align: center
:align: center

The evolution of microprocessors.
The number of transistors per chip doubles roughly every 2 years.
However, it can no longer be explored by the core frequency due to the power consumption limits.
Before 2000, the increase in the single core clock frequency was the major source of the
increase in the performance. Mid 2000 mark a transition towards multi-core processors.
The evolution of microprocessors. The number of transistors per chip doubles roughly
every 2 years. However, it can no longer be explored by the core frequency due to
the power consumption limits. Before 2000, the increase in the single core clock
frequency was the major source of the increase in the performance. Mid 2000 mark a
transition towards multi-core processors.

Increasing performance has been sustained with two main strategies over the years:

- Increase the single processor performance:
- Increase the single processor performance:
- More recently, increase the number of physical cores.


Computing in parallel
---------------------

The underlying idea of parallel computing is to split a computational problem into smaller
subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units.
The underlying idea of parallel computing is to split a computational problem into
smaller subtasks. Many subtasks can then be solved *simultaneously* by multiple
processing units.

.. figure:: img/history/compp.png
:align: center

Computing in parallel.
:align: center

How a problem is split into smaller subtasks strongly depends on the problem.
There are various paradigms and programming approaches to do this.
Computing in parallel.

How a problem is split into smaller subtasks strongly depends on the problem. There are
various paradigms and programming approaches to do this.

Graphics processing units
-------------------------

Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
GPUs were initially developed for highly-parallel task of graphic processing.
But over the years, they were used more and more in HPC.
Graphics processing units (GPU) have been the most common accelerators during the last
few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
GPUs were initially developed for highly-parallel task of graphic processing. But over
the years, they were used more and more in HPC.

GPUs are a specialized parallel hardware for floating point operations.
They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
but it delegates highly-parallel tasks to the GPU.
GPUs are based on highly parallel architectures, which allows taking advantage of the
increasing number of transistors.
GPUs are a specialized parallel hardware for floating point operations. They are
basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
but it delegates highly-parallel tasks to the GPU. GPUs are based on highly parallel
architectures, which allows taking advantage of the increasing number of transistors.

Using GPUs allows one to achieve extreme performance per node.
As a result, the single GPU-equipped workstation can outperform small CPU-based clusters
for some type of computational tasks. The drawback is: usually major rewrites of programs is required
Using GPUs allows one to achieve extreme performance per node. As a result, the single
GPU-equipped workstation can outperform small CPU-based clusters for some type of
computational tasks. The drawback is: usually major rewrites of programs is required
with an accompanying change in the programming paradigm.

.. callout:: Host vs device

GPU-enabled systems require a heterogeneous programming model that involves both
CPU and GPU, where the CPU and its memory are referred to as the host,
and the GPU and its memory as the device.
GPU-enabled systems require a heterogeneous programming model that involves both
CPU and GPU, where the CPU and its memory are referred to as the host,
and the GPU and its memory as the device.

.. figure:: img/history/CPU_and_GPU_separated.png
:align: center

Figure adapted from the Carpentry `GPU Programming lesson <https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
:align: center

Figure adapted from the Carpentry `GPU Programming lesson
<https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.

A look at the Top-500 list
--------------------------

The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2024, where the columns show:
The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful
non-distributed computer systems in the world. The project was started in 1993 and
publishes an updated list of the supercomputers twice a year. The snapshot below shows
the top-5 HPC systems as of June 2024, where the columns show:

- **Cores** - Number of processors
- **Cores** - Number of processors
- **Rmax** - Maximal LINPACK performance achieved
- **Rpeak** - Theoretical peak performance
- **Power** - Power consumption

.. figure:: img/history/top-5.png
:align: center
:align: center

Snapshot from the `TOP500 list from June, 2024 <https://www.top500.org/lists/top500/2024/06/>`__.

All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for Fugaku which instead relies on custom-built Arm A64FX CPUs.
Snapshot from the `TOP500 list from June, 2024
<https://www.top500.org/lists/top500/2024/06/>`__.

All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for
Fugaku which instead relies on custom-built Arm A64FX CPUs.

Why GPUs?
---------

- **Speed**: GPU computing can significantly accelerate many types of scientific workloads.
- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations per watt of power consumed,
which can result in significant energy savings. This is indeed evident from the `Green500 list <https://www.top500.org/lists/green500/2024/06/>`__.
- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based systems for certain workloads.

- **Speed**: GPU computing can significantly accelerate many types of scientific
workloads.
- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations
per watt of power consumed, which can result in significant energy savings. This is
indeed evident from the `Green500 list
<https://www.top500.org/lists/green500/2024/06/>`__.
- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based
systems for certain workloads.

Limitations and drawbacks
-------------------------

- **Only for certain workloads**: Not all workloads can be efficiently parallelized and accelerated on GPUs. Certain types of workloads, such as those with irregular data access patterns or high branching behavior, may not see significant performance improvements on GPUs.
- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU computing could require specialized skills in GPU programming and knowledge of GPU architecture, leading to a steeper learning curve compared to CPU programming. Fortunately, if you study this training material closely you will become productive with GPU programming quickly!


- **Only for certain workloads**: Not all workloads can be efficiently parallelized and
accelerated on GPUs. Certain types of workloads, such as those with irregular data
access patterns or high branching behavior, may not see significant performance
improvements on GPUs.
- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU
computing could require specialized skills in GPU programming and knowledge of GPU
architecture, leading to a steeper learning curve compared to CPU programming.
Fortunately, if you study this training material closely you will become productive
with GPU programming quickly!

.. keypoints::

- GPUs are accelerators for some types of tasks
- Highly parallilizable compute-intensive tasks are suitable for GPUs
- New programming skills are needed to use GPUs efficiently
- GPUs are accelerators for some types of tasks
- Highly parallilizable compute-intensive tasks are suitable for GPUs
- New programming skills are needed to use GPUs efficiently
Loading
Loading