From d8a83297aba5eba4ee80c0824b4b9f5f3f1fb304 Mon Sep 17 00:00:00 2001 From: Andrey Alekseenko Date: Wed, 20 Nov 2024 14:09:32 +0100 Subject: [PATCH 1/2] 4: Extract the table into separate file to make docstrfmt happy --- content/4-gpu-concepts-table.rst | 29 +++++++++++++++++++++++++++++ content/4-gpu-concepts.rst | 32 +------------------------------- 2 files changed, 30 insertions(+), 31 deletions(-) create mode 100644 content/4-gpu-concepts-table.rst diff --git a/content/4-gpu-concepts-table.rst b/content/4-gpu-concepts-table.rst new file mode 100644 index 00000000..564b21c6 --- /dev/null +++ b/content/4-gpu-concepts-table.rst @@ -0,0 +1,29 @@ +.. table:: Software mapping naming + :align: center + + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | CUDA | HIP | OpenCL | SYCL | + +=========================+=========================+===========================+===================================================+ + | grid of threads | NDRange | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | block | work-group | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | warp | wavefront | sub-group | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | thread | work-item | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | registers | private memory | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | shared memory | local data share | local memory | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | threadIdx.\{x,y,z\} | get_local_id(\{0,1,2\}) | nd_item::get_local(\{2,1,0\}) [#syclindex]_ | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | blockIdx.\{x,y,z\} | get_group_id(\{0,1,2\}) | nd_item::get_group(\{2,1,0\}) [#syclindex]_ | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + | blockDim.\{x,y,z\} | get_local_size(\{0,1,2\}) | nd_item::get_local_range(\{2,1,0\}) [#syclindex]_ | + +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ + +.. [#syclindex] In SYCL, the thread indexing is inverted. In a 3D grid, physically adjacent threads have consecutive X (0) index in CUDA, HIP, and OpenCL, but consecutive Z (2) index in SYCL. + In a 2D grid, CUDA, HIP, and OpenCL still has contiguous indexing along X (0) dimension, while in SYCL it is Y (1). + Same applies to block dimensions and indexing. + diff --git a/content/4-gpu-concepts.rst b/content/4-gpu-concepts.rst index 8d3230be..89a3ded0 100644 --- a/content/4-gpu-concepts.rst +++ b/content/4-gpu-concepts.rst @@ -236,37 +236,7 @@ Terminology At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. While the basic concept behind GPUs is pretty similar they use different names for the various parts. Furthermore there are software environments for GPU programming, some from the producers and some from external groups all having different naming as well. Below there is a short compilation of the some terms used across different platforms and software environments. - -.. table:: Software mapping naming - :align: center - - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | CUDA | HIP | OpenCL | SYCL | - +=========================+=========================+===========================+===================================================+ - | grid of threads | NDRange | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | block | work-group | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | warp | wavefront | sub-group | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | thread | work-item | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | registers | private memory | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | shared memory | local data share | local memory | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | threadIdx.\{x,y,z\} | get_local_id(\{0,1,2\}) | nd_item::get_local(\{2,1,0\}) [#syclindex]_ | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | blockIdx.\{x,y,z\} | get_group_id(\{0,1,2\}) | nd_item::get_group(\{2,1,0\}) [#syclindex]_ | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - | blockDim.\{x,y,z\} | get_local_size(\{0,1,2\}) | nd_item::get_local_range(\{2,1,0\}) [#syclindex]_ | - +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ - -.. [#syclindex] In SYCL, the thread indexing is inverted. In a 3D grid, physically adjacent threads have consecutive X (0) index in CUDA, HIP, and OpenCL, but consecutive Z (2) index in SYCL. - In a 2D grid, CUDA, HIP, and OpenCL still has contiguous indexing along X (0) dimension, while in SYCL it is Y (1). - Same applies to block dimensions and indexing. - - +.. include:: 4-gpu-concepts-table.rst Exercises --------- From b8a4cdf6a49200c32a465eb16c293137cc802910 Mon Sep 17 00:00:00 2001 From: Andrey Alekseenko Date: Wed, 20 Nov 2024 14:11:56 +0100 Subject: [PATCH 2/2] Apply docstrfmt to ReST files --- content/0-setup.rst | 112 +- content/1-gpu-history.rst | 136 +- content/10-multiple_gpu.rst | 474 ++++--- content/11-gpu-porting.rst | 636 +++++----- content/12-recommendations.rst | 106 +- content/13-examples.rst | 979 ++++++++------- content/2-gpu-ecosystem.rst | 566 +++++---- content/3-gpu-problems.rst | 350 +++--- content/4-gpu-concepts.rst | 481 ++++--- content/5-intro-to-gpu-prog-models.rst | 285 +++-- content/6-directive-based-models.rst | 1454 +++++++++++----------- content/7-non-portable-kernel-models.rst | 849 ++++++++----- content/8-portable-kernel-models.rst | 841 ++++++++----- content/9-language-support.rst | 842 +++++++------ content/glossary.rst | 84 +- content/guide.rst | 188 +-- content/index.rst | 193 ++- content/quick-reference.rst | 2 +- requirements.txt | 1 + 19 files changed, 4731 insertions(+), 3848 deletions(-) diff --git a/content/0-setup.rst b/content/0-setup.rst index 6465f04a..c3c42bac 100644 --- a/content/0-setup.rst +++ b/content/0-setup.rst @@ -6,18 +6,18 @@ Setup Local installation ------------------ -Since this lesson is taught using an HPC cluster, no software installation on your own computer is needed. - +Since this lesson is taught using an HPC cluster, no software installation on your own +computer is needed. Running on LUMI --------------- -Interactive job, 1 node, 1 GPU, 1 hour: +Interactive job, 1 node, 1 GPU, 1 hour: .. code-block:: console - $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1 - $ srun + $ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1 + $ srun Exit interactive allocation with ``exit``. @@ -25,104 +25,108 @@ Interacive terminal session on compute node: .. code-block:: console - $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash - $ + $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash + $ Corresponding batch script ``submit.sh``: .. code-block:: bash - #!/bin/bash -l - #SBATCH --account=project_465001310 - #SBATCH --job-name=example-job - #SBATCH --output=examplejob.o%j - #SBATCH --error=examplejob.e%j - #SBATCH --partition=standard-g - #SBATCH --nodes=1 - #SBATCH --gpus-per-node=1 - #SBATCH --ntasks-per-node=1 - #SBATCH --time=1:00:00 + #!/bin/bash -l + #SBATCH --account=project_465001310 + #SBATCH --job-name=example-job + #SBATCH --output=examplejob.o%j + #SBATCH --error=examplejob.e%j + #SBATCH --partition=standard-g + #SBATCH --nodes=1 + #SBATCH --gpus-per-node=1 + #SBATCH --ntasks-per-node=1 + #SBATCH --time=1:00:00 - srun + srun - Submit the job: ``sbatch submit.sh`` - Monitor your job: ``squeue --me`` - Kill job: ``scancel `` - - Running Julia on LUMI -^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~ -In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory structure and assume it is our working directory. +In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory +structure and assume it is our working directory. .. code-block:: console - . - ├── Project.toml # Julia environment - ├── script.jl # Julia script - └── submit.sh # Slurm batch script + . + ├── Project.toml # Julia environment + ├── script.jl # Julia script + └── submit.sh # Slurm batch script An example of a ``Project.toml`` project file. .. code-block:: console - [deps] - AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e" + [deps] + AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e" -For the ``submit.sh`` batch script, include additional content to the batch script mentioned above. +For the ``submit.sh`` batch script, include additional content to the batch script +mentioned above. .. code-block:: bash - #SBATCH --cpus-per-task=2 - #SBATCH --mem-per-cpu=1750 + #SBATCH --cpus-per-task=2 + #SBATCH --mem-per-cpu=1750 - module use /appl/local/csc/modulefiles + module use /appl/local/csc/modulefiles - module load julia - module load julia-amdgpu + module load julia + module load julia-amdgpu - julia --project=. -e 'using Pkg; Pkg.instantiate()' - julia --project=. script.jl + julia --project=. -e 'using Pkg; Pkg.instantiate()' + julia --project=. script.jl An example of the ``script.jl`` code is provided below. .. code-block:: julia - using AMDGPU - - A = rand(2^9, 2^9) - A_d = ROCArray(A) - B_d = A_d * A_d - - println("----EOF----") + using AMDGPU + A = rand(2^9, 2^9) + A_d = ROCArray(A) + B_d = A_d * A_d + println("----EOF----") Running on Google Colab ----------------------- -Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook environment which runs in your web browser. Using it requires login with a Google account. +Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook +environment which runs in your web browser. Using it requires login with a Google +account. This is how you can get access to NVIDIA GPUs on Colab: - Visit https://colab.research.google.com/ and sign in to your Google account - In the menu in front of you, click "New notebook" in the bottom right corner -- After the notebook loads, go to the "Runtime" menu at the top and select "Change runtime type" -- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU (e.g. T4) -- Click "Save". The runtime takes a few seconds to load - you can see the status in the top right corner -- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about the GPU. +- After the notebook loads, go to the "Runtime" menu at the top and select "Change + runtime type" +- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU + (e.g. T4) +- Click "Save". The runtime takes a few seconds to load - you can see the status in the + top right corner +- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about + the GPU. - You can now write Python code that runs on GPUs through e.g. the numba library. - Access to code examples ----------------------- -Some exercises in this lesson rely on source code that you should download and modify in your own home directory on the cluster. All code examples are available in the same GitHub repository as this lesson itself. To download it you should use Git: +Some exercises in this lesson rely on source code that you should download and modify in +your own home directory on the cluster. All code examples are available in the same +GitHub repository as this lesson itself. To download it you should use Git: .. code-block:: console - $ git clone https://github.com/ENCCS/gpu-programming.git - $ cd gpu-programming/content/examples/ - $ ls - + $ git clone https://github.com/ENCCS/gpu-programming.git + $ cd gpu-programming/content/examples/ + $ ls diff --git a/content/1-gpu-history.rst b/content/1-gpu-history.rst index 9ee4986c..a670efa4 100644 --- a/content/1-gpu-history.rst +++ b/content/1-gpu-history.rst @@ -1,131 +1,141 @@ .. _gpu-history: - Why GPUs? ========= - .. questions:: - - What is Moore's law? - - What problem do GPUs solve? + - What is Moore's law? + - What problem do GPUs solve? .. objectives:: - - Explain the historical development of microprocessors and how GPUs enable - continued scaling in computational power + - Explain the historical development of microprocessors and how GPUs enable + continued scaling in computational power .. instructor-note:: - - 15 min teaching - - 0 min exercises - + - 15 min teaching + - 0 min exercises Moore's law ----------- -It states that the number of transistors in a dense integrated circuit doubles about every two years. -More transistors means smaller size of a single element, so higher core frequency can be achieved. -However, power consumption scales with frequency to the third power, therefore the growth in the core frequency has slowed down significantly. -Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD (single instruction multiple data), branch prediction, etc. +It states that the number of transistors in a dense integrated circuit doubles about +every two years. More transistors means smaller size of a single element, so higher core +frequency can be achieved. However, power consumption scales with frequency to the third +power, therefore the growth in the core frequency has slowed down significantly. Higher +performance of a single node has to rely on its more complicated structure and still can +be achieved with SIMD (single instruction multiple data), branch prediction, etc. .. figure:: img/history/microprocessor-trend-data.png - :align: center + :align: center - The evolution of microprocessors. - The number of transistors per chip doubles roughly every 2 years. - However, it can no longer be explored by the core frequency due to the power consumption limits. - Before 2000, the increase in the single core clock frequency was the major source of the - increase in the performance. Mid 2000 mark a transition towards multi-core processors. + The evolution of microprocessors. The number of transistors per chip doubles roughly + every 2 years. However, it can no longer be explored by the core frequency due to + the power consumption limits. Before 2000, the increase in the single core clock + frequency was the major source of the increase in the performance. Mid 2000 mark a + transition towards multi-core processors. Increasing performance has been sustained with two main strategies over the years: - - Increase the single processor performance: + - Increase the single processor performance: - More recently, increase the number of physical cores. - Computing in parallel --------------------- -The underlying idea of parallel computing is to split a computational problem into smaller -subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units. +The underlying idea of parallel computing is to split a computational problem into +smaller subtasks. Many subtasks can then be solved *simultaneously* by multiple +processing units. .. figure:: img/history/compp.png - :align: center - - Computing in parallel. + :align: center -How a problem is split into smaller subtasks strongly depends on the problem. -There are various paradigms and programming approaches to do this. + Computing in parallel. +How a problem is split into smaller subtasks strongly depends on the problem. There are +various paradigms and programming approaches to do this. Graphics processing units ------------------------- -Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term *accelerator*. -GPUs were initially developed for highly-parallel task of graphic processing. -But over the years, they were used more and more in HPC. +Graphics processing units (GPU) have been the most common accelerators during the last +few years, the term GPU sometimes is used interchangeably with the term *accelerator*. +GPUs were initially developed for highly-parallel task of graphic processing. But over +the years, they were used more and more in HPC. -GPUs are a specialized parallel hardware for floating point operations. -They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow -but it delegates highly-parallel tasks to the GPU. -GPUs are based on highly parallel architectures, which allows taking advantage of the -increasing number of transistors. +GPUs are a specialized parallel hardware for floating point operations. They are +basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow +but it delegates highly-parallel tasks to the GPU. GPUs are based on highly parallel +architectures, which allows taking advantage of the increasing number of transistors. -Using GPUs allows one to achieve extreme performance per node. -As a result, the single GPU-equipped workstation can outperform small CPU-based clusters -for some type of computational tasks. The drawback is: usually major rewrites of programs is required +Using GPUs allows one to achieve extreme performance per node. As a result, the single +GPU-equipped workstation can outperform small CPU-based clusters for some type of +computational tasks. The drawback is: usually major rewrites of programs is required with an accompanying change in the programming paradigm. .. callout:: Host vs device - GPU-enabled systems require a heterogeneous programming model that involves both - CPU and GPU, where the CPU and its memory are referred to as the host, - and the GPU and its memory as the device. + GPU-enabled systems require a heterogeneous programming model that involves both + CPU and GPU, where the CPU and its memory are referred to as the host, + and the GPU and its memory as the device. .. figure:: img/history/CPU_and_GPU_separated.png - :align: center - - Figure adapted from the Carpentry `GPU Programming lesson `__. + :align: center + Figure adapted from the Carpentry `GPU Programming lesson + `__. A look at the Top-500 list -------------------------- -The `TOP500 project `__ ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2024, where the columns show: +The `TOP500 project `__ ranks and details the 500 most powerful +non-distributed computer systems in the world. The project was started in 1993 and +publishes an updated list of the supercomputers twice a year. The snapshot below shows +the top-5 HPC systems as of June 2024, where the columns show: -- **Cores** - Number of processors +- **Cores** - Number of processors - **Rmax** - Maximal LINPACK performance achieved - **Rpeak** - Theoretical peak performance - **Power** - Power consumption .. figure:: img/history/top-5.png - :align: center + :align: center - Snapshot from the `TOP500 list from June, 2024 `__. - -All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for Fugaku which instead relies on custom-built Arm A64FX CPUs. + Snapshot from the `TOP500 list from June, 2024 + `__. +All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for +Fugaku which instead relies on custom-built Arm A64FX CPUs. Why GPUs? --------- -- **Speed**: GPU computing can significantly accelerate many types of scientific workloads. -- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations per watt of power consumed, - which can result in significant energy savings. This is indeed evident from the `Green500 list `__. -- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based systems for certain workloads. - +- **Speed**: GPU computing can significantly accelerate many types of scientific + workloads. +- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations + per watt of power consumed, which can result in significant energy savings. This is + indeed evident from the `Green500 list + `__. +- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based + systems for certain workloads. Limitations and drawbacks ------------------------- -- **Only for certain workloads**: Not all workloads can be efficiently parallelized and accelerated on GPUs. Certain types of workloads, such as those with irregular data access patterns or high branching behavior, may not see significant performance improvements on GPUs. -- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU computing could require specialized skills in GPU programming and knowledge of GPU architecture, leading to a steeper learning curve compared to CPU programming. Fortunately, if you study this training material closely you will become productive with GPU programming quickly! - - +- **Only for certain workloads**: Not all workloads can be efficiently parallelized and + accelerated on GPUs. Certain types of workloads, such as those with irregular data + access patterns or high branching behavior, may not see significant performance + improvements on GPUs. +- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU + computing could require specialized skills in GPU programming and knowledge of GPU + architecture, leading to a steeper learning curve compared to CPU programming. + Fortunately, if you study this training material closely you will become productive + with GPU programming quickly! .. keypoints:: - - GPUs are accelerators for some types of tasks - - Highly parallilizable compute-intensive tasks are suitable for GPUs - - New programming skills are needed to use GPUs efficiently + - GPUs are accelerators for some types of tasks + - Highly parallilizable compute-intensive tasks are suitable for GPUs + - New programming skills are needed to use GPUs efficiently diff --git a/content/10-multiple_gpu.rst b/content/10-multiple_gpu.rst index 73684985..0e2309ba 100644 --- a/content/10-multiple_gpu.rst +++ b/content/10-multiple_gpu.rst @@ -5,33 +5,61 @@ Multiple GPU programming with MPI .. questions:: - - What approach should be adopted to extend the synchronous OpenACC and OpenMP offloading models to utilise multiple GPUs across multiple nodes? + - What approach should be adopted to extend the synchronous OpenACC and OpenMP offloading models to utilise multiple GPUs across multiple nodes? .. objectives:: - - To learn about combining MPI with either OpenACC or OpenMP offloading models. - - To learn about implementing GPU-awareness MPI approach. + - To learn about combining MPI with either OpenACC or OpenMP offloading models. + - To learn about implementing GPU-awareness MPI approach. .. instructor-note:: - - 30 min teaching - - 30 min exercises + - 30 min teaching + - 30 min exercises Introduction ------------ -Exploring multiple GPUs (Graphics Processing Units) across distributed nodes offers the potential to fully leveraging the capacity of modern HPC (High-Performance Computing) systems at a large scale. Here one of the approaches to accelerate computing on distributed systems is to combine MPI (Message Passing Interface) with a GPU programming model such as OpenACC and OpenMP application programming interfaces (APIs). This combination is motivated by both the simplicity of these APIs, and the widespread use of MPI. - -In this guide we provide readers, who are familiar with MPI, with insights on implementing a hybrid model in which the MPI communication framework is combined with either OpenACC or OpenMP APIs. A special focus will be on performing point-to-point (e.g. `MPI_Send` and `MPI_Recv`) and collective operations (e.g. `MPI_Allreduce`) from OpenACC and OpenMP APIs. Here we address two scenarios: (i) a scenario in which MPI operations are performed in the CPU-host followed by an offload to the GPU-device; and (ii) a scenario in which MPI operations are performed between a pair of GPUs without involving the CPU-host memory. The latter scenario is referred to as GPU-awareness MPI, and has the advantage of reducing the computing time caused by transferring data via the host-memory during heterogeneous communications, thus rendering HPC applications efficient. - -This guide is organized as follows: we first introduce how to assign each MPI rank to a GPU device within the same node. We consider a situation in which the host and the device have a distinct memory. This is followed by a presentation on the hybrid MPI-OpenACC/OpenMP offloading with and without the GPU-awareness MPI. Exercises to help understanding these concepts are provided at the end. +Exploring multiple GPUs (Graphics Processing Units) across distributed nodes offers the +potential to fully leveraging the capacity of modern HPC (High-Performance Computing) +systems at a large scale. Here one of the approaches to accelerate computing on +distributed systems is to combine MPI (Message Passing Interface) with a GPU programming +model such as OpenACC and OpenMP application programming interfaces (APIs). This +combination is motivated by both the simplicity of these APIs, and the widespread use of +MPI. + +In this guide we provide readers, who are familiar with MPI, with insights on +implementing a hybrid model in which the MPI communication framework is combined with +either OpenACC or OpenMP APIs. A special focus will be on performing point-to-point +(e.g. `MPI_Send` and `MPI_Recv`) and collective operations (e.g. `MPI_Allreduce`) from +OpenACC and OpenMP APIs. Here we address two scenarios: (i) a scenario in which MPI +operations are performed in the CPU-host followed by an offload to the GPU-device; and +(ii) a scenario in which MPI operations are performed between a pair of GPUs without +involving the CPU-host memory. The latter scenario is referred to as GPU-awareness MPI, +and has the advantage of reducing the computing time caused by transferring data via the +host-memory during heterogeneous communications, thus rendering HPC applications +efficient. + +This guide is organized as follows: we first introduce how to assign each MPI rank to a +GPU device within the same node. We consider a situation in which the host and the +device have a distinct memory. This is followed by a presentation on the hybrid +MPI-OpenACC/OpenMP offloading with and without the GPU-awareness MPI. Exercises to help +understanding these concepts are provided at the end. Assigning MPI-ranks to GPU-devices ---------------------------------- -Accelerating MPI applications to utilise multiple GPUs on distributed nodes requires as a first step assigning each MPI rank to a GPU device, such that two MPI ranks do not use the same GPU device. This is necessarily in order to prevent the application from a potential crash. This is because GPUs are designed to handle multiple threading tasks, but not multiple MPI ranks. +Accelerating MPI applications to utilise multiple GPUs on distributed nodes requires as +a first step assigning each MPI rank to a GPU device, such that two MPI ranks do not use +the same GPU device. This is necessarily in order to prevent the application from a +potential crash. This is because GPUs are designed to handle multiple threading tasks, +but not multiple MPI ranks. -One of the way to ensure that two MPI ranks do not use the same GPU, is to determine which MPI processes run on the same node, such that each process can be assigned to a GPU device within the same node. This can be done, for instance, by splitting the world communicator into sub-groups of communicators (or sub-communicators) using the routine `MPI_COMM_SPLIT_TYPE()`. +One of the way to ensure that two MPI ranks do not use the same GPU, is to determine +which MPI processes run on the same node, such that each process can be assigned to a +GPU device within the same node. This can be done, for instance, by splitting the world +communicator into sub-groups of communicators (or sub-communicators) using the routine +`MPI_COMM_SPLIT_TYPE()`. .. tabs:: @@ -47,314 +75,384 @@ One of the way to ensure that two MPI ranks do not use the same GPU, is to deter :language: C++ :lines: 17-22 -Here, the size of each sub-communicator corresponds to the number of GPUs per node (which is also the number of tasks per node), and each sub-communicator contains a list of processes indicated by a rank. These processes have a shared-memory region defined by the argument `MPI_COMM_TYPE_SHARED` (see the `MPI report `_) for more details). Calling the routine `MPI_COMM_SPLIT_TYPE()` returns a sub-communicator labelled in the code above *”host_comm”*, and in which MPI-ranks are ranked from 0 to number of processes per node -1. These MPI ranks are in turn assigned to different GPU devices within the same node. This procedure is done according to which directive-based model is implemented. The retrieved MPI ranks are then stored in the variable **myDevice**. The variable is passed to an OpenACC or OpenMP routine as indicated in the code below. +Here, the size of each sub-communicator corresponds to the number of GPUs per node +(which is also the number of tasks per node), and each sub-communicator contains a list +of processes indicated by a rank. These processes have a shared-memory region defined by +the argument `MPI_COMM_TYPE_SHARED` (see the `MPI report +`_) for more details). Calling +the routine `MPI_COMM_SPLIT_TYPE()` returns a sub-communicator labelled in the code +above *”host_comm”*, and in which MPI-ranks are ranked from 0 to number of processes per +node -1. These MPI ranks are in turn assigned to different GPU devices within the same +node. This procedure is done according to which directive-based model is implemented. +The retrieved MPI ranks are then stored in the variable **myDevice**. The variable is +passed to an OpenACC or OpenMP routine as indicated in the code below. .. typealong:: Example: ``Assign device`` - .. tabs:: + .. tabs:: - .. tab:: Fortran OpenACC + .. tab:: Fortran OpenACC - .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90 - :language: fortran - :lines: 34-40 + .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90 + :language: fortran + :lines: 34-40 - .. tab:: Fortran OpenMP + .. tab:: Fortran OpenMP - .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90 - :language: fortran - :lines: 34-40 + .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90 + :language: fortran + :lines: 34-40 - .. tab:: C++ OpenMP + .. tab:: C++ OpenMP - .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp - :language: C++ - :lines: 29-34 + .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp + :language: C++ + :lines: 29-34 + +Another useful function for retrieving the device number of a specific device, which is +useful, e.g., to map data to a specific device is -Another useful function for retrieving the device number of a specific device, which is useful, e.g., to map data to a specific device is - .. tabs:: - .. tab:: OpenACC - - .. code-block:: fortran - - acc_get_device_num() + .. tab:: OpenACC + + .. code-block:: fortran - .. tab:: OpenMP + acc_get_device_num() - .. code-block:: fortran - - omp_get_device_num() + .. tab:: OpenMP + + .. code-block:: fortran + + omp_get_device_num() The syntax of assigning MPI ranks to GPU devices is summarised below .. typealong:: Example: ``Set device`` - .. tabs:: + .. tabs:: - .. tab:: Fortran OpenACC + .. tab:: Fortran OpenACC - .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90 - :language: fortran - :lines: 15-40 + .. literalinclude:: examples/mpi_acc/assignDevice_acc.f90 + :language: fortran + :lines: 15-40 - .. tab:: Fortran OpenMP + .. tab:: Fortran OpenMP - .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90 - :language: fortran - :lines: 15-40 + .. literalinclude:: examples/mpi_omp/assignDevice_omp.f90 + :language: fortran + :lines: 15-40 - .. tab:: C++ OpenMP + .. tab:: C++ OpenMP - .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp - :language: C++ - :lines: 8-34 + .. literalinclude:: examples/mpi_omp/assignDevice_omp.cpp + :language: C++ + :lines: 8-34 Hybrid MPI-OpenACC/OpenMP without GPU-awareness approach -------------------------------------------------------- -After covering how to assign each MPI-rank to a GPU device, we now address the concept of combining MPI with either -OpenACC or OpenMP offloading. In this approach, calling an MPI routine from an OpenACC or OpenMP API requires updating the data in the CPU host before and after an MPI call. In this scenario, the data is copied back and forth between the host and the device before and after each MPI call. In the hybrid MPI-OpenACC model, the procedure is defined by specifying the directive `update host()` for copying the data from the device to the host before an MPI call; and by the directive `update device()` specified after an MPI call for copying the data back to the device. Similarly in the hybrid MPI-OpenMP. Here, updating the data in the host can be done by specifying the OpenMP directives `update device() from()` and `update device() to()`, respectively, for copying the data from the device to the host and back to the device. - -To illustrate the concept of the hybrid MPI-OpenACC/OpenMP, we show below an example of an implementation that involves the MPI functions `MPI_Send()` and `MPI_Recv()`. - +After covering how to assign each MPI-rank to a GPU device, we now address the concept +of combining MPI with either OpenACC or OpenMP offloading. In this approach, calling an +MPI routine from an OpenACC or OpenMP API requires updating the data in the CPU host +before and after an MPI call. In this scenario, the data is copied back and forth +between the host and the device before and after each MPI call. In the hybrid +MPI-OpenACC model, the procedure is defined by specifying the directive `update host()` +for copying the data from the device to the host before an MPI call; and by the +directive `update device()` specified after an MPI call for copying the data back to the +device. Similarly in the hybrid MPI-OpenMP. Here, updating the data in the host can be +done by specifying the OpenMP directives `update device() from()` and `update device() +to()`, respectively, for copying the data from the device to the host and back to the +device. + +To illustrate the concept of the hybrid MPI-OpenACC/OpenMP, we show below an example of +an implementation that involves the MPI functions `MPI_Send()` and `MPI_Recv()`. .. typealong:: Example: ``Update host/device directives`` - .. tabs:: + .. tabs:: - .. tab:: Fortran OpenACC + .. tab:: Fortran OpenACC - .. literalinclude:: examples/mpi_acc/mpiacc.f90 - :language: fortran - :lines: 62-77 + .. literalinclude:: examples/mpi_acc/mpiacc.f90 + :language: fortran + :lines: 62-77 - .. tab:: Fortran OpenMP + .. tab:: Fortran OpenMP - .. literalinclude:: examples/mpi_omp/mpiomp.f90 - :language: fortran - :lines: 63-78 + .. literalinclude:: examples/mpi_omp/mpiomp.f90 + :language: fortran + :lines: 63-78 - .. tab:: C++ OpenMP + .. tab:: C++ OpenMP - .. literalinclude:: examples/mpi_omp/mpiomp.cpp - :language: C++ - :lines: 63-78 + .. literalinclude:: examples/mpi_omp/mpiomp.cpp + :language: C++ + :lines: 63-78 Here we present a code example that combines MPI with OpenACC/OpenMP API. .. typealong:: Example: ``Update host/device directives`` - .. tabs:: + .. tabs:: - .. tab:: Fortan OpenACC - - .. literalinclude:: examples/mpi_acc/mpiacc.f90 - :language: fortran - :lines: 60-94 + .. tab:: Fortan OpenACC - .. tab:: Fortran OpenMP + .. literalinclude:: examples/mpi_acc/mpiacc.f90 + :language: fortran + :lines: 60-94 - .. literalinclude:: examples/mpi_omp/mpiomp.f90 - :language: fortran - :lines: 61-97 + .. tab:: Fortran OpenMP - .. tab:: C++ OpenMP + .. literalinclude:: examples/mpi_omp/mpiomp.f90 + :language: fortran + :lines: 61-97 - .. literalinclude:: examples/mpi_omp/mpiomp.cpp - :language: C++ - :lines: 60-97 + .. tab:: C++ OpenMP -Despite the simplicity of implementing the hybrid MPI-OpenACC/OpenMP offloading, it suffers from a low performance caused by an explicit transfer of data between the host and the device before and after calling an MPI routine. This constitutes a bottleneck in GPU-programming. To improve the performance affected by the host staging during the data transfer, one can implement the GPU-awareness MPI approach as described in the following section. - -Hybrid MPI-OpenACC/OpenMP with GPU-awareness approach ------------------------------------------------------ + .. literalinclude:: examples/mpi_omp/mpiomp.cpp + :language: C++ + :lines: 60-97 -The concept of the GPU-aware MPI enables an MPI library to directly access the GPU-device memory without necessarily using the CPU-host memory as an intermediate buffer (see e.g. `OpenMPI documentation `__). This offers the benefit of transferring data from one GPU to another GPU without the involvement of the CPU-host memory. - -To be specific, in the GPU-awareness approach, the device pointers point to the data allocated in the GPU memory space (data should be present in the GPU device). Here, the pointers are passed as arguments to an MPI routine that is supported by the GPU memory. As MPI routines can directly access GPU memory, it offers the possibility of communicating between pairs of GPUs without transferring data back to the host. +Despite the simplicity of implementing the hybrid MPI-OpenACC/OpenMP offloading, it +suffers from a low performance caused by an explicit transfer of data between the host +and the device before and after calling an MPI routine. This constitutes a bottleneck in +GPU-programming. To improve the performance affected by the host staging during the data +transfer, one can implement the GPU-awareness MPI approach as described in the following +section. -In the hybrid MPI-OpenACC model, the concept is defined by combining the directive `host_data` together with the clause -`use_device(list_array)`. This combination enables the access to the arrays listed in the clause `use_device(list_array)` from the host (see `here `__). The list of arrays, which are already present in the GPU-device memory, are directly passed to an MPI routine without a need of a staging host-memory for copying the data. Note that for initially copying data to GPU, we use unstructured data blocks characterized by the directives `enter data` and `exit data`. The unstructured data has the advantage of allowing to allocate and deallocate arrays within a data region. +Hybrid MPI-OpenACC/OpenMP with GPU-awareness approach +----------------------------------------------------- -To illustrate the concept of the GPU-awareness MPI, we show below two examples that make use of point-to-point and collective operations from OpenACC and OpenMP APIs. In the first code example, the device pointer **f** is passed to the MPI functions `MPI_Send()` and `MPI_Recv()`; and in the second one, the pointer **SumToT** is passed to the MPI function `MPI_Allreduce`. Here, the MPI operations `MPI_Send` and `MPI_Recv` as well as `MPI_Allreduce` are performed between a pair of GPUs without passing through the CPU-host memory. +The concept of the GPU-aware MPI enables an MPI library to directly access the +GPU-device memory without necessarily using the CPU-host memory as an intermediate +buffer (see e.g. `OpenMPI documentation +`__). This offers +the benefit of transferring data from one GPU to another GPU without the involvement of +the CPU-host memory. + +To be specific, in the GPU-awareness approach, the device pointers point to the data +allocated in the GPU memory space (data should be present in the GPU device). Here, the +pointers are passed as arguments to an MPI routine that is supported by the GPU memory. +As MPI routines can directly access GPU memory, it offers the possibility of +communicating between pairs of GPUs without transferring data back to the host. + +In the hybrid MPI-OpenACC model, the concept is defined by combining the directive +`host_data` together with the clause `use_device(list_array)`. This combination enables +the access to the arrays listed in the clause `use_device(list_array)` from the host +(see `here +`__). +The list of arrays, which are already present in the GPU-device memory, are directly +passed to an MPI routine without a need of a staging host-memory for copying the data. +Note that for initially copying data to GPU, we use unstructured data blocks +characterized by the directives `enter data` and `exit data`. The unstructured data has +the advantage of allowing to allocate and deallocate arrays within a data region. + +To illustrate the concept of the GPU-awareness MPI, we show below two examples that make +use of point-to-point and collective operations from OpenACC and OpenMP APIs. In the +first code example, the device pointer **f** is passed to the MPI functions `MPI_Send()` +and `MPI_Recv()`; and in the second one, the pointer **SumToT** is passed to the MPI +function `MPI_Allreduce`. Here, the MPI operations `MPI_Send` and `MPI_Recv` as well as +`MPI_Allreduce` are performed between a pair of GPUs without passing through the +CPU-host memory. .. typealong:: Example: ``GPU-awareness: MPI_Send & MPI_Recv`` - .. tabs:: + .. tabs:: - .. tab:: GPU-aware MPI with OpenACC (Fortran) - - .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90 - :language: fortran - :lines: 65-74 + .. tab:: GPU-aware MPI with OpenACC (Fortran) - .. tab:: GPU-aware MPI with OpenMP (Fortran) - - .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 - :language: fortran - :lines: 66-75 + .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90 + :language: fortran + :lines: 65-74 - .. tab:: GPU-aware MPI with OpenMP (C++) - - .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp - :language: C++ - :lines: 66-76 + .. tab:: GPU-aware MPI with OpenMP (Fortran) + .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 + :language: fortran + :lines: 66-75 + + .. tab:: GPU-aware MPI with OpenMP (C++) + + .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp + :language: C++ + :lines: 66-76 .. typealong:: Example: ``GPU-awareness: MPI_Allreduce`` - .. tabs:: + .. tabs:: - .. tab:: GPU-aware MPI with OpenACC (Fortran) - - .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90 - :language: fortran - :lines: 90-94 + .. tab:: GPU-aware MPI with OpenACC (Fortran) - .. tab:: GPU-aware MPI with OpenMP (Fortran) - - .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 - :language: fortran - :lines: 93-97 + .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90 + :language: fortran + :lines: 90-94 - .. tab:: GPU-aware MPI with OpenMP (C++) - - .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp - :language: C++ - :lines: 90-97 + .. tab:: GPU-aware MPI with OpenMP (Fortran) + + .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 + :language: fortran + :lines: 93-97 -We provide below a code example that illustrates the implementation of the MPI functions `MPI_Send()`, `MPI_Recv()` and `MPI_Allreduce()` within an OpenACC/OpenMP API. This implementation is specifically designed to support GPU-aware MPI operations. + .. tab:: GPU-aware MPI with OpenMP (C++) + + .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.cpp + :language: C++ + :lines: 90-97 + +We provide below a code example that illustrates the implementation of the MPI functions +`MPI_Send()`, `MPI_Recv()` and `MPI_Allreduce()` within an OpenACC/OpenMP API. This +implementation is specifically designed to support GPU-aware MPI operations. .. typealong:: Example: ``GPU-awareness approach`` - .. tabs:: + .. tabs:: - .. tab:: GPU-aware MPI with OpenACC (Fortran) + .. tab:: GPU-aware MPI with OpenACC (Fortran) - .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90 - :language: fortran - :lines: 60-97 + .. literalinclude:: examples/mpi_acc/mpiacc_gpuaware.f90 + :language: fortran + :lines: 60-97 - .. tab:: GPU-aware MPI with OpenMP (Fortran) + .. tab:: GPU-aware MPI with OpenMP (Fortran) - .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 - :language: fortran - :lines: 60-100 + .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 + :language: fortran + :lines: 60-100 - .. tab:: GPU-aware MPI with OpenMP (C++) + .. tab:: GPU-aware MPI with OpenMP (C++) - .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 - :language: C++ - :lines: 61-99 + .. literalinclude:: examples/mpi_omp/mpiomp_gpuaware.f90 + :language: C++ + :lines: 61-99 -The GPU-aware MPI with OpenACC/OpenMP APIs has the capability of directly communicating between a pair of GPUs within a single node. However, performing the GPU-to-GPU communication across multiple nodes requires the the GPUDirect RDMA (Remote Direct Memory Access) technology. This technology can further improve performance by reducing latency. +The GPU-aware MPI with OpenACC/OpenMP APIs has the capability of directly communicating +between a pair of GPUs within a single node. However, performing the GPU-to-GPU +communication across multiple nodes requires the the GPUDirect RDMA (Remote Direct +Memory Access) technology. This technology can further improve performance by reducing +latency. Compilation process ------------------- -The compilation process of the hybrid MPI-OpenACC and MPI-OpenMP offloading is described below. This description is given for a Cray compiler of the wrapper `ftn`. On LUMI-G, the following modules may be necessary before compiling (see the `LUMI documentation `_ for further details about the available programming environments): +The compilation process of the hybrid MPI-OpenACC and MPI-OpenMP offloading is described +below. This description is given for a Cray compiler of the wrapper `ftn`. On LUMI-G, +the following modules may be necessary before compiling (see the `LUMI documentation +`_ for further details +about the available programming environments): .. code-block:: console - $ ml LUMI/24.03 - $ ml PrgEnv-cray - $ ml cray-mpich - $ ml rocm - $ ml craype-accel-amd-gfx90a - + $ ml LUMI/24.03 + $ ml PrgEnv-cray + $ ml cray-mpich + $ ml rocm + $ ml craype-accel-amd-gfx90a .. typealong:: Example: ``Compilation process`` - .. tabs:: + .. tabs:: - .. tab:: Compiling MPI-OpenACC (Fortran) - .. code-block:: console + .. tab:: Compiling MPI-OpenACC (Fortran) + .. code-block:: console - $ ftn -hacc -o mycode.mpiacc.exe mycode_mpiacc.f90 + $ ftn -hacc -o mycode.mpiacc.exe mycode_mpiacc.f90 - .. tab:: Compiling MPI-OpenMP (Fortran) - .. code-block:: console + .. tab:: Compiling MPI-OpenMP (Fortran) + .. code-block:: console - $ ftn -homp -o mycode.mpiomp.exe mycode_mpiomp.f90 + $ ftn -homp -o mycode.mpiomp.exe mycode_mpiomp.f90 - .. tab:: Compiling MPI-OpenMP (C++) - .. code-block:: console + .. tab:: Compiling MPI-OpenMP (C++) + .. code-block:: console - $ CC -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target -march=gfx90a -o mycode.mpiomp.exe mycode_mpiomp.cpp + $ CC -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target -march=gfx90a -o mycode.mpiomp.exe mycode_mpiomp.cpp -Here, the flags `hacc` and `homp` enable the OpenACC and OpenMP directives in the hybrid MPI-OpenACC and MPI-OpenMP applications, respectively. +Here, the flags `hacc` and `homp` enable the OpenACC and OpenMP directives in the hybrid +MPI-OpenACC and MPI-OpenMP applications, respectively. **Enabling GPU-aware support** -To enable the GPU-aware support in MPICH library, one needs to set the following environment variable before running the application. +To enable the GPU-aware support in MPICH library, one needs to set the following +environment variable before running the application. .. code-block:: - $ export MPICH_GPU_SUPPORT_ENABLED=1 - + $ export MPICH_GPU_SUPPORT_ENABLED=1 Conclusion ---------- -In conclusion, we have presented an overview of a GPU-hybrid programming by integrating GPU-directive models, specifically OpenACC and OpenMP APIs, with the MPI library. The approach adopted here allows us to utilise multiple GPU-devices not only within a single node but it extends to distributed nodes. In particular, we have addressed GPU-aware MPI approach, which has the advantage of enabling a direct interaction between an MPI library and a GPU-device memory. In other words, it permits performing MPI operations between a pair of GPUs, thus reducing the computing time caused by the data locality. - + +In conclusion, we have presented an overview of a GPU-hybrid programming by integrating +GPU-directive models, specifically OpenACC and OpenMP APIs, with the MPI library. The +approach adopted here allows us to utilise multiple GPU-devices not only within a single +node but it extends to distributed nodes. In particular, we have addressed GPU-aware MPI +approach, which has the advantage of enabling a direct interaction between an MPI +library and a GPU-device memory. In other words, it permits performing MPI operations +between a pair of GPUs, thus reducing the computing time caused by the data locality. + Exercises --------- -We consider an MPI fortran code that solves a 2D-Laplace equation, and which is partially accelerated. The focus of the exercises is to complete the acceleration using either OpenACC or OpenMP API by following these steps. +We consider an MPI fortran code that solves a 2D-Laplace equation, and which is +partially accelerated. The focus of the exercises is to complete the acceleration using +either OpenACC or OpenMP API by following these steps. .. callout:: Access exercise material - Code examples for the exercises below can be accessed in the `content/examples/exercise_multipleGPU` subdirectory of this repository. To access them, you need to clone the repository: + Code examples for the exercises below can be accessed in the `content/examples/exercise_multipleGPU` subdirectory of this repository. To access them, you need to clone the repository: - .. code-block:: console + .. code-block:: console - $ git clone https://github.com/ENCCS/gpu-programming.git - $ cd gpu-programming/content/examples/exercise_multipleGPU - $ ls + $ git clone https://github.com/ENCCS/gpu-programming.git + $ cd gpu-programming/content/examples/exercise_multipleGPU + $ ls .. challenge:: Exercise I: Set a GPU device - 1. Implement OpenACC/OpenMP functions that enable assigning each MPI rank to a GPU device. + 1. Implement OpenACC/OpenMP functions that enable assigning each MPI rank to a GPU device. - 1.1 Compile and run the code on multiple GPUs. + 1.1 Compile and run the code on multiple GPUs. .. challenge:: Exercise II: Apply traditional MPI-OpenACC/OpenMP - 2.1 Incorporate the OpenACC directives `*update host()*` and `*update device()*` before and after calling an MPI function, respectively. + 2.1 Incorporate the OpenACC directives `*update host()*` and `*update device()*` before and after calling an MPI function, respectively. - .. note:: - The OpenACC directive `*update host()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device()*` is used to transfer the data from CPU to GPU. + .. note:: + The OpenACC directive `*update host()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device()*` is used to transfer the data from CPU to GPU. - 2.2 Incorporate the OpenMP directives `*update device() from()*` and `*update device() to()*` before and after calling an MPI function, respectively. + 2.2 Incorporate the OpenMP directives `*update device() from()*` and `*update device() to()*` before and after calling an MPI function, respectively. - .. note:: - The OpenMP directive `*update device() from()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device() to()*` is used to transfer the data from CPU to GPU. + .. note:: + The OpenMP directive `*update device() from()*` is used to transfer data from GPU to CPU within a data region; while the directive `*update device() to()*` is used to transfer the data from CPU to GPU. - 2.3 Compile and run the code on multiple GPUs. + 2.3 Compile and run the code on multiple GPUs. .. challenge:: Exercise III: Implement GPU-aware support - 3.1 Incorporate the OpenACC directive `*host_data use_device()*` to pass a device pointer to an MPI function. + 3.1 Incorporate the OpenACC directive `*host_data use_device()*` to pass a device pointer to an MPI function. - 3.2 Incorporate the OpenMP directive `*data use_device_ptr()*` to pass a device pointer to an MPI function. + 3.2 Incorporate the OpenMP directive `*data use_device_ptr()*` to pass a device pointer to an MPI function. - 3.3 Compile and run the code on multiple GPUs. + 3.3 Compile and run the code on multiple GPUs. .. challenge:: Exercise IV: Evaluate the performance - 1. Evaluate the execution time of the accelerated codes in the exercises **II** and **III**, and compare it with that of a pure MPI implementation. + 1. Evaluate the execution time of the accelerated codes in the exercises **II** and **III**, and compare it with that of a pure MPI implementation. See also -------- -- `GPU-aware MPI `_. +- `GPU-aware MPI + `_. - `MPI documentation `_. -- `OpenACC specification `_. -- `OpenMP specification `_. -- `LUMI documentation `_. -- `OpenACC vs OpenMP offloading `_. +- `OpenACC specification + `_. +- `OpenMP specification + `_. +- `LUMI documentation + `_. +- `OpenACC vs OpenMP offloading + `_. - `OpenACC course `_. - - diff --git a/content/11-gpu-porting.rst b/content/11-gpu-porting.rst index 8a9576d3..f7c04301 100644 --- a/content/11-gpu-porting.rst +++ b/content/11-gpu-porting.rst @@ -5,273 +5,309 @@ Preparing code for GPU porting .. questions:: - - What are the key steps involved in porting code to take advantage of GPU parallel processing capability? - - How can I identify the computationally intensive parts of my code that can benefit from GPU acceleration? - - What are the considerations for refactoring loops to suit the GPU architecture and improve memory access patterns? - - Are there any tools that can translate automatically between different frameworks? + - What are the key steps involved in porting code to take advantage of GPU parallel processing capability? + - How can I identify the computationally intensive parts of my code that can benefit from GPU acceleration? + - What are the considerations for refactoring loops to suit the GPU architecture and improve memory access patterns? + - Are there any tools that can translate automatically between different frameworks? .. objectives:: - - Getting familiarized the steps involved in porting code to GPUs to take advantage of parallel processing capabilities. - - Giving some idea about refactoring loops and modifying operations to suit the GPU architecture and improve memory access patterns. - - Learn to use automatic translation tools to port from CUDA to HIP and from OpenACC to OpenMP + - Getting familiarized the steps involved in porting code to GPUs to take advantage of parallel processing capabilities. + - Giving some idea about refactoring loops and modifying operations to suit the GPU architecture and improve memory access patterns. + - Learn to use automatic translation tools to port from CUDA to HIP and from OpenACC to OpenMP .. instructor-note:: - - 30 min teaching - - 20 min exercises + - 30 min teaching + - 20 min exercises Porting from CPU to GPU ----------------------- -When porting code to take advantage of the parallel processing capability of GPUs, several steps need to be followed and some additional work is required before writing actual parallel code to be executed on the GPUs: - -* **Identify Targeted Parts**: Begin by identifying the parts of the code that contribute significantly to the execution time. These are often computationally intensive sections such as loops or matrix operations. The Pareto principle suggests that roughly 10-20% of the code accounts for 80-90% of the execution time. - -* **Equivalent GPU Libraries**: If the original code uses CPU libraries like BLAS, FFT, etc, it's crucial to identify the equivalent GPU libraries. For example, `cuBLAS` or `hipBLAS` can replace CPU-based BLAS libraries. Utilizing GPU-specific libraries ensures efficient GPU utilization. - -* **Refactor Loops**: When porting loops directly to GPUs, some refactoring is necessary to suit the GPU architecture. This typically involves splitting the loop into multiple steps or modifying operations to exploit the independence between iterations and improve memory access patterns. Each step of the original loop can be mapped to a kernel, executed by multiple GPU threads, with each thread corresponding to an iteration. - -* **Memory Access Optimization**: Consider the memory access patterns in the code. GPUs perform best when memory access is coalesced and aligned. Minimizing global memory accesses and maximizing utilization of shared memory or registers can significantly enhance performance. Review the code to ensure optimal memory access for GPU execution. +When porting code to take advantage of the parallel processing capability of GPUs, +several steps need to be followed and some additional work is required before writing +actual parallel code to be executed on the GPUs: + +- **Identify Targeted Parts**: Begin by identifying the parts of the code that + contribute significantly to the execution time. These are often computationally + intensive sections such as loops or matrix operations. The Pareto principle suggests + that roughly 10-20% of the code accounts for 80-90% of the execution time. +- **Equivalent GPU Libraries**: If the original code uses CPU libraries like BLAS, FFT, + etc, it's crucial to identify the equivalent GPU libraries. For example, `cuBLAS` or + `hipBLAS` can replace CPU-based BLAS libraries. Utilizing GPU-specific libraries + ensures efficient GPU utilization. +- **Refactor Loops**: When porting loops directly to GPUs, some refactoring is necessary + to suit the GPU architecture. This typically involves splitting the loop into multiple + steps or modifying operations to exploit the independence between iterations and + improve memory access patterns. Each step of the original loop can be mapped to a + kernel, executed by multiple GPU threads, with each thread corresponding to an + iteration. +- **Memory Access Optimization**: Consider the memory access patterns in the code. GPUs + perform best when memory access is coalesced and aligned. Minimizing global memory + accesses and maximizing utilization of shared memory or registers can significantly + enhance performance. Review the code to ensure optimal memory access for GPU + execution. Discussion -^^^^^^^^^^ - .. challenge:: How would this be ported? (n_soap ≈ 100, n_sites ⩾ 10000, k_max ≈ 20*n_sites ) - - Inspect the following Fortran code (if you don't read Fortran: do-loops == for-loops) - - .. code-block:: Fortran - - k2 = 0 - do i = 1, n_sites - do j = 1, n_neigh(i) - k2 = k2 + 1 - counter = 0 - counter2 = 0 - do n = 1, n_max - do np = n, n_max - do l = 0, l_max - if( skip_soap_component(l, np, n) )cycle - - counter = counter+1 - do m = 0, l - k = 1 + l*(l+1)/2 + m - counter2 = counter2 + 1 - multiplicity = multiplicity_array(counter2) - soap_rad_der(counter, k2) = soap_rad_der(counter, k2) + multiplicity * real( cnk_rad_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_rad_der(k, np, k2)) ) - soap_azi_der(counter, k2) = soap_azi_der(counter, k2) + multiplicity * real( cnk_azi_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_azi_der(k, np, k2)) ) - soap_pol_der(counter, k2) = soap_pol_der(counter, k2) + multiplicity * real( cnk_pol_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_pol_der(k, np, k2)) ) - end do - end do - end do - end do - - soap_rad_der(1:n_soap, k2) = soap_rad_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_rad_der(1:n_soap, k2) ) - soap_azi_der(1:n_soap, k2) = soap_azi_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_azi_der(1:n_soap, k2) ) - soap_pol_der(1:n_soap, k2) = soap_pol_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_pol_der(1:n_soap, k2) ) - - if( j == 1 )then - k3 = k2 - else - soap_cart_der(1, 1:n_soap, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2) - soap_cart_der(2, 1:n_soap, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2) - soap_cart_der(3, 1:n_soap, k2) = dcos(thetas(k2)) * soap_rad_der(1:n_soap, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - soap_cart_der(1, 1:n_soap, k3) = soap_cart_der(1, 1:n_soap, k3) - soap_cart_der(1, 1:n_soap, k2) - soap_cart_der(2, 1:n_soap, k3) = soap_cart_der(2, 1:n_soap, k3) - soap_cart_der(2, 1:n_soap, k2) - soap_cart_der(3, 1:n_soap, k3) = soap_cart_der(3, 1:n_soap, k3) - soap_cart_der(3, 1:n_soap, k2) - end if - end do - end do - - Some steps at first glance: - - * the code could (has to) be splitted in 3-4 kernels. Why? - * check if there are any variables that could lead to false dependencies between iterations, like the index `k2` - * is it efficient for GPUs to split the work over the index `i`? What about the memory access? Note the arrays are `2D` in Fortran - * is it possible to collapse some loops? Combining nested loops can reduce overhead and improve memory access patterns, leading to better GPU performance. - * what is the best memory access in a GPU? Review memory access patterns in the code. Minimize global memory access by utilizing shared memory or registers where appropriate. Ensure memory access is coalesced and aligned, maximizing GPU memory throughput - +~~~~~~~~~~ + + .. challenge:: How would this be ported? (n_soap ≈ 100, n_sites ⩾ 10000, k_max ≈ 20*n_sites ) + + Inspect the following Fortran code (if you don't read Fortran: do-loops == for-loops) + + .. code-block:: Fortran + + k2 = 0 + do i = 1, n_sites + do j = 1, n_neigh(i) + k2 = k2 + 1 + counter = 0 + counter2 = 0 + do n = 1, n_max + do np = n, n_max + do l = 0, l_max + if( skip_soap_component(l, np, n) )cycle + + counter = counter+1 + do m = 0, l + k = 1 + l*(l+1)/2 + m + counter2 = counter2 + 1 + multiplicity = multiplicity_array(counter2) + soap_rad_der(counter, k2) = soap_rad_der(counter, k2) + multiplicity * real( cnk_rad_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_rad_der(k, np, k2)) ) + soap_azi_der(counter, k2) = soap_azi_der(counter, k2) + multiplicity * real( cnk_azi_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_azi_der(k, np, k2)) ) + soap_pol_der(counter, k2) = soap_pol_der(counter, k2) + multiplicity * real( cnk_pol_der(k, n, k2) * conjg(cnk(k, np, i)) + cnk(k, n, i) * conjg(cnk_pol_der(k, np, k2)) ) + end do + end do + end do + end do + + soap_rad_der(1:n_soap, k2) = soap_rad_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_rad_der(1:n_soap, k2) ) + soap_azi_der(1:n_soap, k2) = soap_azi_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_azi_der(1:n_soap, k2) ) + soap_pol_der(1:n_soap, k2) = soap_pol_der(1:n_soap, k2) / sqrt_dot_p(i) - soap(1:n_soap, i) / sqrt_dot_p(i)**3 * dot_product( soap(1:n_soap, i), soap_pol_der(1:n_soap, k2) ) + + if( j == 1 )then + k3 = k2 + else + soap_cart_der(1, 1:n_soap, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2) + soap_cart_der(2, 1:n_soap, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(1:n_soap, k2) + soap_cart_der(3, 1:n_soap, k2) = dcos(thetas(k2)) * soap_rad_der(1:n_soap, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + soap_cart_der(1, 1:n_soap, k3) = soap_cart_der(1, 1:n_soap, k3) - soap_cart_der(1, 1:n_soap, k2) + soap_cart_der(2, 1:n_soap, k3) = soap_cart_der(2, 1:n_soap, k3) - soap_cart_der(2, 1:n_soap, k2) + soap_cart_der(3, 1:n_soap, k3) = soap_cart_der(3, 1:n_soap, k3) - soap_cart_der(3, 1:n_soap, k2) + end if + end do + end do + + Some steps at first glance: + + * the code could (has to) be splitted in 3-4 kernels. Why? + * check if there are any variables that could lead to false dependencies between iterations, like the index `k2` + * is it efficient for GPUs to split the work over the index `i`? What about the memory access? Note the arrays are `2D` in Fortran + * is it possible to collapse some loops? Combining nested loops can reduce overhead and improve memory access patterns, leading to better GPU performance. + * what is the best memory access in a GPU? Review memory access patterns in the code. Minimize global memory access by utilizing shared memory or registers where appropriate. Ensure memory access is coalesced and aligned, maximizing GPU memory throughput .. admonition:: Refactored code! - :class: dropdown - - - Registers are limited and the larger the kernel use more registers registers resulting in less active threads (small occupancy). - - In order to compute `soap_rad_der(is,k2)` the CUDA thread needs access to all the previous values `soap_rad_der(1:nsoap,k2)`. - - In order to compute `soap_cart_der(1, 1:n_soap, k3)` it is required to have access to all values `(k3+1:k2+n_neigh(i))`. - - Note the indices in the first part. The matrices are transposed for better access patterns. - - .. code-block:: Fortran - - !omp target teams distribute parallel do private (i) - do k2 = 1, k2_max - i=list_of_i(k2) - counter = 0 - counter2 = 0 - do n = 1, n_max - do np = n, n_max - do l = 0, l_max - if( skip_soap_component(l, np, n) ) then - cycle - endif - counter = counter+1 - do m = 0, l - k = 1 + l*(l+1)/2 + m - counter2 = counter2 + 1 - multiplicity = multiplicity_array(counter2) - tsoap_rad_der(k2,counter) = tsoap_rad_der(k2,counter) + multiplicity * real( tcnk_rad_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_rad_der(k2,k,np)) ) - tsoap_azi_der(k2,counter) = tsoap_azi_der(k2,counter) + multiplicity * real( tcnk_azi_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_azi_der(k2,k,np)) ) - tsoap_pol_der(k2,counter) = tsoap_pol_der(k2,counter) + multiplicity * real( tcnk_pol_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_pol_der(k2,k,np)) ) + + - Registers are limited and the larger the kernel use more registers registers + resulting in less active threads (small occupancy). + - In order to compute `soap_rad_der(is,k2)` the CUDA thread needs access to all the + previous values `soap_rad_der(1:nsoap,k2)`. + - In order to compute `soap_cart_der(1, 1:n_soap, k3)` it is required to have access + to all values `(k3+1:k2+n_neigh(i))`. + - Note the indices in the first part. The matrices are transposed for better access + patterns. + + .. code-block:: Fortran + + !omp target teams distribute parallel do private (i) + do k2 = 1, k2_max + i=list_of_i(k2) + counter = 0 + counter2 = 0 + do n = 1, n_max + do np = n, n_max + do l = 0, l_max + if( skip_soap_component(l, np, n) ) then + cycle + endif + counter = counter+1 + do m = 0, l + k = 1 + l*(l+1)/2 + m + counter2 = counter2 + 1 + multiplicity = multiplicity_array(counter2) + tsoap_rad_der(k2,counter) = tsoap_rad_der(k2,counter) + multiplicity * real( tcnk_rad_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_rad_der(k2,k,np)) ) + tsoap_azi_der(k2,counter) = tsoap_azi_der(k2,counter) + multiplicity * real( tcnk_azi_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_azi_der(k2,k,np)) ) + tsoap_pol_der(k2,counter) = tsoap_pol_der(k2,counter) + multiplicity * real( tcnk_pol_der(k2,k,n) * conjg(tcnk(i,k,np)) + tcnk(i,k,n) * conjg(tcnk_pol_der(k2,k,np)) ) + end do + end do + end do end do end do - end do - end do - end do - - ! Before the next part the variables are transposed again to their original layout. - - !omp target teams distribute private(i) - do k2 = 1, k2_max - i=list_of_i(k2) - locdot=0.d0 - - !omp parallel do reduction(+:locdot_rad_der,locdot_azi_der,locdot_pol_der) - do is=1,nsoap - locdot_rad_der=locdot_rad_der+soap(is, i) * soap_rad_der(is, k2) - locdot_azi_der=locdot_azi_der+soap(is, i) * soap_azi_der(is, k2) - locdot_pol_der=locdot_pol_der+soap(is, i) * soap_pol_der(is, k2) - enddo - dot_soap_rad_der(k2)= locdot_rad_der - dot_soap_azi_der(k2)= locdot_azi_der - dot_soap_pol_der(k2)= locdot_pol_der - end do - - !omp target teams distribute - do k2 = 1, k2_max - i=list_of_i(k2) - - !omp parallel do - do is=1,nsoap - soap_rad_der(is, k2) = soap_rad_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_rad_der(k2) - soap_azi_der(is, k2) = soap_azi_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_azi_der(k2) - soap_pol_der(is, k2) = soap_pol_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_pol_der(k2) - end do - end do - - !omp teams distribute private(k3) - do k2 = 1, k2_max - k3=list_k2k3(k2) - - !omp parallel do private (is) - do is=1,n_soap - if( k3 /= k2)then - soap_cart_der(1, is, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(is, k2) - soap_cart_der(2, is, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(is, k2) - soap_cart_der(3, is, k2) = dcos(thetas(k2)) * soap_rad_der(is, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(is, k2) - end if - end do - end do - - !omp teams distribute private(k3) - do i = 1, n_sites - k3=list_k3(i) - - !omp parallel do private(is, k2) - do is=1,n_soap - do k2=k3+1,k3+n_neigh(i) - soap_cart_der(1, is, k3) = soap_cart_der(1, is, k3) - soap_cart_der(1, is, k2) - soap_cart_der(2, is, k3) = soap_cart_der(2, is, k3) - soap_cart_der(2, is, k2) - soap_cart_der(3, is, k3) = soap_cart_der(3, is, k3) - soap_cart_der(3, is, k2) - end do - end do - end do + ! Before the next part the variables are transposed again to their original layout. + + !omp target teams distribute private(i) + do k2 = 1, k2_max + i=list_of_i(k2) + locdot=0.d0 + + !omp parallel do reduction(+:locdot_rad_der,locdot_azi_der,locdot_pol_der) + do is=1,nsoap + locdot_rad_der=locdot_rad_der+soap(is, i) * soap_rad_der(is, k2) + locdot_azi_der=locdot_azi_der+soap(is, i) * soap_azi_der(is, k2) + locdot_pol_der=locdot_pol_der+soap(is, i) * soap_pol_der(is, k2) + enddo + dot_soap_rad_der(k2)= locdot_rad_der + dot_soap_azi_der(k2)= locdot_azi_der + dot_soap_pol_der(k2)= locdot_pol_der + end do + + !omp target teams distribute + do k2 = 1, k2_max + i=list_of_i(k2) + + !omp parallel do + do is=1,nsoap + soap_rad_der(is, k2) = soap_rad_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_rad_der(k2) + soap_azi_der(is, k2) = soap_azi_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_azi_der(k2) + soap_pol_der(is, k2) = soap_pol_der(is, k2) / sqrt_dot_p(i) - soap(is, i) / sqrt_dot_p(i)**3 * dot_soap_pol_der(k2) + end do + end do + + !omp teams distribute private(k3) + do k2 = 1, k2_max + k3=list_k2k3(k2) + + !omp parallel do private (is) + do is=1,n_soap + if( k3 /= k2)then + soap_cart_der(1, is, k2) = dsin(thetas(k2)) * dcos(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dcos(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) - dsin(phis(k2)) / rjs(k2) * soap_azi_der(is, k2) + soap_cart_der(2, is, k2) = dsin(thetas(k2)) * dsin(phis(k2)) * soap_rad_der(1:n_soap, k2) - dcos(thetas(k2)) * dsin(phis(k2)) / rjs(k2) * soap_pol_der(1:n_soap, k2) + dcos(phis(k2)) / rjs(k2) * soap_azi_der(is, k2) + soap_cart_der(3, is, k2) = dcos(thetas(k2)) * soap_rad_der(is, k2) + dsin(thetas(k2)) / rjs(k2) * soap_pol_der(is, k2) + end if + end do + end do + + !omp teams distribute private(k3) + do i = 1, n_sites + k3=list_k3(i) + + !omp parallel do private(is, k2) + do is=1,n_soap + do k2=k3+1,k3+n_neigh(i) + soap_cart_der(1, is, k3) = soap_cart_der(1, is, k3) - soap_cart_der(1, is, k2) + soap_cart_der(2, is, k3) = soap_cart_der(2, is, k3) - soap_cart_der(2, is, k2) + soap_cart_der(3, is, k3) = soap_cart_der(3, is, k3) - soap_cart_der(3, is, k2) + end do + end do + end do .. keypoints:: - - Identify equivalent GPU libraries for CPU-based libraries and utilizing them to ensure efficient GPU utilization. - - Importance of identifying the computationally intensive parts of the code that contribute significantly to the execution time. - - The need to refactor loops to suit the GPU architecture. - - Significance of memory access optimization for efficient GPU execution, including coalesced and aligned memory access patterns. + - Identify equivalent GPU libraries for CPU-based libraries and utilizing them to ensure efficient GPU utilization. + - Importance of identifying the computationally intensive parts of the code that contribute significantly to the execution time. + - The need to refactor loops to suit the GPU architecture. + - Significance of memory access optimization for efficient GPU execution, including coalesced and aligned memory access patterns. Porting between different GPU frameworks ---------------------------------------- -You might also find yourself in a situation where you need to port a code from one particular -GPU framework to another. This section gives an overview of different tools that enable converting CUDA and -OpenACC codes to HIP and OpenMP, respectively. This conversion process enables an application to target various -GPU architectures, specifically, NVIDIA and AMD GPUs. Here we focus on -`hipify `__ and -`clacc `__ tools. -This guide is adapted from the `NRIS documentation `__. +You might also find yourself in a situation where you need to port a code from one +particular GPU framework to another. This section gives an overview of different tools +that enable converting CUDA and OpenACC codes to HIP and OpenMP, respectively. This +conversion process enables an application to target various GPU architectures, +specifically, NVIDIA and AMD GPUs. Here we focus on `hipify +`__ and +`clacc `__ tools. This guide is adapted from the +`NRIS documentation +`__. Translating CUDA to HIP with Hipify -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this section, we cover the use of ``hipify-perl`` and ``hipify-clang`` tools to translate a CUDA code to HIP. +In this section, we cover the use of ``hipify-perl`` and ``hipify-clang`` tools to +translate a CUDA code to HIP. Hipify-perl -~~~~~~~~~~~ - -The ``hipify-perl`` tool is a script based on perl that translates CUDA syntax into HIP syntax -(see .e.g. `here `_ for more details). -For instance, in a CUDA code that incorporates the CUDA functions ``cudaMalloc``` and ``cudaDeviceSynchronize``, the tool will substitute ``cudaMalloc`` with the HIP function ``hipMalloc``. Similarly the CUDA function ``cudaDeviceSynchronize`` will be substituted with the HIP function ``hipDeviceSynchronize``. We list below the basic steps to run ``hipify-perl`` on LUMI-G. ++++++++++++ + +The ``hipify-perl`` tool is a script based on perl that translates CUDA syntax into HIP +syntax (see .e.g. `here +`_ +for more details). For instance, in a CUDA code that incorporates the CUDA functions +``cudaMalloc``` and ``cudaDeviceSynchronize``, the tool will substitute ``cudaMalloc`` +with the HIP function ``hipMalloc``. Similarly the CUDA function +``cudaDeviceSynchronize`` will be substituted with the HIP function +``hipDeviceSynchronize``. We list below the basic steps to run ``hipify-perl`` on +LUMI-G. - **Step 1**: Generating ``hipify-perl`` script .. code-block:: console - $ module load rocm/5.2.3 - $ hipify-clang --perl + $ module load rocm/5.2.3 + $ hipify-clang --perl - **Step 2**: Running the generated ``hipify-perl`` .. code-block:: console - $ hipify-perl program.cu > program.cu.hip + $ hipify-perl program.cu > program.cu.hip - **Step 3**: Compiling with ``hipcc`` the generated HIP code .. code-block:: console - $ hipcc --offload-arch=gfx90a -o program.hip.exe program.cu.hip + $ hipcc --offload-arch=gfx90a -o program.hip.exe program.cu.hip -Despite the simplicity of the use of ``hipify-perl``, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings (e.g. it substitutes ``*cuda*`` with ``*hip*``). -In addition, ``hipify-perl`` lacks the ability of `distinguishing device/host function calls `_. -The alternative here is to use the ``hipify-clang`` tool as will be described in the next section. +Despite the simplicity of the use of ``hipify-perl``, the tool might not be suitable for +large applications, as it relies heavily on substituting CUDA strings with HIP strings +(e.g. it substitutes ``*cuda*`` with ``*hip*``). In addition, ``hipify-perl`` lacks the +ability of `distinguishing device/host function calls +`_. The +alternative here is to use the ``hipify-clang`` tool as will be described in the next +section. Hipify-clang -~~~~~~~~~~~~ - -As described in the `HIPIFY documentation `_, -the ``hipify-clang`` tool is based on clang for translating CUDA sources into HIP sources. -The tool is more robust for translating CUDA codes compared to the ``hipify-perl`` tool. -Furthermore, it facilitates the analysis of the code by providing assistance. - -In short, ``hipify-clang`` requires ``LLVM+CLANG`` and ``CUDA``. Details about building ``hipify-clang`` can be found `here `__. Note that ``hipify-clang`` is available on LUMI-G. -The issue however might be related to the installation of CUDA-toolkit. -To avoid any eventual issues with the installation procedure we opt for CUDA singularity container. Here we present a step-by-step guide for running ``hipify-clang``: +++++++++++++ + +As described in the `HIPIFY documentation +`_, +the ``hipify-clang`` tool is based on clang for translating CUDA sources into HIP +sources. The tool is more robust for translating CUDA codes compared to the +``hipify-perl`` tool. Furthermore, it facilitates the analysis of the code by providing +assistance. + +In short, ``hipify-clang`` requires ``LLVM+CLANG`` and ``CUDA``. Details about building +``hipify-clang`` can be found `here `__. Note that +``hipify-clang`` is available on LUMI-G. The issue however might be related to the +installation of CUDA-toolkit. To avoid any eventual issues with the installation +procedure we opt for CUDA singularity container. Here we present a step-by-step guide +for running ``hipify-clang``: - **Step 1**: Pulling a CUDA singularity container e.g. .. code-block:: console - $ singularity pull docker://nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04 + $ singularity pull docker://nvcr.io/nvidia/cuda:11.4.3-devel-ubuntu20.04 - **Step 2**: Loading a rocm module and launching the CUDA singularity .. code-block:: console - $ module load rocm/5.2.3 - $ singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif + $ module load rocm/5.2.3 + $ singularity shell -B $PWD,/opt:/opt cuda_11.4.0-devel-ubuntu20.04.sif - where the current directory ``$PWD`` in the host is mounted to that of the container, and the directory ``/opt`` in the host is mounted to the that inside the container. + where the current directory ``$PWD`` in the host is mounted to that of the container, + and the directory ``/opt`` in the host is mounted to the that inside the container. -- **Step 3**: Setting the environment variable ``$PATH``. - In order to run ``hipify-clang`` from inside the container, one can set the environment variable ``$PATH`` that defines the path to look for the binary ``hipify-clang``. +- **Step 3**: Setting the environment variable ``$PATH``. In order to run + ``hipify-clang`` from inside the container, one can set the environment variable + ``$PATH`` that defines the path to look for the binary ``hipify-clang``. .. code-block:: console - $ export PATH=/opt/rocm-5.2.3/bin:$PATH + $ export PATH=/opt/rocm-5.2.3/bin:$PATH Note that the rocm version we used is ``rocm-5.2.3``. @@ -279,149 +315,179 @@ To avoid any eventual issues with the installation procedure we opt for CUDA sin .. code-block:: console - $ hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include + $ hipify-clang program.cu -o hip_program.cu.hip --cuda-path=/usr/local/cuda-11.4 -I /usr/local/cuda-11.4/include - Here the cuda path and the path to the ``*includes*`` and ``*defines*`` files should be specified. The CUDA source code and the generated output code are `program.cu` and `hip_program.cu.hip`, respectively. + Here the cuda path and the path to the ``*includes*`` and ``*defines*`` files should + be specified. The CUDA source code and the generated output code are `program.cu` and + `hip_program.cu.hip`, respectively. - The syntax for the compilation process of the generated hip code is similar to the one described in the previous section (see the **Step 3** in the hipify-perl section). + The syntax for the compilation process of the generated hip code is similar to the one + described in the previous section (see the **Step 3** in the hipify-perl section). -Code examples for the ``Hipify`` exercises can be accessed in the `content/examples/exercise_hipify` subdirectory by cloning this repository: +Code examples for the ``Hipify`` exercises can be accessed in the +`content/examples/exercise_hipify` subdirectory by cloning this repository: - .. code-block:: console + .. code-block:: console - $ git clone https://github.com/ENCCS/gpu-programming.git - $ cd gpu-programming/content/examples/exercise_hipify - $ ls + $ git clone https://github.com/ENCCS/gpu-programming.git + $ cd gpu-programming/content/examples/exercise_hipify + $ ls .. challenge:: Exercise I : Translate an CUDA code to HIP with ``hipify-perl`` - 1.1 Generate the ``hipify-perl`` tool. + 1.1 Generate the ``hipify-perl`` tool. - 1.2 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_perl`` with the ``Hipify-perl`` tool to HIP. + 1.2 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_perl`` with the ``Hipify-perl`` tool to HIP. - 1.3 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it. + 1.3 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it. .. challenge:: Exercise II : Translate an CUDA code to HIP with ``hipify-clang`` - 2.1 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_clang`` with the ``Hipify-clang`` tool to HIP. - - 2.2 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it. + 2.1 Convert the CUDA code ``vec_add_cuda.cu`` located in ``/exercise_hipify/Hipify_clang`` with the ``Hipify-clang`` tool to HIP. + 2.2 Compile the generated HIP code with the ``hipcc`` compiler wrapper and run it. Translating OpenACC to OpenMP with Clacc -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`Clacc `_ is a tool to translate an OpenACC -application to OpenMP offloading with the Clang/LLVM compiler environment. -Note that the tool is specific to OpenACC C, while OpenACC Fortran is already supported on AMD GPU. -As indicated in the `GitHub repository `_ the compiler ``Clacc`` is the ``Clang``'s executable in the subdirectory ``\bin`` of the ``\install`` directory as described below. +`Clacc `_ is a tool to +translate an OpenACC application to OpenMP offloading with the Clang/LLVM compiler +environment. Note that the tool is specific to OpenACC C, while OpenACC Fortran is +already supported on AMD GPU. As indicated in the `GitHub repository +`_ the compiler ``Clacc`` +is the ``Clang``'s executable in the subdirectory ``\bin`` of the ``\install`` directory +as described below. In the following we present a step-by-step guide for building and using `Clacc`: -- **Step 1**: Building and installing `Clacc `_. +- **Step 1**: Building and installing `Clacc + `_. .. code-block:: console - $ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git - $ cd llvm-project - $ mkdir build && cd build - $ cmake -DCMAKE_INSTALL_PREFIX=../install \ - -DCMAKE_BUILD_TYPE=Release \ - -DLLVM_ENABLE_PROJECTS="clang;lld" \ - -DLLVM_ENABLE_RUNTIMES=openmp \ - -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \ - -DCMAKE_C_COMPILER=gcc \ - -DCMAKE_CXX_COMPILER=g++ \ - ../llvm - $ make - $ make install + $ git clone -b clacc/main https://github.com/llvm-doe-org/llvm-project.git + $ cd llvm-project + $ mkdir build && cd build + $ cmake -DCMAKE_INSTALL_PREFIX=../install \ + -DCMAKE_BUILD_TYPE=Release \ + -DLLVM_ENABLE_PROJECTS="clang;lld" \ + -DLLVM_ENABLE_RUNTIMES=openmp \ + -DLLVM_TARGETS_TO_BUILD="host;AMDGPU" \ + -DCMAKE_C_COMPILER=gcc \ + -DCMAKE_CXX_COMPILER=g++ \ + ../llvm + $ make + $ make install -- **Step 2**: Setting up environment variables to be able to work from the ``/install`` directory, which is the simplest way. We assume that the ``/install`` directory is located in the path ``/project/project_xxxxxx/Clacc/llvm-project``. +- **Step 2**: Setting up environment variables to be able to work from the ``/install`` + directory, which is the simplest way. We assume that the ``/install`` directory is + located in the path ``/project/project_xxxxxx/Clacc/llvm-project``. -For more advanced usage, which includes for instance modifying ``Clacc``, we refer readers to `"Usage from Build directory" `_ +For more advanced usage, which includes for instance modifying ``Clacc``, we refer +readers to `"Usage from Build directory" +`_ - .. code-block:: console + .. code-block:: console - $ export PATH=/project/project_xxxxxx/Clacc/llvm-project/install/bin:$PATH - $ export LD_LIBRARY_PATH=/project/project_xxxxxx/Clacc/llvm-project/install/lib:$LD_LIBRARY_PATH + $ export PATH=/project/project_xxxxxx/Clacc/llvm-project/install/bin:$PATH + $ export LD_LIBRARY_PATH=/project/project_xxxxxx/Clacc/llvm-project/install/lib:$LD_LIBRARY_PATH -- **Step 3**: Source to source conversion of the `openACC_code.c` code to be printed out to the file `openMP_code.c`: +- **Step 3**: Source to source conversion of the `openACC_code.c` code to be printed out + to the file `openMP_code.c`: .. code-block:: console - $ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-ompx-hold openACC_code.c > openMP_code.c + $ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-ompx-hold openACC_code.c > openMP_code.c - Here the flag ``-fopenacc-structured-ref-count-omp=no-ompx-hold`` is introduced to disable the ``ompx_hold`` map type modifier, which is used by the OpenACC ``copy`` clause translation. The ``ompx_hold`` is an OpenMP extension that might not be supported yet by other compilers. + Here the flag ``-fopenacc-structured-ref-count-omp=no-ompx-hold`` is introduced to + disable the ``ompx_hold`` map type modifier, which is used by the OpenACC ``copy`` + clause translation. The ``ompx_hold`` is an OpenMP extension that might not be + supported yet by other compilers. -- **Step 4** Compiling the code with the `cc compiler wrapper `_ +- **Step 4** Compiling the code with the `cc compiler wrapper + `_ .. code-block:: - module load CrayEnv - module load PrgEnv-cray - module load craype-accel-amd-gfx90a - module load rocm/5.2.3 + module load CrayEnv + module load PrgEnv-cray + module load craype-accel-amd-gfx90a + module load rocm/5.2.3 - cc -fopenmp -o executable openMP_code.c + cc -fopenmp -o executable openMP_code.c .. callout:: Access exercise material - Code examples for the ``Clacc`` exercise can be accessed in the `content/examples/exercise_clacc` subdirectory by cloning this repository: + Code examples for the ``Clacc`` exercise can be accessed in the `content/examples/exercise_clacc` subdirectory by cloning this repository: - .. code-block:: console + .. code-block:: console - $ git clone https://github.com/ENCCS/gpu-programming.git - $ cd gpu-programming/content/examples/exercise_clacc - $ ls + $ git clone https://github.com/ENCCS/gpu-programming.git + $ cd gpu-programming/content/examples/exercise_clacc + $ ls .. challenge:: Exercise : Translate an OpenACC code to OpenMP - 1. Convert the OpenACC code ``openACC_code.c`` located in ``/exercise_clacc`` with the ``Clacc`` compiler. + 1. Convert the OpenACC code ``openACC_code.c`` located in ``/exercise_clacc`` with the ``Clacc`` compiler. - 2. Compile the generated OpenMP code with the ``cc`` compiler wrapper and run it. + 2. Compile the generated OpenMP code with the ``cc`` compiler wrapper and run it. Translating CUDA to SYCL/DPC++ with SYCLomatic -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Intel offers a tool for CUDA-to-SYCL code migration, included in the Intel oneAPI Basekit. +Intel offers a tool for CUDA-to-SYCL code migration, included in the Intel oneAPI +Basekit. -It is not installed on LUMI, but the general workflow is similar to the HIPify Clang and also requires an existing CUDA installation: +It is not installed on LUMI, but the general workflow is similar to the HIPify Clang and +also requires an existing CUDA installation: - .. code-block:: console + .. code-block:: console - $ dpct program.cu - $ cd dpct_output/ - $ icpx -fsycl program.dp.cpp + $ dpct program.cu + $ cd dpct_output/ + $ icpx -fsycl program.dp.cpp -SYCLomatic can migrate larger projects by using ``-in-root`` and ``-out-root`` flags to process directories recursively. It can also -use compilation database (supported by CMake and other build systems) to deal with more complex project layouts. +SYCLomatic can migrate larger projects by using ``-in-root`` and ``-out-root`` flags to +process directories recursively. It can also use compilation database (supported by +CMake and other build systems) to deal with more complex project layouts. -Please note that the code generated by SYCLomatic relies on oneAPI-specific extensions, and thus cannot be directly used with other -SYCL implementations, such as AdaptiveCpp (hipSYCL). The ``--no-incremental-migration`` flag can be added to ``dpct`` command to minimize, but not -completely avoid, the use of this compatibility layer. That would require manual effort, since some CUDA concepts cannot be directly -mapped to SYCL. +Please note that the code generated by SYCLomatic relies on oneAPI-specific extensions, +and thus cannot be directly used with other SYCL implementations, such as AdaptiveCpp +(hipSYCL). The ``--no-incremental-migration`` flag can be added to ``dpct`` command to +minimize, but not completely avoid, the use of this compatibility layer. That would +require manual effort, since some CUDA concepts cannot be directly mapped to SYCL. -Additionally, CUDA applications might assume certain hardware behavior, such as 32-wide warps. If the target hardware is different -(e.g., AMD MI250 GPUs, used in LUMI, have warp size of 64), the algorithms might need to be adjusted manually. +Additionally, CUDA applications might assume certain hardware behavior, such as 32-wide +warps. If the target hardware is different (e.g., AMD MI250 GPUs, used in LUMI, have +warp size of 64), the algorithms might need to be adjusted manually. Conclusion -^^^^^^^^^^ +~~~~~~~~~~ -This concludes a brief overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process for large applications might be incomplete and thus requires manual modification to complete the porting process. It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes. +This concludes a brief overview of the usage of available tools to convert CUDA codes to +HIP and SYCL, and OpenACC codes to OpenMP offloading. In general the translation process +for large applications might be incomplete and thus requires manual modification to +complete the porting process. It is however worth noting that the accuracy of the +translation process requires that applications are written correctly according to the +CUDA and OpenACC syntaxes. See also -------- - `Hipify GitHub `_ -- `HIPify Reference Guide v5.1 `_ -- `HIP example `_ -- `Porting CUDA to HIP `_ -- `Clacc Main repository README `_ -- `SYCLomatic main mage `_ -- `SYCLomatic documentation `_ +- `HIPify Reference Guide v5.1 + `_ +- `HIP example + `_ +- `Porting CUDA to HIP + `_ +- `Clacc Main repository README + `_ +- `SYCLomatic main mage + `_ +- `SYCLomatic documentation + `_ .. keypoints:: - - Useful tools exist to automatically translate tools from CUDA to HIP and SYCL and from OpenACC to OpenMP, but they may require manual modifications. - - + - Useful tools exist to automatically translate tools from CUDA to HIP and SYCL and from OpenACC to OpenMP, but they may require manual modifications. diff --git a/content/12-recommendations.rst b/content/12-recommendations.rst index ef07536d..81f2482e 100644 --- a/content/12-recommendations.rst +++ b/content/12-recommendations.rst @@ -3,82 +3,88 @@ Recommendations .. questions:: - - Which GPU programming framework is right for me and my project? - + - Which GPU programming framework is right for me and my project? .. instructor-note:: - - 30 min teaching - - 15 min exercises - + - 30 min teaching + - 15 min exercises Portability ----------- -One of the critical factors when diving into GPU programming is the portability of the chosen framework. -It's crucial to ensure that the framework you decide to utilize is compatible with the GPU or GPUs you intend -to use. This might seem like a basic step, but it's essential to avoid unnecessary hardware-software mismatches -that could lead to performance bottlenecks or, worse, a complete failure of the system. +One of the critical factors when diving into GPU programming is the portability of the +chosen framework. It's crucial to ensure that the framework you decide to utilize is +compatible with the GPU or GPUs you intend to use. This might seem like a basic step, +but it's essential to avoid unnecessary hardware-software mismatches that could lead to +performance bottlenecks or, worse, a complete failure of the system. -Moreover, if you're targeting multiple platforms or GPUs, it's wise to consider using frameworks that support -portable kernel-based models or those that come with high-level language support. -The benefit of these frameworks is that they allow for efficient execution of your code on a variety of -hardware configurations without needing significant alterations. +Moreover, if you're targeting multiple platforms or GPUs, it's wise to consider using +frameworks that support portable kernel-based models or those that come with high-level +language support. The benefit of these frameworks is that they allow for efficient +execution of your code on a variety of hardware configurations without needing +significant alterations. Programming Effort ------------------ -The amount of programming effort required is another factor to consider when choosing a GPU programming framework. -It's advisable to select a framework that supports the programming language you're comfortable with. -This consideration will ensure a smoother learning curve and a more efficient development process. +The amount of programming effort required is another factor to consider when choosing a +GPU programming framework. It's advisable to select a framework that supports the +programming language you're comfortable with. This consideration will ensure a smoother +learning curve and a more efficient development process. -Furthermore, it's important to check the availability of supportive resources for the chosen framework. -Comprehensive documentation, illustrative examples, and an active community are important when learning -a new framework or troubleshooting issues. They not only minimize the time spent on resolving bugs but also -foster continuous learning and mastery of the framework. +Furthermore, it's important to check the availability of supportive resources for the +chosen framework. Comprehensive documentation, illustrative examples, and an active +community are important when learning a new framework or troubleshooting issues. They +not only minimize the time spent on resolving bugs but also foster continuous learning +and mastery of the framework. Performance Requirements ------------------------ -Every application or project has unique performance requirements. Therefore, it's crucial to evaluate the -performance characteristics and optimization capabilities of the potential frameworks before choosing one. -Some frameworks offer extensive optimization features and can automatically tune your code to maximize its -performance. Knowing how well a framework can handle your specific workload requirements can save you +Every application or project has unique performance requirements. Therefore, it's +crucial to evaluate the performance characteristics and optimization capabilities of the +potential frameworks before choosing one. Some frameworks offer extensive optimization +features and can automatically tune your code to maximize its performance. Knowing how +well a framework can handle your specific workload requirements can save you considerable time and resources in the long run. Cost-benefit Analysis --------------------- -Before finalizing your choice of a GPU programming framework, it's recommended to perform a cost-benefit analysis. -Consider the specific requirements of your project, like the processing power needed, the complexity of the tasks, -the amount of data to be processed, and the cost associated with the potential framework. -Understanding these factors will help you determine the most suitable and cost-effective framework for your needs. +Before finalizing your choice of a GPU programming framework, it's recommended to +perform a cost-benefit analysis. Consider the specific requirements of your project, +like the processing power needed, the complexity of the tasks, the amount of data to be +processed, and the cost associated with the potential framework. Understanding these +factors will help you determine the most suitable and cost-effective framework for your +needs. Choosing Between Frameworks --------------------------- -The decision of choosing between different GPU programming frameworks often depends on several factors, including: - -- **The specifics of the problem**: Different problems might need different computational capabilities. - Understand your problem thoroughly and evaluate which framework is best equipped to handle it. - -- **Starting point**: If you're starting from scratch, you might have more flexibility in choosing your framework than - if you're building on top of existing code. - -- **Background knowledge of the programmer**: The familiarity of the programmer with certain programming languages or - frameworks plays a big role in the decision-making process. - -- **Time investment**: Some frameworks may have a steeper learning curve but offer more extensive capabilities, - while others might be easier to grasp but provide limited features. - -- **Performance needs**: Some applications require maximum computational power, while others might not. - The performance capabilities of the framework should align with the needs of your project. - -By keeping these considerations in mind, you can make a more informed decision and choose a GPU programming -framework that best suits your needs. +The decision of choosing between different GPU programming frameworks often depends on +several factors, including: + +- **The specifics of the problem**: Different problems might need different + computational capabilities. Understand your problem thoroughly and evaluate which + framework is best equipped to handle it. +- **Starting point**: If you're starting from scratch, you might have more flexibility + in choosing your framework than if you're building on top of existing code. +- **Background knowledge of the programmer**: The familiarity of the programmer with + certain programming languages or frameworks plays a big role in the decision-making + process. +- **Time investment**: Some frameworks may have a steeper learning curve but offer more + extensive capabilities, while others might be easier to grasp but provide limited + features. +- **Performance needs**: Some applications require maximum computational power, while + others might not. The performance capabilities of the framework should align with the + needs of your project. + +By keeping these considerations in mind, you can make a more informed decision and +choose a GPU programming framework that best suits your needs. .. discussion:: Question and discussion time - - Has your mental model of how GPUs work and how they are programmed changed? - - Do you have a better idea about what framework is right for your code? - - What questions do you have? Ask us anything! + - Has your mental model of how GPUs work and how they are programmed changed? + - Do you have a better idea about what framework is right for your code? + - What questions do you have? Ask us anything! diff --git a/content/13-examples.rst b/content/13-examples.rst index 3215613f..13d23938 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -5,671 +5,730 @@ GPU programming example: stencil computation .. questions:: - - How do I compile and run code developed using different programming models and frameworks? - - What can I expect from the GPU-ported programs in terms of performance gains / trends and how do I estimate this? + - How do I compile and run code developed using different programming models and frameworks? + - What can I expect from the GPU-ported programs in terms of performance gains / trends and how do I estimate this? .. objectives:: - - To show a self-contained example of parallel computation executed on CPU and GPU using different programming models - - To show differences and consequences of implementing the same algorithm in natural "style" of different models/ frameworks - - To discuss how to assess theoretical and practical performance scaling of GPU codes + - To show a self-contained example of parallel computation executed on CPU and GPU using different programming models + - To show differences and consequences of implementing the same algorithm in natural "style" of different models/ frameworks + - To discuss how to assess theoretical and practical performance scaling of GPU codes .. instructor-note:: - - 35 min teaching - - 30 min exercises - + - 35 min teaching + - 30 min exercises Problem: heat flow in two-dimensional area -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------------------ -Heat flows in objects according to local temperature differences, as if seeking local equilibrium. The following example defines a rectangular area with two always-warm sides (temperature 70 and 85), two cold sides (temperature 20 and 5) and a cold disk at the center. Because of heat diffusion, temperature of neighboring patches of the area is bound to equalize, changing the overall distribution: +Heat flows in objects according to local temperature differences, as if seeking local +equilibrium. The following example defines a rectangular area with two always-warm sides +(temperature 70 and 85), two cold sides (temperature 20 and 5) and a cold disk at the +center. Because of heat diffusion, temperature of neighboring patches of the area is +bound to equalize, changing the overall distribution: .. figure:: img/stencil/heat_montage.png - :align: center - - Over time, the temperature distribution progresses from the initial state toward an end state where upper triangle is warm and lower is cold. The average temperature tends to (70 + 85 + 20 + 5) / 4 = 45. + :align: center + Over time, the temperature distribution progresses from the initial state toward an + end state where upper triangle is warm and lower is cold. The average temperature + tends to (70 + 85 + 20 + 5) / 4 = 45. Technique: stencil computation -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------ -Heat transfer in the system above is governed by the partial differential equation(s) describing local variation of the temperature field in time and space. That is, the rate of change of the temperature field :math:`u(x, y, t)` over two spatial dimensions :math:`x` and :math:`y` and time :math:`t` (with rate coefficient :math:`\alpha`) can be modelled via the equation +Heat transfer in the system above is governed by the partial differential equation(s) +describing local variation of the temperature field in time and space. That is, the rate +of change of the temperature field :math:`u(x, y, t)` over two spatial dimensions +:math:`x` and :math:`y` and time :math:`t` (with rate coefficient :math:`\alpha`) can be +modelled via the equation .. math:: - \frac{\partial u}{\partial t} = \alpha \left( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right) - -The standard way to numerically solve differential equations is to *discretize* them, i. e. to consider only a set/ grid of specific area points at specific moments in time. That way, partial derivatives :math:`{\partial u}` are converted into differences between adjacent grid points :math:`u^{m}(i,j)`, with :math:`m, i, j` denoting time and spatial grid points, respectively. Temperature change in time at a certain point can now be computed from the values of neighboring points at earlier time; the same expression, called *stencil*, is applied to every point on the grid. + + \frac{\partial u}{\partial t} = \alpha \left( \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right) + +The standard way to numerically solve differential equations is to *discretize* them, i. +e. to consider only a set/ grid of specific area points at specific moments in time. +That way, partial derivatives :math:`{\partial u}` are converted into differences +between adjacent grid points :math:`u^{m}(i,j)`, with :math:`m, i, j` denoting time and +spatial grid points, respectively. Temperature change in time at a certain point can now +be computed from the values of neighboring points at earlier time; the same expression, +called *stencil*, is applied to every point on the grid. .. figure:: img/stencil/stencil.svg - :align: center + :align: center - This simplified model uses an 8x8 grid of data in light blue in state :math:`m`, each location of which has to be updated based on the indicated 5-point stencil in yellow to move to the next time point :math:`m+1`. + This simplified model uses an 8x8 grid of data in light blue in state :math:`m`, + each location of which has to be updated based on the indicated 5-point stencil in + yellow to move to the next time point :math:`m+1`. .. challenge:: Question: stencil applications - Stencil computation is a common occurrence in solving numerical problems. Have you already encountered it? Can you think of a problem that could be formulated this way in your field / area of expertise? - - .. solution:: - - One obvious choice is *convolution* operation, used in image processing to apply various filter kernels; in some contexts, "convolution" and "stencil" are used almost interchangeably. Other related use is for averaging/ pooling adjacent values. + Stencil computation is a common occurrence in solving numerical problems. Have you already encountered it? Can you think of a problem that could be formulated this way in your field / area of expertise? + .. solution:: + + One obvious choice is *convolution* operation, used in image processing to apply various filter kernels; in some contexts, "convolution" and "stencil" are used almost interchangeably. Other related use is for averaging/ pooling adjacent values. Technical considerations ------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~ **1. How fast and/ or accurate can the solution be?** -Spatial resolution of the temperature field is controlled by the number/ density of the grid points. As the full grid update is required to proceed from one time point to the next, stencil computation is the main target of parallelization (on CPU or GPU). +Spatial resolution of the temperature field is controlled by the number/ density of the +grid points. As the full grid update is required to proceed from one time point to the +next, stencil computation is the main target of parallelization (on CPU or GPU). -Moreover, in many cases the chosen time step cannot be arbitrarily large, otherwise the numerical differentiation will fail, and dense/ accurate grids imply small time steps (see inset below), which makes efficient spatial update even more important. +Moreover, in many cases the chosen time step cannot be arbitrarily large, otherwise the +numerical differentiation will fail, and dense/ accurate grids imply small time steps +(see inset below), which makes efficient spatial update even more important. .. solution:: Optional: stencil expression and time-step limit - - Differential equation shown above can be discretized using different schemes. For this example, temperature values at each grid point :math:`u^{m}(i,j)` are updated from one time point (:math:`m`) to the next (:math:`m+1`), using the following expressions: - - .. math:: - u^{m+1}(i,j) = u^m(i,j) + \Delta t \alpha \nabla^2 u^m(i,j) , - - where - - .. math:: - \nabla^2 u &= \frac{u(i-1,j)-2u(i,j)+u(i+1,j)}{(\Delta x)^2} \\ - &+ \frac{u(i,j-1)-2u(i,j)+u(i,j+1)}{(\Delta y)^2} , - - and :math:`\Delta x`, :math:`\Delta y`, :math:`\Delta t` are step sizes in space and time, respectively. - - Time-update schemes often have a limit on the maximum allowed time step :math:`\Delta t`. For the current scheme, it is equal to - - .. math:: - \Delta t_{max} = \frac{(\Delta x)^2 (\Delta y)^2}{2 \alpha ((\Delta x)^2 + (\Delta y)^2)} -**2. What to do with area boundaries?** + Differential equation shown above can be discretized using different schemes. For this example, temperature values at each grid point :math:`u^{m}(i,j)` are updated from one time point (:math:`m`) to the next (:math:`m+1`), using the following expressions: -Naturally, stencil expression can't be applied directly to the outermost grid points that have no outer neighbors. This can be solved by either changing the expression for those points or by adding an additional layer of grid that is used in computing update, but not updated itself -- points of fixed temperature for the sides are being used in this example. + .. math:: + u^{m+1}(i,j) = u^m(i,j) + \Delta t \alpha \nabla^2 u^m(i,j) , -**3. How could the algorithm be optimized further?** + where -In `an earlier episode `_, importance of efficient memory access was already stressed. In the following examples, each grid point (and its neighbors) is treated mostly independently; however, this also means that for 5-point stencil each value of the grid point may be read up to 5 times from memory (even if it's the fast GPU memory). By rearranging the order of mathematical operations, it may be possible to reuse these values in a more efficient way. + .. math:: + \nabla^2 u &= \frac{u(i-1,j)-2u(i,j)+u(i+1,j)}{(\Delta x)^2} \\ + &+ \frac{u(i,j-1)-2u(i,j)+u(i,j+1)}{(\Delta y)^2} , -Another point to note is that even if the solution is propagated in small time steps, not every step might actually be needed for output. Once some *local* region of the field is updated, mathematically nothing prevents it from being updated for the second time step -- even if the rest of the field is still being recalculated -- as long as :math:`t = m-1` values for the region boundary are there when needed. (Of course, this is more complicated to implement and would only give benefits in certain cases.) + and :math:`\Delta x`, :math:`\Delta y`, :math:`\Delta t` are step sizes in space and time, respectively. + Time-update schemes often have a limit on the maximum allowed time step :math:`\Delta t`. For the current scheme, it is equal to -.. challenge:: Poll: which programming model/ framework are you most interested in today? + .. math:: + \Delta t_{max} = \frac{(\Delta x)^2 (\Delta y)^2}{2 \alpha ((\Delta x)^2 + (\Delta y)^2)} - - OpenMP offloading (C++) - - SYCL (C++) - - *Python* (``numba``/CUDA) - - Julia +**2. What to do with area boundaries?** +Naturally, stencil expression can't be applied directly to the outermost grid points +that have no outer neighbors. This can be solved by either changing the expression for +those points or by adding an additional layer of grid that is used in computing update, +but not updated itself -- points of fixed temperature for the sides are being used in +this example. + +**3. How could the algorithm be optimized further?** + +In `an earlier episode +`_, +importance of efficient memory access was already stressed. In the following examples, +each grid point (and its neighbors) is treated mostly independently; however, this also +means that for 5-point stencil each value of the grid point may be read up to 5 times +from memory (even if it's the fast GPU memory). By rearranging the order of mathematical +operations, it may be possible to reuse these values in a more efficient way. + +Another point to note is that even if the solution is propagated in small time steps, +not every step might actually be needed for output. Once some *local* region of the +field is updated, mathematically nothing prevents it from being updated for the second +time step -- even if the rest of the field is still being recalculated -- as long as +:math:`t = m-1` values for the region boundary are there when needed. (Of course, this +is more complicated to implement and would only give benefits in certain cases.) + +.. challenge:: Poll: which programming model/ framework are you most interested in today? + + - OpenMP offloading (C++) + - SYCL (C++) + - *Python* (``numba``/CUDA) + - Julia The following table will aid you in navigating the rest of this section: .. admonition:: Episode guide - - `Sequential and OpenMP-threaded code `__ in C++, including compilation/ running instructions - - `Naive GPU parallelization `__, including SYCL compilation instructions - - `GPU code with device data management `__ (OpenMP, SYCL) - - `Python implementation `__, including running instructions on `Google Colab `__ - - `Julia implementation `__, including running instructions - + - `Sequential and OpenMP-threaded code + `__ + in C++, including compilation/ running instructions + - `Naive GPU parallelization + `__, + including SYCL compilation instructions + - `GPU code with device data management + `__ + (OpenMP, SYCL) + - `Python implementation + `__, + including running instructions on `Google Colab + `__ + - `Julia implementation + `__, + including running instructions Sequential and thread-parallel program in C++ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +--------------------------------------------- .. callout:: Trying out code examples - Source files of the examples presented for the rest of this episode are available in the `content/examples/stencil/ `_ directory. - To download them to your preferred directory on the cluster (f.e. ``/scratch/project_<#>//``), you can use Git: - - .. code-block:: console + Source files of the examples presented for the rest of this episode are available in the `content/examples/stencil/ `_ directory. + To download them to your preferred directory on the cluster (f.e. ``/scratch/project_<#>//``), you can use Git: - $ git clone https://github.com/ENCCS/gpu-programming.git - $ cd gpu-programming/content/examples/stencil/ - $ ls + .. code-block:: console - .. warning:: + $ git clone https://github.com/ENCCS/gpu-programming.git + $ cd gpu-programming/content/examples/stencil/ + $ ls - Don't forget to ``git pull`` for the latest updates if you already have the content from the first day of the workshop! + .. warning:: -If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``: + Don't forget to ``git pull`` for the latest updates if you already have the content from the first day of the workshop! + +If we assume the grid point values to be truly independent *for a single time step*, +stencil application procedure may be straightforwardly written as a loop over the grid +points, as shown below in tab "Stencil update". (General structure of the program and +the default parameter values for the problem model are also provided for reference.) +CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``: .. tabs:: - .. tab:: Stencil update + .. tab:: Stencil update - .. literalinclude:: examples/stencil/base/core.cpp - :language: cpp - :emphasize-lines: 25 + .. literalinclude:: examples/stencil/base/core.cpp + :language: cpp + :emphasize-lines: 25 - .. tab:: Main function + .. tab:: Main function - .. literalinclude:: examples/stencil/base/main.cpp - :language: cpp - :emphasize-lines: 37 - - .. tab:: Default params + .. literalinclude:: examples/stencil/base/main.cpp + :language: cpp + :emphasize-lines: 37 - .. literalinclude:: examples/stencil/base/heat.h - :language: cpp - :lines: 7-34 + .. tab:: Default params + .. literalinclude:: examples/stencil/base/heat.h + :language: cpp + :lines: 7-34 .. solution:: Optional: compiling the executables - To compile executable files for the OpenMP-based variants, follow the instructions below: - - .. code-block:: console - - salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00 - - module load LUMI/24.03 - module load partition/G - module load rocm/6.0.3 - module load PrgEnv-cray/8.5.0 - - cd base/ - make all - - Afterwards login into a compute node and test the executables (or just ``srun `` directly): - - .. code-block:: console - - $ srun --pty bash - - $ ./stencil - $ ./stencil_off - $ ./stencil_data - - $ exit - - If everything works well, the output should look similar to this: - - .. code-block:: console - - $ ./stencil - Average temperature, start: 59.763305 - Average temperature at end: 59.281239 - Control temperature at end: 59.281239 - Iterations took 0.566 seconds. - $ ./stencil_off - Average temperature, start: 59.763305 - Average temperature at end: 59.281239 - Control temperature at end: 59.281239 - Iterations took 3.792 seconds. - $ ./stencil_data - Average temperature, start: 59.763305 - Average temperature at end: 59.281239 - Control temperature at end: 59.281239 - Iterations took 1.211 seconds. - $ + To compile executable files for the OpenMP-based variants, follow the instructions below: + + .. code-block:: console + + salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00 + + module load LUMI/24.03 + module load partition/G + module load rocm/6.0.3 + module load PrgEnv-cray/8.5.0 + cd base/ + make all + + Afterwards login into a compute node and test the executables (or just ``srun `` directly): + + .. code-block:: console + + $ srun --pty bash + + $ ./stencil + $ ./stencil_off + $ ./stencil_data + + $ exit + + If everything works well, the output should look similar to this: + + .. code-block:: console + + $ ./stencil + Average temperature, start: 59.763305 + Average temperature at end: 59.281239 + Control temperature at end: 59.281239 + Iterations took 0.566 seconds. + $ ./stencil_off + Average temperature, start: 59.763305 + Average temperature at end: 59.281239 + Control temperature at end: 59.281239 + Iterations took 3.792 seconds. + $ ./stencil_data + Average temperature, start: 59.763305 + Average temperature at end: 59.281239 + Control temperature at end: 59.281239 + Iterations took 1.211 seconds. + $ CPU parallelization: timings ----------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -(**NOTE**: for thread-parallel runs it is necessary to request multiple CPU cores. In LUMI-G partitions, this can be done by asking for multiple GPUs; an alternative is to use -C partitions.) +(**NOTE**: for thread-parallel runs it is necessary to request multiple CPU cores. In +LUMI-G partitions, this can be done by asking for multiple GPUs; an alternative is to +use -C partitions.) -For later comparison, some benchmarks of the OpenMP thread-parallel implementation are provided below: +For later comparison, some benchmarks of the OpenMP thread-parallel implementation are +provided below: .. list-table:: Run times of OpenMP-enabled executable, s - :widths: 25 25 25 - :header-rows: 1 - - * - Job size - - 1 CPU core - - 32 CPU cores - * - S:2000 T:500 - - 1.402 - - 0.064 - * - S:2000 T:5000 - - 13.895 - - 0.538 - * - S:2000 T:10000 - - 27.753 - - 1.071 - * - S:4000 T:500 - - 5.727 - - 0.633 - * - S:8000 T:500 - - 24.130 - - 16.616 - -A closer look reveals that the computation time scales very nicely with increasing **time steps**: + :widths: 25 25 25 + :header-rows: 1 + + - - Job size + - 1 CPU core + - 32 CPU cores + - - S:2000 T:500 + - 1.402 + - 0.064 + - - S:2000 T:5000 + - 13.895 + - 0.538 + - - S:2000 T:10000 + - 27.753 + - 1.071 + - - S:4000 T:500 + - 5.727 + - 0.633 + - - S:8000 T:500 + - 24.130 + - 16.616 + +A closer look reveals that the computation time scales very nicely with increasing +**time steps**: .. figure:: img/stencil/omp-cpu-scaling-step.png - :align: center - -However, for larger **grid sizes** the parallelization becomes inefficient -- as the individual chunks of the grid get too large to fit into CPU cache, threads become bound by the speed of RAM reads/writes: + :align: center -.. figure:: img/stencil/omp-cpu-scaling-grid.png - :align: center +However, for larger **grid sizes** the parallelization becomes inefficient -- as the +individual chunks of the grid get too large to fit into CPU cache, threads become bound +by the speed of RAM reads/writes: +.. figure:: img/stencil/omp-cpu-scaling-grid.png + :align: center .. challenge:: Discussion: heat flow computation scaling - 1. How is heat flow computation **expected** to scale with respect to the number of time steps? - - a. Linearly - b. Quadratically - c. Exponentially - - 2. How is stencil application (grid update) **expected** to scale with respect to the size of the grid side? - - a. Linearly - b. Quadratically - c. Exponentially - - 3. (Optional) Do you expect GPU-accelerated computations to follow the above-mentioned trends? Why/ why not? - - .. solution:: - - 1. The answer is a: since each time-step follows the previous one and involves a similar number of operations, then the update time per step will be more or less constant. - 2. The answer is b: since stencil application is independent for every grid point, the update time will be proportional to the number of points, i.e. side * side. + 1. How is heat flow computation **expected** to scale with respect to the number of time steps? + + a. Linearly + b. Quadratically + c. Exponentially + + 2. How is stencil application (grid update) **expected** to scale with respect to the size of the grid side? + + a. Linearly + b. Quadratically + c. Exponentially + + 3. (Optional) Do you expect GPU-accelerated computations to follow the above-mentioned trends? Why/ why not? + .. solution:: + + 1. The answer is a: since each time-step follows the previous one and involves a similar number of operations, then the update time per step will be more or less constant. + 2. The answer is b: since stencil application is independent for every grid point, the update time will be proportional to the number of points, i.e. side * side. GPU parallelization: first steps -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-------------------------------- -Let's apply several techniques presented in previous episodes to make stencil update run on GPU. +Let's apply several techniques presented in previous episodes to make stencil update run +on GPU. -OpenMP (or OpenACC) offloading requires to define a region to be executed in parallel as well as data that shall be copied over/ used in GPU memory. -Similarly, SYCL programming model offers convenient ways to define execution kernels, as well as context to run them in (called queue). +OpenMP (or OpenACC) offloading requires to define a region to be executed in parallel as +well as data that shall be copied over/ used in GPU memory. Similarly, SYCL programming +model offers convenient ways to define execution kernels, as well as context to run them +in (called queue). Changes of stencil update code for OpenMP and SYCL are shown in the tabs below: .. tabs:: - .. tab:: OpenMP (naive) + .. tab:: OpenMP (naive) - .. literalinclude:: examples/stencil/base/core-off.cpp - :language: cpp - :emphasize-lines: 25-26 - - .. tab:: SYCL (naive) + .. literalinclude:: examples/stencil/base/core-off.cpp + :language: cpp + :emphasize-lines: 25-26 - .. literalinclude:: examples/stencil/sycl/core-naive.cpp - :language: cpp - :emphasize-lines: 24-27,29,43-45 + .. tab:: SYCL (naive) + .. literalinclude:: examples/stencil/sycl/core-naive.cpp + :language: cpp + :emphasize-lines: 24-27,29,43-45 .. callout:: Loading SYCL modules on LUMI - - As SYCL is placed on top of ROCm/HIP (or CUDA) software stack, running SYCL executables may require respective modules to be loaded. On current nodes, it can be done as follows: - - .. code-block:: console - - # salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00 - - module load LUMI/24.03 - module load partition/G - module load rocm/6.0.3 - module use /appl/local/csc/modulefiles - module load acpp/24.06.0 + + As SYCL is placed on top of ROCm/HIP (or CUDA) software stack, running SYCL executables may require respective modules to be loaded. On current nodes, it can be done as follows: + + .. code-block:: console + + # salloc -A project_465001310 -p small-g -N 1 -c 8 -n 1 --gpus-per-node=1 -t 1:00:00 + + module load LUMI/24.03 + module load partition/G + module load rocm/6.0.3 + module use /appl/local/csc/modulefiles + module load acpp/24.06.0 .. solution:: Optional: compiling the SYCL executables - As previously, you are welcome to generate your own executables: - - .. code-block:: console - - $ cd ../sycl/ - (give the following lines some time, probably a couple of min) - $ acpp -O2 -o stencil_naive core-naive.cpp io.cpp main-naive.cpp pngwriter.c setup.cpp utilities.cpp - $ acpp -O2 -o stencil core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp - - $ srun stencil_naive - $ srun stencil - - If everything works well, the output should look similar to this: - - .. code-block:: console - - $ srun stencil_naive - Average temperature, start: 59.763305 - Average temperature at end: 59.281239 - Control temperature at end: 59.281239 - Iterations took 2.086 seconds. - $ srun stencil - Average temperature, start: 59.763305 - Average temperature at end: 59.281239 - Control temperature at end: 59.281239 - Iterations took 0.052 seconds. + As previously, you are welcome to generate your own executables: + + .. code-block:: console + + $ cd ../sycl/ + (give the following lines some time, probably a couple of min) + $ acpp -O2 -o stencil_naive core-naive.cpp io.cpp main-naive.cpp pngwriter.c setup.cpp utilities.cpp + $ acpp -O2 -o stencil core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp + + $ srun stencil_naive + $ srun stencil + + If everything works well, the output should look similar to this: + .. code-block:: console + + $ srun stencil_naive + Average temperature, start: 59.763305 + Average temperature at end: 59.281239 + Control temperature at end: 59.281239 + Iterations took 2.086 seconds. + $ srun stencil + Average temperature, start: 59.763305 + Average temperature at end: 59.281239 + Control temperature at end: 59.281239 + Iterations took 0.052 seconds. .. challenge:: Exercise: naive GPU ports - Test your compiled executables ``base/stencil``, ``base/stencil_off`` and ``sycl/stencil_naive``. Try changing problem size parameters: - - - ``srun stencil_naive 2000 2000 5000`` - - Things to look for: - - - How computation times change? - - Do the results align to your expectations? - - - .. solution:: - - You might notice that the GPU-"ported" versions actually run slower than the single-CPU-core version! In fact, the scaling behavior of all three variants is similar and expected, which is a good sign; only the "computation unit cost" is different. You can compare benchmark summaries in the tabs below: + Test your compiled executables ``base/stencil``, ``base/stencil_off`` and ``sycl/stencil_naive``. Try changing problem size parameters: + + - ``srun stencil_naive 2000 2000 5000`` + + Things to look for: + + - How computation times change? + - Do the results align to your expectations? + - .. tabs:: + .. solution:: - .. tab:: Sequential + You might notice that the GPU-"ported" versions actually run slower than the single-CPU-core version! In fact, the scaling behavior of all three variants is similar and expected, which is a good sign; only the "computation unit cost" is different. You can compare benchmark summaries in the tabs below: - .. figure:: img/stencil/cpu-seq-scaling.png - :align: center + .. tabs:: - .. tab:: OpenMP (naive) + .. tab:: Sequential - .. figure:: img/stencil/omp-gpu-naive-scaling.png - :align: center + .. figure:: img/stencil/cpu-seq-scaling.png + :align: center - .. tab:: SYCL (naive) + .. tab:: OpenMP (naive) - .. figure:: img/stencil/omp-sycl-naive-scaling-new.png - :align: center + .. figure:: img/stencil/omp-gpu-naive-scaling.png + :align: center + .. tab:: SYCL (naive) + + .. figure:: img/stencil/omp-sycl-naive-scaling-new.png + :align: center GPU parallelization: data movement -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------------- Why the porting approach above seems to be quite inefficient? On each step, we: -- re-allocate GPU memory, -- copy the data from CPU to GPU, -- perform the computation, +- re-allocate GPU memory, +- copy the data from CPU to GPU, +- perform the computation, - then copy the data back. -But overhead can be reduced by taking care to minimize data transfers between *host* and *device* memory: +But overhead can be reduced by taking care to minimize data transfers between *host* and +*device* memory: - allocate GPU memory once at the start of the program, - only copy the data from GPU to CPU when we need it, -- swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.) +- swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this + automatically.) -Changes of stencil update code as well as the main program are shown in tabs below. +Changes of stencil update code as well as the main program are shown in tabs below. .. tabs:: - .. tab:: OpenMP + .. tab:: OpenMP - .. literalinclude:: examples/stencil/base/core-data.cpp - :language: cpp - :emphasize-lines: 25,40-75 - - .. tab:: SYCL + .. literalinclude:: examples/stencil/base/core-data.cpp + :language: cpp + :emphasize-lines: 25,40-75 - .. literalinclude:: examples/stencil/sycl/core.cpp - :language: cpp - :emphasize-lines: 13-14,25,40-50 + .. tab:: SYCL - .. tab:: Python + .. literalinclude:: examples/stencil/sycl/core.cpp + :language: cpp + :emphasize-lines: 13-14,25,40-50 - .. literalinclude:: examples/stencil/python/core_cuda.py - :language: py - :lines: 6-34 - :emphasize-lines: 14-16,18 + .. tab:: Python - .. tab:: main() (SYCL) + .. literalinclude:: examples/stencil/python/core_cuda.py + :language: py + :lines: 6-34 + :emphasize-lines: 14-16,18 - .. literalinclude:: examples/stencil/sycl/main.cpp - :language: cpp - :emphasize-lines: 38-39,44-45,51,56,59,75-77 + .. tab:: main() (SYCL) + .. literalinclude:: examples/stencil/sycl/main.cpp + :language: cpp + :emphasize-lines: 38-39,44-45,51,56,59,75-77 .. challenge:: Exercise: updated GPU ports - Test your compiled executables ``base/stencil_data`` and ``sycl/stencil``. Try changing problem size parameters: - - - ``srun stencil 2000 2000 5000`` - - Things to look for: - - - How computation times change this time around? - - What largest grid and/or longest propagation time can you get in 10 s on your machine? - - - .. solution:: - - .. tabs:: - - .. tab:: OpenMP data mapping - - Using GPU offloading with mapped device data, it is possible to achieve performance gains compared to thread-parallel version for larger grid sizes, due to the fact that the latter version becomes essentially RAM-bound, but the former does not. - - .. figure:: img/stencil/omp-cpu-vs-gpu.png - :align: center - - .. tab:: SYCL device buffers - - Below you can find the summary graphs for step- and grid- scaling of the stencil update task. Because of the more explicit programming approach, SYCL GPU port is much faster than OpenMP-offloaded version, comparable with thread-parallel CPU version running on all cores of a single node. - - .. figure:: img/stencil/summary-scaling-step-new.png - :align: center - - .. figure:: img/stencil/summary-scaling-grid-new.png - :align: center + Test your compiled executables ``base/stencil_data`` and ``sycl/stencil``. Try changing problem size parameters: + + - ``srun stencil 2000 2000 5000`` + + Things to look for: + + - How computation times change this time around? + - What largest grid and/or longest propagation time can you get in 10 s on your machine? + + + .. solution:: + + .. tabs:: + + .. tab:: OpenMP data mapping + + Using GPU offloading with mapped device data, it is possible to achieve performance gains compared to thread-parallel version for larger grid sizes, due to the fact that the latter version becomes essentially RAM-bound, but the former does not. + .. figure:: img/stencil/omp-cpu-vs-gpu.png + :align: center + + .. tab:: SYCL device buffers + + Below you can find the summary graphs for step- and grid- scaling of the stencil update task. Because of the more explicit programming approach, SYCL GPU port is much faster than OpenMP-offloaded version, comparable with thread-parallel CPU version running on all cores of a single node. + + .. figure:: img/stencil/summary-scaling-step-new.png + :align: center + + .. figure:: img/stencil/summary-scaling-grid-new.png + :align: center Python: JIT and GPU acceleration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-------------------------------- -As mentioned `previously `_, Numba package allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can also be used for JIT compiling for (NVIDIA) GPUs. JIT seems to work well on loop-based, computationally heavy functions, so trying it out is a nice choice for initial source version: +As mentioned `previously +`_, Numba package +allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can +also be used for JIT compiling for (NVIDIA) GPUs. JIT seems to work well on loop-based, +computationally heavy functions, so trying it out is a nice choice for initial source +version: .. tabs:: - .. tab:: Stencil update + .. tab:: Stencil update - .. literalinclude:: examples/stencil/python/core.py - :language: py - :lines: 6-29 - :emphasize-lines: 17 - - .. tab:: Data generation + .. literalinclude:: examples/stencil/python/core.py + :language: py + :lines: 6-29 + :emphasize-lines: 17 - .. literalinclude:: examples/stencil/python/heat.py - :language: py - :lines: 57-78 - :emphasize-lines: 1 + .. tab:: Data generation + .. literalinclude:: examples/stencil/python/heat.py + :language: py + :lines: 57-78 + :emphasize-lines: 1 -The alternative approach would be to rewrite stencil update code in NumPy style, exploiting loop vectorization. +The alternative approach would be to rewrite stencil update code in NumPy style, +exploiting loop vectorization. .. callout:: Trying out Python examples - You can run provided code examples on Google Colab using instructions provided in the `Setup `_, your local machine, or LUMI node (non-GPU variants). On LUMI, you can set up Python distribution as following: - - .. code-block:: console + You can run provided code examples on Google Colab using instructions provided in the `Setup `_, your local machine, or LUMI node (non-GPU variants). On LUMI, you can set up Python distribution as following: - $ module load cray-python/3.9.13.1 - (install needed dependencies locally) - $ pip3 install --user numba matplotlib - $ cd ../python/ - (make sure you have active allocation) - $ srun python3 main.py + .. code-block:: console + $ module load cray-python/3.9.13.1 + (install needed dependencies locally) + $ pip3 install --user numba matplotlib + $ cd ../python/ + (make sure you have active allocation) + $ srun python3 main.py Short summary of a typical Colab run is provided below: .. list-table:: Run times of Numba JIT-enabled Python program, s - :widths: 25 25 25 25 25 - :header-rows: 1 - - * - Job size - - JIT (LUMI) - - JIT (Colab) - - Job size - - no JIT (Colab) - * - S:2000 T:500 - - 1.648 - - 8.495 - - S:200 T:50 - - 5.318 - * - S:2000 T:200 - - 0.787 - - 3.524 - - S:200 T:20 - - 1.859 - * - S:1000 T:500 - - 0.547 - - 2.230 - - S:100 T:50 - - 1.156 - -Numba's ``@vectorize`` and ``@guvectorize`` decorators offer an interface to create CPU- (or GPU-) accelerated *Python* functions without explicit implementation details. However, such functions become increasingly complicated to write (and optimize by the compiler) with increasing complexity of the computations within. - -Numba also offers direct CUDA-based kernel programming, which can be the best choice for those already familiar with CUDA. Example for stencil update written in Numba CUDA is shown in the `data movement section `_, tab "Python". In this case, data transfer functions ``devdata = cuda.to_device(data)`` and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba package. - + :widths: 25 25 25 25 25 + :header-rows: 1 + + - - Job size + - JIT (LUMI) + - JIT (Colab) + - Job size + - no JIT (Colab) + - - S:2000 T:500 + - 1.648 + - 8.495 + - S:200 T:50 + - 5.318 + - - S:2000 T:200 + - 0.787 + - 3.524 + - S:200 T:20 + - 1.859 + - - S:1000 T:500 + - 0.547 + - 2.230 + - S:100 T:50 + - 1.156 + +Numba's ``@vectorize`` and ``@guvectorize`` decorators offer an interface to create CPU- +(or GPU-) accelerated *Python* functions without explicit implementation details. +However, such functions become increasingly complicated to write (and optimize by the +compiler) with increasing complexity of the computations within. + +Numba also offers direct CUDA-based kernel programming, which can be the best choice for +those already familiar with CUDA. Example for stencil update written in Numba CUDA is +shown in the `data movement section +`_, +tab "Python". In this case, data transfer functions ``devdata = cuda.to_device(data)`` +and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba +package. .. challenge:: Exercise: CUDA acceleration in Python - Using Google Colab (or your own machine), run provided Numba-CUDA Python program. Try changing problem size parameters: - - - ``args.rows, args.cols, args.nsteps = 2000, 2000, 5000`` for notebooks, - - [``srun``] ``python3 main.py 2000 2000 5000`` for command line. - - Things to look for: - - - How computation times change? - - Do you get better performance than from JIT-compiled CPU version? How far can you push the problem size? - - Are you able to monitor the GPU usage? - - - .. solution:: - - Some numbers from Colab: - - .. list-table:: Run times of Numba CUDA Python program, s - :widths: 25 25 25 25 - :header-rows: 1 - - * - Job size - - JIT (LUMI) - - JIT (Colab) - - CUDA (Colab) - * - S:2000 T:500 - - 1.648 - - 8.495 - - 1.079 - * - S:2000 T:2000 - - 6.133 - - 36.61 - - 3.931 - * - S:5000 T:500 - - 9.478 - - 57.19 - - 6.448 + Using Google Colab (or your own machine), run provided Numba-CUDA Python program. Try changing problem size parameters: + + - ``args.rows, args.cols, args.nsteps = 2000, 2000, 5000`` for notebooks, + - [``srun``] ``python3 main.py 2000 2000 5000`` for command line. + + Things to look for: + + - How computation times change? + - Do you get better performance than from JIT-compiled CPU version? How far can you push the problem size? + - Are you able to monitor the GPU usage? + + .. solution:: + + Some numbers from Colab: + + .. list-table:: Run times of Numba CUDA Python program, s + :widths: 25 25 25 25 + :header-rows: 1 + + * - Job size + - JIT (LUMI) + - JIT (Colab) + - CUDA (Colab) + * - S:2000 T:500 + - 1.648 + - 8.495 + - 1.079 + * - S:2000 T:2000 + - 6.133 + - 36.61 + - 3.931 + * - S:5000 T:500 + - 9.478 + - 57.19 + - 6.448 Julia GPU acceleration -~~~~~~~~~~~~~~~~~~~~~~ +---------------------- -A Julia version of the stencil example above can be found below (a simplified version of the HeatEquation module at https://github.com/ENCCS/HeatEquation.jl). -The source files are also available in the `content/examples/stencil/julia `_ directory of this repository. +A Julia version of the stencil example above can be found below (a simplified version of +the HeatEquation module at https://github.com/ENCCS/HeatEquation.jl). The source files +are also available in the `content/examples/stencil/julia +`_ +directory of this repository. To run the example on LUMI CPU partition, type: .. code-block:: console - $ # interactive CPU node - $ srun --account=project_465001310 --partition=standard --nodes=1 --cpus-per-task=32 --ntasks-per-node=1 --time=01:00:00 --pty bash - $ # load Julia env - $ module purge - $ module use /appl/local/csc/modulefiles - $ module load julia/1.9.0 - $ # in directory with Project.toml and source files, instantiate an environment to install packages - $ julia --project -e "using Pkg ; Pkg.instantiate()" - $ # finally run - $ julia --project main.jl + $ # interactive CPU node + $ srun --account=project_465001310 --partition=standard --nodes=1 --cpus-per-task=32 --ntasks-per-node=1 --time=01:00:00 --pty bash + $ # load Julia env + $ module purge + $ module use /appl/local/csc/modulefiles + $ module load julia/1.9.0 + $ # in directory with Project.toml and source files, instantiate an environment to install packages + $ julia --project -e "using Pkg ; Pkg.instantiate()" + $ # finally run + $ julia --project main.jl -To run on the GPU partition, use instead the ``srun`` command +To run on the GPU partition, use instead the ``srun`` command .. code-block:: console - $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash - + $ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash .. callout:: Optional dependency - Note that the ``Plots.jl`` dependency is commented out in ``main.jl`` and ``Project.toml``. This saves ~2 minute precompilation time when you first instantiate the Julia environment. To generate plots, just uncomment the commented ``Plots.jl`` dependency in ``Project.toml``, instantiate again, and import and use ``Plots`` in ``main.jl``. + Note that the ``Plots.jl`` dependency is commented out in ``main.jl`` and ``Project.toml``. This saves ~2 minute precompilation time when you first instantiate the Julia environment. To generate plots, just uncomment the commented ``Plots.jl`` dependency in ``Project.toml``, instantiate again, and import and use ``Plots`` in ``main.jl``. .. tabs:: - .. tab:: main.jl - - .. literalinclude:: examples/stencil/julia/main.jl - :language: julia + .. tab:: main.jl - .. tab:: core.jl + .. literalinclude:: examples/stencil/julia/main.jl + :language: julia - .. literalinclude:: examples/stencil/julia/core.jl - :language: julia + .. tab:: core.jl - .. tab:: heat.jl + .. literalinclude:: examples/stencil/julia/core.jl + :language: julia - .. literalinclude:: examples/stencil/julia/heat.jl - :language: julia + .. tab:: heat.jl - .. tab:: Project.toml + .. literalinclude:: examples/stencil/julia/heat.jl + :language: julia - .. literalinclude:: examples/stencil/julia/Project.toml - :language: julia + .. tab:: Project.toml + .. literalinclude:: examples/stencil/julia/Project.toml + :language: julia .. challenge:: Exercise: Julia port to GPUs - Carefully inspect all Julia source files and consider the following questions: + Carefully inspect all Julia source files and consider the following questions: - 1. Which functions should be ported to run on GPU? - 2. Look at the :meth:`initialize!` function and how it uses the ``arraytype`` argument. This could be done more compactly and elegantly, but this solution solves scalar indexing errors. What are scalar indexing errors? - 3. Try to start sketching GPU-ported versions of the key functions. - 4. When you have a version running on a GPU (your own or the solution provided below), try benchmarking it by adding ``@btime`` in front of :meth:`simulate!` in ``main.jl``. Benchmark also the CPU version, and compare. + 1. Which functions should be ported to run on GPU? + 2. Look at the :meth:`initialize!` function and how it uses the ``arraytype`` argument. This could be done more compactly and elegantly, but this solution solves scalar indexing errors. What are scalar indexing errors? + 3. Try to start sketching GPU-ported versions of the key functions. + 4. When you have a version running on a GPU (your own or the solution provided below), try benchmarking it by adding ``@btime`` in front of :meth:`simulate!` in ``main.jl``. Benchmark also the CPU version, and compare. - .. solution:: Hints + .. solution:: Hints - - create a new function :meth:`evolve_gpu!` which contains the GPU kernelized version of :meth:`evolve!` - - in the loop over timesteps in :meth:`simulate!`, you will need a conditional like ``if typeof(curr.data) <: ROCArray`` to call your GPU-ported function - - you cannot pass the struct ``Field`` to the kernel. You will instead need to directly pass the array ``Field.data``. This also necessitates passing in other variables like ``curr.dx^2``, etc. + - create a new function :meth:`evolve_gpu!` which contains the GPU kernelized version of :meth:`evolve!` + - in the loop over timesteps in :meth:`simulate!`, you will need a conditional like ``if typeof(curr.data) <: ROCArray`` to call your GPU-ported function + - you cannot pass the struct ``Field`` to the kernel. You will instead need to directly pass the array ``Field.data``. This also necessitates passing in other variables like ``curr.dx^2``, etc. - .. solution:: More hints + .. solution:: More hints - - since the data is two-dimensional, you'll need ``i = (blockIdx().x - 1) * blockDim().x + threadIdx().x`` and ``j = (blockIdx().y - 1) * blockDim().y + threadIdx().y`` - - to not overindex the 2D array, you can use a conditional like ``if i > 1 && j > 1 && i < nx+2 && j < ny+2`` - - when calling the kernel, you can set the number of threads and blocks like ``xthreads = ythreads = 16`` and ``xblocks, yblocks = cld(curr.nx, xthreads), cld(curr.ny, ythreads)``, and then call it with, e.g., ``@roc threads=(xthreads, ythreads) blocks = (xblocks, yblocks) evolve_rocm!(curr.data, prev.data, curr.dx^2, curr.dy^2, nx, ny, a, dt)``. + - since the data is two-dimensional, you'll need ``i = (blockIdx().x - 1) * blockDim().x + threadIdx().x`` and ``j = (blockIdx().y - 1) * blockDim().y + threadIdx().y`` + - to not overindex the 2D array, you can use a conditional like ``if i > 1 && j > 1 && i < nx+2 && j < ny+2`` + - when calling the kernel, you can set the number of threads and blocks like ``xthreads = ythreads = 16`` and ``xblocks, yblocks = cld(curr.nx, xthreads), cld(curr.ny, ythreads)``, and then call it with, e.g., ``@roc threads=(xthreads, ythreads) blocks = (xblocks, yblocks) evolve_rocm!(curr.data, prev.data, curr.dx^2, curr.dy^2, nx, ny, a, dt)``. - .. solution:: + .. solution:: - 1. The :meth:`evolve!` and :meth:`simulate!` functions need to be ported. The ``main.jl`` file also needs to be updated to work with GPU arrays. - 2. "Scalar indexing" is where you iterate over a GPU array, which would be excruciatingly slow and is indeed only allowed in interactive REPL sessions. Without the if-statements in the :meth:`initialize!` function, the :meth:`generate_field!` method would be doing disallowed scalar indexing if you were running on a GPU. - 3. The GPU-ported version is found below. Try it out on both CPU and GPU and observe the speedup. Play around with array size to see if the speedup is affected. You can also play around with the ``xthreads`` and ``ythreads`` variables to see if it changes anything. + 1. The :meth:`evolve!` and :meth:`simulate!` functions need to be ported. The ``main.jl`` file also needs to be updated to work with GPU arrays. + 2. "Scalar indexing" is where you iterate over a GPU array, which would be excruciatingly slow and is indeed only allowed in interactive REPL sessions. Without the if-statements in the :meth:`initialize!` function, the :meth:`generate_field!` method would be doing disallowed scalar indexing if you were running on a GPU. + 3. The GPU-ported version is found below. Try it out on both CPU and GPU and observe the speedup. Play around with array size to see if the speedup is affected. You can also play around with the ``xthreads`` and ``ythreads`` variables to see if it changes anything. - .. tabs:: + .. tabs:: - .. tab:: main_gpu.jl + .. tab:: main_gpu.jl - .. literalinclude:: examples/stencil/julia/main_gpu.jl - :language: julia + .. literalinclude:: examples/stencil/julia/main_gpu.jl + :language: julia - .. tab:: core_gpu.jl - - .. literalinclude:: examples/stencil/julia/core_gpu.jl - :language: julia + .. tab:: core_gpu.jl + .. literalinclude:: examples/stencil/julia/core_gpu.jl + :language: julia See also -~~~~~~~~ +-------- -This section leans heavily on source code and material created for several other computing workshops -by `ENCCS `_ and `CSC `_ and adapted for the purposes of this lesson. -If you want to know more about specific programming models / framework, definitely check these out! +This section leans heavily on source code and material created for several other +computing workshops by `ENCCS `_ and `CSC `_ and +adapted for the purposes of this lesson. If you want to know more about specific +programming models / framework, definitely check these out! - `OpenMP for GPU offloading `_ - `Heterogeneous programming with SYCL `_ -- `Educational implementation of heat flow example (incl. MPI-aware CUDA) `_ - - - +- `Educational implementation of heat flow example (incl. MPI-aware CUDA) + `_ diff --git a/content/2-gpu-ecosystem.rst b/content/2-gpu-ecosystem.rst index 906e169c..487bd1bf 100644 --- a/content/2-gpu-ecosystem.rst +++ b/content/2-gpu-ecosystem.rst @@ -1,25 +1,22 @@ .. _gpu-ecosystem: - The GPU hardware and software ecosystem ======================================= - .. questions:: - - What are the differences between GPUs and CPUs? - - What GPU software stacks are available? What do they provide? + - What are the differences between GPUs and CPUs? + - What GPU software stacks are available? What do they provide? .. objectives:: - - Understand the fundamental differences between GPUs and CPUs - - Explore the major GPU software suites available, such as CUDA, ROCm, and oneAPI, and gain a basic understanding of them + - Understand the fundamental differences between GPUs and CPUs + - Explore the major GPU software suites available, such as CUDA, ROCm, and oneAPI, and gain a basic understanding of them .. instructor-note:: - - 20 min teaching - - 0 min exercises - + - 20 min teaching + - 0 min exercises Overview of GPU hardware ------------------------ @@ -27,278 +24,371 @@ Overview of GPU hardware .. figure:: img/hardware/CPUAndGPU.png :align: center - A comparison of the CPU and GPU architecture. - CPU (left) has complex core structure and pack several cores on a single chip. - GPU cores are very simple in comparison, they also share data and control between each other. - This allows to pack more cores on a single chip, thus achieving very high compute density. + A comparison of the CPU and GPU architecture. CPU (left) has complex core structure + and pack several cores on a single chip. GPU cores are very simple in comparison, + they also share data and control between each other. This allows to pack more cores + on a single chip, thus achieving very high compute density. .. admonition:: In short - :class: dropdown - - - Accelerators offer high performance due to their scalability and high density of compute elements. - - They have separate circuit boards connected to CPUs via PCIe bus, with their own memory. - - CPUs copy data from their own memory to the GPU memory, execute the program, and copy the results back. - - GPUs run thousands of threads simultaneously, quickly switching between them to hide memory operations. - - Effective data management and access pattern is critical on the GPU to avoid running out of memory. - - -Accelerators are a separate main circuit board with the processor, memory, power management, etc. -It is connected to the motherboard with CPUs via PCIe bus. -Having its own memory means that the data has to be copied to and from it (not neceseraly true anymore). -CPU acts as a main processor, controlling the execution workflow. -It copies the data from its own memory to the GPU memory, executes the program and copies the results back. -GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management. -With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly. -This makes the data management and its access pattern is essential on the GPU. -Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly. -This allows to hide the memory operations: while some threads wait, others can compute. - - -A very important feature of the accelerators is their scalability. -Computational cores on accelerators are usually grouped into multiprocessors. -The multiprocessors share the data and logical elements. -This allows to achieve a very high density of compute elements on a GPU. -This also allows the scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available. - + - Accelerators offer high performance due to their scalability and high density of + compute elements. + - They have separate circuit boards connected to CPUs via PCIe bus, with their own + memory. + - CPUs copy data from their own memory to the GPU memory, execute the program, and + copy the results back. + - GPUs run thousands of threads simultaneously, quickly switching between them to + hide memory operations. + - Effective data management and access pattern is critical on the GPU to avoid + running out of memory. + +Accelerators are a separate main circuit board with the processor, memory, power +management, etc. It is connected to the motherboard with CPUs via PCIe bus. Having its +own memory means that the data has to be copied to and from it (not neceseraly true +anymore). CPU acts as a main processor, controlling the execution workflow. It copies +the data from its own memory to the GPU memory, executes the program and copies the +results back. GPUs runs tens of thousands of threads simultaneously on thousands of +cores and does not do much of the data management. With many cores trying to access the +memory simultaneously and with little cache available, the accelerator can run out of +memory very quickly. This makes the data management and its access pattern is essential +on the GPU. Accelerators like to be overloaded with the number of threads, because they +can switch between threads very quickly. This allows to hide the memory operations: +while some threads wait, others can compute. + +A very important feature of the accelerators is their scalability. Computational cores +on accelerators are usually grouped into multiprocessors. The multiprocessors share the +data and logical elements. This allows to achieve a very high density of compute +elements on a GPU. This also allows the scaling: more multiprocessors means more raw +performance and this is very easy to achieve with more transistors available. How do GPUs differ from CPUs? ----------------------------- -CPUs and GPUs were designed with different goals in mind. While the CPU -is designed to excel at executing a sequence of operations, called a thread, -as fast as possible and can execute a few tens of these threads in parallel, -the GPU is designed to excel at executing many thousands of them in parallel. -GPUs were initially developed for highly-parallel task of graphic processing -and therefore designed such that more transistors are devoted to data processing -rather than data caching and flow control. More transistors dedicated to -data processing is beneficial for highly parallel computations; the GPU can -hide memory access latencies with computation, instead of relying on large data caches -and complex flow control to avoid long memory access latencies, -both of which are expensive in terms of transistors. - - -.. list-table:: - :widths: 100 100 - :header-rows: 1 - - * - CPU - - GPU - * - General purpose - - Highly specialized for parallelism - * - Good for serial processing - - Good for parallel processing - * - Great for task parallelism - - Great for data parallelism - * - Low latency per thread - - High-throughput - * - Large area dedicated cache and control - - Hundreds of floating-point execution units - - +CPUs and GPUs were designed with different goals in mind. While the CPU is designed to +excel at executing a sequence of operations, called a thread, as fast as possible and +can execute a few tens of these threads in parallel, the GPU is designed to excel at +executing many thousands of them in parallel. GPUs were initially developed for +highly-parallel task of graphic processing and therefore designed such that more +transistors are devoted to data processing rather than data caching and flow control. +More transistors dedicated to data processing is beneficial for highly parallel +computations; the GPU can hide memory access latencies with computation, instead of +relying on large data caches and complex flow control to avoid long memory access +latencies, both of which are expensive in terms of transistors. + +.. list-table:: + :widths: 100 100 + :header-rows: 1 + + - - CPU + - GPU + - - General purpose + - Highly specialized for parallelism + - - Good for serial processing + - Good for parallel processing + - - Great for task parallelism + - Great for data parallelism + - - Low latency per thread + - High-throughput + - - Large area dedicated cache and control + - Hundreds of floating-point execution units GPU platforms ------------- -GPUs come together with software stacks or APIs that work in conjunction with the hardware and give a standard way for the software to interact with the GPU hardware. They are used by software developers to write code that can take advantage of the parallel processing power of the GPU, and they provide a standard way for software to interact with the GPU hardware. Typically, they provide access to low-level functionality, such as memory management, data transfer between the CPU and the GPU, and the scheduling and execution of parallel processing tasks on the GPU. They may also provide higher level functions and libraries optimized for specific HPC workloads, like linear algebra or fast Fourier transforms. Finally, in order to facilitate the developers to optimize and write correct codes, debugging and profiling tools are also included. - -*NVIDIA*, *AMD*, and *Intel* are the major companies which design and produces GPUs for HPC providing each its own suite **CUDA**, **ROCm**, and respectively **oneAPI**. This way they can offer optimization, differentiation (offering unique features tailored to their devices), vendor lock-in, licensing, and royalty fees, which can result in better performance, profitability, and customer loyalty. -There are also cross-platform APIs such **DirectCompute** (only for Windows operating system), **OpenCL**, and **SYCL**. +GPUs come together with software stacks or APIs that work in conjunction with the +hardware and give a standard way for the software to interact with the GPU hardware. +They are used by software developers to write code that can take advantage of the +parallel processing power of the GPU, and they provide a standard way for software to +interact with the GPU hardware. Typically, they provide access to low-level +functionality, such as memory management, data transfer between the CPU and the GPU, and +the scheduling and execution of parallel processing tasks on the GPU. They may also +provide higher level functions and libraries optimized for specific HPC workloads, like +linear algebra or fast Fourier transforms. Finally, in order to facilitate the +developers to optimize and write correct codes, debugging and profiling tools are also +included. + +*NVIDIA*, *AMD*, and *Intel* are the major companies which design and produces GPUs for +HPC providing each its own suite **CUDA**, **ROCm**, and respectively **oneAPI**. This +way they can offer optimization, differentiation (offering unique features tailored to +their devices), vendor lock-in, licensing, and royalty fees, which can result in better +performance, profitability, and customer loyalty. There are also cross-platform APIs +such **DirectCompute** (only for Windows operating system), **OpenCL**, and **SYCL**. .. admonition:: CUDA - In short - :class: dropdown - - - CUDA: NVIDIA's parallel computing platform - - Components: CUDA Toolkit & CUDA driver - - Supports C, C++, and Fortran languages - - CUDA API Libraries: cuBLAS, cuFFT, cuRAND, cuSPARSE - - Accelerate complex computations on GPUs - - Compilers: nvcc, nvc, nvc++, nvfortran - - Support GPU and multicore CPU programming - - Compatible with OpenACC and OpenMP - - Debugging tools: cuda-gdb, compute-sanitizer - - Debug GPU and CPU code simultaneously - - Identify memory access issues - - Performance analysis tools: NVIDIA Nsight Systems, NVIDIA Nsight Compute - - Analyze system-wide and kernel-level performance - - Optimize CPU and GPU usage, memory bandwidth, instruction throughput - - Comprehensive CUDA ecosystem with extensive tools and features + + - CUDA: NVIDIA's parallel computing platform + - Components: CUDA Toolkit & CUDA driver + - Supports C, C++, and Fortran languages + - CUDA API Libraries: cuBLAS, cuFFT, cuRAND, cuSPARSE + - Accelerate complex computations on GPUs + - Compilers: nvcc, nvc, nvc++, nvfortran + - Support GPU and multicore CPU programming + - Compatible with OpenACC and OpenMP + - Debugging tools: cuda-gdb, compute-sanitizer + - Debug GPU and CPU code simultaneously + - Identify memory access issues + - Performance analysis tools: NVIDIA Nsight Systems, NVIDIA Nsight Compute + - Analyze system-wide and kernel-level performance + - Optimize CPU and GPU usage, memory bandwidth, instruction throughput + - Comprehensive CUDA ecosystem with extensive tools and features .. admonition:: ROCm - In short - :class: dropdown - - - ROCm: Open software platform for AMD accelerators - - Built for open portability across multiple vendors and architectures - - Offers libraries, compilers, and development tools for AMD GPUs - - Supports C, C++, and Fortran languages - - Support GPU and multicore CPU programming - - Debugging: ``roc-gdb`` command line tool - - Facilitates debugging of GPU programs - - Performance analysis: ``rocprof`` and ``roctracer`` tools - - Analyze and optimize program performance - - Supports various heterogeneous programming models such as **HIP**, **OpenMP**, and **OpenCL** - - Heterogeneous-Computing Interface for Portability (HIP) - - Enables source portability for NVIDIA and AMD platforms, Intel in plan - - Provides ``hipcc`` compiler driver and runtime libraries - - Libraries: Prefixed with ``roc`` for AMD platforms - - Can be called directly from HIP - - ``hip``-prefixed wrappers ensure portability with no performance cost + + - ROCm: Open software platform for AMD accelerators + - Built for open portability across multiple vendors and architectures + - Offers libraries, compilers, and development tools for AMD GPUs + - Supports C, C++, and Fortran languages + - Support GPU and multicore CPU programming + - Debugging: ``roc-gdb`` command line tool + - Facilitates debugging of GPU programs + - Performance analysis: ``rocprof`` and ``roctracer`` tools + - Analyze and optimize program performance + - Supports various heterogeneous programming models such as **HIP**, **OpenMP**, and + **OpenCL** + - Heterogeneous-Computing Interface for Portability (HIP) + - Enables source portability for NVIDIA and AMD platforms, Intel in plan + - Provides ``hipcc`` compiler driver and runtime libraries + - Libraries: Prefixed with ``roc`` for AMD platforms + - Can be called directly from HIP + - ``hip``-prefixed wrappers ensure portability with no performance cost .. admonition:: oneAPI - In short - :class: dropdown - - - Intel oneAPI: Unified software toolkit for optimizing and deploying applications across various architectures - - Supports CPUs, GPUs, and FPGAs - - Enables code reusability and performance portability - - Intel oneAPI Base Toolkit: Core set of tools and libraries for high-performance, data-centric applications - - Includes C++ compiler with SYCL support - - Features Collective Communications Library, Data Analytics Library, Deep Neural Networks Library, and more - - Additional toolkits: Intel oneAPI HPC Toolkit - - Contains compilers, debugging tools, MPI library, and performance analysis tool - - Multiple programming models and languages supported: - - OpenMP, Classic Fortran, C++, SYCL - - Unless custom Intel libraries are used, the code is portable to other OpenMP and SYCL frameworks - - DPC++ Compiler: Supports Intel, NVIDIA, and AMD GPUs - - Targets Intel GPUs using oneAPI Level Zero interface - - Added support for NVIDIA GPUs with CUDA and AMD GPUs with ROCm - - Debugging and performance analysis tools: Intel Adviser, Intel Vtune Profiler, Cluster Checker, Inspector, Intel Trace Analyzer and Collector, Intel Distribution for GDB - - Comprehensive and unified approach to heterogeneous computing - - Abstracts complexities and provides consistent programming interface - - Promotes code reusability, productivity, and performance portability + - Intel oneAPI: Unified software toolkit for optimizing and deploying applications across various architectures + - Supports CPUs, GPUs, and FPGAs + - Enables code reusability and performance portability + - Intel oneAPI Base Toolkit: Core set of tools and libraries for high-performance, data-centric applications + - Includes C++ compiler with SYCL support + - Features Collective Communications Library, Data Analytics Library, Deep + Neural Networks Library, and more + - Additional toolkits: Intel oneAPI HPC Toolkit + - Contains compilers, debugging tools, MPI library, and performance analysis + tool + - Multiple programming models and languages supported: + - OpenMP, Classic Fortran, C++, SYCL + - Unless custom Intel libraries are used, the code is portable to other OpenMP + and SYCL frameworks + - DPC++ Compiler: Supports Intel, NVIDIA, and AMD GPUs + - Targets Intel GPUs using oneAPI Level Zero interface + - Added support for NVIDIA GPUs with CUDA and AMD GPUs with ROCm + - Debugging and performance analysis tools: Intel Adviser, Intel Vtune Profiler, + Cluster Checker, Inspector, Intel Trace Analyzer and Collector, Intel Distribution + for GDB + - Comprehensive and unified approach to heterogeneous computing + - Abstracts complexities and provides consistent programming interface + - Promotes code reusability, productivity, and performance portability CUDA -^^^^ - -**Compute Unified Device Architecture** is the parallel computing platform from NVIDIA. The CUDA API provides a comprehensive set of functions and tools for developing high-performance applications that run on NVIDIA GPUs. It consists of two main components: the CUDA Toolkit and the CUDA driver. The toolkit provides a set of libraries, compilers, and development tools for programming and optimizing CUDA applications, while the driver is responsible for communication between the host CPU and the device GPU. CUDA is designed to work with programming languages such as C, C++, and Fortran. - -CUDA API provides many highly optimize libraries such as: **cuBLAS** (for linear algebra operations, such a dense matrix multiplication), **cuFFT** (for performing fast Fourier transforms), **cuRAND** (for generating pseudo-random numbers), **cuSPARSE** (for sparse matrices operations). Using these libraries, developers can quickly and easily accelerate complex computations on NVIDIA GPUs without having to write low-level GPU code themselves. - -There are several compilers that can be used for developing and executing code on NVIDIA GPUs: **nvcc**. The latest versions are based on the widely used LLVM (low level virtual machine) open source compiler infrastructure. nvcc produces optimized code for NVIDIA GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs. - -In addition to this are provided **nvc** (C11 compiler), **nvc++** (C++17 compiler), and **nvfortran** (ISO Fortran 2003 compiler). These compilers can as well create code for execution on the NVIDIA GPUs, and also support GPU and multicore CPU programming with parallel language features, OpeanACC and OpenMP. - - -When programming mistakes are inevitable they have to be fixed as soon as possible. The CUDA toolkit includes the command line tool **cuda-gdb** which can be used to find errors in the code. It is an extension to GDB, the GNU Project debugger. The existing GDB debugging features are inherently present for debugging the host code, and additional features have been provided to support debugging CUDA device code, allowing simultaneous debugging of both GPU and CPU code within the same application. The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware. This enables developers to debug applications without the potential variations introduced by simulation and emulation environments. - -In addition to this the command line tool **compute-sanitizer** can be used to look exclusively for memory access problems: unallocated buffers, out of bounds accesses, race conditions, and uninitialized variables. - -Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the developers to optimize their applications. The former, NVIDIA Nsight Systems, is a system-wide performance analysis tool that provides detailed metrics on both CPU and GPU usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight Compute, is a kernel-level performance analysis tool that allows developers to analyze the performance of individual CUDA kernels. It provides detailed metrics on kernel execution, including memory usage, instruction throughput, and occupancy. These tools have graphical which can be used for all steps of the performance analysis, however on supercomputers it is recommended to use the command line interface for collecting the information needed and then visualize and analyse the results using the graphical interface on personal computers. - -Apart from what was presented above there are many others tools and features provided by NVIDIA. The CUDA ecosystem is very well developed. - +~~~~ + +**Compute Unified Device Architecture** is the parallel computing platform from NVIDIA. +The CUDA API provides a comprehensive set of functions and tools for developing +high-performance applications that run on NVIDIA GPUs. It consists of two main +components: the CUDA Toolkit and the CUDA driver. The toolkit provides a set of +libraries, compilers, and development tools for programming and optimizing CUDA +applications, while the driver is responsible for communication between the host CPU and +the device GPU. CUDA is designed to work with programming languages such as C, C++, and +Fortran. + +CUDA API provides many highly optimize libraries such as: **cuBLAS** (for linear algebra +operations, such a dense matrix multiplication), **cuFFT** (for performing fast Fourier +transforms), **cuRAND** (for generating pseudo-random numbers), **cuSPARSE** (for sparse +matrices operations). Using these libraries, developers can quickly and easily +accelerate complex computations on NVIDIA GPUs without having to write low-level GPU +code themselves. + +There are several compilers that can be used for developing and executing code on NVIDIA +GPUs: **nvcc**. The latest versions are based on the widely used LLVM (low level virtual +machine) open source compiler infrastructure. nvcc produces optimized code for NVIDIA +GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs. + +In addition to this are provided **nvc** (C11 compiler), **nvc++** (C++17 compiler), and +**nvfortran** (ISO Fortran 2003 compiler). These compilers can as well create code for +execution on the NVIDIA GPUs, and also support GPU and multicore CPU programming with +parallel language features, OpeanACC and OpenMP. + +When programming mistakes are inevitable they have to be fixed as soon as possible. The +CUDA toolkit includes the command line tool **cuda-gdb** which can be used to find +errors in the code. It is an extension to GDB, the GNU Project debugger. The existing +GDB debugging features are inherently present for debugging the host code, and +additional features have been provided to support debugging CUDA device code, allowing +simultaneous debugging of both GPU and CPU code within the same application. The tool +provides developers with a mechanism for debugging CUDA applications running on actual +hardware. This enables developers to debug applications without the potential variations +introduced by simulation and emulation environments. + +In addition to this the command line tool **compute-sanitizer** can be used to look +exclusively for memory access problems: unallocated buffers, out of bounds accesses, +race conditions, and uninitialized variables. + +Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA +provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the +developers to optimize their applications. The former, NVIDIA Nsight Systems, is a +system-wide performance analysis tool that provides detailed metrics on both CPU and GPU +usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight +Compute, is a kernel-level performance analysis tool that allows developers to analyze +the performance of individual CUDA kernels. It provides detailed metrics on kernel +execution, including memory usage, instruction throughput, and occupancy. These tools +have graphical which can be used for all steps of the performance analysis, however on +supercomputers it is recommended to use the command line interface for collecting the +information needed and then visualize and analyse the results using the graphical +interface on personal computers. + +Apart from what was presented above there are many others tools and features provided by +NVIDIA. The CUDA ecosystem is very well developed. ROCm -^^^^ - - -ROCm is an open software platform allowing researchers to tap the power of AMD accelerators. -The ROCm platform is built on the foundation of open portability, supporting environments across multiple -accelerator vendors and architectures. In some way it is very similar to CUDA API. -It contains libraries, compilers, and development tools for programming and optimizing programs for AMD GPUs. -For debugging, it provides the command line tool ``rocgdb``, while for performance analysis ``rocprof`` and ``roctracer``. -In order to produce code for the AMD GPUs, one can use the Heterogeneous-Computing Interface for Portability (HIP). -HIP is a C++ runtime API and a set of tools that allows developers to write portable GPU-accelerated code for both NVIDIA and AMD platforms. -It provides the ``hipcc`` compiler driver, which will call the appropriate toolchain depending on the desired platform. -On the AMD ROCm platform, HIP provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). -On an NVIDIA platform, HIP provides a header file which translates from the HIP runtime APIs to CUDA runtime APIs. -The header file contains mostly inlined functions and thus has very low overhead. -The code is then compiled with ``nvcc``, the standard C++ compiler provided with CUDA. -On AMD platforms, libraries are prefixed by ``roc``, which can be called directly from HIP. In order to make portable calls, -one can call the libraries using ``hip``-prefixed wrappers. These wrappers can be used at no performance cost and ensure that -HIP code can be used on other platforms with no changes. Libraries included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA. - -ROCm also integrates with popular machine learning frameworks such as TensorFlow and PyTorch and provides optimized libraries and drivers to accelerate machine learning workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD accelerators to train and deploy machine learning models efficiently. - +~~~~ + +ROCm is an open software platform allowing researchers to tap the power of AMD +accelerators. The ROCm platform is built on the foundation of open portability, +supporting environments across multiple accelerator vendors and architectures. In some +way it is very similar to CUDA API. It contains libraries, compilers, and development +tools for programming and optimizing programs for AMD GPUs. For debugging, it provides +the command line tool ``rocgdb``, while for performance analysis ``rocprof`` and +``roctracer``. In order to produce code for the AMD GPUs, one can use the +Heterogeneous-Computing Interface for Portability (HIP). HIP is a C++ runtime API and a +set of tools that allows developers to write portable GPU-accelerated code for both +NVIDIA and AMD platforms. It provides the ``hipcc`` compiler driver, which will call the +appropriate toolchain depending on the desired platform. On the AMD ROCm platform, HIP +provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). On +an NVIDIA platform, HIP provides a header file which translates from the HIP runtime +APIs to CUDA runtime APIs. The header file contains mostly inlined functions and thus +has very low overhead. The code is then compiled with ``nvcc``, the standard C++ +compiler provided with CUDA. On AMD platforms, libraries are prefixed by ``roc``, which +can be called directly from HIP. In order to make portable calls, one can call the +libraries using ``hip``-prefixed wrappers. These wrappers can be used at no performance +cost and ensure that HIP code can be used on other platforms with no changes. Libraries +included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA. + +ROCm also integrates with popular machine learning frameworks such as TensorFlow and +PyTorch and provides optimized libraries and drivers to accelerate machine learning +workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD +accelerators to train and deploy machine learning models efficiently. oneAPI -^^^^^^ - - -**Intel oneAPI** is a unified software toolkit developed by Intel that allows developers to optimize and deploy applications across a variety of architectures, including CPUs, GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks, enabling developers to leverage the full potential of heterogeneous computing environments. With oneAPI, the developers can write code once and deploy it across different hardware targets without the need for significant modifications or rewriting. This approach promotes code reusability, productivity, and performance portability, as it abstracts the complexities of heterogeneous computing and provides a consistent programming interface based on open standards. - -The core of suite is **Intel oneAPI Base Toolkit**, a set of tools and libraries for developing high-performance, data-centric applications across diverse architectures. It features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for heterogeneous computing. It includes the **Collective Communications Library**, the **Data Analytics Library**, the **Deep Neural Networks Library**, the **DPC++/C++ Compiler**, the **DPC++ Library**, the **Math Kernel Library**, the **Threading Building Blocks**, debugging tool **Intel Distribution for GDB**, performance analysis tools **Intel Adviser** and **Intel Vtune Profiler**, the **Video Processing Library**, **Intel Distribution for Python**, the **DPC++ Compatibility Tool**, the **FPGA Add-on for oneAPI Base Toolkit**, the **Integrated Performance Primitives**. -This can be complemented with additional toolkits. The **Intel oneAPI HPC Toolkit** contains **DPC++/C++ Compiler**, **Fortran** and **C++** Compiler Classic, debugging tools **Cluster Checker** and **Inspector**, **Intel MPI Library**, and performance analysis tool **Intel Trace Analyzer and Collector**. - -oneAPI supports multiple programming models and programming languages. It enables developers to write **OpenMP** codes targeting multi-core CPUs and Intel GPUs using the Classic Fortran and C++ compilers and as well **SYCL** programs for GPUs and FPGAs using the **DPC++** compiler. Initially, the **DPC++** compiler only targeted Intel GPUs using the **oneAPI Level Zero** low-level programming interface, but now support for NVIDIA GPUs (using CUDA) and AMD GPUs (using ROCm) has been added. -Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneous computing, empowering developers to optimize and deploy applications across different architectures with ease. By abstracting the complexities and providing a consistent programming interface, oneAPI promotes code reusability, productivity, and performance portability, making it an invaluable toolkit for developers in the era of diverse computing platforms. - - +~~~~~~ + +**Intel oneAPI** is a unified software toolkit developed by Intel that allows developers +to optimize and deploy applications across a variety of architectures, including CPUs, +GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks, +enabling developers to leverage the full potential of heterogeneous computing +environments. With oneAPI, the developers can write code once and deploy it across +different hardware targets without the need for significant modifications or rewriting. +This approach promotes code reusability, productivity, and performance portability, as +it abstracts the complexities of heterogeneous computing and provides a consistent +programming interface based on open standards. + +The core of suite is **Intel oneAPI Base Toolkit**, a set of tools and libraries for +developing high-performance, data-centric applications across diverse architectures. It +features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for +heterogeneous computing. It includes the **Collective Communications Library**, the +**Data Analytics Library**, the **Deep Neural Networks Library**, the **DPC++/C++ +Compiler**, the **DPC++ Library**, the **Math Kernel Library**, the **Threading Building +Blocks**, debugging tool **Intel Distribution for GDB**, performance analysis tools +**Intel Adviser** and **Intel Vtune Profiler**, the **Video Processing Library**, +**Intel Distribution for Python**, the **DPC++ Compatibility Tool**, the **FPGA Add-on +for oneAPI Base Toolkit**, the **Integrated Performance Primitives**. This can be +complemented with additional toolkits. The **Intel oneAPI HPC Toolkit** contains +**DPC++/C++ Compiler**, **Fortran** and **C++** Compiler Classic, debugging tools +**Cluster Checker** and **Inspector**, **Intel MPI Library**, and performance analysis +tool **Intel Trace Analyzer and Collector**. + +oneAPI supports multiple programming models and programming languages. It enables +developers to write **OpenMP** codes targeting multi-core CPUs and Intel GPUs using the +Classic Fortran and C++ compilers and as well **SYCL** programs for GPUs and FPGAs using +the **DPC++** compiler. Initially, the **DPC++** compiler only targeted Intel GPUs using +the **oneAPI Level Zero** low-level programming interface, but now support for NVIDIA +GPUs (using CUDA) and AMD GPUs (using ROCm) has been added. Overall, Intel oneAPI offers +a comprehensive and unified approach to heterogeneous computing, empowering developers +to optimize and deploy applications across different architectures with ease. By +abstracting the complexities and providing a consistent programming interface, oneAPI +promotes code reusability, productivity, and performance portability, making it an +invaluable toolkit for developers in the era of diverse computing platforms. Differences and similarities -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -GPUs in general support different features, even among the same producer. In general newer cards come with extra -features and sometimes old features are not supported anymore. It is important when compiling to create binaries -targeting the specific architecture when compiling. A binary built for a newer card will not run on older devices, -while a binary build for older devices might not run efficiently on newer architectures. In CUDA the compute -capability which is targeted is specified by the ``-arch=sm_XY``, where ``X`` specifies the major architecture and it is between 1 and 9, and ``Y`` the minor. When using HIP on NVIDIA platforms one needs to use compiling option ``--gpu-architecture=sm_XY``, while on AMD platforms ``--offload-arch=gfxabc`` ( where ``abc`` is the architecture code such as ``90a`` for the MI200 series or ``908`` for MI100 series). -Note that in the case of portable (single source) programs one would specify ``openmp`` as well as target for -compilation, enabling to run the same code on multicore CPU. - - +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +GPUs in general support different features, even among the same producer. In general +newer cards come with extra features and sometimes old features are not supported +anymore. It is important when compiling to create binaries targeting the specific +architecture when compiling. A binary built for a newer card will not run on older +devices, while a binary build for older devices might not run efficiently on newer +architectures. In CUDA the compute capability which is targeted is specified by the +``-arch=sm_XY``, where ``X`` specifies the major architecture and it is between 1 and 9, +and ``Y`` the minor. When using HIP on NVIDIA platforms one needs to use compiling +option ``--gpu-architecture=sm_XY``, while on AMD platforms ``--offload-arch=gfxabc`` ( +where ``abc`` is the architecture code such as ``90a`` for the MI200 series or ``908`` +for MI100 series). Note that in the case of portable (single source) programs one would +specify ``openmp`` as well as target for compilation, enabling to run the same code on +multicore CPU. Terminology -^^^^^^^^^^^ +~~~~~~~~~~~ .. list-table:: Hardware - :widths: 25 25 50 - :header-rows: 1 - - * - NVIDIA - - AMD - - Intel - * - Streaming processor/streaming core - - SIMD lane - - Processing element - * - SIMT unit - - SIMD unit - - Vector engine (XVE) - * - Streaming Multiprocessor (SM) - - Computing Unit (CU) - - Xe-core / Execution unit (EU) - * - GPU processing clusters (GPC) - - Compute Engine - - Xe-slice - -Please keep in mind, that this table is only a rough approximation. -Each GPU architecture is different, and it's impossible to make a 1-to-1 mapping between terms used by different vendors. - - + :widths: 25 25 50 + :header-rows: 1 + + - - NVIDIA + - AMD + - Intel + - - Streaming processor/streaming core + - SIMD lane + - Processing element + - - SIMT unit + - SIMD unit + - Vector engine (XVE) + - - Streaming Multiprocessor (SM) + - Computing Unit (CU) + - Xe-core / Execution unit (EU) + - - GPU processing clusters (GPC) + - Compute Engine + - Xe-slice + +Please keep in mind, that this table is only a rough approximation. Each GPU +architecture is different, and it's impossible to make a 1-to-1 mapping between terms +used by different vendors. Summary ------- -- GPUs are designed to execute thousands of threads simultaneously, making them highly parallel processors. In contrast, CPUs excel at executing a smaller number of threads in parallel. -- GPUs allocate a larger portion of transistors to data processing rather than data caching and flow control. This prioritization of data processing enables GPUs to effectively handle parallel computations and hide memory access latencies through computation. -- GPU producers provide comprehensive toolkits, libraries, and compilers for developing high-performance applications that leverage the parallel processing power of GPUs. Examples include CUDA (NVIDIA), ROCm (AMD), and oneAPI (Intel). -- These platforms offer debugging tools (e.g., ``cuda-gdb``, ``rocgdb``) and performance analysis tools (e.g., NVIDIA Nsight Systems, NVIDIA Nsight Compute, ``rocprof``, ``roctracer``) to facilitate code optimization and ensure efficient utilization of GPU resources. - - +- GPUs are designed to execute thousands of threads simultaneously, making them highly + parallel processors. In contrast, CPUs excel at executing a smaller number of threads + in parallel. +- GPUs allocate a larger portion of transistors to data processing rather than data + caching and flow control. This prioritization of data processing enables GPUs to + effectively handle parallel computations and hide memory access latencies through + computation. +- GPU producers provide comprehensive toolkits, libraries, and compilers for developing + high-performance applications that leverage the parallel processing power of GPUs. + Examples include CUDA (NVIDIA), ROCm (AMD), and oneAPI (Intel). +- These platforms offer debugging tools (e.g., ``cuda-gdb``, ``rocgdb``) and performance + analysis tools (e.g., NVIDIA Nsight Systems, NVIDIA Nsight Compute, ``rocprof``, + ``roctracer``) to facilitate code optimization and ensure efficient utilization of GPU + resources. Exercises --------- .. challenge:: GPUs and memory - Which statement about the relationship between GPUs and memory is true? + Which statement about the relationship between GPUs and memory is true? - - A) GPUs are not affected by memory access latencies. - - B) GPUs can run out of memory quickly with many cores trying to access the memory simultaneously. - - C) GPUs have an unlimited cache size. - - D) GPUs prefer to run with a minimal number of threads to manage memory effectively. + - A) GPUs are not affected by memory access latencies. + - B) GPUs can run out of memory quickly with many cores trying to access the memory simultaneously. + - C) GPUs have an unlimited cache size. + - D) GPUs prefer to run with a minimal number of threads to manage memory effectively. - .. solution:: - - The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of - cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many - cores are trying to access the memory simultaneously. This is why data management and access patterns - are essential in GPU computing. + .. solution:: + The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of + cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many + cores are trying to access the memory simultaneously. This is why data management and access patterns + are essential in GPU computing. .. keypoints:: - - GPUs vs. CPUs, key differences between them - - GPU software suites, support specific GPU features, programming models, compatibility - - Applications of GPUs - + - GPUs vs. CPUs, key differences between them + - GPU software suites, support specific GPU features, programming models, compatibility + - Applications of GPUs diff --git a/content/3-gpu-problems.rst b/content/3-gpu-problems.rst index 405a7bf1..d8b50736 100644 --- a/content/3-gpu-problems.rst +++ b/content/3-gpu-problems.rst @@ -1,257 +1,267 @@ .. _gpu-problems: - What problems fit to GPU? ========================= .. questions:: - - What are the strengths and weaknesses of GPUs? - - What makes a particular problem suitable for GPU-porting? - - Why are GPUs so ubiquitous in machine learning applications? + - What are the strengths and weaknesses of GPUs? + - What makes a particular problem suitable for GPU-porting? + - Why are GPUs so ubiquitous in machine learning applications? .. objectives:: - - Get a feeling for the type of use cases that GPUs excel at. + - Get a feeling for the type of use cases that GPUs excel at. .. instructor-note:: - - 10 min teaching - - 10 min exercises - - + - 10 min teaching + - 10 min exercises What are GPUs good for? ----------------------- +Answer from `Stack Exchange +`__: -Answer from `Stack Exchange `__: - - *From a metaphorical point of view, the GPU can be seen as a person lying on a bed - of nails. The person lying on top is the data and in the base of each nail there - is a processor, so the nail is actually an arrow pointing from processor to memory. - All nails are in a regular pattern, like a grid. If the body is well spread, - it feels good (performance is good), if the body only touches some spots of the - nail bed, then the pain is bad (bad performance).* + *From a metaphorical point of view, the GPU can be seen as a person lying on a bed + of nails. The person lying on top is the data and in the base of each nail there is + a processor, so the nail is actually an arrow pointing from processor to memory. All + nails are in a regular pattern, like a grid. If the body is well spread, it feels + good (performance is good), if the body only touches some spots of the nail bed, + then the pain is bad (bad performance).* - -GPU computing is well-suited to problems that involve large amounts of data parallelism. +GPU computing is well-suited to problems that involve large amounts of data parallelism. Specifically, you can expect good performance on GPUs for: -- **Large-scale matrix and vector operations**: Common in machine learning, scientific computing, and image processing. -- **Fourier transforms**: Also common in machine learning, scientific computing, and image processing. -- **Monte Carlo simulations**: Used across finance, physics, and other fields to simulate complex systems. +- **Large-scale matrix and vector operations**: Common in machine learning, scientific + computing, and image processing. +- **Fourier transforms**: Also common in machine learning, scientific computing, and + image processing. +- **Monte Carlo simulations**: Used across finance, physics, and other fields to + simulate complex systems. - **Molecular dynamics simulations**: Used in chemistry, biochemistry and physics. - **Computational fluid dynamics**: Used in engineering, physics, and other fields. - **Convolutional neural networks** and **computer vision algorithms**. - **Big data analytics**: Clustering, classification, regression, etc. - **Graphics rendering**: Original use-case for GPUs. - What are GPUs not good for? --------------------------- - -Not all programming problems can efficiently leverage the parallelism offered by GPUs. +Not all programming problems can efficiently leverage the parallelism offered by GPUs. Some types of problems that do not fit well on a GPU include: -- **Sequential tasks**: Problems that require a series of dependent steps, - where each step relies on the outcome of the previous step, are not well-suited - for parallel processing. Examples include recursive algorithms, certain dynamic - programming problems, and some graph traversal algorithms. - -- **Fine-grained branching**: GPUs perform best when the code being executed across - different threads follows a similar control flow. When there is extensive - branching (i.e., many ``if`` statements) within a kernel or algorithm, performance - may suffer due to the divergence in execution paths among the GPU threads. - -- **Low arithmetic intensity**: GPUs excel at performing a large number of mathematical - operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of - arithmetic operations to memory accesses), the GPU may not be able to efficiently utilize - its computational power, leading to underperformance. - -- **Small data sets**: If the problem involves a small data set that does not require significant - parallelism, using a GPU may not result in noticeable performance gains. In such cases, - the overhead of transferring data between the CPU and GPU, and the time spent initializing the GPU, - may outweigh any potential benefits. - -- **Limited parallelism**: Some algorithms have inherent limitations on the degree of parallelism that can be - achieved. In these cases, using a GPU may not lead to significant performance improvements. - -- **Memory-bound problems**: GPUs generally have less memory available compared to CPUs, and their memory bandwidth - can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations, - it may not be well-suited for a GPU. - +- **Sequential tasks**: Problems that require a series of dependent steps, where each + step relies on the outcome of the previous step, are not well-suited for parallel + processing. Examples include recursive algorithms, certain dynamic programming + problems, and some graph traversal algorithms. +- **Fine-grained branching**: GPUs perform best when the code being executed across + different threads follows a similar control flow. When there is extensive branching + (i.e., many ``if`` statements) within a kernel or algorithm, performance may suffer + due to the divergence in execution paths among the GPU threads. +- **Low arithmetic intensity**: GPUs excel at performing a large number of mathematical + operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of + arithmetic operations to memory accesses), the GPU may not be able to efficiently + utilize its computational power, leading to underperformance. +- **Small data sets**: If the problem involves a small data set that does not require + significant parallelism, using a GPU may not result in noticeable performance gains. + In such cases, the overhead of transferring data between the CPU and GPU, and the time + spent initializing the GPU, may outweigh any potential benefits. +- **Limited parallelism**: Some algorithms have inherent limitations on the degree of + parallelism that can be achieved. In these cases, using a GPU may not lead to + significant performance improvements. +- **Memory-bound problems**: GPUs generally have less memory available compared to CPUs, + and their memory bandwidth can be a limiting factor. If a problem requires a large + amount of memory or involves memory-intensive operations, it may not be well-suited + for a GPU. Examples of GPU acceleration ---------------------------- -To give a flavor of what type of performance gains we can achieve by porting a calculations to a GPU -(if we're lucky!), let's look at a few case examples. +To give a flavor of what type of performance gains we can achieve by porting a +calculations to a GPU (if we're lucky!), let's look at a few case examples. .. discussion:: Effect of array size - - Consider the case of matrix multiplication in the Julia language: - - .. code-block:: julia - - using AMDGPU - using BenchmarkTools - - N = [9, 10, 11, 12] - - for n in N - A = rand(2^n, 2^n); A_d = ROCArray(A); - - @btime $A * $A; - - @btime begin - $A_d * $A_d; - AMDGPU.synchronize() - end - end - - - - How much faster do you think the GPU version is compared to running on a single CPU core? - - Julia automatically parallelises matrix multiplication over available CPU cores. Will the GPU version be faster than running on 64 cores? - - Does the size of the array affect how much the performance improves? - - .. solution:: - - Example results from running on LUMI (MI250X AMD GPU, 64-core AMD Trento CPUs): - - .. list-table:: GPU acceleration for matrix multiply in Julia - :widths: 25 25 25 25 25 - :header-rows: 1 - - * - Matrix size - - 1 CPU core - - 64 CPU cores - - 1 GPU - - GPU speedup - * - (512, 512) - - 5.472 ms - - 517.722 μs - - 115.805 μs - - ~47x / ~5x - * - (1024, 1024) - - 43.364 ms - - 2.929 ms - - 173.316 μs - - ~250x / ~17x - * - (2048, 2048) - - 344.364 ms - - 30.081 ms - - 866.348 μs - - ~400x / ~35x - * - (4096, 4096) - - 3.221 s - - 159.563 ms - - 5.910 ms - - ~550x / ~27x + Consider the case of matrix multiplication in the Julia language: + + .. code-block:: julia + + using AMDGPU + using BenchmarkTools + + N = [9, 10, 11, 12] + + for n in N + A = rand(2^n, 2^n); A_d = ROCArray(A); + + @btime $A * $A; + + @btime begin + $A_d * $A_d; + AMDGPU.synchronize() + end + end + + + - How much faster do you think the GPU version is compared to running on a single CPU core? + - Julia automatically parallelises matrix multiplication over available CPU cores. Will the GPU version be faster than running on 64 cores? + - Does the size of the array affect how much the performance improves? + + .. solution:: + + Example results from running on LUMI (MI250X AMD GPU, 64-core AMD Trento CPUs): + + .. list-table:: GPU acceleration for matrix multiply in Julia + :widths: 25 25 25 25 25 + :header-rows: 1 + + * - Matrix size + - 1 CPU core + - 64 CPU cores + - 1 GPU + - GPU speedup + * - (512, 512) + - 5.472 ms + - 517.722 μs + - 115.805 μs + - ~47x / ~5x + * - (1024, 1024) + - 43.364 ms + - 2.929 ms + - 173.316 μs + - ~250x / ~17x + * - (2048, 2048) + - 344.364 ms + - 30.081 ms + - 866.348 μs + - ~400x / ~35x + * - (4096, 4096) + - 3.221 s + - 159.563 ms + - 5.910 ms + - ~550x / ~27x Electronic structure calculations -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -VASP is a popular software package used for electronic structure calculations. The figures below show the speedup observed in a recent benchmark study on the Perlmutter and Cori supercomputers, along with an analysis of total energy usage. +VASP is a popular software package used for electronic structure calculations. The +figures below show the speedup observed in a recent benchmark study on the Perlmutter +and Cori supercomputers, along with an analysis of total energy usage. .. figure:: img/problems/vasp_gpu.png - :align: center + :align: center - VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs). + VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of + nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). + (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs). .. figure:: img/problems/vasp_energy.png - :align: center - - Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to the Cori Haswell usage. (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs) - + :align: center + Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical + axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), + CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to + the Cori Haswell usage. (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs) Computational Chemistry -^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~ + +A great deal of computational resources are spent in Quantum Chemical calculations which +involve the solution of the Hartree-Fock eigenvalue problem, which requires the +diagonalization of the Fock matrix whose elements are given by: -A great deal of computational resources are spent in Quantum Chemical calculations which involve -the solution of the Hartree-Fock eigenvalue problem, which requires the diagonalization of the -Fock matrix whose elements are given by: - .. math:: + F_{\alpha \beta} = H^{\textrm{core}}_{\alpha \beta} + \sum_{\gamma \delta}D_{\gamma \delta} \left [ (\alpha \beta|\gamma \delta) - \frac{1}{2} (\alpha \delta|\gamma \beta) \right ], -The first term is related to the one electron contributions and the second term is related to the -electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix -:math:`D_{\gamma \delta}`. One of the most expensive parts in the solution of the Hartree-Fock equations is the -processing (digestion) of the ERIs, one algorithm to do this task is as follows: +The first term is related to the one electron contributions and the second term is +related to the electron repulsion integrals (ERIs), in parenthesis, weighted by the by +the density matrix :math:`D_{\gamma \delta}`. One of the most expensive parts in the +solution of the Hartree-Fock equations is the processing (digestion) of the ERIs, one +algorithm to do this task is as follows: .. figure:: img/concepts/algorithms.svg :width: 200 :align: center - Algorithm for processing ERIs [see `JCTC, 17, 7486, (2021) `__ for details] - -This algorithm is suitable for GPUs as it involves many arithmetic operations. In addition to this, -there are symmetries and properties of the integrals that could be used to rearrange the loops in -an efficient manner that fit GPU architectures. + Algorithm for processing ERIs [see `JCTC, 17, 7486, (2021) + `__ for details] +This algorithm is suitable for GPUs as it involves many arithmetic operations. In +addition to this, there are symmetries and properties of the integrals that could be +used to rearrange the loops in an efficient manner that fit GPU architectures. Humanities -^^^^^^^^^^ +~~~~~~~~~~ -A brief introduction into some of the work that is being done in the humanities that can benefit from utilizing GPUs. +A brief introduction into some of the work that is being done in the humanities that can +benefit from utilizing GPUs. **Language models and NLP (natural language processing)** -With the recent popularity of ChatGPT, the use of language models has come into the mainstream, -however such models have been used in the humanities many years already. One of the biggest goals of humanities -researchers is working with textual data which has increased exponentially over recent years due to the rise in -social media. Analyzing such textual data to gain insights into questions of sociology, linguistics and various -other fields have become increasingly reliant on using language models. Along with language models, -the need for GPU access has become essential. - +With the recent popularity of ChatGPT, the use of language models has come into the +mainstream, however such models have been used in the humanities many years already. One +of the biggest goals of humanities researchers is working with textual data which has +increased exponentially over recent years due to the rise in social media. Analyzing +such textual data to gain insights into questions of sociology, linguistics and various +other fields have become increasingly reliant on using language models. Along with +language models, the need for GPU access has become essential. **Archeology** -The field of archeology also makes use of GPUs in their 3D modelling -and rendering work. The biggest problem with archeological sites is that once they are excavated, -they are destroyed, so any researchers who aren't present at the site, would lose valuable insights into how -it looked when it was found. However, with recent developments in technology and accessibility to high-performance -computing, they are able to generate extremely detailed renderings of the excavation sites which act as a way to -preserve the site for future researchers to gain critical insights and contribute to the research. +The field of archeology also makes use of GPUs in their 3D modelling and rendering work. +The biggest problem with archeological sites is that once they are excavated, they are +destroyed, so any researchers who aren't present at the site, would lose valuable +insights into how it looked when it was found. However, with recent developments in +technology and accessibility to high-performance computing, they are able to generate +extremely detailed renderings of the excavation sites which act as a way to preserve the +site for future researchers to gain critical insights and contribute to the research. **Cognitive Science** -Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics. MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces. This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems. By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations. - -By offloading the computational workload to GPUs, researchers can experience substantial speedup in the execution of MCMC algorithms. This speedup allows for more extensive exploration of parameter spaces and facilitates the analysis of larger datasets, leading to more accurate and detailed insights into human behavior or population dynamics. Examples of studies done using these methods can be found at the `Center for Humanities Computing Aarhus `__ (CHCAA) and `Interacting Minds Centre `__ (IMC) at Aarhus University. - - +Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable +in studies that delve into human behavior or population dynamics. MCMC sampling allows +researchers to simulate and analyze complex systems by iteratively sampling from a +Markov chain, enabling the exploration of high-dimensional parameter spaces. This method +is particularly useful when studying human behavior, as it can capture the inherent +randomness and interdependencies that characterize social systems. By leveraging MCMC +sampling, researchers can gain insights into various aspects of human behavior, such as +decision-making, social interactions, and the spread of information or diseases within +populations. + +By offloading the computational workload to GPUs, researchers can experience substantial +speedup in the execution of MCMC algorithms. This speedup allows for more extensive +exploration of parameter spaces and facilitates the analysis of larger datasets, leading +to more accurate and detailed insights into human behavior or population dynamics. +Examples of studies done using these methods can be found at the `Center for Humanities +Computing Aarhus `__ (CHCAA) and `Interacting Minds Centre +`__ (IMC) at Aarhus University. Exercises --------- .. challenge:: Discussion - - What type of problems have you used GPUs for? - - How large was the performance boost? - + - What type of problems have you used GPUs for? + - How large was the performance boost? .. challenge:: Good and bad use cases for GPU porting - Which of the following computational tasks is likely to gain the least performance benefit from being ported to a GPU? - - 1. Training a large, deep neural network. - 2. Performing a Monte Carlo simulation with a large number of independent trials. - 3. Executing an algorithm with heavy use of recursion and frequent branching. - 4. Processing a large image with a convolutional filter. - - .. solution:: + Which of the following computational tasks is likely to gain the least performance benefit from being ported to a GPU? - The right answer is option 3. GPUs do not handle recursion and branching as effectively as more - data-heavy algorithms. + 1. Training a large, deep neural network. + 2. Performing a Monte Carlo simulation with a large number of independent trials. + 3. Executing an algorithm with heavy use of recursion and frequent branching. + 4. Processing a large image with a convolutional filter. + .. solution:: + The right answer is option 3. GPUs do not handle recursion and branching as effectively as more + data-heavy algorithms. .. keypoints:: - - GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics. - - GPUs struggle with sequential tasks, problems with extensive control flow divergence, low arithmetic intensity tasks, small data sets, and memory-bound problems. + - GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics. + - GPUs struggle with sequential tasks, problems with extensive control flow divergence, low arithmetic intensity tasks, small data sets, and memory-bound problems. diff --git a/content/4-gpu-concepts.rst b/content/4-gpu-concepts.rst index 89a3ded0..77a81fd9 100644 --- a/content/4-gpu-concepts.rst +++ b/content/4-gpu-concepts.rst @@ -1,240 +1,335 @@ .. _gpu-concepts: - GPU programming concepts ======================== - .. questions:: - - What types of parallel computing is possible? - - How does data parallelism differ from task parallelism, and how are they utilized in parallel computing? - - How is the work parallelized and executed on GPUs? - - What are general considerations for an efficient code running on GPUs? + - What types of parallel computing is possible? + - How does data parallelism differ from task parallelism, and how are they utilized in parallel computing? + - How is the work parallelized and executed on GPUs? + - What are general considerations for an efficient code running on GPUs? .. objectives:: - - Understand parallel computing principles and architectures. - - Differentiate data parallelism from task parallelism. - - Learn the GPU execution model. - - Parallelize and execute work on GPUs. - - Develop efficient GPU code for high performance. + - Understand parallel computing principles and architectures. + - Differentiate data parallelism from task parallelism. + - Learn the GPU execution model. + - Parallelize and execute work on GPUs. + - Develop efficient GPU code for high performance. .. instructor-note:: - - 25 min teaching - - 0 min exercises - + - 25 min teaching + - 0 min exercises Different types of parallelism ------------------------------ - Distributed- vs. Shared-Memory Architecture ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Most of computing problems are not trivially parallelizable, which means that the subtasks -need to have access from time to time to some of the results computed by other subtasks. -The way subtasks exchange needed information depends on the available hardware. +Most of computing problems are not trivially parallelizable, which means that the +subtasks need to have access from time to time to some of the results computed by other +subtasks. The way subtasks exchange needed information depends on the available +hardware. .. figure:: img/history/distributed_vs_shared.png - :align: center - - Distributed- vs shared-memory parallel computing. + :align: center -In a distributed memory environment each processing unit operates independently from the -others. It has its own memory and it **cannot** access the memory in other nodes. -The communication is done via network and each computing unit runs a separate copy of the -operating system. In a shared memory machine all processing units have access to the memory -and can read or modify the variables within. + Distributed- vs shared-memory parallel computing. +In a distributed memory environment each processing unit operates independently from the +others. It has its own memory and it **cannot** access the memory in other nodes. The +communication is done via network and each computing unit runs a separate copy of the +operating system. In a shared memory machine all processing units have access to the +memory and can read or modify the variables within. Processes and Threads ~~~~~~~~~~~~~~~~~~~~~ -The type of environment (distributed- or shared-memory) determines the programming model. -There are two types of parallelism possible, process based and thread based. +The type of environment (distributed- or shared-memory) determines the programming +model. There are two types of parallelism possible, process based and thread based. .. figure:: img/history/processes-threads.png - :align: center - -For distributed memory machines, a process-based parallel programming model is employed. -The processes are independent execution units which have their *own memory* address spaces. -They are created when the parallel program is started and they are only terminated at the -end. The communication between them is done explicitly via message passing like MPI. - -On the shared memory architectures it is possible to use a thread based parallelism. -The threads are light execution units and can be created and destroyed at a relatively -small cost. The threads have their own state information, but they *share* the *same memory* -address space. When needed the communication is done though the shared memory. - - -Both approaches have their advantages and disadvantages. Distributed machines are -relatively cheap to build and they have an "infinite " capacity. In principle one could -add more and more computing units. In practice the more computing units are used the more -time consuming is the communication. The shared memory systems can achieve good performance -and the programming model is quite simple. However they are limited by the memory capacity -and by the access speed. In addition in the shared parallel model it is much easier to -create race conditions. + :align: center +For distributed memory machines, a process-based parallel programming model is employed. +The processes are independent execution units which have their *own memory* address +spaces. They are created when the parallel program is started and they are only +terminated at the end. The communication between them is done explicitly via message +passing like MPI. + +On the shared memory architectures it is possible to use a thread based parallelism. The +threads are light execution units and can be created and destroyed at a relatively small +cost. The threads have their own state information, but they *share* the *same memory* +address space. When needed the communication is done though the shared memory. + +Both approaches have their advantages and disadvantages. Distributed machines are +relatively cheap to build and they have an "infinite " capacity. In principle one could +add more and more computing units. In practice the more computing units are used the +more time consuming is the communication. The shared memory systems can achieve good +performance and the programming model is quite simple. However they are limited by the +memory capacity and by the access speed. In addition in the shared parallel model it is +much easier to create race conditions. Exposing parallelism -------------------- -There are two types of parallelism that can be explored. -The data parallelism is when the data can be distributed across computational units that can run in parallel. -The units process the data by applying the same or very similar operation to different data elements. -A common example is applying a blur filter to an image --- the same function is applied to all the pixels on an image. -This parallelism is natural for the GPU, where the same instruction set is executed in multiple :abbr:`threads`. +There are two types of parallelism that can be explored. The data parallelism is when +the data can be distributed across computational units that can run in parallel. The +units process the data by applying the same or very similar operation to different data +elements. A common example is applying a blur filter to an image --- the same function +is applied to all the pixels on an image. This parallelism is natural for the GPU, where +the same instruction set is executed in multiple :abbr:`threads`. .. figure:: img/concepts/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png :align: center :scale: 40 % - Data parallelism and task parallelism. - The data parallelism is when the same operation applies to multiple data (e.g. multiple elements of an array are transformed). - The task parallelism implies that there are more than one independent task that, in principle, can be executed in parallel. - -Data parallelism can usually be explored by the GPUs quite easily. -The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel. -If the number of elements in the data set is fairly large (tens or hundred of thousands elements), the GPU should perform quite well. Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take. Getting absolute maximum out of the data parallelism requires good understanding of how GPU works. - - -Another type of parallelism is a task parallelism. -This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data. -An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time. -Note that the tasks can consume totally different resources, which also can be explored. + Data parallelism and task parallelism. The data parallelism is when the same + operation applies to multiple data (e.g. multiple elements of an array are + transformed). The task parallelism implies that there are more than one independent + task that, in principle, can be executed in parallel. + +Data parallelism can usually be explored by the GPUs quite easily. The most basic +approach would be finding a loop over many data elements and converting it into a GPU +kernel. If the number of elements in the data set is fairly large (tens or hundred of +thousands elements), the GPU should perform quite well. Although it would be odd to +expect absolute maximum performance from such a naive approach, it is often the one to +take. Getting absolute maximum out of the data parallelism requires good understanding +of how GPU works. + +Another type of parallelism is a task parallelism. This is when an application consists +of more than one task that requiring to perform different operations with (the same or) +different data. An example of task parallelism is cooking: slicing vegetables and +grilling are very different tasks and can be done at the same time. Note that the tasks +can consume totally different resources, which also can be explored. .. admonition:: In short - :class: dropdown - - - Computing problems can be parallelized in distributed memory or shared memory architectures. - - In distributed memory, each unit operates independently, with no direct memory access between nodes. - - In shared memory, units have access to the same memory and can communicate through shared variables. - - Parallel programming can be process-based (distributed memory) or thread-based (shared memory). - - Process-based parallelism uses independent processes with separate memory spaces and explicit message passing. - - Thread-based parallelism uses lightweight threads that share the same memory space and communicate through shared memory. - - Data parallelism distributes data across computational units, processing them with the same or similar operations. - - Task parallelism involves multiple independent tasks that perform different operations on the same or different data. - - Task parallelism involves executing different tasks concurrently, leveraging different resources. + - Computing problems can be parallelized in distributed memory or shared memory + architectures. + - In distributed memory, each unit operates independently, with no direct memory + access between nodes. + - In shared memory, units have access to the same memory and can communicate through + shared variables. + - Parallel programming can be process-based (distributed memory) or thread-based + (shared memory). + - Process-based parallelism uses independent processes with separate memory spaces + and explicit message passing. + - Thread-based parallelism uses lightweight threads that share the same memory space + and communicate through shared memory. + - Data parallelism distributes data across computational units, processing them with + the same or similar operations. + - Task parallelism involves multiple independent tasks that perform different + operations on the same or different data. + - Task parallelism involves executing different tasks concurrently, leveraging + different resources. GPU Execution Model ------------------- -In order to obtain maximum performance it is important to understand how GPUs execute the programs. As mentioned before a CPU is a flexible device oriented towards general purpose usage. It's fast and versatile, designed to run operating systems and various, very different types of applications. It has lots of features, such as better control logic, caches and cache coherence, that are not related to pure computing. CPUs optimize the execution by trying to achieve low latency via heavy caching and branch prediction. +In order to obtain maximum performance it is important to understand how GPUs execute +the programs. As mentioned before a CPU is a flexible device oriented towards general +purpose usage. It's fast and versatile, designed to run operating systems and various, +very different types of applications. It has lots of features, such as better control +logic, caches and cache coherence, that are not related to pure computing. CPUs optimize +the execution by trying to achieve low latency via heavy caching and branch prediction. .. figure:: img/concepts/cpu-gpu-highway.png :align: center :scale: 40 % - Cars and roads analogy for the CPU and GPU behavior. The compact road is analogous to the CPU - (low latency, low throughput) and the broader road is analogous to the GPU (high latency, high throughput). - -In contrast the GPUs contain a relatively small amount of transistors dedicated to control and caching, and a much larger fraction of transistors dedicated to the mathematical operations. Since the cores in a GPU are designed just for 3D graphics, they can be made much simpler and there can be a very larger number of cores. The current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by having a very high degree of parallelism. Lots of threads are launched in parallel. For good performance there should be at least several times more than the number of CUDA cores. GPU :abbr:`threads` are much lighter than the usual CPU threads and they have very little penalty for context switching. This way when some threads are performing some memory operations (reading or writing) others execute instructions. - - + Cars and roads analogy for the CPU and GPU behavior. The compact road is analogous + to the CPU (low latency, low throughput) and the broader road is analogous to the + GPU (high latency, high throughput). + +In contrast the GPUs contain a relatively small amount of transistors dedicated to +control and caching, and a much larger fraction of transistors dedicated to the +mathematical operations. Since the cores in a GPU are designed just for 3D graphics, +they can be made much simpler and there can be a very larger number of cores. The +current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by having a +very high degree of parallelism. Lots of threads are launched in parallel. For good +performance there should be at least several times more than the number of CUDA cores. +GPU :abbr:`threads` are much lighter than the usual CPU threads and they have very +little penalty for context switching. This way when some threads are performing some +memory operations (reading or writing) others execute instructions. CUDA Threads, Warps, Blocks --------------------------- -In order to understand the GPU execution model let's look at the so called `axpy` operation. On a single CPU core this operation would be executed in a serial manner in a `for/do` loop going over each element on the array, `id`, and computing `y[id]=y[id]+a*x[id]`. +In order to understand the GPU execution model let's look at the so called `axpy` +operation. On a single CPU core this operation would be executed in a serial manner in a +`for/do` loop going over each element on the array, `id`, and computing +`y[id]=y[id]+a*x[id]`. .. code-block:: C++ - - void axpy_(int n, double a, double *x, double *y) - { - for(int id=0;id