This repository contains a collection of Heterogeneous Computing benchmarks written with CUDA, HIP, SYCL (DPC++), and OpenMP-4.5 target offloading for studying performance, portability, and productivity.
AMD ROCm
Intel DPC++ compiler or Intel oneAPI toolkit
Nvidia HPC SDK
For Rodinia benchmarks, please download the dataset at http://lava.cs.virginia.edu/Rodinia/download.htm
The programs have not been evaluated on Windows or MacOS
The lastest Intel SYCL compiler (not the Intel oneAPI toolkit) may be needed for building some SYCL programs successfully
Kernel results do not exactly match using these programming languages on a platform for certain programs
Not all programs automate the verification of host and device results
Not all CUDA programs have SYCL, HIP or OpenMP equivalents
Not all programs have OpenMP target offloading implementations
Raw performance of any program may be suboptimal
Some programs may take longer to complete on an integrated GPU
Some host programs contain platform-specific intrinsics, so they may cause compile error on a PowerPC platform
I appreciate your feedback when any examples don't look right.
Early results are shown here
Phase-field simulation of dendritic solidification (https://github.com/myousefi2016/Allen-Cahn-CUDA)
Advection (https://github.com/Nek5000/nekBench/tree/master/adv)
AES encrypt and decrypt (https://github.com/Multi2Sim/m2s-bench-amdsdk-2.5-src)
Affine transformation (https://github.com/Xilinx/SDAccel_Examples/tree/master/vision/affine)
Accelerated gSLIC for superpixel generation used in object segmentation (https://vgg.fiit.stuba.sk/2015-10/accelerated-gslic-for-superpixel-generation-used-in-object-segmentation/)
Adaptive inverse distance weighting (Mei, G., Xu, N. & Xu, L. Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search. SpringerPlus 5, 1389 (2016))
Alignment specification for variables of structured types (http://docs.nvidia.com/cuda/cuda-samples/index.html)
All-pairs distance calculation (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910913/)
Associated Legendre polynomials (https://github.com/mstsuite/lsms)
The relax kernel in the AMGmk benchmark (https://asc.llnl.gov/CORAL-benchmarks/Micro/amgmk-v1.0.tar.gz)
Asymmetric numeral systems decoding (https://github.com/weissenberger/multians)
A lightweight ambient occlusion renderer (https://code.google.com/archive/p/aobench)
American options pricing (https://github.com/NVIDIA-developer-blog)
Adaptive smoothing (http://www.hcs.harvard.edu/admiralty/)
Array of structure of tiled array for data layout transposition (https://github.com/chai-benchmarks/chai)
Atomic aggregate (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/)
Atomic add, subtract, min, max, AND, OR, XOR (http://docs.nvidia.com/cuda/cuda-samples/index.html)
64-bit atomic add, min, and max with compare and swap (https://github.com/treecode/Bonsai/blob/master/runtime/profiling/derived_atomic_functions.h)
Integer sum reduction with atomics (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/reduction)
Ham, T.J., et al., 2020, February. A^ 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 328-341). IEEE.
Helmholtz matrix-vector product (https://github.com/Nek5000/nekBench/tree/master/axhelm)
Measure memory transfer rates for copy, add, mul, triad, dot, and nstream (https://github.com/UoB-HPC/BabelStream)
Backpropagation in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
The Bezier surface (https://github.com/chai-benchmarks/chai)
The breadth-first search in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Simulate the gravitational forces in a star cluster using the Barnes-Hut n-body algorithm (https://userweb.cs.txstate.edu/~burtscher/research/ECL-BH/)
Bilateral filter (https://github.com/jstraub/cudaPcl)
Evaluate fair call price for a given set of European options under binomial model (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Bitonic sorting (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/)
A bit-level operation that aims to reduce the number of bits required to store each value (https://github.com/NVIDIA/nvcomp)
The Black-Scholes simulation (https://github.com/cavazos-lab/FinanceBench)
Block-matching and 3D filtering method for image denoising (https://github.com/DawyD/bm3d-gpu)
Bayesian network learning (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)
Fixed-rate bond with flat forward curve (https://github.com/cavazos-lab/FinanceBench)
Box filtering (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Classic and vectorizable binary search algorithms (https://www.sciencedirect.com/science/article/abs/pii/S0743731517302836)
Bspline value gradient hessian (https://github.com/QMCPACK/miniqmc/blob/OMP_offload/src/OpenMP/main.cpp)
GPU accelerated Smith-Waterman for performing batch alignments (https://github.com/mgawan/ADEPT)
2D Burger's equation (https://github.com/soumyasen1809/OpenMP_C_12_steps_to_Navier_Stokes)
Burrows-Wheeler transform (https://github.com/jedbrooke/cuda_bwt)
B+Tree in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Content adaptive resampling (https://github.com/sunwj/CAR)
Cubic b-spline filtering (https://github.com/DannyRuijters/CubicInterpolationCUDA)
Connected components (https://userweb.cs.txstate.edu/~burtscher/research/ECL-CC/)
Condition-dependent Correlation Subgroups (https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS)
The CCSD tengy kernel, which was converted from Fortran to C by Jeff Hammond, in NWChem (https://github.com/jeffhammond/nwchem-ccsd-trpdrv)
Canny edge detection (https://github.com/chai-benchmarks/chai)
The CFD solver in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Phase-field simulation of spinodal decomposition using the Cahn-Hilliard equation (https://github.com/myousefi2016/Cahn-Hilliard-CUDA)
Complex hermitian matrix-vector multiplication (https://repo.or.cz/ppcg.git)
The Chi-square 2-df test. (https://web.njit.edu/~usman/courses/cs677_spring19/)
Direct coulomb summation kernel (http://www.ks.uiuc.edu/Training/Workshop/GPU_Aug2010/resources/clenergy.tar.gz)
Compact LSTM inference kernel (http://github.com/UCLA-VAST/CLINK)
Gene expression connectivity mapping (https://pubmed.ncbi.nlm.nih.gov/24112435/)
Seismic processing using the classic common midpoint (CMP) method (https://github.com/hpg-cepetro/IPDPS-CRS-CMP-code)
Simulation of Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances (https://dl.acm.org/doi/10.1145/3307339.3343460)
Check collision of duplicate values (https://github.com/facebookarchive/fbcuda)
Dimitrov, M. and Esslinger, B., 2021. CUDA Tutorial--Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA. arXiv preprint arXiv:2103.13937.
Complex numbers arithmetics (https://github.com/tpn/cuda-samples/blob/master/v8.0/include/cuComplex.h)
Document filtering (https://www.intel.com/content/www/us/en/programmable/support/support-resources/design-examples/design-software/opencl/compute-score.html)
Demonstrate the use of streams for concurrent execution of several kernels with dependency on a device (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/concurrentKernels)
Second-order tensor aggregation with an adjacency matrix (https://github.com/HyTruongSon/GraphFlow)
Convolution filter of a 2D image with separable kernels (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
64-bit cyclic-redundancy check (https://xgitlab.cels.anl.gov/hfinkel/hpcrc64/-/wikis/home)
Cauchy Reed-Solomon encoding (https://www.comp.hkbu.edu.hk/~chxw/gcrs.html)
A lattice boltzmann scheme with a 2D grid, 9 velocities, and Bhatnagar-Gross-Krook collision step (https://github.com/WSJHawkins/ExploringSycl)
Lattice Boltzmann simulation framework based on C++ parallel algorithms (https://gitlab.com/unigehpfs/stlbm)
The Darmstadt automotive parallel heterogeneous benchmark suite (https://github.com/esa-tu-darmstadt/daphne-benchmark)
The continuum level damage in a peridynamic body (https://github.com/alan-turing-institute/PeriPy)
Discrete Cosine Transform (DCT) and inverse DCT for 8x8 blocks (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Distance-driven backprojection (https://github.com/LAVI-USP/DBT-Reconstruction)
Convert a Bayer mosaic raw image to RGB (https://github.com/GrokImageCompression/latke)
Radio astronomy degridding (https://github.com/NVIDIA/SKA-gpu-degrid)
Check connectivity and remove crosses in depixelization of pixel art (https://github.com/yzhwang/depixelization)
Gene sequence de-redundancy is a precise gene sequence de-redundancy software that supports heterogeneous acceleration (https://github.com/JuZhenCS/gene-sequences-de-redundancy)
Mask sequences kernel in Diamond (https://github.com/bbuchfink/diamond)
CPU and GPU divergence test (https://github.com/E3SM-Project/divergence_cmdvse)
Dot product (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Detail-preserving image downscaling (https://github.com/mergian/dpid)
Randomly zero some elements of the input array with a probability using samples from a uniform distribution (https://github.com/pytorch/)
A Lattice QCD Dslash operator proxy application derived from MILC (https://gitlab.com/NERSC/nersc-proxies/milc-dslash)
DXT1 compression (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Simulation of tsunami generation and propagation in the context of early warning (https://gitext.gfz-potsdam.de/geoperil/easyWave)
Elliptic curve Diffie-Hellman key exchange (https://github.com/jaw566/ECDH)
Calculate the eigenvalues of a tridiagonal symmetric matrix (https://github.com/OpenCL/AMD_APP_samples)
Compute the entropy for each point of a 2D matrix using a 5x5 window (https://lan-jing.github.io/parallel%20computing/system/entropy/)
Epistasis detection (https://github.com/rafatcampos/bio-epistasis-detection)
Modified microkernel in the empirical roofline tool (https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/)
Compute the Bhattacharya cost function (https://github.com/benvanwerkhoven/kernel_tuner)
Smith-Waterman (SW) extension in Burrow-wheeler aligner for short-read alignment (https://github.com/lh3/bwa)
Find local maxima (https://github.com/rapidsai/cusignal/)
Compute the maximum of half-precision floating-point numbers using bit operations (https://x.momo86.net/en?p=113)
Half-precision scalar product (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Face detection using the Viola-Jones algorithm (https://sites.google.com/site/5kk73gpu2012/assignment/viola-jones-face-detection)
FDTD-3D (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Use of Feynman-Kac algorithm to solve Poisson's equation in a 2D ellipse (https://people.sc.fsu.edu/~jburkardt/c_src/feynman_kac_2d/feynman_kac_2d.html)
A case study: advanced magnetic resonance imaging reconstruction (https://ict.senecacollege.ca/~gpu610/pages/content/cudas.html)
Filtering by a predicate (https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/)
FFT in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Fractal flame (http://gpugems.hwu-server2.crhc.illinois.edu/)
Floyd-Warshall Pathfinding sample (https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/HIP-Examples-Applications/FloydWarshall/)
2D Fluid Simulation using the Lattice-Boltzman method (https://github.com/OpenCL/AMD_APP_samples)
Frequent pattern compression ( Base-delta-immediate compression: practical data compression for on-chip caches. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques (pp. 377- 388). ACM.)
Floating-point data compression and decompression (https://userweb.cs.txstate.edu/~burtscher/research/GFC/)
Fresnel integral (http://www.mymathlib.com/functions/fresnel_sin_cos_integrals.html)
Accelerate the fill step in predicting the lowest free energy structure and a set of suboptimal structures (http://rna.urmc.rochester.edu/Text/Fold-cuda.html)
A GPU-accelerated implementation of a genetic algorithm for finding well-performing finite-state machines for predicting binary sequences (https://userweb.cs.txstate.edu/~burtscher/research/FSM_GA/)
Fast Walsh transformation (http://docs.nvidia.com/cuda/cuda-samples/index.html)
Gene alignment (https://github.com/NUCAR-DEV/Hetero-Mark)
Gamma correction (https://github.com/intel/BaseKit-code-samples)
Gaussian elimination in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Graph coloring via shortcutting (https://userweb.cs.txstate.edu/~burtscher/research/ECL-GC/)
Gradient descent (https://github.com/CGudapati/BinaryClassification)
##a geodesic (opencl) Geodesic distance (https://www.osti.gov/servlets/purl/1576565)
General-purposed sparse matrix-matrix multiplication on GPUs for graph neural networks (https://github.com/hgyhungry/ge-spmm)
General matrix-matrix multiplication on GPUs (https://godweiyang.com/2021/08/24/gemm/)
Expectation maximization with Gaussian mixture models (https://github.com/Corv/CUDA-GMM-MultiGPU)
Simulate the dynamics of a small part of a cardiac myocyte, specifically the fast sodium m-gate (https://github.com/LLNL/goulash)
General Plasman Pole Self-Energy Simulation the BerkeleyGW software package (https://github.com/NERSC/gpu-for-science-day-july-2019)
Regular expression matching (https://github.com/bkase/CUDA-grep)
General-relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime (https://github.com/hungyipu/Odyssey)
The HACC microkernel (https://asc.llnl.gov/CORAL-benchmarks/#haccmk)
Haversine distance (https://github.com/rapidsai/cuspatial)
Hybrid methods for parallel betweenness centrality (https://github.com/Adam27X/hybrid_BC)
Heart Wall in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
The heat equation solver (https://github.com/UoB-HPC/heat_sycl)
Discrete 2D laplacian operation a number of times on a given vector (https://github.com/gpucw/cuda-lapl)
Hellinger distance (https://github.com/rapidsai/raft)
Henry coefficient (https://github.com/CorySimon/HenryCoefficient)
A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion (https://github.com/noma/hexciton_benchmark)
Histogram (http://github.com/NVlabs/cub/tree/master/experimental)
Hidden markov model (http://developer.download.nvidia.com/compute/DevZone/OpenCL/Projects/oclHiddenMarkovModel.tar.gz)
The benchmark implements the kernel of the Hogbom Clean deconvolution algorithm (https://github.com/ATNF/askap-benchmarks/)
Hotspot3D in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Fast block distributed Implementation of the Hungarian Algorithm (https://github.com/paclopes/HungarianGPU)
1D Haar wavelet transformation (https://github.com/OpenCL/AMD_APP_samples)
Hybridsort in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Fast interger divide (https://github.com/milakov/int_fastdiv)
Interleaved and non-interleaved global memory accesses (Shane Cook. 2012. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs (1st. ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.)
The inverse kinematics for 2-joint arm (http://axbench.org/)
Integer sort (https://github.com/GMAP/NPB-GPU)
Monte-Carlo simulations of 2D Ising Model (https://github.com/NVIDIA/ising-gpu/)
Isotropic 2-dimensional Finite Difference (https://github.com/intel/HPCKit-code-samples/)
Jaccard index for a sparse matrix (https://github.com/rapidsai/nvgraph/blob/main/cpp/src/jaccard_gpu.cu)
Jacobi relaxation (https://github.com/NVIDIA/multi-gpu-programming-models/blob/master/single_gpu/jacobi.cu)
Bob Jenkins lookup3 hash function (https://android.googlesource.com/platform/external/jenkins-hash/+/75dbeadebd95869dd623a29b720678c5c5c55630/lookup3.c)
Kalman filter (https://github.com/rapidsai/cuml/)
A Keccak tree hash function (http://sites.google.com/site/keccaktreegpu/)
Keogh's lower bound (https://github.com/gravitino/cudadtw)
K-means in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
K-nearest neighbor (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)
Compute the kurtosis of two variables (https://github.com/d-d-j/ddj_store)
Lanczos tridiagonalization (https://github.com/linhr/15618)
Count planar Langford sequences (https://github.com/boris-dimitrov/z4_planar_langford_multigpu)
A Laplace solver using red-black Gaussian Seidel with SOR solver (https://github.com/kyleniemeyer/laplace_gpu)
Solve Laplace equation on a regular 3D grid (https://github.com/gpgpu-sim/ispass2009-benchmarks)
LavaMD in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
AoS and SoA comparison (https://github.com/OpenCL/AMD_APP_samples)
Landau collisional integral (https://github.com/vskokov/Landau_Collisional_Integral)
Latent Dirichlet allocation (https://github.com/js1010/cusim)
QC-LDPC decoding (https://github.com/robertwgh/cuLDPC)
Estimate the Lebesgue constant (https://people.math.sc.edu/Burkardt/c_src/lebesgue/lebesgue.html)
Leukocyte in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Marsa-LFIB4 pseudorandom number generator (https://bitbucket.org/przemstp/gpu-marsa-lfib4/src/master/)
A LIBOR market model Monte Carlo application (https://people.maths.ox.ac.uk/~gilesm/cuda_old.html)
GPU solver for a 2D lid-driven cavity problem (https://github.com/kyleniemeyer/lid-driven-cavity_gpu)
A leaky integrate-and-fire neuron model (https://github.com/e2crawfo/hrr-scaling)
A simple lock-free hash table (https://github.com/nosferalatu/SimpleGPUHashTable)
Approximate the log2 math function (https://adacenter.org/sites/default/files/milspec/Transcendentals.zip)
Lomb-Scargle periodogram (https://github.com/rapidsai/cusignal/)
Lookback option simulation (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application)
Linear scaling quantum transport (https://github.com/brucefan1983/gpuqt)
LU decomposition in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Livermore unstructured Lagrangian explicit shock hydrodynamics (https://github.com/LLNL/LULESH)
The Mandelbrot set in the HPCKit code samples (https://github.com/intel/HPCKit-code-samples/)
A practical isosurfacing algorithm for large data on many-core architectures (https://github.com/LRLVEC/MarchingCubes)
Compute matching scores for two 16K 128D feature points (https://github.com/Celebrandil/CudaSift)
In-place matrix rotation
Matrix transposition (https://docs.nvidia.com/cuda/cuda-samples/index.html)
3D Maxpooling (https://github.com/nachiket/papaa-opencl)
Maximum floating-point operations in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Monte Carlo and Molecular Dynamics Simulation Package (https://github.com/khavernathy/mcmd)
Multi-category probit regression (https://github.com/berkeley-scf/gpu-workshop-2016)
Molecular dynamics function in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Simple multiple Debye-Huckel kernel in fast molecular electrostatics algorithms on GPUs (http://gpugems.hwu-server2.crhc.illinois.edu/)
MD5 hash function in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Two-dimensional 3x3 median filter of RGBA image (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Merkle tree construction using rescue prime hash (https://github.com/itzmeanjan/ff-gpu)
A benchmark for memory copy from a host to a device
Selected memory tests (https://github.com/ComputationalRadiationPhysics/cuda_memtest)
Merge two unsorted arrays into a sorted array (https://github.com/ogreen/MergePathGPU)
Simulation of an ensemble of replicas with Metropolis–Hastings computation in the trial run (https://github.com/crinavar/trueke)
Matrix factorization with stochastic gradient descent (https://github.com/cuMF/cumf_sgd)
MiniFE Mantevo mini-application (https://github.com/Mantevo/miniFE)
The core computation of the Bristol University Docking Engine (BUDE) (https://github.com/UoB-HPC/miniBUDE)
Hardware acceleration of long read pairwise overlapping in genome sequencing (https://github.com/UCLA-VAST/minimap2-acceleration)
A finite difference solver for seismic modeling (https://github.com/rsrice/gpa-minimod-artifacts)
A deterministic Sn radiation transport miniapp (https://github.com/wdj/minisweep)
Minkowski distance (https://github.com/rapidsai/raft)
Maximal independent set (http://www.cs.txstate.edu/~burtscher/research/ECL-MIS/)
A read-only version of mixbench (https://github.com/ekondis/mixbench)
Single-precision floating-point matrix multiply using Intel® Math Kernel Library
MTTKRP kernel using mixed-mode CSF (https://github.com/isratnisa/MM-CSF)
Chapter 4.2: Converting CUDA CNN to HIP (https://developer.amd.com/wp-content/resources)
Morphological operators: Erosion and Dilation (https://github.com/yszheda/CUDA-Morphology)
The Miller-Rabin primality test (https://github.com/wizykowski/miller-rabin)
Computation of a matrix Q used in a 3D magnetic resonance image reconstruction (https://github.com/abduld/Parboil/blob/master/benchmarks/mri-q/DESCRIPTION)
Mersenne Twister (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Move-to-front transform (https://github.com/bzip2-cuda/bzip2-cuda)
Multi-material simulations (https://github.com/reguly/multimaterial)
MurmurHash3 yields a 128-bit hash value (https://github.com/aappleby/smhasher/wiki/MurmurHash3)
Myocte in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Computing non-bonded pair interactions (https://manual.gromacs.org/current/doxygen/html-full/page_nbnxm.xhtml)
Nbody simulation (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2B/N-BodyMethods/Nbody)
Normal estimation in 3D (https://github.com/PointCloudLibrary/pcl)
Work-efficient parallel non-maximum suppression kernels (https://github.com/hertasecurity/gpu-nms)
Nearest neighbor in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Compute the Euclidean norm of a vector (https://docs.nvidia.com/cuda/cublas)
N-Queens (https://github.com/tcarneirop/ChOp)
Number-theoretic transform (https://github.com/vernamlab/cuHE)
Needleman-Wunsch in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Multi-threading over a single device (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Overlay grid in the DetectNet (https://github.com/dusty-nv/jetson-inference)
PageRank (https://github.com/Sable/Ostrich/tree/master/map-reduce/page-rank)
Particle diffusion in the HPCKit code samples (https://github.com/intel/HPCKit-code-samples/)
Particle Filter in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Particles collision simulation (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
PathFinder in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Perplexity search (https://github.com/rapidsai/cuml/)
Pair hidden Markov model (https://github.com/lienliang/Pair_HMM_forward_GPU)
Solve the point-in-polygon problem using the crossing number algorithm (https://github.com/benvanwerkhoven/kernel_tuner)
Petri-net simulation (https://github.com/abduld/Parboil/)
Fused point-wise operations (https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/)
Pooling layer (https://github.com/PaddlePaddle/Paddle)
Implementations of population count (Jin, Z. and Finkel, H., 2020, May. Population Count on Intel® CPU, GPU and FPGA. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 432-439). IEEE.)
Lightweight cryptography (https://github.com/bozhu/PRESENT-C/blob/master/present.h)
Calculate a partition function for a sequence, which can be used to predict base pair probabilities (http://rna.urmc.rochester.edu/Text/partition-cuda.html)
Projectile motion is a program that implements a ballistic equation (https://github.com/intel/BaseKit-code-samples)
Niederreiter quasirandom number generator and Moro's Inverse Cumulative Normal Distribution generator (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
quality threshold clustering (https://github.com/vetter/shoc/)
A parallel radix sort (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Random memory access (https://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/)
3D Gray-Scott reaction diffusion (https://github.com/ifilot/wavefuse)
2-dimensional Gaussian Blur Filter of RGBA image (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
ResNet kernels for inference (https://github.com/xuqiantong/CUDA-Winograd)
Reverse an input array of size 256 using shared memory
Reproducible floating sum (https://github.com/facebookarchive/fbcuda)
Random number generation using the Wallace algorithm (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application)
Romberg's method (https://github.com/SwayambhuNathRay/Parallel-Romberg-Integration)
A proxy application for full neutron transport application like OpenMC that support multipole cross section representations (https://github.com/ANL-CESAR/RSBench/)
A structured-grid applications in the oil and gas industry (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/rtm8)
An ODE solver using the Rush-Larsen scheme (https://bitbucket.org/finsberg/gotran/src/master)
Chemical rates computation used in the simulation of combustion (https://github.com/vetter/shoc/)
Naive template matching with SAD (https://github.com/gholomia/CTMC)
Shapley sampling values explanation method (https://github.com/rapidsai/cuml)
Perform the SAXPY operation on host and device (https://github.com/pc2/OMP-Offloading)
A block-level scan (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)
Scan a large array (https://github.com/OpenCL/AMD_APP_samples)
Part of BIP39 solver (https://github.com/johncantrell97/bip39-solver-gpu)
Seam carving (https://github.com/pauty/CUDA_seam_carving)
Fast segmented sort on a GPU (https://github.com/Funatiq/bb_segsort)
Plasma sheath simulation with the particle-in-cell method (https://www.particleincell.com/2016/cuda-pic/)
The shared local memory microbenchmark (https://github.com/ekondis/gpumembench)
Shuffle instructions with subgroup sizes of 8, 16, and 32 (https://github.com/cpc/hipcl/tree/master/samples/4_shfl)
Set intersection with matrix multiply (https://github.com/chribell/set_intersection)
The attentuation of neutron fluxes across an individual geometrical segment (https://github.com/ANL-CESAR/SimpleMOC-kernel)
Sparse LU factorization (https://github.com/sheldonucr/GLU_public)
Genome pre-alignment filtering (https://github.com/CMU-SAFARI/SneakySnake)
Sobel filter (https://github.com/OpenCL/AMD_APP_samples)
Sobol quasi-random generator (https://docs.nvidia.com/cuda/cuda-samples/index.html)
The softmax function (https://github.com/pytorch/glow/tree/master/lib/Backends/OpenCL)
Radix sort in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Second-order IIR digital filtering (https://github.com/rapidsai/cusignal/)
A miniapp for the CoMet comparative genomics application (https://github.com/wdj/sparkler)
The simple n^2 SPH simulation (https://github.com/olcf/SPH_Simple)
The split operation in radix sort (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Image registration calculations for the statistical parametric mapping (SPM) system (http://mri.ee.ntust.edu.tw/cuda/)
A thread-Level synchronization-free sparse triangular solver (https://github.com/JiyaSu/CapelliniSpTRSV)
SRAD (version 1) in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
String search (https://github.com/OpenCL/AMD_APP_samples)
The single-source shortest path (https://github.com/chai-benchmarks/chai)
Standard deviation (https://github.com/rapidsai/raft)
1D stencil (https://www.olcf.ornl.gov/wp-content/uploads/2019/12/02-CUDA-Shared-Memory.pdf)
3D stencil (https://github.com/LLNL/cardioid)
Streamcluster in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Lattice QCD SU(3) matrix-matrix multiply microbenchmark (https://gitlab.com/NERSC/nersc-proxies/su3_bench)
Surfel rendering (https://github.com/jstraub/cudaPcl)
Compute the singular value decomposition of 3x3 matrices (https://github.com/kuiwuchn/3x3_SVD_CUDA)
SW4 curvilinear kernels are five stencil kernels that account for ~50% of the solution time in SW4 (https://github.com/LLNL/SW4CK)
A proxy for the SNAP force calculation in the LAMMPS molecular dynamics package (https://github.com/FitSNAP/TestSNAP)
Solve tridiagonal systems of equations using the Thomas algorithm (https://pm.bsc.es/gitlab/run-math/cuThomasBatch/tree/master)
Memory fence function (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions)
Accumulate contributions of tissue source strengths and previous solute levels to current tissue solute levels (https://github.com/secomb/GreensTD19_GPU)
Tone mapping (https://github.com/OpenCL/AMD_APP_samples)
The 2-point correlation function (https://users.ncsa.illinois.edu/kindr/projects/hpca/index.html)
Tensor transposition (https://github.com/Jokeren/GPA-Benchmark/tree/master/ExaTENSOR)
Triad in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Matrix solvers for large number of small independent tridiagonal linear systems(http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Trotter-Suzuki approximation (https://bitbucket.org/zzzoom/trottersuzuki/src/master/)
Solving the symmetric traveling salesman problem with iterative hill climbing (https://userweb.cs.txstate.edu/~burtscher/research/TSP_GPU/)
Uniform random noise generator (https://github.com/OpenCL/AMD_APP_samples)
Genuchten conversion of soil moisture and pressure (https://github.com/HydroComplexity/Dhara)
Computes expectation values (6D integrals) associated with the helium atom (https://github.com/wadejong/Summer-School-Materials/tree/master/Examples/vmc)
Demonstrate the usage of the vote intrinsics (https://github.com/NVIDIA/cuda-samples/)
Sort small numbers (https://github.com/facebookarchive/fbcuda)
Winograd convolution (https://github.com/ChenyangZhang-cs/iMLBench)
Compute spring forces in a worm-like chain model with a power function (https://github.com/AnselGitAccount/USERMESO-2.0)
Count the number of words in a text (https://github.com/NVIDIA/thrust/blob/main/examples/)
WRF Single Moment 5-tracer (https://www2.mmm.ucar.edu/wrf/WG2/GPU/WSM5.htm)
List ranking with Wyllie's algorithm (Rehman, M. & Kothapalli, Kishore & Narayanan, P.. (2009). Fast and Scalable List Ranking on the GPU. Proceedings of the International Conference on Supercomputing. 235-243. 10.1145/1542275.1542311.)
Hartree-Fock self-consistent-field (SCF) calculation of H2O (https://github.com/recoli/XLQC)
A proxy application for full neutron transport application like OpenMC (https://github.com/ANL-CESAR/XSBench/)
kernels may read and write directly to pinned system memory from a user perspective (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simpleZeroCopy)
Authored and maintained by Zheming Jin (https://github.com/zjin-lcf)
Anton Gorshkov, Beau Johnston, Bernhard Esslinger, Bert de Jong, Chengjian Liu, Chris Knight, David Oro, Douglas Franz, Edson Borin, Gabriell Araujo, Henry Gabb, Ian Karlin, Istvan Reguly, Jason Lau, Jeff Hammond, Jianxin Qiu, Wayne Joubert, Jakub Chlanda, Jiya Su, John Tramm, Ju Zheng, Martin Burtscher, Matthias Noack, Michael Kruse, Michel Migdal, Mike Giles, Mohammed Alser, Muhammad Haseeb, Muaaz Awan, Nevin Liber, Nicholas Miller, Pavel Samolysov, Pedro Valero Lara, Piotr Różański, Rahulkumar Gayatri, Shaoyi Peng, Robert Harrison, Robin Kobus, Rodrigo Vimieiro, Romanov Vlad, Tadej Ciglarič, Thomas Applencourt, Tiago Carneiro, Timmie Smith, Tobias Baumann, Usman Roshan, Ye Luo, Yongbin Gu, Zhe Chen
The project uses resources at the Intel® DevCloud, the Chameleon testbed supported by the National Science Foundation, the Argonne Leadership Computing Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357, and the Experimental Computing Laboratory (ExCL) at Oak Ridge National Laboratory supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.