These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39

jgfouca · 2024-12-16T21:59:46Z

Performance was measured with this case:
SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1

With these changes, we appear to be spending less time in kokkos kernels than yakl kernels:

Total time spent in YAKL kernels:   16.262377
Total time spent in Kokkos kernels: 14.596052

Change list:

Replaces all multi-dimensional kernel launches with macro FLATTEN_MD_KERNEL$N. This removes most uses of Kokkos' MDRange policy, which does not seem to perform as well in most cases.
YAKL SimpleBounds always has the right index being the fast one; it was inverting the dimensions to get layout left, ex:

parallel_for( YAKL_AUTO_LABEL() , SimpleBounds<3>(a,b,c) {
   md_array(c, b, a)
}

^ Notice the index order flip. FLATTEN_MD_KERNEL works for both left and right layouts and will do the right thing in either case, so there's no need to do this inversion..

Get rid of all uses of alloc_raw to allocate multiple temporary views at the same time. This was leading to views that were not cache aligned which hurt performance.
Adds non-allocating versions of some routines. YAKL didn't have to worry about this since it was doing pool allocations for all arrays automatically. The non-allocating routines take a view that has already been created (presumably via the pool allocator).
Add fences to timing macros to ensure accurate times
Tweaks to the pool allocator to improve performance. Round up all allocations to cache line size.
Prefer type_check_v<type> over type_check<type>::value

Allocate 1 item at a time to ensure aligned memory.

jgfouca · 2025-01-29T17:10:43Z

Performance now seems to match YAKL on machines we care about. This is ready for final review.

ambrad

You should wait for Ben's approval, but this looks good to me. One thing that would be useful: Put a comment in somewhere explaining the naming scheme for "init_no_alloc" and "alloc_no_alloc".

jgfouca · 2025-01-29T17:29:11Z

cpp/rrtmgp_conversion.h

  j = (idx / dims[2]) % dims[1];
  k =  idx % dims[2];
 }

+KOKKOS_INLINE_FUNCTION


Some of the interesting changes are here. These are the functions that unflatten an MD idx and the macros that makes this stuff usable.

jgfouca · 2025-01-29T17:30:17Z

cpp/rrtmgp_conversion.h

+  int64_t get_num_reals(const int64_t num) noexcept
+  {
+    assert(sizeof(T) <= sizeof(RealT));
+    static constexpr int64_t CACHE_LINE_SIZE = 64;


Interesting code here. How temp views are allocated definitely impacts performance.

jgfouca · 2025-01-29T17:31:35Z

@ambrad , you beat me to the punch! I left a few annotations for the parts that I think are interesting.

jgfouca · 2025-01-29T17:31:59Z

Oh, I also added more to the PR description.

jgfouca added 2 commits December 16, 2024 15:08

Perf progress

0510347

More perf stuff

021c65a

jgfouca requested a review from brhillman December 16, 2024 21:59

jgfouca self-assigned this Dec 16, 2024

jgfouca added 24 commits December 16, 2024 17:04

Turn off timing

eda1689

More small optimizations

59ea68e

Pool allocator needs to follow requested layout

e5fa953

Add support for 4d unflatten

944ab67

Add timed inline kernel macro

89eda2d

Progress

5806dcf

progres

d531638

Change default mdrp order

da2a3d0

Now at or better than before layout refactor

c46e984

fix

202e714

progress

e3bee56

Fix all inverted dims

a2b6413

Add non-allocating versions of some routines

e6ae0a8

Move away from bulk pool allocations

d0e0f0b

Allocate 1 item at a time to ensure aligned memory.

We want timing off by default

eeedfc7

Convert mo_gas_optics_kernels.h to remove mdrp

171ae9d

prog

55bd809

Convert mo_rte_solver_kernels.h to remove mdrp

a118f91

mo_gas_optics_rrtmgp.h

11f829c

Remove commented out code

2f162d3

prog

515af51

All MD parallel fors now encapsulated by macro

7780d2d

add comment

f82bbac

Some the MDRPs were actually faster

34793af

jgfouca force-pushed the jgfouca/more_perf branch from f674272 to 34793af Compare January 24, 2025 18:20

jgfouca requested a review from ambrad January 29, 2025 17:10

ambrad approved these changes Jan 29, 2025

View reviewed changes

jgfouca commented Jan 29, 2025

View reviewed changes

Document no_alloc routines

04a816b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39

These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39

jgfouca commented Dec 16, 2024 •

edited

Loading

jgfouca commented Jan 29, 2025

ambrad left a comment

jgfouca Jan 29, 2025

jgfouca Jan 29, 2025 •

edited

Loading

jgfouca commented Jan 29, 2025

jgfouca commented Jan 29, 2025

These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39

Are you sure you want to change the base?

These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39

Conversation

jgfouca commented Dec 16, 2024 • edited Loading

jgfouca commented Jan 29, 2025

ambrad left a comment

Choose a reason for hiding this comment

jgfouca Jan 29, 2025

Choose a reason for hiding this comment

jgfouca Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

jgfouca commented Jan 29, 2025

jgfouca commented Jan 29, 2025

jgfouca commented Dec 16, 2024 •

edited

Loading

jgfouca Jan 29, 2025 •

edited

Loading