Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39

Open
wants to merge 27 commits into
base: kokkos_conversion_branch
Choose a base branch
from

Conversation

jgfouca
Copy link
Member

@jgfouca jgfouca commented Dec 16, 2024

Performance was measured with this case:
SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1

With these changes, we appear to be spending less time in kokkos kernels than yakl kernels:

Total time spent in YAKL kernels:   16.262377
Total time spent in Kokkos kernels: 14.596052

Change list:

  • Replaces all multi-dimensional kernel launches with macro FLATTEN_MD_KERNEL$N. This removes most uses of Kokkos' MDRange policy, which does not seem to perform as well in most cases.
  • YAKL SimpleBounds always has the right index being the fast one; it was inverting the dimensions to get layout left, ex:
parallel_for( YAKL_AUTO_LABEL() , SimpleBounds<3>(a,b,c) {
   md_array(c, b, a)
}

^ Notice the index order flip. FLATTEN_MD_KERNEL works for both left and right layouts and will do the right thing in either case, so there's no need to do this inversion..

  • Get rid of all uses of alloc_raw to allocate multiple temporary views at the same time. This was leading to views that were not cache aligned which hurt performance.
  • Adds non-allocating versions of some routines. YAKL didn't have to worry about this since it was doing pool allocations for all arrays automatically. The non-allocating routines take a view that has already been created (presumably via the pool allocator).
  • Add fences to timing macros to ensure accurate times
  • Tweaks to the pool allocator to improve performance. Round up all allocations to cache line size.
  • Prefer type_check_v<type> over type_check<type>::value

@jgfouca jgfouca requested a review from brhillman December 16, 2024 21:59
@jgfouca jgfouca self-assigned this Dec 16, 2024
@jgfouca
Copy link
Member Author

jgfouca commented Jan 29, 2025

Performance now seems to match YAKL on machines we care about. This is ready for final review.

@jgfouca jgfouca requested a review from ambrad January 29, 2025 17:10
Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should wait for Ben's approval, but this looks good to me. One thing that would be useful: Put a comment in somewhere explaining the naming scheme for "init_no_alloc" and "alloc_no_alloc".

j = (idx / dims[2]) % dims[1];
k = idx % dims[2];
}

KOKKOS_INLINE_FUNCTION
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the interesting changes are here. These are the functions that unflatten an MD idx and the macros that makes this stuff usable.

int64_t get_num_reals(const int64_t num) noexcept
{
assert(sizeof(T) <= sizeof(RealT));
static constexpr int64_t CACHE_LINE_SIZE = 64;
Copy link
Member Author

@jgfouca jgfouca Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting code here. How temp views are allocated definitely impacts performance.

@jgfouca
Copy link
Member Author

jgfouca commented Jan 29, 2025

@ambrad , you beat me to the punch! I left a few annotations for the parts that I think are interesting.

@jgfouca
Copy link
Member Author

jgfouca commented Jan 29, 2025

Oh, I also added more to the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants