-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
These changes get kokkos+rrtmgp performance on frontier to match yakl+rrtmgp #39
base: kokkos_conversion_branch
Are you sure you want to change the base?
Conversation
Allocate 1 item at a time to ensure aligned memory.
f674272
to
34793af
Compare
Performance now seems to match YAKL on machines we care about. This is ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should wait for Ben's approval, but this looks good to me. One thing that would be useful: Put a comment in somewhere explaining the naming scheme for "init_no_alloc" and "alloc_no_alloc".
j = (idx / dims[2]) % dims[1]; | ||
k = idx % dims[2]; | ||
} | ||
|
||
KOKKOS_INLINE_FUNCTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the interesting changes are here. These are the functions that unflatten an MD idx and the macros that makes this stuff usable.
int64_t get_num_reals(const int64_t num) noexcept | ||
{ | ||
assert(sizeof(T) <= sizeof(RealT)); | ||
static constexpr int64_t CACHE_LINE_SIZE = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting code here. How temp views are allocated definitely impacts performance.
@ambrad , you beat me to the punch! I left a few annotations for the parts that I think are interesting. |
Oh, I also added more to the PR description. |
Performance was measured with this case:
SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1
With these changes, we appear to be spending less time in kokkos kernels than yakl kernels:
Change list:
^ Notice the index order flip. FLATTEN_MD_KERNEL works for both left and right layouts and will do the right thing in either case, so there's no need to do this inversion..
type_check_v<type>
overtype_check<type>::value