Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FIELD%DEVPTR LBOUNDS always one #69

Merged
merged 3 commits into from
Jan 8, 2025
Merged

Conversation

wertysas
Copy link
Contributor

@wertysas wertysas commented Dec 4, 2024

This PR is the first in a series of PR's related to the issue Add support for partial offloading of field API fields to device #68.

This PR decouples the bounds of FIELD%DEVPTR from FIELD%PTR member, by always making the FIELDS%DEVPTR lower bound be one. The decoupling of the device and host pointers is the first step towards partial field offloads, as outlined in the issue linked above.

@wertysas wertysas requested a review from awnawab December 4, 2024 14:26
@FussyDuck
Copy link

FussyDuck commented Dec 4, 2024

CLA assistant check
All committers have signed the CLA.

@awnawab awnawab changed the title Make lower bounds field DEVPTR always start from one Make FIELD%DEVPTR LBOUNDS always one Dec 5, 2024
@awnawab awnawab marked this pull request as ready for review December 5, 2024 20:14
Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many many thanks @wertysas for this very neatly implemented first step of #68 👌 This is good to go from my side, but as this is quite a fundamental change, testing this thoroughly is essential. @mlange05 could you please confirm this branch doesn't break ecPhys offload? @dareg could you also please confirm these changes are safe on your end?

Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me and I can confirm that EC-physics regression test works fine with this on GPU.

@pmarguinaud
Copy link
Collaborator

pmarguinaud commented Dec 6, 2024

Hello,

I would like this to be an option, as we already rely on %DEVPTR having the same bounds as %PTR in our physics.

Or would not it be possible to create a local pointer with lower bounds equal to 1 and make the transfer with this pointer ?

@awnawab
Copy link
Collaborator

awnawab commented Dec 6, 2024

Ah, I didn't know we had use cases where external code could access SELF%DEVPTR directly. In that case of course we can make this optional. Thanks!

@pmarguinaud
Copy link
Collaborator

pmarguinaud commented Dec 9, 2024

The best solution would be to use a local pointer with lower bounds equal to one; is not that possible ? Because making DEVPTR with lb =1 optional looks quite complicated.

It is hard for me to understand why a pointer with lower bounds = 1 is required, as long as it is contiguous.

I also understand that the purpose of this is to be able to run calculations with datasets so large that they cannot fit on the device, which is a very questionable purpose, as the recommended approach is to keep data on the device, which we now know is possible (we did it with ARPEGE).
Another thing, is that focussing on a single kernel does not help understanding how to organize the code in the model itself and highlighting where memory bottlenecks are. For instance, keep in mind that usually (also it is not the case for cloudsc), memory passed as arguments to a kernel is much smaller that the temporary memory allocated by the kernel itself, so the approach you are trying to use here would not solve the memory issue.

That being said, if you manage to make it simple and not too complicated, I will approve it.

@awnawab
Copy link
Collaborator

awnawab commented Dec 9, 2024

Hi Phillipe,

There's two features we are hoping to add:

  1. Reduce the memory footprint on device
  2. Break up the data-transfers and kernels into several chunks, and run these asynchronously on device so as to overlap communication and computation

If we are to use the DEVPTR storage pointer in F-API (which we would like to to minimise complexity), then achieving the above requires HST and DEV in ${ftn}$_COPY_DIM${d}$_CONTIGUOUS to have different bounds.

I assumed the storage pointer was not part of the external API (in fact I wondered why we haven't marked it as private) so I advised Johan to go with the simplest solution: assume DEV has lbounds of one and map the bounds of HST to DEV accordingly. Apologies for that Johan, my misunderstanding has created a bit of extra work for you.

If I am understanding your proposal correctly, you are suggesting to place a device storage pointer in user code outside of F-API and use that to do the copy? If so, we could achieve point 1 with that approach, although for fields used in multiple kernels (e.g. YDVARS) it would be very hard to do without commiting device storage pointers to the IFS (which I think we must avoid). The second of our aims, which requires all the machinery for asynchronous copies, I would really like to keep the device storage inside F-API.

With the new knowledge of your physics use-case, @wertysas and I will sit together offline and come up with a clean workaround.

@wertysas wertysas force-pushed the je-devptr-lower-bound branch from 5ddc7a6 to a8353a3 Compare December 11, 2024 14:19
@wertysas wertysas marked this pull request as draft December 12, 2024 08:49
@pmarguinaud
Copy link
Collaborator

Hello Ahmad,

Clearly, PTR and DEVPTR management (allocation/deallocation/synchronization) are field_api purpose, and the user should not interfere with this. That being said, and assuming proper synchronization has taken place, the data pointed to by PTR and DEVPTR can be accessed. This is what we do in our physics/dynamics.

I still do not understand why you need to have lower bounds of DEVPTR equal to 1. Maybe you could try to explain ?

@awnawab
Copy link
Collaborator

awnawab commented Dec 12, 2024

Hi Philipe,

We want to remove the limitation that SELF%DEVPTR and SELF%PTR always have to be of the same size (of course this will be optional, and the default will be the current behaviour). To remove this limitation, the 1D copy methods have to account for potentially different bounds of PTR and DEVPTR and compute the necessary mapping. This mapping is simplest when DEVPTR has lbounds of 1. Of course we can also compute the mapping even if the lbounds of DEVPTR are not 1. Johan is now working to remove the hard-coding of the lbounds to 1, and is instead working to update the 1D copy method so that it simply checks the lbounds of HST and DEV and computes the mapping. The 2D copy method is already written in a way that it does not rely on HST and DEV having the same bounds, and so it should be straightforward to do the same in the 1D copy method. We'll drop you a line once Johan has applied those changes and we'll mark the PR ready for review again.

@wertysas wertysas force-pushed the je-devptr-lower-bound branch 2 times, most recently from 647afa8 to cd763dc Compare December 13, 2024 09:14
@awnawab awnawab marked this pull request as ready for review December 16, 2024 13:31
@awnawab
Copy link
Collaborator

awnawab commented Dec 16, 2024

Thanks a lot @wertysas for addressing the comments 🙏 @pmarguinaud this is now ready for review again, please have a look whenever you can.

@wertysas wertysas force-pushed the je-devptr-lower-bound branch from cd763dc to d151763 Compare December 18, 2024 15:51
@dareg
Copy link
Collaborator

dareg commented Jan 6, 2025

Hello,
Could you make sure all the tests pass please

@mlange05
Copy link
Collaborator

mlange05 commented Jan 6, 2025

Hi @dareg , yes we'll make sure we get tests green first. The current ci-hpc failures are due to some outdated ssh keys on the test runners. We've flagged this with the respective admins but will have to wait for them to update the runners. I'll wait to get the test to pass before merging though.

@dareg
Copy link
Collaborator

dareg commented Jan 6, 2025

Okay, thanks. Sorry to be nit-picking like that. I'm afraid that if we allow a PR with a failing test today, we might have PR with multiple failing tests in a year.

@mlange05
Copy link
Collaborator

mlange05 commented Jan 8, 2025

Hi @dareg, I fully agree. The runners have been fixed now and all tests are passing.

@mlange05 mlange05 merged commit 596bf04 into main Jan 8, 2025
15 checks passed
@mlange05 mlange05 deleted the je-devptr-lower-bound branch January 8, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants