-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NASA:Update] Distributed dace cache (rework) #16
[NASA:Update] Distributed dace cache (rework) #16
Conversation
We are having a problem with determinism of temporaries. The utest fails sometimes, which sounds bad. But the regression test are passing fine. |
This is now solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed with Oliver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall good pr, minor points about some of the structure of the config/cache checking, and the heat_source and diss_est issue keeps coming back up
backend = quantity_factory.empty( | ||
backend = quantity_factory.zeros( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it matter whether we use empty or zeros here? It seems like setting the memory to 0 is an unnecessary step. Also is there a better way to get at the backend than through a Quantity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too fix the "deterministic" test in utest. Empty grabs any memory. The fact is that test fails because halos have the wrong values and it gets pass down. Arguably I did a blanket cover of all the empty/zeros because non of them will be executed at realtime, so the small extra cost of zero-ing it out does not matter.
from pace.util import CubedSpherePartitioner | ||
|
||
|
||
def identify_code_path( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this feels awkward to me, but I'm not sure making this live inside of a more fully-featured FV3CodePath class makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a proper build class/system that uproots most of all of this and consolidate it to one place. It should include distributed compiling for non-orchestrated backend, better hash/cache and much more.
I went for a functional paradigm in the meantime, since I fully expect this to be reworked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't have a super compelling alternative and it's not really blocking, just felt awkward when I read it
attr1, attr2, err_msg=f"{attr} not equal" | ||
) | ||
except AssertionError: | ||
assert np.allclose(attr1, attr2, equal_nan=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have NaNs in our temporaries? Is there another reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Grid/Metric terms generation makes NaNs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall this happening in the fortran as the debug mode checks for all but underflows - which are ftz'd. Do you know what causes this in the python code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have NaN in the geometry under some circumstances at least:
/home/runner/work/pace/pace/util/pace/util/grid/geometry.py:516: RuntimeWarning: invalid value encountered in divide
del6_v = sina_u * dy / dxc
I believe those are in the halo
Co-authored-by: Oliver Elbert <[email protected]>
@FlorianDeconinck - in the orchestration.py source file, the comments highlighted are outdated with the new logic. Please update them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please see the non-review comment as well as these here.
attr1, attr2, err_msg=f"{attr} not equal" | ||
) | ||
except AssertionError: | ||
assert np.allclose(attr1, attr2, equal_nan=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall this happening in the fortran as the debug mode checks for all but underflows - which are ftz'd. Do you know what causes this in the python code?
All issues raised are logged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks
Purpose
The original work to be able to compile DaCe orchestrated backend strategy was:
Unfortunately this is unreadable and demands a compile at 3x3 layout before doing any other runs
The new strategy cleans up code and actually generates the correct caches with any layout. E.g.
The same system should be deploy for
gt
backends, but is more complex due to the atomic nature of compiling, therefore is not part of this work.This PR will synchronizes NASA & NOAA forks.
Code changes:
Checklist
Before submitting this PR, please make sure:
pace-util
, HISTORY has been updated