[NASA:Update] Distributed dace cache (rework) #16

FlorianDeconinck · 2023-08-23T15:40:04Z

Purpose

The original work to be able to compile DaCe orchestrated backend strategy was:

Compile on 3x3 layout, compiling caches from .gt_cache_00000 -> .gt_cache_00008, reflecting the 9 possible codepath
Any other layout would RUN mapping each rank to one of those 9 caches
Unfortunately this is unreadable and demands a compile at 3x3 layout before doing any other runs

The new strategy cleans up code and actually generates the correct caches with any layout. E.g.

9 ranks will compile, tagging them TL, T, TR, R, BR, B, BL, L (for TopLeft, Top, etc.) which describes the actual code path on the cube-sphere face
In RUN any ranks on any layout will compute which codepath they are part of and load accordingly

The same system should be deploy for gt backends, but is more complex due to the atomic nature of compiling, therefore is not part of this work.

This PR will synchronizes NASA & NOAA forks.

Code changes:

ETA values for 137 levels
GEOS wrapper bridge prints more informations

Checklist

Before submitting this PR, please make sure:

You have followed the coding standards guidelines established at Code Review Checklist.
Docstrings and type hints are added to new and updated routines, as appropriate
All relevant documentation has been updated or added (e.g. README, CONTRIBUTING docs)
For each public change and fix in pace-util, HISTORY has been updated
Unit tests are added or updated for non-stencil code changes

2. Made GEOS specific changes to thresholds in saturation adjustment

…aller

Parametrize tool with backend, output format

Add saturation adjustement threshold to const

…/root-358b60d

…-358b60d

Fix bad merge for bdt with GEOS_Wrapper

…eos_wrapper_bridge

…/root-358b60d

Lint

… was wrong

…CHE_* env" This reverts commit 4fc5b4d.

This reverts commit 2245027.

FlorianDeconinck · 2023-08-23T15:52:49Z

We are having a problem with determinism of temporaries. The utest fails sometimes, which sounds bad. But the regression test are passing fine.

… of `empty`

FlorianDeconinck · 2023-08-23T17:37:17Z

We are having a problem with determinism of temporaries. The utest fails sometimes, which sounds bad. But the regression test are passing fine.

This is now solved.

fmalatino

Reviewed with Oliver

oelbert

Overall good pr, minor points about some of the structure of the config/cache checking, and the heat_source and diss_est issue keeps coming back up

oelbert · 2023-08-25T18:45:55Z

driver/pace/driver/grid.py

-        backend = quantity_factory.empty(
+        backend = quantity_factory.zeros(


Why does it matter whether we use empty or zeros here? It seems like setting the memory to 0 is an unnecessary step. Also is there a better way to get at the backend than through a Quantity?

This is too fix the "deterministic" test in utest. Empty grabs any memory. The fact is that test fails because halos have the wrong values and it gets pass down. Arguably I did a blanket cover of all the empty/zeros because non of them will be executed at realtime, so the small extra cost of zero-ing it out does not matter.

driver/pace/driver/initialization.py

dsl/pace/dsl/caches/codepath.py

dsl/pace/dsl/dace/build.py

oelbert · 2023-08-25T19:04:22Z

dsl/pace/dsl/caches/cache_location.py

+from pace.util import CubedSpherePartitioner
+
+
+def identify_code_path(


Something about this feels awkward to me, but I'm not sure making this live inside of a more fully-featured FV3CodePath class makes sense?

We need a proper build class/system that uproots most of all of this and consolidate it to one place. It should include distributed compiling for non-orchestrated backend, better hash/cache and much more.

I went for a functional paradigm in the meantime, since I fully expect this to be reworked

Yeah I don't have a super compelling alternative and it's not really blocking, just felt awkward when I read it

fv3core/pace/fv3core/stencils/d_sw.py

fv3core/pace/fv3core/testing/translate_dyncore.py

oelbert · 2023-08-25T19:22:39Z

stencils/pace/stencils/testing/temporaries.py

-                    attr1, attr2, err_msg=f"{attr} not equal"
-                )
-            except AssertionError:
+                assert np.allclose(attr1, attr2, equal_nan=True)


Do we have NaNs in our temporaries? Is there another reason for this change?

Yes. Grid/Metric terms generation makes NaNs

I don't recall this happening in the fortran as the debug mode checks for all but underflows - which are ftz'd. Do you know what causes this in the python code?

We have NaN in the geometry under some circumstances at least:

/home/runner/work/pace/pace/util/pace/util/grid/geometry.py:516: RuntimeWarning: invalid value encountered in divide del6_v = sina_u * dy / dxc

I believe those are in the halo

tests/main/dsl/test_caches.py

util/pace/util/grid/eta.py

Co-authored-by: Oliver Elbert <[email protected]>

bensonr · 2023-08-29T14:02:28Z

@FlorianDeconinck - in the orchestration.py source file, the comments highlighted are outdated with the new logic. Please update them.

bensonr

Looks good. Please see the non-review comment as well as these here.

fv3core/pace/fv3core/initialization/geos_wrapper.py

bensonr · 2023-08-29T14:10:51Z

stencils/pace/stencils/testing/temporaries.py

-                    attr1, attr2, err_msg=f"{attr} not equal"
-                )
-            except AssertionError:
+                assert np.allclose(attr1, attr2, equal_nan=True)


I don't recall this happening in the fortran as the debug mode checks for all but underflows - which are ftz'd. Do you know what causes this in the python code?

Include boost into main test

FlorianDeconinck · 2023-08-30T14:43:10Z

All issues raised are logged.

oelbert

Awesome, thanks

pchakraborty and others added 30 commits January 26, 2023 20:47

Initialize GeosDycoreWrapper with bdt (timestep)

6d78659

Use GEOS version of constants

0a3e857

1. Add qcld to the list of tracers beings advected

0a8d705

2. Made GEOS specific changes to thresholds in saturation adjustment

Accumulate diss_est

3b73d71

Allow GEOS_WRAPPER to process device data

a68d160

Add clear to collector for 3rd party use. GEOS pass down timings to c…

33ba53f

…aller

Merge branch 'geos/main' into opt_geos_wrapper_bridge

8968698

Make kernel analysis run a copy stencil to compute local bandwith

2327cbe

Parametrize tool with backend, output format

Move constant on a env var

cb4ec5f

Add saturation adjustement threshold to const

lint

7348922

lint

e234d16

More linting

131a2af

Merge branch 'opt_geos_wrapper_bridge' into debug/pchakrab/aquaplanet…

dce3fb7

…/root-358b60d

Remove unused if leading to empty code block

8982542

Restrict dace to 0.14.1 due to a parsing bug

da2f902

Merge branch 'feature/dace_debug' into debug/pchakrab/aquaplanet/root…

f2799d8

…-358b60d

Add guard for bdt==0

27fae1c

Fix bad merge for bdt with GEOS_Wrapper

Remove unused code

2f8ebac

Merge branch 'feature/dace_debug' into opt_geos_wrapper_bridge

5d9e0a0

Merge remote-tracking branch 'nasa/feature/kernel_bw_tool' into opt_g…

b8edbf2

…eos_wrapper_bridge

Merge branch 'opt_geos_wrapper_bridge' into debug/pchakrab/aquaplanet…

f54b231

…/root-358b60d

Fix theroritical timings

81d00ce

Lint

Fixed a bug where pkz was being calculated twice, and the second calc…

4891d56

… was wrong

Downgrade DaCe to 0.14.0 pending array aliasing fix

fafbfc7

Set default cache path for orchestrated DaCe to respect GT_CACHE_* env

4fc5b4d

Remove previous per stencil override of default_build_folder

2245027

Revert "Set default cache path for orchestrated DaCe to respect GT_CA…

4f8fdc3

…CHE_* env" This reverts commit 4fc5b4d.

Revert "Remove previous per stencil override of default_build_folder"

47421a0

This reverts commit 2245027.

Read cache_root in default dace backend

d51bc11

Document faulty behavior with GT_CACHE_DIR_NAME

6bdd595

FlorianDeconinck added 2 commits August 23, 2023 12:57

Fix non-deterministic temporaries by using zeros everywhere instead…

8f6ba7c

… of `empty`

Missed commit

31c4844

bensonr requested review from oelbert, fmalatino and bensonr August 25, 2023 15:27

fmalatino reviewed Aug 25, 2023

View reviewed changes

oelbert suggested changes Aug 25, 2023

View reviewed changes

FlorianDeconinck and others added 5 commits August 25, 2023 15:45

Update dsl/pace/dsl/caches/codepath.py

08f3033

Co-authored-by: Oliver Elbert <[email protected]>

Lint

d63a0f0

Restore zero-ing out the fields

6de1b3c

Fix formatting in geos logger

33ac533

Clean up

7955695

bensonr reviewed Aug 29, 2023

View reviewed changes

FlorianDeconinck mentioned this pull request Aug 29, 2023

buildenv submodule needs to be updated #5

Closed

FlorianDeconinck added 6 commits August 29, 2023 15:42

Refactor the test to go around so reload bug

1252736

Update requirements to include external/dace

8de32bc

Include boost into main test

Typo

6ef8b60

Revert wrong branch changes

51fca6e

Fix utest called from pytest

132e2c4

Update comment

689f4b0

bensonr approved these changes Aug 30, 2023

View reviewed changes

oelbert approved these changes Aug 31, 2023

View reviewed changes

fmalatino approved these changes Sep 11, 2023

View reviewed changes

bensonr merged commit b1ef6b5 into NOAA-GFDL:main Sep 11, 2023
2 checks passed

FlorianDeconinck mentioned this pull request Sep 12, 2023

Synchronize with NOAA GEOS-ESM/pace#28

Closed

FlorianDeconinck deleted the up/feature/distributed_dace_cache branch September 12, 2023 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NASA:Update] Distributed dace cache (rework) #16

[NASA:Update] Distributed dace cache (rework) #16

FlorianDeconinck commented Aug 23, 2023 •

edited

Loading

FlorianDeconinck commented Aug 23, 2023

FlorianDeconinck commented Aug 23, 2023

fmalatino left a comment

oelbert left a comment

oelbert Aug 25, 2023

FlorianDeconinck Aug 25, 2023

oelbert Aug 25, 2023

FlorianDeconinck Aug 25, 2023

oelbert Aug 25, 2023

oelbert Aug 25, 2023

FlorianDeconinck Aug 25, 2023

bensonr Aug 29, 2023

FlorianDeconinck Aug 29, 2023

bensonr commented Aug 29, 2023

bensonr left a comment

bensonr Aug 29, 2023

FlorianDeconinck commented Aug 30, 2023

oelbert left a comment

		backend = quantity_factory.empty(
		backend = quantity_factory.zeros(

		from pace.util import CubedSpherePartitioner


		def identify_code_path(

[NASA:Update] Distributed dace cache (rework) #16

[NASA:Update] Distributed dace cache (rework) #16

Conversation

FlorianDeconinck commented Aug 23, 2023 • edited Loading

Purpose

Code changes:

Checklist

FlorianDeconinck commented Aug 23, 2023

FlorianDeconinck commented Aug 23, 2023

fmalatino left a comment

Choose a reason for hiding this comment

oelbert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bensonr commented Aug 29, 2023

bensonr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FlorianDeconinck commented Aug 30, 2023

oelbert left a comment

Choose a reason for hiding this comment

FlorianDeconinck commented Aug 23, 2023 •

edited

Loading