Skip to content

Commit

Permalink
Benchmark with DaCe cpu and gpu backends (#50)
Browse files Browse the repository at this point in the history
* Benchmark changes

* Added benchmark configurations

* Removed benchmark configs from configs to be tested in unit tests

* Changed dace submodule to point to Florian fix

* Dace submodule points to gcc_dies_on_dacecpu branch

* Set 'rf_fast' to true in baroclinic_c384_cpu/gpu.yaml

* fix mpi4py version (#51)

* [Feature] Better translate test (#39) (#47)

Translate test: small improvements

- Parametrize perturbation upon failure
- Refactor error folder to be `pwd` based
- Fix GPU translate unable to dump error `nc`
- Fix mixmatch precision on translate test
- Update README.md

Test fix:
- Orchestrate YPPM for translate purposes

Misc:
- Fix bad logger formatting on DaCeProgress

* [NASA] [Feature] Guarding against unimplemented configuration (#40) (#48)

* [Feature] Guarding against unimplemented configuration (#40)

Guarding against unimplemented namelists options:
- a2b_ord4
- d_sw
- fv_dynamics
- fv_subgridz
- neg_adj3
- divergence damping
- xppm
- yppm

Misc:
- Fix `netcdf_monitor` not mkdir the directory
- Add `as_dict` to the dycore state to dump the dycore more easily

* Unused assert

* Update fv3core/pace/fv3core/stencils/yppm.py

Co-authored-by: Oliver Elbert <[email protected]>

* Update fv3core/pace/fv3core/stencils/xppm.py

Co-authored-by: Oliver Elbert <[email protected]>

* Change NotImplemented to ValueError for n_sponge<3

* lint

---------

Co-authored-by: Oliver Elbert <[email protected]>

* Re-try of updating dace submodule to track Florian fix branch

* Reverted gt4py submodule to match main checkout

* Added benchmark README file

* Read in ak, bk coefficients  (#36)

* initial changes to read in ak bk

* read ak/bk

* add xfail

* remove input dir

* further changes to unit tests

* finish up test

* add history

* commit uncommited files

* fix test comment

* add input to top

* read in data

* read in netcdf file in eta mod

* remove txt file

* test

* modify test and fix generate.py

* remove emacs backup file

* driver tests pass

* fix helper.py

* fix fv3core tests

* fix physics test

* fix grid tests

* nullcommconfig

* cleanup input

* remove driver input

* remove top level input

* fix circular import problems

* modify eta_file readin for test_restart_serial

* comment out 91 test

* rm safety checks

* revert diagnostics.py

* restore driver.py

* revert initialization.py

* restore state.py

* restore analytic_init.py

* restore init_utils.py and analytic_init.py

* restore c_sw.py

* d2a2c_vect.py

* restore fv3core/stensils

* restore translate_fvdynamics

* restore physics/stencils

* restore stencils

* remove circular dependency

* use pytest parametrize

* cleanup generation.py

* fstrinngs

* add eta_file to MetricTerm init

* remove eta_file argument in new_from_metric_terms and geos_wrapper

* use pytest parametrize for the xfail tests

* use pytest parametrize for the xfail tests

* fix geos_wrapper and grid

* fix tests

* fstring

* add test comments

* fix util/HISTORY.md

* fix comments

* remove __init__.py from tests/main/grid

* add jupyter notebooks to generate eta files

* generate ak,bk,ptop on metricterm init

* fix tests

* exploit np.all in eta mod

* remove tests/main/grid/input

* update ci

* test

* remove input

* edit ci yaml

* remove push

---------

Co-authored-by: mlee03 <[email protected]>

* Move Active Physics Schemes to Config (#44)

* initial commit, need to adapt and run tests

* revising scheme name

* tests pass

* update history

* linting

* changing typehints for physics schemes to enum instead of str

* driver now works with physics config enum, tests pass

* fixed tests

* missed one

* D-grid data input method (#42)

* Testing changes reflected across branches

* Undoing changes made in build_gaea_c5.sh

* Testing vscode functionality, by adding a change to external_grid branch

* Testing vscode functionality, by adding a change to external_grid branch

* Addition of from_generated method and calc_flag to util/pace/util/grid/generation.py

* Added get_grid method for external grid data to driver/pace/driver/grid.py

* Preliminary xarray netcdf read in method added to driver/pace/driver/grid.py

* Updating util/pace/util/grid/generation.py from_generated method

* Addition of external grid data read in methods for initialization of grid. Current method uses xarray to interact with netcdf tile files. Values for longitutde, latitude, number of points in x an y, grid edge distances read in.

* driver/examples/configs/test_external_C12_1x1.yaml

* Preliminary unit test for external grid data read in

* Current state of unit tests as of 27 Nov 2023

* External grid method and unit tests added

* Re-excluding external grid data yamls from test_example_configs.py

* Update driver/pace/driver/grid.py

Co-authored-by: Florian Deconinck <[email protected]>

* Changed name of grid initializer function to match NetCDF dependency and class descriptor

* Update util/pace/util/grid/generation.py

Moved position of doc string for "from_external" MetricTerms class method

Co-authored-by: Oliver Elbert <[email protected]>

* Fixed indentation error in generation.py from suggestion in PR 42

* Removal of TODO comment in grid.py, changes to method of file accessing in test_analytic_init, test_external_grid_*

* Changed grid data read-in unit tests to compare data directly from file to driver grid data generated from yaml

* Change to reading in lon and lat, other metric terms calculated as needed

* Removed read-in of dx, dy, and area. Changed unit tests to compare calculated area to 'ideal' surface area as given by selected constants type.

* Update tests/mpi_54rank/test_external_grid_1x1.py

Incorrect name of test in test_external_grid_1x1.py changed to match file name

Co-authored-by: Oliver Elbert <[email protected]>

* Added comparisons for read-in vs generated by driver lon, lat, dx, dy, and area data to unit tests

* Added relative error calculations to unit tests for external grid data read-in

* External grid data read in tests changed: relative errors printed by each rank and get_tile_number replacing get_tile_index

* Removing commented out sections in test_external_grid_2x2.py

Co-authored-by: Oliver Elbert <[email protected]>

* Updated external grid data read-in to take configuration and input data locations from command line, updated test description, and added documentation on grid construction to external grid data configuration selection dataclass.

* Updated documentation in grid.py

* Updated external grid data read in unit test to use parametrize functionality of pytest

* Ammended files to reference changes to PR 36

---------

Co-authored-by: Frank Malatino <[email protected]>
Co-authored-by: Florian Deconinck <[email protected]>
Co-authored-by: Oliver Elbert <[email protected]>

---------

Co-authored-by: Oliver Elbert <[email protected]>
Co-authored-by: Florian Deconinck <[email protected]>
Co-authored-by: Oliver Elbert <[email protected]>
Co-authored-by: MiKyung Lee <[email protected]>
Co-authored-by: mlee03 <[email protected]>
Co-authored-by: Frank Malatino <[email protected]>
  • Loading branch information
7 people authored Jan 30, 2024
1 parent 77ff59c commit cd1bd06
Show file tree
Hide file tree
Showing 68 changed files with 1,512 additions and 968 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/main_unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install -r requirements_dev.txt
- name: Clone datafiles
run: |
mkdir -p tests/main/input && cd tests/main/input
git clone -b store_files https://github.com/mlee03/pace.git tmp && mv tmp/*.nc . && rm -rf tmp
- name: Run all main tests
run: |
pytest -x tests/main
6 changes: 4 additions & 2 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
[submodule "external/gt4py"]
path = external/gt4py
url = https://github.com/gridtools/gt4py.git
[submodule "external/dace"]

[submodule "dacefix"]
path = external/dace
url = https://github.com/spcl/dace.git
url = https://github.com/FlorianDeconinck/dace.git
branch = fix/gcc_dies_on_dacecpu
2 changes: 2 additions & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@ List format (alphabetical order): Surname, Name. Employer/Affiliation
* Fuhrer, Oliver. Allen Institute for AI.
* George, Rhea. Allen Institute for AI.
* Harris, Lucas. GFDL.
* Lee, Mi Kyung. GFDL.
* Kung, Chris. NASA.
* Malatino, Frank. GFDL
* McGibbon, Jeremy. Allen Institute for AI.
* Niedermayr, Yannick. ETH.
* Savarin, Ajda. University of Washington.
Expand Down
96 changes: 96 additions & 0 deletions benchmark_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Benchmarking README

The tests contained in this archive are for benchmarking purposes only. Any
distribution beyond those personnel performing the tests need explicit approval
from NOAA/GFDL (Seth Underwood or Rusty Benson).

## Cloning benchmark repository and generating conda environment

Pace requires GCC > 9.2, MPI, and Python 3.8 on your system, and CUDA is required to run with a GPU backend. You will also need the headers of the boost libraries in your `$PATH` (boost itself does not need to be installed).

```shell
cd BOOST/ROOT
wget https://boostorg.jfrog.io/artifactory/main/release/1.79.0/source/boost_1_79_0.tar.gz
tar -xzf boost_1_79_0.tar.gz
mkdir -p boost_1_79_0/include
mv boost_1_79_0/boost boost_1_79_0/include/
export BOOST_ROOT=BOOST/ROOT/boost_1_79_0
```

To clone the benchmark branch use the command:

```shell
git clone --recursive -b benchmark [email protected]:NOAA-GFDL/pace.git
```

or if you have already cloned the repository:

```shell
git submodule update --init --recursive
```

After cloning, change into the directory containing the clone. To generate the conda environment use the following commands:

```shell
conda create -y --name <desired_name> python=3.8
conda activate <desired_name>
pip3 install --upgrade pip setuptools wheel
pip3 install -r requirements_dev.txt -c constraints.txt
```

## Benchmarking configurations

There are four configurations of the PACE application contained within the branch to be used for benchmarking:

```shell
driver/examples/configs/baroclinic_c384_cpu.yaml
driver/examples/configs/baroclinic_c384_gpu.yaml
driver/examples/configs/baroclinic_c3072_cpu.yaml
driver/examples/configs/baroclinic_c3072_gpu.yaml
```

## Building

To build with the DaCe backends, set the following environment variables:

```shell
FV3_DACEMODE=Build
PACE_FLOAT_PRECISION=64
PACE_LOGLEVEL=INFO
PYTHONOPTIMIZE=1
OMP_NUM_THREAD=1
```

Adjust the time of the configuration to be built such that the time of the build is for one timestep. For example:

```shell
dt_atmos: 450
seconds: 450
```
## Running
To build with the DaCe backends, set the following environment variables:

```shell
FV3_DACEMODE=Run
PACE_FLOAT_PRECISION=64
PACE_LOGLEVEL=INFO
PYTHONOPTIMIZE=1
OMP_NUM_THREAD=1
```

Adjust the time of the configuration to be run to the desired length, example:

```shell
dt_atmos: 450
days: 9
```

The time for the build or run can be set with units of seconds, minutes, hours, or days.

An example command to start the build or run process with MPI using the DaCe CPU backend for the c384 configuration:

```shell
mpirun -n 1536 python3 -m pace.driver.run driver/examples/configs/baroclinic_c384_cpu.yaml
```

The build or run requires 1536 ranks, given that layout of 16x16 ranks per tile, and there are 6 tiles.
4 changes: 2 additions & 2 deletions constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ coverage==5.5
# pytest-cov
cytoolz==0.12.1
# via gt4py
dace==0.14.4
dace==0.15.1
# via
# -r requirements_dev.txt
# pace-dsl
Expand Down Expand Up @@ -184,7 +184,7 @@ googleapis-common-protos==1.53.0
# via google-api-core
gprof2dot==2021.2.21
# via pytest-profiling
gridtools-cpp==2.3.0
gridtools-cpp==2.3.1
# via gt4py
h5netcdf==0.11.0
# via -r util/requirements.txt
Expand Down
2 changes: 1 addition & 1 deletion docs/physics/state.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,6 @@ You can initialize a zero-filled PhysicsState and MicrophysicsState from other P

>>> quantity_factory = QuantityFactory.from_backend(sizer=sizer, backend="numpy")
>>> physics_state = PhysicsState.init_zeros(
... quantity_factory=quantity_factory, active_packages=["microphysics"]
... quantity_factory=quantity_factory, schemes=["GFS_microphysics"]
... )
>>> microphysics_state = physics_state.microphysics
5 changes: 5 additions & 0 deletions driver/examples/configs/baroclinic_c12.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,8 @@ physics_config:
hydrostatic: false
nwat: 6
do_qa: true

grid_config:
type: generated
config:
eta_file: 'tests/main/input/eta79.nc'
98 changes: 98 additions & 0 deletions driver/examples/configs/baroclinic_c12_explicit_physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
stencil_config:
compilation_config:
backend: numpy
rebuild: false
validate_args: true
format_source: false
device_sync: false
initialization:
type: analytic
config:
case: baroclinic
performance_config:
collect_performance: true
experiment_name: c12_baroclinic
nx_tile: 12
nz: 79
dt_atmos: 225
minutes: 15
layout:
- 1
- 1
diagnostics_config:
path: output
output_format: netcdf
names:
- u
- v
- ua
- va
- pt
- delp
- qvapor
- qliquid
- qice
- qrain
- qsnow
- qgraupel
z_select:
- level: 65
names:
- pt
dycore_config:
a_imp: 1.0
beta: 0.
consv_te: 0.
d2_bg: 0.
d2_bg_k1: 0.2
d2_bg_k2: 0.1
d4_bg: 0.15
d_con: 1.0
d_ext: 0.0
dddmp: 0.5
delt_max: 0.002
do_sat_adj: true
do_vort_damp: true
fill: true
hord_dp: 6
hord_mt: 6
hord_tm: 6
hord_tr: 8
hord_vt: 6
hydrostatic: false
k_split: 1
ke_bg: 0.
kord_mt: 9
kord_tm: -9
kord_tr: 9
kord_wz: 9
n_split: 1
nord: 3
nwat: 6
p_fac: 0.05
rf_cutoff: 3000.
rf_fast: true
tau: 10.
vtdm4: 0.06
z_tracer: true
do_qa: true
tau_i2s: 1000.
tau_g2v: 1200.
ql_gen: 0.001
ql_mlt: 0.002
qs_mlt: 0.000001
qi_lim: 1.0
dw_ocean: 0.1
dw_land: 0.15
icloud_f: 0
tau_l2v: 300.
tau_v2l: 90.
fv_sg_adj: 0
n_sponge: 48

physics_config:
hydrostatic: false
nwat: 6
do_qa: true
schemes:
- GFS_microphysics
26 changes: 23 additions & 3 deletions driver/examples/configs/baroclinic_c12_orch_cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,30 @@ performance_config:
nx_tile: 12
nz: 79
dt_atmos: 225
minutes: 5
seconds: 675
layout:
- 1
- 1
- 2
- 2
diagnostics_config:
path: output
output_format: netcdf
names:
- u
- v
- ua
- va
- pt
- delp
- qvapor
- qliquid
- qice
- qrain
- qsnow
- qgraupel
z_select:
- level: 65
names:
- pt
dycore_config:
a_imp: 1.0
beta: 0.
Expand Down
5 changes: 5 additions & 0 deletions driver/examples/configs/baroclinic_c12_write_restart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,3 +92,8 @@ physics_config:
hydrostatic: false
nwat: 6
do_qa: true

grid_config:
type: generated
config:
eta_file: "tests/main/input/eta79.nc"
9 changes: 4 additions & 5 deletions driver/examples/configs/baroclinic_c384_cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,10 @@ performance_config:
nx_tile: 384
nz: 79
dt_atmos: 450
minutes: 7
seconds: 30
days: 9
layout:
- 1
- 1
- 16
- 16
diagnostics_config:
path: output
output_format: netcdf
Expand Down Expand Up @@ -72,7 +71,7 @@ dycore_config:
nwat: 6
p_fac: 0.1
rf_cutoff: 800.
rf_fast: false
rf_fast: true
tau: 5.
vtdm4: 0.06
z_tracer: true
Expand Down
2 changes: 1 addition & 1 deletion driver/examples/configs/baroclinic_c384_gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ dycore_config:
nwat: 6
p_fac: 0.1
rf_cutoff: 800.
rf_fast: false
rf_fast: true
tau: 5.
vtdm4: 0.06
z_tracer: true
Expand Down
Loading

0 comments on commit cd1bd06

Please sign in to comment.