Add test script to process nwm forcing into ngen input #26

jameshalgren · 2022-12-19T23:01:51Z

Working on this with respect to #24.

This is ready for at least initial review. The main script, process_nwm_forcing_to_ngen.py has four functions which all produce the same thing -- a dataframe with the right information to provide forcing to the ngen CFE model in the form of precipitation inputs for each basin in a geopackage file with basin polygons from the ngen hydrofabric. In the test script, there is a primitive method included that creates a dictionary using an extremely slow technique, but one which is very clear to follow. (The test script, by the way, is not a canonical test script -- just a main function that runs a few example uses.)

Work remaining might include the following:

Include the basin ID in the output dataframe. (Actually, this is critical...)
Label the output with times (it's simply an hourly output now, with no explicit timestamping)
Add more flexibility regarding inputs, CLI interaction, etc.
Incorporate csv/.nc file output as an option
Package and distribute (with whatever re-writing that requires.)
Fix the parallel bug (orphaned objects left behind by parallel execution.)
Work with HDF5 group to make reading netcdf files a truly concurrent activity (see here - suggestion to re-compile with --enable-parallel flag or here - possibly change to access pattern)
Walk through the math and make sure all (or at least one) of the methods produce the expected output
Figure out how to have a more nuanced zonal statistic where the windowing can be partial (or confirm that it doesn't matter.)

Three levels of review are helpful:
A. Is a utility to convert NWM output to NGEN basins useful?
B. Who else should be tagged?
C. Deep review (code, methods, etc.)

How to test:

Download a set of test data files to use with test_process_nwm_forcing_to_ngen.py:

wget -P data -c https://storage.googleapis.com/national-water-model/nwm.20220824/forcing_medium_range/nwm.t12z.medium_range.forcing.f001.conus.nc
wget -P data -c https://storage.googleapis.com/national-water-model/nwm.20220824/forcing_medium_range/nwm.t12z.medium_range.forcing.f002.conus.nc
wget -P data -c https://storage.googleapis.com/national-water-model/nwm.20220824/forcing_medium_range/nwm.t12z.medium_range.forcing.f003.conus.nc
wget -P 03w -c https://nextgen-hydrofabric.s3.amazonaws.com/v1.2/nextgen_03W.gpkg

~~Update~~ See below...
Times are as follows for tests with 3 files and 200 features, then 30 files and 2000 features, respectively:

(pyenv3_jupyter) > python test_process_nwm_forcing_to_ngen.py
Working on the new way
2.9967598915100098
Working on the new way with threading parallel.
2.714801073074341
Working on the new way with process parallel.
10.04658818244934
Working on the new way with loops reversed.
2.522495985031128
Working on the new way with loops reversed with threading parallel.
2.614391803741455
Working on the new way with loops reversed with process parallel.
10.183135032653809

(pyenv3_jupyter) > python test_process_nwm_forcing_to_ngen.py
Working on the new way
201.32517886161804
Working on the new way with threading parallel.
212.48912286758423
Working on the new way with process parallel.
465.59838366508484
Working on the new way with loops reversed.
189.13055515289307
Working on the new way with loops reversed with threading parallel.
213.426913022995
Working on the new way with loops reversed with process parallel.
458.9345338344574

There is a warning returned. Perhaps the processes are closing before the files close...

python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Update
With the latest commits, the fastest method is the "new" way with reversed loops, especially with a large number of files/timestamps. To produce a 30 hour input for 2000 basins takes between 220 and 250 seconds:

Working on the new way
249.70958185195923
Working on the new way with loops reversed.
217.39878916740417

200 basins with 240 files:

Working on the new way
6935.022551059723
Working on the new way with loops reversed.
174.43949794769287

A 10-day hourly input of rainfall (240 hours == one medium_range forecast) for 2000 basins in the Southeast W hydrofabric region took 30 minutes:

Working on the new way with loops reversed.
1870.9580311775208

All execution times reported here from tests performed on a mac M1 Max with 64 Gb RAM.

jameshalgren · 2022-12-20T18:12:37Z

@groutr
Sources for methods, etc. in scripts:
https://rasterio.readthedocs.io/en/latest/api/rasterio.features.html
https://stackoverflow.com/questions/34906652/does-hdf5-support-concurrent-reads-or-writes-to-different-files
https://github.com/h5py/h5py/blob/master/examples/multiprocessing_example.py

groutr · 2022-12-22T18:22:32Z

ngen_forcing/TestConvert_NWMForcing_to_Ngen.py

+        ):
+
+    reng = "rasterio"
+    _xds = xr.open_dataset(folder_prefix.joinpath(file_list[0]), engine=reng)


maybe open the file handles up first, then just use first filehandle to extract _xds. Then everything gets closed at the end.

Maybe. We need the specific raster engine to get the right behavior from the [email protected]() line below, which we don't have in the later call...
See

_xds = xr.open_dataset(folder_prefix.joinpath(file_list[0]), engine=reng)

vs.

filehandles = [xr.open_dataset("data/" + f) for f in file_list]

groutr · 2022-12-22T18:23:40Z

ngen_forcing/TestConvert_NWMForcing_to_Ngen.py

+        for f in filehandles:
+            cropped = xr_read_window(f, winslices, mask=mask)
+            stats.append(cropped.mean())
+    [f.close() for f in filehandles]  # Returns None for each file


I did this for convenience. You could make this a for-loop for clarity

for f in filehandles: f.close()

I think the list comprehension is plenty readable... and more compact. Good enough?

ngen_forcing/TestConvert_NWMForcing_to_Ngen.py

jameshalgren · 2023-01-03T15:39:58Z

Moved todo to main comment.

jameshalgren · 2023-01-17T21:58:25Z

@mgdenno @arpita0911patel -- Let's discuss

jameshalgren · 2023-01-19T17:13:27Z

ping @karnesh

jameshalgren · 2023-01-27T19:36:22Z

ping @hariteja-jajula @RohanSunkarapalli

groutr reviewed Dec 22, 2022

View reviewed changes

ngen_forcing/TestConvert_NWMForcing_to_Ngen.py Outdated Show resolved Hide resolved

jameshalgren force-pushed the ngen_forcing branch 2 times, most recently from 983aa5b to fc67d18 Compare December 22, 2022 18:50

jameshalgren force-pushed the ngen_forcing branch from 1245428 to e8caaf9 Compare January 2, 2023 18:26

jameshalgren marked this pull request as ready for review January 3, 2023 15:33

jameshalgren requested review from hellkite500, shahab122 and ZacharyWills January 3, 2023 15:34

jameshalgren linked an issue Jan 4, 2023 that may be closed by this pull request

Create script to convert NWM forcing inputs into ngen basins and format #24

Open

jameshalgren force-pushed the ngen_forcing branch from 1ab1532 to 3dc7705 Compare January 27, 2023 20:17

jameshalgren mentioned this pull request Feb 8, 2023

Do not merge forcing240hr #35

Draft

jameshalgren mentioned this pull request Feb 22, 2023

Create script to convert NWM forcing inputs into ngen basins and format #24

Open

James Halgren added 9 commits March 8, 2023 12:29

Add test script to process nwm forcing into ngen input

203df69

rewrite to include file-loop

7d1b50d

include gpkg download in example

3cf0421

add optimized loop and port to functions

a7509f5

add parallel options

3d4bf39

explain commented test block

71c8370

better file names for process script

1b8efe8

turn on slow test in committed file

47a2f8f

hold file handles for slow method

2dc42b5

jameshalgren force-pushed the ngen_forcing branch from eb93ee1 to 2dc42b5 Compare March 8, 2023 18:30

fix duplicate variable names in test

889df5d

James Halgren added 3 commits March 27, 2023 16:28

update single-thread implementations

b6a49f7

update parallel implementations

0fd7d8b

add parallel options to inverted parallel function

476ee1f

jameshalgren mentioned this pull request Apr 27, 2023

Hydrofab2ngen tools #46

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test script to process nwm forcing into ngen input #26

Add test script to process nwm forcing into ngen input #26

jameshalgren commented Dec 19, 2022 •

edited

Loading

jameshalgren commented Dec 20, 2022 •

edited

Loading

groutr Dec 22, 2022 •

edited

Loading

jameshalgren Jan 2, 2023 •

edited

Loading

groutr Dec 22, 2022

jameshalgren Jan 2, 2023 •

edited

Loading

jameshalgren commented Jan 3, 2023 •

edited

Loading

jameshalgren commented Jan 17, 2023

jameshalgren commented Jan 19, 2023

jameshalgren commented Jan 27, 2023 •

edited

Loading

Add test script to process nwm forcing into ngen input #26

Are you sure you want to change the base?

Add test script to process nwm forcing into ngen input #26

Conversation

jameshalgren commented Dec 19, 2022 • edited Loading

How to test:

jameshalgren commented Dec 20, 2022 • edited Loading

groutr Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

jameshalgren Jan 2, 2023 • edited Loading

Choose a reason for hiding this comment

groutr Dec 22, 2022

Choose a reason for hiding this comment

jameshalgren Jan 2, 2023 • edited Loading

Choose a reason for hiding this comment

jameshalgren commented Jan 3, 2023 • edited Loading

jameshalgren commented Jan 17, 2023

jameshalgren commented Jan 19, 2023

jameshalgren commented Jan 27, 2023 • edited Loading

jameshalgren commented Dec 19, 2022 •

edited

Loading

jameshalgren commented Dec 20, 2022 •

edited

Loading

groutr Dec 22, 2022 •

edited

Loading

jameshalgren Jan 2, 2023 •

edited

Loading

jameshalgren Jan 2, 2023 •

edited

Loading

jameshalgren commented Jan 3, 2023 •

edited

Loading

jameshalgren commented Jan 27, 2023 •

edited

Loading