Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test script to process nwm forcing into ngen input #26

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jameshalgren
Copy link
Contributor

@jameshalgren jameshalgren commented Dec 19, 2022

Working on this with respect to #24.

This is ready for at least initial review. The main script, process_nwm_forcing_to_ngen.py has four functions which all produce the same thing -- a dataframe with the right information to provide forcing to the ngen CFE model in the form of precipitation inputs for each basin in a geopackage file with basin polygons from the ngen hydrofabric. In the test script, there is a primitive method included that creates a dictionary using an extremely slow technique, but one which is very clear to follow. (The test script, by the way, is not a canonical test script -- just a main function that runs a few example uses.)

Work remaining might include the following:

  • Include the basin ID in the output dataframe. (Actually, this is critical...)
  • Label the output with times (it's simply an hourly output now, with no explicit timestamping)
  • Add more flexibility regarding inputs, CLI interaction, etc.
  • Incorporate csv/.nc file output as an option
  • Package and distribute (with whatever re-writing that requires.)
  • Fix the parallel bug (orphaned objects left behind by parallel execution.)
  • Work with HDF5 group to make reading netcdf files a truly concurrent activity (see here - suggestion to re-compile with --enable-parallel flag or here - possibly change to access pattern)
  • Walk through the math and make sure all (or at least one) of the methods produce the expected output
  • Figure out how to have a more nuanced zonal statistic where the windowing can be partial (or confirm that it doesn't matter.)

Three levels of review are helpful:
A. Is a utility to convert NWM output to NGEN basins useful?
B. Who else should be tagged?
C. Deep review (code, methods, etc.)

How to test:

Download a set of test data files to use with test_process_nwm_forcing_to_ngen.py:

wget -P data -c https://storage.googleapis.com/national-water-model/nwm.20220824/forcing_medium_range/nwm.t12z.medium_range.forcing.f001.conus.nc
wget -P data -c https://storage.googleapis.com/national-water-model/nwm.20220824/forcing_medium_range/nwm.t12z.medium_range.forcing.f002.conus.nc
wget -P data -c https://storage.googleapis.com/national-water-model/nwm.20220824/forcing_medium_range/nwm.t12z.medium_range.forcing.f003.conus.nc
wget -P 03w -c https://nextgen-hydrofabric.s3.amazonaws.com/v1.2/nextgen_03W.gpkg

Update See below...
Times are as follows for tests with 3 files and 200 features, then 30 files and 2000 features, respectively:

(pyenv3_jupyter) > python test_process_nwm_forcing_to_ngen.py
Working on the new way
2.9967598915100098
Working on the new way with threading parallel.
2.714801073074341
Working on the new way with process parallel.
10.04658818244934
Working on the new way with loops reversed.
2.522495985031128
Working on the new way with loops reversed with threading parallel.
2.614391803741455
Working on the new way with loops reversed with process parallel.
10.183135032653809
(pyenv3_jupyter) > python test_process_nwm_forcing_to_ngen.py
Working on the new way
201.32517886161804
Working on the new way with threading parallel.
212.48912286758423
Working on the new way with process parallel.
465.59838366508484
Working on the new way with loops reversed.
189.13055515289307
Working on the new way with loops reversed with threading parallel.
213.426913022995
Working on the new way with loops reversed with process parallel.
458.9345338344574

There is a warning returned. Perhaps the processes are closing before the files close...

python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Update
With the latest commits, the fastest method is the "new" way with reversed loops, especially with a large number of files/timestamps. To produce a 30 hour input for 2000 basins takes between 220 and 250 seconds:

Working on the new way
249.70958185195923
Working on the new way with loops reversed.
217.39878916740417

200 basins with 240 files:

Working on the new way
6935.022551059723
Working on the new way with loops reversed.
174.43949794769287

A 10-day hourly input of rainfall (240 hours == one medium_range forecast) for 2000 basins in the Southeast W hydrofabric region took 30 minutes:

Working on the new way with loops reversed.
1870.9580311775208

All execution times reported here from tests performed on a mac M1 Max with 64 Gb RAM.

):

reng = "rasterio"
_xds = xr.open_dataset(folder_prefix.joinpath(file_list[0]), engine=reng)
Copy link
Contributor

@groutr groutr Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe open the file handles up first, then just use first filehandle to extract _xds. Then everything gets closed at the end.

Copy link
Contributor Author

@jameshalgren jameshalgren Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. We need the specific raster engine to get the right behavior from the [email protected]() line below, which we don't have in the later call...
See

_xds = xr.open_dataset(folder_prefix.joinpath(file_list[0]), engine=reng)

vs.

filehandles = [xr.open_dataset("data/" + f) for f in file_list]

for f in filehandles:
cropped = xr_read_window(f, winslices, mask=mask)
stats.append(cropped.mean())
[f.close() for f in filehandles] # Returns None for each file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for convenience. You could make this a for-loop for clarity

for f in filehandles:
    f.close()

Copy link
Contributor Author

@jameshalgren jameshalgren Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the list comprehension is plenty readable... and more compact. Good enough?

@jameshalgren jameshalgren force-pushed the ngen_forcing branch 2 times, most recently from 983aa5b to fc67d18 Compare December 22, 2022 18:50
@jameshalgren jameshalgren marked this pull request as ready for review January 3, 2023 15:33
@jameshalgren
Copy link
Contributor Author

jameshalgren commented Jan 3, 2023

Moved todo to main comment.

@jameshalgren
Copy link
Contributor Author

@mgdenno @arpita0911patel -- Let's discuss

@jameshalgren
Copy link
Contributor Author

ping @karnesh

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Jan 27, 2023

@jameshalgren jameshalgren mentioned this pull request Apr 27, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create script to convert NWM forcing inputs into ngen basins and format
2 participants