Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create script to convert NWM forcing inputs into ngen basins and format #24

Open
jameshalgren opened this issue Dec 15, 2022 · 7 comments · May be fixed by #26
Open

Create script to convert NWM forcing inputs into ngen basins and format #24

jameshalgren opened this issue Dec 15, 2022 · 7 comments · May be fixed by #26
Assignees

Comments

@jameshalgren
Copy link
Contributor

jameshalgren commented Dec 15, 2022

TL; DR

How do we convert the current NWM forcing data into inputs for the ngen catchments?

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Dec 15, 2022

Details

We are working on a pipeline to develop forcing inputs for the community NextGen
The code snippets below should take about 5 minutes to execute, if you want to try them out.

The hydrofabric files

Please see this for a geopkg to Ngen input converter. With that script, the gpkg files in the AWS catalog for the current NWM hydrofabric artifacts(see also the hydrofabric github.io page)… can be turned into the catchment_data.geojson and nexus_data.geojson input files for a NextGen execution — there is only one other main input file, which is the realization.json that needs to be developed (along with the forcing information, of course.)

Forcings

And... those forcing files are what this issue is about!

Here's what we have so far:

Get some test data

We can download the forcing data pre-processed for the current NWM from the gcp bucket.

e.g., wget https://storage.googleapis.com/national-water-model/nwm.20220901/forcing_medium_range/nwm.t00z.medium_range.forcing.f001.conus.nc

(Note: there are cooler ways to do this which we are exploring in other experiments, but for now, it is easy to just download a few files and work with that.)

Also get the .gpkg file (for Alabama, of course!!)
wget https://nextgen-hydrofabric.s3.amazonaws.com/v1.2/nextgen_03W.gpkg

read in the catchment boundaries where we will need to sample the forcing

import geopandas as gpd
f_03 = "nextgen_03W.gpkg" 
gpkg_03w_divides = gpd.read_file(f_03, layer="divides")

method #1

We tried this two ways. The first way relies on an intermediate file for the variable. It's fast. And it helps with debugging. Note, the Tiff export turns things into scaled integers, and we didn't figure out how to handle that if its important (it would be, but we were aiming for method #2 anyway...)

read the forcing dataset and write to geotiff

import xarray as xr
fc_nc = ("nwm.t00z.medium_range.forcing.f001.conus.nc")

# TODO LATER: Look more into the differences between these input methods
# https://docs.xarray.dev/en/stable/generated/xarray.open_rasterio.html
# https://corteva.github.io/rioxarray/stable/examples/convert_to_raster.html
# https://corteva.github.io/rioxarray/stable/getting_started/getting_started.html
# TODO LATER: make sure the imports are robust -- do these need to be loaded or installed?
# import rioxarray as rxr
# import rasterio

reng = "rasterio"
fc_xds = xr.open_dataset(fc_nc, engine=reng)
# NOTE: This is the same: fc_xds = rxr.open_rasterio(fc_nc)
# NOTE: This way doesn't work (the engine specification is important): fc_xds = xr.open_dataset(fc_nc) 
fc_xds["LWDOWN"].rio.to_raster("test1.tif")

read the geotiff and do zonal statistics on the catchments

with rasterio.open("test1.tif") as src:
    affine = src.transform
    array = src.read(1)
    df_zonal_stats = pd.DataFrame(zonal_stats(gpkg_03w_divides, array, affine=affine))

# adding statistics back to original GeoDataFrame
gdf2 = pd.concat([gpkg_03w_divides, df_zonal_stats], axis=1) 

Method #2

This way, we avoid the intermediate file -- which is good for streamlining, but probably is harder to debug.

with xr.open_dataset(fc_nc, engine=reng) as _xds:
    _src = _xds.LWDOWN
    _aff2 = _src.rio.transform()
    _arr2 = _src.values[0]
    _df_zonal_stats = pd.DataFrame(zonal_stats(gpkg_03w_divides, _arr2, affine=_aff2))
    
# adding statistics back to original GeoDataFrame
gdf3 = pd.concat([gpkg_03w_divides, _df_zonal_stats], axis=1)
gdf3.to_csv 

Success!

The output looks like this. We can use the mean value (or perhaps some other statistic... we'll need to think about that) as the value to bring over to the csv for this time-step for each catchment.

(Pdb) gdf3
               id    areasqkm      type        toid                                           geometry      min      max        mean  count
0      cat-107068   14.698357   network  nex-107069  MULTIPOLYGON (((1044735.000 841095.000, 104482...  199.125  199.333  199.233917     12
1      cat-107233    8.058603   network  nex-107234  MULTIPOLYGON (((1077765.001 1061085.003, 10778...  192.494  192.550  192.525000      5
2      cat-107683   17.203060   network  nex-107684  MULTIPOLYGON (((1038915.001 1106985.003, 10387...  193.985  194.136  194.069125     16
3      cat-108167   15.632991   network  nex-108168  MULTIPOLYGON (((1094685.002 1036725.001, 10943...  192.026  192.298  192.182143     14
4      cat-112977   17.297992   network  nex-112978  MULTIPOLYGON (((736335.000 1096815.003, 736184...  187.651  188.667  188.019111     18
...           ...         ...       ...         ...                                                ...      ...      ...         ...    ...
30867  cat-137554    0.022500   coastal  cnx-137554  MULTIPOLYGON (((648015.005 830264.997, 648074....      NaN      NaN         NaN      0
30868  cat-106885  136.690204  internal  inx-106885  MULTIPOLYGON (((986774.996 876974.997, 986835....  197.238  197.934  197.502941    135
30869  cat-137071    0.848702   coastal  cnx-137071  MULTIPOLYGON (((1152974.999 851804.998, 115300...  195.405  195.405  195.405000      1
30870  cat-137068    0.007199   coastal  cnx-137068  MULTIPOLYGON (((1153095.000 851805.003, 115300...      NaN      NaN         NaN      0
30871  cat-137748    5.238898   coastal  cnx-137748  MULTIPOLYGON (((621914.998 811425.005, 621914....  272.608  285.758  279.119333      6

[30872 rows x 9 columns]

@jameshalgren
Copy link
Contributor Author

Next Steps

We need to loop through the variables and forecast times (or retrospective times) and write a giant concatenated .csv or .nc with the write information.

Or... figure out how to do streaming. But one step at a time.

@jameshalgren
Copy link
Contributor Author

Ping @ZacharyWills @hellkite500

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Dec 15, 2022

This is pretty slow... so we'll want to multithread it. We can use the kerchunk tools, perhaps, to do collections of files at a time. Are there ways we can split up the calculation into clusters of inputs? (run 5 days, pause, run 5 more days, etc.?) Building the input files in clusters could make this easier to stream, grabbing the source files in manageable bunches.

@jameshalgren
Copy link
Contributor Author

We'll need to pay close attention to the transform parameter if we take this into production somehow. This post has detailed discussion:
https://github.com/perrygeo/python-rasterstats/blob/master/docs/notebooks/Integrating%20with%20GeoPandas%20and%20Numpy.ipynb

@shahab122
Copy link

shahab122 commented Dec 19, 2022

We compared the variable names between the forcing variables of ngen example files (https://github.com/NOAA-OWP/ngen/tree/master/data/forcing) and the forcing variables extracted using the above script. The following table shows that ngen example forcing includes both precip_rate and APCP_surface, while the forcing extracted by our script (from the GCP bucket) includes only RAINRATE (seems to be the same as precip_rate).

NGEN Forcing File NGEN Forcing File (unit) NWM (GCP) Forcing Data NWM (GCP) Forcing Data (Long Name) NWM (GCP) Forcing Data (Unit)
time hour time   hour
APCP_surface   Not Available    
DLWRF_surface   LWDOWN Surface downward long-wave radiation flux W m-2
DSWRF_surface   SWDOWN Surface downward short-wave radiation flux W m-2
PRES_surface   PSFC Surface Pressure Pa
SPFH_2maboveground   Q2D 2-m Specific Humidity kg kg-1
TMP_2maboveground   T2D 2-m Air Temperature K
UGRD_10maboveground   U2D 10-m U-component of wind m s-1
VGRD_10maboveground   V2D 10-m V-component of wind m s-1
precip_rate   RAINRATE Surface Precipitation Rate mm s^-1

@jameshalgren jameshalgren linked a pull request Dec 19, 2022 that will close this issue
9 tasks
@jameshalgren jameshalgren linked a pull request Jan 4, 2023 that will close this issue
9 tasks
@jameshalgren jameshalgren changed the title NWM forcing inputs for ngen Create script to convert NWM forcing inputs into ngen format Jan 6, 2023
@jameshalgren jameshalgren changed the title Create script to convert NWM forcing inputs into ngen format Create script to convert NWM forcing inputs into ngen basins and format Jan 6, 2023
@jameshalgren
Copy link
Contributor Author

jameshalgren commented Feb 22, 2023

Working on converting the outputs from the "new" way into something a little more structured for output to csv... see #26

@jameshalgren jameshalgren mentioned this issue May 5, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants