Skip to content

Tools for running the NextGen National Water Model

License

Notifications You must be signed in to change notification settings

JordanLaserGit/ngen-datastream

 
 

Repository files navigation

NextGen Water Modeling Framework Datastream

ngen-datastream automates the process of collecting and formatting input data for NextGen, orchestrating the NextGen run through NextGen In a Box (NGIAB), and handling outputs. This software allows users to run NextGen in an efficient, relatively painless, and reproducible fashion.

ngen-datastream

Getting Started

  • Installation: Follow the step-by-step instructions in the Installation Guide to set up ngen-datastream on your system.
  • Usage: Learn how to use ngen-datastream effectively by referring to the comprehensive Usage Guide.

Run it

ngen-datastream can be executed using cli args or a configuration file. Not all arguments are requried.

> cd ngen-datastream && ./scripts/stream.sh --help

Usage: ./scripts/stream.sh [options]
Either provide a datastream configuration file
  -c, --CONF_FILE           <Path to datastream configuration file> 
or run with cli args
  -s, --START_DATE          <YYYYMMDDHHMM or "DAILY"> 
  -e, --END_DATE            <YYYYMMDDHHMM> 
  -C, --FORCING_SOURCE      <Forcing source option> 
  -D, --DOMAIN_NAME         <Name for spatial domain> 
  -g, --GEOPACKAGE          <Path to geopackage file> 
  -I, --SUBSET_ID           <Hydrofabric id to subset>  
  -i, --SUBSET_ID_TYPE      <Hydrofabric id type>  
  -v, --HYDROFABRIC_VERSION <Hydrofabric version> 
  -R, --REALIZATION         <Path to realization file> 
  -d, --DATA_DIR            <Path to write to> 
  -r, --RESOURCE_DIR        <Path to resource directory> 
  -f, --NWM_FORCINGS_DIR    <Path to nwm forcings directory> 
  -F, --NGEN_FORCINGS       <Path to ngen forcings directory, tarball, or netcdf> 
  -N, --NGEN_BMI_CONFS      <Path to ngen BMI config directory> 
  -S, --S3_MOUNT            <Path to mount s3 bucket to>  
  -o, --S3_PREFIX           <File prefix within s3 mount> 
  -n, --NPROCS              <Process limit> 
  -y, --DRYRUN              <True to skip calculations> 

First, obtain a hydrofabric file for the gage you wish to model. For example for Palisade, Colorado:

hfsubset -w medium_range -s nextgen -v 2.1.1 -l divides,flowlines,network,nexus,forcing-weights,flowpath-attributes,model-attributes -o palisade.gpkg -t hl "Gages-09106150"

Then feed the hydrofabric file to ngen-datastream along with a few cli args to define the time domain and NextGen configuration. This command will execute a 24 hour NextGen simulation over VPU 09 with CFE, SLOTH, PET, NOM, and t-route configuration distributed over 4 processes. See more examples.

./scripts/stream.sh -s 202006200100 -e 202006210000 -C NWM_RETRO_V3 -d $(pwd)/data/datastream_test -g $(pwd)/palisade.gpkg -R $(pwd)/configs/ngen/realization_sloth_nom_cfe_pet_troute.json -n 4

To see what's happening in ngen-datastream step-by-step, see the breakdown document.

Explanation of cli args (or variables in defined in CONF_FILE)

Field Description Required
START_DATE Start simulation time (YYYYMMDDHHMM) or "DAILY"
END_DATE End simulation time (YYYYMMDDHHMM)
FORCING_SOURCE Select the forcings data provider. Options include NWM_RETRO_V2, NWM_RETRO_V3, NWM_OPERATIONAL_V3, NOMADS_OPERATIONAL
DOMAIN_NAME Name for spatial domain in run, stripped from gpkg if not supplied
GEOPACKAGE Path to hydrofabric, can be s3URI, URL, or local file. Generate file with hfsubset or use SUBSET args. Required here or file exists in RESOURCE_DIR/config
SUBSET_ID_TYPE id type corresponding to "id" See hfsubset for options Required here if user is not providing GEOPACKAGE and GEOPACKAGE_ATTR.
SUBSET_ID catchment id to subset See hfsubset for options Required here if user is not providing GEOPACKAGE and GEOPACKAGE_ATTR.
HYDROFABRIC_VERSION $\geq$ v20.1 See hfsubset for options Required here if user is not providing GEOPACKAGE and GEOPACKAGE_ATTR.
REALIZATION Path to NextGen realization file Required here or file exists in RESOURCE_DIR/config
DATA_DIR Absolute local path to construct the datastream run.
RESOURCE_DIR Path to directory that contains the datastream resources. More explanation here.
NWM_FORCINGS_DIR Path to local directory containing nwm files. Alternatively, these file could be stored in RESOURCE_DIR as nwm-forcings.
NGEN_BMI_CONFS Path to local directory containing NextGen BMI configuration files. Alternatively, these files could be stored in RESOURCE_DIR under config/. See here for directory structure.
NGEN_FORCINGS Path to local ngen forcings directory holding ngen forcing csv's or parquet's. Also accepts tarball or netcdf. Alternatively, this file(s) could be stored in RESOURCE_DIR at ngen-forcings/.
S3_MOUNT Path to mount S3 bucket to. ngen-datastream will copy outputs here.
S3_PREFIX Prefix to prepend to all files when copying to s3
DRYRUN Set to "True" to skip all compute steps.
NPROCS Maximum number of processes to use in any step of ngen-datastream. Defaults to nprocs - 2

ngen-datastream Output Directory Structure

When the datastream is executed a folder of the structure below will be constructed at DATA_DIR

DATA-PATH/
│
├── datastream-metadata/
│
├── datastream-resources/
|
├── ngen-run/

Each folder is explained below

datastream-metadata/

Holds metadata about the ngen-datastream excution that allows for a relatively condensed view of how the execution was performed. Example directory:

datastream-metadata/
│
├── conf_datastream.json
│
├── conf_fp.json
|
├── conf_nwmurl.json
|
├── profile_fp.txt
|
├── profile.txt
|
├── filenamelist.txt
|
├── realization.json
File Type Path in Resource Directory Description Naming
DATASTREAM CONFIGURATION datastream-metadata/conf_datastream.json Holds metadata about the execution conf_datastream.json
FORCING PROCESSOR CONFIGURATION datastream-metadata/conf_fp.json Configuration file for forcingprocessor. See here conf_fp.json
NWM URL CONFIGURATION datastream-metadata/conf_nwmurl.json Configuration file for nwmurl. See here conf_nwmurl.json
PROFILE datastream-metadata/profile_fp.txt Datetime print statements that allow for profiling each step in forcingprocessor profile_fp.txt
PROFILE datastream-metadata/profile.txt Datetime print statements that allow for profiling each step in datastream profile.txt
FILENAME LIST datastream-metadata/filenamelist.txt Local file paths or URLs to NWM forcings. Generated by nwmurl. filenamelist.txt
REALIZATION datastream-metadata/realization.json NextGen configuration file. See here realization.json

RESOURCE_DIR (datastream-resources/)

datastream-resources/ holds all the input data files required to perform the various computations ngen-datastream performs. This folder is not required as input, but will be a faster method for running ngen-datastream repeatedly over a given spatial or time domain.

Examples of the application of the resource directory:

  1. Repeated executions. ngen-datastream will retrieve files (that are given as arguements) remotely, however this can take time depending on the networking between the data source and host. Storing these files locally in RESOURCE_DIR for repeated runs will save time and network bandwith. In addition, this saves on compute required to build input files from scratch.
  2. Communicating runs. ngen-datastream versions everything in DATA_DIR, which means a single hash corresponds to a unique RESOURCE_DIR, which allows users to quickly identify potential differences between ngen-datastream input data.

Guide for building a RESOURCE_DIR

The easiest way to create a reusable resource directory is to execute ngen-datastream and save DATA_DIR/datastream-resources for later use. A user defined RESOURCE_DIR may take the form below. Only one file of each type is allowed (e.g. cannot have two geopackages or two realizations). Not every file is required. ngen-datastream will generate all required files by default, but will skip those steps if corresponding files exist in the resource directory.

RESOURCE_DIR/
|
├── config/
|   │
|   ├── nextgen_09.gpkg
|   |
|   ├── realization.json
|   |
|   ├── ngen.yaml
|   |
|   ├── partitions.json
|   |
|   ├── cat-config/
|   │   |
|   |   ├──PET/
|   │   |
|   |   ├──CFE/
|   │   |
|   |   ├──NOAH-OWP-M/
|
├── nwm-forcings/
|   |
|   ├── nwm.t00z.medium_range.forcing.f001.conus
|   |
|   ├── ...
|
├── ngen-forcings/
|   |
|   ├── forcings.nc
|
File Type Path in Resource Directory Example Link Description Naming
BMI CONFIGURATION config/cat-config directory holding BMI module configuration files defined in realization file. See here
REALIZATION config/realization.json link NextGen configuration *realization*.json
GEOPACKAGE config/nextgen_01.gpkg link Hydrofabric file of version $\geq$ v20.1 Ignored if subset hydrofabric options are set in datastream config. See Lynker-Spatial for complete VPU geopackages or hfsubset for generating your own custom domain. hfsubset can be invoked indirectly through ngen-datastream through the subsetting args. *.gpkg
PARTITIONS config/patitions_$NPROCS.json File generated by the NextGen framework to distribute processing by spatial domain. *partitions*.json
FORCINGS nwm-forcings/*.nc link NetCDF National Water Model forcing files. These are not saved to the resource directory by default. *.nc
FORCINGS ngen-forcings/*.nc netcdf holding ngen forcings. *.nc (by default), *.tar.gz, *.csv, *.parquet

ngen-run/

Running NextGen requires building a standard run directory complete with only the necessary files. The datastream constructs this automatically, but can be manually built as well. Below is an explanation of the standard. Reference for discussion of the standard here.

A NextGen run directory ngen-run is composed of three necessary subfolders config, forcings, outputs and an optional fourth subfolder metadata.

ngen-run/
│
├── config/
│
├── forcings/
|
├── metadata/
│
├── outputs/

The ngen-run directory contains the following subfolders:

  • config: model configuration files and hydrofabric configuration files. A deeper explanation here
  • forcings: catchment-level forcing timeseries files. These can be generated with the forcingprocessor. Forcing files contain variables like wind speed, temperature, precipitation, and solar radiation.
  • metadata is an optional subfolder. This is programmatically generated and it used within to ngen. Do not edit this folder.
  • outputs: This is where ngen will place the output files.

Configuration directory ngen-run/config/

This folder contains the NextGen realization file, which serves as the primary model configuration for the ngen framework. This file specifies which models to run and with which parameters, run parameters like date and time, and hydrofabric specifications.

Based on the models defined in the realization file, BMI configuration files may be required. For those models that require per-catchment configuration files, a folder will hold these files for each model in ngen-run/config/cat-config. See here for which models ngen-datastream supports automated BMI configuration file generation. See the directory structure convention below.

ngen-run/
|
├── config/
|   │
|   ├── nextgen_09.gpkg
|   |
|   ├── realization.json
|   |
|   ├── ngen.yaml
|   |
|   ├── cat-config/
|   │   |
|   |   ├──PET/
|   │   |
|   |   ├──CFE/
|   │   |
|   |   ├──NOAH-OWP-M/
...

Versioning

ngen-datstream uses a merkel tree hashing algorithm to version each execution with merkdir. This means all input and output files in a ngen-datastream execution will be hashed in such a way that tracking minute changes among millions of files is trivial.

License

ngen-datastream is distributed under GNU General Public License v3.0 or later

About

Tools for running the NextGen National Water Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.2%
  • Shell 19.7%
  • HCL 7.1%