ngen-datastream
automates the process of collecting and formatting input data for NextGen, orchestrating the NextGen run through NextGen In a Box (NGIAB), and handling outputs. This software allows users to run NextGen in an efficient, relatively painless, and reproducible fashion.
- Installation: Follow the step-by-step instructions in the Installation Guide to set up
ngen-datastream
on your system. - Usage: Learn how to use
ngen-datastream
effectively by referring to the comprehensive Usage Guide.
ngen-datastream
can be executed using cli args or a configuration file. Not all arguments are requried.
> cd ngen-datastream && ./scripts/stream.sh --help
Usage: ./scripts/stream.sh [options]
Either provide a datastream configuration file
-c, --CONF_FILE <Path to datastream configuration file>
or run with cli args
-s, --START_DATE <YYYYMMDDHHMM or "DAILY">
-e, --END_DATE <YYYYMMDDHHMM>
-C, --FORCING_SOURCE <Forcing source option>
-D, --DOMAIN_NAME <Name for spatial domain>
-g, --GEOPACKAGE <Path to geopackage file>
-I, --SUBSET_ID <Hydrofabric id to subset>
-i, --SUBSET_ID_TYPE <Hydrofabric id type>
-v, --HYDROFABRIC_VERSION <Hydrofabric version>
-R, --REALIZATION <Path to realization file>
-d, --DATA_DIR <Path to write to>
-r, --RESOURCE_DIR <Path to resource directory>
-f, --NWM_FORCINGS_DIR <Path to nwm forcings directory>
-F, --NGEN_FORCINGS <Path to ngen forcings directory, tarball, or netcdf>
-N, --NGEN_BMI_CONFS <Path to ngen BMI config directory>
-S, --S3_MOUNT <Path to mount s3 bucket to>
-o, --S3_PREFIX <File prefix within s3 mount>
-n, --NPROCS <Process limit>
-y, --DRYRUN <True to skip calculations>
First, obtain a hydrofabric file for the gage you wish to model. For example for Palisade, Colorado:
hfsubset -w medium_range -s nextgen -v 2.1.1 -l divides,flowlines,network,nexus,forcing-weights,flowpath-attributes,model-attributes -o palisade.gpkg -t hl "Gages-09106150"
Then feed the hydrofabric file to ngen-datastream along with a few cli args to define the time domain and NextGen configuration. This command will execute a 24 hour NextGen simulation over VPU 09 with CFE, SLOTH, PET, NOM, and t-route configuration distributed over 4 processes. See more examples.
./scripts/stream.sh -s 202006200100 -e 202006210000 -C NWM_RETRO_V3 -d $(pwd)/data/datastream_test -g $(pwd)/palisade.gpkg -R $(pwd)/configs/ngen/realization_sloth_nom_cfe_pet_troute.json -n 4
To see what's happening in ngen-datastream
step-by-step, see the breakdown document.
Field | Description | Required |
---|---|---|
START_DATE | Start simulation time (YYYYMMDDHHMM) or "DAILY" | ✅ |
END_DATE | End simulation time (YYYYMMDDHHMM) | ✅ |
FORCING_SOURCE | Select the forcings data provider. Options include NWM_RETRO_V2, NWM_RETRO_V3, NWM_OPERATIONAL_V3, NOMADS_OPERATIONAL | ✅ |
DOMAIN_NAME | Name for spatial domain in run, stripped from gpkg if not supplied | |
GEOPACKAGE | Path to hydrofabric, can be s3URI, URL, or local file. Generate file with hfsubset or use SUBSET args. | Required here or file exists in RESOURCE_DIR/config
|
SUBSET_ID_TYPE | id type corresponding to "id" See hfsubset for options | Required here if user is not providing GEOPACKAGE and GEOPACKAGE_ATTR. |
SUBSET_ID | catchment id to subset See hfsubset for options | Required here if user is not providing GEOPACKAGE and GEOPACKAGE_ATTR. |
HYDROFABRIC_VERSION |
|
Required here if user is not providing GEOPACKAGE and GEOPACKAGE_ATTR. |
REALIZATION | Path to NextGen realization file | Required here or file exists in RESOURCE_DIR/config
|
DATA_DIR | Absolute local path to construct the datastream run. | ✅ |
RESOURCE_DIR | Path to directory that contains the datastream resources. More explanation here. | |
NWM_FORCINGS_DIR | Path to local directory containing nwm files. Alternatively, these file could be stored in RESOURCE_DIR as nwm-forcings. | |
NGEN_BMI_CONFS | Path to local directory containing NextGen BMI configuration files. Alternatively, these files could be stored in RESOURCE_DIR under config/ . See here for directory structure. |
|
NGEN_FORCINGS | Path to local ngen forcings directory holding ngen forcing csv's or parquet's. Also accepts tarball or netcdf. Alternatively, this file(s) could be stored in RESOURCE_DIR at ngen-forcings/ . |
|
S3_MOUNT | Path to mount S3 bucket to. ngen-datastream will copy outputs here. |
|
S3_PREFIX | Prefix to prepend to all files when copying to s3 | |
DRYRUN | Set to "True" to skip all compute steps. | |
NPROCS | Maximum number of processes to use in any step of ngen-datastream . Defaults to nprocs - 2
|
When the datastream is executed a folder of the structure below will be constructed at DATA_DIR
DATA-PATH/
│
├── datastream-metadata/
│
├── datastream-resources/
|
├── ngen-run/
Each folder is explained below
Holds metadata about the ngen-datastream
excution that allows for a relatively condensed view of how the execution was performed.
Example directory:
datastream-metadata/
│
├── conf_datastream.json
│
├── conf_fp.json
|
├── conf_nwmurl.json
|
├── profile_fp.txt
|
├── profile.txt
|
├── filenamelist.txt
|
├── realization.json
File Type | Path in Resource Directory | Description | Naming |
---|---|---|---|
DATASTREAM CONFIGURATION | datastream-metadata/conf_datastream.json | Holds metadata about the execution | conf_datastream.json |
FORCING PROCESSOR CONFIGURATION | datastream-metadata/conf_fp.json | Configuration file for forcingprocessor. See here | conf_fp.json |
NWM URL CONFIGURATION | datastream-metadata/conf_nwmurl.json | Configuration file for nwmurl. See here | conf_nwmurl.json |
PROFILE | datastream-metadata/profile_fp.txt | Datetime print statements that allow for profiling each step in forcingprocessor | profile_fp.txt |
PROFILE | datastream-metadata/profile.txt | Datetime print statements that allow for profiling each step in datastream | profile.txt |
FILENAME LIST | datastream-metadata/filenamelist.txt | Local file paths or URLs to NWM forcings. Generated by nwmurl. | filenamelist.txt |
REALIZATION | datastream-metadata/realization.json | NextGen configuration file. See here | realization.json |
datastream-resources/
holds all the input data files required to perform the various computations ngen-datastream
performs. This folder is not required as input, but will be a faster method for running ngen-datastream repeatedly over a given spatial or time domain.
Examples of the application of the resource directory:
- Repeated executions.
ngen-datastream
will retrieve files (that are given as arguements) remotely, however this can take time depending on the networking between the data source and host. Storing these files locally inRESOURCE_DIR
for repeated runs will save time and network bandwith. In addition, this saves on compute required to build input files from scratch. - Communicating runs. ngen-datastream versions everything in
DATA_DIR
, which means a single hash corresponds to a uniqueRESOURCE_DIR
, which allows users to quickly identify potential differences betweenngen-datastream
input data.
The easiest way to create a reusable resource directory is to execute ngen-datastream
and save DATA_DIR/datastream-resources
for later use. A user defined RESOURCE_DIR
may take the form below. Only one file of each type is allowed (e.g. cannot have two geopackages or two realizations). Not every file is required. ngen-datastream
will generate all required files by default, but will skip those steps if corresponding files exist in the resource directory.
RESOURCE_DIR/
|
├── config/
| │
| ├── nextgen_09.gpkg
| |
| ├── realization.json
| |
| ├── ngen.yaml
| |
| ├── partitions.json
| |
| ├── cat-config/
| │ |
| | ├──PET/
| │ |
| | ├──CFE/
| │ |
| | ├──NOAH-OWP-M/
|
├── nwm-forcings/
| |
| ├── nwm.t00z.medium_range.forcing.f001.conus
| |
| ├── ...
|
├── ngen-forcings/
| |
| ├── forcings.nc
|
File Type | Path in Resource Directory | Example Link | Description | Naming |
---|---|---|---|---|
BMI CONFIGURATION | config/cat-config | directory holding BMI module configuration files defined in realization file. | See here | |
REALIZATION | config/realization.json | link | NextGen configuration | *realization*.json |
GEOPACKAGE | config/nextgen_01.gpkg | link | Hydrofabric file of version hfsubset can be invoked indirectly through ngen-datastream through the subsetting args. |
*.gpkg |
PARTITIONS | config/patitions_$NPROCS.json | File generated by the NextGen framework to distribute processing by spatial domain. | *partitions*.json | |
FORCINGS | nwm-forcings/*.nc | link | NetCDF National Water Model forcing files. These are not saved to the resource directory by default. | *.nc |
FORCINGS | ngen-forcings/*.nc | netcdf holding ngen forcings. | *.nc (by default), *.tar.gz, *.csv, *.parquet |
Running NextGen requires building a standard run directory complete with only the necessary files. The datastream constructs this automatically, but can be manually built as well. Below is an explanation of the standard. Reference for discussion of the standard here.
A NextGen run directory ngen-run
is composed of three necessary subfolders config, forcings, outputs
and an optional fourth subfolder metadata
.
ngen-run/
│
├── config/
│
├── forcings/
|
├── metadata/
│
├── outputs/
The ngen-run
directory contains the following subfolders:
config
: model configuration files and hydrofabric configuration files. A deeper explanation hereforcings
: catchment-level forcing timeseries files. These can be generated with the forcingprocessor. Forcing files contain variables like wind speed, temperature, precipitation, and solar radiation.metadata
is an optional subfolder. This is programmatically generated and it used within to ngen. Do not edit this folder.outputs
: This is where ngen will place the output files.
This folder contains the NextGen realization file, which serves as the primary model configuration for the ngen framework. This file specifies which models to run and with which parameters, run parameters like date and time, and hydrofabric specifications.
Based on the models defined in the realization file, BMI configuration files may be required. For those models that require per-catchment configuration files, a folder will hold these files for each model in ngen-run/config/cat-config
. See here for which models ngen-datastream supports automated BMI configuration file generation. See the directory structure convention below.
ngen-run/
|
├── config/
| │
| ├── nextgen_09.gpkg
| |
| ├── realization.json
| |
| ├── ngen.yaml
| |
| ├── cat-config/
| │ |
| | ├──PET/
| │ |
| | ├──CFE/
| │ |
| | ├──NOAH-OWP-M/
...
ngen-datstream
uses a merkel tree hashing algorithm to version each execution with merkdir. This means all input and output files in a ngen-datastream
execution will be hashed in such a way that tracking minute changes among millions of files is trivial.
ngen-datastream
is distributed under GNU General Public License v3.0 or later