Skip to content

Commit

Permalink
use a new case to try prepare_data script
Browse files Browse the repository at this point in the history
  • Loading branch information
OuyangWenyu committed Mar 25, 2024
1 parent d18c904 commit 0db0480
Show file tree
Hide file tree
Showing 8 changed files with 178 additions and 116 deletions.
95 changes: 51 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,78 +12,85 @@

**Hydromodel is a python implementation for common hydrological models such as the XinAnJiang (XAJ) model, which is one of the most famous conceptual hydrological models, especially in Southern China.**

An additional feature of hydro-model-xaj is that it provides a differentiable version of XAJ, which means it can be nested in deep-learning algorithms. More information could be found in the following "What are the main features" part.

**Not an official version, just for learning**

## How to run

### Environment
This is a Python console program (no graphic interface now). It is **still developing**.

Hydro-model-xaj is a Python console program (no graphic interface now). It is **still developing**, and we have not
provided a pip or conda package for hydro-model-xaj yet, so please set up a Python environment for the code.
## How to run

If you are new to python, please [install miniconda or anaconda on your computer and config the environment](https://conda.io/projects/conda/en/stable/user-guide/install/index.html).
### Install

Since you see hydro-model-xaj in GitHub, I think you have known a little about git and GitHub at least. Please install git on your computer and register your own GitHub account.
We provided a pip package. You can install it with pip:

Then, fork hydro-model-xaj to your GitHub, and clone it to your computer. If you have forked it before, please update it from [upstream](https://github.com/OuyangWenyu/hydro-model-xaj) as our previous version has some errors. Open your terminal, input:

```Shell
# clone hydro-model-xaj, if you have cloned it, ignore this step
$ git clone <address of hydro-model-xaj in your github>
# move to it
$ cd hydro-model-xaj
# if updating from upstream, pull the new version to local
$ git pull
# create python environment
$ conda env create -f environment.yml
# if conda is very slow, mamba can be an alternative:
# $ conda install -c conda-forge mamba
# $ mamba env create -f environment.yml
# activate it
$ conda activate xaj
$ conda create -n hydromodel python=3.10
$ conda activate hydromodel
# install hydromodel
$ pip install hydromodel
```

If you want to run notebooks in your jupyter notebook, please install jupyter kenel in your jupyter lab:

```Shell
$ python -m ipykernel install --user --name xaj --display-name "xaj"
# if you don't have a jupyterlab in your PC, please install it at first
# $ conda install -c conda-forge jupyterlab
$ conda activate hydromodel
$ conda install -c conda-forge ipykernel
$ python -m ipykernel install --user --name hydromodel --display-name "hydromodel"
```

### Prepare data

To use your own data to run the model, you can prepare the data in the required format:
You can use the CAMELS dataset (see [here](https://github.com/OuyangWenyu/hydrodataset) to prepare it) to run the model.

For one basin (We only support one basin now), the data is put in one csv/txt file.
There are three necessary columns: "time", "prcp", "pet", and "flow". "time" is the time series, "prcp" is the precipitation, "pet" is the potential evapotranspiration, and "flow" is the observed streamflow.
The time series should be continuous (NaN values are allowed), and the time step should be the same for all columns. The time format should be "YYYY-MM-DD HH:MM:SS". The data should be sorted by time.
To use your own data to run the model, you need prepare the data in the required format.

You can run a checker function to see if the data is in the right format:
We provide some transformation functions in the "scripts" directory. You can use them to transform your data to the required format.

```Shell
$ cd hydromodel/scripts
$ python check_data_format.py --data_file <absolute path of the data file>
But you still need to do some manual work before transformation. Here are the steps:

1. Put all data in one directory and check if it is organized as the following format:
```
your_data_directory_for_hydromodel/
# one attribute csv file for all basins
├─ basin_attributes.csv
# one timeseries csv file for one basin, xxx and yyy are the basin ids
├─ basin_xxx.csv
├─ basin_yyy.csv
├─ basin_zzz.csv
├─ ...
```
basin_attributes.csv should have the following columns:
```csv
id name area(km^2)
xxx Basin A 100
yyy Basin B 200
zzz Basin C 300
```
basin_xxx.csv should have the following columns:
```csv
time pet(mm/day) prcp(mm/day) flow(m^3/s) et(mm/day) node1_flow(m^3/s)
2022-01-01 00:00:00 1 10 13 16 19
2022-01-02 00:00:00 2 11 14 17 20
2022-01-03 00:00:00 3 12 15 18 21
```
The sequence of the columns is not important, but the column names should be the same as the above.
No more unnecessary columns are allowed.
For time series csv files, et and node1_flow are optional. If you don't have them, you can ignore them.
The units of all variables could be different, but they cannot be missed and should be put in `()` in the column name.

Then, you can use the data_preprocess module to transform the data to the required format:
2. download [prepare_data.py](https://github.com/OuyangWenyu/hydro-model-xaj/tree/master/scripts) and run the following code to transform the data format to the required format:
```Shell
$ python prepare_data.py --origin_data_dir <your_data_directory_for_hydromodel>
```

3. If the format is wrong, please do step 1 again carefully. If the format is right, you can run the following code to preprocess the data, such as cross-validation, etc.:
```Shell
$ python datapreprocess4calibrate.py --data <name of the data file> --exp <name of the directory of the prepared data>
```

The data will be transformed in data interface, here is the convention:

- All input data for models are three-dimensional NumPy array: [time, basin, variable], which means "time" series data
for "variables" in "basins"
- Data files should be .npy files with a JSON file that show the information of the data. We provide sample code in
"test/test_data.py" to show how to process a .csv/.txt file to the required format.
- To run the model, the dataset should be split into two parts: the training dataset (used for calibrating) and the testing dataset (used for evaluation). In the xxx directory, there must be four files: "basins_lump_p_pe_q_foldx_train.npy", "data_info_foldx_train.json", "basins_lump_p_pe_q_foldx_test.npy", and "data_info_foldx_test.json". (files' name cannot be changed; x is 0 if there is only one fold)

To run models in hydro-model-xaj, one need to prepare data in the required format.

We have provided sample data in the "example/example" directory. You can run the model with this data.

### Run the model

Run the following code:
Expand Down
9 changes: 8 additions & 1 deletion hydromodel/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@
NODE_FLOW_NAME = "node1_flow(m^3/s)"
AREA_NAME = "area(km^2)"
TIME_NAME = "time"
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
POSSIBLE_TIME_FORMATS = [
"%Y-%m-%d %H:%M:%S", # 完整的日期和时间
"%Y-%m-%d", # 只有日期
"%d/%m/%Y", # 不同的日期格式
"%m/%d/%Y %H:%M", # 月/日/年 小时:分钟
"%d/%m/%Y %H:%M", # 日/月/年 小时:分钟
# ... 可以根据需要添加更多格式 ...
]
ID_NAME = "id"
NAME_NAME = "name"

Expand Down
64 changes: 46 additions & 18 deletions hydromodel/datasets/data_preprocess.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
Author: Wenyu Ouyang
Date: 2022-10-25 21:16:22
LastEditTime: 2024-03-25 14:50:32
LastEditTime: 2024-03-25 17:06:13
LastEditors: Wenyu Ouyang
Description: preprocess data for models in hydro-model-xaj
FilePath: \hydro-model-xaj\hydromodel\datasets\data_preprocess.py
Expand Down Expand Up @@ -39,32 +39,47 @@ def check_tsdata_format(file_path):
"""
# prcp means precipitation, pet means potential evapotranspiration, flow means streamflow
required_columns = [
TIME_NAME,
PRCP_NAME,
PET_NAME,
FLOW_NAME,
remove_unit_from_name(TIME_NAME),
remove_unit_from_name(PRCP_NAME),
remove_unit_from_name(PET_NAME),
remove_unit_from_name(FLOW_NAME),
]
# et means evapotranspiration, node_flow means upstream streamflow
# node1 means the first upstream node, node2 means the second upstream node, etc.
# these nodes are the nearest upstream nodes of the target node
# meaning: if node1_flow, node2_flow, and more upstream nodes are parellel.
# No serial relationship
optional_columns = [ET_NAME, NODE_FLOW_NAME]
optional_columns = [
remove_unit_from_name(ET_NAME),
remove_unit_from_name(NODE_FLOW_NAME),
]

try:
data = pd.read_csv(file_path)

# Check required columns
if any(column not in data.columns for column in required_columns):
print(f"Missing required columns in file: {file_path}")
missing_required_columns = [
column
for column in data.columns
if remove_unit_from_name(column) not in required_columns
]

if missing_required_columns:
print(
f"Missing required columns in file: {file_path}: {missing_required_columns}"
)
return False

# Check optional columns
for column in optional_columns:
if column not in data.columns:
for column in data.columns:
if (
remove_unit_from_name(column) not in required_columns
and remove_unit_from_name(column) not in optional_columns
):
print(
f"Optional column '{column}' not found in file: {file_path}, but it's okay."
)

# Check node_flow columns (flexible number of nodes)
node_flow_columns = [
col for col in data.columns if re.match(r"node\d+_flow", col)
Expand All @@ -73,19 +88,26 @@ def check_tsdata_format(file_path):
print(f"No 'node_flow' columns found in file: {file_path}, but it's okay.")

# Check time format and sorting
try:
data["time"] = pd.to_datetime(data["time"], format=TIME_FORMAT)
except ValueError:
time_parsed = False
for time_format in POSSIBLE_TIME_FORMATS:
try:
data[TIME_NAME] = pd.to_datetime(data[TIME_NAME], format=time_format)
time_parsed = True
break
except ValueError:
continue

if not time_parsed:
print(f"Time format is incorrect in file: {file_path}")
return False

if not data["time"].is_monotonic_increasing:
if not data[TIME_NAME].is_monotonic_increasing:
print(f"Data is not sorted by time in file: {file_path}")
return False

# Check for consistent time intervals
time_differences = (
data["time"].diff().dropna()
data[TIME_NAME].diff().dropna()
) # Calculate differences and remove NaN
if not all(time_differences == time_differences.iloc[0]):
print(f"Time series is not at consistent intervals in file: {file_path}")
Expand Down Expand Up @@ -155,7 +177,9 @@ def check_folder_contents(folder_path, basin_attr_file="basin_attributes.csv"):
return False

# 获取流域ID列表
basin_ids = pd.read_csv(os.path.join(folder_path, basin_attr_file))["id"].tolist()
basin_ids = pd.read_csv(
os.path.join(folder_path, basin_attr_file), dtype={ID_NAME: str}
)[ID_NAME].tolist()

# 检查每个流域的时序文件
for basin_id in basin_ids:
Expand Down Expand Up @@ -222,8 +246,12 @@ def process_and_save_data_as_nc(
file_name = f"basin_{basin_id}.csv"
file_path = os.path.join(folder_path, file_name)
data = pd.read_csv(file_path)
data[TIME_NAME] = pd.to_datetime(data[TIME_NAME])

for time_format in POSSIBLE_TIME_FORMATS:
try:
data[TIME_NAME] = pd.to_datetime(data[TIME_NAME], format=time_format)
break
except ValueError:
continue
# 在处理第一个流域时构建单位字典
if i == 0:
for col in data.columns:
Expand Down
27 changes: 0 additions & 27 deletions scripts/check_data_format.py

This file was deleted.

41 changes: 41 additions & 0 deletions scripts/prepare_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""
Author: Wenyu Ouyang
Date: 2024-03-25 09:21:56
LastEditTime: 2024-03-25 17:08:08
LastEditors: Wenyu Ouyang
Description: Script for preparing data
FilePath: \hydro-model-xaj\scripts\prepare_data.py
Copyright (c) 2023-2024 Wenyu Ouyang. All rights reserved.
"""

from pathlib import Path
import sys
import os
import argparse

current_script_path = Path(os.path.realpath(__file__))
repo_root_dir = current_script_path.parent.parent
sys.path.append(str(repo_root_dir))
from hydromodel.datasets.data_preprocess import process_and_save_data_as_nc


def main(args):
data_path = args.origin_data_dir

if process_and_save_data_as_nc(data_path, save_folder=data_path):
print("Data is ready!")
else:
print("Data format is incorrect! Please check the data.")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Prepare data.")
parser.add_argument(
"--origin_data_dir",
type=str,
help="Path to your hydrological data",
default="C:\\Users\\wenyu\\Downloads\\biliuhe",
)

args = parser.parse_args()
main(args)
24 changes: 0 additions & 24 deletions test/test_data.py

This file was deleted.

9 changes: 9 additions & 0 deletions test/test_data_postprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""
Author: Wenyu Ouyang
Date: 2022-10-25 21:16:22
LastEditTime: 2024-03-25 14:59:43
LastEditors: Wenyu Ouyang
Description: Test for data preprocess
FilePath: \hydro-model-xaj\test\test_data_postprocess.py
Copyright (c) 2021-2022 Wenyu Ouyang. All rights reserved.
"""
Loading

0 comments on commit 0db0480

Please sign in to comment.