use a new case to try prepare_data script

OuyangWenyu · Mar 25, 2024 · 0db0480 · 0db0480
1 parent d18c904
commit 0db0480
Show file tree

Hide file tree

Showing 8 changed files with 178 additions and 116 deletions.
diff --git a/README.md b/README.md
@@ -12,78 +12,85 @@
 
 **Hydromodel is a python implementation for common hydrological models such as the XinAnJiang (XAJ) model, which is one of the most famous conceptual hydrological models, especially in Southern China.**
 
-An additional feature of hydro-model-xaj is that it provides a differentiable version of XAJ, which means it can be nested in deep-learning algorithms. More information could be found in the following "What are the main features" part.
-
 **Not an official version, just for learning**
 
-## How to run
-
-### Environment
+This is a Python console program (no graphic interface now). It is **still developing**.
 
-Hydro-model-xaj is a Python console program (no graphic interface now). It is **still developing**, and we have not
-provided a pip or conda package for hydro-model-xaj yet, so please set up a Python environment for the code.
+## How to run
 
-If you are new to python, please [install miniconda or anaconda on your computer and config the environment](https://conda.io/projects/conda/en/stable/user-guide/install/index.html).
+### Install
 
-Since you see hydro-model-xaj in GitHub, I think you have known a little about git and GitHub at least. Please install git on your computer and register your own GitHub account.
+We provided a pip package. You can install it with pip:
 
-Then, fork hydro-model-xaj to your GitHub, and clone it to your computer. If you have forked it before, please update it from [upstream](https://github.com/OuyangWenyu/hydro-model-xaj) as our previous version has some errors. Open your terminal, input：
 
 ```Shell
-# clone hydro-model-xaj, if you have cloned it, ignore this step 
-$ git clone <address of hydro-model-xaj in your github>
-# move to it
-$ cd hydro-model-xaj
-# if updating from upstream, pull the new version to local
-$ git pull
 # create python environment
-$ conda env create -f environment.yml
-# if conda is very slow, mamba can be an alternative:
-# $ conda install -c conda-forge mamba
-# $ mamba env create -f environment.yml
-# activate it
-$ conda activate xaj
+$ conda create -n hydromodel python=3.10
+$ conda activate hydromodel
+# install hydromodel
+$ pip install hydromodel
 ```
 
 If you want to run notebooks in your jupyter notebook, please install jupyter kenel in your jupyter lab:
 
 ```Shell
-$ python -m ipykernel install --user --name xaj --display-name "xaj"
+# if you don't have a jupyterlab in your PC, please install it at first
+# $ conda install -c conda-forge jupyterlab
+$ conda activate hydromodel
+$ conda install -c conda-forge ipykernel
+$ python -m ipykernel install --user --name hydromodel --display-name "hydromodel"
 ```
 
 ### Prepare data
 
-To use your own data to run the model, you can prepare the data in the required format:
+You can use the CAMELS dataset (see [here](https://github.com/OuyangWenyu/hydrodataset) to prepare it) to run the model.
 
-For one basin (We only support one basin now), the data is put in one csv/txt file.
-There are three necessary columns: "time", "prcp", "pet", and "flow". "time" is the time series, "prcp" is the precipitation, "pet" is the potential evapotranspiration, and "flow" is the observed streamflow. 
-The time series should be continuous (NaN values are allowed), and the time step should be the same for all columns. The time format should be "YYYY-MM-DD HH:MM:SS". The data should be sorted by time.
+To use your own data to run the model, you need prepare the data in the required format.
 
-You can run a checker function to see if the data is in the right format:
+We provide some transformation functions in the "scripts" directory. You can use them to transform your data to the required format.
 
-```Shell
-$ cd hydromodel/scripts
-$ python check_data_format.py --data_file <absolute path of the data file>
+But you still need to do some manual work before transformation. Here are the steps:
+
+1. Put all data in one directory and check if it is organized as the following format:
+```
+your_data_directory_for_hydromodel/
+# one attribute csv file for all basins
+├─ basin_attributes.csv
+# one timeseries csv file for one basin, xxx and yyy are the basin ids
+├─ basin_xxx.csv
+├─ basin_yyy.csv
+├─ basin_zzz.csv
+├─ ...
+```
+basin_attributes.csv should have the following columns:
+```csv
+id     name  area(km^2)
+xxx  Basin A         100
+yyy  Basin B         200
+zzz  Basin C         300
 ```
+basin_xxx.csv should have the following columns:
+```csv
+time  pet(mm/day)  prcp(mm/day)  flow(m^3/s)  et(mm/day)  node1_flow(m^3/s)
+2022-01-01 00:00:00            1                 10                 13                 16                 19
+2022-01-02 00:00:00            2                 11                 14                 17                 20
+2022-01-03 00:00:00            3                 12                 15                 18                 21
+```
+The sequence of the columns is not important, but the column names should be the same as the above.
+No more unnecessary columns are allowed.
+For time series csv files, et and node1_flow are optional. If you don't have them, you can ignore them.
+The units of all variables could be different, but they cannot be missed and should be put in `()` in the column name.
 
-Then, you can use the data_preprocess module to transform the data to the required format:
+2. download [prepare_data.py](https://github.com/OuyangWenyu/hydro-model-xaj/tree/master/scripts) and run the following code to transform the data format to the required format:
+```Shell
+$ python prepare_data.py --origin_data_dir <your_data_directory_for_hydromodel>
+```
 
+3. If the format is wrong, please do step 1 again carefully. If the format is right, you can run the following code to preprocess the data, such as cross-validation, etc.:
 ```Shell
 $ python datapreprocess4calibrate.py --data <name of the data file> --exp <name of the directory of the prepared data>
 ```
 
-The data will be transformed in data interface, here is the convention:
-
-- All input data for models are three-dimensional NumPy array: [time, basin, variable], which means "time" series data
-  for "variables" in "basins"
-- Data files should be .npy files with a JSON file that show the information of the data. We provide sample code in
-  "test/test_data.py" to show how to process a .csv/.txt file to the required format. 
-- To run the model, the dataset should be split into two parts: the training dataset (used for calibrating) and the testing dataset (used for evaluation). In the xxx directory, there must be four files: "basins_lump_p_pe_q_foldx_train.npy", "data_info_foldx_train.json", "basins_lump_p_pe_q_foldx_test.npy", and "data_info_foldx_test.json". (files' name cannot be changed; x is 0 if there is only one fold)
-
-To run models in hydro-model-xaj, one need to prepare data in the required format. 
-
-We have provided sample data in the "example/example" directory. You can run the model with this data.
-
 ### Run the model
 
 Run the following code:

diff --git a/hydromodel/datasets/__init__.py b/hydromodel/datasets/__init__.py
@@ -5,7 +5,14 @@
 NODE_FLOW_NAME = "node1_flow(m^3/s)"
 AREA_NAME = "area(km^2)"
 TIME_NAME = "time"
-TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
+POSSIBLE_TIME_FORMATS = [
+    "%Y-%m-%d %H:%M:%S",  # 完整的日期和时间
+    "%Y-%m-%d",  # 只有日期
+    "%d/%m/%Y",  # 不同的日期格式
+    "%m/%d/%Y %H:%M",  # 月/日/年 小时:分钟
+    "%d/%m/%Y %H:%M",  # 日/月/年 小时:分钟
+    # ... 可以根据需要添加更多格式 ...
+]
 ID_NAME = "id"
 NAME_NAME = "name"
 

diff --git a/hydromodel/datasets/data_preprocess.py b/hydromodel/datasets/data_preprocess.py
@@ -1,7 +1,7 @@
 """
 Author: Wenyu Ouyang
 Date: 2022-10-25 21:16:22
-LastEditTime: 2024-03-25 14:50:32
+LastEditTime: 2024-03-25 17:06:13
 LastEditors: Wenyu Ouyang
 Description: preprocess data for models in hydro-model-xaj
 FilePath: \hydro-model-xaj\hydromodel\datasets\data_preprocess.py
@@ -39,32 +39,47 @@ def check_tsdata_format(file_path):
     """
     # prcp means precipitation, pet means potential evapotranspiration, flow means streamflow
     required_columns = [
-        TIME_NAME,
-        PRCP_NAME,
-        PET_NAME,
-        FLOW_NAME,
+        remove_unit_from_name(TIME_NAME),
+        remove_unit_from_name(PRCP_NAME),
+        remove_unit_from_name(PET_NAME),
+        remove_unit_from_name(FLOW_NAME),
     ]
     # et means evapotranspiration, node_flow means upstream streamflow
     # node1 means the first upstream node, node2 means the second upstream node, etc.
     # these nodes are the nearest upstream nodes of the target node
     # meaning: if node1_flow, node2_flow, and more upstream nodes are parellel.
     # No serial relationship
-    optional_columns = [ET_NAME, NODE_FLOW_NAME]
+    optional_columns = [
+        remove_unit_from_name(ET_NAME),
+        remove_unit_from_name(NODE_FLOW_NAME),
+    ]
 
     try:
         data = pd.read_csv(file_path)
 
         # Check required columns
-        if any(column not in data.columns for column in required_columns):
-            print(f"Missing required columns in file: {file_path}")
+        missing_required_columns = [
+            column
+            for column in data.columns
+            if remove_unit_from_name(column) not in required_columns
+        ]
+
+        if missing_required_columns:
+            print(
+                f"Missing required columns in file: {file_path}: {missing_required_columns}"
+            )
             return False
 
         # Check optional columns
-        for column in optional_columns:
-            if column not in data.columns:
+        for column in data.columns:
+            if (
+                remove_unit_from_name(column) not in required_columns
+                and remove_unit_from_name(column) not in optional_columns
+            ):
                 print(
                     f"Optional column '{column}' not found in file: {file_path}, but it's okay."
                 )
+
         # Check node_flow columns (flexible number of nodes)
         node_flow_columns = [
             col for col in data.columns if re.match(r"node\d+_flow", col)
@@ -73,19 +88,26 @@ def check_tsdata_format(file_path):
             print(f"No 'node_flow' columns found in file: {file_path}, but it's okay.")
 
         # Check time format and sorting
-        try:
-            data["time"] = pd.to_datetime(data["time"], format=TIME_FORMAT)
-        except ValueError:
+        time_parsed = False
+        for time_format in POSSIBLE_TIME_FORMATS:
+            try:
+                data[TIME_NAME] = pd.to_datetime(data[TIME_NAME], format=time_format)
+                time_parsed = True
+                break
+            except ValueError:
+                continue
+
+        if not time_parsed:
             print(f"Time format is incorrect in file: {file_path}")
             return False
 
-        if not data["time"].is_monotonic_increasing:
+        if not data[TIME_NAME].is_monotonic_increasing:
             print(f"Data is not sorted by time in file: {file_path}")
             return False
 
         # Check for consistent time intervals
         time_differences = (
-            data["time"].diff().dropna()
+            data[TIME_NAME].diff().dropna()
         )  # Calculate differences and remove NaN
         if not all(time_differences == time_differences.iloc[0]):
             print(f"Time series is not at consistent intervals in file: {file_path}")
@@ -155,7 +177,9 @@ def check_folder_contents(folder_path, basin_attr_file="basin_attributes.csv"):
         return False
 
     # 获取流域ID列表
-    basin_ids = pd.read_csv(os.path.join(folder_path, basin_attr_file))["id"].tolist()
+    basin_ids = pd.read_csv(
+        os.path.join(folder_path, basin_attr_file), dtype={ID_NAME: str}
+    )[ID_NAME].tolist()
 
     # 检查每个流域的时序文件
     for basin_id in basin_ids:
@@ -222,8 +246,12 @@ def process_and_save_data_as_nc(
         file_name = f"basin_{basin_id}.csv"
         file_path = os.path.join(folder_path, file_name)
         data = pd.read_csv(file_path)
-        data[TIME_NAME] = pd.to_datetime(data[TIME_NAME])
-
+        for time_format in POSSIBLE_TIME_FORMATS:
+            try:
+                data[TIME_NAME] = pd.to_datetime(data[TIME_NAME], format=time_format)
+                break
+            except ValueError:
+                continue
         # 在处理第一个流域时构建单位字典
         if i == 0:
             for col in data.columns:

diff --git a/scripts/check_data_format.py b/scripts/check_data_format.py
diff --git a/scripts/prepare_data.py b/scripts/prepare_data.py
@@ -0,0 +1,41 @@
+"""
+Author: Wenyu Ouyang
+Date: 2024-03-25 09:21:56
+LastEditTime: 2024-03-25 17:08:08
+LastEditors: Wenyu Ouyang
+Description: Script for preparing data
+FilePath: \hydro-model-xaj\scripts\prepare_data.py
+Copyright (c) 2023-2024 Wenyu Ouyang. All rights reserved.
+"""
+
+from pathlib import Path
+import sys
+import os
+import argparse
+
+current_script_path = Path(os.path.realpath(__file__))
+repo_root_dir = current_script_path.parent.parent
+sys.path.append(str(repo_root_dir))
+from hydromodel.datasets.data_preprocess import process_and_save_data_as_nc
+
+
+def main(args):
+    data_path = args.origin_data_dir
+
+    if process_and_save_data_as_nc(data_path, save_folder=data_path):
+        print("Data is ready!")
+    else:
+        print("Data format is incorrect! Please check the data.")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Prepare data.")
+    parser.add_argument(
+        "--origin_data_dir",
+        type=str,
+        help="Path to your hydrological data",
+        default="C:\\Users\\wenyu\\Downloads\\biliuhe",
+    )
+
+    args = parser.parse_args()
+    main(args)
diff --git a/test/test_data.py b/test/test_data.py
diff --git a/test/test_data_postprocess.py b/test/test_data_postprocess.py
@@ -0,0 +1,9 @@
+"""
+Author: Wenyu Ouyang
+Date: 2022-10-25 21:16:22
+LastEditTime: 2024-03-25 14:59:43
+LastEditors: Wenyu Ouyang
+Description: Test for data preprocess
+FilePath: \hydro-model-xaj\test\test_data_postprocess.py
+Copyright (c) 2021-2022 Wenyu Ouyang. All rights reserved.
+"""