diff --git a/book/tutorials/decision_trees/01.script/00.tutorial_post_processing_xgboost_all.ipynb b/book/tutorials/decision_trees/01.script/00.tutorial_post_processing_xgboost_all.ipynb deleted file mode 100644 index 6d119fe..0000000 --- a/book/tutorials/decision_trees/01.script/00.tutorial_post_processing_xgboost_all.ipynb +++ /dev/null @@ -1,1310 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "# Machine Learning for Post-Processing NWM Data \n", - "**Authors: Savalan Naser Neisary (PhD Student, CIROH & The University of Alabama)**\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "### 1. Introduction \n", - "#### 1.1. Overview of the Workshop’s Goals and Structure.\n", - "\n", - "This is going to be 60 minutes workshop in which we will:\n", - "- Understand the basics of machine learning and decision-tree algorithms.\n", - "- Learn how to apply and train an XGBoost model for hydrological modeling.\n", - "- Learn how to implement feature selection using the XGBoost algorithm.\n", - "\n", - "We will first review the theoretical background behind decision trees and the pros and cons of the most powerful decision-tree algorithms. Then, we will start the hands-on part of the workshop on setting up our environments and getting codes and data from GitHub repositories. Next, we plan to get the data preprocessed and start model development using the XGBoost algorithm. After that, we will discuss the feature selection and hyperparameter tuning (i.e., manually and automatically). Finally, we will evaluate the performance of XGBoost in different stations. \n", - "#### 1.2. Post-processing Hydrological Predictions\n", - "\n", - "Effective and sustainable management of water resources is crucial to provide adequate water supply for human societies, regardless of their geographical location. Having an accurate and precise prediction of future hydrological variables, including streamflow is a critical component for an effective water systems management, and various studies presented different methods, such as post-processing to increase the accuracy of the hydrological predictions. Post-processing methods seek to quantify the uncertainties of hydrological model outcomes and correct their biases by using a statistical model to transform model outputs based the relationship(s) between observations and model. According to the literature Machine Learning (ML) models proved to be useful in post-processing the results of other ML or physical-based hydrological models. Therefore, in this workshop we will use decision-tree algorithms, an ensemble subgroup of ML models, to post-process streamflow outputs of a physical-baed model. \n", - "#### 1.3. Post-processing Retrospective National Water Model (NWM) Streamflow Data\n", - "\n", - "NOAA introduced the NWM to address the need for an operational large-scale hydrological forecasting model to provide streamflow predictions in CONUS. While it has the capability of predicting streamflow in 2.7 billion water reaches, according to the literature, NWM has a low accuracy in regions west of the 95th meridian with drought and low-flow problem and in controlled basins with extensive water infrastructure. This low performance in western US watersheds is due to the lack of water operation consideration and a comprehensive groundwater and snow model beside calibrating NWM mostly with watersheds in eastern US. To compensate for NWM shortcomings in this workshop we will demonstrate how we can use decision-trees to increase its accuracy by post-processing the NWM outputs and adding the human activity impact to it. \n", - "\n", - "###### Recommended Resources:\n", - "- Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow.\n", - "- C4.5: Programs for Machine Learning.\n", - "\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Theoretical Background \n", - "\n", - "#### 2.1. Decision-Trees Algorithm\n", - "\n", - "A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.Decision trees are recursively constructed multidimensional histograms. Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels. Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the foundation of many popular decision tree algorithms, such as ID3, C4.5, and CART. \n", - "\n", - "PIC\n", - "\n", - "\n", - "Advantages\n", - "\n", - "- Easy to interpret: The Boolean logic and visual representations of decision trees make them easier to understand and consume. The hierarchical nature of a decision tree also makes it easy to see which attributes are most important, which isn’t always clear with other algorithms, like neural networks.\n", - "- Little to no data preparation required: Decision trees have a number of characteristics, which make it more flexible than other classifiers. It can handle various data types—i.e. discrete or continuous values, and continuous values can be converted into categorical values through the use of thresholds. Additionally, it can also handle values with missing values, which can be problematic for other classifiers, like Naïve Bayes. \n", - "- More flexible: Decision trees can be leveraged for both classification and regression tasks, making it more flexible than some other algorithms. It’s also insensitive to underlying relationships between attributes; this means that if two variables are highly correlated, the algorithm will only choose one of the features to split on. \n", - "\n", - "Disadvantages\n", - "\n", - "- Prone to overfitting: Complex decision trees tend to overfit and do not generalize well to new data. This scenario can be avoided through the processes of pre-pruning or post-pruning. Pre-pruning halts tree growth when there is insufficient data while post-pruning removes subtrees with inadequate data after tree construction. \n", - "- High variance estimators: Small variations within data can produce a very different decision tree. Bagging, or the averaging of estimates, can be a method of reducing variance of decision trees. However, this approach is limited as it can lead to highly correlated predictors. \n", - "- More costly: Given that decision trees take a greedy search approach during construction, they can be more expensive to train compared to other algorithms. \n", - "\n", - "#### 2.2. Random Forest (RF) Algorithm\n", - "\n", - "RF is a widely used machine learning algorithm developed by Leo Breiman and Adele Cutler. RF is based on decision-trees, but it is based on the *Wisdom of the Crowd*, which means it aggregate the results of a group of DTs. Using the results of more than one models is called *ensemble*, so we can say that RF is an ensemble algorithm since it aggregates the results of several number of DTs. Using an ensemble of decision trees can largely reduce the overfitting and prediction variance, providing more accurate results. RF is an extension of the bagging approach, which generates a random subset of both samples and features for each model training. While a DT is based on all features to make decisions, the RF algorithm only uses a subset of features, which can reduce the influence of highly correlated features in model prediction.\n", - "\n", - "Advantages\n", - "\n", - "- Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data. However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error.\n", - "- Provides flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists. Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing.\n", - "- Easy to determine feature importance: Random forest makes it easy to evaluate variable importance, or contribution, to the model. There are a few ways to evaluate feature importance. Gini importance and mean decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a given variable is excluded. However, permutation importance, also known as mean decrease accuracy (MDA), is another importance measure. MDA identifies the average decrease in accuracy by randomly permutating the feature values in oob samples.\n", - "\n", - "Disadvantages\n", - "\n", - "- Time-consuming process: Since random forest algorithms can handle large data sets, they can provide more accurate predictions, but can be slow to process data as they are computing data for each individual decision tree.\n", - "- Requires more resources: Since random forests process larger data sets, they’ll require more resources to store that data.\n", - "- More complex: The prediction of a single decision tree is easier to interpret when compared to a forest of them.\n", - "\n", - "#### 2.3. Extreme Gradient Boosting (XGBoost) Algorithm\n", - "XGBoost is one of the algorithms based on Boosting ensemble method, and the idea behind it is to train the predictors sequentially, each trying to correct its predecessor. XGBoost method tries to fit the new predictor to the residual errors made by the previous predictor. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. XGBoost gained significant favor in the last few years as a result of helping individuals and teams win virtually every Kaggle structured data competition. \n", - "\n", - "Advantages\n", - "\n", - "- Gradient Boosting comes with an easy to read and interpret algorithm, making most of its predictions easy to handle.\n", - "- Boosting is a resilient and robust method that prevents and cubs over-fitting quite easily\n", - "- XGBoost performs very well on medium, small, data with subgroups and structured datasets with not too many features. \n", - "- It is a great approach to go for because the large majority of real-world problems involve classification and regression, two tasks where XGBoost is the reigning king. \n", - "\n", - "Disadvantages \n", - "\n", - "- XGBoost does not perform so well on sparse and unstructured data.\n", - "- A common thing often forgotten is that Gradient Boosting is very sensitive to outliers since every classifier is forced to fix the errors in the predecessor learners. \n", - "- The overall method is hardly scalable. This is because the estimators base their correctness on previous predictors, hence the procedure involves a lot of struggle to streamline. \n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## 3. Setting Up the Codes and Notebook\n", - "\n", - "#### 3.1. Access the GitHub Codes\n", - "First we download the code files from the GitHub. In the terminal you should use the following command:" - ] - }, - { - "cell_type": "raw", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "git clone https://github.com/savalann/hydromachine-tutorials" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 3.2. Import the Python Libraries\n", - "First we will install two libraries, then we will import the libraries that we need. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install hydroeval xgboost" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# system packages\n", - "from datetime import datetime, date, timedelta\n", - "import pickle\n", - "import warnings\n", - "warnings.filterwarnings(\"ignore\")\n", - "import platform\n", - "import time\n", - "from tqdm import tqdm\n", - "import os\n", - "import boto3\n", - "from botocore.client import Config\n", - "from botocore import UNSIGNED\n", - "\n", - "# basic packages\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import matplotlib.dates as mdates\n", - "from matplotlib.patches import Patch\n", - "import math\n", - "from evaluation_table import EvalTable\n", - "\n", - "# model packages\n", - "import xgboost as xgb\n", - "from sklearn.model_selection import GridSearchCV, train_test_split, RepeatedKFold, cross_val_score\n", - "from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer\n", - "from sklearn.preprocessing import MinMaxScaler\n", - "import joblib\n", - "from shapely.geometry import Point\n", - "import geopandas as gpd\n", - "import pyproj\n", - "\n", - "# Identify the path\n", - "home = os.getcwd()\n", - "parent_path = os.path.dirname(home)\n", - "input_path = f'{parent_path}/02.input/'\n", - "output_path = f'{parent_path}/03.output/'\n", - "main_path = home" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "## 4. Data Preprocessing\n", - "#### 4.1. Overview of the USGS Stream Station\n", - "- The dataset that we will use provides the data for seven GSL watershed stations. \n", - "- The dataset contains climate variables, such as precipitation and temperature, water infrastructure, storage percentage, and watershed characteristics, such as average area and elevation. \n", - "You can see the location of the station and its watershed in the Figure below. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "source": [ - "#### 4.2. Load Dataset\n", - "- Using the boto3 library we get the input dataset from the CIROH S3 bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# List of station IDs that are of interest.\n", - "stations = ['10126000', '10130500', '10134500', '10136500', '10137500', '10141000', '10155000', '10164500', '10171000']\n", - "\n", - "# Read a CSV file into a DataFrame and set the first column as the index.\n", - "df = df = pd.read_parquet(f'{input_path}final_input.parquet')\n", - "\n", - "# Convert the station_id column to string data type.\n", - "df.station_id = df.station_id.astype(str)\n", - "\n", - "# Convert the 'datetime' column to datetime objects.\n", - "df.datetime = pd.to_datetime(df.datetime)\n", - "\n", - "# Filter the DataFrame to include only the rows where 'station_id' is in the 'stations' list.\n", - "df_modified = df[df['station_id'].isin(stations)]\n", - "\n", - "# Select specific columns to create a new DataFrame.\n", - "dataset = df_modified[['station_id', 'datetime', 'Lat', 'Long', 'Drainage_area_mi2', 'Mean_Basin_Elev_ft',\n", - " 'Perc_Forest', 'Perc_Develop', 'Perc_Imperv', 'Perc_Herbace',\n", - " 'Perc_Slop_30', 'Mean_Ann_Precip_in', 's1',\n", - " 's2', 'storage', 'swe', 'NWM_flow', 'DOY','tempe(F)', 'precip(mm)', 'flow_cfs']]\n", - "\n", - "# Extract a list of unique station IDs from the modified dataset.\n", - "station_list = dataset.station_id.unique()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 4.3. Visualizing the Data\n", - "- The takeaway from data visualization is to gather information about data distribution, outliers, missing values, correlation between different variables, and time dependencies between variables.\n", - "- Here, we will use boxplots, histograms, and combo bar and line plots to show outliers, distribution, and time dependencies in streamflow, precipitation, temperature, and SWE." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, we will plot the time dependencies using bar and line plots for streamflow vs SWE. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "figsize = (9, 6) # Set the figure size for the plot.\n", - "fig, ax = plt.subplots(figsize=figsize)\n", - "\n", - "\n", - "# Extract data for the current station.\n", - "temp_df_1 = dataset[dataset.station_id == station_list[0]]\n", - "# Set 'datetime' as the index for the DataFrame for plotting.\n", - "temp_df_2 = temp_df_1.set_index('datetime')\n", - "# Plot 'flow_cfs' on the primary y-axis.\n", - "ax.plot(temp_df_2.index, temp_df_2['flow_cfs'])\n", - "# Set x-axis limits from the minimum to maximum year of data.\n", - "start_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.min()}-01-01')\n", - "end_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.max()}-12-31')\n", - "ax.set_xlim(start_year, end_year)\n", - "# Get current x-tick labels and set their rotation for better visibility.\n", - "labels = ax.get_xticklabels()\n", - "ax.set_xticklabels(labels, rotation=45)\n", - "\n", - "# Create a secondary y-axis for Snow Water Equivalent (SWE).\n", - "ax2 = ax.twinx()\n", - "# Plot SWE as a bar graph on the secondary y-axis.\n", - "ax2.bar(temp_df_2.index, temp_df_2['swe'], label='Inverted', color='red')\n", - "# Set the y-axis limits for SWE, flipping the axis to make bars grow downward.\n", - "ax2.set_ylim(max(temp_df_2['swe']) + 40, 0)\n", - "# Set label for the secondary y-axis.\n", - "ax2.set_ylabel('SWE')\n", - "# Define custom ticks for the secondary y-axis.\n", - "ax2.set_yticks(np.arange(0, max(temp_df_2['swe']), 5))\n", - "\n", - "# Set the title of the subplot to the station ID.\n", - "ax.set_title(f'{station_list[0]}')\n", - "# Set the x-axis label for subplots in the last row.\n", - "\n", - "ax.set_xlabel('Datetime (day)')\n", - "\n", - "# Set the y-axis label for subplots in the first column.\n", - "\n", - "ax.set_ylabel('Streamflow (cfs)')\n", - "\n", - "\n", - "\n", - "# Adjust the layout to prevent overlapping elements.\n", - "plt.tight_layout()\n", - "# Uncomment the line below to save the figure to a file.\n", - "# plt.savefig(f'{save_path}scatter_annual_drought_number.png')\n", - "# Display the plot.\n", - "plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The next plot shows precipitation vs streamflow. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "figsize = (9, 6) # Set the figure size for the plot.\n", - "fig, ax = plt.subplots(figsize=figsize)\n", - "\n", - "\n", - "# Extract the data for the current station from the dataset.\n", - "temp_df_1 = dataset[dataset.station_id == station_list[0]]\n", - "# Set 'datetime' as the index for plotting.\n", - "temp_df_2 = temp_df_1.set_index('datetime')\n", - "# Plot the 'flow_cfs' data on the primary y-axis.\n", - "ax.plot(temp_df_2.index, temp_df_2['flow_cfs'])\n", - "# Set the x-axis limits from the first to the last year of data.\n", - "start_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.min()}-01-01')\n", - "end_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.max()}-12-31')\n", - "ax.set_xlim(start_year, end_year)\n", - "# Rotate x-axis labels for better readability.\n", - "labels = ax.get_xticklabels()\n", - "ax.set_xticklabels(labels, rotation=45)\n", - "\n", - "# Create a second y-axis for the precipitation data.\n", - "ax2 = ax.twinx()\n", - "# Plot the 'precip(mm)' data as a bar graph on the secondary y-axis.\n", - "ax2.bar(temp_df_2.index, temp_df_2['precip(mm)'], label='Inverted', color='red', width=25)\n", - "# Set the y-axis limits for precipitation, flipping the axis to make bars grow downward.\n", - "ax2.set_ylim(max(temp_df_2['precip(mm)']) + 1000, 0)\n", - "# Set the label for the secondary y-axis.\n", - "ax2.set_ylabel('Precipitation (mm)')\n", - "# Define custom ticks for the secondary y-axis.\n", - "ax2.set_yticks(np.arange(0, max(temp_df_2['precip(mm)']), 250))\n", - "\n", - "# Set the title of the subplot to the station ID.\n", - "ax.set_title(f'{station_list[0]}')\n", - "# Set the x-axis label for subplots in the last row.\n", - "\n", - "ax.set_xlabel('Datetime (day)')\n", - "\n", - "# Set the y-axis label for subplots in the first column.\n", - "\n", - "ax.set_ylabel('Streamflow (cfs)')\n", - "\n", - "\n", - "\n", - "# Adjust the layout to prevent overlapping elements.\n", - "plt.tight_layout()\n", - "# Uncomment the line below to save the figure to a file.\n", - "# plt.savefig(f'{save_path}scatter_annual_drought_number.png')\n", - "# Display the plot.\n", - "plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we will plot all the stations!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "# Initialize variables for the number of plots, columns, and rows based on the number of unique stations.\n", - "n_subplots = len(station_list)\n", - "n_cols = int(math.ceil(math.sqrt(n_subplots))) # Calculate columns as the ceiling of the square root of number of subplots.\n", - "n_rows = int(math.ceil(n_subplots / n_cols)) # Calculate rows as the ceiling of the ratio of subplots to columns.\n", - "figsize = (18, 12) # Set the figure size for the plot.\n", - "# Create a figure and a grid of subplots with the specified number of rows and columns.\n", - "fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize, dpi=300)\n", - "axes = axes.flatten() # Flatten the axes array for easier iteration.\n", - "\n", - "# Iterate over each axis to plot data for each station.\n", - "for i, ax in enumerate(axes):\n", - " if i < n_subplots: # Check if the current index is less than the number of subplots to populate.\n", - " # Extract data for the current station.\n", - " temp_df_1 = dataset[dataset.station_id == station_list[i]]\n", - " # Set 'datetime' as the index for the DataFrame for plotting.\n", - " temp_df_2 = temp_df_1.set_index('datetime')\n", - " # Plot 'flow_cfs' on the primary y-axis.\n", - " ax.plot(temp_df_2.index, temp_df_2['flow_cfs'])\n", - " # Set x-axis limits from the minimum to maximum year of data.\n", - " start_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.min()}-01-01')\n", - " end_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.max()}-12-31')\n", - " ax.set_xlim(start_year, end_year)\n", - " # Get current x-tick labels and set their rotation for better visibility.\n", - " labels = ax.get_xticklabels()\n", - " ax.set_xticklabels(labels, rotation=45)\n", - "\n", - " # Create a secondary y-axis for Snow Water Equivalent (SWE).\n", - " ax2 = ax.twinx()\n", - " # Plot SWE as a bar graph on the secondary y-axis.\n", - " ax2.bar(temp_df_2.index, temp_df_2['swe'], label='Inverted', color='red')\n", - " # Set the y-axis limits for SWE, flipping the axis to make bars grow downward.\n", - " ax2.set_ylim(max(temp_df_2['swe']) + 40, 0)\n", - " # Set label for the secondary y-axis.\n", - " ax2.set_ylabel('SWE')\n", - " # Define custom ticks for the secondary y-axis.\n", - " ax2.set_yticks(np.arange(0, max(temp_df_2['swe']), 5))\n", - "\n", - " # Set the title of the subplot to the station ID.\n", - " ax.set_title(f'{station_list[i]}')\n", - " # Set the x-axis label for subplots in the last row.\n", - " if i // n_cols == n_rows - 1:\n", - " ax.set_xlabel('Datetime (day)')\n", - "\n", - " # Set the y-axis label for subplots in the first column.\n", - " if i % n_cols == 0:\n", - " ax.set_ylabel('Streamflow (cfs)')\n", - " else:\n", - " # Hide any unused axes.\n", - " ax.axis('off')\n", - "\n", - "# Adjust the layout to prevent overlapping elements.\n", - "plt.tight_layout()\n", - "# Uncomment the line below to save the figure to a file.\n", - "# plt.savefig(f'{save_path}scatter_annual_drought_number.png')\n", - "# Display the plot.\n", - "plt.show()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "# Calculate the number of subplots needed based on the number of unique stations.\n", - "n_subplots = len(station_list)\n", - "# Determine the number of columns in the subplot grid by taking the ceiling of the square root of 'n_subplots'.\n", - "n_cols = int(math.ceil(math.sqrt(n_subplots)))\n", - "# Determine the number of rows in the subplot grid by dividing 'n_subplots' by 'n_cols' and taking the ceiling of that.\n", - "n_rows = int(math.ceil(n_subplots / n_cols))\n", - "# Set the figure size for the subplots.\n", - "figsize = (18, 12)\n", - "# Create a grid of subplots with specified number of rows and columns and figure size.\n", - "fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize, dpi=300)\n", - "# Flatten the axes array for easier iteration.\n", - "axes = axes.flatten()\n", - "\n", - "# Iterate over the axes to plot the data for each station.\n", - "for i, ax in enumerate(axes):\n", - " if i < n_subplots:\n", - " # Extract the data for the current station from the dataset.\n", - " temp_df_1 = dataset[dataset.station_id == station_list[i]]\n", - " # Set 'datetime' as the index for plotting.\n", - " temp_df_2 = temp_df_1.set_index('datetime')\n", - " # Plot the 'flow_cfs' data on the primary y-axis.\n", - " ax.plot(temp_df_2.index, temp_df_2['flow_cfs'])\n", - " # Set the x-axis limits from the first to the last year of data.\n", - " start_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.min()}-01-01')\n", - " end_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.max()}-12-31')\n", - " ax.set_xlim(start_year, end_year)\n", - " # Rotate x-axis labels for better readability.\n", - " labels = ax.get_xticklabels()\n", - " ax.set_xticklabels(labels, rotation=45)\n", - "\n", - " # Create a second y-axis for the precipitation data.\n", - " ax2 = ax.twinx()\n", - " # Plot the 'precip(mm)' data as a bar graph on the secondary y-axis.\n", - " ax2.bar(temp_df_2.index, temp_df_2['precip(mm)'], label='Inverted', color='red', width=25)\n", - " # Set the y-axis limits for precipitation, flipping the axis to make bars grow downward.\n", - " ax2.set_ylim(max(temp_df_2['precip(mm)']) + 1000, 0)\n", - " # Set the label for the secondary y-axis.\n", - " ax2.set_ylabel('Precipitation (mm)')\n", - " # Define custom ticks for the secondary y-axis.\n", - " ax2.set_yticks(np.arange(0, max(temp_df_2['precip(mm)']), 250))\n", - "\n", - " # Set the title of the subplot to the station ID.\n", - " ax.set_title(f'{station_list[i]}')\n", - " # Set the x-axis label for subplots in the last row.\n", - " if i // n_cols == n_rows - 1:\n", - " ax.set_xlabel('Datetime (day)')\n", - "\n", - " # Set the y-axis label for subplots in the first column.\n", - " if i % n_cols == 0:\n", - " ax.set_ylabel('Streamflow (cfs)')\n", - " else:\n", - " # Hide any unused axes.\n", - " ax.axis('off')\n", - "\n", - "# Adjust layout to prevent overlapping elements.\n", - "plt.tight_layout()\n", - "# Uncomment the line below to save the figure to a file.\n", - "# plt.savefig(f'{save_path}scatter_annual_drought_number.png')\n", - "# Display the plot.\n", - "plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 4.4. Splitting the Data \n", - "We split 80 percent of the data for training and the rest for testing the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create empty DataFrames for training and testing datasets.\n", - "data_train = pd.DataFrame()\n", - "data_test = pd.DataFrame()\n", - "\n", - "# Loop through each station name in the list of station IDs.\n", - "for station_name in station_list:\n", - " # Extract data for the current station and reset the index.\n", - " temp_df_1 = dataset[dataset.station_id == station_name].reset_index(drop=True)\n", - " \n", - " # Determine the maximum and minimum years in the dataset for the current station.\n", - " end_year = temp_df_1.datetime.dt.year.max()\n", - " start_year = temp_df_1.datetime.dt.year.min()\n", - " \n", - " # Calculate the duration in years between the earliest and latest data points.\n", - " duration = end_year - start_year\n", - " \n", - " # Calculate the division year to split training and testing data (80% for training).\n", - " division_year = start_year + int(duration * 0.8)\n", - " \n", - " # Select data from the start year up to the division year for training, reset the index, and append to the training DataFrame.\n", - " data_train = pd.concat((data_train.reset_index(drop=True), temp_df_1[temp_df_1.datetime < f'{division_year}-01-01'].reset_index(drop=True)), axis=0).reset_index(drop=True)\n", - " \n", - " # Select data from the division year onward for testing, reset the index, and append to the testing DataFrame.\n", - " data_test = pd.concat((data_test.reset_index(drop=True), temp_df_1[temp_df_1.datetime >= f'{division_year}-01-01'].reset_index(drop=True)), axis=0).reset_index(drop=True)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Model Development \n", - "#### 5.1. Defining the XGBoost Model \n", - "As mentioned, we will use XGBoost in our tutorial, and we will use the [dmlc XGBoost package](https://xgboost.readthedocs.io/en/stable/). Understanding and tuning the model parameters is critical in any ML model development since it will affect the final model performance. The XGBoost model has different parameters, and here, we will work on the three most important parameters of XGBoost:\n", - " \n", - "* **`max_depth`** This parameter determines the maximum depth of each tree. This setting controls the complexity of the tree by restricting the number of levels or splits within it. Increasing the *max_depth* can help capture more complex patterns in the data, but it can also lead to overfitting, where the model becomes too specialized in the training data and performs poorly on new data. On the other hand, reducing the *max_depth* can help prevent overfitting, but it can also result in underfitting, where the model misses relevant patterns.\n", - "\n", - "* **`n_estimators`** It determines the number of trees in the ensemble and controls how many boosting rounds the algorithm should perform. Boosting rounds add new trees that attempt to correct errors from previous rounds, leading to improved performance up to a point. However, too many trees can result in overfitting, where the model performs well on training data but poorly on unseen data. It also increases the running time of the model.\n", - "\n", - "* **`eta`** It controls the step size used to update the weights of the trees during training and determines how much each new tree contributes to the overall model. A smaller *eta* value can make the model more robust to overfitting by slowing the learning process. However, more boosting rounds (n_estimators) may be required to achieve good performance. Conversely, a larger *eta* value allows the model to learn faster but increases the risk of overfitting." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will call the *XGBRegressor()* model and specify its parameters.\n", - "\n", - "**To make the process faster for this part we only use one station.**\n", - "\n", - "Let's start investigating different parameters and find out the best possible values!!!!!!!!!!!!!!!!!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def evaluate_model(params, X_train, y_train):\n", - " # Create an XGBoost regressor model with the provided parameters.\n", - " model = xgb.XGBRegressor(**params)\n", - " \n", - " # Set up cross-validation configuration with 10 splits and 3 repeats, and a fixed random state for reproducibility.\n", - " cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\n", - " \n", - " # Perform cross-validation to evaluate the model using the negative mean absolute error as the scoring method.\n", - " # 'n_jobs=-1' enables using all CPU cores for parallel computation.\n", - " scores = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)\n", - " \n", - " # Return the scores from the cross-validation.\n", - " return scores\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the station ID to be used.\n", - "station_name = '10126000'\n", - "# Select and reset the index of feature columns from the training data where station ID matches.\n", - "x_train = data_train[data_train.station_id == station_name].iloc[:, 2:-1].reset_index(drop=True)\n", - "# Select and reset the index of the target column from the training data where station ID matches.\n", - "y_train = data_train[data_train.station_id == station_name].iloc[:, -1].reset_index(drop=True)\n", - "\n", - "# Define the initial parameters for the XGBoost model.\n", - "params = {\n", - " 'n_estimators': 200,\n", - " 'max_depth': 5,\n", - " 'eta': 0.1,\n", - "}\n", - "\n", - "# Evaluate the model with initial parameters and calculate the mean of absolute scores.\n", - "current_score = abs(evaluate_model(params, x_train, y_train).mean())\n", - "print(f\"Initial score (cfs): {current_score} with params: {params}\")\n", - "\n", - "# Initialize the interactive tuning loop.\n", - "continue_tuning = True\n", - "while continue_tuning:\n", - " # Prompt the user if they want to continue tuning.\n", - " print('=====================================================================================')\n", - " change = input(\"Do you want to change any variable? (y/n): \")\n", - " if change.lower() == 'y':\n", - " # Ask which parameter to change.\n", - " variable = input(\"Which variable number? (n_estimators(1)/max_depth(2)/eta(3)):\")\n", - " # Map user input to the corresponding parameter.\n", - " if variable == '1':\n", - " variable = 'n_estimators'\n", - " elif variable == '2':\n", - " variable = 'max_depth'\n", - " elif variable == '3':\n", - " variable = 'eta'\n", - " else:\n", - " print('Error: Wrong Number')\n", - " break\n", - "\n", - " # Prompt for the new value and validate the type.\n", - " value = input(f\"Enter the new value for {variable} (previous value {params[variable]}): \")\n", - " if variable == 'n_estimators' or variable == 'max_depth':\n", - " value = int(value)\n", - " else:\n", - " value = float(value)\n", - "\n", - " # Update parameter and re-evaluate the model.\n", - " old_param = params[variable]\n", - " params[variable] = value\n", - " new_score = evaluate_model(params, x_train, y_train)\n", - " print('**********************************************')\n", - " print('Previous Mean Score (cfs): %.3f (Previous Score SD: %.3f)' % (abs(current_score.mean()), current_score.std()))\n", - " print('New Mean Score (cfs): %.3f (New Score SD: %.3f)' % (abs(new_score.mean()), new_score.std()))\n", - " print('**********************************************')\n", - " current_score = new_score\n", - "\n", - " # Prompt if the new parameter setting should be kept.\n", - " keep_answer = input(f\"Do you want to keep the new variable?(y/n): \")\n", - " if keep_answer == 'n':\n", - " params[variable] = old_param\n", - " else:\n", - " # Exit tuning loop.\n", - " continue_tuning = False\n", - " print(f\"Finished tuning ====================> Final parameters: {params}.\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### !!!! Don't forget to train and save your model after tuning the hyperparameters as a Pickle file.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Instantiate an XGBoost regressor model with the specified parameters.\n", - "xgboost_model = xgb.XGBRegressor(**params)\n", - "\n", - "# Fit the model using the training dataset.\n", - "xgboost_model.fit(x_train, y_train)\n", - "\n", - "# Save the trained model to a file using the pickle library for later use.\n", - "# 'save_path' should be defined earlier in your script and point to a directory where you have write permissions.\n", - "pickle.dump(xgboost_model, open(f'{output_path}best_manuall_model.pkl', \"wb\"))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 5.2. Scaling the Data\n", - "Generally, scaling the inputs is not required in decision-tree ensemble models. However, some studies suggest scaling the inputs since XGBoost uses the Gradient Decent algorithm in its core optimization. So here we will try both \n", - "scaled and unscaled inputs to see the difference.\n", - "We will scale the data by using the *MinMaxScaler()* function from the Scikit-Learn library. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Function for scaling the data. \n", - "def input_scale(x_train, y_train):\n", - "\n", - " scaler_x = MinMaxScaler()\n", - " scaler_y = MinMaxScaler()\n", - " x_train_scaled, y_train_scaled = \\\n", - " scaler_x.fit_transform(x_train), scaler_y.fit_transform(y_train.values.reshape(-1, 1)).reshape(-1)\n", - " joblib.dump(scaler_x, f'{output_path}scaler_x.joblib')\n", - " joblib.dump(scaler_y, f'{output_path}scaler_y.joblib')\n", - " return x_train_scaled, y_train_scaled, scaler_x, scaler_y\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 5.3. Automatic Hypermeter Tuning\n", - "We investigated different values for each parameter, trying to tune it manually, which shows how difficult and time-consuming this process is. So, the next step is to use a simple automatic tuning method, Grird Search, to find the optimal hyperparameter values. To do so, we will use the *GirdSearchCV()* function of the Scikit-Learn library. \n", - "This method gets possible values for each parameter and then tests all possible combinations one by one. It finds the optimal value, but it is very slow. It also uses cross-validation to evaluate each combination. \n", - "\n", - "First, we divide and scale our whole dataset for the automatic tuning. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize dictionaries for storing scaled test datasets.\n", - "x_test_scaled = {}\n", - "y_test_scaled = {}\n", - "x_test = {} # Missing declaration in your provided code.\n", - "y_test = {} # Missing declaration in your provided code.\n", - "\n", - "\n", - "# Assigning features by selecting all but the last column from the data_train DataFrame and resetting the index.\n", - "x_train = data_train.iloc[:, 2:-1].reset_index(drop=True)\n", - "# Assigning the target by selecting the last column from the data_train DataFrame and resetting the index.\n", - "y_train = data_train.iloc[:, -1].reset_index(drop=True)\n", - "\n", - "# Scale the training data and retrieve the scalers for later use on the test data.\n", - "x_train_scaled, y_train_scaled, scaler_x, scaler_y = input_scale(x_train, y_train)\n", - "\n", - "# Loop over each station name from the list of station IDs.\n", - "for station_name in station_list:\n", - " # Extract and store the features for the test data for each station.\n", - " x_test[station_name] = data_test[data_test.station_id == station_name].iloc[:, 2:-1]\n", - " # Extract and store the target variable for the test data for each station.\n", - " y_test[station_name] = data_test[data_test.station_id == station_name].iloc[:, -1]\n", - " # Scale the extracted test features and targets using the previously fitted scalers.\n", - " x_test_scaled[station_name] = scaler_x.transform(x_test[station_name])\n", - " y_test_scaled[station_name] = scaler_y.transform(y_test[station_name].values.reshape(-1, 1)).reshape(-1)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then, we select the possible values or range of values. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define the range of hyperparameters for XGBoost tuning.\n", - "# Note that 'range(100, 300, 200)' implies a single value because the step size leads directly to the limit.\n", - "# If you intend multiple steps, adjust the range appropriately.\n", - "hyperparameters_xgboost = {\n", - " 'max_depth': range(2, 4), # Generates [2, 3] because 'range' is exclusive of the stop value.\n", - " 'n_estimators': range(100, 301, 200), # To include both 100 and 300 if that was your intent.\n", - " 'eta': [0.1] # Learning rate is a fixed value in this setup.\n", - "}\n", - "\n", - "# Paths for saving the tuned hyperparameters and the trained model.\n", - "path_model_save_hyperparameters = f\"{output_path}best_model_hyperparameters_xgboost.pkl\"\n", - "path_model_save_model = f\"{output_path}best_model_xgboost.pkl\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we create a new XGBoost model and the grid search function. Then, we will run the function and compare the results with those in the previous section. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize an XGBoost regressor model.\n", - "xgboost_model_automatic = xgb.XGBRegressor()\n", - "\n", - "# Setup GridSearchCV with the XGBoost model and hyperparameter grid.\n", - "grid_search_3 = GridSearchCV(estimator=xgboost_model_automatic, # Corrected to use the initialized model\n", - " param_grid=hyperparameters_xgboost, # Dictionary of parameters to try\n", - " scoring='neg_mean_absolute_error', # Scoring method MAE, reported as negative for maximization\n", - " cv=3, # Number of cross-validation folds\n", - " n_jobs=-1, # Use all available CPU cores\n", - " verbose=0) # Show detailed progress (level 3)\n", - "\n", - "# Fit the GridSearchCV to the scaled training data.\n", - "grid_search_3.fit(x_train_scaled, y_train_scaled)\n", - "\n", - "# Retrieve the best estimator from the grid search.\n", - "optimized_xgboost_model = grid_search_3.best_estimator_\n", - "\n", - "# Output the best parameters and the corresponding score for those parameters.\n", - "print(f\"Best parameters found: {grid_search_3.best_params_}\")\n", - "print(f\"Best RMSE: {abs(grid_search_3.best_score_)}\") # Print the absolute value of the RMSE\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Remember to save the best parameters after finding them!!!!!!**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "joblib.dump(grid_search_3, path_model_save_hyperparameters)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "\n", - "