Modified the scripts, removed the output folder, and add geopackage

geo-smart · Aug 15, 2024 · 75c97b0 · 75c97b0
1 parent 884c4f2
commit 75c97b0
Show file tree

Hide file tree

Showing 94 changed files with 183 additions and 7,767 deletions.
diff --git a/book/tutorials/decision_trees/01.script/01.tutorial_post_processing_xgboost_tuning.ipynb b/book/tutorials/decision_trees/01.script/01.tutorial_post_processing_xgboost_tuning.ipynb
@@ -100,6 +100,8 @@
     "home = os.getcwd()\n",
     "parent_path = os.path.dirname(home)\n",
     "input_path = f'{parent_path}/02.input/'\n",
+    "new_directory = os.path.join(parent_path, '03.output')\n",
+    "os.makedirs(new_directory, exist_ok=True)\n",
     "output_path = f'{parent_path}/03.output/'\n",
     "main_path = home"
    ]
@@ -113,7 +115,7 @@
     }
    },
    "source": [
-    "### 4. Data Preprocessing\n",
+    "## 4. Data Preprocessing\n",
     "#### 4.1. Overview of the USGS Stream Station\n",
     "- The dataset that we will use provides the data for seven GSL watershed stations. \n",
     "- The dataset contains climate variables, such as precipitation and temperature, water infrastructure, storage percentage, and watershed characteristics, such as average area and elevation. \n",
@@ -130,7 +132,7 @@
    },
    "source": [
     "#### 4.2. Load Dataset\n",
-    "- Using the boto3 library we get the input dataset from the CIROH S3 bucket."
+    "- We will load the data from a parquet file and choose the required stations. "
    ]
   },
   {
@@ -213,22 +215,83 @@
     "# Set the y-axis limits for SWE, flipping the axis to make bars grow downward.\n",
     "ax2.set_ylim(max(temp_df_2['swe']) + 40, 0)\n",
     "# Set label for the secondary y-axis.\n",
-    "ax2.set_ylabel('SWE')\n",
+    "ax2.set_ylabel('SWE (in)')\n",
     "# Define custom ticks for the secondary y-axis.\n",
     "ax2.set_yticks(np.arange(0, max(temp_df_2['swe']), 5))\n",
     "\n",
     "# Set the title of the subplot to the station ID.\n",
     "ax.set_title(f'{station_list[0]}')\n",
     "# Set the x-axis label for subplots in the last row.\n",
-    "if i // n_cols == n_rows - 1:\n",
-    "    ax.set_xlabel('Datetime (day)')\n",
+    "ax.set_xlabel('Datetime (day)')\n",
     "\n",
     "# Set the y-axis label for subplots in the first column.\n",
-    "if i % n_cols == 0:\n",
-    "    ax.set_ylabel('Streamflow (cfs)')\n",
-    "else:\n",
-    "    # Hide any unused axes.\n",
-    "    ax.axis('off')\n",
+    "\n",
+    "ax.set_ylabel('Streamflow (cfs)')\n",
+    "\n",
+    "\n",
+    "# Adjust the layout to prevent overlapping elements.\n",
+    "plt.tight_layout()\n",
+    "# Uncomment the line below to save the figure to a file.\n",
+    "# plt.savefig(f'{save_path}scatter_annual_drought_number.png')\n",
+    "# Display the plot.\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The next plot shows precipitation vs streamflow. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "figsize = (9, 6)  # Set the figure size for the plot.\n",
+    "fig, ax = plt.subplots(figsize=figsize)\n",
+    "\n",
+    "\n",
+    "# Extract the data for the current station from the dataset.\n",
+    "temp_df_1 = dataset[dataset.station_id == station_list[0]]\n",
+    "# Set 'datetime' as the index for plotting.\n",
+    "temp_df_2 = temp_df_1.set_index('datetime')\n",
+    "# Plot the 'flow_cfs' data on the primary y-axis.\n",
+    "ax.plot(temp_df_2.index, temp_df_2['flow_cfs'])\n",
+    "# Set the x-axis limits from the first to the last year of data.\n",
+    "start_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.min()}-01-01')\n",
+    "end_year = pd.to_datetime(f'{temp_df_1.datetime.dt.year.max()}-12-31')\n",
+    "ax.set_xlim(start_year, end_year)\n",
+    "# Rotate x-axis labels for better readability.\n",
+    "labels = ax.get_xticklabels()\n",
+    "ax.set_xticklabels(labels, rotation=45)\n",
+    "\n",
+    "# Create a second y-axis for the precipitation data.\n",
+    "ax2 = ax.twinx()\n",
+    "# Plot the 'precip(mm)' data as a bar graph on the secondary y-axis.\n",
+    "ax2.bar(temp_df_2.index, temp_df_2['precip(mm)'], label='Inverted', color='red', width=25)\n",
+    "# Set the y-axis limits for precipitation, flipping the axis to make bars grow downward.\n",
+    "ax2.set_ylim(max(temp_df_2['precip(mm)']) + 1000, 0)\n",
+    "# Set the label for the secondary y-axis.\n",
+    "ax2.set_ylabel('Precipitation (mm)')\n",
+    "# Define custom ticks for the secondary y-axis.\n",
+    "ax2.set_yticks(np.arange(0, max(temp_df_2['precip(mm)']), 250))\n",
+    "\n",
+    "# Set the title of the subplot to the station ID.\n",
+    "ax.set_title(f'{station_list[0]}')\n",
+    "# Set the x-axis label for subplots in the last row.\n",
+    "\n",
+    "ax.set_xlabel('Datetime (day)')\n",
+    "\n",
+    "# Set the y-axis label for subplots in the first column.\n",
+    "\n",
+    "ax.set_ylabel('Streamflow (cfs)')\n",
+    "\n",
+    "\n",
     "\n",
     "# Adjust the layout to prevent overlapping elements.\n",
     "plt.tight_layout()\n",
@@ -238,6 +301,13 @@
     "plt.show()\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we will plot all the stations."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -303,13 +373,6 @@
     "plt.show()\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The next plot shows precipitation vs streamflow. "
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -423,7 +486,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 5. Model Development \n",
+    "## 5. Model Development \n",
     "#### 5.1. Defining the XGBoost Model \n",
     "As mentioned, we will use XGBoost in our tutorial, and we will use the  [dmlc XGBoost package](https://xgboost.readthedocs.io/en/stable/). Understanding and tuning the model parameters is critical in any ML model development since it will affect the final model performance. The XGBoost model has different parameters, and here, we will work on the three most important parameters of XGBoost:\n",
     "  \n",
@@ -546,7 +609,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**!!!! Don't forget to train and save your model after tuning the hyperparameters as a Pickle file.**\n"
+    "#### !!!! Don't forget to train and save your model after tuning the hyperparameters as a Pickle file.\n"
    ]
   },
   {

diff --git a/...rials/decision_trees/01.script/02.tutorial_post_processing_xgboost_automatic_tuning.ipynb b/...rials/decision_trees/01.script/02.tutorial_post_processing_xgboost_automatic_tuning.ipynb
@@ -72,7 +72,6 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 5. Model Development Continued\n",
     "#### 5.2. Scaling the Data\n",
     "Generally, scaling the inputs is not required in decision-tree ensemble models. However, some studies suggest scaling the inputs since XGBoost uses the Gradient Decent algorithm in its core optimization. So here we will try both \n",
     "scaled and unscaled inputs to see the difference.\n",