From 66cc39a72cd315e5580d9f4aa794cb096110deb8 Mon Sep 17 00:00:00 2001
From: Sai Vivek Vangaveti <saivivek116@gmail.com>
Date: Fri, 9 Aug 2024 12:02:37 -0700
Subject: [PATCH 1/5] chapter 4 restructure

---
 book/_toc.yml                                 |  4 +-
 ...selection.ipynb => data_integration.ipynb} | 60 +----------------
 book/chapters/feature_selection.ipynb         | 66 +++++++++++++++++++
 3 files changed, 72 insertions(+), 58 deletions(-)
 rename book/chapters/{feat_selection.ipynb => data_integration.ipynb} (89%)
 create mode 100644 book/chapters/feature_selection.ipynb

diff --git a/book/_toc.yml b/book/_toc.yml
index eacc7b8..54873a9 100644
--- a/book/_toc.yml
+++ b/book/_toc.yml
@@ -37,7 +37,9 @@ parts:
 
 - caption: Chapter Four
   chapters:
-    - file: chapters/feat_selection
+    - file: chapters/data_integration
+      title: Data Integration
+    - file: chapters/feature_selection
       title: Feature Selection
 
 - caption: Chapter Five
diff --git a/book/chapters/feat_selection.ipynb b/book/chapters/data_integration.ipynb
similarity index 89%
rename from book/chapters/feat_selection.ipynb
rename to book/chapters/data_integration.ipynb
index fe585de..6da7f5e 100644
--- a/book/chapters/feat_selection.ipynb
+++ b/book/chapters/data_integration.ipynb
@@ -5,8 +5,7 @@
    "id": "7387d7e6",
    "metadata": {},
    "source": [
-    "# Feature Selection for SWE Prediction Models\n",
-    "Criteria for selecting features in SWE prediction models, Techniques and tools used for feature selection\n",
+    "# Bringing Data Together\n",
     "\n",
     "## Introduction to the Data\n",
     "\n",
@@ -109,65 +108,12 @@
     "merged_df2.to_csv('../data/model_training_data.csv', index=False, single_file=True)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "51a6ad77",
-   "metadata": {},
-   "source": [
-    "## Step 2: Preprocessing for Feature Selection\n",
-    "\n",
-    "Preprocessing steps are crucial for fine-tuning the data to ensure it's model-ready. This includes:\n",
-    "- **Date-Range Data Clipping:** This step focuses on trimming the data to fit a specified date range, which is necessary for the analysis. After this trimming process, we save the refined data back into a CSV file.\n",
-    "- **Filtering:** Select only the relevant columns needed for SWE prediction, such as weather conditions, geographic features, and snowpack measurements.\n",
-    "- **Renaming:** Streamline column names for consistency and clarity (e.g., changing \"Snow Water Equivalent (in) Start of Day Values\" to \"swe_value\").\n",
-    "\n",
-    "save the final data into csv file."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 23,
-   "id": "8df5d523",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "['/Users/vangavetisaivivek/research/swe-workflow-book/book/data/model_training_cleaned.csv']"
-      ]
-     },
-     "execution_count": 23,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "input_csv = '../data/model_training_data.csv'\n",
-    "\n",
-    "# List of columns you want to extract\n",
-    "selected_columns = ['date', 'lat', 'lon', 'etr', 'pr', 'rmax',\n",
-    "                    'rmin', 'tmmn', 'tmmx', 'vpd', 'vs', \n",
-    "                    'elevation',\n",
-    "                    'slope', 'curvature', 'aspect', 'eastness',\n",
-    "                    'northness', 'Snow Water Equivalent (in) Start of Day Values']\n",
-    "# Read the CSV file into a Dask DataFrame\n",
-    "df = dd.read_csv(input_csv, usecols=selected_columns)\n",
-    "\n",
-    "df = df.rename(columns={\"Snow Water Equivalent (in) Start of Day Values\": \"swe_value\"})\n",
-    "\n",
-    "# Replace 'output.csv' with the desired output file name\n",
-    "output_csv = '../data/model_training_cleaned.csv'\n",
-    "\n",
-    "# Write the selected columns to a new CSV file\n",
-    "df.to_csv(output_csv, index=False, single_file=True)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "8803de8a",
    "metadata": {},
    "source": [
-    "## Step 3: Advanced Merging and Cleaning\n",
+    "## Step 2: Advanced Merging and Cleaning\n",
     "\n",
     "For a deeper dive, additional scripts provide a more intricate merging process involving multiple data sources and filters based on time ranges. The aim here is to:\n",
     "\n",
@@ -574,7 +520,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.5"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,
diff --git a/book/chapters/feature_selection.ipynb b/book/chapters/feature_selection.ipynb
new file mode 100644
index 0000000..93304ea
--- /dev/null
+++ b/book/chapters/feature_selection.ipynb
@@ -0,0 +1,66 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Feature Selection for SWE Prediction Models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Criteria for selecting features in SWE prediction models, Techniques and tools used for feature selection"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "- **Filtering:** Select only the relevant columns needed for SWE prediction, such as weather conditions, geographic features, and snowpack measurements.\n",
+    "- **Renaming:** Streamline column names for consistency and clarity (e.g., changing \"Snow Water Equivalent (in) Start of Day Values\" to \"swe_value\").\n",
+    "\n",
+    "save the final data into csv file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "input_csv = '../data/model_training_data.csv'\n",
+    "\n",
+    "# List of columns you want to extract\n",
+    "selected_columns = ['date', 'lat', 'lon', 'etr', 'pr', 'rmax',\n",
+    "                    'rmin', 'tmmn', 'tmmx', 'vpd', 'vs', \n",
+    "                    'elevation',\n",
+    "                    'slope', 'curvature', 'aspect', 'eastness',\n",
+    "                    'northness', 'Snow Water Equivalent (in) Start of Day Values']\n",
+    "# Read the CSV file into a Dask DataFrame\n",
+    "df = dd.read_csv(input_csv, usecols=selected_columns)\n",
+    "\n",
+    "df = df.rename(columns={\"Snow Water Equivalent (in) Start of Day Values\": \"swe_value\"})\n",
+    "\n",
+    "# Replace 'output.csv' with the desired output file name\n",
+    "output_csv = '../data/model_training_cleaned.csv'\n",
+    "\n",
+    "# Write the selected columns to a new CSV file\n",
+    "df.to_csv(output_csv, index=False, single_file=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

From dd0661cabf2e560655b70de8f5f064ce0cf7c2a8 Mon Sep 17 00:00:00 2001
From: Sai Vivek Vangaveti <saivivek116@gmail.com>
Date: Tue, 13 Aug 2024 20:22:26 -0700
Subject: [PATCH 2/5] seperated chapters and added heading, refactored content

---
 book/chapters/data_integration.ipynb  | 412 ++------------------------
 book/chapters/feature_selection.ipynb |   9 +-
 book/chapters/fsCA.ipynb              |  30 +-
 3 files changed, 55 insertions(+), 396 deletions(-)

diff --git a/book/chapters/data_integration.ipynb b/book/chapters/data_integration.ipynb
index 6da7f5e..c1fc3a1 100644
--- a/book/chapters/data_integration.ipynb
+++ b/book/chapters/data_integration.ipynb
@@ -5,9 +5,9 @@
    "id": "7387d7e6",
    "metadata": {},
    "source": [
-    "# Bringing Data Together\n",
+    "# 4.1 Bringing Data Together\n",
     "\n",
-    "## Introduction to the Data\n",
+    "## 4.1.1 Introduction to the Data\n",
     "\n",
     "The three key datasets:\n",
     "\n",
@@ -17,14 +17,14 @@
     "\n",
     "Each dataset comes packed with essential features like latitude, longitude, and date, ready to enrich our SWE prediction model.\n",
     "\n",
-    "## Step 1: Integrating the Datasets with Dask\n",
+    "## 4.1.2 Integrating the Datasets\n",
     "\n",
     "We are combining these large datasets into one DataFrame using Dask. Dask allows us to work with big data efficiently, so we can merge the datasets quickly and easily, no matter how large they are.\n",
     "\n",
     "And also if the size of the data is larger then reading large CSV files in chunks helps manage big data more efficiently by reducing memory use, speeding up processing, and improving error handling. This approach makes it easier to work on large datasets with limited resources, ensuring flexibility and scalability in data analysis.\n",
     "\n",
-    "#### Read and Convert\n",
-    "- Each CSV file is read into a Dask DataFrame, with latitude and longitude data types converted to floats for uniformity. And also if the size of the data is larger then reading large CSV files in chunks helps manage big data more efficiently by reducing memory use, speeding up processing, and improving error handling. This approach makes it easier to work on large datasets with limited resources, ensuring flexibility and scalability in data analysis."
+    "### Read and Convert\n",
+    "- Each CSV file is read into a Dask DataFrame, with latitude and longitude data types converted to floats for uniformity."
    ]
   },
   {
@@ -108,387 +108,35 @@
     "merged_df2.to_csv('../data/model_training_data.csv', index=False, single_file=True)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "8803de8a",
-   "metadata": {},
-   "source": [
-    "## Step 2: Advanced Merging and Cleaning\n",
-    "\n",
-    "For a deeper dive, additional scripts provide a more intricate merging process involving multiple data sources and filters based on time ranges. The aim here is to:\n",
-    "\n",
-    "- **Integrate Further Data:** Additional sources like AMSR data are introduced, expanding the dataset with more variables relevant to SWE prediction.\n",
-    "- **Optimize and Clean:** Repartitioning and dropping duplicates are applied post-merge to ensure the dataset is optimized for processing and free of redundancy.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 24,
-   "id": "cdf07ce8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import dask.dataframe as dd\n",
-    "import os\n",
-    "import pandas as pd\n",
-    "\n",
-    "homedir = os.path.expanduser('~')\n",
-    "working_dir = f\"../data\"\n",
-    "work_dir = working_dir\n",
-    "final_output_name = \"final_merged_data_3yrs_all_active_stations_v1.csv\"\n",
-    "chunk_size = '10MB'  # You can adjust this chunk size based on your hardware and data size"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "67230adc",
-   "metadata": {},
-   "source": [
-    " It begins by importing necessary libraries like dask.dataframe, os, pandas.\n",
-    "\n",
-    "- **dask.dataframe:** This is for handling large datasets efficiently. Dask is a Python library that allows for parallel computing and works well with datasets too large for the memory of a single computer.\n",
-    "\n",
-    "- **os:** This module provides a way of using operating system-dependent functionality like reading or writing to a file system.\n",
-    "\n",
-    "- **pandas**: This module is great for data manipulation and analysis. It's particularly used for working with tabular data (like spreadsheets and SQL database outputs).\n",
-    "\n",
-    "Initially, it identifies the user's home directory to establish a base location. Subsequently, within this base location, we are giving a specific path which is directory named 'data' where we have provided all the data that is needed for analysis. Then we define the name of the output file,\n",
-    "final_merged_data_3yrs_all_active_stations_v1.csv. To efficiently manage computer memory during this operation, the data will be processed in segments, each limited to 10MB."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 25,
-   "id": "0f8668c6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "amsr_file = f'{working_dir}/all_snotel_cdec_stations_active_in_westus.csv_amsr_dask.csv'\n",
-    "snotel_file = f'{working_dir}/all_snotel_cdec_stations_active_in_westus.csv_swe_restored_dask_all_vars.csv'\n",
-    "gridmet_file = f'{working_dir}/training_all_active_snotel_station_list_elevation.csv_gridmet.csv'\n",
-    "terrain_file = f'{working_dir}/training_all_active_snotel_station_list_elevation.csv_terrain_4km_grid_shift.csv'\n",
-    "fsca_file = f'{working_dir}/fsca_final_training_all.csv'\n",
-    "final_final_output_file = f'{work_dir}/{final_output_name}'\n",
-    "\n",
-    "if os.path.exists(final_final_output_file):\n",
-    "    print(f\"The file '{final_final_output_file}' exists. Skipping\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d81b4ac6",
-   "metadata": {},
-   "source": [
-    "Here we are defining the input and output files, checks if the final output file already exists. If it does, it prints a message and skips further processing."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 26,
-   "id": "a599b0ce",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "amsr.columns =  Index(['date', 'lat', 'lon', 'AMSR_SWE'], dtype='object')\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Read the CSV files with a smaller chunk size and compression\n",
-    "amsr = dd.read_csv(amsr_file, blocksize=chunk_size)\n",
-    "print(\"amsr.columns = \", amsr.columns)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d11a2a17",
-   "metadata": {},
-   "source": [
-    "It reads data from CSV file into Dask DataFrames by blocks which provides a flexible and efficient approach for handling large datasets, enabling better scalability and performance in data processing tasks."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 27,
-   "id": "5f3cdcaf",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "snotel.columns =  Index(['station_name', 'date', 'lat', 'lon', 'swe_value', 'change_in_swe_inch',\n",
-      "       'snow_depth', 'air_temperature_observed_f'],\n",
-      "      dtype='object')\n"
-     ]
-    }
-   ],
-   "source": [
-    "snotel = dd.read_csv(snotel_file, blocksize=chunk_size)\n",
-    "print(\"snotel.columns = \", snotel.columns)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 28,
-   "id": "a4f57204",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "gridmet.columns =  Index(['day', 'lat', 'lon', 'air_temperature_tmmn',\n",
-      "       'potential_evapotranspiration', 'mean_vapor_pressure_deficit',\n",
-      "       'relative_humidity_rmax', 'relative_humidity_rmin',\n",
-      "       'precipitation_amount', 'air_temperature_tmmx', 'wind_speed'],\n",
-      "      dtype='object')\n"
-     ]
-    }
-   ],
-   "source": [
-    "gridmet = dd.read_csv(gridmet_file, blocksize=chunk_size)\n",
-    "# Drop the 'Unnamed: 0' column\n",
-    "gridmet = gridmet.drop(columns=[\"Unnamed: 0\"])\n",
-    "print(\"gridmet.columns = \", gridmet.columns)"
-   ]
-  },
   {
    "cell_type": "code",
-   "execution_count": 29,
-   "id": "cb830353",
+   "execution_count": 2,
+   "id": "414a3ebf",
    "metadata": {},
    "outputs": [
     {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "terrain.columns =  Index(['stationTriplet', 'elevation', 'lat', 'lon', 'Elevation', 'Slope',\n",
-      "       'Aspect', 'Curvature', 'Northness', 'Eastness'],\n",
-      "      dtype='object')\n"
-     ]
-    }
-   ],
-   "source": [
-    "terrain = dd.read_csv(terrain_file, blocksize=chunk_size)\n",
-    "# rename columns to match the other dataframes\n",
-    "terrain = terrain.rename(columns={\n",
-    "    \"latitude\": \"lat\", \n",
-    "    \"longitude\": \"lon\"\n",
-    "})\n",
-    "# select only the columns we need for the final output\n",
-    "terrain = terrain[[\"stationTriplet\", \"elevation\", \"lat\", \"lon\", 'Elevation', 'Slope', 'Aspect', 'Curvature', 'Northness', 'Eastness']]\n",
-    "print(\"terrain.columns = \", terrain.columns)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 30,
-   "id": "fcd4dae6",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "snowcover.columns =  Index(['date', 'lat', 'lon', 'fSCA'], dtype='object')\n"
-     ]
-    }
-   ],
-   "source": [
-    "snowcover = dd.read_csv(fsca_file, blocksize=chunk_size)\n",
-    "# rename columns to match the other dataframes\n",
-    "snowcover = snowcover.rename(columns={\n",
-    "    \"latitude\": \"lat\", \n",
-    "    \"longitude\": \"lon\"\n",
-    "})\n",
-    "print(\"snowcover.columns = \", snowcover.columns)\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 31,
-   "id": "e79ad4d1",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "all the dataframes are partitioned\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Repartition DataFrames for optimized processing\n",
-    "amsr = amsr.repartition(partition_size=chunk_size)\n",
-    "snotel = snotel.repartition(partition_size=chunk_size)\n",
-    "gridmet = gridmet.repartition(partition_size=chunk_size)\n",
-    "gridmet = gridmet.rename(columns={'day': 'date'})\n",
-    "terrain = terrain.repartition(partition_size=chunk_size)\n",
-    "snow_cover = snowcover.repartition(partition_size=chunk_size)\n",
-    "print(\"all the dataframes are partitioned\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 32,
-   "id": "50240e52",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "start to merge amsr and snotel\n",
-      "intermediate file saved to ../data/final_merged_data_3yrs_all_active_stations_v1.csv_snotel.csv\n",
-      "start to merge gridmet\n",
-      "intermediate file saved to ../data/final_merged_data_3yrs_all_active_stations_v1.csv_gridmet.csv\n",
-      "start to merge terrain\n",
-      "intermediate file saved to ../data/final_merged_data_3yrs_all_active_stations_v1.csv_terrain.csv\n",
-      "start to merge snowcover\n",
-      "intermediate file saved to ../data/final_merged_data_3yrs_all_active_stations_v1.csv_snow_cover.csv\n",
-      "Merge completed. ../data/final_merged_data_3yrs_all_active_stations_v1.csv\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Merge DataFrames based on specified columns\n",
-    "print(\"start to merge amsr and snotel\")\n",
-    "merged_df = dd.merge(amsr, snotel, on=['lat', 'lon', 'date'], how='outer')\n",
-    "merged_df = merged_df.drop_duplicates(keep='first')\n",
-    "output_file = os.path.join(working_dir, f\"{final_output_name}_snotel.csv\")\n",
-    "merged_df.to_csv(output_file, single_file=True, index=False)\n",
-    "print(f\"intermediate file saved to {output_file}\")\n",
-    "\n",
-    "print(\"start to merge gridmet\")\n",
-    "merged_df = dd.merge(merged_df, gridmet, on=['lat', 'lon', 'date'], how='outer')\n",
-    "merged_df = merged_df.drop_duplicates(keep='first')\n",
-    "output_file = os.path.join(working_dir, f\"{final_output_name}_gridmet.csv\")\n",
-    "merged_df.to_csv(output_file, single_file=True, index=False)\n",
-    "print(f\"intermediate file saved to {output_file}\")\n",
-    "\n",
-    "print(\"start to merge terrain\")\n",
-    "merged_df = dd.merge(merged_df, terrain, on=['lat', 'lon'], how='outer')\n",
-    "merged_df = merged_df.drop_duplicates(keep='first')\n",
-    "output_file = os.path.join(working_dir, f\"{final_output_name}_terrain.csv\")\n",
-    "merged_df.to_csv(output_file, single_file=True, index=False)\n",
-    "print(f\"intermediate file saved to {output_file}\")\n",
-    "\n",
-    "print(\"start to merge snowcover\")\n",
-    "merged_df = dd.merge(merged_df, snow_cover, on=['lat', 'lon', 'date'], how='outer')\n",
-    "merged_df = merged_df.drop_duplicates(keep='first')\n",
-    "output_file = os.path.join(working_dir, f\"{final_output_name}_snow_cover.csv\")\n",
-    "merged_df.to_csv(output_file, single_file=True, index=False)\n",
-    "print(f\"intermediate file saved to {output_file}\")\n",
-    "\n",
-    "# Save the merged DataFrame to a CSV file in chunks\n",
-    "output_file = os.path.join(working_dir, final_output_name)\n",
-    "merged_df.to_csv(output_file, single_file=True, index=False)\n",
-    "print(f'Merge completed. {output_file}')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 33,
-   "id": "c63a4000",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Data cleaning completed.\n"
-     ]
-    }
-   ],
-   "source": [
-    "\n",
-    "# Read the merged DataFrame, remove duplicate rows, and save the cleaned DataFrame to a new CSV file\n",
-    "df = dd.read_csv(f'{work_dir}/{final_output_name}', dtype={'stationTriplet': 'object',\n",
-    "       'station_name': 'object'})\n",
-    "df = df.drop_duplicates(keep='first')\n",
-    "df.to_csv(f'{work_dir}/{final_output_name}', single_file=True, index=False)\n",
-    "print('Data cleaning completed.')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a5408637",
-   "metadata": {},
-   "source": [
-    "- **Merge and Save AMSR and SNOTEL Data:**\n",
-    "    - It merges AMSR and SNOTEL data on latitude, longitude, and date using an outer join. on=['lat', 'lon', 'date'] specifies the columns to merge on and how='outer' performs an outer join, retaining all rows from both Dataframes.\n",
-    "    - Removes duplicate rows.\n",
-    "    - Saves the merged DataFrame to a CSV file named {final_output_name}_snotel.csv.\n",
-    "- **Merge and Save Gridmet Data:**\n",
-    "    - It merges the previously merged DataFrame with Gridmet data on latitude, longitude, and date using an outer join.\n",
-    "    - Removes duplicate rows.\n",
-    "    - Saves the updated merged DataFrame to a CSV file named {final_output_name}_gridmet.csv.\n",
-    "- **Merge and Save Terrain Data:**\n",
-    "    - It merges the DataFrame again with terrain data on latitude and longitude using an outer join.\n",
-    "    - Removes duplicate rows.\n",
-    "    - Saves the updated merged DataFrame to a CSV file named {final_output_name}_terrain.csv\n",
-    "- **Merge and Save Snow Cover Data:**\n",
-    "    - It merges the DataFrame once more with snow cover data on latitude, longitude, and date using an outer join.\n",
-    "    - Removes duplicate rows.\n",
-    "    - Saves the updated merged DataFrame to a CSV file named {final_output_name}_snow_cover.csv\n",
-    "- **Save Final Merged Data:**\n",
-    "    - It saves the final merged DataFrame to a single CSV file named {final_output_name} in the specified working directory.\n",
-    "- **Data Cleaning:**\n",
-    "    - It reads the final merged DataFrame again.\n",
-    "    - Removes duplicate rows.\n",
-    "    - Saves the cleaned DataFrame to a new CSV file with the same name {final_output_name}.\n",
-    "    \n",
-    "    single_file=True: Saves data to a single file.\n",
-    "    \n",
-    "    index=False: Omits DataFrame index from the CSV.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 34,
-   "id": "beb02f6d",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "sorted training data is saved to ../data/final_merged_data_3yrs_all_active_stations_v1.csv_sorted.csv\n"
-     ]
+     "data": {
+      "text/plain": [
+       "Index(['date', 'lat', 'lon', 'cell_id', 'station_id', 'etr', 'pr', 'rmax',\n",
+       "       'rmin', 'tmmn', 'tmmx', 'vpd', 'vs',\n",
+       "       'Snow Water Equivalent (in) Start of Day Values',\n",
+       "       'Change In Snow Water Equivalent (in)',\n",
+       "       'Snow Depth (in) Start of Day Values', 'Change In Snow Depth (in)',\n",
+       "       'Air Temperature Observed (degF) Start of Day Values', 'station_name',\n",
+       "       'station_triplet', 'station_elevation', 'station_lat', 'station_long',\n",
+       "       'mapping_station_id', 'mapping_cell_id', 'elevation', 'slope',\n",
+       "       'curvature', 'aspect', 'eastness', 'northness'],\n",
+       "      dtype='object')"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
     }
    ],
    "source": [
-    "def sort_training_data(input_training_csv, sorted_training_csv):\n",
-    "    # Read Dask DataFrame from CSV with increased blocksize and assuming missing data\n",
-    "    ddf = dd.read_csv(input_training_csv, assume_missing=True, blocksize='10MB', dtype={'stationTriplet': 'object',\n",
-    "       'station_name': 'object'})\n",
-    "\n",
-    "    # Persist the Dask DataFrame in memory\n",
-    "    ddf = ddf.persist()\n",
-    "\n",
-    "    # Sort Dask DataFrame by three columns: date, lat, and Lon\n",
-    "    sorted_ddf = ddf.sort_values(by=['date', 'lat', 'lon'])\n",
-    "\n",
-    "    # Save the sorted Dask DataFrame to a new CSV file\n",
-    "    sorted_ddf.to_csv(sorted_training_csv, index=False, single_file=True)\n",
-    "    print(f\"sorted training data is saved to {sorted_training_csv}\")\n",
-    "\n",
-    "final_final_output_file = f'{work_dir}/{final_output_name}'\n",
-    "sort_training_data(final_final_output_file, f'{work_dir}/{final_output_name}_sorted.csv')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "84dd7545",
-   "metadata": {},
-   "source": [
-    "Here we first read the CSV file into a Dask DataFrame, persists the Dask DataFrame in memory, which improves performance by keeping the data cached and readily accessible for further processing. Sorts the Dask DataFrame based on the specified columns (date, lat, lon). Saves the sorted Dask DataFrame to a new CSV file specified by sorted_training_csv. The index=False argument ensures that the index column is not included in the output CSV."
+    "df = dd.read_csv('../data/model_training_data.csv')\n",
+    "df.columns"
    ]
   },
   {
@@ -498,17 +146,15 @@
    "source": [
     "## Conclusion: The Ready-to-Train Dataset\n",
     "\n",
-    "The outcome of this journey is a rich, comprehensive dataset that stands ready for training SWE prediction models. Through meticulous merging, preprocessing, and cleaning, we’ve prepared a dataset that encapsulates the complexity of the environment and the specificity of snowpack conditions, laying a solid foundation for accurate and reliable SWE predictions.\n",
-    "\n",
-    "This streamlined dataset not only facilitates more accurate models but also illustrates the importance of a thorough feature selection process in predictive modeling. "
+    "The outcome of this journey is a rich, comprehensive dataset that stands ready for training SWE prediction models. Through meticulous merging, preprocessing, and cleaning, we’ve prepared a dataset that encapsulates the complexity of the environment and the specificity of snowpack conditions, laying a solid foundation for accurate and reliable SWE predictions."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python (base)",
    "language": "python",
-   "name": "python3"
+   "name": "base"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/book/chapters/feature_selection.ipynb b/book/chapters/feature_selection.ipynb
index 93304ea..dfcd488 100644
--- a/book/chapters/feature_selection.ipynb
+++ b/book/chapters/feature_selection.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Feature Selection for SWE Prediction Models"
+    "# 4.2 Feature Selection for SWE Prediction Models"
    ]
   },
   {
@@ -54,6 +54,13 @@
     "# Write the selected columns to a new CSV file\n",
     "df.to_csv(output_csv, index=False, single_file=True)"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Major environmental factors that affect the swe and required to train the models are date, location which includes latitude and longitude, etr (`Evapotranspiration`), pr (`Precipitation`), rmax (`Maximum Relative Humidity`), rmin (`Minimum Relative Humidity`), tmmn (`Minimum Temperature`), tmmx (`Maximum Temperature`), vpd (`Vapor Pressure Deficit`), vs (`Wind Speed`), `elevation`, `slope`, `curvature`, `aspect`, `eastness`, `northness`.\n"
+   ]
   }
  ],
  "metadata": {
diff --git a/book/chapters/fsCA.ipynb b/book/chapters/fsCA.ipynb
index c019711..14a2911 100644
--- a/book/chapters/fsCA.ipynb
+++ b/book/chapters/fsCA.ipynb
@@ -7,7 +7,7 @@
     "collapsed": false
    },
    "source": [
-    "# MODIS for fsCA"
+    "# 3.4 MODIS for fsCA"
    ]
   },
   {
@@ -52,6 +52,14 @@
     "create an account in urs.earthdata.nasa.gov for earth access"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "846abe3a",
+   "metadata": {},
+   "source": [
+    "## 3.4.1 Converting date to Julian format "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -106,7 +114,7 @@
    "id": "392d47a0",
    "metadata": {},
    "source": [
-    "### Convert HDF to GeoTIFF"
+    "## 3.4.2 Convert HDF to GeoTIFF"
    ]
   },
   {
@@ -158,7 +166,7 @@
    "id": "402ae4ac",
    "metadata": {},
    "source": [
-    "### Convert All HDF Files in Folder"
+    "## 3.4.3 Convert All HDF Files in Folder"
    ]
   },
   {
@@ -192,7 +200,7 @@
    "id": "6222844e",
    "metadata": {},
    "source": [
-    "### Merge GeoTIFF Files"
+    "## 3.4.4 Merge GeoTIFF Files"
    ]
   },
   {
@@ -231,6 +239,12 @@
     "- If files are found, it merges them using gdalwarp and reprojects them to EPSG:4326."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "075be3f7",
+   "metadata": {},
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": 12,
@@ -452,14 +466,6 @@
     "        #delete_files_in_folder(output_folder)  # cleanup\n",
     "download_tiles_and_merge(start_date, end_date)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "af10ceaa",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

From 14a9e37d422890d0cfb2c3f5500079d11c243168 Mon Sep 17 00:00:00 2001
From: Sai Vivek Vangaveti <saivivek116@gmail.com>
Date: Thu, 15 Aug 2024 17:47:08 -0700
Subject: [PATCH 3/5] minor changes

---
 book/chapters/data_integration.ipynb | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/book/chapters/data_integration.ipynb b/book/chapters/data_integration.ipynb
index c1fc3a1..5dc431f 100644
--- a/book/chapters/data_integration.ipynb
+++ b/book/chapters/data_integration.ipynb
@@ -138,16 +138,6 @@
     "df = dd.read_csv('../data/model_training_data.csv')\n",
     "df.columns"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "48810821",
-   "metadata": {},
-   "source": [
-    "## Conclusion: The Ready-to-Train Dataset\n",
-    "\n",
-    "The outcome of this journey is a rich, comprehensive dataset that stands ready for training SWE prediction models. Through meticulous merging, preprocessing, and cleaning, we’ve prepared a dataset that encapsulates the complexity of the environment and the specificity of snowpack conditions, laying a solid foundation for accurate and reliable SWE predictions."
-   ]
   }
  ],
  "metadata": {

From a18e5646f081e0e83850d453d3a9b95f6a8ff504 Mon Sep 17 00:00:00 2001
From: Sai Vivek Vangaveti <saivivek116@gmail.com>
Date: Thu, 15 Aug 2024 18:04:06 -0700
Subject: [PATCH 4/5] added code

---
 book/chapters/data_integration.ipynb | 512 +++++++++++++++++++++++++++
 1 file changed, 512 insertions(+)

diff --git a/book/chapters/data_integration.ipynb b/book/chapters/data_integration.ipynb
index 5dc431f..44037f8 100644
--- a/book/chapters/data_integration.ipynb
+++ b/book/chapters/data_integration.ipynb
@@ -15,6 +15,518 @@
     "- **SNOTEL Data:** Provides specific insights into snowpack conditions.\n",
     "- **Terrain Data:** Brings in the geographical and physical characteristics of the landscape.\n",
     "\n",
+    "Lets look at the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c59df521",
+   "metadata": {},
+   "source": [
+    "### Climatology data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4d270d3e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>date</th>\n",
+       "      <th>lat</th>\n",
+       "      <th>lon</th>\n",
+       "      <th>cell_id</th>\n",
+       "      <th>station_id</th>\n",
+       "      <th>etr</th>\n",
+       "      <th>pr</th>\n",
+       "      <th>rmax</th>\n",
+       "      <th>rmin</th>\n",
+       "      <th>tmmn</th>\n",
+       "      <th>tmmx</th>\n",
+       "      <th>vpd</th>\n",
+       "      <th>vs</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2001-01-01</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>61.4</td>\n",
+       "      <td>30.6</td>\n",
+       "      <td>265.7</td>\n",
+       "      <td>274.2</td>\n",
+       "      <td>0.28</td>\n",
+       "      <td>2.7</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2001-01-02</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>2.8</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>58.1</td>\n",
+       "      <td>23.2</td>\n",
+       "      <td>266.6</td>\n",
+       "      <td>277.3</td>\n",
+       "      <td>0.44</td>\n",
+       "      <td>4.1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2001-01-03</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>3.2</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>38.0</td>\n",
+       "      <td>21.4</td>\n",
+       "      <td>268.4</td>\n",
+       "      <td>280.6</td>\n",
+       "      <td>0.54</td>\n",
+       "      <td>4.2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2001-01-04</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>53.1</td>\n",
+       "      <td>23.0</td>\n",
+       "      <td>269.7</td>\n",
+       "      <td>278.2</td>\n",
+       "      <td>0.40</td>\n",
+       "      <td>3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2001-01-05</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>2.1</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>58.7</td>\n",
+       "      <td>22.2</td>\n",
+       "      <td>270.6</td>\n",
+       "      <td>279.6</td>\n",
+       "      <td>0.49</td>\n",
+       "      <td>2.5</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         date        lat         lon                               cell_id  \\\n",
+       "0  2001-01-01  41.993149 -120.178715  76b55900-eb3d-4d25-a538-f74302ffe72d   \n",
+       "1  2001-01-02  41.993149 -120.178715  76b55900-eb3d-4d25-a538-f74302ffe72d   \n",
+       "2  2001-01-03  41.993149 -120.178715  76b55900-eb3d-4d25-a538-f74302ffe72d   \n",
+       "3  2001-01-04  41.993149 -120.178715  76b55900-eb3d-4d25-a538-f74302ffe72d   \n",
+       "4  2001-01-05  41.993149 -120.178715  76b55900-eb3d-4d25-a538-f74302ffe72d   \n",
+       "\n",
+       "  station_id  etr   pr  rmax  rmin   tmmn   tmmx   vpd   vs  \n",
+       "0   CDEC:ADM  1.5  0.0  61.4  30.6  265.7  274.2  0.28  2.7  \n",
+       "1   CDEC:ADM  2.8  0.0  58.1  23.2  266.6  277.3  0.44  4.1  \n",
+       "2   CDEC:ADM  3.2  0.0  38.0  21.4  268.4  280.6  0.54  4.2  \n",
+       "3   CDEC:ADM  2.0  0.0  53.1  23.0  269.7  278.2  0.40  3.0  \n",
+       "4   CDEC:ADM  2.1  0.0  58.7  22.2  270.6  279.6  0.49  2.5  "
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import dask.dataframe as dd\n",
+    "climatology_data_path = '../data/training_ready_climatology_data.csv'\n",
+    "climatology_data = dd.read_csv(climatology_data_path)\n",
+    "climatology_data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80ae76d5",
+   "metadata": {},
+   "source": [
+    "### SNOTEL Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c940b2fb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Date</th>\n",
+       "      <th>Snow Water Equivalent (in) Start of Day Values</th>\n",
+       "      <th>Change In Snow Water Equivalent (in)</th>\n",
+       "      <th>Snow Depth (in) Start of Day Values</th>\n",
+       "      <th>Change In Snow Depth (in)</th>\n",
+       "      <th>Air Temperature Observed (degF) Start of Day Values</th>\n",
+       "      <th>station_name</th>\n",
+       "      <th>station_triplet</th>\n",
+       "      <th>station_elevation</th>\n",
+       "      <th>station_lat</th>\n",
+       "      <th>station_long</th>\n",
+       "      <th>mapping_station_id</th>\n",
+       "      <th>mapping_cell_id</th>\n",
+       "      <th>lat</th>\n",
+       "      <th>lon</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2002-01-03</td>\n",
+       "      <td>20.3</td>\n",
+       "      <td>0.3</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>25.9</td>\n",
+       "      <td>Dismal Swamp</td>\n",
+       "      <td>446:CA:SNTL</td>\n",
+       "      <td>7360</td>\n",
+       "      <td>41.99127</td>\n",
+       "      <td>-120.18033</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2002-01-04</td>\n",
+       "      <td>20.4</td>\n",
+       "      <td>0.1</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>12.0</td>\n",
+       "      <td>Dismal Swamp</td>\n",
+       "      <td>446:CA:SNTL</td>\n",
+       "      <td>7360</td>\n",
+       "      <td>41.99127</td>\n",
+       "      <td>-120.18033</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2002-01-05</td>\n",
+       "      <td>20.4</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>22.8</td>\n",
+       "      <td>Dismal Swamp</td>\n",
+       "      <td>446:CA:SNTL</td>\n",
+       "      <td>7360</td>\n",
+       "      <td>41.99127</td>\n",
+       "      <td>-120.18033</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2002-01-06</td>\n",
+       "      <td>20.5</td>\n",
+       "      <td>0.1</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>33.3</td>\n",
+       "      <td>Dismal Swamp</td>\n",
+       "      <td>446:CA:SNTL</td>\n",
+       "      <td>7360</td>\n",
+       "      <td>41.99127</td>\n",
+       "      <td>-120.18033</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2002-01-07</td>\n",
+       "      <td>21.2</td>\n",
+       "      <td>0.7</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>34.7</td>\n",
+       "      <td>Dismal Swamp</td>\n",
+       "      <td>446:CA:SNTL</td>\n",
+       "      <td>7360</td>\n",
+       "      <td>41.99127</td>\n",
+       "      <td>-120.18033</td>\n",
+       "      <td>CDEC:ADM</td>\n",
+       "      <td>76b55900-eb3d-4d25-a538-f74302ffe72d</td>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         Date  Snow Water Equivalent (in) Start of Day Values  \\\n",
+       "0  2002-01-03                                            20.3   \n",
+       "1  2002-01-04                                            20.4   \n",
+       "2  2002-01-05                                            20.4   \n",
+       "3  2002-01-06                                            20.5   \n",
+       "4  2002-01-07                                            21.2   \n",
+       "\n",
+       "   Change In Snow Water Equivalent (in)  Snow Depth (in) Start of Day Values  \\\n",
+       "0                                   0.3                                  NaN   \n",
+       "1                                   0.1                                  NaN   \n",
+       "2                                   0.0                                  NaN   \n",
+       "3                                   0.1                                  NaN   \n",
+       "4                                   0.7                                  NaN   \n",
+       "\n",
+       "   Change In Snow Depth (in)  \\\n",
+       "0                        NaN   \n",
+       "1                        NaN   \n",
+       "2                        NaN   \n",
+       "3                        NaN   \n",
+       "4                        NaN   \n",
+       "\n",
+       "   Air Temperature Observed (degF) Start of Day Values  station_name  \\\n",
+       "0                                               25.9    Dismal Swamp   \n",
+       "1                                               12.0    Dismal Swamp   \n",
+       "2                                               22.8    Dismal Swamp   \n",
+       "3                                               33.3    Dismal Swamp   \n",
+       "4                                               34.7    Dismal Swamp   \n",
+       "\n",
+       "  station_triplet  station_elevation  station_lat  station_long  \\\n",
+       "0     446:CA:SNTL               7360     41.99127    -120.18033   \n",
+       "1     446:CA:SNTL               7360     41.99127    -120.18033   \n",
+       "2     446:CA:SNTL               7360     41.99127    -120.18033   \n",
+       "3     446:CA:SNTL               7360     41.99127    -120.18033   \n",
+       "4     446:CA:SNTL               7360     41.99127    -120.18033   \n",
+       "\n",
+       "  mapping_station_id                       mapping_cell_id        lat  \\\n",
+       "0           CDEC:ADM  76b55900-eb3d-4d25-a538-f74302ffe72d  41.993149   \n",
+       "1           CDEC:ADM  76b55900-eb3d-4d25-a538-f74302ffe72d  41.993149   \n",
+       "2           CDEC:ADM  76b55900-eb3d-4d25-a538-f74302ffe72d  41.993149   \n",
+       "3           CDEC:ADM  76b55900-eb3d-4d25-a538-f74302ffe72d  41.993149   \n",
+       "4           CDEC:ADM  76b55900-eb3d-4d25-a538-f74302ffe72d  41.993149   \n",
+       "\n",
+       "          lon  \n",
+       "0 -120.178715  \n",
+       "1 -120.178715  \n",
+       "2 -120.178715  \n",
+       "3 -120.178715  \n",
+       "4 -120.178715  "
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "snotel_data_path = '../data/training_ready_snotel_data.csv'\n",
+    "snotel_data = dd.read_csv(snotel_data_path)\n",
+    "snotel_data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "951a2c8c",
+   "metadata": {},
+   "source": [
+    "### Terrain Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "482017bf",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>lat</th>\n",
+       "      <th>lon</th>\n",
+       "      <th>elevation</th>\n",
+       "      <th>slope</th>\n",
+       "      <th>curvature</th>\n",
+       "      <th>aspect</th>\n",
+       "      <th>eastness</th>\n",
+       "      <th>northness</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>41.993149</td>\n",
+       "      <td>-120.178715</td>\n",
+       "      <td>2290.8364</td>\n",
+       "      <td>89.988850</td>\n",
+       "      <td>-9401.7705</td>\n",
+       "      <td>40.629730</td>\n",
+       "      <td>0.577196</td>\n",
+       "      <td>0.649194</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>37.727154</td>\n",
+       "      <td>-119.136669</td>\n",
+       "      <td>2955.2370</td>\n",
+       "      <td>89.991880</td>\n",
+       "      <td>-5259.1160</td>\n",
+       "      <td>38.885838</td>\n",
+       "      <td>0.560589</td>\n",
+       "      <td>0.661430</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>38.918144</td>\n",
+       "      <td>-120.205665</td>\n",
+       "      <td>2481.0059</td>\n",
+       "      <td>89.992966</td>\n",
+       "      <td>-9113.7150</td>\n",
+       "      <td>40.579857</td>\n",
+       "      <td>0.576732</td>\n",
+       "      <td>0.649554</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>37.070608</td>\n",
+       "      <td>-118.768361</td>\n",
+       "      <td>3329.7136</td>\n",
+       "      <td>89.975500</td>\n",
+       "      <td>-7727.0957</td>\n",
+       "      <td>78.698520</td>\n",
+       "      <td>0.775608</td>\n",
+       "      <td>0.193519</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>36.364939</td>\n",
+       "      <td>-118.292254</td>\n",
+       "      <td>2851.1318</td>\n",
+       "      <td>89.975540</td>\n",
+       "      <td>2352.6350</td>\n",
+       "      <td>123.959000</td>\n",
+       "      <td>0.692435</td>\n",
+       "      <td>-0.509422</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         lat         lon  elevation      slope  curvature      aspect  \\\n",
+       "0  41.993149 -120.178715  2290.8364  89.988850 -9401.7705   40.629730   \n",
+       "1  37.727154 -119.136669  2955.2370  89.991880 -5259.1160   38.885838   \n",
+       "2  38.918144 -120.205665  2481.0059  89.992966 -9113.7150   40.579857   \n",
+       "3  37.070608 -118.768361  3329.7136  89.975500 -7727.0957   78.698520   \n",
+       "4  36.364939 -118.292254  2851.1318  89.975540  2352.6350  123.959000   \n",
+       "\n",
+       "   eastness  northness  \n",
+       "0  0.577196   0.649194  \n",
+       "1  0.560589   0.661430  \n",
+       "2  0.576732   0.649554  \n",
+       "3  0.775608   0.193519  \n",
+       "4  0.692435  -0.509422  "
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "terrain_data_path = '../data/training_ready_terrain_data.csv'\n",
+    "terrain_data = dd.read_csv(terrain_data_path)\n",
+    "terrain_data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c8a584b",
+   "metadata": {},
+   "source": [
     "Each dataset comes packed with essential features like latitude, longitude, and date, ready to enrich our SWE prediction model.\n",
     "\n",
     "## 4.1.2 Integrating the Datasets\n",

From 927ca4a1dd6180b365ebe11005d8e33800be74fa Mon Sep 17 00:00:00 2001
From: Sai Vivek Vangaveti <saivivek116@gmail.com>
Date: Fri, 16 Aug 2024 13:50:36 -0700
Subject: [PATCH 5/5] fixed import issue

---
 book/chapters/data_integration.ipynb  | 16 ++++++------
 book/chapters/feature_selection.ipynb | 37 +++++++++++++++++++++------
 2 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/book/chapters/data_integration.ipynb b/book/chapters/data_integration.ipynb
index 44037f8..66e9e16 100644
--- a/book/chapters/data_integration.ipynb
+++ b/book/chapters/data_integration.ipynb
@@ -400,7 +400,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
    "id": "482017bf",
    "metadata": {},
    "outputs": [
@@ -511,7 +511,7 @@
        "4  0.692435  -0.509422  "
       ]
      },
-     "execution_count": 4,
+     "execution_count": 3,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -541,7 +541,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 4,
    "id": "25039fe3",
    "metadata": {},
    "outputs": [],
@@ -578,7 +578,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 5,
    "id": "43de425b",
    "metadata": {},
    "outputs": [],
@@ -601,7 +601,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 6,
    "id": "c200ffde",
    "metadata": {},
    "outputs": [
@@ -611,7 +611,7 @@
        "['/Users/vangavetisaivivek/research/swe-workflow-book/book/data/model_training_data.csv']"
       ]
      },
-     "execution_count": 22,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -622,7 +622,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 7,
    "id": "414a3ebf",
    "metadata": {},
    "outputs": [
@@ -641,7 +641,7 @@
        "      dtype='object')"
       ]
      },
-     "execution_count": 2,
+     "execution_count": 7,
      "metadata": {},
      "output_type": "execute_result"
     }
diff --git a/book/chapters/feature_selection.ipynb b/book/chapters/feature_selection.ipynb
index dfcd488..4bba25f 100644
--- a/book/chapters/feature_selection.ipynb
+++ b/book/chapters/feature_selection.ipynb
@@ -27,16 +27,23 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "vscode": {
-     "languageId": "plaintext"
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['/Users/vangavetisaivivek/research/swe-workflow-book/book/data/model_training_cleaned.csv']"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
     }
-   },
-   "outputs": [],
+   ],
    "source": [
+    "import dask.dataframe as dd\n",
     "input_csv = '../data/model_training_data.csv'\n",
-    "\n",
     "# List of columns you want to extract\n",
     "selected_columns = ['date', 'lat', 'lon', 'etr', 'pr', 'rmax',\n",
     "                    'rmin', 'tmmn', 'tmmx', 'vpd', 'vs', \n",
@@ -64,8 +71,22 @@
   }
  ],
  "metadata": {
+  "kernelspec": {
+   "display_name": "Python (base)",
+   "language": "python",
+   "name": "base"
+  },
   "language_info": {
-   "name": "python"
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,