GWC-DCMB · echou89 · Jun 16, 2023 · Jun 16, 2023 · Jan 30, 2024 · smith-kyle
diff --git a/Lessons/Lesson22_Basic_Stats_II_Percents.ipynb b/Lessons/Lesson22_Basic_Stats_II_Percents.ipynb
@@ -83,7 +83,7 @@
     "id": "5ADm2TV-s7VG"
    },
    "source": [
-    "**Example 2:**  Let's learn to calculate percentages by using real world data. We will work with a dataset of Boston housing prices."
+    "**Example 2:**  Let's learn to calculate percentages by using real world data. We will work with a dataset of Ames, Iowa housing prices."
    ]
   },
   {
@@ -96,24 +96,25 @@
    },
    "outputs": [],
    "source": [
-    "# Import the load_boston method \n",
-    "from sklearn.datasets import load_boston"
+    "# Import the fetch_openml method \n",
+    "from sklearn.datasets import fetch_openml\n",
+    "housing = fetch_openml(name='house_prices', as_frame=True, parser='auto')"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "colab": {},
-    "colab_type": "code",
-    "id": "9Q6sI8C0s7VL"
+    "cell_type": "code",
+    "execution_count": null,
+    "metadata": {
+     "colab": {},
+     "colab_type": "code",
+     "id": "9Q6sI8C0s7VL"
+    },
+    "outputs": [],
+    "source": [
+     "# Import pandas, so that we can work with the data frame version of the Boston housing data\n",
+     "import pandas as pd"
+    ]
    },
-   "outputs": [],
-   "source": [
-    "# Import pandas, so that we can work with the data frame version of the Boston housing data\n",
-    "import pandas as pd"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -125,12 +126,10 @@
    },
    "outputs": [],
    "source": [
-    "# Load the dataset of housing prices in Boston, and convert to\n",
+    "# Load the dataset of housing prices in Ames, and convert to\n",
     "# a data frame format so it's easier to view and process\n",
-    "boston = load_boston()\n",
-    "boston_df = pd.DataFrame(boston['data'], columns = boston['feature_names'])\n",
-    "boston_df['PRICE'] = boston.target\n",
-    "boston_df"
+    "ames_df = pd.DataFrame(housing['data']\n",
+    "ames_df"
    ]
   },
   {
@@ -140,7 +139,13 @@
     "id": "eyMUHGews7VZ"
    },
    "source": [
-    "CHAS is the indicator variable we used last week, where 1 indicates that the property (tract) is on the Charles River and 0 means otherwise."
+    "The `SaleCondition` column lists the condition of the house sale:\n",
+    "*   `Normal`: Normal Sale\n",
+    "* `Abnorml`: Abnormal Sale -  trade, foreclosure, short sale\n",
+    "* `AdjLand`: Adjoining Land Purchase\n",
+    "* `Alloca`: Allocation - two linked properties with separate deeds, typically condo with a garage unit\n",
+    "* `Family`: Sale between family members\n",
+    "* `Partial`: Home was not completed when last assessed (associated with New Homes)",
-    "* `Partial`: Home was not completed when last assessed (associated with New Homes)",
+    "* `Partial`: Home was not completed when last assessed (associated with New Homes)"
-    "* `Partial`: Home was not completed when last assessed (associated with New Homes)",
+    "* `Partial`: Home was not completed when last assessed (associated with New Homes)"
    ]
   },
   {
@@ -150,7 +155,7 @@
     "id": "IMpeHBEzs7VZ"
    },
    "source": [
-    "What percentage of the tracts bound the Charles River? We'll see how to do this using the query method AND using boolean indexing."
+    "What percentage of the houses were sold normally? We'll see how to do this using the query method AND using boolean indexing."
    ]
   },
   {
@@ -163,7 +168,7 @@
    },
    "outputs": [],
    "source": [
-    "# Determine number of tracts that bound the Charles River two ways:\n",
+    "# Determine number of houses that were sold normally in two ways:\n",
     "# (1) with the query function\n"
    ]
   },
@@ -200,10 +205,10 @@
    },
    "outputs": [],
    "source": [
-    "# Determine the total number of tracts in the dataset\n",
+    "# Determine the total number of houses in the dataset\n",
     "\n",
     "\n",
-    "# Now calculate the percentage of tracts that bounds the Charles River.\n"
+    "# Now calculate the percentage of houses sold normally in Ames.\n"
    ]
   },
   {
@@ -226,7 +231,7 @@
     "id": "kFGToww_s7Vg"
    },
    "source": [
-    "What percentage of tracts have a median price less than $10,000?"
+    "What percentage of houses have a price less than $200,000?"
    ]
   },
   {
@@ -239,10 +244,10 @@
    },
    "outputs": [],
    "source": [
-    "# Determine number of tracts that cost less than $10,000\n",
+    "# Determine number of houses that cost less than $200,000\n",
     "\n",
     "\n",
-    "# Calculate the percentage of tracts that cost less than $10k.\n"
+    "# Calculate the percentage of houses that cost less than $200k.\n"
    ]
   },
   {
@@ -252,7 +257,7 @@
     "id": "RLZ-k3L7s7Vq"
    },
    "source": [
-    "What percentage of tracts have a median price **between** \\$10,000 and \\$30,000?"
+    "What percentage of houses have a price **between** \\$200,000 and \\$500,000?"
    ]
   },
   {
@@ -265,13 +270,13 @@
    },
    "outputs": [],
    "source": [
-    "# Make an array of booleans with cost greater than $10,000 AND less than $30,000\n",
+    "# Make an array of booleans with cost greater than $200,000 AND less than $500,000\n",
     "\n",
     "\n",
-    "# Determine number of tracts that cost between $10,000 and $30,000\n",
+    "# Determine number of houses that cost between $200,000 and $500,000\n",
     "\n",
     "\n",
-    "# Calculate the percentage of tracts between $10,000 and $30,000\n"
+    "# Calculate the percentage of houses between $200,000 and $500,000\n"
    ]
   },
   {

diff --git a/Lessons/Lesson24_Basic_Stats_IV_Significance.ipynb b/Lessons/Lesson24_Basic_Stats_IV_Significance.ipynb
@@ -43,7 +43,7 @@
    "metadata": {},
    "source": [
     "What is **hypothesis testing**? \n",
-    "- **Definition:  hypothesis testing** is the use of statistic to determine the probability that a given hypothesis is true. \n",
+    "- **Definition:  hypothesis testing** is the use of statistics to determine the probability that a given hypothesis is true. \n",
     "- There are two types of statistical hypotheses.\n",
     "    - **Definition**:  The **null hypothesis**, denoted by $H_o$, is usually the hypothesis that sample observations result purely from chance. The most common null hypothesis is that the variable in question is equal to 0, i.e. this indicates that the variable has zero effect on the outcome of interest.\n",
     "    - **Definition**:  The **alternative hypothesis**, denoted by $H_1$ or $H_a$, is the hypothesis that sample observations are influenced by some non-random cause. A common alternative hypothesis is that the variable in question has a non-zero effect on the outcome. "
@@ -67,7 +67,7 @@
    "metadata": {},
    "source": [
     "There are so many statistical tests that are appropriate for use if certain assumptions are met. But for this lesson, we will focus on the Student’s t-test.\n",
-    "- Here a link to all statistical tests:  https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/."
+    "- Here's a link to all statistical tests:  https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/."
    ]
   },
   {
@@ -314,7 +314,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**4. Compare the p-value to an acceptable significance value, $\\alpha$ and compare the test statistic to acceptable critical value(s)**. If p-value $< \\alpha$ and the test-statistic $\\geq$ +critical value or test-statistic $\\leq$ -critical value, that the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.**\n",
+    "**4. Compare the p-value to an acceptable significance value, $\\alpha$, and compare the test statistic to acceptable critical value(s)**. If p-value $< \\alpha$, and the |test-statistic| $\\geq$ critical value, the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.\n",
     "- p-value \n",
     "- t-statistic \n",
     "- Interpretation: "
@@ -332,7 +332,7 @@
    "metadata": {},
    "source": [
     "**Misconceptions** about statistical significance: \n",
-    "1. A low p-values implies a large effect.\n",
+    "1. A low p-value implies a large effect.\n",
     "    - **Proper interpretation**: A low p-value indicates that the outcome would be highly unlikely if the null hypothesis were true. A lower p-value does not usually equate to a large effect. There are cases when a low p-value can occur with a small effect. \n",
     "2. A non-significant outcome (AKA high p-value) means that the null hypothesis is probably true.\n",
     "    - **Proper interpretation**: A non-significant outcome (AKA high p-value) means that the data do not conclusively demonstrate that the null hypothesis is false. This is why we should say, \"When the p-value > 0.05, we fail to reject the null hypothesis.\" We should not say that we accept the null hypothesis when the p-value > 0.05."