Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Lesson24_Basic_Stats_IV_Significance.ipynb #51

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 37 additions & 32 deletions Lessons/Lesson22_Basic_Stats_II_Percents.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
"id": "5ADm2TV-s7VG"
},
"source": [
"**Example 2:** Let's learn to calculate percentages by using real world data. We will work with a dataset of Boston housing prices."
"**Example 2:** Let's learn to calculate percentages by using real world data. We will work with a dataset of Ames, Iowa housing prices."
]
},
{
Expand All @@ -96,24 +96,25 @@
},
"outputs": [],
"source": [
"# Import the load_boston method \n",
"from sklearn.datasets import load_boston"
"# Import the fetch_openml method \n",
"from sklearn.datasets import fetch_openml\n",
"housing = fetch_openml(name='house_prices', as_frame=True, parser='auto')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "9Q6sI8C0s7VL"
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "9Q6sI8C0s7VL"
},
"outputs": [],
"source": [
"# Import pandas, so that we can work with the data frame version of the Boston housing data\n",
"import pandas as pd"
]
},
"outputs": [],
"source": [
"# Import pandas, so that we can work with the data frame version of the Boston housing data\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -125,12 +126,10 @@
},
"outputs": [],
"source": [
"# Load the dataset of housing prices in Boston, and convert to\n",
"# Load the dataset of housing prices in Ames, and convert to\n",
"# a data frame format so it's easier to view and process\n",
"boston = load_boston()\n",
"boston_df = pd.DataFrame(boston['data'], columns = boston['feature_names'])\n",
"boston_df['PRICE'] = boston.target\n",
"boston_df"
"ames_df = pd.DataFrame(housing['data']\n",
"ames_df"
]
},
{
Expand All @@ -140,7 +139,13 @@
"id": "eyMUHGews7VZ"
},
"source": [
"CHAS is the indicator variable we used last week, where 1 indicates that the property (tract) is on the Charles River and 0 means otherwise."
"The `SaleCondition` column lists the condition of the house sale:\n",
"* `Normal`: Normal Sale\n",
"* `Abnorml`: Abnormal Sale - trade, foreclosure, short sale\n",
"* `AdjLand`: Adjoining Land Purchase\n",
"* `Alloca`: Allocation - two linked properties with separate deeds, typically condo with a garage unit\n",
"* `Family`: Sale between family members\n",
"* `Partial`: Home was not completed when last assessed (associated with New Homes)",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"* `Partial`: Home was not completed when last assessed (associated with New Homes)",
"* `Partial`: Home was not completed when last assessed (associated with New Homes)"

This isn't a valid notebook because of the trailing ,. That's why the build is failing

]
},
{
Expand All @@ -150,7 +155,7 @@
"id": "IMpeHBEzs7VZ"
},
"source": [
"What percentage of the tracts bound the Charles River? We'll see how to do this using the query method AND using boolean indexing."
"What percentage of the houses were sold normally? We'll see how to do this using the query method AND using boolean indexing."
]
},
{
Expand All @@ -163,7 +168,7 @@
},
"outputs": [],
"source": [
"# Determine number of tracts that bound the Charles River two ways:\n",
"# Determine number of houses that were sold normally in two ways:\n",
"# (1) with the query function\n"
]
},
Expand Down Expand Up @@ -200,10 +205,10 @@
},
"outputs": [],
"source": [
"# Determine the total number of tracts in the dataset\n",
"# Determine the total number of houses in the dataset\n",
"\n",
"\n",
"# Now calculate the percentage of tracts that bounds the Charles River.\n"
"# Now calculate the percentage of houses sold normally in Ames.\n"
]
},
{
Expand All @@ -226,7 +231,7 @@
"id": "kFGToww_s7Vg"
},
"source": [
"What percentage of tracts have a median price less than $10,000?"
"What percentage of houses have a price less than $200,000?"
]
},
{
Expand All @@ -239,10 +244,10 @@
},
"outputs": [],
"source": [
"# Determine number of tracts that cost less than $10,000\n",
"# Determine number of houses that cost less than $200,000\n",
"\n",
"\n",
"# Calculate the percentage of tracts that cost less than $10k.\n"
"# Calculate the percentage of houses that cost less than $200k.\n"
]
},
{
Expand All @@ -252,7 +257,7 @@
"id": "RLZ-k3L7s7Vq"
},
"source": [
"What percentage of tracts have a median price **between** \\$10,000 and \\$30,000?"
"What percentage of houses have a price **between** \\$200,000 and \\$500,000?"
]
},
{
Expand All @@ -265,13 +270,13 @@
},
"outputs": [],
"source": [
"# Make an array of booleans with cost greater than $10,000 AND less than $30,000\n",
"# Make an array of booleans with cost greater than $200,000 AND less than $500,000\n",
"\n",
"\n",
"# Determine number of tracts that cost between $10,000 and $30,000\n",
"# Determine number of houses that cost between $200,000 and $500,000\n",
"\n",
"\n",
"# Calculate the percentage of tracts between $10,000 and $30,000\n"
"# Calculate the percentage of houses between $200,000 and $500,000\n"
]
},
{
Expand Down
8 changes: 4 additions & 4 deletions Lessons/Lesson24_Basic_Stats_IV_Significance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
"metadata": {},
"source": [
"What is **hypothesis testing**? \n",
"- **Definition: hypothesis testing** is the use of statistic to determine the probability that a given hypothesis is true. \n",
"- **Definition: hypothesis testing** is the use of statistics to determine the probability that a given hypothesis is true. \n",
"- There are two types of statistical hypotheses.\n",
" - **Definition**: The **null hypothesis**, denoted by $H_o$, is usually the hypothesis that sample observations result purely from chance. The most common null hypothesis is that the variable in question is equal to 0, i.e. this indicates that the variable has zero effect on the outcome of interest.\n",
" - **Definition**: The **alternative hypothesis**, denoted by $H_1$ or $H_a$, is the hypothesis that sample observations are influenced by some non-random cause. A common alternative hypothesis is that the variable in question has a non-zero effect on the outcome. "
Expand All @@ -67,7 +67,7 @@
"metadata": {},
"source": [
"There are so many statistical tests that are appropriate for use if certain assumptions are met. But for this lesson, we will focus on the Student’s t-test.\n",
"- Here a link to all statistical tests: https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/."
"- Here's a link to all statistical tests: https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/."
]
},
{
Expand Down Expand Up @@ -314,7 +314,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. Compare the p-value to an acceptable significance value, $\\alpha$ and compare the test statistic to acceptable critical value(s)**. If p-value $< \\alpha$ and the test-statistic $\\geq$ +critical value or test-statistic $\\leq$ -critical value, that the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.**\n",
"**4. Compare the p-value to an acceptable significance value, $\\alpha$, and compare the test statistic to acceptable critical value(s)**. If p-value $< \\alpha$, and the |test-statistic| $\\geq$ critical value, the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.\n",
"- p-value \n",
"- t-statistic \n",
"- Interpretation: "
Expand All @@ -332,7 +332,7 @@
"metadata": {},
"source": [
"**Misconceptions** about statistical significance: \n",
"1. A low p-values implies a large effect.\n",
"1. A low p-value implies a large effect.\n",
" - **Proper interpretation**: A low p-value indicates that the outcome would be highly unlikely if the null hypothesis were true. A lower p-value does not usually equate to a large effect. There are cases when a low p-value can occur with a small effect. \n",
"2. A non-significant outcome (AKA high p-value) means that the null hypothesis is probably true.\n",
" - **Proper interpretation**: A non-significant outcome (AKA high p-value) means that the data do not conclusively demonstrate that the null hypothesis is false. This is why we should say, \"When the p-value > 0.05, we fail to reject the null hypothesis.\" We should not say that we accept the null hypothesis when the p-value > 0.05."
Expand Down
Loading