Skip to content

Commit

Permalink
Update notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed Nov 18, 2024
1 parent 8982b3f commit b2e0a03
Show file tree
Hide file tree
Showing 29 changed files with 328 additions and 258 deletions.
19 changes: 19 additions & 0 deletions notebooks/01_tabular_data_exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,25 @@
"adult_census.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An alternative is to omit the `head` method. This would output the intial and\n",
"final rows and columns, but everything in between is not shown by default. It\n",
"also provides the dataframe's dimensions at the bottom in the format `n_rows`\n",
"x `n_columns`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"adult_census"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
8 changes: 4 additions & 4 deletions notebooks/02_numerical_pipeline_hands_on.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
"# drop the duplicated column `\"education-num\"` as stated in the first notebook\n",
"adult_census = adult_census.drop(columns=\"education-num\")\n",
"adult_census.head()"
"adult_census"
]
},
{
Expand All @@ -64,7 +64,7 @@
"metadata": {},
"outputs": [],
"source": [
"data.head()"
"data"
]
},
{
Expand Down Expand Up @@ -157,7 +157,7 @@
"metadata": {},
"outputs": [],
"source": [
"data.head()"
"data"
]
},
{
Expand All @@ -177,7 +177,7 @@
"outputs": [],
"source": [
"numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
"data[numerical_columns].head()"
"data[numerical_columns]"
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions notebooks/02_numerical_pipeline_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"metadata": {},
"outputs": [],
"source": [
"adult_census.head()"
"adult_census"
]
},
{
Expand Down Expand Up @@ -86,14 +86,14 @@
"outputs": [],
"source": [
"data = adult_census.drop(columns=[target_name])\n",
"data.head()"
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now linger on the variables, also denominated features, that we later\n",
"We can now focus on the variables, also denominated features, that we later\n",
"use to build our predictive model. In addition, we can also check how many\n",
"samples are available in our dataset."
]
Expand Down
13 changes: 7 additions & 6 deletions notebooks/03_categorical_pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@
"outputs": [],
"source": [
"data_categorical = data[categorical_columns]\n",
"data_categorical.head()"
"data_categorical"
]
},
{
Expand Down Expand Up @@ -312,7 +312,7 @@
"outputs": [],
"source": [
"print(f\"The dataset is composed of {data_categorical.shape[1]} features\")\n",
"data_categorical.head()"
"data_categorical"
]
},
{
Expand Down Expand Up @@ -404,7 +404,7 @@
"and check the generalization performance of this machine learning pipeline using\n",
"cross-validation.\n",
"\n",
"Before we create the pipeline, we have to linger on the `native-country`.\n",
"Before we create the pipeline, we have to focus on the `native-country`.\n",
"Let's recall some statistics regarding this column."
]
},
Expand Down Expand Up @@ -529,9 +529,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, this representation of the categorical variables is\n",
"slightly more predictive of the revenue than the numerical variables\n",
"that we used previously."
"As you can see, this representation of the categorical variables is slightly\n",
"more predictive of the revenue than the numerical variables that we used\n",
"previously. The reason being that we have more (predictive) categorical\n",
"features than numerical ones."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/03_categorical_pipeline_column_transformer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@
"metadata": {},
"outputs": [],
"source": [
"data_test.head()"
"data_test"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions notebooks/cross_validation_ex_01.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
"exercise.\n",
"\n",
"Also, this classifier can become more flexible/expressive by using a so-called\n",
"kernel that makes the model become non-linear. Again, no understanding regarding\n",
"kernel that makes the model become non-linear. Again, no undestanding regarding\n",
"the mathematics is required to accomplish this exercise.\n",
"\n",
"We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
Expand Down Expand Up @@ -160,4 +160,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
77 changes: 48 additions & 29 deletions notebooks/cross_validation_grouping.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,8 @@
"metadata": {},
"source": [
"# Sample grouping\n",
"We are going to linger into the concept of sample groups. As in the previous\n",
"section, we will give an example to highlight some surprising results. This\n",
"time, we will use the handwritten digits dataset."
"In this notebook we present the concept of **sample groups**. We use the\n",
"handwritten digits dataset to highlight some surprising results."
]
},
{
Expand All @@ -26,8 +25,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will recreate the same model used in the previous notebook: a logistic\n",
"regression classifier with a preprocessor to scale the data."
"We create a model consisting of a logistic regression classifier with a\n",
"preprocessor to scale the data.\n",
"\n",
"<div class=\"admonition note alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
"<p class=\"last\">Here we use a <tt class=\"docutils literal\">MinMaxScaler</tt> as we know that each pixel's gray-scale is\n",
"strictly bounded between 0 (white) and 16 (black). This makes <tt class=\"docutils literal\">MinMaxScaler</tt>\n",
"more suited in this case than <tt class=\"docutils literal\">StandardScaler</tt>, as some pixels consistently\n",
"have low variance (pixels at the borders might almost always be zero if most\n",
"digits are centered in the image). Then, using <tt class=\"docutils literal\">StandardScaler</tt> can result in\n",
"a very high scaled value due to division by a small number.</p>\n",
"</div>"
]
},
{
Expand All @@ -47,8 +56,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the same baseline model. We will use a `KFold` cross-validation\n",
"without shuffling the data at first."
"The idea is to compare the estimated generalization performance using\n",
"different cross-validation techniques and see how such estimations are\n",
"impacted by underlying data structures. We first use a `KFold`\n",
"cross-validation without shuffling the data."
]
},
{
Expand Down Expand Up @@ -97,9 +108,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We observe that shuffling the data improves the mean accuracy. We could go a\n",
"little further and plot the distribution of the testing score. We can first\n",
"concatenate the test scores."
"We observe that shuffling the data improves the mean accuracy. We can go a\n",
"little further and plot the distribution of the testing score. For such\n",
"purpose we concatenate the test scores."
]
},
{
Expand All @@ -120,7 +131,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot the distribution now."
"Let's now plot the score distributions."
]
},
{
Expand All @@ -131,7 +142,7 @@
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"all_scores.plot.hist(bins=10, edgecolor=\"black\", alpha=0.7)\n",
"all_scores.plot.hist(bins=16, edgecolor=\"black\", alpha=0.7)\n",
"plt.xlim([0.8, 1.0])\n",
"plt.xlabel(\"Accuracy score\")\n",
"plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
Expand All @@ -142,9 +153,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The cross-validation testing error that uses the shuffling has less variance\n",
"than the one that does not impose any shuffling. It means that some specific\n",
"fold leads to a low score in this case."
"Shuffling the data results in a higher cross-validated test accuracy with less\n",
"variance compared to when the data is not shuffled. It means that some\n",
"specific fold leads to a low score in this case."
]
},
{
Expand All @@ -160,9 +171,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Thus, there is an underlying structure in the data that shuffling will break\n",
"and get better results. To get a better understanding, we should read the\n",
"documentation shipped with the dataset."
"Thus, shuffling the data breaks the underlying structure and thus makes the\n",
"classification task easier to our model. To get a better understanding, we can\n",
"read the dataset description in more detail:"
]
},
{
Expand Down Expand Up @@ -263,7 +274,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can check the grouping by plotting the indices linked to writer ids."
"We can check the grouping by plotting the indices linked to writers' ids."
]
},
{
Expand All @@ -284,8 +295,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we group the digits by writer, we can use cross-validation to take this\n",
"information into account: the class containing `Group` should be used."
"Once we group the digits by writer, we can incorporate this information into\n",
"the cross-validation process by using group-aware variations of the strategies\n",
"we have explored in this course, for example, the `GroupKFold` strategy."
]
},
{
Expand All @@ -309,10 +321,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that this strategy is less optimistic regarding the model\n",
"generalization performance. However, this is the most reliable if our goal is\n",
"to make handwritten digits recognition writers independent. Besides, we can as\n",
"well see that the standard deviation was reduced."
"We see that this strategy leads to a lower generalization performance than the\n",
"other two techniques. However, this is the most reliable estimate if our goal\n",
"is to evaluate the capabilities of the model to generalize to new unseen\n",
"writers. In this sense, shuffling the dataset (or alternatively using the\n",
"writers' ids as a new feature) would lead the model to memorize the different\n",
"writer's particular handwriting."
]
},
{
Expand All @@ -337,7 +351,7 @@
"metadata": {},
"outputs": [],
"source": [
"all_scores.plot.hist(bins=10, edgecolor=\"black\", alpha=0.7)\n",
"all_scores.plot.hist(bins=16, edgecolor=\"black\", alpha=0.7)\n",
"plt.xlim([0.8, 1.0])\n",
"plt.xlabel(\"Accuracy score\")\n",
"plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
Expand All @@ -348,9 +362,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As a conclusion, it is really important to take any sample grouping pattern\n",
"into account when evaluating a model. Otherwise, the results obtained will be\n",
"over-optimistic in regards with reality."
"In conclusion, accounting for any sample grouping patterns is crucial when\n",
"assessing a model\u2019s ability to generalize to new groups. Without this\n",
"consideration, the results may appear overly optimistic compared to the actual\n",
"performance.\n",
"\n",
"The interested reader can learn about other group-aware cross-validation\n",
"techniques in the [scikit-learn user\n",
"guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data)."
]
}
],
Expand Down
2 changes: 1 addition & 1 deletion notebooks/cross_validation_learning_curve.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"generalizing. Besides these aspects, it is also important to understand how\n",
"the different errors are influenced by the number of samples available.\n",
"\n",
"In this notebook, we will show this aspect by looking a the variability of\n",
"In this notebook, we will show this aspect by looking at the variability of\n",
"the different errors.\n",
"\n",
"Let's first load the data and create the same model as in the previous\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/cross_validation_sol_01.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
"exercise.\n",
"\n",
"Also, this classifier can become more flexible/expressive by using a so-called\n",
"kernel that makes the model become non-linear. Again, no requirement regarding\n",
"kernel that makes the model become non-linear. Again, no understanding regarding\n",
"the mathematics is required to accomplish this exercise.\n",
"\n",
"We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
Expand Down
Loading

0 comments on commit b2e0a03

Please sign in to comment.