Update notebooks

INRIA · Nov 18, 2024 · b2e0a03 · b2e0a03
1 parent 8982b3f
commit b2e0a03
Show file tree

Hide file tree

Showing 29 changed files with 328 additions and 258 deletions.
diff --git a/notebooks/01_tabular_data_exploration.ipynb b/notebooks/01_tabular_data_exploration.ipynb
@@ -98,6 +98,25 @@
     "adult_census.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "An alternative is to omit the `head` method. This would output the intial and\n",
+    "final rows and columns, but everything in between is not shown by default. It\n",
+    "also provides the dataframe's dimensions at the bottom in the format `n_rows`\n",
+    "x `n_columns`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "adult_census"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/notebooks/02_numerical_pipeline_hands_on.ipynb b/notebooks/02_numerical_pipeline_hands_on.ipynb
@@ -38,7 +38,7 @@
     "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
     "# drop the duplicated column `\"education-num\"` as stated in the first notebook\n",
     "adult_census = adult_census.drop(columns=\"education-num\")\n",
-    "adult_census.head()"
+    "adult_census"
    ]
   },
   {
@@ -64,7 +64,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data.head()"
+    "data"
    ]
   },
   {
@@ -157,7 +157,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data.head()"
+    "data"
    ]
   },
   {
@@ -177,7 +177,7 @@
    "outputs": [],
    "source": [
     "numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
-    "data[numerical_columns].head()"
+    "data[numerical_columns]"
    ]
   },
   {

diff --git a/notebooks/02_numerical_pipeline_introduction.ipynb b/notebooks/02_numerical_pipeline_introduction.ipynb
@@ -53,7 +53,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "adult_census.head()"
+    "adult_census"
    ]
   },
   {
@@ -86,14 +86,14 @@
    "outputs": [],
    "source": [
     "data = adult_census.drop(columns=[target_name])\n",
-    "data.head()"
+    "data"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now linger on the variables, also denominated features, that we later\n",
+    "We can now focus on the variables, also denominated features, that we later\n",
     "use to build our predictive model. In addition, we can also check how many\n",
     "samples are available in our dataset."
    ]

diff --git a/notebooks/03_categorical_pipeline.ipynb b/notebooks/03_categorical_pipeline.ipynb
@@ -129,7 +129,7 @@
    "outputs": [],
    "source": [
     "data_categorical = data[categorical_columns]\n",
-    "data_categorical.head()"
+    "data_categorical"
    ]
   },
   {
@@ -312,7 +312,7 @@
    "outputs": [],
    "source": [
     "print(f\"The dataset is composed of {data_categorical.shape[1]} features\")\n",
-    "data_categorical.head()"
+    "data_categorical"
    ]
   },
   {
@@ -404,7 +404,7 @@
     "and check the generalization performance of this machine learning pipeline using\n",
     "cross-validation.\n",
     "\n",
-    "Before we create the pipeline, we have to linger on the `native-country`.\n",
+    "Before we create the pipeline, we have to focus on the `native-country`.\n",
     "Let's recall some statistics regarding this column."
    ]
   },
@@ -529,9 +529,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As you can see, this representation of the categorical variables is\n",
-    "slightly more predictive of the revenue than the numerical variables\n",
-    "that we used previously."
+    "As you can see, this representation of the categorical variables is slightly\n",
+    "more predictive of the revenue than the numerical variables that we used\n",
+    "previously. The reason being that we have more (predictive) categorical\n",
+    "features than numerical ones."
    ]
   },
   {

diff --git a/notebooks/03_categorical_pipeline_column_transformer.ipynb b/notebooks/03_categorical_pipeline_column_transformer.ipynb
@@ -244,7 +244,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data_test.head()"
+    "data_test"
    ]
   },
   {

diff --git a/notebooks/cross_validation_ex_01.ipynb b/notebooks/cross_validation_ex_01.ipynb
@@ -52,7 +52,7 @@
     "exercise.\n",
     "\n",
     "Also, this classifier can become more flexible/expressive by using a so-called\n",
-    "kernel that makes the model become non-linear. Again, no understanding regarding\n",
+    "kernel that makes the model become non-linear. Again, no undestanding regarding\n",
     "the mathematics is required to accomplish this exercise.\n",
     "\n",
     "We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
@@ -160,4 +160,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/notebooks/cross_validation_grouping.ipynb b/notebooks/cross_validation_grouping.ipynb
@@ -5,9 +5,8 @@
    "metadata": {},
    "source": [
     "# Sample grouping\n",
-    "We are going to linger into the concept of sample groups. As in the previous\n",
-    "section, we will give an example to highlight some surprising results. This\n",
-    "time, we will use the handwritten digits dataset."
+    "In this notebook we present the concept of **sample groups**. We use the\n",
+    "handwritten digits dataset to highlight some surprising results."
    ]
   },
   {
@@ -26,8 +25,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We will recreate the same model used in the previous notebook: a logistic\n",
-    "regression classifier with a preprocessor to scale the data."
+    "We create a model consisting of a logistic regression classifier with a\n",
+    "preprocessor to scale the data.\n",
+    "\n",
+    "<div class=\"admonition note alert alert-info\">\n",
+    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
+    "<p class=\"last\">Here we use a <tt class=\"docutils literal\">MinMaxScaler</tt> as we know that each pixel's gray-scale is\n",
+    "strictly bounded between 0 (white) and 16 (black). This makes <tt class=\"docutils literal\">MinMaxScaler</tt>\n",
+    "more suited in this case than <tt class=\"docutils literal\">StandardScaler</tt>, as some pixels consistently\n",
+    "have low variance (pixels at the borders might almost always be zero if most\n",
+    "digits are centered in the image). Then, using <tt class=\"docutils literal\">StandardScaler</tt> can result in\n",
+    "a very high scaled value due to division by a small number.</p>\n",
+    "</div>"
    ]
   },
   {
@@ -47,8 +56,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We will use the same baseline model. We will use a `KFold` cross-validation\n",
-    "without shuffling the data at first."
+    "The idea is to compare the estimated generalization performance using\n",
+    "different cross-validation techniques and see how such estimations are\n",
+    "impacted by underlying data structures. We first use a `KFold`\n",
+    "cross-validation without shuffling the data."
    ]
   },
   {
@@ -97,9 +108,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We observe that shuffling the data improves the mean accuracy. We could go a\n",
-    "little further and plot the distribution of the testing score. We can first\n",
-    "concatenate the test scores."
+    "We observe that shuffling the data improves the mean accuracy. We can go a\n",
+    "little further and plot the distribution of the testing score. For such\n",
+    "purpose we concatenate the test scores."
    ]
   },
   {
@@ -120,7 +131,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's plot the distribution now."
+    "Let's now plot the score distributions."
    ]
   },
   {
@@ -131,7 +142,7 @@
    "source": [
     "import matplotlib.pyplot as plt\n",
     "\n",
-    "all_scores.plot.hist(bins=10, edgecolor=\"black\", alpha=0.7)\n",
+    "all_scores.plot.hist(bins=16, edgecolor=\"black\", alpha=0.7)\n",
     "plt.xlim([0.8, 1.0])\n",
     "plt.xlabel(\"Accuracy score\")\n",
     "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
@@ -142,9 +153,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The cross-validation testing error that uses the shuffling has less variance\n",
-    "than the one that does not impose any shuffling. It means that some specific\n",
-    "fold leads to a low score in this case."
+    "Shuffling the data results in a higher cross-validated test accuracy with less\n",
+    "variance compared to when the data is not shuffled. It means that some\n",
+    "specific fold leads to a low score in this case."
    ]
   },
   {
@@ -160,9 +171,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Thus, there is an underlying structure in the data that shuffling will break\n",
-    "and get better results. To get a better understanding, we should read the\n",
-    "documentation shipped with the dataset."
+    "Thus, shuffling the data breaks the underlying structure and thus makes the\n",
+    "classification task easier to our model. To get a better understanding, we can\n",
+    "read the dataset description in more detail:"
    ]
   },
   {
@@ -263,7 +274,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can check the grouping by plotting the indices linked to writer ids."
+    "We can check the grouping by plotting the indices linked to writers' ids."
    ]
   },
   {
@@ -284,8 +295,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Once we group the digits by writer, we can use cross-validation to take this\n",
-    "information into account: the class containing `Group` should be used."
+    "Once we group the digits by writer, we can incorporate this information into\n",
+    "the cross-validation process by using group-aware variations of the strategies\n",
+    "we have explored in this course, for example, the `GroupKFold` strategy."
    ]
   },
   {
@@ -309,10 +321,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see that this strategy is less optimistic regarding the model\n",
-    "generalization performance. However, this is the most reliable if our goal is\n",
-    "to make handwritten digits recognition writers independent. Besides, we can as\n",
-    "well see that the standard deviation was reduced."
+    "We see that this strategy leads to a lower generalization performance than the\n",
+    "other two techniques. However, this is the most reliable estimate if our goal\n",
+    "is to evaluate the capabilities of the model to generalize to new unseen\n",
+    "writers. In this sense, shuffling the dataset (or alternatively using the\n",
+    "writers' ids as a new feature) would lead the model to memorize the different\n",
+    "writer's particular handwriting."
    ]
   },
   {
@@ -337,7 +351,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "all_scores.plot.hist(bins=10, edgecolor=\"black\", alpha=0.7)\n",
+    "all_scores.plot.hist(bins=16, edgecolor=\"black\", alpha=0.7)\n",
     "plt.xlim([0.8, 1.0])\n",
     "plt.xlabel(\"Accuracy score\")\n",
     "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
@@ -348,9 +362,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As a conclusion, it is really important to take any sample grouping pattern\n",
-    "into account when evaluating a model. Otherwise, the results obtained will be\n",
-    "over-optimistic in regards with reality."
+    "In conclusion, accounting for any sample grouping patterns is crucial when\n",
+    "assessing a model\u2019s ability to generalize to new groups. Without this\n",
+    "consideration, the results may appear overly optimistic compared to the actual\n",
+    "performance.\n",
+    "\n",
+    "The interested reader can learn about other group-aware cross-validation\n",
+    "techniques in the [scikit-learn user\n",
+    "guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data)."
    ]
   }
  ],

diff --git a/notebooks/cross_validation_learning_curve.ipynb b/notebooks/cross_validation_learning_curve.ipynb
@@ -11,7 +11,7 @@
     "generalizing. Besides these aspects, it is also important to understand how\n",
     "the different errors are influenced by the number of samples available.\n",
     "\n",
-    "In this notebook, we will show this aspect by looking a the variability of\n",
+    "In this notebook, we will show this aspect by looking at the variability of\n",
     "the different errors.\n",
     "\n",
     "Let's first load the data and create the same model as in the previous\n",

diff --git a/notebooks/cross_validation_sol_01.ipynb b/notebooks/cross_validation_sol_01.ipynb
@@ -52,7 +52,7 @@
     "exercise.\n",
     "\n",
     "Also, this classifier can become more flexible/expressive by using a so-called\n",
-    "kernel that makes the model become non-linear. Again, no requirement regarding\n",
+    "kernel that makes the model become non-linear. Again, no understanding regarding\n",
     "the mathematics is required to accomplish this exercise.\n",
     "\n",
     "We will use an RBF kernel where a parameter `gamma` allows to tune the\n",