jupytext | kernelspec | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
(chapters:appendix)=
In this page we can find further development of concepts used in previous chapters.
(appendix:violinplot)=
Violin plots are a powerful tool that puts together the information from Box plots and KDE Plots.
:tags: [hide-input]
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=200)
fig, ax = plt.subplots(1, 1)
plt.tight_layout()
sns.violinplot(y=x, ax=ax, whis=1.5)
plt.show()
As we already said, a Violin plot is the combination of box plots and kde plots. Let's have a look at them first.
Box plots are used to depict numerical data through their quartiles.
:tags: [hide-input]
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=200)
fig, ax = plt.subplots(1, 1)
plt.tight_layout()
sns.boxplot(y=x, ax=ax, whis=1.5)
# Median and quartiles
med = sorted(x)[len(x)//2]
fq = sorted(x)[len(x)//4]
tq = sorted(x)[len(x)//4*3]
# Inter-quaritles range
iqr = tq - fq
# Minimum and maximum
mmin = sorted(x)[int(np.floor(tq + iqr*1.5))]
mmax = sorted(x)[int(np.floor(fq - iqr*1.5))]
# Lines with text
ax.axhline(fq, ls='--', color='r', label='First quartile')
ax.text(-0.45, fq - 0.3, "First quartile", fontsize=12)
ax.axhline(tq, ls='-.', color='r', label='Third quartile')
ax.text(-0.45, tq + 0.1, "Third quartile", fontsize=12)
ax.axhline(med, ls='--', color='g', label='Median')
ax.axhline(mmin, ls='--', color='b', label='Minimum')
ax.text(-0.45, mmin - 0.3, "Minimum", fontsize=12)
ax.axhline(mmax, ls='-.', color='b', label='Maximum')
ax.text(-0.45, mmax + 0.1, "Maximum", fontsize=12)
plt.show()
The main concept behind this representation is the Percentile:
A percentile is a score below which falls a given percentage of the data
$ P_n = \frac{n}{100}N $
where:
-
$n$ is the desired percentile -
$N$ is the total number of observation
The components needed to build this kind of plot are:
-
First Quartile (
$P_{25}$ ): Median of the lower half of the dataset. -
Second Quartile (
$P_{75}$ ): Median of the upper half of the dataset. -
Inter-Quartile Range (
$P_{75}$ -$P_{25}$ ): Distance between the thrid and the first quartile -
Minimum (
$P_{25} - 1.5, \text{IQR}$ ): 1.5 times the IQR below the first quartile -
Maximum (
$P_{75} + 1.5, \text{IQR}$ ): 1.5 times the IQR above the thrid quartile - Outliers: Any point falling above the maximum and below the minimum
The best feature of box plots is that they sum the data distribution with only 5 values (First/Second/Thrid Quartile, Maximum and Minimum). They give a clear, although rough understanding of the distribution of our data.
Kernel Density Estimation Plots represent an estimated probability density function based on a finite data sample.
:tags: [hide-input]
from scipy.stats import norm
import scipy as sp
sns.set_theme()
# generate sample data
np.random.seed(42)
data = []
underlying_distrib = np.random.normal(-10, 7, 300)
data.extend(np.random.choice(underlying_distrib[:150], 8))
data.extend(np.random.choice(underlying_distrib[-150:], 8))
samples = 200
# bandwidth parameter (standard deviation of each kernel)
bandwidth=1
# grid points
xmin = min(data) - 3 * bandwidth
xmax = max(data) + 3 * bandwidth
x = np.linspace(xmin, xmax, samples)
# compute KDE by hand
kde = np.zeros(samples)
kdes = []
for val in data:
kdes.append(norm.pdf(x, loc=val, scale=bandwidth))
kde = np.sum(kdes, axis=0)
# normalize
norm_fact = sp.integrate.simps(kde, x)
kde /= norm_fact
# plot computed kde
# in scipy, the bandwidth is scaled by the standard deviation.
# In order to be able to compare our implementation with scipy's, we must
# divide the bandwidth by the standard deviation of the sample.
bandwidth=1/np.std(data)
plt.figure(figsize=(12, 9))
ax = sns.kdeplot(data, label="seaborn", bw_method=bandwidth)
sns.lineplot(x=x, y=kde, label="KDE", ax=ax)
# plot single kernels
for kd in kdes:
# https://gsalvatovallverdu.gitlab.io/python/kernel_density_estimation/
sns.lineplot(x=x, y=kd/norm_fact, ax=ax, palette="Paired")
for line in ax.lines[1:]:
line.set_linestyle("--")
# plot sample data
sns.rugplot(data, ax=ax, palette="Paired")
plt.legend();
plt.show()
Kernel Density Estimate is a procedure used to estimate a density function, starting from a sample, originated from an unknown underlying distribution. In order to build such estimated density function, we must chose an appropriate kernel. We can choose among many different kernels, in our explanation we will use the normal kernel. The following equation shows how we build the KDE:
$$ {\displaystyle {\widehat {f}}_{h}(x)={\frac {1}{m}}\sum {i=1}^{n}K{\Big (}{\frac {x-x{i}}{h}}{\Big )},} $$
Where:
-
$K$ is the Gaussian kernel centered in a given point$x_i$ -
$h$ is the bandwidth of the kernel (the standard deviation). This parameter influences the smoothness of the estimated function. -
$m$ is a scaling factor. Seaborn's implementation of KDE uses the integral of the sum of the kernels as the scaling factor.
From an intuitive point of view, we are centering a Gaussian distribution in each of our sample points, and calling the sum of such distributions our KDE.
KDE plots give a powerful representation of what we expect to be the underlying distribution of our sample data. It is a non-parametric method, meaning that we do not need strong assumptions on the unknown distribution, but it is very sensible on the choice of bandwidth, and the amount of data.
The Violin plot puts together the information of the Box and the KDE, making it more even more powerful information-wise. The side effect is that it requires a deeper understanding in order to read it correctly, and so it can be intimidating, making it less common.
All these plots are commonly used to show differences among populations.
(appendix:chi-squared)=
The chi-squared test is a statistical hypothesis test carried out when the test statistic is chi-squared distributed under the null hypothesis. The way we use Pearson's chi-squared test to test for statistical independence is the following:
Suppose to have data from 1000 athletes concerning the brand of their shoes, and the final place. Both features are categorical, and suppose they are encoded ordinally.
:tags: [hide-input]
import pandas as pd
df_fake = pd.DataFrame(columns=["shoes", "podium"])
np.random.seed(42)
samples = 1000
# There are 3 brands of shoes
shoes = [1, 2, 3]
# There are 4 possible final positions: 1st, 2nd, 3rd and 0 for not-on-the-podium
podium = [1, 2, 3, 0]
# Let's insert some dependence between brand of shoes and final place
podium_rigged_good = [1, 2, 3, 0] * 3 + [1]
podium_rigged_bad = [1, 2, 3, 0] * 3 + [0]
for i in range(samples):
shoe = np.random.choice(shoes)
# brand 1 will be better than all the other
if shoe == 1:
pos = np.random.choice(podium_rigged_good)
# brand 2 will be worse
elif shoe == 2:
pos = np.random.choice(podium_rigged_bad)
# and brand 3 will give equal chance to all positions
else:
pos = np.random.choice(podium)
df_fake = df_fake.append({'shoes': shoe, 'podium': pos}, ignore_index=True)
df_fake
Our observation is made up of two values (one for shoes and one for podium), and we want our $H_0$ to say that the two random variables are statistically independent. To carry on the test we must compute the contingency table, counting each occurrence for each pair of values.
:tags: [hide-input]
data_crosstab = pd.crosstab(df_fake['shoes'], df_fake['podium'])
data_crosstab
In a standard chi-squared test, we would need to formulate our null hypothesis by choosing some values for the desired occurrences of each variables, and this would give us theoretical values for the expected values of our observation. The resulting test statistic would assume the form:
$${\displaystyle \chi ^{2}=\sum {i=1}^{n}{\frac {(O{i}-E_{i})^{2}}{E_{i}}}=N\sum {i=1}^{n}{\frac {\left((O{i}/N)-p_{i}\right)^{2}}{p_{i}}}}$$
Since in this case, our
Where:
- ${\displaystyle p_{{i\cdot }}={\frac {O_{{i\cdot }}}{N}}=\sum {{j=1}}^{c}{\frac {O{{i,j}}}{N}}}$ is the fraction of observations ignoring the column attribute
- ${\displaystyle p_{{\cdot j}}={\frac {O_{{\cdot j}}}{N}}=\sum {{i=1}}^{r}{\frac {O{{i,j}}}{N}}}$ is the fraction of observations ignoring the row attribute
The basic idea for the test is to compare the actual counts with the expected counts. So we compute the squared difference between expected and true, divide by the expectation, and sum over all the possible pair of values. This gives us our test statistic:
:tags: [hide-input]
from scipy.stats import chi2_contingency
chi_stat, p, dof, expected = chi2_contingency(data_crosstab)
print(f"chi-squared statistic:\t{chi_stat:.4f}")
print(f"p-value:\t\t{p:.4f}")
We then choose a significance for our test
:tags: [hide-input]
from scipy.stats import chi2
# Choose the significance level and find the critical value
prob = 0.95
critical = chi2.ppf(prob, dof)
print(f'critical value: {critical:.4f} (alpha={1-prob:.2f})')
We now need to compare our critical value with our test statistics. We will reject the null hypothesis when we get a test statistic more extreme than our critical value (o a right-tail test, to refuse
:tags: [hide-input]
from scipy.stats import chi2
plt.figure(figsize=(12, 9))
# Choose the significance level and find the critical value
prob = 0.95
critical = chi2.ppf(prob, dof)
# Build X axis
x = np.arange(0, 30, .05)
# Draw the chi-squared pdf witht he appropriate dof
ax = sns.lineplot(x=x, y=chi2.pdf(x, df=np.prod(data_crosstab.shape) - np.sum(data_crosstab.shape) + 1), color='r', lw=2)
ax.text(2.7, 0.03, f'Chi-Squared\n (dof={dof})', fontsize=18, color='r')
# Draw the test statistics
ax.axvline(chi_stat, ls='-.', color='b', label='test-statistic')
ax.text(chi_stat-1.5, 0.05, "test statistic", fontsize=18, rotation='vertical', color='b')
ax.text(24, 0.02, f"p = {p:.4f}", fontsize=18, color='b')
# Draw the line leaving behind 95% of the area under the curve
ax.axvline(critical, ls='-.', color='g', label='critical-value')
ax.text(critical-1.5, 0.05, "critical value", fontsize=18, rotation='vertical', color='g')
ax.text(14, 0.02, f"alpha = {1-prob:.4f}", fontsize=18, color='g')
plt.show()
In the picture above, we see values for the critical value and p-value. The p-value measures how likely we are to observe a test statistic more extreme than what we already observed. Since the p-value is smaller than the critical value we previously set, we end up rejecting the
+++
(appendix:oversampling)=
The two most common oversampling techniques are SMOTE and ADASYN. Both techniques are based on the same oversampling algorithm:
where:
-
$x_{new}$ is the newly generated point -
$x_i$ is a point from the orginal dataset -
$x_{zi}$ is the$z_{th}$ nearest neighbor of$x_i$
We are basically finding a new point as an interpolation of each point and one of it's nearest neighbors. The main issue is that approach only works with continuous features. Since we are required to deal with continuous features as well as categorical features, we have to choose SMOTE, since ADASYN does not offer any solutions for the problem.
As we just said, SMOTE for contiuous values is pretty straight forward:
- Determine the oversampling ratio for each point in the minority class
- For each minority point:
- Find
$k$ nearest neighbors - Choose
$n$ of the$k$ neighbors and compute the new point coordinates - Add the new points to the dataset
- Find
:tags: [hide-input]
# Create fake data
x = [0.2, 0.4, 0.41, 0.42, 0.6, 0.6, 0.35, 0.58]
y = [3.2, 3.21, 3.65, 3.52, 3.51, 3.51, 3.6, 3.3]
c = ['Minority', 'Minority', 'Minority', 'Minority', 'Minority', 'Minority', 'Majority', 'Majority']
a = [1, 1, 1, 1, 1, 0.4, 1, 1]
l = 0.7
df = pd.DataFrame({'x': x, 'y': y, 'c': c, 'a': a})
# Build the new interpolated point
df = df.append(
{'x': df.loc[5, 'x'] + l * (df.loc[1, 'x'] - df.loc[5, 'x']),
'y': df.loc[5, 'y'] + l * (df.loc[1, 'y'] - df.loc[5, 'y']),
'c': 'New',
'a': 1}, ignore_index=True)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
size = 500000
# Draw the neighbourhood
sns.scatterplot(x='x', y='y', data=df[df['a']==0.4], hue='c', s=size, alpha=0.4, ax=ax, linewidth=0, legend=False)
# Draw the interpolation line
sns.lineplot(x=[x[1], x[5]], y=[y[1], y[5]], linestyle='--', color='green', ax=ax, legend=False)
# Draw points
sns.scatterplot(data=df.drop(5), x='x', y='y', hue='c', s=100, ax=ax, palette="tab10")
# Some text
ax.text(x[5]-0.03, y[5], "$x_i$", fontsize=20)
ax.text(x[1]-0.035, y[1], "$x_{zi}$", fontsize=20)
ax.text(df.iloc[-1]['x']-0.05, df.iloc[-1]['y'], "$x_{new}$", fontsize=20)
plt.legend()
plt.show()
Since it is not possible to interpolate categorical features, we can extend SMOTE to deal with this kind of data. The way the imblearn library does it is by adopting a majority voting technique among the k nearest neighbors of any given point.
+++
+++
In order to understand the meaning behind some procedures like Training-Test Data Split or Cross Validation we must give a brief introduction to some basic concepts of learning theory.
+++
Let's start by defining some important terms that we will need in the following chapters:
- Learner: The Machine Learning algorithm
- Domain set: The set of observations that we want to label
$\mathcal{X}$ (in our case, the set of session recorded) - Label set: The set of possible labels
$\mathcal{Y}$ (in our case,${0, 1}$ ) - Training data: A finite sequence of labeled domain points
$S = \left( \left( \mathcal{x}_1, \mathcal{y}_1 \right) \dots \left( \mathcal{x}_m, \mathcal{y}_m \right) \right)$ - Learner's output: The prediction rule obtained by the learner, trained on Training data
$\mathcal{h} : \mathcal{X} \rightarrow \mathcal{Y}$ - Data-Generating model: The probability distribution
$\mathcal{D}$ underlying$\mathcal{X}$ and the correct labelling function$\mathcal{f}\ \text{s.t.}\ \mathcal{f}\left(\mathcal{x}_i\right)=\mathcal{y}_i$ - Measure of success: The error of a prediction rule
$\mathcal{h} : \mathcal{X} \rightarrow \mathcal{Y}$
Meaning that the error of
The goal of our learner (algorithm) is to find the best predictor
All that our learner sees is the training set S, which works like a small window on the real world
That is, the ratio of examples in which our rule agrees with with the labelling function. Consequently, we train our learner by minimizing the Empirical Risk, a process called Empirical Risk Minimization (ERM)
The ERM technique, although very intuitive, may lead to overfitting.
:tags: [hide-input]
n_inner_points = 20
n_outer_points = 40
np.random.seed(42)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.set_xlim((-2, 2))
ax.set_ylim((-2, 2))
# Outer boundary
outer_circle = plt.Circle((0, 0), np.sqrt(2), color='b', fill=False, lw=2)
ax.add_patch(outer_circle)
# Inner boundary
inner_circle = plt.Circle((0, 0), 1, color='r', fill=False, lw=2)
ax.add_patch(inner_circle)
# Plot outer samples
angle_outer = np.random.rand(1, n_outer_points) * 2 * np.pi
radius_outer = np.random.rand(1, n_outer_points) * (np.sqrt(2) - 1) + 1
ax.scatter(radius_outer * np.cos(angle_outer), radius_outer * np.sin(angle_outer), c='b')
# Plot inner samples
angle_inner = np.random.rand(1, n_inner_points) * 2 * np.pi
radius_inner = np.random.rand(1, n_inner_points)
ax.scatter(radius_inner * np.cos(angle_inner), radius_inner * np.sin(angle_inner), marker='D', c='r')
plt.axis('off')
plt.show()
Let's imagine a bivariate dataset, where the
Now consider the predictor:
Which essentially is remembering by heart the training samples, and guessing label 0 for every unknown sample. This predictor has empirical risk
We have found a predictor that performs perfectly on the training set, and very poorly on the "true world".
+++
We have just shown that
A common solution to overfitting is to limit the search space of the
$$h_S = \operatorname{ERM}{\mathcal{H}}(S) \in \underset{h \in \mathcal{H}}{\operatorname{argmin}} L{S}(h)$$
We call
A further assumption that we have to make is the following:
This means that there always exists an hypothesis in our finite hypothesis set, for which the True Error is zero. This implies
- $\mathcal{D}^{\mathcal{m}}\left(S:L_S\left(h^\right)=0\right)=1$ with probability 1, over the random samples $S$ sampled according to $\mathcal{D}$, labelled by $\mathcal{f}$, we have that the empirical error of $h^$ equals 0
-
$L_S\left(h_S\right)=0$ for every$\operatorname{ERM}$ hypothesis, the empirical risk equals 0
The last assumption is that the samples in the training set are independently and indentically distributed according to
Since the training set is random, there is always a (small) probability that our sample is unrepresentative of the underlying
We are now interested in finding an upper bound to the probability of sampling an m-tuple of instances that will lead to faliure in learning a good predictor. The quantity we want to upper bound is
$$\mathcal{D^m}\left({ S|x : L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon}\right)\tag{1}$$
where
Let:
-
$\mathcal{H_B}$ be the set of "bad" hypothesis$\mathcal{H_B}={h \in \mathcal{H}:L_{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon}$ .$\mathcal{H_B}$ is the set of hypothesis that leads to learner's failure (N.B. defined with respect to the True error). -
$M$ be the set of "misleading" samples$M={S|_x:\exists h \in \mathcal{H}_B, L_S\left(h\right)=0}$ .$M$ is the set of samples, for which a bad hypothesis (high true error) has a low empirical error.
Recall:
- We are trying to bound $\mathcal{D^m}\left({ S|x : L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon}\right)$
Consider:
- The realizability assumption implies that:
$L_S\left(h_S\right)=0$
This means that
This is because, being the
We can now say that the event above will only happen if our sample is in the set
We have shown that:
$${S|x:L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon } \subseteq M$$
(meaning that the set of samples for which we have learner's failure is bounded by the size of the set of misleading samples)
Hence:
$$\mathcal{D^m}\left({S|x:L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon }\right) \leq \mathcal{D^m}\left(M\right)\tag{2}$$
(The probability of picking a sample that brings learner's failure is bounded by the probability of picking a misleading sample)
Consider:
- The equation
$M={S|_x:\exists h \in \mathcal{H}_B, L_S\left(h\right)=0}=\underset{h\in\mathcal{H}_B}{\bigcup}{S|_x:L_S\left(h\right)=0}$
(The size of the set
Then
$$\mathcal{D^m}\left({S|x:L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon }\right)\leq\mathcal{D^m}\left(\underset{h\in\mathcal{H}_B}{\bigcup}{S|_x:L_S\left(h\right)=0}\right)\tag{3}$$
(The probability of failure is bounded by the probability of sapling such bad sample)
Consider:
- The union bound
$\mathcal{D}\left(A \cup B\right) \leq \mathcal{D}\left(A\right) +\mathcal{D}\left(B\right)$
Then
$$\mathcal{D^m}\left({S|x:L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon }\right) \leq \underset{h\in\mathcal{H}_B}\sum{\mathcal{D^m}\left(S|_x:L_S\left(h\right)=0\right)}\tag{4}\label{eq:4}$$
Now, let's bound each single summand on the right side.
Fix:
$h \in \mathcal{H_B}$
Consider:
- The event
$L_S\left(h\right)=0$ indicates the case where our predictor$h$ always agrees with the true labelling$f$ , formally$\forall{i}, h\left(x_i\right)=\mathcal{f}\left(x_i\right)$
Then
Consider:
- The training samples are i.i.d.
Then
Remember:
- The definition of true risk:
$L_{\mathcal{D}, f}(h) \stackrel{\text { def }}{=} \mathcal{D}({x: h(x) \neq f(x)})$
Then, for each individual sampling of an element of the training set we have
Where the last inequation follows from the fact that we are condidering only bad hypothesis
Consider:
- the inequality
$\left(1 - \epsilon\right)^m \leq e^{-\epsilon m}$
We now have proved that:
Now, sostituting back in equation
$$\mathcal{D^m}\left({S|x:L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon }\right) \leq \left|\mathcal{H_B}\right|e^{-\epsilon m}\tag{5}$$
Consider:
$\left|\mathcal{H_B}\right| \leq \left|\mathcal{H}\right|$
We conclude that
$$\mathcal{D^m}\left({S|x:L{\left(\mathcal{D},\mathcal{f}\right)}\left(h_S\right)>\epsilon }\right) \leq \left|\mathcal{H}\right|e^{-\epsilon m}\tag{6}$$
---
Let
-
$\mathcal{H}$ be a finite hypothesis class $\delta \in \left(0, 1\right) \text{and} \epsilon > 0$ -
$m$ be an integer that satisfies$m \geq \frac{\log (|\mathcal{H}| / \delta)}{\epsilon}$
Then, for any labeling function
The above statements mean that, for a training set of size
In the paragraph above we found out that for a finite hypothesis class, if the sample size
We can now define PAC Learnability as follows:
:class: important
A hypothesis class $\mathcal{H}$ is PAC learnable if there exists a function $m_{\mathcal{H}}:(0,1)^{2} \rightarrow \mathbb{N}$ and a learning algorithm that has the following properties $\forall \epsilon,\delta\in\left(0, 1\right), \forall \mathcal{D}\sim\mathcal{X}, \forall \mathcal{f}:\mathcal{X}\rightarrow\left(0, 1\right)$ if the realizability assumption holds w.r.t. $\mathcal{H,D,f}$ then when running the learning algorithm on $m \geq m_\mathcal{H}\left(\epsilon,\theta\right)$ i.i.d. samples generated by $\mathcal{D}$ and labelled by $\mathcal{f}$ the algorighm returns a hypothesis $h$ such that with probability of at least $1-\theta$ (probably) $L_{\left(\mathcal{D}, \mathcal{f}\right)}\left(h\right)\leq\epsilon$ (approximately correct)
It is important to note that if
The first step towards the generalization of the model is to remove the realizability assumption. Remember, the realizabilty assumption says that $\exists h^\in\mathcal{H}:\mathcal{D}\left({h_\left(\mathcal{x}\right)=\mathcal{f}\left(\mathcal{x}\right)}\right)=1$. In many real problems, this is not the case. Furthermore, it is also unrealistic to suppose that the labels are fully determined by the labelling function
For this reason, from now on, we indicate the probablity distribution over
-
$\mathcal{X}$ is our domain set as before,$\mathcal{D_x}$ is the marginal over the unlabeled domain points -
$\mathcal{Y}$ is a set of labels,$\mathcal{D\left(\left(x, y\right)|x\right)}$ is the conditional over the labels for each domain point
This makes
We can now rewrite some of the concept that we defined previously:
-
True Error:
$L_\mathcal{D}\left(h\right) \overset{\operatorname{def}}{=} \mathcal{D}\left({\left(\mathcal{x, y}\right): h\left(\mathcal{x}\right)\neq\mathcal{y}}\right)$ -
Empirical Error:
$L_{S}(h) \stackrel{\text { def }}{=} \frac{\left|\left{i \in[m]: h\left(x_{i}\right) \neq y_{i}\right}\right|}{m}$
We can now replace the realizability assumption by considering the following example:
Given a probability distribution
The above classifier is called Bayes classifier, and it is the optimal predictor for every
However, the Bayes Risk can not once more be computed, as it depends on
:class: important
A hypothesis class $\mathcal{H}$ is **agnostic** PAC learnable if there exists a function $m_{\mathcal{H}}:(0,1)^{2} \rightarrow \mathbb{N}$ and a learning algorithm that has the following property: $\forall \epsilon,\theta\in\left(0, 1\right), \forall \mathcal{D}\sim\mathcal{X}\times\mathcal{Y}$, when running the learning algorithm on $m \geq m_\mathcal{H}\left(\epsilon,\delta\right)$ i.i.d. samples generated by $\mathcal{D}$ the algorighm returns a hypothesis $h$ such that, with probability of at least $1-\delta$
$$L_\mathcal{D}\left(h\right)\leq \underset{h^\prime\in\mathcal{H}}{\operatorname{min}} L_\mathcal{D}\left(h^\prime\right) +\epsilon$$
With the above definition we have relaxed the PAC learnable property, by removing the requirement for the learner to have a small error in absolute terms, and compared it to the best learner among the class of predictors that we choose.
The next relaxation we are going to perform over our model concerns the scope of the Learning Problem. To give an idea of what it means, let's consider some examples of different learning tasks:
- Multiclass Classification: We do not require for our classification to be binary. We could think of document classification, where our
$\mathcal{X}$ is the set of all documents, and our$\mathcal{Y}$ is a large but finite set (e.g. Sport, Polithics, Kitchen, News). Our training sample still looks exactly the same as previously discussed, a finite sequence of pairs$\mathcal{(x_i, y_i)}$ , and the learner's output will still be a function$h_S:\mathcal{X}\rightarrow\mathcal{Y}$ . As a metric we can use the probability, over pairs, that our predictor suggests a wrong label. - Regression: In regression, the goal is to find a functional relationship between
$\mathcal{X}$ and$\mathcal{Y}$ . For example, one could want to find a linear function that predicts a child's weight at birth give some ultrasound measurements (e.g. head circumference, abdominal circumference and femur length). Our$\mathcal{X} \in \mathbb{R}^3$ and the labels set, is the set of real numbers. Our training data still looks the same even in this task (a pair$\mathcal{\left(x_i, y_i\right)}$ ). However, the measure of success differs. We may evaluate the qualityb of a hypothesis function$h_S:\mathcal{X}\rightarrow\mathcal{Y}$ using the expected square difference between the true labels and the predictor's outputs:
For this reason, we generalize our measure of success as follows:
:class: important
Given any set $\mathcal{H}$ and some domain $Z$ let $l$ be any function $\ell:\mathcal{H}\times Z \rightarrow \mathbb{R}_+$ we call such functions *loss functions*. (for classification problems, $Z=\mathcal{Z\times Y}$)
We can again rewrite the definitions of true and empirical risk
We can now rewrite some of the concept that we defined previously:
-
True Risk:
$L_\mathcal{D}\left(h\right) \overset{\operatorname{def}}{=} \underset{\mathcal{z\sim D}}{\mathbb{E}}\left[\ell\left(\mathcal{h, z}\right)\right]$ (The risk is the expected value of the loss) -
Empirical Error:
$L_{S}(h) \overset{\operatorname{def}}{=} \frac{1}{m} \overset{m}{\underset{i=1}{\sum}}\ell\mathcal{\left(h,z_i\right)}$ where$S=\mathcal{\left(z_1,\dots,z_m\right)\in Z_m}$
Let's now write the loss for classification and regression tasks:
- 0-1 Loss
- Square Loss
To summarize:
:class: important
A hypothesis class $\mathcal{H}$ is **agnostic** PAC learnable with respect to a set $Z$ and a loss function $\ell:\mathcal{H}\times Z\rightarrow \mathbb{R}_+$ if there exists a function $m_{\mathcal{H}}:(0,1)^{2} \rightarrow \mathbb{N}$ and a learning algorithm that has the following property: $\forall \epsilon,\theta\in\left(0, 1\right), \forall \mathcal{D}\sim>$, when running the learning algorithm on $m \geq m_\mathcal{H}\left(\epsilon,\delta\right)$ i.i.d. samples generated by $\mathcal{D}$ the algorighm returns a hypothesis $h$ such that, with probability of at least $1-\delta$ (over
the choice of the m training examples)
$$L_\mathcal{D}\left(h\right)\leq \underset{h^\prime\in\mathcal{H}}{\operatorname{min}} L_\mathcal{D}\left(h^\prime\right) +\epsilon$$
where $L_\mathcal{D}\left(h\right) = \mathbb{E}_{\mathcal{z\sim D}}\left[\ell\left(\mathcal{h, z}\right)\right]$
+++
Recall the
- The
$\operatorname{ERM}$ recieves a training sample$S$ - The learner evaluates the empirical risk of each
$h \in\mathcal{H}$ over the training sample$S$ - The learner outputs an
$h$ that minimizes the empirical risk.
We then hope that this
:class: important
A training set $S$ is called $\epsilon$-representative (w.r.t. domain $Z$, hypothesis class $\mathcal{H}$, loss function $\ell$, and distribuyion $\mathcal{D}$) if
$$\forall{h}\in\mathcal{H},\ \left|L_S\left(h\right)-L_D\left(h\right)\right|\leq{\epsilon}$$
Now assume that a training set
Proof:
Where:
-
$1$ and$3$ hold because$S$ is$\epsilon$ -representative:$\left|L_S\left(h\right)-L_D\left(h\right)\right|\leq{\epsilon}$ -
$2$ holds because$h_S$ is an$\operatorname{ERM}$ predictor (the hypothesis which yields the minimum empirical risk)
We can now say that to ensure that the
As we introduced
:class: important
We say that a hypothesis class $\mathcal{H}$ has the *uniform convergence* property (w.r.t a domain $Z$ and a loss function $\ell$) if there exists a function $m^{UC}_\mathcal{H}:\left(0, 1\right)^2\rightarrow\mathbb{N}$ such that for every $\epsilon,\delta\in\left(0, 1\right)$ and for every probability distribution $\mathcal{D}$ over $Z$, if $S$ is a sample of $m \geq m^{UC}_\mathcal{H}\left(\epsilon,\theta\right)$ examples, drawn i.i.d. according to $\mathcal{D}$, then, with probability of at least $1-\delta$, $S$ is $\epsilon$-representative (which in turn means that the $\operatorname{ERM}$ rule is an agnostic PAC learner)
it follows that if a class
Below is a recap of the most important results achieved during this brief introduction
The Vapnik-Chervonenkis(VC)-dimension is a property of a hypothesis class that gives the correct characterization of its learnability. In order to define the VC-dimension we first need to define a preliminary notion called shattering
:class: important
Let $\mathcal{H}$ be a class of $\{0, 1\}$ functions over some domain $\mathcal{X}$ and let $A\subseteq\mathcal{X}$. We say that $\mathcal{H}$ shatters $A$ if $\forall \mathcal{g}:A\rightarrow\{0, 1\} \exists h\in\mathcal{H}\ \operatorname{s.t.}\ \forall \mathcal{x} \in A,\ h\left(\mathcal{x}\right) = g\left(\mathcal{x}\right)$
:tags: [hide-input]
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as mpath
sns.set_style("darkgrid")
# fig = plt.figure()
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
# Domain X
ax.add_patch(patches.Rectangle((2, 2), 6, 6, fc='none', ec='blue'))
ax.text(8.5, 4.5, '$\mathcal{X}$', fontsize=40, color='blue')
# Subset A
ax.add_patch(patches.Circle((5, 5), 1, fc='none', ec='red'))
ax.text(4.8, 6.4, '$A$', fontsize=35, color='red')
# 0-1 behaviour over A
x_a = np.arange(4, 6, 0.01);
y_a = np.sin(x_a*(4*np.pi))*0.2 + 5
ax.plot(x_a, y_a, color='green', linewidth=5)
ax.text(4.9, 5.6, '$\mathcal{g}$', fontsize=20, color='green')
# 0-1 behaviour over X
x_x_left = np.arange(2, 4, 0.01)
y_x_left = np.sin(x_x_left*(4*np.pi))*0.5 + 5
ax.plot(x_x_left, y_x_left, color='orange', linestyle='--', linewidth=3)
x_a = np.arange(4, 6, 0.01);
y_a = np.sin(x_a*(4*np.pi))*0.2 + 5
ax.plot(x_a, y_a, color='orange', linestyle='--', linewidth=3)
x_x_right = np.arange(6, 8, 0.01)
y_x_right = np.sin(x_x_right*(4*np.pi))*0.5 + 5
ax.plot(x_x_right, y_x_right, color='orange', linestyle='--', linewidth=3)
ax.text(2.8, 6, '$\mathcal{h}$', fontsize=30, color='orange')
# Amplitude of the sine wave is sine of a variable like time
plt.ylim((0, 10))
plt.xlim((0, 10))
ax.set_yticks([])
ax.set_xticks([])
plt.show()
In the figure above we can see a (green) function
:class: note
$\{0,1\}$ behaviour in this settings means "*a way of splitting the set A in two subsets*"
:class: note
There is equivalence between functions $\mathcal{h}:\mathcal{X}\rightarrow\{0, 1\}$ and subsets $A\subseteq\mathcal{X}$. Given any $\mathcal{h}$ I can define a set
$$\forall{h}:\mathcal{X}\rightarrow\{0, 1\}\exists A_{\mathcal{h}}=\{\mathcal{x}\in\mathcal{X}:\mathcal{h}\left(\mathcal{x}\right)=1\}$$
The relationship also works in the opposite direction:
$$\forall{A}\subseteq\mathcal{X}\ \exists h_A\left(\mathcal{x}\right)=\begin{cases}1 & \text { if } x\in A \\ 0 & \text { if } x \notin A \end{cases}$$
We can now define the VC-Dimension
:class: important
The VC-dimension of a class $\mathcal{H}$ is defined as the size of the maximal shattered set.
$$ \operatorname{VCdim}\left(\mathcal{H}\right)\overset{\operatorname{def}}{=}\underset{A \operatorname{shattered by} \mathcal{H}}{\operatorname{max}} \left\lvert A\right\rvert $$
:class: note
* If I claim that $\operatorname{VCdim}\left(\mathcal{H}\right)\geq n$ then I have to find at least 1 set of size $n$ which is shattered
* If I claim that $\operatorname{VCdim}\left(\mathcal{H}\right)< n$ then I have to show that no matter how I pick $n$ points, they will never be shattered by $\mathcal{h}$
+++
The k-fold cross validation technique is designed to provide an accurate estimate of the true error without wasting too much data (as it happens in regular validation techniques). This method consists in dividng the original training dataset of size