Learning from Data

Yaser Abu-Mostafa

Sno	Topics
1	The Learning Problem
2	Is Learning Feasible?
3	The Linear Model I
4	Error and Noise
5	Training versus Testing
6	Theory of Generalization
7	The VC Dimension
8	Bias-Variance Tradeoff
9	The Linear Model II
10	Neural Networks
11	Overfitting
12	Regularization
13	Validation
14	Support Vector Machines
15	Kernel Methods
16	Radial Basis Functions
17	Three Learning Principles
18	Epilogue

The storyline:

What is learning? - 1-4
Can we learn? - 5-8
How to do it? - 9-10
How to do it well? - 11-16
The philosophy - 17-18

Chapter 1: The Learning Problem

Example of machine learning

Predicting how a viewer will rate a movie

Netflix wanted to improve 10% for 1 million

3 components that ML can help with:

a pattern exists
- if no pattern, ML cannot help
we cannot pin it down mathematically
we have data on it

We can describe each viewer with a vector of features. We can also create a same vector for the movie. When we take their inner product, we get a measure of how likely the user is to like the movie.

This approach is a problem because you have to get those vectors - watch the movie, interview the user etc

ML will reverse engineer the process. It will start with the rating and them come up with vectors for the movies and users (starting from vectors for both)

Components of Learning

Metaphor: Credit approval The banks have historical data about customers - age, gender, annual salary, etc

Formalization:

Input: Χ
- customer vector
Output: y
- to extend or deny credit
Target function: ƒ
- ƒ: Χ → y
- this function has a binary co-domain (y can only be 1 or 0)
- ƒ is the target function, that is what we have to find
Data: (x₁, y₁), (x₂, y₂), …, (x_N, y_N)
- this is the historical data

We use the “Data” to get an approximation of the “target function” called “hypothesis”

Hypothesis: g: Χ → y

So, we use the training examples to find the hypothesis which approximates the target function using the learning algorithm which selects our hypothesis (g) from the hypothesis set Η. The hypothesis set can be discreet or continuous, limited or infinite. In general, the hypothesis set is continuous and infinite (very infinite!) - but we will still be able to learn.

We will be able to with theory, put a number to the sophistication of the hypothesis set.

The red components are what we can choose

Why do we need the hypothesis set? Why not let the learning algorithm select from universal set?

this will help us formalize learning, we will see this later

So, the components of learning are:

The hypothesis set:
- Η ∈ {h}, g ∈ H
The learning algorithm
Together, they form the learning model

A simple model - the perceptron

Input is: x → {x₁, …, x_d}

Approve credit if:

Σ w_ix_i > threshold, i → 1 to d

else deny

This linear formula h ∈ H can be written as:

h(x) = sign ( Σ w_ix_i - threshold ), i → 1 to d

If h is +ve, we approve credit else we deny credit

We see that h is a function of w and threshold

If the data is linearly separable, we can learn a single line from a random line

We can rename -threshold to w₀ and call it bias - we added an artificial coordinate x₀ whose value is always 1 Note: w₀ is not 1, it is a free parameter that we learn, x₀ is 1 Now, formula:

h(x) = sign ( Σ w_ix_i ), i → 0 to d

Vector form:

h(x) = sign (w^Tx)

w is a column vector, or [w₀, …, w_d]^T and vector x is [x₀, …, x_d]

The perceptron learning algorithm takes a misclassified point and updates the weights such that they behave better on that point i.e. w ← w + y_nx_n

Consider these 2 cases of a point being misclassified In first case, the inner product will be negative but y is positive. So, the update rule moves w towards x (it adds x to w) and their inner product is now positive, so prediction is positive

In second case, the inner product is positive but y is negative So, the update rule moves w away from x (it adds -x to w) and now the inner product is negative as well

The problem with this approach is that we can wrong the other ones when we classify a particular point correctly If you keep on repeating this process, and if the data is linearly separable, you can classify all points correctly with guarantee

(If it is not linearly separable, you can map it to a space where they are linearly separable)

Types of learning

The basic premise of learning

“using a set of observations to uncover an underlying process”

Supervised Learning
Unsupervised learning
- we can get a high level of input data
Reinforcement learning

Puzzle

Consider this problem, you are given 6 training examples which are labeled with the correct output For the new one, what is the label?

It is impossible to predict correctly - this is because the target function can be anything!

It can be -1
- if you take the function to be -1 if top left square is black
It can be +1
- if you take the function to be +1 if pattern is symmetric

We have this problem in machine learning also. But this does not mean that learning is impossible, this will be proved in next lecture

Chapter 2 - Is learning feasible?

Review

Learning is used when
- there is a pattern, we cannot write mathematical formula for it, we have data
Notation:
- we don’t know the target function y = ƒ(x)
- we have the data set: (x₁, y₁), (x₂, y₂), …, (x_N, y_N)
- we have a learning algorithm that finds g from the hypothesis set such that g ≈ y
we were stuck at the puzzle, where the random function could be anything, how are we to learn?

This lecture will address this question

Probability to the rescue

Consider an experiment: A bin has red and green marbles.

We need to find the probability of picking a red marble, let’s call it μ

P[red] = μ

So, P[green] = 1 - μ

We pick N marbles independently (recall N is for # of data points) so, sample: GRGGGGRRGG

Let the fraction of R in sample: ν

Does ν (sample frequency) say anything about μ (the actual frequency in bin)? The short answer is not on the face of it, but it does give us some bounds.

ν is likely to be close to μ

What does ν say about μ

In a big sample, large N, ν is close to μ (within ε)

Formally:

P[|ν - μ| > ε] ≤ 2e**{-2ε²N}

This is Hoeffding’s inequality* This will give us the VC dimension

Hoeffding’s inequality* in words:

μ = ν probably approximately correct (PAC)

That is, the N required rises squared and exponentially with the bound

This is the inequality you are looking for when you want to see how closely your sample represents the actual truth. That is, say, you are checking for nulls in a database attribute, how many samples should you take to be 90% confident of your estimation

This formula is valid for N and ε, and doesn’t depend on μ However, there is a tradeoff involved, less ε, more N

Also note: ν ≈ μ ⇒ μ ≈ ν

Connection to learning

How is the bin related to learning?

Bin:
- the unknown is a number μ
Leaning:
- the unknown is a function ƒ: X → y

We can think of the bin as the input space. Each marble is a point x ∈ X So, all the possible applicants for credit function are represented in the bin

Now, let’s say we have a hypothesis. For all the applicants that our hypothesis gets right, we can mark with a green marble in the bin. What we want is, the accuracy in the test dataset (ν) allow us to say something about the actual accuracy in the entire input space (μ)

Reiterating:

green if
- h(x) = f(x)
red if:
- h(x) ≠ f(x)

Here, we don’t have all the points in the input space to check. We have only a sample of the data points. The sample that we have comes from a probability distribution; P on X, that gives us one point from X over another - the probability generates the dataset We are not restricting P on X, we don’t even need to know what it is because our Hoeffding’s inequality does not depend on probability distribution of N

We aren’t done yet, what we have discussed is, for this h, ν ≈ μ within some ε

However, this is just verification of h, not learning. Currently we chose some h and verified that it makes sense. We don’t want to pick the h ourselves, we want the algorithm to do it for us - we need to choose h from H

This is simple, we already have a probability distribution that gives us some data points from the input space(bin). We can test our h on each and choose the h that gives us the maximum right results from them and invoke Hoeffding’s to get our bound.

Notation:

both μ and ν depend on h
ν is “in sample”, called E_in
μ is “out of sample”, called E_out
E_in and E_out are actually E_in(h) and E_out(h) (are functions of h) because they are a dependent on h

Hoeffding’s becomes

P[|E_in(h) - E_out(h)| > ε] ≤ 2e**{-2ε²N}

A dilemma and a solution

We still have a problem!

We cannot just have multiple h and apply Hoeffding’s to them. Why? Consider this:

Take a coin, flip it 5 times. If we get 5 heads and we choose h which is always heads, it means ν is 1, but it doesn’t mean that μ is also 1

The probability of 10 heads in 10 tosses is: 1/2¹⁰

If we toss 1000 fair coins 10 times each, probability that some coin will get 10 heads: 1 - P[no coin gets 10 heads] 1 - P[a particular coin doesn’t get 10 heads]¹⁰⁰⁰ 1 - [1 - P[a particular coin gets 10 heads]¹⁰⁰⁰ 1 - [1 - 1/2¹⁰]¹⁰⁰⁰ 0.63

So, it is more likely that the 10 heads will occur than not. So, the 10 heads are no indication at all of the real probability of getting head. We cannot choose a h which will give 1 always and choose a sample which has all 1s and say we have a perfect system according to Hoeffding’s. Hoeffding’s applies to each one individually, but in each case, there is a probability that we will be off by say half a percent that we are off in some aspect in the 1st case, and half a percent off in another aspect in the 2nd case. If these “off” probabilities are disjoint, we end up with a bad system.

We need to find a way to deal with multiple h/bins.

An easy solution:

recall we had: P[|E_in(h) - E_out(h)| > ε] ≤ 2e**{-2ε²N}
we wanted to make a statement about E_outbased on E_in

What we want now is:

P[|E_in(g) - E_out(g)| > ε] ≤ ??
- so, we want to choose the best hypothesis g and want a bound for that choice
- by plain logic (not using Hoeffding’s or anything), since g ∈ H, we have:
- P[|E_in(g) - E_out(g)| > ε] ≤ P[ |E_in(h₁) - E_out(h₁)| > ε + |E_in(h₂) - E_out(h₂)| > ε + … + |E_in(h_M) - E_out(h_M)| > ε ]
  - where M is the number of h ∈ H or the cardinality of H
  - this is valid because g ∈ H, g is one of the h-s
- This is the union bound, which assumes no overlap, this is the worst case, it cannot get worse than this
- using Hoeffding’s we get:
- P[|E_in(g) - E_out(g)| > ε] ≤ Σ P[|E_in(m) - E_out(m)| > ε], m → 1 to M
  - P[|E_in(g) - E_out(g)| > ε] ≤ Σ 2e**{-2ε²N}, m → 1 to M

But the problem is:

P[|E_in(g) - E_out(g)| > ε] ≤ 2Me**{-2ε²N}

And M is the cardinality of H, which is ∞ generally, so we get that P[something] < ∞ etc So, the more sophisticated the model that you use, the looser will E_in track the E_out

Chapter 3 - The Linear Model I

Review

We ended with a problem, the loose tracking of E_out(g) by E_in(g)
Since g has to be one of h₁, …, h_M we conclude that: (union bound)
- if | E_in(g) - E_out(g) | > ε then:
  - | E_in(1) - E_out(1) | > ε or
  - …
  - | E_in(M) - E_out(M) | > ε or
- This gives us an added M factor
- This generally works for other things, because they are disjoint. Eg: on a roll dice, we get a 5 or we get a 4 = P[5] or P[4] = P[5] + P[4]

We need to have a tighter bound on the tracking of E_out(g) by E_in(g). The union bound assumes no correlation b/w all events. It makes sense if the events are independent, and happen disjointly, like in the coin flipping scenario. We will get a smaller number if there is some correlation and they overlap.

We have established the principle that thru learning, you can generalize, and we have established that. We will later established that even when the cardinality of H is infinite, we can still generalize - the theory of generalization.

The Input representation

We’ll work with the MNIST like dataset. 16x16 grey level pixels.

This is the raw input. 16x16 - 256 pixels

‘raw’ input x = (x₀, x₁, …, x₂₅₆)

Recall, we added the mandatory x₀ to make the formula better

Our parameters: (w₀, w₁, …, w₂₅₆) - this is 256 dimensional space

We can reduce the parameters, i.e. we can extract features:

intensity and symmetry x = (x₀, x₁, x₂)
- we have lost some irrelevant info but we also lost some information

Plotting just 5 and 1:

Linear Classification

we’ll generalize perceptron to linearly non separable case
What does PLA do?
- it tries to iteratively correctly classify a single point with each iteration
- the data is not linearly separable, so it will not terminate

we see that the error jumps around a lot
also note that E_out is being tracked nicely by E_in

Final perceptron boundary (after stopping at 1000 iterations):

we can make a modification - “Pocket algorithm”
- Just make a few iterations and keep the h that had the lowest E_in so far.
- The errors now look like this:

We get this boundary:

Linear Regression

we’ll generalize the target function from being a binary function (a classifier) to a real valued function

Let’s discuss the credit problem again

Linear regression input:

let’s say we have d input features:
- annual salary, years in residence, years in job, current debt etc

Linear regression output: h(x) = Σ w_ix_i = w^Tx

summation over i → 0 to d

Linear regression dataset:

(x₁, y₁), …, (x_N, y_N)
y_n ∈ R, is the credit line for customer x_n

Error:

we can use the squared error: (h(x) - f(x))²
this squared error has good analytical properties - helps with differentiation etc

When we plot linear regression, we never plot x₀ because that is always 1.

The “line” or “hyperplane” is 1 dimension short of what you are working with. Eg:, we have mandatory x₀, we also have one feature, and we have the output - 2 dimensions, and so 1D line

The red is the in sample error E_in

We had:

Which can be re-written in matrix form as:

Minimizing E_in:

We have only w as the variable. How we can find the minimum of the function E_in(w) is by taking it’s derivative and equating it to 0 (a vector of 0s)

X-dagger is “pseudo-inverse” because X is not a square matrix, it is a very tall matrix and but X-dagger is still it’s inverse, because if you multiple X-dagger with X, you’ll get identity matrix.

*actually d+1 and not d, because we also have constant x₀ ⬇️

X is Nxd, X^T is dxN So, X^TX is dxN * Nxd = dxd Inverse of dxd is simple (since num of features is generally less), so we have: (X^TX)^-1X^T = dxd * dxN = dxN → X-dagger X-dagger * y = dxN * Nx1 = dx1 → our features

y is the target vector X is the input data matrix

This is “one step learning”

You can use linear regression for classification, it doesn’t matter if you set y to be ±1, the algo will still learn. You just use sign(w^Tx) as the output - similar to using 0 as the threshold

Linear regression applied like so 🔝 is called linear classification There is a problem here: 🔝

The value for the red region is highly negative for the lower points. But their target value is just -1. So, the linear regression reports high error due to this and the boundary is pushed towards the center

So, one can use Linear Regression to give a jumpstart to the perceptron - a good initial weights to start with.

Nonlinear transformation

Sometimes we need to transform data to make them linearly separable
Sometimes we want to have non-linear features, like in the credit example, we don’t want the time spent in one location to be linear, we want it to be something like: 0 for |x<1| and 1 for |x>5| etc
Non linear transformations remain within the realm of linear models.

This is because the variables of the function are the weights, not the features. So, we can do anything with the features, we are still in the linear realm.

So, we can do this transformation:

(x₁, x₂) → (x₁², x₂²)

We can now use perceptron in the transformed space. We cannot use arbitrarily complex transformation, there is a catch about which we’ll see later

Chapter 4 - Error and Noise

Review

Linear models use the “signal” w^Tx (which is vector form for Σ w_ix_i)
One step learning: w = (X^TX)^-1X^T · y
Non linear transformation:
- w^Tx is linear in w
- any transformation x → z preserves this linearity and so can be used

Nonlinear transformation (continued)

We had Φ transformation last time:

Note, Φ is a non-linear transformation of input space, that is to say, straight and parallel lines drawn in input space won’t remain straight and parallel in transformed space. So, each point in input space, may map to more than 1 point in the transformed space and vice versa (or it may not have a mapping)

Note, we can transform from d-dimensional space to d`-dimensional space, there is no limit really

We learn the weights in the transformed space, the targets remain in the original space.

The learned hypothesis is still g,

Note, Φ(x) gives us z

Error measures

They try to answer this:

what does it mean for “h ≈ ƒ”?
Error measure: E(h, ƒ)
If error is 0, h perfectly represents ƒ and we are golden.
It is almost always “pointwise defination”:
- e(h(x), ƒ(x))
example: square error: e(h(x), ƒ(x)) = (h(x)-ƒ(x))²
another example: e(h(x), ƒ(x)) = || h(x) ≠ ƒ(x) ||
- || h(x) ≠ ƒ(x) || returns 1 if h ≠ f else 0

How do we go from the pointwise to overall? We take pointwise and average it to get to overall

Thus, in-sample error:

Also, out of sample error:
- this is by logic, the definition of E_out

It is the expectation of pointwise error e on data points x

So, the learning diagram becomes:

How to choose error measure

It depends on the system.

For some systems, false positive would be expensive. For other systems, false negatives would be expensive

For classification, we can create a table:

For supermarket:

False negative is expensive; if you are given a discount coupon and you go there and they say you can’t use it, it is horrible It is okay if you weren’t given it in the first place but you still trick them into giving the discount

For CIA, false positives are disastrous,

Sometimes we don’t have the ideal error measure, so we use alternative error measures

if the noise is gaussian, we can use squared error
in binary, we can use cross entropy error measure (this is what we used in logistic regression, the Bernoulli thingy)

Sometimes, we use friendly measures:

squared error for linear regression is friendly because it has a closed-formed solution. The cross entropy error measure in classification turns out to be convex and so we can find the global minimum etc.

Noisy targets

Very important, because the target function is not always a “function” in true mathematical sense, it is noisy. This is because the target function cannot capture ALL the information that plays a part in determining the outcome

So, we use target distribution:

Instead of y = ƒ(x) we get:
- P(y|x)

Earlier, (x, y) was generated by a probability distribution (the dataset was generated by PD, say P) but now, y is non-deterministic and is generated by a PD too. So, (x, y) is generated by a joint PD:

P(x, y) = P(x)P(y|x)
we can model noisy target as:
- Noisy target = deterministic target “ƒ(x) = E(y|x)” + noise “y - f(x)”

No loss of generality by assuming probability distribution over P over y, because even if the target function is indeed a function, we can model it as:

P(y|x) = 0 everywhere except for y = f(x)

So, discrete case:

Probability is 1 for y=f(x), 0 elsewhere

So, continuous case:

delta function at y=f(x), 0 elsewhere

Noisy target function?

The target function is noisy because we aren’t able to model it completely. We are trying to model it with limited parameters (the features) and not taking into consideration all the factors that it depends on. Hence, the target function that we are trying to learn is noisy. The god send truth target function that is generating the data is not noisy. Our approximation, the one we are trying to learn, is.
Or is it inherently noisy? The function that generated the data in the input space, that true function ƒ is noisy itself?

Our updated learning diagram:

Now, the target function is replaced by a distribution. And we also have an error measure lying between learning algorithm and final hypothesis. It is the prize we pay for not being perfect in our approximation of the target function.

Note:

P(x)
- This is the probability of generating the dataset that we train on
- We introduced this to accommodate Hoeffding’s
- it quantifies the relative importance of x - it is not what we are interested in
- this also dictates what region of the input space you get samples from. The probability distribution could be such that it never gives you points from particular regions of input space. Currently, our model will be agnostic of that, but in Bayesian learning, the confidence of the model will be lower in those regions where it hasn’t seen examples from
- also, once you have P(x), you draw samples from it independently, i.e. i.i.d - independent identically distributed
- P(x) is not a problem if it has a long tail or if it is a delta function or whatever; as long as it is used in both training and testing.
P(y|x)
- This is the probability the target function gives y for an input x
- This is because we the target function is inherently noisy
  - this is what we are trying to learn

Both these PDs together generate our dataset; P(x, y) So, we must always remember that P(x, y) is actually the mixture of concepts, of 2 PDs which are inherently different

Preamble to the theory

What we know so far:

we know that learning is feasible
- E_out(g) ≈ E_in(g)
However, this is not really learning. It is just validation that our “selected” hypothesis gives us a measure of how it will perform out of sample. This is just good generalization
This guarantee that E_in is a good proxy for E_out is important for us to learn. It is a building block for learning.
Real learning is actually:
- g ≈ f
  - i.e. E_out(g) ≈ 0
  - this measure how far are you from the target function

The 2 questions of learning

We need E_out(g) ≈ 0
This is achieved thru:
- E_out(g) ≈ E_in(g) and E_in(g) ≈ 0
- It’s like saying, I can do good on the sample dataset and I know that this is means that I will do good outside the sample too

We know that answer to 1 is “Yes”. But we were left with the M factor on right hand side that we have to deal with. The answer to 2 is what ML algorithms are suppose to do! They give us a good fit for sample data.

The theory will achieve this for us:

characterize feasibility of learning for infinite M
- we will have a single parameter that tells us the sophistication of the hypothesis set
characterize the tradeoff b/w model complexity and how well our E_in(g) tracks E_out(g)

Chapter 5 - Training versus Testing

Review

E_in is averaged over N points E_out has a logical definition, the weighted error on a point, over all points.

Also, last time, we were confused about the noisiness of the target function. It is that the target function inherently is noisy, it is not a function in a true mathematical sense. It has a probability distribution over outcome (y) based on the input so: ƒ: y → x + noise or: y ˜ P(y|x)

Earlier, when we considered y to be deterministic, we generated the sample by P(x) But now, y also comes from the PD, so we have the dataset generated from 2 PDs P(x, y) = P(x)P(y|x)

E_out(h) is now E_x,y[e(h(x), y)]

From training to testing

Say you have a final exam. You get some practice problems. This is training for final exam. If you directly do the final exam questions, you might not have learned. The end goal is low E_out which only happens if you study and learn the material.

We have:

Testing
- How well you do in the final exam (E_in) tracks how well you do in the wild (E_out)

Training
- How well you do in the practice does not track very well how you do in the wild
- There is a factor of M that comes in here
- M represents “how much you explore”, how many cases are possible

We need to replace M with a friendlier quantity, if we are able to do it, we are in business.

Recall, M is cardinality of hypothesis set. Our learning algorithm is free to choose any hypothesis it wants from the set and so to invoke Hoeffding’s inequality, we had to add Ps of all bad events i.e. we had:

P[B₁ or … or B_M] where B is the bad event: | E_in(h_m) - E_out(h_m) | > ε

By union bound: P[B₁ or … or B_M] = P[B₁] + … + P[B_M]

We took disjoint events of all bad events, but in reality they are related and overlap a lot

We can solve this exactly for the perceptron for eg, but it would be a nightmare. We want to extract from a given hypothesis set H, a number that would characterize this overlap and give us a good bound

Illustrative examples to show overlap

Consider a perceptron

Also, we have E_in,

Here, we apply Hoeffding’s inequality and we know that E_in tracks E_out etc Now, consider another perceptron, slightly different:

The green line is another perceptron. In both the cases, E_out will be slightly different. Also, E_in will be different if there is any point that falls in the yellow region.

So, | E_in(h₁) - E_out(h₁) | ≈ | E_in(h₂) - E_out(h₂) | If one exceeds ε, the other does as well. But we do not consider this strong correlation between these, we consider them to be independent. So, just counting the number of hypothesis is not optimal in that it does not give us a tight bound. We can improve it.

Instead of the whole input space, we can consider only a finite set of input points. And we can differentiate b/w the perceptrons if they classify the input points differently So, given a set of red and blue points, we can count all possible classifications of them. This is a good proxy for the complexity of the hypothesis set. If hypothesis set is powerful, it can classify the points in all possible ways. If it is not so powerful, it may not be able to achieve some classifications. The number of classifications that the hypothesis set can give is called dichotomies.

Dichotomies is a proxy for the number of hypothesis. It is based only on the input points and not on the general input space. They are “mini hypotheses”

A hypothesis is a function h: X → {-1, +1} which takes in the full input space and gives the classification

A dichotomy is also a function h: {x₁, x₂, …, x_N} → {-1, +1} which takes in only the input points and gives the classification (each x, say x₁ is a vector of all input features for first data point)

Number of hypothesis |H| → ∞ Number of dichotomies |H(x₁, x₂, …, x_N)| → 2^N if H is extremely expressive

This is a candidate for replacing M. We can also define “m”, the growth function which gives us the most dichotomies on N points (given a hypothesis set)

The subscript H is because the growth function is defined for a given hypothesis set.

We also know that m_H(N) ≤ 2^N

Growth function for the perceptron:

N = 3, i.e. m_H(3) = 8
recall this is the maximum dichotomies possible
N = 4, i.e. m_H(4) = 14
this is because we cannot get this combination with our perceptron:

Growth functions for simple cases of H

Example 1: positive rays

It is defined on the real line
it has a point, everything on right is +1, on left is -1

H is the set of h: R → {-1, +1} So, for N points, we get: N dichotomies, so m_H(N) = N + 1 (all blue, all red, N-1 sandwiched positions)

Example 2: Positive intervals

here, we have an interval
everything within is +1, everything else is -1

H is the set of h: R → {-1, +1} So, for N points, we get: N dichotomies, so m_H(N) = N + 1 The way we get a different dichotomy is by choosing 2 points on the number line:

m_H(N) = nC2 We need to add 1 for the case that we can select the same point, (blue region is null set)

So, m_H(N) = nC2 + 1 or, m_H(N) = N²/2 + N/2 + 1

More powerful than the last one!

Example 3: convex sets

we define a convex region as our +1 area
a convex region is the region where a line segment connecting 2 points on the region lie entirely inside the region H is the set of h: R → {-1, +1} h(x) = +1 is convex

What is the growth function? We can put our N points on the perimeter of a circle and thus we can get ANY classification from the 2^N eg:

So, the growth function is m_H(N) = 2^N Since the hypothesis set gets all the 2^N dichotomies, we say that the shatters the N points

We have 3 growth function:

Note that the dichotomies also aren’t very tight. Convex sets is a complex hypothesis set, but not the most complex, we still have 2^N growth function

So, talking about replacing M with m_H(N)… We had:

We can replace M with m_H(N) if m_H(N) is a polynomial. This is because then we can increase N and get the RHS to be very small, so small that the bound makes sense. The only criterion is that m_H(N) should be a polynomial so that we can defeat it (we cannot defeat M right now, it is infinite)

So, once we are able to prove that our hypothesis set’s growth is polynomial, we will be able to learn using that hypothesis set. We may need a lot of data points, but we will be able to generalize to the entire input space given the finite (albeit large) input points

Key notion: Break point

Defined as the k after which growth function, m_H(k) is less than 2^k –> m_H(k) < 2^k “If no data set of size k can be shattered by H, we call k a break point for H”

For perceptrons it is 4.

So, break point for

positive rays:
- k = 2
- For 2 points, we cannot get this: “<red> <blue>”
- we can only get: “<red> <red>” or “<blue> <blue>”
positive intervals
- using the formula m_H(k) < 2^k we have: k = 3
- we cannot get: <blue> <red> <blue>
convex sets
- never fails, so, k = ∞

So, we have this result:

No break point ⇒ m_H(N) = 2^N
Any break point ⇒ m_H(N) is polynomial in N

So, to be able to learn with a given hypothesis set, we just need to prove that it has a break point

Example: Given 3 points, and given that the break point is 2, we can have only 4 dichotomies

Any addition (out of 2³ = 8) is not allowed because then it would classify 2 points completely (and that is not allowed, as break point is 2)

Chapter 6 - Theory of Generalization

The existence of a break point restricts the number of dichotomies drastically.

We have 2 things to do now:

Prove that a growth function m_H(N) with a break point is polynomial
Prove that we can replace m_H(N) with M

Recall to show that m_H(N) is a polynomial, we need to show that m_H(N) is bound above with a polynomial in the general case It is actually! m_H(N) is bounded above by a polynomial of power N^k-1 where k is the break point for the hypothesis set

examples:
- H is positive rays
  - k is 2, so, m_H(N) should be bounded by polynomial of power 1. Which is true, since m_H(N) is N + 1
- H is positive intervals
  - k is 3, so, m_H(N) should be bounded by N². Which is true, since m_H(N) is nC2
- H is 2D perceptrons
  - k is 4, m_H(N) should be bounded by N³. We did not know the m_H(N) for it, but we know that it is bounded above by N³

So, as long as there is a break point, we will be able to get a polynomial and thus learn.

This result that you can replace m_H(N) with M is called the Vapnik-Chervonenkis Inequality is the most important theoretical result in machine learning.

Chapter 7 - The VC dimension

Review

We saw that m_H(N) is bounded above by a polynomial in k, where k is the break point for the hypothesis set
Also, that we can replace M with m_H(N) and thus learn according to VC inequality

Thus, we can learn for any hypothesis set which has a break point. This lecture will give us the notion of the “VC dimension”, which characterizes the complexity of the hypothesis set

The definition

It is a quantity defined for a hypothesis set H, denoted by d_vc(H)
It the is most points H can shatter
it is the “largest” value for N, for which m_H(N) = 2^N
it is k-1
N ≤ d_vc(H) ⇒ H can shatter N points
N > d_vc(H) ⇒ N is a break point for H (anything above d_vc(H) is a break point)
in terms of break point k:
- for any hypothesis set with VC dimension d_vc(H), we have it’s growth function a polynomial of degree d_vc(H)
examples:
- H is positive rays
  - d_vc(H) is 1
- H is 2D perceptrons
  - k is 4, so d_vc(H) is 3
- H is convex sets
  - k is ∞, so d_vc(H) is infinite as well
  - this is the most pessimistic case, since we won’t get the points in a neat circle. This is the upper bound for the hypothesis set
  - so, we can still learn

Reiterating, we have established that if d_vc(H) is finite (aka k exists), g ∈ H will generalize And it will generalize independently of learning algorithm, input distribution, independent of target function, for any hypothesis set

VC dimension of perceptrons in any dimension

for 2D, d_vc(H) was 3
for 3D, d_vc(H) is 4
in general, d_vc(H) = d + 1 where d is the dimension
I.e., the perceptron in d dimensions can shatter d+1 points completely
Also, d is the number of paramteres in the model - (w₀, w₁, …, w_d)

Interpreting the VC dimension

the #parameters are analog degrees of freedom
the VC dimension makes them binary degrees of freedom - it is based on the #dichotomies possible, #of points you can shatter
examples:
- H is positive rays
  - d_vc(H) is 1, so 1 parameter, 1 degree of freedom - which is what we have!
- H is positive intervals
  - d_vc(H) is 2, so 2 parameters, 2 degrees of freedom - which is what we have!

Actually, it’s not just parameters, it is degrees of freedom. A parameter may not contribute to the degree of freedom, then it won’t count.

Consider a 1D perceptron,

It is 2 parameters(variables), 1 weight and 1 bias/threshold Now, we can take the output and feed it into another perceptron, …, 3 times and that is the output. Thus, we have this:

Here, we have 4*2 = 8 parameters, but we still have 2 degrees of freedom, so d_vc(H) is still 2 You don’t care about the internal structure, you ask yourself how many points can I shatter? k is it? Then my d_vc(H) is k-1. That’s all.

So, d_vc(H) measures the effective number of parameters

Number of data points needed:
- we had this statement according to the VC inequality:

Here, we can see that the RHS is just N^de^-N where d is the d_vc(H) (vc dimension) and N is the number of samples needed Plotting LHS (probability) vs RHS:

We see that the polynomial is high initially (and the bound is absurd, like LHS > 10 etc) but then the exponential kills it and we get tighter and tighter bounds. This is the reason if we use a linear regression model on large number of sample points, and we get some error on the sample points, we’ll get the same error out of sample as well - with a high probability. This is the case of having RHS (δ) as very small, I.e. Having a tight bound on the | E_in - E_out |, but the ε may be large. That is to say, we are 99% sure that the in sample error will be within 60% of out of sample error - “very confident that it is a bad system”

Rule of hand:

N ≥ 10d_vc(H)
You need 10 times the d_vc(H) number of examples

Generalization bounds

We can rearrange things:
We had this from the VC inequality:

From the RHS, we can get ε in terms of δ δ is just how tight a bound do you want on the statement you make…

So, we have:

ε depends on N, δ, m_H - growth function of hypothesis set - we call the RHS Ω
if m_H is large, d_vc(H) is large, the guarantee on generalization(ε) is worse
if δ is small, I.e. You want to be very sure of the statement you make, worse ε
if N is large, ε becomes smaller
eventually, the log is killed by linear N (just like polynomial is killed by exponential)

We can re-write as:

With probability ≥ 1 - δ

Ω is a function of

N → goes down with it
- the more examples, the tighter bound on generalization
H → goes up with it
- more complex H, higher the VC dimension d_vc(H) and so less generalization
δ → goes up with smaller delta
- as you want to be more sure about the confidence in your statement, less generalization

We can simplify further as:

remove the modulus, because mostly E_out is smaller than E_in
so, E_out - E_in ≤ Ω
- E_out - E_in is called the generalization error
moving E_in to RHS,
- E_out ≤ E_in + Ω(N, H, δ)
- So, the RHS bounds the E_out
- We have control on the RHS, E_in we are controlling and Ω is dependent on H, N, δ etc
- If we choose a more complex H, E_in goes down, but Ω goes up.
  - Hence, there is a tradeoff involved here

This form will also give us the groundwork for regularization. Earlier, we used just E_in as a proxy for E_out. It seems here, that we can use E_in + Ω as the proxy and that may give us a better handle on E_out.

It is not always possible to find the d_vc(H) dimension of the hypothesis set, eg neural nets. We can say it is bounded above by the #parameters, but it can be much much lower as we saw in the chained perceptrons.

Chapter 8 - Bias Variance Tradeoff

Review

We saw that d_vc(H) is just the max number of points a hypothesis set can shatter
It is k-1, where k is the break point
We have a rule of thumb: N ≥ d_vc(H)
Also, we rearranged the generalization bound to:
- E_out ≤ E_in + Ω
- Ω is a function of:
  - δ - what is the probability of error in the statement we make, I.e., what is the probability that E_in tracks E_out within ε
  - ε - how loosely does E_in track E_out

Bias-Variance tradeoff

It is a stand alone theory which gives us a different angle on generalization (as contrasted with VC analyses)
The tradeoff is b/w approximation and generalization
- Small E_out is what we want; good approximation of ƒ out of sample
- More complex H → better chance of approximating ƒ - we have a lot of options
- Less complex H → better chance of generalization ƒ - we have a hard time trying to get the right one from so many
- The hypothesis with only the target function in it!
- Ideal is H = {ƒ}
The quantification of this tradeoff in VC dimension analysis is:
- E_out ≤ E_in + Ω
  - E_in is approximation, because we are trying to fit a target function to input dataset
  - Ω is generalization

Bias-Variance provides an alternate way of looking at this same tradeoff

It decomposes E_out is decomposed into:
- How well H can approximate ƒ - how flexible is your H
- How well we can zoom in on a good h ∈ H
This applies to real valued targets. We’ll use squared-error because it helps with derivation
We start with E_out
We keep D in the notation because the Es depend on the dataset. Earlier, in VC analysis also it was present, but we never wrote it because it wasn’t used
E_out(g^(D)) = E_x[(g^(D)(x) - ƒ(x))²]
Expected error over entire space 🔝
To remove dependency on a particular D, we take E_D on both sides
E_D[E_out(g^(D))] = E_D[E_x[(g^(D)(x) - ƒ(x))²]]
we now manipulate this equation 🔝

Let’s focus on just E_x[(g^(D)(x) - ƒ(x))²]
we define an “average” hypothesis - bar

what is gbar?
- imagine you have many datasets, so gbar is the average of the hypothesis that you choose from them.

So, we have this derivation:

Adding and subtracting gbar 🔝

We get this 🔝 on rearrangement.

Now, the second term is independent of D, so E_D of it is itself Also, in the 3rd term, the first half is gbar - gbar so that is 0 as well

So,we get:

This 🔝 ⬆️ is a very important equation. The first term tells you how the hypothesis we got from our D differers from the best that you could get from that D. Gbar is the best we can get from D because it is the average so is “better”

There are 2 hops; the hop from your hypothesis to the best you can get, and the second hop is from the best you can get to the actual target function The second term is how far is the best possible you can get from the target function itself?

The 1st term is variance
- This is how much away you are from the best h in you H. You cannot get gbar because you don’t have all the possible datasets, you have only the one you got
The 2nd term is bias
- Because your hypothesis set may be biased away from the target function because your best is so much away from the target function

Thus, recall we started with:

So, we get: E_D[E_out(g^(D))] = E_x[bias(x) + bar(x)] = bias + var

So, we have broken down the out of sample error into it’s bias and variance components. So, if we have an E_out of 0.3, we can say that 0.05 is because of bias and 0.25 is because of variance

The tradeoff

We have:

Pictorial representation:

If we have a H with only 1 function, we have no variance and we always choose that. But it may not be the best, our g ( = gbar) will be biased away from the target function

If we have a very complicated H, we won’t be able to choose the best - the target function because we have a lot of variance. We get a different D each time to learn etc. And depending on that D, we might not choose ƒ. The centroid of the red region is gbar. It would be quite close to ƒ. (The difference b/w them would be the bias, which is close to 0 here)

If H goes up, bias goes down, variance goes up

Let’s take an example

Let’s take our target function to be a sine curve.
ƒ(x) = sin(π x), ƒ:[-1, 1] → ℜ
Our dataset is only 2 datapoints.
We have 2 hypothesis sets, H₀ and H₁
H₀ is the constant model - h(x) = b
H₁ is the linear model - h(x) = ax + b
Which one is better?
- Better for what?

Approximation

The linear line is better. Since we know the target function, a line is a better fit than a constant. (we are using mean squared error)

The errors are in yellow:

E_out in this case is 0.2

For H₀, we have the constant only. So, the hypothesis will be g(x) = 0 and the error will be huge:

E_out in this case is 0.5

So, the linear model wins. We can do better with a 3rd order polynomial, even better with a 17th etc. This is because we have no question of zooming in, we have all the information

Learning

We get 2 examples now. We don’t know the target function, we just have independently picked examples. So, we have:

So, H₀ and H₁ will be:

We can compute the error (the yellow regions) etc, but it depends on which 2 points we get, on our dataset. This is why we had the bias-variance tradeoff. Now, we can do a bias-variance decomposition on these two hypothesis sets.

So, we do a simulation in which we take 2 points on random from the target function and fit our constant hypothesis to them We get this:

The gray lines are all the various g-s we choose depending on the input dataset D We can take the average of it, to get g-bar

G-bar is just what we had in the Approximating case. It shows us that the average hypothesis is inherently better because it “cancels out the variance”. The output of our learning process is one of the gray lines, not g-bar. The gray shaded region is the variance around g-bar - the std dev around g-bar.

The error b/w the green line and the target function will give you the bias. The gray region will be the variance

If we have the H₁, the straight line, we get this (on the same dataset as in case H₀):

Note we have a lot of variance.

On average we get:

The variance depends on x, the gray region is the std dev. When you want one number to define it, we take the expected value of the width² of the gray region. So, we have better approximation but very bad (high) variance. So, given only 2 points, we are better off with a constant.

So, if asked what is better at approximating a sine curve, a line or a constant? - A line of course! But, if asked what is better at approximating 2 points coming from some unknown curve? - A constant is better

Let’s quantify this:

Here, the bias is exactly what we had for the approximating case. The variance is 0.25

Whereas for H₁,

The bias is 0.21. This is slightly higher than in the approximating case where it was 0.20. This is to be expected because here, we get only 2 points at a time and not the whole sine curve. The variance is huge.

So, H₀ is better we see.

In any ML scenario, we are matching the model complexity to the data resources, not the target complexity; we don’t know what the target function is.

ML idea

Let’s say you have 100 datapoints. Is there any utility in taking 10 points at random and trying to learn a hypothesis? Doing this for many random samplings of 10 points and then averaging the hypothesis would give us something close to gbar. This is a way of dealing with variance, no?

Learning curves

Plot of E_out and E_in as a function of N Data set D of size N. Expected out-of-sample - E_D[E_out(g^(D))]

it depends on the dataset. If you want it to be agnostic of any particular dataset, we “integrate” D out and get expected value wrt D

Expected in-sample error: E_D[E_in(g^(D))]

This is how you fit them

How do these vary with N?

This is for the simple model.

The same thing for the complex model as well, just the graph is shifted towards right. The x-axis where the E_in is zero corresponds to the VC dimension. Till there, the hypothesis set fits everything perfectly, all the points are shattered.

VC analysis vs bias-variance

In VC analysis, we had E_out which comprised of in-sample error and generalization error (Ω) So, the figure looks like this:

One thing is that in the VC analysis, the generalization error isn’t this much, it is much much higher. It just follows this shape. The bound is not that tight.

If we plot the same thing for the BV analysis, we get:

Here, the bias is the error in g-bar, that is the best you can do. As N goes up, the variance decreases, and we get closer and closer to g-bar.

Linear regression case

Let’s say we have a noisy target y = w^*Tx + noise D is {(x₁, y₁), (x₂, y₂), …, (x_N, y_N)} The input points in D are picked independently.

Learning solution: X = (X^TX)^-1X^Ty

In sample error: Xw - y

Out-of-sample error - same input points + new noise Xw - y’

On analysis we get is:

σ² is the variance of the noise, and that is the best you can do, which is true, since you can’t capture the noise Also, till d+1, you can shatter N points. As we get more N, the variance cancels out but the bias persists and we get lower and lower E_out

Best approximation error = σ²

Here, we see that we have an exact formula for Expected generalization error and we see that d_VC ∝ N

Chapter 9 - The Linear Model II

Review

We took the expected value of the E_out wrt the identity of dataset D, size fixed at N to get rid of variation due to datasets. And we got a clean decomposition of E_out into bias + variance terms.
Bias is how far is your best hypothesis in your hypothesis set away from ƒ
Variance is how far close to the best hypothesis are you.
g^(D)(x) → gbar(x) → ƒ(x)
N ∝ d_VC

Non linear transforms

Summary from earlier

We used non-linear transforms earlier, when we applied (x₁, x₂) → (z₁, z₂, …, z_d) so, z = Φ(x) Example: z = (1, x₁, x₂, x₁², x₂², x₁x₂) - this is the quadratic transformation Our final hypothesis g(x) will always live in X space.

We take the first in classification, 2nd in regression

Price we pay

In terms of generalization 🔝

Earlier, in X space, we had w free parameters. But in transformed space, we have w-tilde free parameters. So, the d_VC dimension increases and so we pay the price for generalization.

However, we can do this to reduce the parameters:

Here, we are reducing the parameters to just 1. This won’t work because we looked at the data and did the “learning” ourselves. We decided that we need a circle and don’t need the cross terms etc. We have to pay the price in terms of d_VC for what we start with (in our mind here), which is the full quadratic form.

Looking at the data before choosing the model is hazardous to your E_out. You explore a very high hypothesis space but account for very little in your analysis. This is data snopping.

This is why you cannot arbitrarily go to very high dimensions during transformations, because the d_VC gets high and generalization suffers.

Logistic regression

We have seen so far:

Linear perceptron
Linear regression

Now we will see Logistic regression.

The model

This will be our third linear model. Being linear means that we compute the signal as being a linear combination of the weights. And then we use the signal to do classification, estimation etc

The perceptron is this:

The staircase step pattern is there because we it is a step function. We look at the sign and return +1/-1

All the linear models differ in what they do to the signal. Perceptron looks at the sign 🔝

Regression uses s as is. We can say we apply the identity function.

Logistic regression squashes it to be b/w 0 and 1 by applying a nonlinearity to it (logistic function θ).

This is somewhat b/w perceptron and linear regression

The logistic function θ looks like this:

This is a soft threshold. There are many formulas that give us this, we will use:

So, we have: h(x) = θ(s), s = w^Tx. The signal s, can be thought of as a “risk score”

Since during training, we have a binary label on the datapoints in D, we don’t have a probability, we get:

Since the target function itself is noisy, it gives y +1 or -1 with a certain underlying probability that we are trying to learn.

Error measure

We can either do a cost-benefit analysis and come up with a very plausible error measure. Or we can use something that is friendly to the optimizer.

For each (x, y), y is generated by probability ƒ(x) We can have a plausible error measure based on likelihood:

We will grade different hypothesis according to the likelihood that they generated the data. This gives us a way of grading the different hypotheses.

Let’s assume that the datapoints in D were generated by h. (ie h = ƒ). This is the reverse of what we were doing earlier. Earlier, we extracted the most probable hypothesis given the data. That was clean. Now we are asking what is the probability of the data given the hypothesis. There are some data snooping concerns. The Bayesian’s get around this by starting with a prior and then using likelihood to get a posterior.

What we need to do is, we need to compute the LHS in 🔝 Then we select the h which has the greatest LHS

So, we do this: We replace h(x) in that formula 🔝 with:

h(x) = θ(w^Tx)

Also, we see:

Θ(-s) = 1 - θ(s)

So, we get:

P(y | x) = θ(y · w^Tx)

This is for one point, for all the points in the dataset, we get the likelihood:

So, since we find the w that maximizes the likelihood of getting this dataset, we learn something about the underlying probability distribution that generated this dataset.

How do we maximize the likelihood?

We have:

recall, we are maximizing wrt w

Instead, we can maximize log-likelihood

We can also do this: ⬇️

Also, we can do this:

We can substitute θ and get this simplification:

So, we can call the terms inside the Σ as the e - pointwise error

The exponent, -y_nw^Tx_n is nice because if the risk score is high and y_n is negative, we get a total -(-)(+) which is negative so, high error etc.

This is called the cross-entropy error. (b/w h and ƒ)

Learning algorithm

Since we know the error measure, we can minimize it

We had in linear regression:

We had a closed-form solution in linear regression, the pseudo-inverse, the one step learning.

We cannot find a closed form solution for the logistic regression case. We need an iterative solution - gradient descent We have a convex error surface in logistic regression, so we can optimize it perfectly. We won’t have this 🔝 when we have neural networks etc.

We have:

We need the direction, v-hat. Earlier, we just said that the direction of steepest fall is the gradient of that surface at that points. An alternative approach to the same idea is:

We can replace w(1) with the formula

Using Taylor series expansion of the first term, we get:

If the surface was linear, we wouldn’t have the O(η²) term, the first order approximation would have been exact.

We just need the direction of v-hat now such that it is as negative as possible. So, we have:

Size of η?

η should increase with slope. So, if we can do this:

So, we moved from a fixed step size to a fixed learning rate. We can manipulate that as well, with momentum etc.

Summary

We have seen 3 linear models now:

Chapter 10 - Neural Networks

In logistic regression we have a convex error surface, so we could use gradient descent and minimize the error. This is not the case with Neural Networks.

Stochastic gradient descent

We had GD, which minimizes the in sample error

We need to evaluate the hypothesis at every point in D, so this is expensive. Each step is one epoch (epoch is when you have considered the full dataset at once)

This was “batch GD” 🔝

We can also have stochastic GD. Pick 1 example (x_n, y_n) at a time. Apply GD to e(h(x_n), y_n)

This works because the “average” direction in which we move is:

So, at every step, we are moving in this direction(this is the expected direction) + noise This is randomized GD, aka stochastic GD

Benefits of SGD

cheaper computation
- with the same expected movement
randomization
- so you don’t get stuck in local minima always
simple
- simplest possible optimization to GD
Rule of thumb: η = 0.1 works

SGD in action

In the Netflix competition, we had user vectors u_ik for user i, and movie vectors v_jk for movie j We also had a rating r_ij for user i and movie k

We can define an error measure as:

So, we have 2k parameters, and we can use SGD to move in this 2k dimensional space. We nudge them little by little to go towards the rating.

Neural Network Model

The building blocks are perceptrons. Neural networks are combination of them The case where the perceptrons failed was the 4 points arranged funnily.

But this can be solved by 2 perceptrons

We can combine them using OR/AND gates which are manifested by the weights on them

Note the step function just takes the sign to give +1/-1

Now, once we have OR and AND, we can create layers of these to get more complex function approximaters or logic gates etc

We can get XOR with:

So, we have: the 2 perceptrons (doing AND, and NAND) and then we are ORing them

So, the full multilayer perceptron that implemented the function that we wanted(where the single perceptron failed) is:

3 layers, feedforward architecture.

Now, you can get whatever surface you want to approximate:

2 red flags: generalization and optimization For Generalization, we know exactly what we are dealing with, and we can reason rationally about it For optimization, what we can do is we can have soft thresholds thru out, and then give the answer after hard thresholding it. Soft thresholds are better because they are twice differential and we can use gradient descent readily.

The neural network

We have this:

Θ is any non-linearity that you want. We can use logistic function, but it goes from -1 to +1 Each of these guys can be different as well.

The input layer has x The hidden layers 1 ≤ l ≤ L output layer l = L

The non-linearity θ can be: tanh (hyperbolic tangent)

This function behaves linearly near low signal score, otherwise as a hard threshold.

The parameters of the neural network, the weights have elaborate indices.

d^(l) is the dimension of the layer l inputs start with index 0 because each neuron has the mandatory input x₀ = 1

So, 🔝 the value in the next layer is just the signal (computed from the weights+values of previous layer) and passed thru the nonlinear function θ.

So, we start applying this from the input layer all the way to the output layer to get x₁^(L)

Applying SGD

All the weights are w = {w_ij^(l)}, they determine h(x) Error on example (x_n, y_n) is: e(h(x_n), y_n) = fn of e(w)

We need the gradient of the e(h(x_n), y_n)

How do we compute this?

To find de(w)/dw_ij^(l) - we can do it by chain rule. The problem is that we have to do it for each one of them, we need a faster way of computing it.

We can use backpropagation algo to get the entire gradient in 1 shot

Backpropagation algorithm

We can write using chain rule:

The 2nd term is simple. The 1st term needs thought, let’s call it δ_j^(l) So, e’ is a product of the two terms, which are basically the x of the previous layer and the δ of the next layer. This is an attractive property.

We get δ at the output layer and then use that to compute the inner layer etc.

This is just a normal application of chain rules.

The hidden layers are performing nonlinear transformation. But the transformations are learned, and we are already charging the generalization for that. The VC dimension is ≈ #weights

Chapter 11 - Overfitting

Review

We can combine perceptrons to get more complex boundaries
The optimization is difficult, so we introduced neural networks that softened thresholds of the perceptrons.
We used chain rule to get derivative of error wrt weights and so we can use SGD to minimize the error.

What is Overfitting?

Let’s say we have a simple target function. We generate 5 data points with random noise in them.

Overfitting is when you go father than you should have. If you try to fit the above ƒ with a 4th order polynomial, we will overfit.

When training a neural network, if you plot E_in and E_out at each step and you get something like so:

Overfitting is when E_in goes down and at the same time, E_out is going up. So, we should stop at that moment and report that. This is called early stopping

Overfitting is “fitting the data more than warranted” The culprit is when we are fitting the noise.

Case-study to investigate effects of fitting noise on overfitting

Let’s have a 10th order target function. We generate 15 datapoints + noise

We also have target function which is a a 50th order polynomial. We also generate 15 noiseless datapoints lying on the curve.

We will have 2 models for each target

2nd order polynomial
10th order polynomial

For noisy low-order target

This is a blatant case of overfitting

For noiseless high-order target

So, here too, it is overfitting. Even when we are fitting a 10th order polynomial to a 50th order target function

The reason is that the second case also has “noise”, not the traditional noise, but noise. We need to match the hypothesis complexity to the data resources and not the target function complexity.

A detailed experiment

Impact of noise level and target complexity

y = f(x) + ε(x)

ε(x) has σ² std dev, or “energy”

f(x) is a normalized Qth order polynomial so that the noise has effect - we’ll use Lagrange polynomials. So, the things affecting overfitting are:

We would like to understand the dependency b/w these on overfitting

We fit the polynomials using our 2 hypothesis - H₂ which is a 2nd order polynomial and H₁₀ which is a 10th order polynomial. And we’ll compare the out-of-sample errors. Overfit measure: E_out(g₁₀) - E_out(g₂) So, a positive value means the more complex guy is doing worse, so overfitting. Negative means that the large model is doing better, no overfitting

When we run this for 10s of millions of examples, we get this:

Impact of σ²
N vs. σ²

(Red is overfitting) So, more σ, I.e. More noise, we need more N to beat overfitting.

For all the simulations 🔝, the target complexity is fixed at 20th order polynomial

This was okayish, no surprises. What about the other case where we were getting overfitting without noise?

Here we fix the level of noise, to 0.1, but we are changing the target function complexity We note somewhat the same pattern. Overfitting happens with higher complexity of ƒ(not as pronounced as earlier, but there. This is because we can readily capture the pattern in this “noise”, whereas in the other case we are trying to learn random noise, which takes a lot of N) but with more N, we can beat it. Also, it does not start until the power of polynomial reaches 10, this is because we are trying to fit a 2nd order polynomial to it. So, the target polynomial needs to be sufficiently complex for us to not be able to capture it. Thus from all this, there appears to be another factor apart from conventional noise that affects overfitting.

The role of noise

The first case, with σ² energy is called “stochastic noise” The second case is noise which isn’t random, it is deterministic but we just cannot capture it and it looks like noise to me. We’ll call it “deterministic noise”

Deterministic noise

The part of ƒ that H cannot capture → f(x) − h^*(x)

Main differences with stochastic noise:

depends on H
fixed for a given x (difference b/w f(x) and h(x) at that point)

They behave exactly the same as long as learning is concerned.

So, if we have only 10 points and the hypothesis is very complex, it will try to fit in the noise - either stochastic or deterministic - and you overfit

Earlier, when we did the Bias-Variance decomposition, we got the bias term and the variance term. Adding noise to the mix and repeating the analysis, we get:

The 3rd term is the stochastic noise. The second term, the bias, is the deterministic noise. The noise terms increase the variance (the 1st term)

Dealing with overfitting

Two cures:

Regularization - putting the brakes
Validation - checking the bottom line - learning to not look at E_in when deciding when to stop but getting an estimate for E_out

This will allow us to use the 4th order polynomial but improve dramatically:

Chapter 12 - Regularization

Review:

The 1st cure to overfitting is Regularization, 2nd is validation

Regularization - informal

Regularization has a mathematical basis - you want to approximate functions, which is an ill-posed problem because there are many functions that approx it, so you have smoothness constraints

It also has heuristic basis.

Earlier, we studied how a linear model fails on fitting a sine curve with only 2 points.

We can apply regularization here and restrict offsets and too deep slopes. This will mean we sacrifice a little on the fit of the datapoints.

This shows that the variance is reduced.

Quantifying it:

So, with regularization, the total score is better than in the constant model. There is a small increase in the bias, the g-bar is different as well. This is because we are handicapping the flexibility of the hypothesis and so we don’t get the same g-bar as earlier. The slight increase in bias can be considered a side-effect of the treatment.

The g-bar is the best hypothesis in H. This is what you’ll get if you get infinitely many datasets of 2 datapoints each and you average all the various hypothesises you form.

Regularization allows you to choose a model “in-between” the constant and linear model - a sort of a restricted linear model

What regularization did we apply?

Regularization - formal

Let’s define a hypothesis set: H_Q = polynomials of order Q

The non-linear transformation that produces this polynomial: z

You take the input x, and combine and evaluate these polynomials. This is the nonlinear transformation. This means, we are converting each feature into higher powers till power Q.

The hypothesis set is simply all possible Legendre polynomials till power Q

We can use linear regression to get the solution This is simple, we know the drill:

We apply the transformation on the input points

We write E_in(w), in vector form

We know the formula

What if we constrain the weights?

we can consider H₂ as a constrained version of H₁₀
- But this is hard constraint, with w_q = 0 for q > 2

A softer constraint would be:

Here, we are decreasing the possbile hypotheses, so the d_VC reduces and our chances of generalization improve. So, the optimization problem changes to:

Pictorially, we have our old error surface:

The lowest error lies at the center of this ellipse. This is what is reported by linear regression without regularization. With the constraint, we have:

So, we have to now pick a point that is as close to w_lin as possible and still respects the constraint. So, it will lie on the circle. Also, if C is too large, we will have something like this:

Here, the solution will be w_lin, as the regularization was too liberal.

Now, consider a non-optimal point:

Here, the gradient of E_in will be orthogonal to the ellipse. Also, the normal to the circle will be w vector (origin to point w will be orthogonal to circumference of circle) Thus, we can write:

We can put proportionality constant as:

So, the last equation looks like the minimization of:

So, we moved from constraint satisfication optimization to normal optimization. λ is dependent on C (among other steps), it is inversely proportional to λ

λ ∝ -C

Augmented error We are now minimizing the augmented error - that is, error with the regularization term attached to it

These 🔝 two optimizations are fundamentally the same.

The bottom formulation lends itself to VC analysis, as we are restricting our hypothesis set explicitly, we have a lower d_VC and better chances at generalization.

The solution is simple:

We have:

So, we get gradient:

So, this is the solution with regularization.

Applying regularization on the previous problem:

We see regularization increased bias, reduces variance We went from overfitting to underfitting

There is a optimal value:

Weight decay

This regularization we studied, w^Tw ≤ C is called weight decay.

Why is this called weight decay?

We had in GD:

But now, we have gradient due to w^Tw also

We see that λ is playing the role of decaying the weight at each step. This applies in neural networks as well, just the w^Tw is the sum of all the weights in the network

Apart from constraining the total weights, we can also decide some weights are more important that others. We can use the importance factor - γ_q So, if γ_q is low, that weight need not be reduced. If it is big, emphasis on reducing that weight.

Example:

The algorithm will try to find a lower order fit

In neural networks, we give different layers different γ’s

One famous family of regularizers is Tikhonov regularizer

In the above form:

We apply the regularization only to the diagonal terms, I.e. w₁², w₂², etc In the Tikhonov regularizer, when you write it in the matrix form, you can write off diagonal terms as well and get all sorts of regularization effects - weight decay, low order fit, high order fit etc

Practical rules:

stochastic noise is high frequency
deterministic noise is also nonsmooth
Constrain learning towards smoother hypotheses (because the noise is not smooth, so we harm the noise by going smooth)

This works because it will stop the hypothesis from trying to fit the high frequency noise (stochastic or deterministic) Generally, small weights correspond to smoother hypothesis.

The holy grail of machine learning is finding E_in which is a great proxy for E_out. If we get that, we can minimize E_in and go home. But there is this slack, this bound etc. E_aug is better at being a proxy for E_out compared to E_in.

Choosing a regularizer

In neural networks, you can do this ⬇️ to eliminate small weights and leave the large ones alone.

So, this means that we will be near the logical part (+/-1) of tanh and not the linear part. This is good because the neural network can implement OR/AND gate and thus combine to get approximate any function

Early stopping is also a regularization. Optimal λ needs validation.

When there is no noise, we don’t need λ We need more λ when there was more noise

This is for stochastic noise 🔝

We get the exact same thing for deterministic noise

This should seal the correspondence in your mind that as far as overfitting and it’s cures are concerned, both the noises behave exactly the same way.

Chapter 13 - Validation

Review

We got the E_aug to minimize now:

The general form of regularization was:

The choice of λ is retrieved using validation

The validation set

Validation verses regularization

We have, E_out(h) = E_in(h) + overfit penalty

Both regularization and validation deal with this “overfit penalty” 🔝

Regularization is a way of getting a handle on overfit penalty; estimate it and try to minimize E_aug Validation tries to estimate E_out directly

Analyzing the E_out estimate

On out-of-sample points (x,y), we have e(h(x), y) as:

squared error = (h(x)-y)²
binary error = || h(x) ≠ y ||

If we take the expected value of this error wrt x, we get E_out E_out(h)= E_x[e(h(x), y)]

There is a lot of variance because we have only 1 point here var[e(h(x), y)] = σ²

To reduce variance, we move from a single point to a validation set Num of points in validation set = K

The error on validation set is:

These folks 🔝 weren’t used in training. Also, they are more than 1, so we hope this is a good estimate for E_out

So, E_out(h) = E[E_val(h)] which is:

The variance on the E_val(h) is: using the same analysis as earlier, the cross terms go to zero, the covariance terms go to zero since the terms were chosen independently, we get:

So, we get:

Since K is taken out of N, we cannot increase it indefinitely.

One idea: K is put back into N

Divide D into 2 parts, N-K to train and K to validate

Train on N-K to get g^-, use that hypothesis, train on the full N and report it back. We don’t have an estimate on E_out now but we know it will be better than E_val due to the behavior of the learning curves

This is good because we get to train on the whole set. But bad because the validation error we are reporting is the error on a different hypothesis than the one we finally use. So, our error estimate is bad.

Rule of thumb: K = 20% of N, N/5

Why ‘validation’

When we use validation error for early stopping, we introduce a optimistic bias in our model for estimation of E_out. This is because there is a certain variance in E_val and we ignore E_val when it is high, but we take it when it is low.

Model selection

The main purpose of validation set is to help with model selection. We have M models to choose from.

We will use D_train to train and we’ll get g_m⁻ for each model We’ll evaluate each using D_val. Pick the best.

When you pick the best 🔝, there’s a optimistic bias

The output is g_m^* which is the best model trained on the complete D

E_val(g_m*⁻) is a biased estimate of E_out(g_m*⁻)

The bias diminishes with growing K

The curves are going up because we have fewer datapoints left for train
the 2 curves are getting closer together because the E_val estimate of E_out is getting better and better

We can quantify the optimistic bias. If you think about it, selecting the best hypothesis from the M options (by selecting the model with the lowest E_val) is just regular training. D_val is used for “training” on the finalists model.

H_val = {g₁⁻, g₂⁻, …, g_M⁻}

Using Hoeffding and VC!

So, we have an added term of ln M. Hence, if we use validation to select from 20 parameters using 100 points, the bias is more than if we are selecting 2 parameters(only by a log factor, but still)

Data contamination: We have E_in, E_test, E_val

If we use the data to make choices, we are contaminating it and reducing it’s ability to estimate real performance.

Test set: Completely contaminated Validation set: Somewhat contaminated Test set: Totally clean

Cross validation

We have the following chain of reasoning.

E_out(g) → this is what we want to estimate. g is our g⁻ hypothesis trained on the full N datapoints E_out(g⁻) → this is our proxy for E_out(g), obtained by training on N-K E_val(g⁻) → this is our estimate of E_out(g⁻) obtained by training only on N-K(the lowest E_val of all M models), this one has a optimistic bias in it

We want small K so that g⁻ is fairly close to g. We want large K so that E_val(g⁻) is a good proxy for E_out(g⁻) (due to reduced variance)

Leave one out

We will use N-1 points for training. This means g⁻ will be very close to g. But E_val(g⁻) is not a good estimate for E_out(g⁻) due to insane variation. So, let’s see

We have D - nth point

So, final hypothesis learned from D_n is g_n⁻ We have: e_n = E_val(g_n⁻) = e(g_n⁻(x_n), y_n) So, RHS is a bad estimate for E_val

If we do this for all the points once, we get N estimates of E_val, the average of them is perfect, the variance is reduced - we’ll call it E_CV.

So, we have the best of both worlds, large K and small K

Demonstration: Let’s use CV to choose between a linear and constant model for this dataset

We get E_CV:

Same for constant model:

It turns out E_CV for constant model is lower, we’ll choose that. That would be the right choice, the datapoints were indeed generated by a constant target function with noise. Using E_CV will give you the right answer on average.

Another example: We can use CV to find out what order of polynomial to use for digit classification

We’ll use 500 points for training and the rest for testing. The nonlinear transformation will be polynomials for upto 5th order polynomial

We’ll have 500 training session, each with 499 points. We’ll get this curve:

We note how E_CV is a good proxy for E_out CV tells us to stop at 5-7, we can stop at 6. Here’s the plot:

The CV result is smooth, good E_out

There is a problem with “Leave one out”. The problem is that you’ll have N training sessions(with N-1 points each time). So, we generally leave K out.

This works nice if N is big. The number of training sessions reduces and we don’t take a very huge hit on num of points (N-K is good if N is big)

Rule of thumb: 10-fold Cross validation works nicely in practice.

Chapter 14 - Support Vector Machines

Review

The whole lecture in a slide:

SVM - Classification King

Maximizing the margin

For linearly separable data, we can have different separating lines. The ones with the largest margin are the best.

We can see that the margin successively increases and so does the separating line (even when E_in and E_out and all are the same) because we can accept noise better

Can we solve for w that maximizes the margin? Recall the dichotomies, the growth function? We recall that the perceptron can shatter 3 points completely. So, we have 8 dichotomies:

Each dichotomy’s max margin:

If we accept dichotomies with a mandatory minimum margin, we reject some dichotomies, so the growth function decreases, d_VC decreases, we have better generalization chances. This is just like regularization.

Finding w with large margin

Let x_n be the nearest data point to the line/plane/hyperplane: w^Tx=0 (this is the separating surface) The distance b/w the separating plane and the nearest point is the margin

2 preliminary technicalities:

normalize w
- we’ll want the min distance to be 1, so we scale distances down accordingly. So, |w^Tx_n| = 1
We’ll put out w₀, the bias term. So, the vector w is not only [w₁, …, w_d]
So, the equation of the plane becomes: w^Tx+b = 0

Now, we can compute the distance between the nearest point x_n and the plane w^Tx+b = 0 where |w^Tx_n + b| = 1 So, we have:

Vector w is ⊥ to the plane in the X space (eg: in y=x, vector [1, -1] is ⊥ to the line) Proof: Take 2 points on the plane x’ and x” Since they line on the plane, we have:

w^Tx’ + b = 0 w^Tx” + b = 0

So, we have the difference: w^T(x’-x”) = 0

This 🔝 means that w vector is ⊥ to (x’-x”) vector Since x’-x” lies on the plane, w is ⊥ to the plane

Now the distance b/w x_n and plane is:

Take any point x on the plane
Projection of x_n - x on w is the required distance of x_n from plane

So, w·(x_n-x) divided by mag of w Also, w·(x_n-x) is just: w^T(x_n-x) We take the abs value, so:

Or:

We can add and subtract b to get:

The 2nd group of terms is 0 since x lies on plane. The 1st group was scaled to be 1 So, we have distance/margin as:

Hence, the resulting optimization problem:

We are maximizing the margin, subject to the scaling of w such that it makes the nearest point be at a distance of 1

Also, instead of maximizing 1\|w|, we maximize:

So, the problem becomes:

The Solution

We’ll use quadratic programming to solve this

This is related to the optimization problem we had earlier with Regularization:

We’ll formulate the Lagrange: by putting the constraint into the objective (by getting the -1 on LHS and adding α for the slack):

We are minimizing this 🔝 wrt w, b and maximizing wrt α_n ≥ 0 We take gradient of Lagrangian wrt w and b and equate to 0:

So, we get:

Substituting this back into the original Lagrangian:

we get:

So, we have:

Which is the same as:

Expanding the formula to see what matrices are passed to QP:

Along with the conditions:

This is to be passed to the QP package and it gives us the α-s back We get a convex function that the quadratic programming optimizes.

So, we have our α-s. α = [α₁, …, α_N] We can get w by:

The α vector has mostly 0, this is because in the interior points, the slack is 0

This is similar to the case in regularization where the C was so large that we actually could get to w_lin and λ was 0 there. For the cases where we actually had to compromise, we had a positive λ, so is the case here with our α-s.

Those points where α_n > 0, are the support points, x_n is the support vector. They are the circled ones in the diagram:

The support vectors achieve the margin for them: y_n(w^Tx_n + b) = 1

Also, since α-s for non support vectors are 0, we can do:

Since w-s are the parameters of the model, less w means lower d_VC, better generalization. So, getting the best separator using SVMs has a generalization dividend as well. Also, b is simple to get: apply “y_n(w^Tx_n + b) = 1” for any support vector.

Nonlinear transforms

Currently, we are talking only about linearly separable case, where all the points are strictly linearly separable. If they aren’t linearly separable, we can apply a non-linear transformation and get them to be linearly separable in that higher dimension space.

Recall, we had:

Here, we are passing x^Tx to the LP. x^Tx is just inner product of 2 D dimensional vectors, where D is the dimensionality of the x space. So, if we apply a nonlinear transformation to 1 million dimensional space, we’ll have to do a inner product of 2 one million vectors, but the optimization problem won’t be any difficult.

The w-s will belong to the z space. The SVs live in the Z space. In the X space, the Z-space hyperplane with the largest margin looks like this say, with the “pre-images” of support vectors highlighted (Support Vectors are defined in the Z-space):

The margins are in the Z space

Here, we have only 4 support vectors, so, the w is only defined by 4 parameters in the Z space. This after going to a million dimensional space! So, good generalization.

Generalization result:

Actually, we need to run several runs and get the expected values:

So, SVMs are awesome because you get to go to high dimensions without paying for the computation of doing so and without pay in terms of generalization.

Kernel methods take this one step further, you won’t need to pay for getting the inner product as well. This will enable you to go to infinite dimensional spaces.

Chapter 15 - Kernel Methods

Review

SVMs are classifiers with maximum margin. They allow us to go to higher dimensional spaces without paying for generalization and complexity.

This means, we get a complex final hypothesis (g) but our hypothesis set is “simple”

The kernel trick

The only thing we need from the Z space is the inner-dot-product. If we get that somehow, we can use the Z-space

We need z in one more place as well…, Our final hypothesis We have: g(x) = sign(w^Tz + b)

But we know w is:

So, this reduces to inner product as well.

What about the b? That is solvable by taking any SV and using this equation:

So, here as well, we need just the inner product.

This raises a very interesting possibility. If we can get the inner product of z, we can go to that space – even if we do not know what the z space is – even if z space is infinite dimensional.

The inner product that we are taking about, z^Tz, let’s assume it is given to us by a function We have, for any 2 points in X space: x and x’ ∈ X

We have: z^Tz = K(x, x’)

Where K is kernel. The kernel will correspond to some Z space.

Example:

Let’s have a 2nd order transformation φ So, z vector: z = φ(x) = [1, x₁, x₂, x₁², x₂², x₁x₂]

Hence, K will be:

The trick: is it possible to compute this kernel (evaluate K) without transforming x and x’?

Let’s consider K to be (1 + x^Tx’)² This is a function of x, x’ and it gives us a inner product in some Z space (we don’t yet know which)

K(x, x’) = (1 + x^Tx’)² = (1 + x₁x₂^’ + x₂x^’₂)²

Squaring, we get:

This looks like an 2nd order non-linear transformation inner product if we didn’t have the 2s. It actually is still a 2nd order non-linear transformation but x vector is not transformed to (1, x₁², x₂², x₁x₂) but to:

Hence, K(x, x’) = (1 + x^Tx’)² does compute a inner product in 2nd order without converting to those coordinates first If we replace the power 2 with power 100, we get the inner product in 100D. This without having to go there.

This is the polynomial kernel

Files

learning_from_data.org

Latest commit

History

learning_from_data.org

File metadata and controls

Learning from Data

Yaser Abu-Mostafa

Chapter 1: The Learning Problem

Example of machine learning

Components of Learning

A simple model - the perceptron

Types of learning

Puzzle

Chapter 2 - Is learning feasible?

Review

Probability to the rescue

What does ν say about μ

Connection to learning

A dilemma and a solution

Chapter 3 - The Linear Model I

Review

The Input representation

Linear Classification

Linear Regression

Nonlinear transformation

Chapter 4 - Error and Noise

Review

Nonlinear transformation (continued)

Error measures

How to choose error measure

Noisy targets

Preamble to the theory

The 2 questions of learning

Chapter 5 - Training versus Testing

Review

From training to testing

Illustrative examples to show overlap

Growth functions for simple cases of H

Example 1: positive rays

Example 2: Positive intervals

Example 3: convex sets

Key notion: Break point

Chapter 6 - Theory of Generalization

Chapter 7 - The VC dimension

Review

The definition

VC dimension of perceptrons in any dimension

Interpreting the VC dimension

Generalization bounds

Chapter 8 - Bias Variance Tradeoff

Review

Bias-Variance tradeoff

The tradeoff

Let’s take an example

Approximation

Learning

ML idea

Learning curves

VC analysis vs bias-variance

Linear regression case

Chapter 9 - The Linear Model II

Review

Non linear transforms

Summary from earlier

Price we pay

Logistic regression

The model

Error measure

Learning algorithm

Summary

Chapter 10 - Neural Networks

Stochastic gradient descent

Benefits of SGD

SGD in action

Neural Network Model

The neural network

Applying SGD

How do we compute this?

Backpropagation algorithm

Analyzing the E_out estimate