Sno | Topics |
---|---|
1 | The Learning Problem |
2 | Is Learning Feasible? |
3 | The Linear Model I |
4 | Error and Noise |
5 | Training versus Testing |
6 | Theory of Generalization |
7 | The VC Dimension |
8 | Bias-Variance Tradeoff |
9 | The Linear Model II |
10 | Neural Networks |
11 | Overfitting |
12 | Regularization |
13 | Validation |
14 | Support Vector Machines |
15 | Kernel Methods |
16 | Radial Basis Functions |
17 | Three Learning Principles |
18 | Epilogue |
The storyline:
- What is learning? - 1-4
- Can we learn? - 5-8
- How to do it? - 9-10
- How to do it well? - 11-16
- The philosophy - 17-18
Predicting how a viewer will rate a movie
Netflix wanted to improve 10% for 1 million
3 components that ML can help with:
- a pattern exists
- if no pattern, ML cannot help
- we cannot pin it down mathematically
- we have data on it
We can describe each viewer with a vector of features. We can also create a same vector for the movie. When we take their inner product, we get a measure of how likely the user is to like the movie.
This approach is a problem because you have to get those vectors - watch the movie, interview the user etc
ML will reverse engineer the process. It will start with the rating and them come up with vectors for the movies and users (starting from vectors for both)
Metaphor: Credit approval The banks have historical data about customers - age, gender, annual salary, etc
Formalization:
- Input: Χ
- customer vector
- Output: y
- to extend or deny credit
- Target function: ƒ
- ƒ: Χ → y
- this function has a binary co-domain (y can only be 1 or 0)
- ƒ is the target function, that is what we have to find
- Data: (x1, y1), (x2, y2), …, (xN, yN)
- this is the historical data
We use the “Data” to get an approximation of the “target function” called “hypothesis”
- Hypothesis: g: Χ → y
So, we use the training examples to find the hypothesis which approximates the target function using the learning algorithm which selects our hypothesis (g) from the hypothesis set Η. The hypothesis set can be discreet or continuous, limited or infinite. In general, the hypothesis set is continuous and infinite (very infinite!) - but we will still be able to learn.
We will be able to with theory, put a number to the sophistication of the hypothesis set.
The red components are what we can choose
Why do we need the hypothesis set? Why not let the learning algorithm select from universal set?
- this will help us formalize learning, we will see this later
So, the components of learning are:
- The hypothesis set:
- Η ∈ {h}, g ∈ H
- The learning algorithm
- Together, they form the learning model
Input is: x → {x1, …, xd}
Approve credit if:
Σ wixi > threshold, i → 1 to d
else deny
This linear formula h ∈ H can be written as:
h(x) = sign ( Σ wixi - threshold ), i → 1 to d
If h is +ve, we approve credit else we deny credit
We see that h is a function of w and threshold
If the data is linearly separable, we can learn a single line from a random line
We can rename -threshold to w0 and call it bias - we added an artificial coordinate x0 whose value is always 1 Note: w0 is not 1, it is a free parameter that we learn, x0 is 1 Now, formula:
h(x) = sign ( Σ wixi ), i → 0 to d
Vector form:
h(x) = sign (wTx)
w is a column vector, or [w0, …, wd]T and vector x is [x0, …, xd]
The perceptron learning algorithm takes a misclassified point and updates the weights such that they behave better on that point i.e. w ← w + ynxn
Consider these 2 cases of a point being misclassified In first case, the inner product will be negative but y is positive. So, the update rule moves w towards x (it adds x to w) and their inner product is now positive, so prediction is positive
In second case, the inner product is positive but y is negative So, the update rule moves w away from x (it adds -x to w) and now the inner product is negative as well
The problem with this approach is that we can wrong the other ones when we classify a particular point correctly If you keep on repeating this process, and if the data is linearly separable, you can classify all points correctly with guarantee
(If it is not linearly separable, you can map it to a space where they are linearly separable)
The basic premise of learning
“using a set of observations to uncover an underlying process”
- Supervised Learning
- Unsupervised learning
- we can get a high level of input data
- Reinforcement learning
Consider this problem, you are given 6 training examples which are labeled with the correct output For the new one, what is the label?
It is impossible to predict correctly - this is because the target function can be anything!
- It can be -1
- if you take the function to be -1 if top left square is black
- It can be +1
- if you take the function to be +1 if pattern is symmetric
We have this problem in machine learning also. But this does not mean that learning is impossible, this will be proved in next lecture
- Learning is used when
- there is a pattern, we cannot write mathematical formula for it, we have data
- Notation:
- we don’t know the target function y = ƒ(x)
- we have the data set: (x1, y1), (x2, y2), …, (xN, yN)
- we have a learning algorithm that finds g from the hypothesis set such that g ≈ y
- we were stuck at the puzzle, where the random function could be anything, how are we to learn?
This lecture will address this question
Consider an experiment: A bin has red and green marbles.
We need to find the probability of picking a red marble, let’s call it μ
P[red] = μ
So, P[green] = 1 - μ
We pick N marbles independently (recall N is for # of data points) so, sample: GRGGGGRRGG
Let the fraction of R in sample: ν
Does ν (sample frequency) say anything about μ (the actual frequency in bin)? The short answer is not on the face of it, but it does give us some bounds.
ν is likely to be close to μ
In a big sample, large N, ν is close to μ (within ε)
Formally:
P[|ν - μ| > ε] ≤ 2e**{-2ε2N}
This is Hoeffding’s inequality* This will give us the VC dimension
Hoeffding’s inequality* in words:
- μ = ν probably approximately correct (PAC)
That is, the N required rises squared and exponentially with the bound
This is the inequality you are looking for when you want to see how closely your sample represents the actual truth. That is, say, you are checking for nulls in a database attribute, how many samples should you take to be 90% confident of your estimation
This formula is valid for N and ε, and doesn’t depend on μ However, there is a tradeoff involved, less ε, more N
Also note: ν ≈ μ ⇒ μ ≈ ν
How is the bin related to learning?
- Bin:
- the unknown is a number μ
- Leaning:
- the unknown is a function ƒ: X → y
We can think of the bin as the input space. Each marble is a point x ∈ X So, all the possible applicants for credit function are represented in the bin
Now, let’s say we have a hypothesis. For all the applicants that our hypothesis gets right, we can mark with a green marble in the bin. What we want is, the accuracy in the test dataset (ν) allow us to say something about the actual accuracy in the entire input space (μ)
Reiterating:
- green if
- h(x) = f(x)
- red if:
- h(x) ≠ f(x)
Here, we don’t have all the points in the input space to check. We have only a sample of the data points. The sample that we have comes from a probability distribution; P on X, that gives us one point from X over another - the probability generates the dataset We are not restricting P on X, we don’t even need to know what it is because our Hoeffding’s inequality does not depend on probability distribution of N
We aren’t done yet, what we have discussed is, for this h, ν ≈ μ within some ε
However, this is just verification of h, not learning. Currently we chose some h and verified that it makes sense. We don’t want to pick the h ourselves, we want the algorithm to do it for us - we need to choose h from H
This is simple, we already have a probability distribution that gives us some data points from the input space(bin). We can test our h on each and choose the h that gives us the maximum right results from them and invoke Hoeffding’s to get our bound.
Notation:
- both μ and ν depend on h
- ν is “in sample”, called Ein
- μ is “out of sample”, called Eout
- Ein and Eout are actually Ein(h) and Eout(h) (are functions of h) because they are a dependent on h
Hoeffding’s becomes
P[|Ein(h) - Eout(h)| > ε] ≤ 2e**{-2ε2N}
We still have a problem!
We cannot just have multiple h and apply Hoeffding’s to them. Why? Consider this:
Take a coin, flip it 5 times. If we get 5 heads and we choose h which is always heads, it means ν is 1, but it doesn’t mean that μ is also 1
The probability of 10 heads in 10 tosses is: 1/210
If we toss 1000 fair coins 10 times each, probability that some coin will get 10 heads: 1 - P[no coin gets 10 heads] 1 - P[a particular coin doesn’t get 10 heads]1000 1 - [1 - P[a particular coin gets 10 heads]1000 1 - [1 - 1/210]1000 0.63
So, it is more likely that the 10 heads will occur than not. So, the 10 heads are no indication at all of the real probability of getting head. We cannot choose a h which will give 1 always and choose a sample which has all 1s and say we have a perfect system according to Hoeffding’s. Hoeffding’s applies to each one individually, but in each case, there is a probability that we will be off by say half a percent that we are off in some aspect in the 1st case, and half a percent off in another aspect in the 2nd case. If these “off” probabilities are disjoint, we end up with a bad system.
We need to find a way to deal with multiple h/bins.
An easy solution:
- recall we had: P[|Ein(h) - Eout(h)| > ε] ≤ 2e**{-2ε2N}
- we wanted to make a statement about Eout based on Ein
What we want now is:
- P[|Ein(g) - Eout(g)| > ε] ≤ ??
- so, we want to choose the best hypothesis g and want a bound for that choice
- by plain logic (not using Hoeffding’s or anything), since g ∈ H, we have:
- P[|Ein(g) - Eout(g)| > ε] ≤ P[ |Ein(h1) - Eout(h1)| > ε + |Ein(h2) - Eout(h2)| > ε + … + |Ein(hM) - Eout(hM)| > ε ]
- where M is the number of h ∈ H or the cardinality of H
- this is valid because g ∈ H, g is one of the h-s
- This is the union bound, which assumes no overlap, this is the worst case, it cannot get worse than this
- using Hoeffding’s we get:
- P[|Ein(g) - Eout(g)| > ε] ≤ Σ P[|Ein(m) - Eout(m)| > ε], m → 1 to M
- P[|Ein(g) - Eout(g)| > ε] ≤ Σ 2e**{-2ε2N}, m → 1 to M
But the problem is:
- P[|Ein(g) - Eout(g)| > ε] ≤ 2Me**{-2ε2N}
And M is the cardinality of H, which is ∞ generally, so we get that P[something] < ∞ etc So, the more sophisticated the model that you use, the looser will Ein track the Eout
- We ended with a problem, the loose tracking of Eout(g) by Ein(g)
- Since g has to be one of h1, …, hM we conclude that: (union bound)
- if | Ein(g) - Eout(g) | > ε then:
- | Ein(1) - Eout(1) | > ε or
- …
- | Ein(M) - Eout(M) | > ε or
- This gives us an added M factor
- This generally works for other things, because they are disjoint. Eg: on a roll dice, we get a 5 or we get a 4 = P[5] or P[4] = P[5] + P[4]
- if | Ein(g) - Eout(g) | > ε then:
We need to have a tighter bound on the tracking of Eout(g) by Ein(g). The union bound assumes no correlation b/w all events. It makes sense if the events are independent, and happen disjointly, like in the coin flipping scenario. We will get a smaller number if there is some correlation and they overlap.
We have established the principle that thru learning, you can generalize, and we have established that. We will later established that even when the cardinality of H is infinite, we can still generalize - the theory of generalization.
- We’ll work with the MNIST like dataset. 16x16 grey level pixels.
This is the raw input. 16x16 - 256 pixels
- ‘raw’ input x = (x0, x1, …, x256)
Recall, we added the mandatory x0 to make the formula better
Our parameters: (w0, w1, …, w256) - this is 256 dimensional space
We can reduce the parameters, i.e. we can extract features:
- intensity and symmetry x = (x0, x1, x2)
- we have lost some irrelevant info but we also lost some information
Plotting just 5 and 1:
- we’ll generalize perceptron to linearly non separable case
- What does PLA do?
- it tries to iteratively correctly classify a single point with each iteration
- the data is not linearly separable, so it will not terminate
- we see that the error jumps around a lot
- also note that Eout is being tracked nicely by Ein
Final perceptron boundary (after stopping at 1000 iterations):
- we can make a modification - “Pocket algorithm”
- Just make a few iterations and keep the h that had the lowest Ein so far.
- The errors now look like this:
We get this boundary:
- we’ll generalize the target function from being a binary function (a classifier) to a real valued function
Let’s discuss the credit problem again
Linear regression input:
- let’s say we have d input features:
- annual salary, years in residence, years in job, current debt etc
Linear regression output: h(x) = Σ wixi = wTx
- summation over i → 0 to d
Linear regression dataset:
- (x1, y1), …, (xN, yN)
- yn ∈ R, is the credit line for customer xn
Error:
- we can use the squared error: (h(x) - f(x))2
- this squared error has good analytical properties - helps with differentiation etc
When we plot linear regression, we never plot x0 because that is always 1.
The “line” or “hyperplane” is 1 dimension short of what you are working with. Eg:, we have mandatory x0, we also have one feature, and we have the output - 2 dimensions, and so 1D line
The red is the in sample error Ein
We had:
Which can be re-written in matrix form as:
Minimizing Ein:
We have only w as the variable. How we can find the minimum of the function Ein(w) is by taking it’s derivative and equating it to 0 (a vector of 0s)
X-dagger is “pseudo-inverse” because X is not a square matrix, it is a very tall matrix and but X-dagger is still it’s inverse, because if you multiple X-dagger with X, you’ll get identity matrix.
*actually d+1 and not d, because we also have constant x0 ⬇️
X is Nxd, XT is dxN So, XTX is dxN * Nxd = dxd Inverse of dxd is simple (since num of features is generally less), so we have: (XTX)-1XT = dxd * dxN = dxN → X-dagger X-dagger * y = dxN * Nx1 = dx1 → our features
y is the target vector X is the input data matrix
This is “one step learning”
You can use linear regression for classification, it doesn’t matter if you set y to be ±1, the algo will still learn. You just use sign(wTx) as the output - similar to using 0 as the threshold
Linear regression applied like so 🔝 is called linear classification There is a problem here: 🔝
The value for the red region is highly negative for the lower points. But their target value is just -1. So, the linear regression reports high error due to this and the boundary is pushed towards the center
So, one can use Linear Regression to give a jumpstart to the perceptron - a good initial weights to start with.
- Sometimes we need to transform data to make them linearly separable
- Sometimes we want to have non-linear features, like in the credit example, we don’t want the time spent in one location to be linear, we want it to be something like: 0 for |x<1| and 1 for |x>5| etc
- Non linear transformations remain within the realm of linear models.
This is because the variables of the function are the weights, not the features. So, we can do anything with the features, we are still in the linear realm.
So, we can do this transformation:
(x1, x2) → (x12, x22)
We can now use perceptron in the transformed space. We cannot use arbitrarily complex transformation, there is a catch about which we’ll see later
- Linear models use the “signal” wTx (which is vector form for Σ wixi)
- One step learning: w = (XTX)-1XT · y
- Non linear transformation:
- wTx is linear in w
- any transformation x → z preserves this linearity and so can be used
We had Φ transformation last time:
Note, Φ is a non-linear transformation of input space, that is to say, straight and parallel lines drawn in input space won’t remain straight and parallel in transformed space. So, each point in input space, may map to more than 1 point in the transformed space and vice versa (or it may not have a mapping)
Note, we can transform from d-dimensional space to d`-dimensional space, there is no limit really
We learn the weights in the transformed space, the targets remain in the original space.
The learned hypothesis is still g,
Note, Φ(x) gives us z
They try to answer this:
- what does it mean for “h ≈ ƒ”?
- Error measure: E(h, ƒ)
- If error is 0, h perfectly represents ƒ and we are golden.
- It is almost always “pointwise defination”:
- e(h(x), ƒ(x))
- example: square error: e(h(x), ƒ(x)) = (h(x)-ƒ(x))2
- another example: e(h(x), ƒ(x)) = || h(x) ≠ ƒ(x) ||
- || h(x) ≠ ƒ(x) || returns 1 if h ≠ f else 0
How do we go from the pointwise to overall? We take pointwise and average it to get to overall
- Thus, in-sample error:
- Also, out of sample error:
- this is by logic, the definition of Eout
It is the expectation of pointwise error e on data points x
So, the learning diagram becomes:
It depends on the system.
- For some systems, false positive would be expensive. For other systems, false negatives would be expensive
For classification, we can create a table:
For supermarket:
False negative is expensive; if you are given a discount coupon and you go there and they say you can’t use it, it is horrible It is okay if you weren’t given it in the first place but you still trick them into giving the discount
For CIA, false positives are disastrous,
Sometimes we don’t have the ideal error measure, so we use alternative error measures
- if the noise is gaussian, we can use squared error
- in binary, we can use cross entropy error measure (this is what we used in logistic regression, the Bernoulli thingy)
Sometimes, we use friendly measures:
- squared error for linear regression is friendly because it has a closed-formed solution. The cross entropy error measure in classification turns out to be convex and so we can find the global minimum etc.
Very important, because the target function is not always a “function” in true mathematical sense, it is noisy. This is because the target function cannot capture ALL the information that plays a part in determining the outcome
So, we use target distribution:
- Instead of y = ƒ(x) we get:
- P(y|x)
Earlier, (x, y) was generated by a probability distribution (the dataset was generated by PD, say P) but now, y is non-deterministic and is generated by a PD too. So, (x, y) is generated by a joint PD:
- P(x, y) = P(x)P(y|x)
- we can model noisy target as:
- Noisy target = deterministic target “ƒ(x) = E(y|x)” + noise “y - f(x)”
No loss of generality by assuming probability distribution over P over y, because even if the target function is indeed a function, we can model it as:
P(y|x) = 0 everywhere except for y = f(x)
So, discrete case:
- Probability is 1 for y=f(x), 0 elsewhere
So, continuous case:
- delta function at y=f(x), 0 elsewhere
Noisy target function?
- The target function is noisy because we aren’t able to model it completely. We are trying to model it with limited parameters (the features) and not taking into consideration all the factors that it depends on. Hence, the target function that we are trying to learn is noisy. The god send truth target function that is generating the data is not noisy. Our approximation, the one we are trying to learn, is.
- Or is it inherently noisy? The function that generated the data in the input space, that true function ƒ is noisy itself?
Our updated learning diagram:
Now, the target function is replaced by a distribution. And we also have an error measure lying between learning algorithm and final hypothesis. It is the prize we pay for not being perfect in our approximation of the target function.
Note:
- P(x)
- This is the probability of generating the dataset that we train on
- We introduced this to accommodate Hoeffding’s
- it quantifies the relative importance of x - it is not what we are interested in
- this also dictates what region of the input space you get samples from. The probability distribution could be such that it never gives you points from particular regions of input space. Currently, our model will be agnostic of that, but in Bayesian learning, the confidence of the model will be lower in those regions where it hasn’t seen examples from
- also, once you have P(x), you draw samples from it independently, i.e. i.i.d - independent identically distributed
- P(x) is not a problem if it has a long tail or if it is a delta function or whatever; as long as it is used in both training and testing.
- P(y|x)
- This is the probability the target function gives y for an input x
- This is because we the target function is inherently noisy
- this is what we are trying to learn
Both these PDs together generate our dataset; P(x, y) So, we must always remember that P(x, y) is actually the mixture of concepts, of 2 PDs which are inherently different
What we know so far:
- we know that learning is feasible
- Eout(g) ≈ Ein(g)
- However, this is not really learning. It is just validation that our “selected” hypothesis gives us a measure of how it will perform out of sample. This is just good generalization
- This guarantee that Ein is a good proxy for Eout is important for us to learn. It is a building block for learning.
- Real learning is actually:
- g ≈ f
- i.e. Eout(g) ≈ 0
- this measure how far are you from the target function
- g ≈ f
- We need Eout(g) ≈ 0
- This is achieved thru:
- Eout(g) ≈ Ein(g) and Ein(g) ≈ 0
- It’s like saying, I can do good on the sample dataset and I know that this is means that I will do good outside the sample too
We know that answer to 1 is “Yes”. But we were left with the M factor on right hand side that we have to deal with. The answer to 2 is what ML algorithms are suppose to do! They give us a good fit for sample data.
The theory will achieve this for us:
- characterize feasibility of learning for infinite M
- we will have a single parameter that tells us the sophistication of the hypothesis set
- characterize the tradeoff b/w model complexity and how well our Ein(g) tracks Eout(g)
Ein is averaged over N points Eout has a logical definition, the weighted error on a point, over all points.
Also, last time, we were confused about the noisiness of the target function. It is that the target function inherently is noisy, it is not a function in a true mathematical sense. It has a probability distribution over outcome (y) based on the input so: ƒ: y → x + noise or: y ˜ P(y|x)
Earlier, when we considered y to be deterministic, we generated the sample by P(x) But now, y also comes from the PD, so we have the dataset generated from 2 PDs P(x, y) = P(x)P(y|x)
Eout(h) is now Ex,y[e(h(x), y)]
Say you have a final exam. You get some practice problems. This is training for final exam. If you directly do the final exam questions, you might not have learned. The end goal is low Eout which only happens if you study and learn the material.
We have:
- Testing
- How well you do in the final exam (Ein) tracks how well you do in the wild (Eout)
- Training
- How well you do in the practice does not track very well how you do in the wild
- There is a factor of M that comes in here
- M represents “how much you explore”, how many cases are possible
We need to replace M with a friendlier quantity, if we are able to do it, we are in business.
Recall, M is cardinality of hypothesis set. Our learning algorithm is free to choose any hypothesis it wants from the set and so to invoke Hoeffding’s inequality, we had to add Ps of all bad events i.e. we had:
P[B1 or … or BM] where B is the bad event: | Ein(hm) - Eout(hm) | > ε
By union bound: P[B1 or … or BM] = P[B1] + … + P[BM]
We took disjoint events of all bad events, but in reality they are related and overlap a lot
We can solve this exactly for the perceptron for eg, but it would be a nightmare. We want to extract from a given hypothesis set H, a number that would characterize this overlap and give us a good bound
- Consider a perceptron
Also, we have Ein,
Here, we apply Hoeffding’s inequality and we know that Ein tracks Eout etc Now, consider another perceptron, slightly different:
The green line is another perceptron. In both the cases, Eout will be slightly different. Also, Ein will be different if there is any point that falls in the yellow region.
So, | Ein(h1) - Eout(h1) | ≈ | Ein(h2) - Eout(h2) | If one exceeds ε, the other does as well. But we do not consider this strong correlation between these, we consider them to be independent. So, just counting the number of hypothesis is not optimal in that it does not give us a tight bound. We can improve it.
Instead of the whole input space, we can consider only a finite set of input points. And we can differentiate b/w the perceptrons if they classify the input points differently So, given a set of red and blue points, we can count all possible classifications of them. This is a good proxy for the complexity of the hypothesis set. If hypothesis set is powerful, it can classify the points in all possible ways. If it is not so powerful, it may not be able to achieve some classifications. The number of classifications that the hypothesis set can give is called dichotomies.
Dichotomies is a proxy for the number of hypothesis. It is based only on the input points and not on the general input space. They are “mini hypotheses”
A hypothesis is a function h: X → {-1, +1} which takes in the full input space and gives the classification
A dichotomy is also a function h: {x1, x2, …, xN} → {-1, +1} which takes in only the input points and gives the classification (each x, say x1 is a vector of all input features for first data point)
Number of hypothesis |H| → ∞ Number of dichotomies |H(x1, x2, …, xN)| → 2N if H is extremely expressive
This is a candidate for replacing M. We can also define “m”, the growth function which gives us the most dichotomies on N points (given a hypothesis set)
The subscript H is because the growth function is defined for a given hypothesis set.
We also know that mH(N) ≤ 2N
Growth function for the perceptron:
- N = 3, i.e. mH(3) = 8
- recall this is the maximum dichotomies possible
- N = 4, i.e. mH(4) = 14
- this is because we cannot get this combination with our perceptron:
- It is defined on the real line
- it has a point, everything on right is +1, on left is -1
H is the set of h: R → {-1, +1} So, for N points, we get: N dichotomies, so mH(N) = N + 1 (all blue, all red, N-1 sandwiched positions)
- here, we have an interval
- everything within is +1, everything else is -1
H is the set of h: R → {-1, +1} So, for N points, we get: N dichotomies, so mH(N) = N + 1 The way we get a different dichotomy is by choosing 2 points on the number line:
mH(N) = nC2 We need to add 1 for the case that we can select the same point, (blue region is null set)
So, mH(N) = nC2 + 1 or, mH(N) = N2/2 + N/2 + 1
More powerful than the last one!
- we define a convex region as our +1 area
- a convex region is the region where a line segment connecting 2 points on the region lie entirely inside the region H is the set of h: R → {-1, +1} h(x) = +1 is convex
What is the growth function? We can put our N points on the perimeter of a circle and thus we can get ANY classification from the 2N eg:
So, the growth function is mH(N) = 2N Since the hypothesis set gets all the 2N dichotomies, we say that the shatters the N points
We have 3 growth function:
Note that the dichotomies also aren’t very tight. Convex sets is a complex hypothesis set, but not the most complex, we still have 2N growth function
So, talking about replacing M with mH(N)… We had:
We can replace M with mH(N) if mH(N) is a polynomial. This is because then we can increase N and get the RHS to be very small, so small that the bound makes sense. The only criterion is that mH(N) should be a polynomial so that we can defeat it (we cannot defeat M right now, it is infinite)
So, once we are able to prove that our hypothesis set’s growth is polynomial, we will be able to learn using that hypothesis set. We may need a lot of data points, but we will be able to generalize to the entire input space given the finite (albeit large) input points
Defined as the k after which growth function, mH(k) is less than 2k –> mH(k) < 2k “If no data set of size k can be shattered by H, we call k a break point for H”
- For perceptrons it is 4.
So, break point for
- positive rays:
- k = 2
- For 2 points, we cannot get this: “<red> <blue>”
- we can only get: “<red> <red>” or “<blue> <blue>”
- positive intervals
- using the formula mH(k) < 2k we have: k = 3
- we cannot get: <blue> <red> <blue>
- convex sets
- never fails, so, k = ∞
So, we have this result:
- No break point ⇒ mH(N) = 2N
- Any break point ⇒ mH(N) is polynomial in N
So, to be able to learn with a given hypothesis set, we just need to prove that it has a break point
Example: Given 3 points, and given that the break point is 2, we can have only 4 dichotomies
Any addition (out of 23 = 8) is not allowed because then it would classify 2 points completely (and that is not allowed, as break point is 2)
The existence of a break point restricts the number of dichotomies drastically.
We have 2 things to do now:
- Prove that a growth function mH(N) with a break point is polynomial
- Prove that we can replace mH(N) with M
Recall to show that mH(N) is a polynomial, we need to show that mH(N) is bound above with a polynomial in the general case It is actually! mH(N) is bounded above by a polynomial of power Nk-1 where k is the break point for the hypothesis set
- examples:
- H is positive rays
- k is 2, so, mH(N) should be bounded by polynomial of power 1. Which is true, since mH(N) is N + 1
- H is positive intervals
- k is 3, so, mH(N) should be bounded by N2. Which is true, since mH(N) is nC2
- H is 2D perceptrons
- k is 4, mH(N) should be bounded by N3. We did not know the mH(N) for it, but we know that it is bounded above by N3
- H is positive rays
So, as long as there is a break point, we will be able to get a polynomial and thus learn.
This result that you can replace mH(N) with M is called the Vapnik-Chervonenkis Inequality is the most important theoretical result in machine learning.
- We saw that mH(N) is bounded above by a polynomial in k, where k is the break point for the hypothesis set
- Also, that we can replace M with mH(N) and thus learn according to VC inequality
Thus, we can learn for any hypothesis set which has a break point. This lecture will give us the notion of the “VC dimension”, which characterizes the complexity of the hypothesis set
- It is a quantity defined for a hypothesis set H, denoted by dvc(H)
- It the is most points H can shatter
- it is the “largest” value for N, for which mH(N) = 2N
- it is k-1
- N ≤ dvc(H) ⇒ H can shatter N points
- N > dvc(H) ⇒ N is a break point for H (anything above dvc(H) is a break point)
- in terms of break point k:
- for any hypothesis set with VC dimension dvc(H), we have it’s growth function a polynomial of degree dvc(H)
- examples:
- H is positive rays
- dvc(H) is 1
- H is 2D perceptrons
- k is 4, so dvc(H) is 3
- H is convex sets
- k is ∞, so dvc(H) is infinite as well
- this is the most pessimistic case, since we won’t get the points in a neat circle. This is the upper bound for the hypothesis set
- so, we can still learn
- H is positive rays
Reiterating, we have established that if dvc(H) is finite (aka k exists), g ∈ H will generalize And it will generalize independently of learning algorithm, input distribution, independent of target function, for any hypothesis set
- for 2D, dvc(H) was 3
- for 3D, dvc(H) is 4
- in general, dvc(H) = d + 1 where d is the dimension
- I.e., the perceptron in d dimensions can shatter d+1 points completely
- Also, d is the number of paramteres in the model - (w0, w1, …, wd)
- the #parameters are analog degrees of freedom
- the VC dimension makes them binary degrees of freedom - it is based on the #dichotomies possible, #of points you can shatter
- examples:
- H is positive rays
- dvc(H) is 1, so 1 parameter, 1 degree of freedom - which is what we have!
- H is positive intervals
- dvc(H) is 2, so 2 parameters, 2 degrees of freedom - which is what we have!
- H is positive rays
Actually, it’s not just parameters, it is degrees of freedom. A parameter may not contribute to the degree of freedom, then it won’t count.
Consider a 1D perceptron,
It is 2 parameters(variables), 1 weight and 1 bias/threshold Now, we can take the output and feed it into another perceptron, …, 3 times and that is the output. Thus, we have this:
Here, we have 4*2 = 8 parameters, but we still have 2 degrees of freedom, so dvc(H) is still 2 You don’t care about the internal structure, you ask yourself how many points can I shatter? k is it? Then my dvc(H) is k-1. That’s all.
So, dvc(H) measures the effective number of parameters
- Number of data points needed:
- we had this statement according to the VC inequality:
Here, we can see that the RHS is just Nde-N where d is the dvc(H) (vc dimension) and N is the number of samples needed Plotting LHS (probability) vs RHS:
We see that the polynomial is high initially (and the bound is absurd, like LHS > 10 etc) but then the exponential kills it and we get tighter and tighter bounds. This is the reason if we use a linear regression model on large number of sample points, and we get some error on the sample points, we’ll get the same error out of sample as well - with a high probability. This is the case of having RHS (δ) as very small, I.e. Having a tight bound on the | Ein - Eout |, but the ε may be large. That is to say, we are 99% sure that the in sample error will be within 60% of out of sample error - “very confident that it is a bad system”
Rule of hand:
- N ≥ 10dvc(H)
- You need 10 times the dvc(H) number of examples
- We can rearrange things:
- We had this from the VC inequality:
From the RHS, we can get ε in terms of δ δ is just how tight a bound do you want on the statement you make…
So, we have:
- ε depends on N, δ, mH - growth function of hypothesis set - we call the RHS Ω
- if mH is large, dvc(H) is large, the guarantee on generalization(ε) is worse
- if δ is small, I.e. You want to be very sure of the statement you make, worse ε
- if N is large, ε becomes smaller
- eventually, the log is killed by linear N (just like polynomial is killed by exponential)
We can re-write as:
With probability ≥ 1 - δ
Ω is a function of
- N → goes down with it
- the more examples, the tighter bound on generalization
- H → goes up with it
- more complex H, higher the VC dimension dvc(H) and so less generalization
- δ → goes up with smaller delta
- as you want to be more sure about the confidence in your statement, less generalization
We can simplify further as:
- remove the modulus, because mostly Eout is smaller than Ein
- so, Eout - Ein ≤ Ω
- Eout - Ein is called the generalization error
- moving Ein to RHS,
- Eout ≤ Ein + Ω(N, H, δ)
- So, the RHS bounds the Eout
- We have control on the RHS, Ein we are controlling and Ω is dependent on H, N, δ etc
- If we choose a more complex H, Ein goes down, but Ω goes up.
- Hence, there is a tradeoff involved here
This form will also give us the groundwork for regularization. Earlier, we used just Ein as a proxy for Eout. It seems here, that we can use Ein + Ω as the proxy and that may give us a better handle on Eout.
It is not always possible to find the dvc(H) dimension of the hypothesis set, eg neural nets. We can say it is bounded above by the #parameters, but it can be much much lower as we saw in the chained perceptrons.
- We saw that dvc(H) is just the max number of points a hypothesis set can shatter
- It is k-1, where k is the break point
- We have a rule of thumb: N ≥ dvc(H)
- Also, we rearranged the generalization bound to:
- Eout ≤ Ein + Ω
- Ω is a function of:
- δ - what is the probability of error in the statement we make, I.e., what is the probability that Ein tracks Eout within ε
- ε - how loosely does Ein track Eout
- It is a stand alone theory which gives us a different angle on generalization (as contrasted with VC analyses)
- The tradeoff is b/w approximation and generalization
- Small Eout is what we want; good approximation of ƒ out of sample
- More complex H → better chance of approximating ƒ - we have a lot of options
- Less complex H → better chance of generalization ƒ - we have a hard time trying to get the right one from so many
- The hypothesis with only the target function in it!
- Ideal is H = {ƒ}
- The quantification of this tradeoff in VC dimension analysis is:
- Eout ≤ Ein + Ω
- Ein is approximation, because we are trying to fit a target function to input dataset
- Ω is generalization
- Eout ≤ Ein + Ω
Bias-Variance provides an alternate way of looking at this same tradeoff
- It decomposes Eout is decomposed into:
- How well H can approximate ƒ - how flexible is your H
- How well we can zoom in on a good h ∈ H
- This applies to real valued targets. We’ll use squared-error because it helps with derivation
- We start with Eout
- We keep D in the notation because the Es depend on the dataset. Earlier, in VC analysis also it was present, but we never wrote it because it wasn’t used
- Eout(g(D)) = Ex[(g(D)(x) - ƒ(x))2]
- Expected error over entire space 🔝
- To remove dependency on a particular D, we take ED on both sides
- ED[Eout(g(D))] = ED[Ex[(g(D)(x) - ƒ(x))2]]
- we now manipulate this equation 🔝
- Let’s focus on just Ex[(g(D)(x) - ƒ(x))2]
- we define an “average” hypothesis - bar
- what is gbar?
- imagine you have many datasets, so gbar is the average of the hypothesis that you choose from them.
So, we have this derivation:
We get this 🔝 on rearrangement.
Now, the second term is independent of D, so ED of it is itself Also, in the 3rd term, the first half is gbar - gbar so that is 0 as well
So,we get:
This 🔝 ⬆️ is a very important equation. The first term tells you how the hypothesis we got from our D differers from the best that you could get from that D. Gbar is the best we can get from D because it is the average so is “better”
There are 2 hops; the hop from your hypothesis to the best you can get, and the second hop is from the best you can get to the actual target function The second term is how far is the best possible you can get from the target function itself?
- The 1st term is variance
- This is how much away you are from the best h in you H. You cannot get gbar because you don’t have all the possible datasets, you have only the one you got
- The 2nd term is bias
- Because your hypothesis set may be biased away from the target function because your best is so much away from the target function
Thus, recall we started with:
So, we get: ED[Eout(g(D))] = Ex[bias(x) + bar(x)] = bias + var
So, we have broken down the out of sample error into it’s bias and variance components. So, if we have an Eout of 0.3, we can say that 0.05 is because of bias and 0.25 is because of variance
We have:
Pictorial representation:
If we have a H with only 1 function, we have no variance and we always choose that. But it may not be the best, our g ( = gbar) will be biased away from the target function
If we have a very complicated H, we won’t be able to choose the best - the target function because we have a lot of variance. We get a different D each time to learn etc. And depending on that D, we might not choose ƒ. The centroid of the red region is gbar. It would be quite close to ƒ. (The difference b/w them would be the bias, which is close to 0 here)
If H goes up, bias goes down, variance goes up
- Let’s take our target function to be a sine curve.
- ƒ(x) = sin(π x), ƒ:[-1, 1] → ℜ
- Our dataset is only 2 datapoints.
- We have 2 hypothesis sets, H0 and H1
- H0 is the constant model - h(x) = b
- H1 is the linear model - h(x) = ax + b
- Which one is better?
- Better for what?
The linear line is better. Since we know the target function, a line is a better fit than a constant. (we are using mean squared error)
The errors are in yellow:
Eout in this case is 0.2
For H0, we have the constant only. So, the hypothesis will be g(x) = 0 and the error will be huge:
Eout in this case is 0.5
So, the linear model wins. We can do better with a 3rd order polynomial, even better with a 17th etc. This is because we have no question of zooming in, we have all the information
We get 2 examples now. We don’t know the target function, we just have independently picked examples. So, we have:
So, H0 and H1 will be:
We can compute the error (the yellow regions) etc, but it depends on which 2 points we get, on our dataset. This is why we had the bias-variance tradeoff. Now, we can do a bias-variance decomposition on these two hypothesis sets.
So, we do a simulation in which we take 2 points on random from the target function and fit our constant hypothesis to them We get this:
The gray lines are all the various g-s we choose depending on the input dataset D We can take the average of it, to get g-bar
G-bar is just what we had in the Approximating case. It shows us that the average hypothesis is inherently better because it “cancels out the variance”. The output of our learning process is one of the gray lines, not g-bar. The gray shaded region is the variance around g-bar - the std dev around g-bar.
The error b/w the green line and the target function will give you the bias. The gray region will be the variance
If we have the H1, the straight line, we get this (on the same dataset as in case H0):
Note we have a lot of variance.
On average we get:
The variance depends on x, the gray region is the std dev. When you want one number to define it, we take the expected value of the width2 of the gray region. So, we have better approximation but very bad (high) variance. So, given only 2 points, we are better off with a constant.
So, if asked what is better at approximating a sine curve, a line or a constant? - A line of course! But, if asked what is better at approximating 2 points coming from some unknown curve? - A constant is better
Let’s quantify this:
Here, the bias is exactly what we had for the approximating case. The variance is 0.25
Whereas for H1,
The bias is 0.21. This is slightly higher than in the approximating case where it was 0.20. This is to be expected because here, we get only 2 points at a time and not the whole sine curve. The variance is huge.
So, H0 is better we see.
In any ML scenario, we are matching the model complexity to the data resources, not the target complexity; we don’t know what the target function is.
Let’s say you have 100 datapoints. Is there any utility in taking 10 points at random and trying to learn a hypothesis? Doing this for many random samplings of 10 points and then averaging the hypothesis would give us something close to gbar. This is a way of dealing with variance, no?
Plot of Eout and Ein as a function of N Data set D of size N. Expected out-of-sample - ED[Eout(g(D))]
- it depends on the dataset. If you want it to be agnostic of any particular dataset, we “integrate” D out and get expected value wrt D
Expected in-sample error: ED[Ein(g(D))]
- This is how you fit them
How do these vary with N?
This is for the simple model.
The same thing for the complex model as well, just the graph is shifted towards right. The x-axis where the Ein is zero corresponds to the VC dimension. Till there, the hypothesis set fits everything perfectly, all the points are shattered.
In VC analysis, we had Eout which comprised of in-sample error and generalization error (Ω) So, the figure looks like this:
One thing is that in the VC analysis, the generalization error isn’t this much, it is much much higher. It just follows this shape. The bound is not that tight.
If we plot the same thing for the BV analysis, we get:
Here, the bias is the error in g-bar, that is the best you can do. As N goes up, the variance decreases, and we get closer and closer to g-bar.
Let’s say we have a noisy target y = w*Tx + noise D is {(x1, y1), (x2, y2), …, (xN, yN)} The input points in D are picked independently.
Learning solution: X = (XTX)-1XTy
In sample error: Xw - y
Out-of-sample error - same input points + new noise Xw - y’
On analysis we get is:
σ2 is the variance of the noise, and that is the best you can do, which is true, since you can’t capture the noise Also, till d+1, you can shatter N points. As we get more N, the variance cancels out but the bias persists and we get lower and lower Eout
Best approximation error = σ2
Here, we see that we have an exact formula for Expected generalization error and we see that dVC ∝ N
- We took the expected value of the Eout wrt the identity of dataset D, size fixed at N to get rid of variation due to datasets. And we got a clean decomposition of Eout into bias + variance terms.
- Bias is how far is your best hypothesis in your hypothesis set away from ƒ
- Variance is how far close to the best hypothesis are you.
- g(D)(x) → gbar(x) → ƒ(x)
- N ∝ dVC
We used non-linear transforms earlier, when we applied (x1, x2) → (z1, z2, …, zd) so, z = Φ(x) Example: z = (1, x1, x2, x12, x22, x1x2) - this is the quadratic transformation Our final hypothesis g(x) will always live in X space.
We take the first in classification, 2nd in regression
In terms of generalization 🔝
Earlier, in X space, we had w free parameters. But in transformed space, we have w-tilde free parameters. So, the dVC dimension increases and so we pay the price for generalization.
However, we can do this to reduce the parameters:
Here, we are reducing the parameters to just 1. This won’t work because we looked at the data and did the “learning” ourselves. We decided that we need a circle and don’t need the cross terms etc. We have to pay the price in terms of dVC for what we start with (in our mind here), which is the full quadratic form.
Looking at the data before choosing the model is hazardous to your Eout. You explore a very high hypothesis space but account for very little in your analysis. This is data snopping.
This is why you cannot arbitrarily go to very high dimensions during transformations, because the dVC gets high and generalization suffers.
We have seen so far:
- Linear perceptron
- Linear regression
Now we will see Logistic regression.
This will be our third linear model. Being linear means that we compute the signal as being a linear combination of the weights. And then we use the signal to do classification, estimation etc
The perceptron is this:
The staircase step pattern is there because we it is a step function. We look at the sign and return +1/-1
All the linear models differ in what they do to the signal. Perceptron looks at the sign 🔝
Regression uses s as is. We can say we apply the identity function.
Logistic regression squashes it to be b/w 0 and 1 by applying a nonlinearity to it (logistic function θ).
This is somewhat b/w perceptron and linear regression
The logistic function θ looks like this:
This is a soft threshold. There are many formulas that give us this, we will use:
So, we have: h(x) = θ(s), s = wTx. The signal s, can be thought of as a “risk score”
Since during training, we have a binary label on the datapoints in D, we don’t have a probability, we get:
Since the target function itself is noisy, it gives y +1 or -1 with a certain underlying probability that we are trying to learn.
We can either do a cost-benefit analysis and come up with a very plausible error measure. Or we can use something that is friendly to the optimizer.
For each (x, y), y is generated by probability ƒ(x) We can have a plausible error measure based on likelihood:
We will grade different hypothesis according to the likelihood that they generated the data. This gives us a way of grading the different hypotheses.
Let’s assume that the datapoints in D were generated by h. (ie h = ƒ). This is the reverse of what we were doing earlier. Earlier, we extracted the most probable hypothesis given the data. That was clean. Now we are asking what is the probability of the data given the hypothesis. There are some data snooping concerns. The Bayesian’s get around this by starting with a prior and then using likelihood to get a posterior.
What we need to do is, we need to compute the LHS in 🔝 Then we select the h which has the greatest LHS
So, we do this: We replace h(x) in that formula 🔝 with:
h(x) = θ(wTx)
Also, we see:
Θ(-s) = 1 - θ(s)
So, we get:
P(y | x) = θ(y · wTx)
This is for one point, for all the points in the dataset, we get the likelihood:
So, since we find the w that maximizes the likelihood of getting this dataset, we learn something about the underlying probability distribution that generated this dataset.
How do we maximize the likelihood?
We have:
recall, we are maximizing wrt w
Instead, we can maximize log-likelihood
We can also do this: ⬇️
Also, we can do this:
We can substitute θ and get this simplification:
So, we can call the terms inside the Σ as the e - pointwise error
The exponent, -ynwTxn is nice because if the risk score is high and yn is negative, we get a total -(-)(+) which is negative so, high error etc.
This is called the cross-entropy error. (b/w h and ƒ)
Since we know the error measure, we can minimize it
We had in linear regression:
We had a closed-form solution in linear regression, the pseudo-inverse, the one step learning.
We cannot find a closed form solution for the logistic regression case. We need an iterative solution - gradient descent We have a convex error surface in logistic regression, so we can optimize it perfectly. We won’t have this 🔝 when we have neural networks etc.
We have:
We need the direction, v-hat. Earlier, we just said that the direction of steepest fall is the gradient of that surface at that points. An alternative approach to the same idea is:
We can replace w(1) with the formula
Using Taylor series expansion of the first term, we get:
If the surface was linear, we wouldn’t have the O(η2) term, the first order approximation would have been exact.
We just need the direction of v-hat now such that it is as negative as possible. So, we have:
Size of η?
η should increase with slope. So, if we can do this:
So, we moved from a fixed step size to a fixed learning rate. We can manipulate that as well, with momentum etc.
We have seen 3 linear models now:
In logistic regression we have a convex error surface, so we could use gradient descent and minimize the error. This is not the case with Neural Networks.
We had GD, which minimizes the in sample error
We need to evaluate the hypothesis at every point in D, so this is expensive. Each step is one epoch (epoch is when you have considered the full dataset at once)
This was “batch GD” 🔝
We can also have stochastic GD. Pick 1 example (xn, yn) at a time. Apply GD to e(h(xn), yn)
This works because the “average” direction in which we move is:
So, at every step, we are moving in this direction(this is the expected direction) + noise This is randomized GD, aka stochastic GD
- cheaper computation
- with the same expected movement
- randomization
- so you don’t get stuck in local minima always
- simple
- simplest possible optimization to GD
- Rule of thumb: η = 0.1 works
In the Netflix competition, we had user vectors uik for user i, and movie vectors vjk for movie j We also had a rating rij for user i and movie k
We can define an error measure as:
So, we have 2k parameters, and we can use SGD to move in this 2k dimensional space. We nudge them little by little to go towards the rating.
The building blocks are perceptrons. Neural networks are combination of them The case where the perceptrons failed was the 4 points arranged funnily.
But this can be solved by 2 perceptrons
We can combine them using OR/AND gates which are manifested by the weights on them
Note the step function just takes the sign to give +1/-1
Now, once we have OR and AND, we can create layers of these to get more complex function approximaters or logic gates etc
We can get XOR with:
So, we have: the 2 perceptrons (doing AND, and NAND) and then we are ORing them
So, the full multilayer perceptron that implemented the function that we wanted(where the single perceptron failed) is:
3 layers, feedforward architecture.
Now, you can get whatever surface you want to approximate:
2 red flags: generalization and optimization For Generalization, we know exactly what we are dealing with, and we can reason rationally about it For optimization, what we can do is we can have soft thresholds thru out, and then give the answer after hard thresholding it. Soft thresholds are better because they are twice differential and we can use gradient descent readily.
We have this:
Θ is any non-linearity that you want. We can use logistic function, but it goes from -1 to +1 Each of these guys can be different as well.
The input layer has x The hidden layers 1 ≤ l ≤ L output layer l = L
The non-linearity θ can be: tanh (hyperbolic tangent)
This function behaves linearly near low signal score, otherwise as a hard threshold.
The parameters of the neural network, the weights have elaborate indices.
d(l) is the dimension of the layer l inputs start with index 0 because each neuron has the mandatory input x0 = 1
So, 🔝 the value in the next layer is just the signal (computed from the weights+values of previous layer) and passed thru the nonlinear function θ.
So, we start applying this from the input layer all the way to the output layer to get x1(L)
All the weights are w = {wij(l)}, they determine h(x) Error on example (xn, yn) is: e(h(xn), yn) = fn of e(w)
We need the gradient of the e(h(xn), yn)
To find de(w)/dwij(l) - we can do it by chain rule. The problem is that we have to do it for each one of them, we need a faster way of computing it.
We can use backpropagation algo to get the entire gradient in 1 shot
We can write using chain rule:
The 2nd term is simple. The 1st term needs thought, let’s call it δj(l) So, e’ is a product of the two terms, which are basically the x of the previous layer and the δ of the next layer. This is an attractive property.
We get δ at the output layer and then use that to compute the inner layer etc.
This is just a normal application of chain rules.
The hidden layers are performing nonlinear transformation. But the transformations are learned, and we are already charging the generalization for that. The VC dimension is ≈ #weights
- We can combine perceptrons to get more complex boundaries
- The optimization is difficult, so we introduced neural networks that softened thresholds of the perceptrons.
- We used chain rule to get derivative of error wrt weights and so we can use SGD to minimize the error.
Let’s say we have a simple target function. We generate 5 data points with random noise in them.
Overfitting is when you go father than you should have. If you try to fit the above ƒ with a 4th order polynomial, we will overfit.
When training a neural network, if you plot Ein and Eout at each step and you get something like so:
Overfitting is when Ein goes down and at the same time, Eout is going up. So, we should stop at that moment and report that. This is called early stopping
Overfitting is “fitting the data more than warranted” The culprit is when we are fitting the noise.
Let’s have a 10th order target function. We generate 15 datapoints + noise
We also have target function which is a a 50th order polynomial. We also generate 15 noiseless datapoints lying on the curve.
We will have 2 models for each target
- 2nd order polynomial
- 10th order polynomial
For noisy low-order target
This is a blatant case of overfitting
For noiseless high-order target
So, here too, it is overfitting. Even when we are fitting a 10th order polynomial to a 50th order target function
The reason is that the second case also has “noise”, not the traditional noise, but noise. We need to match the hypothesis complexity to the data resources and not the target function complexity.
Impact of noise level and target complexity
y = f(x) + ε(x)
ε(x) has σ2 std dev, or “energy”
f(x) is a normalized Qth order polynomial so that the noise has effect - we’ll use Lagrange polynomials. So, the things affecting overfitting are:
We would like to understand the dependency b/w these on overfitting
We fit the polynomials using our 2 hypothesis - H2 which is a 2nd order polynomial and H10 which is a 10th order polynomial. And we’ll compare the out-of-sample errors. Overfit measure: Eout(g10) - Eout(g2) So, a positive value means the more complex guy is doing worse, so overfitting. Negative means that the large model is doing better, no overfitting
When we run this for 10s of millions of examples, we get this:
- Impact of σ2
- N vs. σ2
(Red is overfitting) So, more σ, I.e. More noise, we need more N to beat overfitting.
For all the simulations 🔝, the target complexity is fixed at 20th order polynomial
This was okayish, no surprises. What about the other case where we were getting overfitting without noise?
Here we fix the level of noise, to 0.1, but we are changing the target function complexity We note somewhat the same pattern. Overfitting happens with higher complexity of ƒ(not as pronounced as earlier, but there. This is because we can readily capture the pattern in this “noise”, whereas in the other case we are trying to learn random noise, which takes a lot of N) but with more N, we can beat it. Also, it does not start until the power of polynomial reaches 10, this is because we are trying to fit a 2nd order polynomial to it. So, the target polynomial needs to be sufficiently complex for us to not be able to capture it. Thus from all this, there appears to be another factor apart from conventional noise that affects overfitting.
The first case, with σ2 energy is called “stochastic noise” The second case is noise which isn’t random, it is deterministic but we just cannot capture it and it looks like noise to me. We’ll call it “deterministic noise”
The part of ƒ that H cannot capture → f(x) − h*(x)
Main differences with stochastic noise:
- depends on H
- fixed for a given x (difference b/w f(x) and h(x) at that point)
They behave exactly the same as long as learning is concerned.
So, if we have only 10 points and the hypothesis is very complex, it will try to fit in the noise - either stochastic or deterministic - and you overfit
Earlier, when we did the Bias-Variance decomposition, we got the bias term and the variance term. Adding noise to the mix and repeating the analysis, we get:
The 3rd term is the stochastic noise. The second term, the bias, is the deterministic noise. The noise terms increase the variance (the 1st term)
Two cures:
- Regularization - putting the brakes
- Validation - checking the bottom line - learning to not look at Ein when deciding when to stop but getting an estimate for Eout
This will allow us to use the 4th order polynomial but improve dramatically:
The 1st cure to overfitting is Regularization, 2nd is validation
Regularization has a mathematical basis - you want to approximate functions, which is an ill-posed problem because there are many functions that approx it, so you have smoothness constraints
It also has heuristic basis.
Earlier, we studied how a linear model fails on fitting a sine curve with only 2 points.
We can apply regularization here and restrict offsets and too deep slopes. This will mean we sacrifice a little on the fit of the datapoints.
This shows that the variance is reduced.
Quantifying it:
So, with regularization, the total score is better than in the constant model. There is a small increase in the bias, the g-bar is different as well. This is because we are handicapping the flexibility of the hypothesis and so we don’t get the same g-bar as earlier. The slight increase in bias can be considered a side-effect of the treatment.
The g-bar is the best hypothesis in H. This is what you’ll get if you get infinitely many datasets of 2 datapoints each and you average all the various hypothesises you form.
Regularization allows you to choose a model “in-between” the constant and linear model - a sort of a restricted linear model
What regularization did we apply?
Let’s define a hypothesis set: HQ = polynomials of order Q
The non-linear transformation that produces this polynomial: z
You take the input x, and combine and evaluate these polynomials. This is the nonlinear transformation. This means, we are converting each feature into higher powers till power Q.
The hypothesis set is simply all possible Legendre polynomials till power Q
We can use linear regression to get the solution This is simple, we know the drill:
- We apply the transformation on the input points
- We write Ein(w), in vector form
- We know the formula
What if we constrain the weights?
- we can consider H2 as a constrained version of H10
- But this is hard constraint, with wq = 0 for q > 2
A softer constraint would be:
Here, we are decreasing the possbile hypotheses, so the dVC reduces and our chances of generalization improve. So, the optimization problem changes to:
Pictorially, we have our old error surface:
The lowest error lies at the center of this ellipse. This is what is reported by linear regression without regularization. With the constraint, we have:
So, we have to now pick a point that is as close to wlin as possible and still respects the constraint. So, it will lie on the circle. Also, if C is too large, we will have something like this:
Here, the solution will be wlin, as the regularization was too liberal.
Now, consider a non-optimal point:
Here, the gradient of Ein will be orthogonal to the ellipse. Also, the normal to the circle will be w vector (origin to point w will be orthogonal to circumference of circle) Thus, we can write:
We can put proportionality constant as:
So, the last equation looks like the minimization of:
So, we moved from constraint satisfication optimization to normal optimization. λ is dependent on C (among other steps), it is inversely proportional to λ
λ ∝ -C
Augmented error We are now minimizing the augmented error - that is, error with the regularization term attached to it
These 🔝 two optimizations are fundamentally the same.
The bottom formulation lends itself to VC analysis, as we are restricting our hypothesis set explicitly, we have a lower dVC and better chances at generalization.
The solution is simple:
We have:
So, we get gradient:
So, this is the solution with regularization.
Applying regularization on the previous problem:
We see regularization increased bias, reduces variance We went from overfitting to underfitting
There is a optimal value:
This regularization we studied, wTw ≤ C is called weight decay.
Why is this called weight decay?
We had in GD:
But now, we have gradient due to wTw also
We see that λ is playing the role of decaying the weight at each step. This applies in neural networks as well, just the wTw is the sum of all the weights in the network
Apart from constraining the total weights, we can also decide some weights are more important that others. We can use the importance factor - γq So, if γq is low, that weight need not be reduced. If it is big, emphasis on reducing that weight.
Example:
The algorithm will try to find a lower order fit
In neural networks, we give different layers different γ’s
One famous family of regularizers is Tikhonov regularizer
In the above form:
We apply the regularization only to the diagonal terms, I.e. w12, w22, etc In the Tikhonov regularizer, when you write it in the matrix form, you can write off diagonal terms as well and get all sorts of regularization effects - weight decay, low order fit, high order fit etc
Practical rules:
- stochastic noise is high frequency
- deterministic noise is also nonsmooth
- Constrain learning towards smoother hypotheses (because the noise is not smooth, so we harm the noise by going smooth)
This works because it will stop the hypothesis from trying to fit the high frequency noise (stochastic or deterministic) Generally, small weights correspond to smoother hypothesis.
The holy grail of machine learning is finding Ein which is a great proxy for Eout. If we get that, we can minimize Ein and go home. But there is this slack, this bound etc. Eaug is better at being a proxy for Eout compared to Ein.
In neural networks, you can do this ⬇️ to eliminate small weights and leave the large ones alone.
So, this means that we will be near the logical part (+/-1) of tanh and not the linear part. This is good because the neural network can implement OR/AND gate and thus combine to get approximate any function
Early stopping is also a regularization. Optimal λ needs validation.
When there is no noise, we don’t need λ We need more λ when there was more noise
This is for stochastic noise 🔝
We get the exact same thing for deterministic noise
This should seal the correspondence in your mind that as far as overfitting and it’s cures are concerned, both the noises behave exactly the same way.
We got the Eaug to minimize now:
The general form of regularization was:
The choice of λ is retrieved using validation
We have, Eout(h) = Ein(h) + overfit penalty
Both regularization and validation deal with this “overfit penalty” 🔝
Regularization is a way of getting a handle on overfit penalty; estimate it and try to minimize Eaug Validation tries to estimate Eout directly
On out-of-sample points (x,y), we have e(h(x), y) as:
- squared error = (h(x)-y)2
- binary error = || h(x) ≠ y ||
If we take the expected value of this error wrt x, we get Eout Eout(h)= Ex[e(h(x), y)]
There is a lot of variance because we have only 1 point here var[e(h(x), y)] = σ2
To reduce variance, we move from a single point to a validation set Num of points in validation set = K
The error on validation set is:
These folks 🔝 weren’t used in training. Also, they are more than 1, so we hope this is a good estimate for Eout
So, Eout(h) = E[Eval(h)] which is:
The variance on the Eval(h) is: using the same analysis as earlier, the cross terms go to zero, the covariance terms go to zero since the terms were chosen independently, we get:
So, we get:
Since K is taken out of N, we cannot increase it indefinitely.
One idea: K is put back into N
Divide D into 2 parts, N-K to train and K to validate
Train on N-K to get g-, use that hypothesis, train on the full N and report it back. We don’t have an estimate on Eout now but we know it will be better than Eval due to the behavior of the learning curves
This is good because we get to train on the whole set. But bad because the validation error we are reporting is the error on a different hypothesis than the one we finally use. So, our error estimate is bad.
Rule of thumb: K = 20% of N, N/5
When we use validation error for early stopping, we introduce a optimistic bias in our model for estimation of Eout. This is because there is a certain variance in Eval and we ignore Eval when it is high, but we take it when it is low.
The main purpose of validation set is to help with model selection. We have M models to choose from.
We will use Dtrain to train and we’ll get gm− for each model We’ll evaluate each using Dval. Pick the best.
When you pick the best 🔝, there’s a optimistic bias
The output is gm* which is the best model trained on the complete D
Eval(gm*−) is a biased estimate of Eout(gm*−)
The bias diminishes with growing K
- The curves are going up because we have fewer datapoints left for train
- the 2 curves are getting closer together because the Eval estimate of Eout is getting better and better
We can quantify the optimistic bias. If you think about it, selecting the best hypothesis from the M options (by selecting the model with the lowest Eval) is just regular training. Dval is used for “training” on the finalists model.
Hval = {g1−, g2−, …, gM−}
Using Hoeffding and VC!
So, we have an added term of ln M. Hence, if we use validation to select from 20 parameters using 100 points, the bias is more than if we are selecting 2 parameters(only by a log factor, but still)
Data contamination: We have Ein, Etest, Eval
If we use the data to make choices, we are contaminating it and reducing it’s ability to estimate real performance.
Test set: Completely contaminated Validation set: Somewhat contaminated Test set: Totally clean
We have the following chain of reasoning.
Eout(g) → this is what we want to estimate. g is our g− hypothesis trained on the full N datapoints Eout(g−) → this is our proxy for Eout(g), obtained by training on N-K Eval(g−) → this is our estimate of Eout(g−) obtained by training only on N-K(the lowest Eval of all M models), this one has a optimistic bias in it
We want small K so that g− is fairly close to g. We want large K so that Eval(g−) is a good proxy for Eout(g−) (due to reduced variance)
We will use N-1 points for training. This means g− will be very close to g. But Eval(g−) is not a good estimate for Eout(g−) due to insane variation. So, let’s see
We have D - nth point
So, final hypothesis learned from Dn is gn− We have: en = Eval(gn−) = e(gn−(xn), yn) So, RHS is a bad estimate for Eval
If we do this for all the points once, we get N estimates of Eval, the average of them is perfect, the variance is reduced - we’ll call it ECV.
So, we have the best of both worlds, large K and small K
Demonstration: Let’s use CV to choose between a linear and constant model for this dataset
We get ECV:
Same for constant model:
It turns out ECV for constant model is lower, we’ll choose that. That would be the right choice, the datapoints were indeed generated by a constant target function with noise. Using ECV will give you the right answer on average.
Another example: We can use CV to find out what order of polynomial to use for digit classification
We’ll use 500 points for training and the rest for testing. The nonlinear transformation will be polynomials for upto 5th order polynomial
We’ll have 500 training session, each with 499 points. We’ll get this curve:
We note how ECV is a good proxy for Eout CV tells us to stop at 5-7, we can stop at 6. Here’s the plot:
The CV result is smooth, good Eout
There is a problem with “Leave one out”. The problem is that you’ll have N training sessions(with N-1 points each time). So, we generally leave K out.
This works nice if N is big. The number of training sessions reduces and we don’t take a very huge hit on num of points (N-K is good if N is big)
Rule of thumb: 10-fold Cross validation works nicely in practice.
The whole lecture in a slide:
SVM - Classification King
For linearly separable data, we can have different separating lines. The ones with the largest margin are the best.
We can see that the margin successively increases and so does the separating line (even when Ein and Eout and all are the same) because we can accept noise better
Can we solve for w that maximizes the margin? Recall the dichotomies, the growth function? We recall that the perceptron can shatter 3 points completely. So, we have 8 dichotomies:
Each dichotomy’s max margin:
If we accept dichotomies with a mandatory minimum margin, we reject some dichotomies, so the growth function decreases, dVC decreases, we have better generalization chances. This is just like regularization.
Let xn be the nearest data point to the line/plane/hyperplane: wTx=0 (this is the separating surface) The distance b/w the separating plane and the nearest point is the margin
2 preliminary technicalities:
- normalize w
- we’ll want the min distance to be 1, so we scale distances down accordingly. So, |wTxn| = 1
- We’ll put out w0, the bias term. So, the vector w is not only [w1, …, wd]
- So, the equation of the plane becomes: wTx+b = 0
Now, we can compute the distance between the nearest point xn and the plane wTx+b = 0 where |wTxn + b| = 1 So, we have:
Vector w is ⊥ to the plane in the X space (eg: in y=x, vector [1, -1] is ⊥ to the line) Proof: Take 2 points on the plane x’ and x” Since they line on the plane, we have:
wTx’ + b = 0 wTx” + b = 0
So, we have the difference: wT(x’-x”) = 0
This 🔝 means that w vector is ⊥ to (x’-x”) vector Since x’-x” lies on the plane, w is ⊥ to the plane
Now the distance b/w xn and plane is:
- Take any point x on the plane
- Projection of xn - x on w is the required distance of xn from plane
So, w·(xn-x) divided by mag of w Also, w·(xn-x) is just: wT(xn-x) We take the abs value, so:
Or:
We can add and subtract b to get:
The 2nd group of terms is 0 since x lies on plane. The 1st group was scaled to be 1 So, we have distance/margin as:
Hence, the resulting optimization problem:
We are maximizing the margin, subject to the scaling of w such that it makes the nearest point be at a distance of 1
Also, instead of maximizing 1\|w|, we maximize:
So, the problem becomes:
We’ll use quadratic programming to solve this
This is related to the optimization problem we had earlier with Regularization:
We’ll formulate the Lagrange: by putting the constraint into the objective (by getting the -1 on LHS and adding α for the slack):
We are minimizing this 🔝 wrt w, b and maximizing wrt αn ≥ 0 We take gradient of Lagrangian wrt w and b and equate to 0:
So, we get:
Substituting this back into the original Lagrangian:
So, we have:
Which is the same as:
Expanding the formula to see what matrices are passed to QP:
Along with the conditions:
This is to be passed to the QP package and it gives us the α-s back We get a convex function that the quadratic programming optimizes.
So, we have our α-s. α = [α1, …, αN] We can get w by:
The α vector has mostly 0, this is because in the interior points, the slack is 0
This is similar to the case in regularization where the C was so large that we actually could get to wlin and λ was 0 there. For the cases where we actually had to compromise, we had a positive λ, so is the case here with our α-s.
Those points where αn > 0, are the support points, xn is the support vector. They are the circled ones in the diagram:
The support vectors achieve the margin for them: yn(wTxn + b) = 1
Also, since α-s for non support vectors are 0, we can do:
Since w-s are the parameters of the model, less w means lower dVC, better generalization. So, getting the best separator using SVMs has a generalization dividend as well. Also, b is simple to get: apply “yn(wTxn + b) = 1” for any support vector.
Currently, we are talking only about linearly separable case, where all the points are strictly linearly separable. If they aren’t linearly separable, we can apply a non-linear transformation and get them to be linearly separable in that higher dimension space.
Recall, we had:
Here, we are passing xTx to the LP. xTx is just inner product of 2 D dimensional vectors, where D is the dimensionality of the x space. So, if we apply a nonlinear transformation to 1 million dimensional space, we’ll have to do a inner product of 2 one million vectors, but the optimization problem won’t be any difficult.
The w-s will belong to the z space. The SVs live in the Z space. In the X space, the Z-space hyperplane with the largest margin looks like this say, with the “pre-images” of support vectors highlighted (Support Vectors are defined in the Z-space):
The margins are in the Z space
Here, we have only 4 support vectors, so, the w is only defined by 4 parameters in the Z space. This after going to a million dimensional space! So, good generalization.
Generalization result:
Actually, we need to run several runs and get the expected values:
So, SVMs are awesome because you get to go to high dimensions without paying for the computation of doing so and without pay in terms of generalization.
Kernel methods take this one step further, you won’t need to pay for getting the inner product as well. This will enable you to go to infinite dimensional spaces.
SVMs are classifiers with maximum margin. They allow us to go to higher dimensional spaces without paying for generalization and complexity.
This means, we get a complex final hypothesis (g) but our hypothesis set is “simple”
The only thing we need from the Z space is the inner-dot-product. If we get that somehow, we can use the Z-space
We need z in one more place as well…, Our final hypothesis We have: g(x) = sign(wTz + b)
But we know w is:
So, this reduces to inner product as well.
What about the b? That is solvable by taking any SV and using this equation:
So, here as well, we need just the inner product.
This raises a very interesting possibility. If we can get the inner product of z, we can go to that space – even if we do not know what the z space is – even if z space is infinite dimensional.
The inner product that we are taking about, zTz, let’s assume it is given to us by a function We have, for any 2 points in X space: x and x’ ∈ X
We have: zTz = K(x, x’)
Where K is kernel. The kernel will correspond to some Z space.
Example:
Let’s have a 2nd order transformation φ So, z vector: z = φ(x) = [1, x1, x2, x12, x22, x1x2]
Hence, K will be:
The trick: is it possible to compute this kernel (evaluate K) without transforming x and x’?
Let’s consider K to be (1 + xTx’)2 This is a function of x, x’ and it gives us a inner product in some Z space (we don’t yet know which)
K(x, x’) = (1 + xTx’)2 = (1 + x1x2’ + x2x’2)2
Squaring, we get:
This looks like an 2nd order non-linear transformation inner product if we didn’t have the 2s. It actually is still a 2nd order non-linear transformation but x vector is not transformed to (1, x12, x22, x1x2) but to:
Hence, K(x, x’) = (1 + xTx’)2 does compute a inner product in 2nd order without converting to those coordinates first If we replace the power 2 with power 100, we get the inner product in 100D. This without having to go there.
This is the polynomial kernel