Over-Parameterization

Learning Theory

Training data and testing data samples from the same distribution(i.i.d)
How to use a algorithm make $E_{in}$ is small enough?(Optimization)
How to make sure $E_{in}$ samll makes $E_{out}$ is small, too? -> the statisical guarantee(Generalization)

Optimization

If our loss surface is CONVEX. GD algorithm could coverage to the global minimum.
But the loss surface of deep learning usually is NON-CONVEX

Generalization

Old theorem

Over-fitting is caused by high VC Dimension(high model complexity)

The VC theorem stats :

the upper bound of testing error is :

$n$ plays data diversity, n should be larger than d so that VC bound is small.
$d$ >> $n$, then we have a shatter(high VC bound)

NN

$d$ >> $n$

shatter and overfitting?

but that's not the fact.

Over-Parameterization

paper : Rethinking Optimization and Generalization.

We have experiments:

How about the generalization?

Looks good, but why?

the data

We foucs on data

data	error	note
original	small
random	large
noise level	large	for reference

Noise level?

Conclusion :

Noise hurts deep nn model(actually, hurts all ml model...)

Generalization curve of Over-Parameterization model

PAC-Bayesian Bound

The equation look likes vc bound(pac bound)

Conclusion : If the parameter change a lot, you might overfit the data.

Otherwise, you're doing well.

Experiments?

parameter x 2 -> vc bound x 2 -> $E_{out}$ the same
add noise(random label) -> vc bound the same -> $E_{out}$ explore! (PAC bound x 13)
PAC bound matches the result =)

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Symbols

data

wnormalized data both $x$ and $y$

model

$m$ : hidden nodes

$w$ : wrights from normal distribution

$f$ : model hypothesis also written as $u_{i}$

Loss

Optimization :

do some math trick

the derevite of relu we get a $I$ identity function

What we get?

the gradient is meaningful. we introduce Gram Matrix.

Gram Matrix

$H(k)$

$k$ is the $k^{th}$ step of training iteration.

If we throw two data into model at iteration $k$. we have gram matrix.

Example?

Gram matrix infinity

The infinity means width of layer is infinity.

After some math trick(maybe integrtation or something). we know the value.

Gram Matrix could be eigen decomposition(bring us many phtsical meaning)

Example

Since we can decompose the gram matrix.

We can find representive virtual data to represent all data(depends on eigen values)

Gram Matrix is the Neural Tangent Kernel two-layer ReLU NN

Neural Tangent Kernel : Convergence and Generalization in Nerual Networks

This kernel could solve the VC bounds and pac bound puzzle!

Analysis of Convergence Rate

if you use large $m$(large layer width)

as k grows, you will reach global minimum.

It shows that even though you have a non-convex loss surface. When you Over-Parameterization, you will get global minimum.

How fast? (Analysis of Convergence Rate)

They use functional space to analysis it.

Function could be treat as vectors.

And functions could be orthogonal to each other.

The math tricks do their work, XD.

Conclusion :

If the gram matrix could be decompose with large eigen value. model converge more faster.(more easy to learn the data)
The convergence can be slower when the eigen vector are more uniform-like

All about the data, right : P

Examples

true label

noisy label

some body label the data is wrong.

Red line is true label(mistake on the ppt)

worse case ?

How to training your model more faster!?

Conclusion :

Clean data helps convergence!!!!
Instead of tuning optimization method, clean your data helps!

Analysis of Generalization

We have dataset, then we can caculate gram matrix. Then we can know about the convergence rate and is there any representive vitural data

What is the bound?

Left hand is test error - in sample error

What we got?

Still, we need more data $n$
The deep learning need a lot data due to model complexity - wrong! over-Parameterization!
The gram matrix shows that same label and similar x helps the convergence and better result.
Notice that $y^{T}(H^{\infinity})^{-1}y$ can shows clean, seperatable data makes the generalization better.

Experiments shows the same guarantee!

This theorem is good so far!

Summary

Theorem 5.1 : focus on data

VC bound : focus on number of data and model complexity

PAC-Bayesian Bound : focus on model and data but so cheating(only you trained the model....)

Misc

Implementation is very important!
Deep Learning so far is a magic. No theoretical guarantee!
The deep learning theorem(optimization and generalization) is still growing.
You should believe your reuslt of the experiemnt.
How to read math-related paper?
1. write down the symbol. make sure what it is.
2. take example which the formular you don't understand.
Is some data agumentation technique useless? - yes. you need to make your data fits your testing data distribution.
Are there some tricks to make data labeling good and use less manpower? - yes there are some research.

Papers

Rethinking Optimization and Generalization
Generalization curve of Over-Parameterization model
PAC-Bayesian Bound
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Neural Tangent Kernel : Convergence and Generalization in Nerual Networks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

over_parameterization.md

over_parameterization.md

Over-Parameterization

Learning Theory

Optimization

Generalization

NN

Over-Parameterization

the data

Noise level?

Generalization curve of Over-Parameterization model

PAC-Bayesian Bound

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Symbols

Gram Matrix

Example?

Gram matrix infinity

Example

Neural Tangent Kernel : Convergence and Generalization in Nerual Networks

Analysis of Convergence Rate

How fast? (Analysis of Convergence Rate)

Examples

Analysis of Generalization

Summary

Misc

Papers

Files

over_parameterization.md

Latest commit

History

over_parameterization.md

File metadata and controls

Over-Parameterization

Learning Theory

Optimization

Generalization

NN

Over-Parameterization

the data

Noise level?

Generalization curve of Over-Parameterization model

PAC-Bayesian Bound

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Symbols

Gram Matrix

Example?

Gram matrix infinity

Example

Neural Tangent Kernel : Convergence and Generalization in Nerual Networks

Analysis of Convergence Rate

How fast? (Analysis of Convergence Rate)

Examples

Analysis of Generalization

Summary

Misc

Papers