- Training data and testing data samples from the same distribution(i.i.d)
- How to use a algorithm make
$E_{in}$ is small enough?(Optimization) - How to make sure
$E_{in}$ samll makes$E_{out}$ is small, too? -> the statisical guarantee(Generalization)
- If our loss surface is CONVEX. GD algorithm could coverage to the global minimum.
- But the loss surface of deep learning usually is NON-CONVEX
Old theorem
Over-fitting is caused by high VC Dimension(high model complexity)
The VC theorem stats :
the upper bound of testing error is :
-
$n$ plays data diversity, n should be larger than d so that VC bound is small. -
$d$ >>$n$ , then we have a shatter(high VC bound)
shatter and overfitting?
but that's not the fact.
paper : Rethinking Optimization and Generalization.
We have experiments:
How about the generalization?
Looks good, but why?
We foucs on data
data | error | note |
---|---|---|
original | small | |
random | large | |
noise level | large | for reference |
Conclusion :
Noise hurts deep nn model(actually, hurts all ml model...)
The equation look likes vc bound(pac bound)
Conclusion : If the parameter change a lot, you might overfit the data.
Otherwise, you're doing well.
Experiments?
- parameter x 2 -> vc bound x 2 ->
$E_{out}$ the same - add noise(random label) -> vc bound the same ->
$E_{out}$ explore! (PAC bound x 13) - PAC bound matches the result =)
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
- data
wnormalized data both
- model
- Loss
Optimization :
do some math trick
the derevite of relu we get a
What we get?
the gradient is meaningful. we introduce Gram Matrix.
If we throw two data into model at iteration
- The infinity means width of layer is infinity.
After some math trick(maybe integrtation or something). we know the value.
Gram Matrix could be eigen decomposition(bring us many phtsical meaning)
Since we can decompose the gram matrix.
We can find representive virtual data to represent all data(depends on eigen values)
Gram Matrix is the Neural Tangent Kernel two-layer ReLU NN
This kernel could solve the VC bounds and pac bound puzzle!
if you use large
as k grows, you will reach global minimum.
It shows that even though you have a non-convex loss surface. When you Over-Parameterization, you will get global minimum.
They use functional space to analysis it.
Function could be treat as vectors.
And functions could be orthogonal to each other.
The math tricks do their work, XD.
Conclusion :
- If the gram matrix could be decompose with large eigen value. model converge more faster.(more easy to learn the data)
- The convergence can be slower when the eigen vector are more uniform-like
All about the data, right : P
- true label
- noisy label
some body label the data is wrong.
Red line is true label(mistake on the ppt)
worse case ?
How to training your model more faster!?
Conclusion :
- Clean data helps convergence!!!!
- Instead of tuning optimization method, clean your data helps!
We have dataset, then we can caculate gram matrix. Then we can know about the convergence rate and is there any representive vitural data
What is the bound?
Left hand is test error - in sample error
What we got?
- Still, we need more data
$n$ - The deep learning need a lot data due to model complexity - wrong! over-Parameterization!
- The gram matrix shows that same label and similar x helps the convergence and better result.
- Notice that
$y^{T}(H^{\infinity})^{-1}y$ can shows clean, seperatable data makes the generalization better.
Experiments shows the same guarantee!
This theorem is good so far!
Theorem 5.1 : focus on data
VC bound : focus on number of data and model complexity
PAC-Bayesian Bound : focus on model and data but so cheating(only you trained the model....)
-
Implementation is very important!
-
Deep Learning so far is a magic. No theoretical guarantee!
-
The deep learning theorem(optimization and generalization) is still growing.
-
You should believe your reuslt of the experiemnt.
-
How to read math-related paper?
- write down the symbol. make sure what it is.
- take example which the formular you don't understand.
-
Is some data agumentation technique useless? - yes. you need to make your data fits your testing data distribution.
-
Are there some tricks to make data labeling good and use less manpower? - yes there are some research.
-
Rethinking Optimization and Generalization
-
Generalization curve of Over-Parameterization model
-
PAC-Bayesian Bound
-
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
-
Neural Tangent Kernel : Convergence and Generalization in Nerual Networks