You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In chapter 10 after the square root example the text is written as
Overall, the impact is the learning rates for parameters with smaller gradients are decreased slowly, while the parameters with larger gradients have their learning rates decreased faster
I am confused over this, we are not updating learning rate anywhere (other than rate decay). Yes the weights will be updated faster for parameters with bigger gradients but much slower than they would have if no normalization is used.
Am I correct or am I understanding it incorrectly.
The text was updated successfully, but these errors were encountered:
In chapter 10 after the square root example the text is written as
I am confused over this, we are not updating learning rate anywhere (other than rate decay). Yes the weights will be updated faster for parameters with bigger gradients but much slower than they would have if no normalization is used.
Am I correct or am I understanding it incorrectly.
The text was updated successfully, but these errors were encountered: