Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

Open
Novestars opened this issue Nov 24, 2023 · 1 comment

Comments

@Novestars
Copy link

No description provided.

@alexlioralexli
Copy link

AdamW divides by its estimate of the gradient's second order moment. If this is out of date, then it could lead to exploding updates (if the estimate is too small) or slow learning (if the estimate is too large). Decreasing beta2 from 0.999 to 0.95 helps address this by keeping the running estimate closer to the current value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants