I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

Novestars · 2023-11-24T16:55:34Z

No description provided.

alexlioralexli · 2023-12-18T21:18:00Z

AdamW divides by its estimate of the gradient's second order moment. If this is out of date, then it could lead to exploding updates (if the estimate is too small) or slow learning (if the estimate is too large). Decreasing beta2 from 0.999 to 0.95 helps address this by keeping the running estimate closer to the current value.

jackswl mentioned this issue Sep 2, 2024

Is it possible to tweak the adamw_torch optimizer to change Beta1 and Beta2 values? huggingface/autotrain-advanced#743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

Novestars commented Nov 24, 2023

alexlioralexli commented Dec 18, 2023

I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

Comments

Novestars commented Nov 24, 2023

alexlioralexli commented Dec 18, 2023