You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, the paper only addressed the problem when using step decay as learning rate schedule, what if cosine decay or other non-linear schedule is used? Are there experiments using these schedules? Thanks!
The text was updated successfully, but these errors were encountered:
In the paper, we only experimented on networks with step decay, and with linear learning rate warmup at the beginning of training. All three re-training techniques compared in the paper could still be applied with a non-linear schedule. It would definitely be interesting to compare the techniques on networks with other schedules: I suspect that the same findings will hold, though of course it's always possible that a nonlinear schedule would change things.
Hello, the paper only addressed the problem when using step decay as learning rate schedule, what if cosine decay or other non-linear schedule is used? Are there experiments using these schedules? Thanks!
The text was updated successfully, but these errors were encountered: