Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Optimi(more fused_background_optimizer and new function) #1381

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

sdbds
Copy link
Contributor

@sdbds sdbds commented Jun 24, 2024

https://optimi.benjaminwarner.dev/

New optimizer:
adamw、lion、ranger and stableadamw from Optimi

New function:
Low Precision Training with Kahan Summation(auto use when use above optimizers)
image

Gradient Release(same as fused background pass)
Auto use when use above optimizers.

Fully Decoupled Weight Decay(looks like decoupled lr)

Optimizer Accumulation
Gradient accumulation reduces training memory by splitting a batch into micro-batches and accumulating micro-batch gradients into the larger batch. Gradient release reduces training memory by limiting gradients to one layer at any given time. Optimizer accumulation unifies these two disparate approaches by accumulating gradients directly into optimizer states while performing gradient release.

@sdbds
Copy link
Contributor Author

sdbds commented Jun 24, 2024

The problem now is that the VRAM usage keeps going up after using prepare_for_gradient_release() for hook, not sure about the problem at the moment...

@feffy380
Copy link
Contributor

Some of optimi's features do not support fp16. They should not replace the original optimizers.

@sdbds
Copy link
Contributor Author

sdbds commented Jul 1, 2024

Some of optimi's features do not support fp16. They should not replace the original optimizers.

Sure, i will add fp16 checker, thanks

@feffy380
Copy link
Contributor

feffy380 commented Jul 1, 2024

I'm saying it should not be considered a substitute for the original optimizers. Some features do work with fp16 (like low precision training), but others don't.

You don't need an fp16 check, you need to treat the optimi versions as completely separate optimizers, the same way Adamw8bit is considered separate from Adamw.

People can also just load them with --optimizer_type optimi.Adamw. Only gradient release and optimizer accumulation need special handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants