-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdamE optimizer with decoupled L1 and L2 regularization #1314
base: main
Are you sure you want to change the base?
AdamE optimizer with decoupled L1 and L2 regularization #1314
Conversation
Dear @vincenzo-scotti, Thanks a lot for your contribution, the code looks really nice and we see you put a lot of attention to detail, that's very appreciated! Tim and I discussed this shortly and we had a bit of difficulty locating the relevant scientific papers for this and you didn't point out any experimental evidence of others or experiments of your own in order to put this in perspective. Could you elaborate a bit more on these aspects? On the code side everything looks really clean, but no tests are present. Maybe take a look at the other optimizer tests and, also, maybe @matthewdouglas has a few suggestions on this front? I would say we need some simple integration test that shows that learning happens. Just something that validates that it works and that makes us immediately aware if some code changes introduces a regression. Wdyt? |
Dear @Titus-von-Koeller, Thank you very much for your feedback. To my knowledge, there is no paper to mention. I agree with you that some proper integration testing is needed, I will proceed to write some tests based on those you have for the other optimizers. Thank you again for your response. |
Thanks @vincenzo-scotti, perfect, thanks again for your contribution and explaining the background a bit more. Looking forward to further info once you're ready for it. Please also consider evaluating your creation agains other optimizers experimentally and sharing a bit about that. With BNB all code implies maintenance cost over time, which can really add up. When merging new code, we want to be sure we have a good understanding of the value it adds and this experimental evaluation against other approaches would really help gauge that. I imagine you would need that for the paper as well. Let us know if you need further input from us, no question is a stupid question and we can always escalate to Tim Dettmers for tricky questions. We're happy to support where we can and appreciate your initiative! |
Summary
This pull request introduces a new variant of the Adam optimizer called AdamE, where the E stands for Elastic Net.
AdamE incorporates both L1 (Lasso) and L2 (Weight Decay) decoupled regularization using independent parameters, allowing for flexible settings.
This addition enhances the regularization capabilities of the optimizer, supporting various configurations for different parameter groups.
Features and Changes
Implementation Details
bitsandbytes/optim/adame.py
. Here is the list of new optimizers that are now part of thebitsandbytes/optim
module:AdamE
AdamE8bit
AdamE32bit
PagedAdamE
PagedAdamE8bit
PagedAdamE32bit
lasso
for L1 regularization (the parameter defaults to 0 to avoid changing the behaviour of pre-existing optimizers) among the defaults of the following classes ofbitsandbytes/optim/optimizer.py
:Optimizer8bit
Optimizer2State
(parameter is also in the class constructor)Optimizer1State
(parameter is also in the class constructor)lasso
for L1 regularization (the parameter defaults to 0 to avoid changing the behaviour of pre-existing optimizers) among the parameters of the following optimizer update function inbitsandbytes/functional.py
:optimizer_update_32bit
optimizer_update_8bit
optimizer_update_8bit_blockwise
csrc
; in particular, the following files have been modified to introduce the new parameterlasso
for L1 regularization:csrc/kernels.cu
csrc/kernels.cuh
csrc/ops.cu
csrc/ops.cuh
csrc/pythonInterface.cpp
Motivation
The addition of the AdamE optimizer provides a more robust and flexible approach to regularization, particularly in deep learning models with complex structures and varying regularization requirements.
By allowing decoupled L1 and L2 regularization, this optimizer can help prevent overfitting and improve model generalization.
In particular, this Adam variant introduces L1 regularization, which wasn't properly supported before.
Moreover, the fine-grained control coming from this implementation allows complete choice over the amount and modalities of regularization without impacting usability.
Request for Feedback
I would appreciate any feedback on the implementation and documentation of the AdamE optimizer.
Any suggestions for further improvements or additional tests are welcome.
Thank you for considering this pull request.