Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdamE optimizer with decoupled L1 and L2 regularization #1314

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

vincenzo-scotti
Copy link

@vincenzo-scotti vincenzo-scotti commented Aug 10, 2024

Summary

This pull request introduces a new variant of the Adam optimizer called AdamE, where the E stands for Elastic Net.
AdamE incorporates both L1 (Lasso) and L2 (Weight Decay) decoupled regularization using independent parameters, allowing for flexible settings.
This addition enhances the regularization capabilities of the optimizer, supporting various configurations for different parameter groups.

Features and Changes

  • Decoupled L1 and L2 Regularization: The AdamE optimizer allows for both L1 and L2 regularization as in Elastic Net through separate coefficients. The regularization is decoupled from the Adam weights update component (as happens in the AdamW optimizer with the L2 regularization).
    • Independent regularization coefficients: AdamE optimizer allows for independent adjustment of L1 and L2 regularization coefficients, providing greater control over the regularization of model parameters (setting the coefficient to 0 will ensure that only the other of the two regularizations will be used).
    • Support for parameter groups: Different parameter groups can now have specific L1 and L2 values, enabling targeted regularization strategies.
    • Support for different formats: AdamE comes in different flavours, including the base version, the 8bit variant and the paged variant.
  • Compatibility with Existing Optimizers: The underlying shared classes of the optimizers, weights updated function and the cuda components have been extended to include the lasso regularization, without modifying the interfaces of the pre-existing optimizers.

Implementation Details

  • Added new optimizer classes in bitsandbytes/optim/adame.py. Here is the list of new optimizers that are now part of the bitsandbytes/optim module:
    • AdamE
    • AdamE8bit
    • AdamE32bit
    • PagedAdamE
    • PagedAdamE8bit
    • PagedAdamE32bit
  • Introduced new parameter lasso for L1 regularization (the parameter defaults to 0 to avoid changing the behaviour of pre-existing optimizers) among the defaults of the following classes of bitsandbytes/optim/optimizer.py:
    • Optimizer8bit
    • Optimizer2State (parameter is also in the class constructor)
    • Optimizer1State (parameter is also in the class constructor)
  • Introduced new parameter lasso for L1 regularization (the parameter defaults to 0 to avoid changing the behaviour of pre-existing optimizers) among the parameters of the following optimizer update function in bitsandbytes/functional.py:
    • optimizer_update_32bit
    • optimizer_update_8bit
    • optimizer_update_8bit_blockwise
  • Updated documentation to include all the versions of the new optimizer and to explain the new parameter in the low level optimizer classes and the update functions.
  • Updated accordingly all CUDA kernels and interfaces inside csrc; in particular, the following files have been modified to introduce the new parameter lasso for L1 regularization:
    • csrc/kernels.cu
    • csrc/kernels.cuh
    • csrc/ops.cu
    • csrc/ops.cuh
    • csrc/pythonInterface.cpp

Motivation

The addition of the AdamE optimizer provides a more robust and flexible approach to regularization, particularly in deep learning models with complex structures and varying regularization requirements.
By allowing decoupled L1 and L2 regularization, this optimizer can help prevent overfitting and improve model generalization.
In particular, this Adam variant introduces L1 regularization, which wasn't properly supported before.
Moreover, the fine-grained control coming from this implementation allows complete choice over the amount and modalities of regularization without impacting usability.

Request for Feedback

I would appreciate any feedback on the implementation and documentation of the AdamE optimizer.
Any suggestions for further improvements or additional tests are welcome.

Thank you for considering this pull request.

@matthewdouglas matthewdouglas added the enhancement New feature or request label Aug 14, 2024
@Titus-von-Koeller
Copy link
Collaborator

Dear @vincenzo-scotti,

Thanks a lot for your contribution, the code looks really nice and we see you put a lot of attention to detail, that's very appreciated!

Tim and I discussed this shortly and we had a bit of difficulty locating the relevant scientific papers for this and you didn't point out any experimental evidence of others or experiments of your own in order to put this in perspective. Could you elaborate a bit more on these aspects?

On the code side everything looks really clean, but no tests are present. Maybe take a look at the other optimizer tests and, also, maybe @matthewdouglas has a few suggestions on this front? I would say we need some simple integration test that shows that learning happens. Just something that validates that it works and that makes us immediately aware if some code changes introduces a regression. Wdyt?

@vincenzo-scotti
Copy link
Author

Dear @Titus-von-Koeller,

Thank you very much for your feedback.
Hereafter I will answer all the points you mentioned.

To my knowledge, there is no paper to mention.
I developed this variant of AdamW as part of a research project I am currently working on (hopefully I will manage to publish a paper to mention soon).
The rationale is the same as that behind having decoupled L2 regularisation but with L1, which enforces sparsity.
I will take care of better commenting and justifying the rationale behind AdamE.

I agree with you that some proper integration testing is needed, I will proceed to write some tests based on those you have for the other optimizers.

Thank you again for your response.
I will update you on everything as soon as possible.

@Titus-von-Koeller
Copy link
Collaborator

Thanks @vincenzo-scotti,

perfect, thanks again for your contribution and explaining the background a bit more. Looking forward to further info once you're ready for it.

Please also consider evaluating your creation agains other optimizers experimentally and sharing a bit about that. With BNB all code implies maintenance cost over time, which can really add up. When merging new code, we want to be sure we have a good understanding of the value it adds and this experimental evaluation against other approaches would really help gauge that. I imagine you would need that for the paper as well. Let us know if you need further input from us, no question is a stupid question and we can always escalate to Tim Dettmers for tricky questions. We're happy to support where we can and appreciate your initiative!

@Titus-von-Koeller Titus-von-Koeller marked this pull request as draft September 10, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants