AdamE optimizer with decoupled L1 and L2 regularization #1314

vincenzo-scotti · 2024-08-10T17:16:59Z

Summary

This pull request introduces a new variant of the Adam optimizer called AdamE, where the E stands for Elastic Net.
AdamE incorporates both L1 (Lasso) and L2 (Weight Decay) decoupled regularization using independent parameters, allowing for flexible settings.
This addition enhances the regularization capabilities of the optimizer, supporting various configurations for different parameter groups.

Features and Changes

Decoupled L1 and L2 Regularization: The AdamE optimizer allows for both L1 and L2 regularization as in Elastic Net through separate coefficients. The regularization is decoupled from the Adam weights update component (as happens in the AdamW optimizer with the L2 regularization).
- Independent regularization coefficients: AdamE optimizer allows for independent adjustment of L1 and L2 regularization coefficients, providing greater control over the regularization of model parameters (setting the coefficient to 0 will ensure that only the other of the two regularizations will be used).
- Support for parameter groups: Different parameter groups can now have specific L1 and L2 values, enabling targeted regularization strategies.
- Support for different formats: AdamE comes in different flavours, including the base version, the 8bit variant and the paged variant.
Compatibility with Existing Optimizers: The underlying shared classes of the optimizers, weights updated function and the cuda components have been extended to include the lasso regularization, without modifying the interfaces of the pre-existing optimizers.

Implementation Details

Added new optimizer classes in bitsandbytes/optim/adame.py. Here is the list of new optimizers that are now part of the bitsandbytes/optim module:
- AdamE
- AdamE8bit
- AdamE32bit
- PagedAdamE
- PagedAdamE8bit
- PagedAdamE32bit
Introduced new parameter lasso for L1 regularization (the parameter defaults to 0 to avoid changing the behaviour of pre-existing optimizers) among the defaults of the following classes of bitsandbytes/optim/optimizer.py:
- Optimizer8bit
- Optimizer2State (parameter is also in the class constructor)
- Optimizer1State (parameter is also in the class constructor)
Introduced new parameter lasso for L1 regularization (the parameter defaults to 0 to avoid changing the behaviour of pre-existing optimizers) among the parameters of the following optimizer update function in bitsandbytes/functional.py:
- optimizer_update_32bit
- optimizer_update_8bit
- optimizer_update_8bit_blockwise
Updated documentation to include all the versions of the new optimizer and to explain the new parameter in the low level optimizer classes and the update functions.
Updated accordingly all CUDA kernels and interfaces inside csrc; in particular, the following files have been modified to introduce the new parameter lasso for L1 regularization:
- csrc/kernels.cu
- csrc/kernels.cuh
- csrc/ops.cu
- csrc/ops.cuh
- csrc/pythonInterface.cpp

Motivation

The addition of the AdamE optimizer provides a more robust and flexible approach to regularization, particularly in deep learning models with complex structures and varying regularization requirements.
By allowing decoupled L1 and L2 regularization, this optimizer can help prevent overfitting and improve model generalization.
In particular, this Adam variant introduces L1 regularization, which wasn't properly supported before.
Moreover, the fine-grained control coming from this implementation allows complete choice over the amount and modalities of regularization without impacting usability.

Request for Feedback

I would appreciate any feedback on the implementation and documentation of the AdamE optimizer.
Any suggestions for further improvements or additional tests are welcome.

Thank you for considering this pull request.

…icient

…tation

Titus-von-Koeller · 2024-08-21T15:36:56Z

Dear @vincenzo-scotti,

Thanks a lot for your contribution, the code looks really nice and we see you put a lot of attention to detail, that's very appreciated!

Tim and I discussed this shortly and we had a bit of difficulty locating the relevant scientific papers for this and you didn't point out any experimental evidence of others or experiments of your own in order to put this in perspective. Could you elaborate a bit more on these aspects?

On the code side everything looks really clean, but no tests are present. Maybe take a look at the other optimizer tests and, also, maybe @matthewdouglas has a few suggestions on this front? I would say we need some simple integration test that shows that learning happens. Just something that validates that it works and that makes us immediately aware if some code changes introduces a regression. Wdyt?

vincenzo-scotti · 2024-08-26T08:01:04Z

Dear @Titus-von-Koeller,

Thank you very much for your feedback.
Hereafter I will answer all the points you mentioned.

To my knowledge, there is no paper to mention.
I developed this variant of AdamW as part of a research project I am currently working on (hopefully I will manage to publish a paper to mention soon).
The rationale is the same as that behind having decoupled L2 regularisation but with L1, which enforces sparsity.
I will take care of better commenting and justifying the rationale behind AdamE.

I agree with you that some proper integration testing is needed, I will proceed to write some tests based on those you have for the other optimizers.

Thank you again for your response.
I will update you on everything as soon as possible.

Titus-von-Koeller · 2024-09-10T14:23:50Z

Thanks @vincenzo-scotti,

perfect, thanks again for your contribution and explaining the background a bit more. Looking forward to further info once you're ready for it.

Please also consider evaluating your creation agains other optimizers experimentally and sharing a bit about that. With BNB all code implies maintenance cost over time, which can really add up. When merging new code, we want to be sure we have a good understanding of the value it adds and this experimental evaluation against other approaches would really help gauge that. I imagine you would need that for the paper as well. Let us know if you need further input from us, no question is a stupid question and we can always escalate to Tim Dettmers for tricky questions. We're happy to support where we can and appreciate your initiative!

vincenzo-scotti added 8 commits August 10, 2024 07:26

Add AdamE optimizer classes

5d85318

Add support for L1 regularization in optimizer interfaces

454a264

Register AdamE as part of the optim module

cccd929

Update optimizer step functions to include lasso regularization coeff…

82acad8

…icient

Update C++ code to include lasso regularization parameter

051bdfc

Add temporary fix for operands precision issue in sign function compu…

4c5da80

…tation

Add threshold to lasso update to ensure convergence

1d4e114

Refactor switch cases code

f29c7e0

matthewdouglas added the enhancement New feature or request label Aug 14, 2024

Titus-von-Koeller marked this pull request as draft September 10, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdamE optimizer with decoupled L1 and L2 regularization #1314

AdamE optimizer with decoupled L1 and L2 regularization #1314

vincenzo-scotti commented Aug 10, 2024 •

edited

Loading

Titus-von-Koeller commented Aug 21, 2024

vincenzo-scotti commented Aug 26, 2024

Titus-von-Koeller commented Sep 10, 2024

AdamE optimizer with decoupled L1 and L2 regularization #1314

Are you sure you want to change the base?

AdamE optimizer with decoupled L1 and L2 regularization #1314

Conversation

vincenzo-scotti commented Aug 10, 2024 • edited Loading

Summary

Features and Changes

Implementation Details

Motivation

Request for Feedback

Titus-von-Koeller commented Aug 21, 2024

vincenzo-scotti commented Aug 26, 2024

Titus-von-Koeller commented Sep 10, 2024

vincenzo-scotti commented Aug 10, 2024 •

edited

Loading