🗺️ Implementation DiscoPOP Loss #2323

fanconic · 2024-11-04T18:36:47Z

What does this PR do?

This PR implements the DiscoPOP loss. The DiscoPOP loss, also described in the paper as "Log Ratio Modulated Loss" (LRML), was discovered through an LLM discovery process to find optimization functions for offline preference optimization. The DiscoPOP loss outperformed (or performed competitively to) traditional offline preference optimization functions, such as DPO and Hinge loss, on different evaluation tasks (IMDb positive text generation, TL;DR text summarization, and Alpaca Eval 2.0).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Additional Notes

About tests, I have added tests to the best of my knowledge to make sure that this feature works. While test failures occurred in unrelated parts of the code, the ones I implemented passed successfully. Please let me know if something needs to be fixed.

docs/source/dpo_trainer.mdx

qgallouedec · 2024-11-04T18:48:02Z

docs/source/dpo_trainer.mdx

+### DiscoPOP loss
+
+The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). To use this discovered loss, set the `loss_type` value to `discopop` in the [`DPOConfig`]. Additionally, you can change the `discopop_tau` value to change the shape of the DiscoPOP loss. However, the authors recommed the default value `discopop_tau=0.05`.
+


Suggested change

### DiscoPOP loss

The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). To use this discovered loss, set the `loss_type` value to `discopop` in the [`DPOConfig`]. Additionally, you can change the `discopop_tau` value to change the shape of the DiscoPOP loss. However, the authors recommed the default value `discopop_tau=0.05`.

Can you make the final remark about discopop_tau in the table instead?

qgallouedec · 2024-11-04T18:49:31Z

trl/trainer/dpo_config.py

                - `"apo_zero"`: APO-zero loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
                - `"apo_down"`: APO-down loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
+                - `"discopop"`: DiscoPOP (a.k.a Log-Ratio Modulated Loss, LRML) loss from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper.


It's sorted by date

Suggested change

- `"apo_zero"`: APO-zero loss from the [APO](https://huggingface.co/papers/2408.06266) paper.

- `"apo_down"`: APO-down loss from the [APO](https://huggingface.co/papers/2408.06266) paper.

- `"discopop"`: DiscoPOP (a.k.a Log-Ratio Modulated Loss, LRML) loss from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper.

- `"discopop"`: DiscoPOP (a.k.a Log-Ratio Modulated Loss, LRML) loss from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper.

- `"apo_zero"`: APO-zero loss from the [APO](https://huggingface.co/papers/2408.06266) paper.

- `"apo_down"`: APO-down loss from the [APO](https://huggingface.co/papers/2408.06266) paper.

Why not "lrm" instead by the way?

I don't really have a clear answer to this. While lrml was the name proposed by the LLM during discovery, the authors agreed to name the best performing one DiscoPOP, which seemed to us like a catchy abbreviation for Discovered Preference Optimization

I don't have strong opinion on this, but LRML might be more informative than DiscoPOP. I mean, we could have a lot of different Discovered Preference Optimization losses.

trl/trainer/dpo_config.py

trl/trainer/dpo_trainer.py

qgallouedec · 2024-11-04T18:57:53Z

trl/commands/scripts

@@ -0,0 +1 @@
+/home/azureuser/caf83/trl/examples/scripts/


please remove this file

Sorry, this slipped in there when adding changed files. Good catch

qgallouedec · 2024-11-04T18:58:39Z

Thanks @fanconic! Do you have reference results to share?

Co-authored-by: Quentin Gallouédec <[email protected]>

trl/trainer/dpo_config.py

qgallouedec · 2024-11-18T10:58:45Z

Thanks for contributing @fanconic 👊

HuggingFaceDocBuilderDev · 2024-11-18T11:02:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…discopop_loss

fanconic added 2 commits November 4, 2024 18:24

Implement DiscoPOP Loss

45e19d2

Updated DiscoPOP documentation

c93ee1c

qgallouedec reviewed Nov 4, 2024

View reviewed changes

docs/source/dpo_trainer.mdx Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 4, 2024

View reviewed changes

trl/trainer/dpo_config.py Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 4, 2024

View reviewed changes

trl/trainer/dpo_config.py Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 4, 2024

View reviewed changes

trl/trainer/dpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 4, 2024

View reviewed changes

trl/trainer/dpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 4, 2024

View reviewed changes

trl/trainer/dpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Nov 4, 2024

View reviewed changes

fanconic and others added 8 commits November 4, 2024 19:07

Corrected docs/source/dpo_trainer.mdx

f3e9f81

Co-authored-by: Quentin Gallouédec <[email protected]>

Update docs/source/dpo_trainer.mdx

3d56cf3

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/dpo_config.py

228c2c9

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/dpo_trainer.py

05aea62

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/dpo_trainer.py

e905827

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/dpo_trainer.py

5ff1e1a

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/dpo_config.py

53d3ca0

Co-authored-by: Quentin Gallouédec <[email protected]>

Update trl/trainer/dpo_config.py

4139972

Co-authored-by: Quentin Gallouédec <[email protected]>

qgallouedec reviewed Nov 4, 2024

View reviewed changes

trl/trainer/dpo_config.py Outdated Show resolved Hide resolved

qgallouedec and others added 6 commits November 4, 2024 22:24

Merge branch 'main' into discopop_loss

ce60650

Update trl/trainer/dpo_config.py

01fd3ec

Merge branch 'main' into discopop_loss

866434a

Merge branch 'main' into discopop_loss

f1e2063

Merge branch 'main' into discopop_loss

7991057

Delete scripts directory

e4df418

qgallouedec approved these changes Nov 18, 2024

View reviewed changes

Merge branch 'main' into discopop_loss

b2f4316

qgallouedec added 2 commits November 18, 2024 12:11

style

97d8478

Merge branch 'discopop_loss' of https://github.com/fanconic/trl into …

533f5ab

…discopop_loss

qgallouedec changed the title ~~Implementation DiscoPOP Loss~~ 🗺️ Implementation DiscoPOP Loss Nov 18, 2024

empty commit

b8dd80e

qgallouedec merged commit cbf9abc into huggingface:main Nov 18, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🗺️ Implementation DiscoPOP Loss #2323

🗺️ Implementation DiscoPOP Loss #2323

fanconic commented Nov 4, 2024 •

edited

Loading

qgallouedec Nov 4, 2024

qgallouedec Nov 4, 2024

qgallouedec Nov 4, 2024

fanconic Nov 4, 2024

qgallouedec Nov 4, 2024

qgallouedec Nov 4, 2024

fanconic Nov 4, 2024

qgallouedec commented Nov 4, 2024

qgallouedec commented Nov 18, 2024

HuggingFaceDocBuilderDev commented Nov 18, 2024

		### DiscoPOP loss

		The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). To use this discovered loss, set the `loss_type` value to `discopop` in the [`DPOConfig`]. Additionally, you can change the `discopop_tau` value to change the shape of the DiscoPOP loss. However, the authors recommed the default value `discopop_tau=0.05`.

		@@ -0,0 +1 @@
		/home/azureuser/caf83/trl/examples/scripts/

🗺️ Implementation DiscoPOP Loss #2323

🗺️ Implementation DiscoPOP Loss #2323

Conversation

fanconic commented Nov 4, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Additional Notes

qgallouedec Nov 4, 2024

Choose a reason for hiding this comment

qgallouedec Nov 4, 2024

Choose a reason for hiding this comment

qgallouedec Nov 4, 2024

Choose a reason for hiding this comment

fanconic Nov 4, 2024

Choose a reason for hiding this comment

qgallouedec Nov 4, 2024

Choose a reason for hiding this comment

qgallouedec Nov 4, 2024

Choose a reason for hiding this comment

fanconic Nov 4, 2024

Choose a reason for hiding this comment

qgallouedec commented Nov 4, 2024

qgallouedec commented Nov 18, 2024

HuggingFaceDocBuilderDev commented Nov 18, 2024

fanconic commented Nov 4, 2024 •

edited

Loading