ADD: Raise warning for duplicated conversation types #1232

JohanWork · 2024-01-31T08:43:02Z

Rasise a warning if conversation is specified but already exists in the type for datasets. This is for example the case for ultra_apply_chatml.

winglian · 2024-01-31T11:31:55Z

Aren't the chatml.* already builtin formatters? I would think we wouldn't want to warn for those since they are correct.

winglian · 2024-01-31T11:45:46Z

src/axolotl/utils/config.py

+                    "chatml.intel",
+                    "chatml.argilla",


Suggested change

"chatml.intel",

"chatml.argilla",

Aren't these current formatters? I don't think we should expect the user to receive a warning if they use these two.

chatml.intel is used here https://github.com/OpenAccess-AI-Collective/axolotl/blob/5787e1a23f61be093fc26067f8a05342fbdfbe1a/tests/e2e/test_dpo.py#L46 and the test seams to work. So thought they where supported but might miss something :) chatml.argilla is mention here https://github.com/OpenAccess-AI-Collective/axolotl/blob/5787e1a23f61be093fc26067f8a05342fbdfbe1a/docs/rlhf.md?plain=1#L29

Right, if the chatml.* types.work, then we shouldn't we warning here? Or am I misunderstanding something?

@winglian I don't fully understand. My intention was to suggest that we should issue a warning or raise a ValueError if a conversation format is added, but there's already a type in place that formats the data. This situation arises, for example, with ultra_apply_chatml (see the example in the comment below). Here's another example where the conversation format is determined by the type: https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo/blob/5c32e515f3d79beefc110e8a07c3671269a0f5ab/configs/dolphin-dpo.yml#L14

winglian · 2024-01-31T11:46:35Z

src/axolotl/utils/config.py

+                ds_cfg.type
+                in [
+                    "ultra_apply_chatml",
+                    "toxic_apply_chatml",


I don't think we ever had toxic_apply_chatml in our previous dpo release pre-cleanup.

At least from how I've understood the DPO (and this took me some time, and I might have misunderstood it):

datasets: path: argilla/ultrafeedback-binarized-preferences-cleaned split: train type: ultra_apply_chatml conversation: chatml

This doesn't make any sense. You shouldn't set conversation here since the type actually also includes formatting for a ChatML conversation. In some cases, type serves both as the input format and the format used for conversation style, to my understanding. Therefore, I thought it would be valid to raise an error if you use a type format that also includes a conversation format and your setting conversation.

Regarding toxic_apply_chatml do you mean it is not supported or was previously not supported?

@winglian, I now understand your point. However, I still believe that setting conversations for DPO should not be possible since it's done in type. Or am I still overlooking something?

winglian reviewed Jan 31, 2024

View reviewed changes

JohanWork added 3 commits February 1, 2024 08:42

add config check and test

034d7e1

add one format

88fcdd6

run pre-commit

0774552

winglian force-pushed the add-warrning-model-tokenizer-chat-template branch from f8c97a0 to 0774552 Compare February 1, 2024 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD: Raise warning for duplicated conversation types #1232

ADD: Raise warning for duplicated conversation types #1232

JohanWork commented Jan 31, 2024

winglian commented Jan 31, 2024

winglian Jan 31, 2024

JohanWork Jan 31, 2024

winglian Feb 6, 2024

JohanWork Feb 13, 2024 •

edited

Loading

winglian Jan 31, 2024

JohanWork Jan 31, 2024 •

edited

Loading

JohanWork Feb 6, 2024

ADD: Raise warning for duplicated conversation types #1232

Are you sure you want to change the base?

ADD: Raise warning for duplicated conversation types #1232

Conversation

JohanWork commented Jan 31, 2024

winglian commented Jan 31, 2024

winglian Jan 31, 2024

Choose a reason for hiding this comment

JohanWork Jan 31, 2024

Choose a reason for hiding this comment

winglian Feb 6, 2024

Choose a reason for hiding this comment

JohanWork Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

winglian Jan 31, 2024

Choose a reason for hiding this comment

JohanWork Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

JohanWork Feb 6, 2024

Choose a reason for hiding this comment

JohanWork Feb 13, 2024 •

edited

Loading

JohanWork Jan 31, 2024 •

edited

Loading