Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD: Raise warning for duplicated conversation types #1232

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JohanWork
Copy link
Contributor

Rasise a warning if conversation is specified but already exists in the type for datasets. This is for example the case for ultra_apply_chatml.

@winglian
Copy link
Collaborator

Aren't the chatml.* already builtin formatters? I would think we wouldn't want to warn for those since they are correct.

Comment on lines +537 to +536
"chatml.intel",
"chatml.argilla",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"chatml.intel",
"chatml.argilla",

Aren't these current formatters? I don't think we should expect the user to receive a warning if they use these two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if the chatml.* types.work, then we shouldn't we warning here? Or am I misunderstanding something?

Copy link
Contributor Author

@JohanWork JohanWork Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winglian I don't fully understand. My intention was to suggest that we should issue a warning or raise a ValueError if a conversation format is added, but there's already a type in place that formats the data. This situation arises, for example, with ultra_apply_chatml (see the example in the comment below). Here's another example where the conversation format is determined by the type: https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo/blob/5c32e515f3d79beefc110e8a07c3671269a0f5ab/configs/dolphin-dpo.yml#L14

ds_cfg.type
in [
"ultra_apply_chatml",
"toxic_apply_chatml",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever had toxic_apply_chatml in our previous dpo release pre-cleanup.

Copy link
Contributor Author

@JohanWork JohanWork Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least from how I've understood the DPO (and this took me some time, and I might have misunderstood it):

datasets:
    path: argilla/ultrafeedback-binarized-preferences-cleaned
    split: train
    type: ultra_apply_chatml
    conversation: chatml 

This doesn't make any sense. You shouldn't set conversation here since the type actually also includes formatting for a ChatML conversation. In some cases, type serves both as the input format and the format used for conversation style, to my understanding. Therefore, I thought it would be valid to raise an error if you use a type format that also includes a conversation format and your setting conversation.

Regarding toxic_apply_chatml do you mean it is not supported or was previously not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winglian, I now understand your point. However, I still believe that setting conversations for DPO should not be possible since it's done in type. Or am I still overlooking something?

@winglian winglian force-pushed the add-warrning-model-tokenizer-chat-template branch from f8c97a0 to 0774552 Compare February 1, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants