-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add eos_text and bos_text defaults for convert_text_to_mds.py #826
Conversation
add_bos_token=False, | ||
add_eos_token=False) | ||
tokenizer.model_max_length = 5000000000 # Hack to prevent warnings from HuggingFace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this because if a user specifies a particular bos_text
or eos_text
the tokenizer should not automatically also add bos or eos token
bos_text (Optional[str]): Text to prepend to each example to separate concatenated samples | ||
If None, use the tokenizer's bos_token if tokenizer.add_bos_token is True, otherwise use an empty string. | ||
eos_text (Optional[str]): Text end to append to each example to separate concatenated samples | ||
If None, use the tokenizer's eos_token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't love that these defaults aren't mirrors of each other, but this allows the defaults to match the finetuning cases we discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is trying to put the finetuning configuration in the wrong place. I think the default for this script should be to just do whatever the tokenizer does by default. That is what I would expect as a user. I think its reasonable to error if the tokenizer adds a bos (eos) and the user also specified bos (eos). That error would occur in ConcatTokensDataset
constructor (currently its a warning).
@dakinggg Hm, so with what you propose, a user could not specify an alternative bos if the tokenizer already adds one by default? OK with that being the case, i don't really have a strong opinion on that. |
@irenedea if we added a And today, they can add a custom bos, but it will be in addition to the one the tokenizer adds by default. Which I don't think is ever what is desired. |
closing in favor of #843 |
No description provided.