Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add eos_text and bos_text defaults for convert_text_to_mds.py #826

Closed
wants to merge 8 commits into from

Conversation

irenedea
Copy link
Contributor

No description provided.

Comment on lines +225 to 227
add_bos_token=False,
add_eos_token=False)
tokenizer.model_max_length = 5000000000 # Hack to prevent warnings from HuggingFace
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this because if a user specifies a particular bos_text or eos_text the tokenizer should not automatically also add bos or eos token

Comment on lines 348 to 351
bos_text (Optional[str]): Text to prepend to each example to separate concatenated samples
If None, use the tokenizer's bos_token if tokenizer.add_bos_token is True, otherwise use an empty string.
eos_text (Optional[str]): Text end to append to each example to separate concatenated samples
If None, use the tokenizer's eos_token.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't love that these defaults aren't mirrors of each other, but this allows the defaults to match the finetuning cases we discussed.

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is trying to put the finetuning configuration in the wrong place. I think the default for this script should be to just do whatever the tokenizer does by default. That is what I would expect as a user. I think its reasonable to error if the tokenizer adds a bos (eos) and the user also specified bos (eos). That error would occur in ConcatTokensDataset constructor (currently its a warning).

@irenedea
Copy link
Contributor Author

irenedea commented Jan 3, 2024

@dakinggg Hm, so with what you propose, a user could not specify an alternative bos if the tokenizer already adds one by default? OK with that being the case, i don't really have a strong opinion on that.

@dakinggg
Copy link
Collaborator

dakinggg commented Jan 4, 2024

@irenedea if we added a tokenizer_kwargs then they could, by turning off the auto bos adding.

And today, they can add a custom bos, but it will be in addition to the one the tokenizer adds by default. Which I don't think is ever what is desired.

@irenedea
Copy link
Contributor Author

irenedea commented Jan 8, 2024

closing in favor of #843

@irenedea irenedea closed this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants