Add eos_text and bos_text defaults for convert_text_to_mds.py #826

irenedea · 2023-12-28T23:54:33Z

No description provided.

irenedea · 2023-12-28T23:56:12Z

scripts/data_prep/convert_text_to_mds.py

+                                                  add_bos_token=False,
+                                                  add_eos_token=False)
        tokenizer.model_max_length = 5000000000  # Hack to prevent warnings from HuggingFace


Added this because if a user specifies a particular bos_text or eos_text the tokenizer should not automatically also add bos or eos token

irenedea · 2023-12-28T23:58:01Z

scripts/data_prep/convert_text_to_mds.py

+        bos_text (Optional[str]): Text to prepend to each example to separate concatenated samples
+            If None, use the tokenizer's bos_token if tokenizer.add_bos_token is True, otherwise use an empty string.
+        eos_text (Optional[str]): Text end to append to each example to separate concatenated samples
+            If None, use the tokenizer's eos_token.


don't love that these defaults aren't mirrors of each other, but this allows the defaults to match the finetuning cases we discussed.

dakinggg

I think this is trying to put the finetuning configuration in the wrong place. I think the default for this script should be to just do whatever the tokenizer does by default. That is what I would expect as a user. I think its reasonable to error if the tokenizer adds a bos (eos) and the user also specified bos (eos). That error would occur in ConcatTokensDataset constructor (currently its a warning).

irenedea · 2024-01-03T17:00:44Z

@dakinggg Hm, so with what you propose, a user could not specify an alternative bos if the tokenizer already adds one by default? OK with that being the case, i don't really have a strong opinion on that.

dakinggg · 2024-01-04T18:21:51Z

@irenedea if we added a tokenizer_kwargs then they could, by turning off the auto bos adding.

And today, they can add a custom bos, but it will be in addition to the one the tokenizer adds by default. Which I don't think is ever what is desired.

irenedea · 2024-01-08T23:43:56Z

closing in favor of #843

irenedea added 4 commits December 27, 2023 11:04

default to using tokenizer eos and bos

754fe7f

pyright fixes

50588ac

Take into account tokenizers that automatically add bos token

0fb487f

set add_eos_token to false

b121d74

irenedea commented Dec 28, 2023

View reviewed changes

irenedea added 2 commits December 28, 2023 16:06

small logic fix

4496f98

fix comment

655536c

dakinggg reviewed Dec 29, 2023

View reviewed changes

Merge branch 'main' into tok

a9ff7c2

Merge branch 'main' into tok

efe0198

irenedea closed this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eos_text and bos_text defaults for convert_text_to_mds.py #826

Add eos_text and bos_text defaults for convert_text_to_mds.py #826

irenedea commented Dec 28, 2023

irenedea Dec 28, 2023

irenedea Dec 28, 2023

dakinggg left a comment

irenedea commented Jan 3, 2024

dakinggg commented Jan 4, 2024

irenedea commented Jan 8, 2024

Add eos_text and bos_text defaults for convert_text_to_mds.py #826

Add eos_text and bos_text defaults for convert_text_to_mds.py #826

Conversation

irenedea commented Dec 28, 2023

irenedea Dec 28, 2023

Choose a reason for hiding this comment

irenedea Dec 28, 2023

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

irenedea commented Jan 3, 2024

dakinggg commented Jan 4, 2024

irenedea commented Jan 8, 2024