Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

theblackcat102 · 2023-08-17T00:47:31Z

from transformers import AutoTokenizer
AutoTokenizer.from_pretrained("OpenAssistant/llama2-13b-orca-8k-3319").padding_side
>> 'left'
AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-fp16")
>> 'left'
AutoTokenizer.from_pretrained("mosaicml/mpt-7b").padding_side
>> 'right'
AutoTokenizer.from_pretrained("huggyllama/llama-7b").padding_side
>> 'left'
AutoTokenizer.from_pretrained("OpenAssistant/llama-30b-sft-v8.2-2.4k-steps-system").padding_side
>> 'left'

Since llama models are using left padding, the supervised training dialoguecollator would cause the label_mask to pad in a different direction as the tokenizer.pad (input_ids, attention_mask), as torch.stack (label_mask) implements the right padding strategy.

Printing out the dataloader results in trainer_sft.py would also verify the issue

    train_dataloader = DataLoader(train, collate_fn=train_collate_fn, batch_size=9, shuffle=True)
    for batch in train_dataloader:
        for idx, question in enumerate(batch['input_ids']):
            print('-------')
            print(tokenizer.decode(question[batch['label_masks'][idx]]).replace('</s>', '')+'\n')

I think there's no padding_side assigned to right in the trainer_sft.py pipeline, so by default llama models we have trained are bit faulty

The text was updated successfully, but these errors were encountered:

theblackcat102 · 2023-08-17T00:49:31Z

An easy fix would be setting padding_side = 'left' in DialogueDataCollator post_init function

@dataclass
class DialogueDataCollator:
    ...
    def __post_init__(self):
        assert self.tokenizer.eos_token
        self.tokenizer.padding_side = 'right'

theblackcat102 added the bug Something isn't working label Aug 17, 2023

theblackcat102 assigned andreaskoepf Aug 17, 2023

andreaskoepf added the ml label Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

theblackcat102 commented Aug 17, 2023 •

edited

Loading

theblackcat102 commented Aug 17, 2023

Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

Comments

theblackcat102 commented Aug 17, 2023 • edited Loading

theblackcat102 commented Aug 17, 2023

theblackcat102 commented Aug 17, 2023 •

edited

Loading