Fully compatible with the open clip Tokeniser #24925

laksjdjf · 2023-07-19T14:25:37Z

Feature request

The open clip tokeniser has a pad_token_id of 0, but this cannot be achieved because CLIPTokeniser.__init__() cannot set a pad_token_id

Motivation

Related to huggingface/diffusers#4153

Stable diffusion v2 and XL use the open clip tokeniser. To avoid increasing dependencies, the transformers must also have the same functionality.

Your contribution

It seems possible to set pad_token_id directly, but it is not realistic.

tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2-1", pad_token="<|endoftext|>")
tokenizer.pad_token_id = 0

The text was updated successfully, but these errors were encountered:

laksjdjf · 2023-07-19T14:26:34Z

@patrickvonplaten

sgugger · 2023-07-19T20:46:27Z

cc @patil-suraj and @ArthurZucker

ArthurZucker · 2023-07-20T16:39:25Z

Hey, I am not really sure I understand the issue here. The pad_token_id is dependent on the pad_token. By default the pad_token is set to '!' which, maps to 0 : tokenizer.decode(torch.tensor([0])), tokenizer.convert_tokens_to_ids('!'). If you set the padding token to a different value by hand, the following will happen:

>>> tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2-1", pad_token="<|endoftext|>")
>>> print(tokenizer.pad_token_id)
49407
>>> tokenizer.pad_token_id = 0
>>> print(tokenizer.pad_token)
 '!'

laksjdjf · 2023-07-21T00:36:01Z

Hey, I am not really sure I understand the issue here. The pad_token_id is dependent on the pad_token. By default the pad_token is set to '!' which, maps to 0 : tokenizer.decode(torch.tensor([0])), tokenizer.convert_tokens_to_ids('!'). If you set the padding token to a different value by hand, the following will happen:

You are right, my implementation does not seem to be perfect. The purpose of this implementation is to match when converting from text to token_id. What I want to know is how to set pad_token_id to 0 without affecting the words that are normally used like !

ArthurZucker · 2023-07-21T07:15:32Z

You can't really do that unless you train a new tokenizer 😅
You can add a new token, with a new index, which will prevent splitting !. The problem is that the embedding at position 0 might have been trained as padding token and is thus a random tensor (not updated by gradient computation).

patrickvonplaten · 2023-07-21T07:27:44Z

cc @patil-suraj to explain the context around Stable Diffusion and OAI vs. openCLIP here maybe

patil-suraj · 2023-07-21T11:01:34Z

Hey @laksjdjf ,

Indeed, there's a discrepancy between the pad_token_id in the open clip tokenizer and the CLIPTokenizer in transformers. But we can't change it for the existing models for backward compatibility reasons.

But note that for the tokenizer used in SD2 and SDXL it's already set correctly cf https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/tokenizer/special_tokens_map.json#L16

And a bit more context about padding token in CLIP. CLIP doesn't care about padding token and the wrong padding token will only affect inference when using all token embeddings (like Stable Diffusion). For training, even if the padding token is wrong (i.e if we use eos instead of !, it shouldn't affect) because

CLIP did not use attention_mask during training.
CLIPTextEncoder uses a casual mask, so the tokens to the right don't influence the hidden states of tokens to the left.
CLIP is trained with contrastive loss, which is computed using the projections, and the text_projection is computed by pooling the eos _token embeddings, which will always be similar no matter what the padding token is, because CLIPTextEncoder is causal, so the eos embeddings won't be affected by tokens on the right.
For downstream training (like SD), as long as a consistent token is used for padding, it shouldn't severely affect the training. But for inference, we will need to use the same token.

So the way CLIP is trained, it doesn't care about padding token. It'll only affect the inference if a different token (compared to the padding token used for training) is used for padding. And this is already taken care of in SD 2 and SDXL repos.

laksjdjf · 2023-07-21T11:35:21Z

But note that for the tokenizer used in SD2 and SDXL it's already set correctly cf https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/tokenizer/special_tokens_map.json#L16

My concern is that the above process will result in ! no longer being available in its normal sense.

patrickvonplaten · 2023-07-21T13:06:55Z

Hey @laksjdjf,

As @patil-suraj mentioned CLIP never used a padding token for training. It was trained with a causal mask and only tokens until the eos token are taken into account when computing the CLIP contrastive loss. All tokens following the eos token have no influence on the model, so one could have added any token here.

Now, it only matters if a pretrained CLIP is further fine-tuned as is done for SD. In this case the padding token was used to influence the loss and in that sense SD does not make a difference between ! and a padding token. But this is purely due to the way SD uses CLIP for fine-tuning - this is not an inherit characteristic of CLIP.

laksjdjf · 2023-07-21T14:09:41Z

Hmmm, maybe I just don't understand, but my question is about the behaviour of the ! token, not the behaviour of the pad token. If ! token is converted to pad token, it seems to make a difference when processing text containing ! token .

>>>tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", subfolder="tokenizer_2")
>>>prompt = "! !! !!!"
>>>input_ids = tokenizer(prompt,padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors='pt').input_ids

>>>print(input_ids)
tensor([[49406,     0,     0,     0,     0,     0,     0, 49407,     0,     0, ...

>>>print(open_clip.tokenize(prompt))
tensor([[49406,   256,   748,   995, 49407,     0,     0,     0,     0,     0, ...

github-actions · 2023-08-18T15:02:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Clement-Lelievre · 2024-08-01T10:34:49Z

hello,

the fact that the pad tokens differ across both SDXL-base-1.0 tokenizers raises a RunTimeError in a specific case, when using the compel library.
See issue here on the compel repo to understand the specific case and here on SDXL's repo.

I am not 100% sure where the responsibility lies in this case, can anyone advise?
First question, why do these two pad tokens differ?

Thank you!

cc @patrickvonplaten

ArthurZucker · 2024-08-05T05:30:57Z

Hey! Would recommend you to open an issue on diffusers as this seems to be related to tokenizer setting, not internal bug!

github-actions bot closed this as completed Aug 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully compatible with the open clip Tokeniser #24925

Fully compatible with the open clip Tokeniser #24925

laksjdjf commented Jul 19, 2023

laksjdjf commented Jul 19, 2023

sgugger commented Jul 19, 2023

ArthurZucker commented Jul 20, 2023

laksjdjf commented Jul 21, 2023

ArthurZucker commented Jul 21, 2023

patrickvonplaten commented Jul 21, 2023

patil-suraj commented Jul 21, 2023

laksjdjf commented Jul 21, 2023

patrickvonplaten commented Jul 21, 2023

laksjdjf commented Jul 21, 2023

github-actions bot commented Aug 18, 2023

Clement-Lelievre commented Aug 1, 2024

ArthurZucker commented Aug 5, 2024

Fully compatible with the open clip Tokeniser #24925

Fully compatible with the open clip Tokeniser #24925

Comments

laksjdjf commented Jul 19, 2023

Feature request

Motivation

Your contribution

laksjdjf commented Jul 19, 2023

sgugger commented Jul 19, 2023

ArthurZucker commented Jul 20, 2023

laksjdjf commented Jul 21, 2023

ArthurZucker commented Jul 21, 2023

patrickvonplaten commented Jul 21, 2023

patil-suraj commented Jul 21, 2023

laksjdjf commented Jul 21, 2023

patrickvonplaten commented Jul 21, 2023

laksjdjf commented Jul 21, 2023

github-actions bot commented Aug 18, 2023

Clement-Lelievre commented Aug 1, 2024

ArthurZucker commented Aug 5, 2024