Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully compatible with the open clip Tokeniser #24925

Closed
laksjdjf opened this issue Jul 19, 2023 · 13 comments
Closed

Fully compatible with the open clip Tokeniser #24925

laksjdjf opened this issue Jul 19, 2023 · 13 comments

Comments

@laksjdjf
Copy link

Feature request

The open clip tokeniser has a pad_token_id of 0, but this cannot be achieved because CLIPTokeniser.__init__() cannot set a pad_token_id

Motivation

Related to huggingface/diffusers#4153

Stable diffusion v2 and XL use the open clip tokeniser. To avoid increasing dependencies, the transformers must also have the same functionality.

Your contribution

It seems possible to set pad_token_id directly, but it is not realistic.

tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2-1", pad_token="<|endoftext|>")
tokenizer.pad_token_id = 0
@laksjdjf
Copy link
Author

@patrickvonplaten

@sgugger
Copy link
Collaborator

sgugger commented Jul 19, 2023

cc @patil-suraj and @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Hey, I am not really sure I understand the issue here. The pad_token_id is dependent on the pad_token. By default the pad_token is set to '!' which, maps to 0 : tokenizer.decode(torch.tensor([0])), tokenizer.convert_tokens_to_ids('!'). If you set the padding token to a different value by hand, the following will happen:

>>> tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2-1", pad_token="<|endoftext|>")
>>> print(tokenizer.pad_token_id)
49407
>>> tokenizer.pad_token_id = 0
>>> print(tokenizer.pad_token)
 '!'

@laksjdjf
Copy link
Author

Hey, I am not really sure I understand the issue here. The pad_token_id is dependent on the pad_token. By default the pad_token is set to '!' which, maps to 0 : tokenizer.decode(torch.tensor([0])), tokenizer.convert_tokens_to_ids('!'). If you set the padding token to a different value by hand, the following will happen:

You are right, my implementation does not seem to be perfect. The purpose of this implementation is to match when converting from text to token_id. What I want to know is how to set pad_token_id to 0 without affecting the words that are normally used like !

@ArthurZucker
Copy link
Collaborator

You can't really do that unless you train a new tokenizer 😅
You can add a new token, with a new index, which will prevent splitting !. The problem is that the embedding at position 0 might have been trained as padding token and is thus a random tensor (not updated by gradient computation).

@patrickvonplaten
Copy link
Contributor

cc @patil-suraj to explain the context around Stable Diffusion and OAI vs. openCLIP here maybe

@patil-suraj
Copy link
Contributor

Hey @laksjdjf ,

Indeed, there's a discrepancy between the pad_token_id in the open clip tokenizer and the CLIPTokenizer in transformers. But we can't change it for the existing models for backward compatibility reasons.

But note that for the tokenizer used in SD2 and SDXL it's already set correctly cf https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/tokenizer/special_tokens_map.json#L16

And a bit more context about padding token in CLIP. CLIP doesn't care about padding token and the wrong padding token will only affect inference when using all token embeddings (like Stable Diffusion). For training, even if the padding token is wrong (i.e if we use eos instead of !, it shouldn't affect) because

  • CLIP did not use attention_mask during training.
  • CLIPTextEncoder uses a casual mask, so the tokens to the right don't influence the hidden states of tokens to the left.
  • CLIP is trained with contrastive loss, which is computed using the projections, and the text_projection is computed by pooling the eos _token embeddings, which will always be similar no matter what the padding token is, because CLIPTextEncoder is causal, so the eos embeddings won't be affected by tokens on the right.
  • For downstream training (like SD), as long as a consistent token is used for padding, it shouldn't severely affect the training. But for inference, we will need to use the same token.

So the way CLIP is trained, it doesn't care about padding token. It'll only affect the inference if a different token (compared to the padding token used for training) is used for padding. And this is already taken care of in SD 2 and SDXL repos.

@laksjdjf
Copy link
Author

But note that for the tokenizer used in SD2 and SDXL it's already set correctly cf https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/tokenizer/special_tokens_map.json#L16

My concern is that the above process will result in ! no longer being available in its normal sense.

@patrickvonplaten
Copy link
Contributor

Hey @laksjdjf,

As @patil-suraj mentioned CLIP never used a padding token for training. It was trained with a causal mask and only tokens until the eos token are taken into account when computing the CLIP contrastive loss. All tokens following the eos token have no influence on the model, so one could have added any token here.

Now, it only matters if a pretrained CLIP is further fine-tuned as is done for SD. In this case the padding token was used to influence the loss and in that sense SD does not make a difference between ! and a padding token. But this is purely due to the way SD uses CLIP for fine-tuning - this is not an inherit characteristic of CLIP.

@laksjdjf
Copy link
Author

Hmmm, maybe I just don't understand, but my question is about the behaviour of the ! token, not the behaviour of the pad token. If ! token is converted to pad token, it seems to make a difference when processing text containing ! token .

>>>tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", subfolder="tokenizer_2")
>>>prompt = "! !! !!!"
>>>input_ids = tokenizer(prompt,padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors='pt').input_ids

>>>print(input_ids)
tensor([[49406,     0,     0,     0,     0,     0,     0, 49407,     0,     0, ...

>>>print(open_clip.tokenize(prompt))
tensor([[49406,   256,   748,   995, 49407,     0,     0,     0,     0,     0, ...

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Clement-Lelievre
Copy link

hello,

the fact that the pad tokens differ across both SDXL-base-1.0 tokenizers raises a RunTimeError in a specific case, when using the compel library.
See issue here on the compel repo to understand the specific case and here on SDXL's repo.

I am not 100% sure where the responsibility lies in this case, can anyone advise?
First question, why do these two pad tokens differ?

Thank you!

cc @patrickvonplaten

@ArthurZucker
Copy link
Collaborator

Hey! Would recommend you to open an issue on diffusers as this seems to be related to tokenizer setting, not internal bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants