`CLIPTokenizer` does not work as expected #2018

fdtomasi · 2024-12-11T16:00:01Z

To Reproduce

from keras_hub import models
tokenizer = models.Tokenizer.from_preset(
    "clip_vit_h_14_laion2b_s32b_b79k", 
    sequence_length=77,
    pad_with_end_token=True,
)
tokenizer = models.CLIPPreprocessor(tokenizer, sequence_length=77)
tokenizer(["a cat sitting on the table"])

which returns

{'token_ids': <tf.Tensor: shape=(1, 77), dtype=int32, numpy=
 array([[49406,   320,  2368,  4919,   525,   518,  2175,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0, 49407]], dtype=int32)>,
 'padding_mask': <tf.Tensor: shape=(1, 77), dtype=bool, numpy=
 array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True]])>}

This is surprising because of a few reasons. First, even if pad_with_end_token=True, the pad is using 0 (which correspond to ! in this vocabulary). Also, the end token is added at the end of the padding instead of the end of the original sequence.
Further, padding_mask is all True, while I would expect to be False in correspondence of padding tokens.

Additional context
Using keras_hub==0.18.1, keras==3.7.0.

The text was updated successfully, but these errors were encountered:

james77777778 · 2024-12-24T10:11:17Z

You can work around the issue by not specifying sequence_length in Tokenizer.
I have proposed a fix for this #2031

import keras_hub

preset = "clip_vit_h_14_laion2b_s32b_b79k"
text = ["a cat sitting on the table"]

tokenizer = keras_hub.models.Tokenizer.from_preset(
    preset, pad_with_end_token=True
)
preprocessor = keras_hub.models.CLIPPreprocessor(tokenizer, sequence_length=77)
print(preprocessor(text))

{'token_ids': Array([[49406,   320,  2368,  4919,   525,   518,  2175, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407]], dtype=int32), 'padding_mask': Array([[ True,  True,  True,  True,  True,  True,  True,  True, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False]], dtype=bool)}

mehtamansi29 self-assigned this Dec 12, 2024

mehtamansi29 added the type:Bug Something isn't working label Dec 23, 2024

james77777778 linked a pull request Dec 24, 2024 that will close this issue

Fix sequence_length option in CLIPTokenizer #2031

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`CLIPTokenizer` does not work as expected #2018

`CLIPTokenizer` does not work as expected #2018

fdtomasi commented Dec 11, 2024

james77777778 commented Dec 24, 2024

CLIPTokenizer does not work as expected #2018

CLIPTokenizer does not work as expected #2018

Comments

fdtomasi commented Dec 11, 2024

james77777778 commented Dec 24, 2024

`CLIPTokenizer` does not work as expected #2018

`CLIPTokenizer` does not work as expected #2018