VLM: special multimodal Tokenizer #34461

zucchini-nlp · 2024-10-28T07:28:38Z

What does this PR do?

Part of Major VLM standardization (#33948). We will have special tokens that are present in all VLMs to be part if XXXTokenizer attributes. This will make our lives easier when doing several processing manipulations and/or formatting the prompt manually, as we can simply call self.tokenizer.image_token.

Currently if we need any of VLM special tokens, those are saved in processor config, but not all models save it since not all models use it when calling the processor. After this PR I'll go over models and clean up the processing code given the changes. But we might still have to support old way, because we can't change stuff if that can break loading configs from the hub

HuggingFaceDocBuilderDev · 2024-10-28T07:55:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-10-28T13:45:27Z

Should be ready for review @ArthurZucker !

I think we'll support simple non-multimodal tokenizers for quite a while in VLMs, no idea yet how/when to make this a new default

ArthurZucker

Okay super super good! The only thing I don't like is the is_multimodal!
I think what you added gives a lot of freedom to all tokenizers -> audio_cls_token or anything that ends with token / ends with id will be properly processed!

Let's remove the is_mulitmodal and should be good!

ArthurZucker

Perfect

ArthurZucker · 2024-10-30T07:55:47Z

docs/source/en/main_classes/tokenizer.md

+tokenizer = AutoTokenizer.from_pretrained(model_id)
+tokenizer.extra_special_tokens = ["image_token", "boi_token", "eoi_token"]


Suggested change

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.extra_special_tokens = ["image_token", "boi_token", "eoi_token"]

tokenizer = AutoTokenizer.from_pretrained(model_id, extra_special_tokens = ["image_token", "boi_token", "eoi_token"])

let's add a small test for this

yes, this is actually not correct anymore hehe, forgot to update the docs. And it has a test for that already so we are good

new way of adding extra special tokens is like
tokenizer.extra_special_tokens = {"eoi_token": "<s>", "image_token": "<image>"}. After adding this line and saving the tokenizer, loading back will do the magic and tokenizer will have self.image_token attribute

we should be able to pass it as input as well instead of forcing people to use the setter! 🤗

yeap, realized later and added that in the docs instead of "saving-loading back". Plus extended the test

docs/source/en/main_classes/tokenizer.md

Co-authored-by: Arthur <[email protected]>

ArthurZucker

Very nice!

ArthurZucker · 2024-10-30T10:42:29Z

src/transformers/tokenization_utils_base.py

@@ -1633,6 +1443,9 @@ def __init__(self, **kwargs):

        super().__init__(**kwargs)

+        self.extra_special_tokens = kwargs.pop("extra_special_tokens", {})
+        self._set_model_specific_special_tokens(special_tokens=self.extra_special_tokens)


when we do this, we don't add them to the tokenizer vocab right?

I think you are already checking that these tokens are added to the vocab if not already present right?

if the special token is not present in the vocab, we do add them as new tokens to the tokenizer vocab. Should we prevent users from adding new tokens and allow to use only available tokens?

It happens because the Tokenizer initially is wired to do that, irrespective of current changes

# 4. If some of the special tokens are not part of the vocab, we add them, at the end. # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`

NO it's alright IMO we have not really seen reports about that

* kinda works * update * add tests * update * use special tokens in processors * typo * fix copies * fix * fix moshi after rebase * update * fix tests * update * Update docs/source/en/main_classes/tokenizer.md Co-authored-by: Arthur <[email protected]> * update docs * test for load time adding tokens * fix some more tests which are now fetched better * one more fix --------- Co-authored-by: Arthur <[email protected]>

zucchini-nlp added 5 commits October 5, 2024 16:14

kinda works

d4778c9

update

0ba73ec

add tests

116a49e

update

661b5df

use special tokens in processors

5ed9379

zucchini-nlp added 6 commits October 28, 2024 10:31

typo

a284f31

Merge remote-tracking branch 'upstream/main' into vlm-tokenizer

9ad5362

fix copies

1def73c

fix

8dfa536

fix moshi after rebase

200879b

Merge branch 'main' into vlm-tokenizer

e0bf53b

ArthurZucker reviewed Oct 29, 2024

View reviewed changes

zucchini-nlp added 4 commits October 29, 2024 17:07

update

7e5c4ba

Merge branch 'main' into vlm-tokenizer

b77f3be

fix tests

8b61969

update

276c55e

ArthurZucker approved these changes Oct 30, 2024

View reviewed changes

zucchini-nlp and others added 3 commits October 30, 2024 09:12

Update docs/source/en/main_classes/tokenizer.md

3177519

Co-authored-by: Arthur <[email protected]>

update docs

46a30e5

test for load time adding tokens

d7c4eb5

ArthurZucker approved these changes Oct 30, 2024

View reviewed changes

zucchini-nlp added 6 commits October 31, 2024 16:48

Merge branch 'main' into vlm-tokenizer

f3b102e

Merge branch 'main' into vlm-tokenizer

5240129

fix some more tests which are now fetched better

58bf9e7

one more fix

76b39aa

Merge branch 'main' into vlm-tokenizer

9994aed

Merge branch 'main' into vlm-tokenizer

c21997d

zucchini-nlp merged commit 187439c into huggingface:main Nov 4, 2024
26 checks passed

dvrogozh mentioned this pull request Nov 20, 2024

[Volta] [No flash attention] Llama 3.1 8B Instruct failed to start - "< not supported between instances of 'NoneType' and 'int'" huggingface/text-generation-inference#2440

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM: special multimodal Tokenizer #34461

VLM: special multimodal Tokenizer #34461

zucchini-nlp commented Oct 28, 2024

HuggingFaceDocBuilderDev commented Oct 28, 2024

zucchini-nlp commented Oct 28, 2024

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker Oct 30, 2024

ArthurZucker Oct 30, 2024

zucchini-nlp Oct 30, 2024

ArthurZucker Oct 30, 2024

zucchini-nlp Oct 30, 2024

ArthurZucker left a comment

ArthurZucker Oct 30, 2024

ArthurZucker Oct 30, 2024

zucchini-nlp Oct 30, 2024

ArthurZucker Oct 30, 2024

		tokenizer = AutoTokenizer.from_pretrained(model_id)
		tokenizer.extra_special_tokens = ["image_token", "boi_token", "eoi_token"]

VLM: special multimodal Tokenizer #34461

VLM: special multimodal Tokenizer #34461

Conversation

zucchini-nlp commented Oct 28, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 28, 2024

zucchini-nlp commented Oct 28, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment