Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841

dakinggg · 2024-01-07T22:54:48Z

Previously, if you tried to decode an invalid token, the tokenizer would crash. For a trained model, this would never be an issue because it would not produce invalid tokens, but for an untrained model, it could randomly produce any token up to the embedding size. This PR has the tokenizer instead return empty string for these invalid tokens.

This behavior matches some Hugging Face fast tokenizers, but not slow tokenizers, which error on out of range indices. Even though the tiktoken wrapper is technically a slow tokenizer, the fast behavior seems better to avoid crashes on random models.

It also adds a utf-8 encoding to two file open calls. See mosaicml/composer#2824 for more details.

sashaDoubov

LGTM! this is very useful for debugging + thank you for digging into the utf-8 encoding issue.

llmfoundry/tokenizers/tiktoken.py

dakinggg added 3 commits January 7, 2024 14:36

fix gap token errors

2376056

change approach

c6e716b

update comment

46fba40

dakinggg requested review from irenedea, rajammanabrolu and sashaDoubov January 7, 2024 22:54

dakinggg marked this pull request as ready for review January 7, 2024 23:09

dakinggg added 2 commits January 7, 2024 23:55

add encoding

8100f63

another one

991a80e

dakinggg mentioned this pull request Jan 8, 2024

Add encoding=utf-8 mosaicml/composer#2824

Merged

sashaDoubov approved these changes Jan 8, 2024

View reviewed changes

irenedea approved these changes Jan 8, 2024

View reviewed changes

llmfoundry/tokenizers/tiktoken.py Show resolved Hide resolved

dakinggg merged commit 5b99488 into mosaicml:main Jan 8, 2024
10 checks passed

dakinggg deleted the tiktoken-gap branch February 10, 2024 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841

Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841

dakinggg commented Jan 7, 2024 •

edited

Loading

sashaDoubov left a comment

Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841

Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841

Conversation

dakinggg commented Jan 7, 2024 • edited Loading

sashaDoubov left a comment

Choose a reason for hiding this comment

dakinggg commented Jan 7, 2024 •

edited

Loading