Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, if you tried to decode an invalid token, the tokenizer would crash. For a trained model, this would never be an issue because it would not produce invalid tokens, but for an untrained model, it could randomly produce any token up to the embedding size. This PR has the tokenizer instead return empty string for these invalid tokens.
This behavior matches some Hugging Face fast tokenizers, but not slow tokenizers, which error on out of range indices. Even though the tiktoken wrapper is technically a slow tokenizer, the fast behavior seems better to avoid crashes on random models.
It also adds a utf-8 encoding to two file open calls. See mosaicml/composer#2824 for more details.