-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test] fix transformers neuronx integration test failure #2539
Conversation
Do we need 4.45.2 for neuron at the moment? (I think yes for multimodal?) |
""" | ||
underlying_tokenizer = tokenizer.backend_tokenizer | ||
if underlying_tokenizer.pre_tokenizer is None: | ||
split_pattern = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from Llama 3.2 1B tokenizer.json. https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/tokenizer.json.
Here is the raw json version. Here i just converted into Object.
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
}
Yes. For vllm 0.6.2 version, it is needed. #2518 MultiModal support is also based on vllm 0.6.2 |
spoke offline - we're going to look into whether we need to do the tokenizer training stuff we currently do. If we can just use the tiny llama tokenizer directly, then that simplifies things here. |
a47657f
to
d00d6c1
Compare
Description
EDIT: Even tiny models have higher context length. So, we trained model and tokenizer to have smaller context length 4k on runtime. I trained the model offline and uploaded it to s3. So the previous fix is not needed.
This should fix the tnx unit tests in integration tests failures. This is due to transformers getting upgraded to 4.45.2. This does not affect our handlers. This only affects,
train_new_from_iterator
in transformers.Raised PR in transformers to fix it. huggingface/transformers#34661