Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] fix transformers neuronx integration test failure #2539

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

sindhuvahinis
Copy link
Contributor

@sindhuvahinis sindhuvahinis commented Nov 8, 2024

Description

EDIT: Even tiny models have higher context length. So, we trained model and tokenizer to have smaller context length 4k on runtime. I trained the model offline and uploaded it to s3. So the previous fix is not needed.


This should fix the tnx unit tests in integration tests failures. This is due to transformers getting upgraded to 4.45.2. This does not affect our handlers. This only affects, train_new_from_iterator in transformers.

Raised PR in transformers to fix it. huggingface/transformers#34661

@sindhuvahinis sindhuvahinis requested review from zachgk and a team as code owners November 8, 2024 23:35
@siddvenk
Copy link
Contributor

Do we need 4.45.2 for neuron at the moment? (I think yes for multimodal?)

"""
underlying_tokenizer = tokenizer.backend_tokenizer
if underlying_tokenizer.pre_tokenizer is None:
split_pattern = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from Llama 3.2 1B tokenizer.json. https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/tokenizer.json.
Here is the raw json version. Here i just converted into Object.

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  }

@sindhuvahinis
Copy link
Contributor Author

Do we need 4.45.2 for neuron at the moment? (I think yes for multimodal?)

Yes. For vllm 0.6.2 version, it is needed. #2518 MultiModal support is also based on vllm 0.6.2

@siddvenk
Copy link
Contributor

spoke offline - we're going to look into whether we need to do the tokenizer training stuff we currently do. If we can just use the tiny llama tokenizer directly, then that simplifies things here.

@sindhuvahinis sindhuvahinis force-pushed the fix branch 2 times, most recently from a47657f to d00d6c1 Compare November 18, 2024 17:15
@sindhuvahinis sindhuvahinis merged commit 746fdd5 into deepjavalibrary:master Nov 18, 2024
9 checks passed
sindhuvahinis added a commit to sindhuvahinis/djl-serving that referenced this pull request Nov 18, 2024
@sindhuvahinis sindhuvahinis deleted the fix branch December 2, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants