Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(test): Add tests for alpaca chatml prompt tokenizer #1088

Merged

Conversation

JohanWork
Copy link
Contributor

@JohanWork JohanWork commented Jan 10, 2024

In relation to #112, prf for adding test to the Alpaca prompt tokenizer.

For reference and testing I also compared with the code below.

from transformers import AutoTokenizer
import torch
from tokenizers import AddedToken
torch.set_default_device('cuda')
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
                                          torch_dtype="auto")
tokenizer.add_special_tokens(
    {
        "eos_token": AddedToken(
            "<|im_end|>", rstrip=False, lstrip=False, normalized=False
        )
    }
)
tokenizer.add_tokens(
    [
        AddedToken("<|im_start|>", rstrip=False, lstrip=False, normalized=False),
    ]
)
test = [1,32001,1587,13,20548,336,349,396,13126,369,13966,264,3638,28725,5881,1360,395,396,2787,369,5312,3629,2758,28723,12018,264,2899,369,6582,1999,2691,274,272,2159,28723,32000,28705,13,32001,2188,13,16627,11931,456,12271,354,668,3572,304,18756,3479,17179,13,2428,854,28711,1497,516,11314,304,1749,272,1846,324,440,32000,28705,13,32001,13892,13,
            650,5967,516,11314,304,1749,272,9926,28723,
            32000,
        ]
print(tokenizer.decode(test))

text = "<|im_start|> system\nBelow is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.<|im_end|>"
encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)
print(encodeds)

text = "<|im_start|> user\nEvaluate this sentence for spelling and grammar mistakes\nHe finnished his meal and left the resturant<|im_end|>"
encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)
print(encodeds)

@JohanWork JohanWork marked this pull request as ready for review January 14, 2024 21:03
@JohanWork
Copy link
Contributor Author

@NanoCode012 #112 Tagging you since your the creator of the issue, feel free to comment and feedback.

Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Thank you for this PR. It's good to have more tests. We do have some small alpaca tests here, but they're for regular alpaca.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/c1b741d9fb51c6e2d5bcda960524bbdd6fc21bf7/tests/test_prompt_tokenizers.py#L269-L306

Your one seem to be more detailed for the chatml variant. Perhaps this class should be named TestAlpacaChatml

Also, I noticed some minor typos of alpacha including the file name. Would it be possible to update them to alpaca (without h)?

tests/prompt_strategies/test_alpacha.py Outdated Show resolved Hide resolved
tests/prompt_strategies/test_alpacha.py Outdated Show resolved Hide resolved
tests/prompt_strategies/test_alpacha.py Outdated Show resolved Hide resolved
tests/prompt_strategies/test_alpacha.py Outdated Show resolved Hide resolved
@JohanWork
Copy link
Contributor Author

JohanWork commented Jan 15, 2024

Hello. Thank you for this PR. It's good to have more tests. We do have some small alpaca tests here, but they're for regular alpaca.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/c1b741d9fb51c6e2d5bcda960524bbdd6fc21bf7/tests/test_prompt_tokenizers.py#L269-L306

Your one seem to be more detailed for the chatml variant. Perhaps this class should be named TestAlpacaChatml

Also, I noticed some minor typos of alpacha including the file name. Would it be possible to update them to alpaca (without h)?

Thnx for the feedback! Agree with it and have updated, also fixed you comment about importing promptStyle

@NanoCode012 NanoCode012 changed the title Add: test alpacha prompt tokenizer Feat(test): Add tests for alpaca chatml prompt tokenizer Jan 17, 2024
@JohanWork
Copy link
Contributor Author

@NanoCode012 let me know if there is anything more, or if it is good to go :)

Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the reminder!

@NanoCode012 NanoCode012 merged commit 5439707 into axolotl-ai-cloud:main Jan 23, 2024
7 checks passed
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* draft for adding test for tokenizer

* clean up

* clean up

* fix pre commit

* fix pylint

* Revert "fix pylint"

This reverts commit cd2cda3.

* add pylint exception for pytest fixture

* update comments

* Apply suggestions from code review

Co-authored-by: NanoCode012 <[email protected]>

* update spelling and import promptstyle

* reaname, restrucure

* clean up

* add fmt:on

---------

Co-authored-by: NanoCode012 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants