Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange tokenizer results in notebook tab #6558

Open
1 task done
Weker01 opened this issue Dec 2, 2024 · 1 comment
Open
1 task done

Strange tokenizer results in notebook tab #6558

Weker01 opened this issue Dec 2, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Weker01
Copy link

Weker01 commented Dec 2, 2024

Describe the bug

I am useing the model MN-12B-Mag-Mell-Q8_0.gguf which has special tokens for chatml but I noticed this with other models too. Token 14 for example is <|im_start|>.

When I run llama.cpp server and querry the /tokenize endpoint manually on <|im_start|> I get the expected token 14, but in the notebook tab this is not the case:

I get the following tokens. Detokenizing with llama.cpp server also reveals that this does indeed also translate to the text <|im_start|>. Also this is the number of tokens counted in the main (Raw) notebook tab.

1      -  ''
1060   -  '<'
1124   -  '|'
1329   -  'im'
18993  -  '_start'
1124   -  '|'
1062   -  '>'

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Download a model with special chatml tokens like MN-12B-Mag-Mell or countless others.
Type in the special chatml token in the notebook.
Go to the Tokens tab and see that the special token is not generated.

Screenshot

No response

Logs

There are no error logs specific to this as far as I know.

System Info

Archlinux
Nvidia
Manual install directly from the git repo.
@Weker01 Weker01 added the bug Something isn't working label Dec 2, 2024
@Weker01
Copy link
Author

Weker01 commented Dec 2, 2024

Well I guess it works with llamacpp_HF, there I get the expected tokens. But why can llama.cpp server do this automatically?

@Weker01 Weker01 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024
@Weker01 Weker01 reopened this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant