Strange tokenizer results in notebook tab #6558

Weker01 · 2024-12-02T23:03:31Z

Describe the bug

I am useing the model MN-12B-Mag-Mell-Q8_0.gguf which has special tokens for chatml but I noticed this with other models too. Token 14 for example is <|im_start|>.

When I run llama.cpp server and querry the /tokenize endpoint manually on <|im_start|> I get the expected token 14, but in the notebook tab this is not the case:

I get the following tokens. Detokenizing with llama.cpp server also reveals that this does indeed also translate to the text <|im_start|>. Also this is the number of tokens counted in the main (Raw) notebook tab.

1      -  ''
1060   -  '<'
1124   -  '|'
1329   -  'im'
18993  -  '_start'
1124   -  '|'
1062   -  '>'

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Download a model with special chatml tokens like MN-12B-Mag-Mell or countless others.
Type in the special chatml token in the notebook.
Go to the Tokens tab and see that the special token is not generated.

Screenshot

No response

Logs

There are no error logs specific to this as far as I know.

System Info

Archlinux
Nvidia
Manual install directly from the git repo.

The text was updated successfully, but these errors were encountered:

Weker01 · 2024-12-02T23:11:55Z

Well I guess it works with llamacpp_HF, there I get the expected tokens. But why can llama.cpp server do this automatically?

Weker01 added the bug Something isn't working label Dec 2, 2024

Weker01 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024

Weker01 reopened this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange tokenizer results in notebook tab #6558

Strange tokenizer results in notebook tab #6558

Weker01 commented Dec 2, 2024

Weker01 commented Dec 2, 2024

Strange tokenizer results in notebook tab #6558

Strange tokenizer results in notebook tab #6558

Comments

Weker01 commented Dec 2, 2024

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

Weker01 commented Dec 2, 2024