Anime/Manga Search: Chinese / Japanese Tokenization for TypeSense #553

irfan-dahir · 2024-09-30T05:50:23Z

As reported here on Discord: https://discord.com/channels/460491088004907029/732509662280417360/1286222187304456256

Summary:
There are some Chinese characters that are not detected when searching for titles.

there are kanji characters in Japanese, in this case, the Chinese is “少女终末旅行” and the Japanese is “少女終末旅行”, there is only one word “終”, and there are many other examples like this, although they don't come up very often in practice, if you want to find examples, for example, “伤物语” and “伤物語” (語).

There are very many similar examples
千年女優and千年女优，未来日記and未来日记，魔女の宅急便and魔女的宅急便，狼と香辛料and狼与香辛料

these seem to be searching fine.:
电波女与青春男and電波女と青春男, 血界戦線and血界战线, 虚構推理and虚构推理
Or狼与香辛料and狼と香辛料 where there is only a difference of one Japanese character

Noted by kimigaiiwuyi on discord

In case of chinese tokenization there is a need to set a field level indicator for typesense that the field contains chinese characters. The problem at hand could be solved if we do language detection, and if the title contains chinese characters, it would go in a new typesense field, which would have the "ch" flag to tell typesense that it should use chinese tokenization, fixing this issue afaik.
So we can try to fix it on our end, but I have no idea if this would work :worrythink:
https://threads.typesense.org/2L156

Noted by @pushrbx

This is something that would require further triage. If anyone knows the best way to handle this, please feel to contribute.

irfan-dahir added help wanted needs triage labels Sep 30, 2024

KimigaiiWuyi mentioned this issue Oct 8, 2024

识别少女终末旅行识别成小圆外传 KimigaiiWuyi/Bangumi_Auto_Rename#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anime/Manga Search: Chinese / Japanese Tokenization for TypeSense #553

Anime/Manga Search: Chinese / Japanese Tokenization for TypeSense #553

irfan-dahir commented Sep 30, 2024

Anime/Manga Search: Chinese / Japanese Tokenization for TypeSense #553

Anime/Manga Search: Chinese / Japanese Tokenization for TypeSense #553

Comments

irfan-dahir commented Sep 30, 2024