Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anime/Manga Search: Chinese / Japanese Tokenization for TypeSense #553

Open
irfan-dahir opened this issue Sep 30, 2024 · 0 comments
Open

Comments

@irfan-dahir
Copy link
Contributor

As reported here on Discord: https://discord.com/channels/460491088004907029/732509662280417360/1286222187304456256

Summary:
There are some Chinese characters that are not detected when searching for titles.

there are kanji characters in Japanese, in this case, the Chinese is “少女终末旅行” and the Japanese is “少女終末旅行”, there is only one word “終”, and there are many other examples like this, although they don't come up very often in practice, if you want to find examples, for example, “伤物语” and “伤物語” (語).

There are very many similar examples
千年女優and千年女优,未来日記and未来日记,魔女の宅急便and魔女的宅急便,狼と香辛料and狼与香辛料

these seem to be searching fine.:
电波女与青春男and電波女と青春男, 血界戦線and血界战线, 虚構推理and虚构推理
Or狼与香辛料and狼と香辛料 where there is only a difference of one Japanese character

Noted by kimigaiiwuyi on discord

In case of chinese tokenization there is a need to set a field level indicator for typesense that the field contains chinese characters. The problem at hand could be solved if we do language detection, and if the title contains chinese characters, it would go in a new typesense field, which would have the "ch" flag to tell typesense that it should use chinese tokenization, fixing this issue afaik.
So we can try to fix it on our end, but I have no idea if this would work :worrythink:
https://threads.typesense.org/2L156

Noted by @pushrbx


This is something that would require further triage. If anyone knows the best way to handle this, please feel to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant