-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trigrams for japanese, chinese, korean? #42
Comments
This package doesn't support ja, zh, ko yet. The algorithm was designed for languages using alphabetic writing systems. Not sure it'll work on for languages using logographic since they likely have many more possible trigrams. The trigrams observed in a short text may not even show up among the top-ranked trigrams of a language, simply because there are so many possible trigrams in a language like Chinese. I guess that's also the reason why guess_language.py doesn't have trigrams for these languages. However, it might be easy to detect these languages based on other features. If you check what guess_languages.py is using, we could perhaps use the same approach here. I imagine that unigrams might work if all these languages have logographs that are sufficiently frequent. |
thanks for your response. my understanding is the guess_languages.py uses https://github.com/kent37/guess-language/blob/master/guess_language/Blocks.txt to determine the writing system, but i don't understand how. (it looks like blocks.py contains a function that determines what block a single character is from?) i'm also not sure if emacs itself could simply detect a unicode language system? |
If I understand correctly, guess_language.py checks for the presence of, e.g., Katakana, and if there is any, it decides that the text must be in Japanese. That's not ideal because if you have an English paragraph with just a single Katakana character, it will misclassify the paragraph as Japanese. See here: |
i suspect that's not how it works. the checks in that function run on the arg and
so it sounds like it's adapted for the case you mention? i.e. the result of find_runs, should only contain Katakana if over 40% of the text? but i'm saying this without knowing the guess language code more than just a casual glance... |
You may be right, I just had a quick glance at the code. Their approach may be reliable but it's also not terribly elegant. I wonder if we can come up with a unified approach: What if we replace logographic characters with placeholders that simply indicate their category. Then we could perhaps again apply the usual tri-gram approach, so that no separate code-path is needed. This should also work nicely for languages that mix different types of characters like Japanese which uses Chinese kanji, Japanese kanji, hiragana, and katakana, with some Latin characters sprinkled in here and there. |
hi, i'm interested in using this just for the guess-language part only (i.e. not the typo-mode setting or spellchecking) but using all possible languages.
is it possible that there's no japanese (ja), chinese (zh), and korean (ko) in the trigrams data? or am i confused about it somehow?
i did a few tests with chinese and japanese texts and
guess-language-region
returned zu, i.e. Zulu.but i must be a little confused, as guess_language.py supports those languages, but it doesn't have ja, zh, or ko in its trigrams files.
perhaps the python package simply selects those languages (and greek) by their script, using the
Blocks.txt
file? would it be possible to support that also in guess-language.el?i guess if that's the issue i'm encountering it would require a bit of work to support those languages in this package...
The text was updated successfully, but these errors were encountered: