Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trigrams for japanese, chinese, korean? #42

Open
mooseyboots opened this issue May 29, 2024 · 5 comments
Open

trigrams for japanese, chinese, korean? #42

mooseyboots opened this issue May 29, 2024 · 5 comments

Comments

@mooseyboots
Copy link

hi, i'm interested in using this just for the guess-language part only (i.e. not the typo-mode setting or spellchecking) but using all possible languages.

is it possible that there's no japanese (ja), chinese (zh), and korean (ko) in the trigrams data? or am i confused about it somehow?

i did a few tests with chinese and japanese texts and guess-language-region returned zu, i.e. Zulu.

but i must be a little confused, as guess_language.py supports those languages, but it doesn't have ja, zh, or ko in its trigrams files.

perhaps the python package simply selects those languages (and greek) by their script, using the Blocks.txt file? would it be possible to support that also in guess-language.el?

i guess if that's the issue i'm encountering it would require a bit of work to support those languages in this package...

@tmalsburg
Copy link
Owner

tmalsburg commented May 29, 2024

This package doesn't support ja, zh, ko yet. The algorithm was designed for languages using alphabetic writing systems. Not sure it'll work on for languages using logographic since they likely have many more possible trigrams. The trigrams observed in a short text may not even show up among the top-ranked trigrams of a language, simply because there are so many possible trigrams in a language like Chinese. I guess that's also the reason why guess_language.py doesn't have trigrams for these languages. However, it might be easy to detect these languages based on other features. If you check what guess_languages.py is using, we could perhaps use the same approach here. I imagine that unigrams might work if all these languages have logographs that are sufficiently frequent.

@mooseyboots
Copy link
Author

mooseyboots commented May 29, 2024

thanks for your response.

my understanding is the guess_languages.py uses https://github.com/kent37/guess-language/blob/master/guess_language/Blocks.txt to determine the writing system, but i don't understand how. (it looks like blocks.py contains a function that determines what block a single character is from?)

i'm also not sure if emacs itself could simply detect a unicode language system?

@tmalsburg
Copy link
Owner

If I understand correctly, guess_language.py checks for the presence of, e.g., Katakana, and if there is any, it decides that the text must be in Japanese. That's not ideal because if you have an English paragraph with just a single Katakana character, it will misclassify the paragraph as Japanese.

See here:
https://github.com/kent37/guess-language/blob/master/guess_language/guess_language.py#L375

@mooseyboots
Copy link
Author

i suspect that's not how it works.

the checks in that function run on the arg scripts, and the function is called (in guessLanguage()) with the result of find_runs() as the scripts arg.

and find_runs() explains itself thus:

    # return run types that used for 40% or more of the string
    # always return basic latin if found more than 15%
    # and extended additional latin if over 10% (for Vietnamese)

https://github.com/kent37/guess-language/blob/8983cc0f511ed81495684653e09b1643b8fd92e7/guess_language/guess_language.py#L359

so it sounds like it's adapted for the case you mention? i.e. the result of find_runs, should only contain Katakana if over 40% of the text?

but i'm saying this without knowing the guess language code more than just a casual glance...

@tmalsburg
Copy link
Owner

tmalsburg commented Jun 10, 2024

You may be right, I just had a quick glance at the code. Their approach may be reliable but it's also not terribly elegant. I wonder if we can come up with a unified approach: What if we replace logographic characters with placeholders that simply indicate their category. Then we could perhaps again apply the usual tri-gram approach, so that no separate code-path is needed. This should also work nicely for languages that mix different types of characters like Japanese which uses Chinese kanji, Japanese kanji, hiragana, and katakana, with some Latin characters sprinkled in here and there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants