trigrams for japanese, chinese, korean? #42

mooseyboots · 2024-05-29T08:04:39Z

hi, i'm interested in using this just for the guess-language part only (i.e. not the typo-mode setting or spellchecking) but using all possible languages.

is it possible that there's no japanese (ja), chinese (zh), and korean (ko) in the trigrams data? or am i confused about it somehow?

i did a few tests with chinese and japanese texts and guess-language-region returned zu, i.e. Zulu.

but i must be a little confused, as guess_language.py supports those languages, but it doesn't have ja, zh, or ko in its trigrams files.

perhaps the python package simply selects those languages (and greek) by their script, using the Blocks.txt file? would it be possible to support that also in guess-language.el?

i guess if that's the issue i'm encountering it would require a bit of work to support those languages in this package...

The text was updated successfully, but these errors were encountered:

tmalsburg · 2024-05-29T14:33:01Z

This package doesn't support ja, zh, ko yet. The algorithm was designed for languages using alphabetic writing systems. Not sure it'll work on for languages using logographic since they likely have many more possible trigrams. The trigrams observed in a short text may not even show up among the top-ranked trigrams of a language, simply because there are so many possible trigrams in a language like Chinese. I guess that's also the reason why guess_language.py doesn't have trigrams for these languages. However, it might be easy to detect these languages based on other features. If you check what guess_languages.py is using, we could perhaps use the same approach here. I imagine that unigrams might work if all these languages have logographs that are sufficiently frequent.

mooseyboots · 2024-05-29T14:41:06Z

thanks for your response.

my understanding is the guess_languages.py uses https://github.com/kent37/guess-language/blob/master/guess_language/Blocks.txt to determine the writing system, but i don't understand how. (it looks like blocks.py contains a function that determines what block a single character is from?)

i'm also not sure if emacs itself could simply detect a unicode language system?

tmalsburg · 2024-05-29T15:07:37Z

If I understand correctly, guess_language.py checks for the presence of, e.g., Katakana, and if there is any, it decides that the text must be in Japanese. That's not ideal because if you have an English paragraph with just a single Katakana character, it will misclassify the paragraph as Japanese.

See here:
https://github.com/kent37/guess-language/blob/master/guess_language/guess_language.py#L375

mooseyboots · 2024-06-03T17:40:55Z

i suspect that's not how it works.

the checks in that function run on the arg scripts, and the function is called (in guessLanguage()) with the result of find_runs() as the scripts arg.

and find_runs() explains itself thus:

    # return run types that used for 40% or more of the string
    # always return basic latin if found more than 15%
    # and extended additional latin if over 10% (for Vietnamese)

https://github.com/kent37/guess-language/blob/8983cc0f511ed81495684653e09b1643b8fd92e7/guess_language/guess_language.py#L359

so it sounds like it's adapted for the case you mention? i.e. the result of find_runs, should only contain Katakana if over 40% of the text?

but i'm saying this without knowing the guess language code more than just a casual glance...

tmalsburg · 2024-06-10T15:20:02Z

You may be right, I just had a quick glance at the code. Their approach may be reliable but it's also not terribly elegant. I wonder if we can come up with a unified approach: What if we replace logographic characters with placeholders that simply indicate their category. Then we could perhaps again apply the usual tri-gram approach, so that no separate code-path is needed. This should also work nicely for languages that mix different types of characters like Japanese which uses Chinese kanji, Japanese kanji, hiragana, and katakana, with some Latin characters sprinkled in here and there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trigrams for japanese, chinese, korean? #42

trigrams for japanese, chinese, korean? #42

mooseyboots commented May 29, 2024

tmalsburg commented May 29, 2024 •

edited

Loading

mooseyboots commented May 29, 2024 •

edited

Loading

tmalsburg commented May 29, 2024

mooseyboots commented Jun 3, 2024

tmalsburg commented Jun 10, 2024 •

edited

Loading

trigrams for japanese, chinese, korean? #42

trigrams for japanese, chinese, korean? #42

Comments

mooseyboots commented May 29, 2024

tmalsburg commented May 29, 2024 • edited Loading

mooseyboots commented May 29, 2024 • edited Loading

tmalsburg commented May 29, 2024

mooseyboots commented Jun 3, 2024

tmalsburg commented Jun 10, 2024 • edited Loading

tmalsburg commented May 29, 2024 •

edited

Loading

mooseyboots commented May 29, 2024 •

edited

Loading

tmalsburg commented Jun 10, 2024 •

edited

Loading