Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Kanji Recognition in Japanese Language Detection #381

Closed
wants to merge 9 commits into from
2 changes: 1 addition & 1 deletion accuracy-reports/aggregated-accuracy-values.csv
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Bokmal,NaN,NaN,NaN,NaN,49,24,44,80,NaN,NaN,NaN,NaN,49,27,47,74,58,38,58,76
Bosnian,18,4,15,36,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,29,22,28,36,34,29,34,40
Bulgarian,65,31,72,92,69,44,67,96,NaN,NaN,NaN,NaN,77,56,80,96,86,70,91,99
Catalan,37,4,29,79,51,29,45,80,NaN,NaN,NaN,NaN,58,33,60,81,70,50,73,86
Chinese,33,NaN,2,98,100,100,100,100,97,93,98,100,100,100,100,100,100,100,100,100
Chinese,33,NaN,2,98,100,100,100,100,97,93,98,100,100,100,100,100,95,89,96,100
Croatian,51,33,46,72,61,34,54,94,NaN,NaN,NaN,NaN,59,36,57,85,72,53,74,90
Czech,73,50,79,90,63,42,66,82,NaN,NaN,NaN,NaN,70,54,71,87,80,65,84,91
Danish,59,26,56,94,53,31,45,84,NaN,NaN,NaN,NaN,70,45,70,95,81,61,83,97
Expand Down
10 changes: 5 additions & 5 deletions accuracy-reports/lingua-high-accuracy/Chinese.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
##### Chinese #####

>>> Accuracy on average: 100%
>>> Accuracy on average: 95.16%

>> Detection of 1000 single words (average length: 1 chars)
Accuracy: 100%
Erroneously classified as
Accuracy: 89.1%
Erroneously classified as Japanese: 10.9%

>> Detection of 1000 word pairs (average length: 2 chars)
Accuracy: 100%
Erroneously classified as
Accuracy: 96.4%
Erroneously classified as Japanese: 3.6%

>> Detection of 729 sentences (average length: 48 chars)
Accuracy: 100%
Expand Down
2 changes: 1 addition & 1 deletion src/constant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ use crate::alphabet::CharSet;
use crate::language::Language;

pub(crate) static JAPANESE_CHARACTER_SET: Lazy<CharSet> =
Lazy::new(|| CharSet::from_char_classes(&["Hiragana", "Katakana", "Han"]));
Lazy::new(|| CharSet::from_char_classes(&["Hiragana", "Katakana", "Japanese_Han"]));
pub(crate) static MULTIPLE_WHITESPACE: Lazy<Regex> = Lazy::new(|| Regex::new("\\s+").unwrap());
pub(crate) static NUMBERS: Lazy<Regex> = Lazy::new(|| Regex::new("\\p{N}").unwrap());
pub(crate) static PUNCTUATION: Lazy<Regex> = Lazy::new(|| Regex::new("\\p{P}").unwrap());
Expand Down
39 changes: 32 additions & 7 deletions src/detector.rs
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,8 @@ impl LanguageDetector {
) -> Option<Language> {
let mut total_language_counts = HashMap::<Option<Language>, u32>::new();
let half_word_count = (words.len() as f64) * 0.5;
let mut cjk_lang_uncertainty:usize=0;
let cjk_lang_uncertainty_max_ratio =0.9999999999;

for word in words {
let mut word_language_counts = HashMap::<Language, u32>::new();
Expand All @@ -811,18 +813,19 @@ impl LanguageDetector {
}

if !is_match {
if cfg!(feature = "chinese") && Alphabet::Han.matches_char(character) {

if cfg!(feature = "japanese") //we need to test for both and later guess at which one it is
&& JAPANESE_CHARACTER_SET.is_char_match(character)
{
self.increment_counter(
&mut word_language_counts,
Language::from_str("Chinese").unwrap(),
Language::from_str("Japanese").unwrap(),
1,
);
} else if cfg!(feature = "japanese")
&& JAPANESE_CHARACTER_SET.is_char_match(character)
{
} if cfg!(feature = "chinese") && Alphabet::Han.matches_char(character) {
self.increment_counter(
&mut word_language_counts,
Language::from_str("Japanese").unwrap(),
Language::from_str("Chinese").unwrap(),
1,
);
} else if Alphabet::Latin.matches_char(character)
Expand Down Expand Up @@ -854,11 +857,17 @@ impl LanguageDetector {
&& word_language_counts.contains_key(&Language::from_str("Chinese").unwrap())
&& word_language_counts.contains_key(&Language::from_str("Japanese").unwrap())
{
self.increment_counter(
&mut total_language_counts,
Some(Language::from_str("Chinese").unwrap()),
1,
);
self.increment_counter(
&mut total_language_counts,
Some(Language::from_str("Japanese").unwrap()),
1,
);
cjk_lang_uncertainty +=1;
} else {
let sorted_word_language_counts = word_language_counts
.into_iter()
Expand Down Expand Up @@ -898,10 +907,26 @@ impl LanguageDetector {
&& cfg!(feature = "japanese")
&& total_language_counts.contains_key(&Some(Language::from_str("Chinese").unwrap()))
&& total_language_counts.contains_key(&Some(Language::from_str("Japanese").unwrap()))
&& (cjk_lang_uncertainty as f32 / words.len() as f32) >= cjk_lang_uncertainty_max_ratio
&& self.is_low_accuracy_mode_enabled
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this rule applied in low accuracy mode only? The rule engine should operate independently of the selected accuracy mode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to to the fact that on low accuracy mode, the lingua-rs doesn't use the n-gram model after running detect_language_with_rules (idk if that's right). Regardless, by adding this case in, a lot more Chinese words get recognized as Chinese in low accuracy. Otherwise, they would be misidentified as unknown. If you want I can move this logic to compute_language_confidence_values_for_languages .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you want me to add the unit tests?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a new unit test method in file detector.rs. I think it's best to use a parameterized test method. Just take a look at the other test methods and do it analogously.

{
return Some(Language::from_str("Japanese").unwrap());
// Retrieve the counts for Chinese and Japanese languages
let chinese_count = *total_language_counts
.get(&Some(Language::Chinese))
.unwrap_or(&0);
let japanese_count = *total_language_counts
.get(&Some(Language::Japanese))
.unwrap_or(&0);
// Compare the counts and return the language with the higher count
if chinese_count >= japanese_count {
return Some(Language::Chinese);
} else {
return Some(Language::Japanese);
}
}



let sorted_total_language_counts = total_language_counts
.into_iter()
.sorted_by(|(_, first_count), (_, second_count)| second_count.cmp(first_count))
Expand Down
1 change: 0 additions & 1 deletion src/language.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1095,7 +1095,6 @@ impl Language {

#[cfg(test)]
mod tests {
use std::str::FromStr;

use crate::language::Language::*;

Expand Down
1 change: 0 additions & 1 deletion src/model.rs
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,6 @@ fn get_utf8_slice(string: &str, start: usize, end: usize) -> &str {

#[cfg(test)]
mod tests {
use itertools::Itertools;
use rstest::*;

use super::*;
Expand Down
Loading