nucleo-matcher: Normalization is incomplete. #51

markus-bauer · 2024-10-04T17:09:11Z

Thanks for this crate. But I had some issues with normalization.

Here are some random examples from the polish alphabet, that don't work as expected; there might be more issues, though.

use nucleo_matcher::pattern::{AtomKind, CaseMatching, Normalization};

fn main() {
    // Example: https://en.wikipedia.org/wiki/Polish_alphabet
    let tests = [
        ('ą', 'a'),
        ('ć', 'c'),
        ('ę', 'e'),
        ('ł', 'l'),
        ('ń', 'n'),
        ('ó', 'o'),
        ('ś', 's'),
        ('ź', 'z'),
        ('ż', 'z'),
        ('Ą', 'A'),
        ('Ć', 'C'),
        ('Ę', 'E'),
        ('Ł', 'L'),
        ('Ń', 'N'),
        ('Ó', 'O'),
        ('Ś', 'S'),
        ('Ź', 'Z'),
        ('Ż', 'Z'),
    ];

    // test normalization funcion
    println!("--- normalize ---");
    for (c, expected) in tests {
        let normalized = nucleo_matcher::chars::normalize(c);
        if normalized == expected {
            println!("okay: {} -> {}.", c, normalized);
        } else {
            println!("fail: {} -> {}. expected {}", c, normalized, expected);
            let c_lower = c.to_lowercase().next().unwrap();
            let expected_lower = expected.to_lowercase().next().unwrap();
            let normalized_lower = nucleo_matcher::chars::normalize(c_lower);
            if normalized_lower == expected_lower {
                println!(" but lowercase works: {} -> {}", c_lower, normalized_lower);
            }
        }
    }

    // test matcher
    println!("--- matcher ---");
    let mut matcher = nucleo_matcher::Matcher::new(nucleo_matcher::Config::DEFAULT);

    for (c, expected) in tests {
        // search for c in expected
        let haystack = nucleo_matcher::Utf32String::from(c.to_string());
        let atom = nucleo_matcher::pattern::Atom::new(
            &expected.to_string(),
            CaseMatching::Smart,
            Normalization::Smart,
            AtomKind::Exact,
            false,
        );

        if atom.score(haystack.slice_u32(..), &mut matcher).is_some() {
            println!("okay: {}", c)
        } else {
            println!("fail: {} not in {}", atom.needle_text(), haystack)
        }
    }
}

--- normalize ---
okay: ą -> a.
okay: ć -> c.
okay: ę -> e.
okay: ł -> l.
okay: ń -> n.
okay: ó -> o.
okay: ś -> s.
okay: ź -> z.
okay: ż -> z.
fail: Ą -> Ą. expected A
 but lowercase works: ą -> a
fail: Ć -> Ć. expected C
 but lowercase works: ć -> c
fail: Ę -> Ę. expected E
 but lowercase works: ę -> e
fail: Ł -> Ł. expected L
 but lowercase works: ł -> l
fail: Ń -> Ń. expected N
 but lowercase works: ń -> n
okay: Ó -> O.
fail: Ś -> Ś. expected S
 but lowercase works: ś -> s
fail: Ź -> Ź. expected Z
 but lowercase works: ź -> z
fail: Ż -> Ż. expected Z
 but lowercase works: ż -> z
--- matcher ---
okay: ą
okay: ć
okay: ę
okay: ł
okay: ń
okay: ó
okay: ś
okay: ź
okay: ż
fail: A not in Ą
fail: C not in Ć
fail: E not in Ę
fail: L not in Ł
fail: N not in Ń
okay: Ó
fail: S not in Ś
fail: Z not in Ź
fail: Z not in Ż

Also, I think you might have made some copy-paste/codegen mistakes (here and in other places):

nucleo/matcher/src/chars/normalize.rs

Lines 340 to 390 in 6df3cd0

    
           ('\u{1E37}', 'l'), //  WITH DOT BELOW, LATIN SMALL LETTER    ('\u{1E3B}', 'l'), //  WITH LINE BELOW, LATIN SMALL LETTER 
        
           ('\u{1E3D}', 'l'), //  WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER 
        
           ('\u{1E3F}', 'm'), //  WITH ACUTE, LATIN SMALL LETTER 
        
           ('\u{1E41}', 'm'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E43}', 'm'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E45}', 'n'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E47}', 'n'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E49}', 'n'), //  WITH LINE BELOW, LATIN SMALL LETTER 
        
           ('\u{1E4B}', 'n'), //  WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER 
        
           ('\u{1E55}', 'p'), //  WITH ACUTE, LATIN SMALL LETTER 
        
           ('\u{1E57}', 'p'), //  WITH DOT ABOVE, LATIN SMALL LETTER    ('\u{1E59}', 'r'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E5B}', 'r'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E5F}', 'r'), //  WITH LINE BELOW, LATIN SMALL LETTER 
        
           ('\u{1E61}', 's'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E63}', 's'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E6B}', 't'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E6D}', 't'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E6F}', 't'), //  WITH LINE BELOW, LATIN SMALL LETTER 
        
           ('\u{1E71}', 't'), //  WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER 
        
           ('\u{1E73}', 'u'), //  WITH DIAERESIS BELOW, LATIN SMALL LETTER 
        
           ('\u{1E75}', 'u'), //  WITH TILDE BELOW, LATIN SMALL LETTER    ('\u{1E77}', 'u'), //  WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER 
        
           ('\u{1E7D}', 'v'), //  WITH TILDE, LATIN SMALL LETTER 
        
           ('\u{1E7F}', 'v'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E81}', 'w'), //  WITH GRAVE, LATIN SMALL LETTER 
        
           ('\u{1E83}', 'w'), //  WITH ACUTE, LATIN SMALL LETTER 
        
           ('\u{1E85}', 'w'), //  WITH DIAERESIS, LATIN SMALL LETTER    ('\u{1E87}', 'w'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E89}', 'w'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E8B}', 'x'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E8D}', 'x'), //  WITH DIAERESIS, LATIN SMALL LETTER 
        
           ('\u{1E8F}', 'y'), //  WITH DOT ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E91}', 'z'), //  WITH CIRCUMFLEX, LATIN SMALL LETTER    ('\u{1E93}', 'z'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1E95}', 'z'), //  WITH LINE BELOW, LATIN SMALL LETTER 
        
           ('\u{1E96}', 'h'), //  WITH LINE BELOW, LATIN SMALL LETTER 
        
           ('\u{1E97}', 't'), //  WITH DIAERESIS, LATIN SMALL LETTER 
        
           ('\u{1E98}', 'w'), //  WITH RING ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E99}', 'y'), //  WITH RING ABOVE, LATIN SMALL LETTER 
        
           ('\u{1E9A}', 'a'), //  WITH RIGHT HALF RING, LATIN SMALL LETTER 
        
           ('\u{1E9B}', 's'), //  WITH DOT ABOVE, LATIN SMALL LETTER LONG 
        
           ('\u{1EA1}', 'a'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1EA3}', 'a'), //  WITH HOOK ABOVE, LATIN SMALL LETTER 
        
           ('\u{1EB9}', 'e'), //  WITH DOT BELOW, LATIN SMALL LETTER    ('\u{1EBB}', 'e'), //  WITH HOOK ABOVE, LATIN SMALL LETTER 
        
           ('\u{1EBD}', 'e'), //  WITH TILDE, LATIN SMALL LETTER 
        
           ('\u{1EC9}', 'i'), //  WITH HOOK ABOVE, LATIN SMALL LETTER 
        
           ('\u{1ECB}', 'i'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1ECD}', 'o'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1ECF}', 'o'), //  WITH HOOK ABOVE, LATIN SMALL LETTER 
        
           ('\u{1EE5}', 'u'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1EE7}', 'u'), //  WITH HOOK ABOVE, LATIN SMALL LETTER 
        
           ('\u{1EF3}', 'y'), //  WITH GRAVE, LATIN SMALL LETTER 
        
           ('\u{1EF5}', 'y'), //  WITH DOT BELOW, LATIN SMALL LETTER 
        
           ('\u{1EF7}', 'y'), //  WITH HOOK ABOVE, LATIN SMALL LETTER    ('\u{1EF9}', 'y'), //  WITH TILDE, LATIN SMALL LETTER

The text was updated successfully, but these errors were encountered:

alexrutar · 2024-11-17T18:33:34Z

I might pick this up at some point; but if some else does (and for my own reference in the future) I have linked the Wikipedia pages for reference to the usual Unicode blocks:

And then there are some weirder ones (not sure if these should be handled, and if so which planes)

Not sure if I am missing anything else. For codegen the unicode-normalization crate should work. I guess hand-rolling an optimized normalizer for latin-only blocks is quite good for performance reasons, especially because of the way graphemes are handled internally (i.e. we do not need to handle multi-char normalization since we just grab the first char in each grapheme cluster anyway).

There's also icu_normalizer for comparison. I guess any implementation should be benchmarked against icu_normalizer since I think it is the best-performing fully featured normalizer implementing all of UAX 15 which is also written in Rust. Though a custom implementation should be substantially faster / smaller.

pascalkuthe · 2024-11-17T20:01:31Z

This likely just a bug with the current lookup but the implementation is indeed optimized for performance and I don't want something exhaustive like the existing creates you linked here.

The tabels are copied from fzf and the normalization option in nucleo and fzf are specifically only about latin characters.

This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.

alexrutar · 2024-12-07T20:42:03Z

This is now fixed in #60

alexrutar mentioned this issue Nov 18, 2024

Improve Latin char normalization #60

Merged

pascalkuthe closed this as completed Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nucleo-matcher: Normalization is incomplete. #51

nucleo-matcher: Normalization is incomplete. #51

markus-bauer commented Oct 4, 2024

alexrutar commented Nov 17, 2024

pascalkuthe commented Nov 17, 2024

alexrutar commented Dec 7, 2024

nucleo-matcher: Normalization is incomplete. #51

nucleo-matcher: Normalization is incomplete. #51

Comments

markus-bauer commented Oct 4, 2024

alexrutar commented Nov 17, 2024

pascalkuthe commented Nov 17, 2024

alexrutar commented Dec 7, 2024