Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nucleo-matcher: Normalization is incomplete. #51

Closed
markus-bauer opened this issue Oct 4, 2024 · 3 comments
Closed

nucleo-matcher: Normalization is incomplete. #51

markus-bauer opened this issue Oct 4, 2024 · 3 comments

Comments

@markus-bauer
Copy link

Thanks for this crate. But I had some issues with normalization.

Here are some random examples from the polish alphabet, that don't work as expected; there might be more issues, though.

use nucleo_matcher::pattern::{AtomKind, CaseMatching, Normalization};

fn main() {
    // Example: https://en.wikipedia.org/wiki/Polish_alphabet
    let tests = [
        ('ą', 'a'),
        ('ć', 'c'),
        ('ę', 'e'),
        ('ł', 'l'),
        ('ń', 'n'),
        ('ó', 'o'),
        ('ś', 's'),
        ('ź', 'z'),
        ('ż', 'z'),
        ('Ą', 'A'),
        ('Ć', 'C'),
        ('Ę', 'E'),
        ('Ł', 'L'),
        ('Ń', 'N'),
        ('Ó', 'O'),
        ('Ś', 'S'),
        ('Ź', 'Z'),
        ('Ż', 'Z'),
    ];

    // test normalization funcion
    println!("--- normalize ---");
    for (c, expected) in tests {
        let normalized = nucleo_matcher::chars::normalize(c);
        if normalized == expected {
            println!("okay: {} -> {}.", c, normalized);
        } else {
            println!("fail: {} -> {}. expected {}", c, normalized, expected);
            let c_lower = c.to_lowercase().next().unwrap();
            let expected_lower = expected.to_lowercase().next().unwrap();
            let normalized_lower = nucleo_matcher::chars::normalize(c_lower);
            if normalized_lower == expected_lower {
                println!(" but lowercase works: {} -> {}", c_lower, normalized_lower);
            }
        }
    }

    // test matcher
    println!("--- matcher ---");
    let mut matcher = nucleo_matcher::Matcher::new(nucleo_matcher::Config::DEFAULT);

    for (c, expected) in tests {
        // search for c in expected
        let haystack = nucleo_matcher::Utf32String::from(c.to_string());
        let atom = nucleo_matcher::pattern::Atom::new(
            &expected.to_string(),
            CaseMatching::Smart,
            Normalization::Smart,
            AtomKind::Exact,
            false,
        );

        if atom.score(haystack.slice_u32(..), &mut matcher).is_some() {
            println!("okay: {}", c)
        } else {
            println!("fail: {} not in {}", atom.needle_text(), haystack)
        }
    }
}
--- normalize ---
okay: ą -> a.
okay: ć -> c.
okay: ę -> e.
okay: ł -> l.
okay: ń -> n.
okay: ó -> o.
okay: ś -> s.
okay: ź -> z.
okay: ż -> z.
fail: Ą -> Ą. expected A
 but lowercase works: ą -> a
fail: Ć -> Ć. expected C
 but lowercase works: ć -> c
fail: Ę -> Ę. expected E
 but lowercase works: ę -> e
fail: Ł -> Ł. expected L
 but lowercase works: ł -> l
fail: Ń -> Ń. expected N
 but lowercase works: ń -> n
okay: Ó -> O.
fail: Ś -> Ś. expected S
 but lowercase works: ś -> s
fail: Ź -> Ź. expected Z
 but lowercase works: ź -> z
fail: Ż -> Ż. expected Z
 but lowercase works: ż -> z
--- matcher ---
okay: ą
okay: ć
okay: ę
okay: ł
okay: ń
okay: ó
okay: ś
okay: ź
okay: ż
fail: A not in Ą
fail: C not in Ć
fail: E not in Ę
fail: L not in Ł
fail: N not in Ń
okay: Ó
fail: S not in Ś
fail: Z not in Ź
fail: Z not in Ż

Also, I think you might have made some copy-paste/codegen mistakes (here and in other places):

('\u{1E37}', 'l'), // WITH DOT BELOW, LATIN SMALL LETTER ('\u{1E3B}', 'l'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E3D}', 'l'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E3F}', 'm'), // WITH ACUTE, LATIN SMALL LETTER
('\u{1E41}', 'm'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E43}', 'm'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E45}', 'n'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E47}', 'n'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E49}', 'n'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E4B}', 'n'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E55}', 'p'), // WITH ACUTE, LATIN SMALL LETTER
('\u{1E57}', 'p'), // WITH DOT ABOVE, LATIN SMALL LETTER ('\u{1E59}', 'r'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E5B}', 'r'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E5F}', 'r'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E61}', 's'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E63}', 's'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E6B}', 't'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E6D}', 't'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E6F}', 't'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E71}', 't'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E73}', 'u'), // WITH DIAERESIS BELOW, LATIN SMALL LETTER
('\u{1E75}', 'u'), // WITH TILDE BELOW, LATIN SMALL LETTER ('\u{1E77}', 'u'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E7D}', 'v'), // WITH TILDE, LATIN SMALL LETTER
('\u{1E7F}', 'v'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E81}', 'w'), // WITH GRAVE, LATIN SMALL LETTER
('\u{1E83}', 'w'), // WITH ACUTE, LATIN SMALL LETTER
('\u{1E85}', 'w'), // WITH DIAERESIS, LATIN SMALL LETTER ('\u{1E87}', 'w'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E89}', 'w'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E8B}', 'x'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E8D}', 'x'), // WITH DIAERESIS, LATIN SMALL LETTER
('\u{1E8F}', 'y'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E91}', 'z'), // WITH CIRCUMFLEX, LATIN SMALL LETTER ('\u{1E93}', 'z'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E95}', 'z'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E96}', 'h'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E97}', 't'), // WITH DIAERESIS, LATIN SMALL LETTER
('\u{1E98}', 'w'), // WITH RING ABOVE, LATIN SMALL LETTER
('\u{1E99}', 'y'), // WITH RING ABOVE, LATIN SMALL LETTER
('\u{1E9A}', 'a'), // WITH RIGHT HALF RING, LATIN SMALL LETTER
('\u{1E9B}', 's'), // WITH DOT ABOVE, LATIN SMALL LETTER LONG
('\u{1EA1}', 'a'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1EA3}', 'a'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EB9}', 'e'), // WITH DOT BELOW, LATIN SMALL LETTER ('\u{1EBB}', 'e'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EBD}', 'e'), // WITH TILDE, LATIN SMALL LETTER
('\u{1EC9}', 'i'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1ECB}', 'i'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1ECD}', 'o'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1ECF}', 'o'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EE5}', 'u'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1EE7}', 'u'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EF3}', 'y'), // WITH GRAVE, LATIN SMALL LETTER
('\u{1EF5}', 'y'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1EF7}', 'y'), // WITH HOOK ABOVE, LATIN SMALL LETTER ('\u{1EF9}', 'y'), // WITH TILDE, LATIN SMALL LETTER

@alexrutar
Copy link
Contributor

I might pick this up at some point; but if some else does (and for my own reference in the future) I have linked the Wikipedia pages for reference to the usual Unicode blocks:

And then there are some weirder ones (not sure if these should be handled, and if so which planes)

Not sure if I am missing anything else. For codegen the unicode-normalization crate should work. I guess hand-rolling an optimized normalizer for latin-only blocks is quite good for performance reasons, especially because of the way graphemes are handled internally (i.e. we do not need to handle multi-char normalization since we just grab the first char in each grapheme cluster anyway).

There's also icu_normalizer for comparison. I guess any implementation should be benchmarked against icu_normalizer since I think it is the best-performing fully featured normalizer implementing all of UAX 15 which is also written in Rust. Though a custom implementation should be substantially faster / smaller.

@pascalkuthe
Copy link
Member

This likely just a bug with the current lookup but the implementation is indeed optimized for performance and I don't want something exhaustive like the existing creates you linked here.

The tabels are copied from fzf and the normalization option in nucleo and fzf are specifically only about latin characters.

alexrutar added a commit to rutar-forks/nucleo that referenced this issue Nov 18, 2024
This improves the normalization for Latin characters, mainly to
address the concerns in helix-editor#51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
alexrutar added a commit to rutar-forks/nucleo that referenced this issue Nov 18, 2024
This improves the normalization for Latin characters, mainly to
address the concerns in helix-editor#51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
alexrutar added a commit to rutar-forks/nucleo that referenced this issue Nov 18, 2024
This improves the normalization for Latin characters, mainly to
address the concerns in helix-editor#51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
alexrutar added a commit to rutar-forks/nucleo that referenced this issue Nov 18, 2024
This improves the normalization for Latin characters, mainly to
address the concerns in helix-editor#51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
alexrutar added a commit to rutar-forks/nucleo that referenced this issue Nov 18, 2024
This improves the normalization for Latin characters, mainly to
address the concerns in helix-editor#51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
alexrutar added a commit to rutar-forks/nucleo that referenced this issue Nov 18, 2024
This improves the normalization for Latin characters, mainly to
address the concerns in helix-editor#51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
@alexrutar
Copy link
Contributor

This is now fixed in #60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants