-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nucleo-matcher: Normalization is incomplete. #51
Comments
I might pick this up at some point; but if some else does (and for my own reference in the future) I have linked the Wikipedia pages for reference to the usual Unicode blocks:
And then there are some weirder ones (not sure if these should be handled, and if so which planes)
Not sure if I am missing anything else. For codegen the unicode-normalization crate should work. I guess hand-rolling an optimized normalizer for latin-only blocks is quite good for performance reasons, especially because of the way graphemes are handled internally (i.e. we do not need to handle multi-char normalization since we just grab the first char in each grapheme cluster anyway). There's also icu_normalizer for comparison. I guess any implementation should be benchmarked against |
This likely just a bug with the current lookup but the implementation is indeed optimized for performance and I don't want something exhaustive like the existing creates you linked here. The tabels are copied from fzf and the normalization option in nucleo and fzf are specifically only about latin characters. |
This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
This improves the normalization for Latin characters, mainly to address the concerns in helix-editor#51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
This is now fixed in #60 |
Thanks for this crate. But I had some issues with normalization.
Here are some random examples from the polish alphabet, that don't work as expected; there might be more issues, though.
Also, I think you might have made some copy-paste/codegen mistakes (here and in other places):
nucleo/matcher/src/chars/normalize.rs
Lines 340 to 390 in 6df3cd0
The text was updated successfully, but these errors were encountered: