Grapheme handling issue with `\r\n` #58

alexrutar · 2024-11-17T13:57:00Z

There seems to be a grapheme handling ambiguity for strings containing the "windows-style newline" \r\n?

Since \r\n is treated as a single grapheme by the Unicode segmentation crate, the highlight indices corresponding to a Utf32String constructed from a string slice consisting only of ASCII characters and \r\n will be offset relative to the grapheme indices of that string.

I am curious if this is intended behaviour? If very fast reading of the implementation in helix is correct, this is also a "bug" in helix.

It is reasonable to me that helix does not support multi-line objects at all, so in practice this never turns up. I noticed this behaviour while implementing support for multi-line matches in nucleo-picker; here is the downstream PR: autobib/nucleo-picker#13

The text was updated successfully, but these errors were encountered:

alexrutar · 2024-11-17T14:13:31Z

More generally I feel like there are a few rough edges to the Utf32String API?

My feeling somehow is that Utf32String itself being mutable is in principle incorrect, because you could easily construct a Utf32String which is entirely invalid by taking a large grapheme, iterating over chars, and pushing each char individually to the Utf32String. Of course the reasonable answer to this type of issue is 'stop hitting yourself in the head' (since pushing is totally valid if you just push complete graphemes), but in my opinion the API concerns are still an issue.
I mentioned this in Feature request: Better conversions between Utf32String and Utf32Str #50 but it is also a bit strange to me that Utf32String implements Display. For instance the conversion String -> Utf32String -> String using the From and ToString implementations is lossy.

pascalkuthe · 2024-11-17T19:59:22Z

That one is a bit ugly. I didn't really consider windows newlines because indeed nucleo is intended for single-line matches.

I guess you would want to normalize \r\n to \n instead of \r (and do that in the ascik case too).

Not sure about the API. Having a Display/From/Into implementations is useful. You can shoot yourself in the foot with it but it would be unfortunate to restrict the API. Maybe some naming changes to make things more explicit could be sufficient (or improving the docs). For example I think there can be usecases where you would want to use push() manually so just removing the API isn't great. Maybe call it push_grapheme or something isntead

alexrutar · 2024-11-17T20:58:08Z

Edit: See the comment in the closed PR instead: #59 (comment)

But leaving the old version below for posterity.

R.e. the \r\n issue, just to clarify, are you happy for this change to be implemented in nucleo, or do you think that I should perform normalization downstream? I suppose the relevant changes would be:
Replace the value.is_ascii() check with an extra check that the string does not contain \r\n:
value.is_ascii() && memmem::find(value.as_bytes, b"\r\n").is_none()
(or a an epsilon cheaper version only looking for \r; not sure how much extra overhead this has).
In the chars::graphemes function perform the extra check grapheme == "\r\n" and map to \n.
Also easy to implement downstream by just replacing \r\n with \n in any case.

I guess the newline support in general is an edge case in more scenarios, e.g. one might expect that foo$ would match foo\nbar. As you say nucleo as currently implemented isn't really designed for matching multiple lines.

Before putting this issue here I spent a bit of time searching for similar issues in helix and I only spotted one PR where there was some sort of multiline behaviour helix-editor/helix#11184

alexrutar · 2024-11-17T22:32:44Z

Edit: See the comment in the closed PR instead: #59 (comment)

But leaving the old version below for posterity.

Quick micro-benchmark shows about 10% performance hit on short ASCII strings, which I guess overall is quite mild since this almost certainly won't be a bottleneck in any reasonable situation (200k short string conversions takes about 10ms on my machine). On long Unicode strings it's essentially 0 since the grapheme iteration is so much slower anyway.

The performance hit is unsurprisingly much larger if the string actually contains a \r\n, since it then runs down the much slower Unicode path (about 6x slower). Replacing the unicode path in this situation with a custom \r\n -> \n substitution instead is about twice as slow.

This was referenced Nov 17, 2024

Properly handle windows-style newlines \r\n #59

Closed

Proper grapheme truncation for windows-style newline #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grapheme handling issue with `\r\n` #58

Grapheme handling issue with `\r\n` #58

alexrutar commented Nov 17, 2024

alexrutar commented Nov 17, 2024

pascalkuthe commented Nov 17, 2024

alexrutar commented Nov 17, 2024 •

edited

Loading

alexrutar commented Nov 17, 2024 •

edited

Loading

Grapheme handling issue with \r\n #58

Grapheme handling issue with \r\n #58

Comments

alexrutar commented Nov 17, 2024

alexrutar commented Nov 17, 2024

pascalkuthe commented Nov 17, 2024

alexrutar commented Nov 17, 2024 • edited Loading

alexrutar commented Nov 17, 2024 • edited Loading

Grapheme handling issue with `\r\n` #58

Grapheme handling issue with `\r\n` #58

alexrutar commented Nov 17, 2024 •

edited

Loading

alexrutar commented Nov 17, 2024 •

edited

Loading