Properly handle windows-style newlines `\r\n` #59

alexrutar · 2024-11-17T23:20:06Z

Partially addresses #58 by checking for the presence of a carriage return \r and defaulting to Unicode handling in that situation.

The carriage return only check implemented here has maybe 10% performance hit on short ASCII strings; other solutions that I tried resulted in 10-20% performance differences. (The std is_ascii implementation is very optimized.)

Technically this could result in some false positives which means more grapheme iteration. I'm not sure what the most desirable solution here is. It is unclear to me that it is worth special-casing \r\n handling at the cost of normal code paths.

Is it worth optimizing for ASCII only? How much better is ASCII performance, vs Vec<char> performance, in the matcher?

alexrutar · 2024-11-17T23:37:05Z

The 'proper' implementation using memchr::memmem::find is a tiny bit slower:

fn is_non_cr_ascii(input: &str) -> bool {
    input.is_ascii() && find(input.as_bytes(), b"\r\n").is_none()
}

And another 'improper' solution using memchr::memchr is a tiny bit faster:

fn is_non_cr_ascii(input: &str) -> bool {
    input.is_ascii() && memchr(b'\r', input.as_bytes()).is_none()
}

alexrutar · 2024-11-18T20:07:50Z

Having done more downstream work, it is no longer clear to me that the changes in this PR are necessary.

I guess technically downstream I am depending on somewhat undefined behaviour, but the point is that the only thing that matters is whether or not grapheme generation is consistent: i.e. the required API guarantee for my downstream implementation is essentially:

If the match object is a Utf32String::Ascii, then the indices correspond to char indices, and
If the match type is a Utf32String::Unicode, then the indices correspond to grapheme indices.

I guess in an ideal world, these would literally be equivalent (since Ascii is Unicode), but at the cost of introducing additional checks, it seems less necessary to me that this is required.

However this is quite a subtle footgun and makes implementing on top of nucleo somewhat more difficult. But it isn't the least of the footguns out there with respect to the Utf32String handling.

I am going to close this for now, since it seems to me there might be a more general / robust solution hiding underneath all of this that does not require this honestly quite awkward special-casing implemented here.

Properly handle windows-style newlines '\r\n'

820097c

alexrutar closed this Nov 18, 2024

alexrutar mentioned this pull request Nov 18, 2024

Grapheme handling issue with \r\n #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle windows-style newlines `\r\n` #59

Properly handle windows-style newlines `\r\n` #59

alexrutar commented Nov 17, 2024 •

edited

Loading

alexrutar commented Nov 17, 2024 •

edited

Loading

alexrutar commented Nov 18, 2024

Properly handle windows-style newlines \r\n #59

Properly handle windows-style newlines \r\n #59

Conversation

alexrutar commented Nov 17, 2024 • edited Loading

alexrutar commented Nov 17, 2024 • edited Loading

alexrutar commented Nov 18, 2024

Properly handle windows-style newlines `\r\n` #59

Properly handle windows-style newlines `\r\n` #59

alexrutar commented Nov 17, 2024 •

edited

Loading

alexrutar commented Nov 17, 2024 •

edited

Loading