Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle windows-style newlines \r\n #59

Closed

Conversation

alexrutar
Copy link
Contributor

@alexrutar alexrutar commented Nov 17, 2024

Partially addresses #58 by checking for the presence of a carriage return \r and defaulting to Unicode handling in that situation.

The carriage return only check implemented here has maybe 10% performance hit on short ASCII strings; other solutions that I tried resulted in 10-20% performance differences. (The std is_ascii implementation is very optimized.)

Technically this could result in some false positives which means more grapheme iteration. I'm not sure what the most desirable solution here is. It is unclear to me that it is worth special-casing \r\n handling at the cost of normal code paths.

Is it worth optimizing for ASCII only? How much better is ASCII performance, vs Vec<char> performance, in the matcher?

@alexrutar
Copy link
Contributor Author

alexrutar commented Nov 17, 2024

The 'proper' implementation using memchr::memmem::find is a tiny bit slower:

fn is_non_cr_ascii(input: &str) -> bool {
    input.is_ascii() && find(input.as_bytes(), b"\r\n").is_none()
}

And another 'improper' solution using memchr::memchr is a tiny bit faster:

fn is_non_cr_ascii(input: &str) -> bool {
    input.is_ascii() && memchr(b'\r', input.as_bytes()).is_none()
}

@alexrutar
Copy link
Contributor Author

Having done more downstream work, it is no longer clear to me that the changes in this PR are necessary.

I guess technically downstream I am depending on somewhat undefined behaviour, but the point is that the only thing that matters is whether or not grapheme generation is consistent: i.e. the required API guarantee for my downstream implementation is essentially:

  1. If the match object is a Utf32String::Ascii, then the indices correspond to char indices, and
  2. If the match type is a Utf32String::Unicode, then the indices correspond to grapheme indices.

I guess in an ideal world, these would literally be equivalent (since Ascii is Unicode), but at the cost of introducing additional checks, it seems less necessary to me that this is required.

However this is quite a subtle footgun and makes implementing on top of nucleo somewhat more difficult. But it isn't the least of the footguns out there with respect to the Utf32String handling.

I am going to close this for now, since it seems to me there might be a more general / robust solution hiding underneath all of this that does not require this honestly quite awkward special-casing implemented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant