Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize String _find_upper and _find_lower by handling low-bit characters (including normal latin) explicitly. #99971

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ivorforce
Copy link
Contributor

@Ivorforce Ivorforce commented Dec 3, 2024

I ran across this and wanted to try if it can't be improved. The throughput of _find_upper and _find_lower for normal latin is improved severalfold, making measurable difference in several benchmarks.

Here are the benchmark results of 355dfad (this branch) vs ee4acfb (current master):

newplot

The change also removes 896 bytes from the caps table (while adding only a few more in code).

Explanation

This is taking advantage of the fact that most common (latin) text is in the lower-bit range (ASCII). This is exploited by running an if-based algorithm instead of a binary search for this case.
As can be seen in the benchmarks (e.g. find vs findn), conversion of capitalization of a single common latin char is now almost free, compared to other parts of the algorithms.

I have also adjusted moved the functions out of the generator script to make them easier to maintain (as plain C++ rather than strings). char_utils seemed like a fitting place since it already harbours similar functions.

@Ivorforce Ivorforce marked this pull request as ready for review December 3, 2024 16:26
@Ivorforce Ivorforce requested a review from a team as a code owner December 3, 2024 16:26
@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch 2 times, most recently from 378b6d4 to 9c9e7d3 Compare December 3, 2024 16:36
@Mickeon Mickeon added this to the 4.x milestone Dec 3, 2024
@Mickeon
Copy link
Contributor

Mickeon commented Dec 3, 2024

Let me get this. Is the memory footprint roughly the same in this PR? The chart shows interesting results, but it's so minimal in most scenarios I'm not sure it's worth it.

@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 3, 2024

Let me get this. Is the memory footprint roughly the same in this PR? The chart shows interesting results, but it's so minimal in most scenarios I'm not sure it's worth it if it's 1kb more.

I added an amendment to my statement; due to removed entries in the other two tables it should actually be about 370b less total even after the change to 32-bit tables :)

@MewPurPur
Copy link
Contributor

#93360 Seems to do the same

@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 3, 2024

#93360 Seems to do the same

Interesting PR! It uses a different technique, switching to hashes instead of a binary search, which improves the speed for different reasons. My PR is likely to gain more speed than the other for most latin texts, but they are not mutually exclusive and could be combined.

@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 9c9e7d3 to 4daea36 Compare December 3, 2024 16:53
@Ivorforce Ivorforce changed the title Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Dec 3, 2024
@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 4daea36 to 6469b52 Compare December 3, 2024 17:16
@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 3, 2024

Since we just realized the tables are using int instead of char32_t for no reason, I change it to char32_t in this PR. This should reduce the binary size and RAM use by about 5kb in 64-bit systems.

@Ivorforce Ivorforce changed the title Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Reduce size of caps tables by half (about 5kb). Dec 3, 2024
@kiroxas
Copy link
Contributor

kiroxas commented Dec 3, 2024

Are those suppose to be functions that find the upper/lower versions for every unicode codepoints ? And what you did is extract the ASCII range into its own look up table to have direct look up for that range ? Do I get that right ?

@Ivorforce
Copy link
Contributor Author

Are those suppose to be functions that find the upper/lower versions for every unicode codepoints ? And what you did is extract the ASCII range into its own look up table to have direct look up for that range ? Do I get that right ?

Exactly right!

@kiroxas
Copy link
Contributor

kiroxas commented Dec 3, 2024

Would be nice to see how it compares to old | 32 trick like in the if for toLower to have return ((ch >= 'A' && ch <= 'Z') || (ch >= 192 && ch <= 222)) ? (ch | 32) : ch and for toUpper return ((ch > 'a' && ch < 'z') || (ch > 224)) ? (ch & (~32)) : ch
It might not be faster than a look up table but it would be interesting to see how it compares

@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 3, 2024

Would be nice to see how it compares to old | 32 trick

I actually tried this first.
Unfortunately it does not quite cover the 8 bit range. There's 2 blocks of cased letters in there and to cover them we'd need more ifs. Plus common non cased characters like . would not be caught, causing them to go into the binary search path - unless you add even more ifs...

I don't have a benchmark for it but I'd wager the branch mispredictions would make it quite slow.

@Chubercik
Copy link
Contributor

This PR stands as an opposite of what I did in #90726 😅

@kiroxas
Copy link
Contributor

kiroxas commented Dec 3, 2024

Would be nice to see how it compares to old | 32 trick

I actually tried this first. Unfortunately it does not quite cover the 8 bit range. There's 2 blocks of cased letters in there and to cover them we'd need more ifs. Plus common non cased characters like . would not be caught, causing them to go into the binary search path - unless you add even more ifs...

I don't have a benchmark for it but I'd wager the branch mispredictions would make it quite slow.

Well that's what I wrote, for example for to lower

if (ch < 0x00FF) {
		return ((ch >= 'A' && ch <= 'Z') || (ch >= 192 && ch <= 222)) ? (ch | 32) : ch;
	}  // else binary search

You should cover ascii ranges here (I think) without additional if cost

@Ivorforce
Copy link
Contributor Author

This PR stands as an opposite of what I did in #90726 😅

I don't think that's true, they are doing quite orthogonal things and could both be merged :)

@AThousandShips
Copy link
Member

Please keep in mind rapid CI runs on this as well, make sure things compile on your own branch before opening a PR and ensure it compiles locally to avoid taxing the CI too much

@Ivorforce
Copy link
Contributor Author

You should cover ascii ranges here (I think) without additional if cost

I tested the if based code. Performance is very similar to the one based on dense tables1. I am currently without internet but i will update the PR to use ifs for the ASCII range when that changes :)

Footnotes

  1. thinking about it, possibly it's because lower/uppercase letters are often nearby each other leading to fewer branch mispredictions

@Chubercik
Copy link
Contributor

This PR stands as an opposite of what I did in #90726 😅

I don't think that's true, they are doing quite orthogonal things and could both be merged :)

Oh, only a portion of the table that fits a narrower data type got converted - my bad, at first glance I thought you removed other matchings 😁

@Ivorforce Ivorforce marked this pull request as draft December 3, 2024 18:38
@kiroxas
Copy link
Contributor

kiroxas commented Dec 3, 2024

Since we just realized the tables are using int instead of char32_t for no reason, I change it to char32_t in this PR. This should reduce the binary size and RAM use by about 5kb in 64-bit systems.

int is 4 bytes on x64 on all major compilers, so there is no gain in size here I suspect. But being explicit on the size is a plus, no more ambiguity.

@Ivorforce
Copy link
Contributor Author

int is 4 bytes on x64 on all major compilers, so there is no gain in size here I suspect. But being explicit on the size is a plus, no more ambiguity.

Wow... I can't believe I didn't know that. Makes the change less critical (though still good to be specific).

@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 6469b52 to ebbb09c Compare December 4, 2024 15:04
@Ivorforce Ivorforce changed the title Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Reduce size of caps tables by half (about 5kb). Optimize String _find_upper and _find_lower by handling low-bit characters (including normal latin) explicitly. Dec 4, 2024
@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from ebbb09c to fa40766 Compare December 4, 2024 15:10
@Ivorforce Ivorforce marked this pull request as ready for review December 4, 2024 15:12
@Ivorforce
Copy link
Contributor Author

Ivorforce commented Dec 4, 2024

Alright, the PR is ready for review again.

I did some more tests with different if based implementations, but they were all equal in speed to the dense table LUT implementation, so for both simplicity and byte use I switched to the current implementation. Thanks again to @kiroxas!

We also don't save 5kb; the tables were already in 32 bit mode, though i've kept the change from int to char32_t to be more explicit; it's just more correct and could potentially help portability of the code. Up to 896 bytes are saved from the removed entries in the caps tables, but I don't think that's an important selling point for the PR.

@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from fa40766 to e8195a1 Compare December 4, 2024 15:24
@Ivorforce
Copy link
Contributor Author

Ivorforce commented Jan 30, 2025

#90726 was just merged. This is good, because it fixed the tables being outdated, but it also decreased performance of _find_upper and _find_lower further because the table size increased significantly.

I have updated the PR accordingly (moving the methods to char_utils so they aren't generated by the script, for cleanup)
and re-ran the benchmark.

newplot

Compare this to the old benchmark, and you can see the importance of this PR increased a fair bit:

oldplot

@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 355dfad to aa5200f Compare January 30, 2025 13:27
…haracters (including normal latin) explicitly.

Move `_find_upper` and `_find_lower` to non-generated `char_utils.h`.
@Ivorforce Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from aa5200f to 97a14e5 Compare January 30, 2025 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants