Optimize String `_find_upper` and `_find_lower` by handling low-bit characters (including normal latin) explicitly. #99971

Ivorforce · 2024-12-03T16:26:42Z

I ran across this and wanted to try if it can't be improved. The throughput of _find_upper and _find_lower for normal latin is improved severalfold, making measurable difference in several benchmarks.

Here are the benchmark results of 355dfad (this branch) vs ee4acfb (current master):

The change also removes 896 bytes from the caps table (while adding only a few more in code).

Explanation

This is taking advantage of the fact that most common (latin) text is in the lower-bit range (ASCII). This is exploited by running an if-based algorithm instead of a binary search for this case.
As can be seen in the benchmarks (e.g. find vs findn), conversion of capitalization of a single common latin char is now almost free, compared to other parts of the algorithms.

I have also adjusted moved the functions out of the generator script to make them easier to maintain (as plain C++ rather than strings). char_utils seemed like a fitting place since it already harbours similar functions.

Mickeon · 2024-12-03T16:40:06Z

Let me get this. Is the memory footprint roughly the same in this PR? The chart shows interesting results, but it's so minimal in most scenarios I'm not sure it's worth it.

Ivorforce · 2024-12-03T16:40:56Z

Let me get this. Is the memory footprint roughly the same in this PR? The chart shows interesting results, but it's so minimal in most scenarios I'm not sure it's worth it if it's 1kb more.

I added an amendment to my statement; due to removed entries in the other two tables it should actually be about 370b less total even after the change to 32-bit tables :)

MewPurPur · 2024-12-03T16:41:23Z

#93360 Seems to do the same

Ivorforce · 2024-12-03T16:50:12Z

#93360 Seems to do the same

Interesting PR! It uses a different technique, switching to hashes instead of a binary search, which improves the speed for different reasons. My PR is likely to gain more speed than the other for most latin texts, but they are not mutually exclusive and could be combined.

Ivorforce · 2024-12-03T17:20:31Z

Since we just realized the tables are using int instead of char32_t for no reason, I change it to char32_t in this PR. This should reduce the binary size and RAM use by about 5kb in 64-bit systems.

kiroxas · 2024-12-03T17:30:46Z

Are those suppose to be functions that find the upper/lower versions for every unicode codepoints ? And what you did is extract the ASCII range into its own look up table to have direct look up for that range ? Do I get that right ?

Ivorforce · 2024-12-03T17:36:41Z

Are those suppose to be functions that find the upper/lower versions for every unicode codepoints ? And what you did is extract the ASCII range into its own look up table to have direct look up for that range ? Do I get that right ?

Exactly right!

kiroxas · 2024-12-03T17:47:26Z

Would be nice to see how it compares to old | 32 trick like in the if for toLower to have return ((ch >= 'A' && ch <= 'Z') || (ch >= 192 && ch <= 222)) ? (ch | 32) : ch and for toUpper return ((ch > 'a' && ch < 'z') || (ch > 224)) ? (ch & (~32)) : ch
It might not be faster than a look up table but it would be interesting to see how it compares

Ivorforce · 2024-12-03T17:55:03Z

Would be nice to see how it compares to old | 32 trick

I actually tried this first.
Unfortunately it does not quite cover the 8 bit range. There's 2 blocks of cased letters in there and to cover them we'd need more ifs. Plus common non cased characters like . would not be caught, causing them to go into the binary search path - unless you add even more ifs...

I don't have a benchmark for it but I'd wager the branch mispredictions would make it quite slow.

Chubercik · 2024-12-03T18:05:56Z

This PR stands as an opposite of what I did in #90726 😅

kiroxas · 2024-12-03T18:08:48Z

Would be nice to see how it compares to old | 32 trick

I actually tried this first. Unfortunately it does not quite cover the 8 bit range. There's 2 blocks of cased letters in there and to cover them we'd need more ifs. Plus common non cased characters like . would not be caught, causing them to go into the binary search path - unless you add even more ifs...

I don't have a benchmark for it but I'd wager the branch mispredictions would make it quite slow.

Well that's what I wrote, for example for to lower

if (ch < 0x00FF) {
		return ((ch >= 'A' && ch <= 'Z') || (ch >= 192 && ch <= 222)) ? (ch | 32) : ch;
	}  // else binary search

You should cover ascii ranges here (I think) without additional if cost

Ivorforce · 2024-12-03T18:10:57Z

This PR stands as an opposite of what I did in #90726 😅

I don't think that's true, they are doing quite orthogonal things and could both be merged :)

AThousandShips · 2024-12-03T18:13:45Z

Please keep in mind rapid CI runs on this as well, make sure things compile on your own branch before opening a PR and ensure it compiles locally to avoid taxing the CI too much

Ivorforce · 2024-12-03T18:30:26Z

You should cover ascii ranges here (I think) without additional if cost

I tested the if based code. Performance is very similar to the one based on dense tables¹. I am currently without internet but i will update the PR to use ifs for the ASCII range when that changes :)

thinking about it, possibly it's because lower/uppercase letters are often nearby each other leading to fewer branch mispredictions ↩

Chubercik · 2024-12-03T18:30:55Z

This PR stands as an opposite of what I did in #90726 😅

I don't think that's true, they are doing quite orthogonal things and could both be merged :)

Oh, only a portion of the table that fits a narrower data type got converted - my bad, at first glance I thought you removed other matchings 😁

kiroxas · 2024-12-03T19:29:24Z

Since we just realized the tables are using int instead of char32_t for no reason, I change it to char32_t in this PR. This should reduce the binary size and RAM use by about 5kb in 64-bit systems.

int is 4 bytes on x64 on all major compilers, so there is no gain in size here I suspect. But being explicit on the size is a plus, no more ambiguity.

Ivorforce · 2024-12-03T21:32:40Z

int is 4 bytes on x64 on all major compilers, so there is no gain in size here I suspect. But being explicit on the size is a plus, no more ambiguity.

Wow... I can't believe I didn't know that. Makes the change less critical (though still good to be specific).

Ivorforce · 2024-12-04T15:15:37Z

Alright, the PR is ready for review again.

I did some more tests with different if based implementations, but they were all equal in speed to the dense table LUT implementation, so for both simplicity and byte use I switched to the current implementation. Thanks again to @kiroxas!

We also don't save 5kb; the tables were already in 32 bit mode, though i've kept the change from int to char32_t to be more explicit; it's just more correct and could potentially help portability of the code. Up to 896 bytes are saved from the removed entries in the caps tables, but I don't think that's an important selling point for the PR.

Ivorforce · 2025-01-30T13:23:39Z

#90726 was just merged. This is good, because it fixed the tables being outdated, but it also decreased performance of _find_upper and _find_lower further because the table size increased significantly.

I have updated the PR accordingly (moving the methods to char_utils so they aren't generated by the script, for cleanup)
and re-ran the benchmark.

Compare this to the old benchmark, and you can see the importance of this PR increased a fair bit:

…haracters (including normal latin) explicitly. Move `_find_upper` and `_find_lower` to non-generated `char_utils.h`.

Ivorforce marked this pull request as ready for review December 3, 2024 16:26

Ivorforce requested a review from a team as a code owner December 3, 2024 16:26

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch 2 times, most recently from 378b6d4 to 9c9e7d3 Compare December 3, 2024 16:36

Mickeon added enhancement topic:core performance labels Dec 3, 2024

Mickeon added this to the 4.x milestone Dec 3, 2024

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 9c9e7d3 to 4daea36 Compare December 3, 2024 16:53

Ivorforce changed the title ~~Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table.~~ Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Dec 3, 2024

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 4daea36 to 6469b52 Compare December 3, 2024 17:16

Ivorforce changed the title ~~Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table.~~ Optimize String _find_upper and _find_lower by including a dense ASCII-range conversion table. Reduce size of caps tables by half (about 5kb). Dec 3, 2024

Ivorforce marked this pull request as draft December 3, 2024 18:38

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 6469b52 to ebbb09c Compare December 4, 2024 15:04

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from ebbb09c to fa40766 Compare December 4, 2024 15:10

Ivorforce marked this pull request as ready for review December 4, 2024 15:12

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from fa40766 to e8195a1 Compare December 4, 2024 15:24

kiroxas approved these changes Dec 8, 2024

View reviewed changes

Ivorforce mentioned this pull request Dec 16, 2024

Optimize String.capitalize by running the capitalization algorithm inplace. #100486

Open

Ivorforce mentioned this pull request Jan 5, 2025

Improve ucaps.h _find_upper() and _find_lower() performance #93360

Open

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from e8195a1 to 355dfad Compare January 30, 2025 13:20

Ivorforce requested a review from a team as a code owner January 30, 2025 13:20

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from 355dfad to aa5200f Compare January 30, 2025 13:27

Optimize String _find_upper and _find_lower by handling low-bit c…

97a14e5

…haracters (including normal latin) explicitly. Move `_find_upper` and `_find_lower` to non-generated `char_utils.h`.

Ivorforce force-pushed the to-upper-lower-ascii-table-optimization branch from aa5200f to 97a14e5 Compare January 30, 2025 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize String `_find_upper` and `_find_lower` by handling low-bit characters (including normal latin) explicitly. #99971

Optimize String `_find_upper` and `_find_lower` by handling low-bit characters (including normal latin) explicitly. #99971

Ivorforce commented Dec 3, 2024 •

edited

Loading

Mickeon commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024 •

edited

Loading

MewPurPur commented Dec 3, 2024

Ivorforce commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024 •

edited

Loading

kiroxas commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024

kiroxas commented Dec 3, 2024

Ivorforce commented Dec 3, 2024 •

edited

Loading

Chubercik commented Dec 3, 2024

kiroxas commented Dec 3, 2024

Ivorforce commented Dec 3, 2024

AThousandShips commented Dec 3, 2024

Ivorforce commented Dec 3, 2024

Chubercik commented Dec 3, 2024

kiroxas commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024

Ivorforce commented Dec 4, 2024 •

edited

Loading

Ivorforce commented Jan 30, 2025 •

edited

Loading

Optimize String _find_upper and _find_lower by handling low-bit characters (including normal latin) explicitly. #99971

Are you sure you want to change the base?

Optimize String _find_upper and _find_lower by handling low-bit characters (including normal latin) explicitly. #99971

Conversation

Ivorforce commented Dec 3, 2024 • edited Loading

Explanation

Mickeon commented Dec 3, 2024 • edited Loading

Ivorforce commented Dec 3, 2024 • edited Loading

MewPurPur commented Dec 3, 2024

Ivorforce commented Dec 3, 2024 • edited Loading

Ivorforce commented Dec 3, 2024 • edited Loading

kiroxas commented Dec 3, 2024 • edited Loading

Ivorforce commented Dec 3, 2024

kiroxas commented Dec 3, 2024

Ivorforce commented Dec 3, 2024 • edited Loading

Chubercik commented Dec 3, 2024

kiroxas commented Dec 3, 2024

Ivorforce commented Dec 3, 2024

AThousandShips commented Dec 3, 2024

Ivorforce commented Dec 3, 2024

Footnotes

Chubercik commented Dec 3, 2024

kiroxas commented Dec 3, 2024 • edited Loading

Ivorforce commented Dec 3, 2024

Ivorforce commented Dec 4, 2024 • edited Loading

Ivorforce commented Jan 30, 2025 • edited Loading

Optimize String `_find_upper` and `_find_lower` by handling low-bit characters (including normal latin) explicitly. #99971

Optimize String `_find_upper` and `_find_lower` by handling low-bit characters (including normal latin) explicitly. #99971

Ivorforce commented Dec 3, 2024 •

edited

Loading

Mickeon commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024 •

edited

Loading

kiroxas commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 3, 2024 •

edited

Loading

kiroxas commented Dec 3, 2024 •

edited

Loading

Ivorforce commented Dec 4, 2024 •

edited

Loading

Ivorforce commented Jan 30, 2025 •

edited

Loading