Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new loop_parse_if_digits function #291

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

RealTimeChris
Copy link

We can eliminate one of the subtractions from within is_made_of_eight_digits_fast and parse_eight_digits_unrolled by instead moving it out into when the value is collected. Additionally, we can also save some overhead by not collecting the value twice for each iteration of the loop within loop_parse_if_eight_digits.

The new function results in the following benchmark scores:
https://pastebin.com/raw/EQJvPF5G
with the following code:
https://github.com/RealTimeChris/BenchmarkSuite/blob/loop_parse_if_digits/StringComparison/main.cpp

We can eliminate one of the subtractions from within is_made_of_eight_digits_fast and parse_eight_digits_unrolled by instead moving it out into when the value is collected. Additionally, we can also save some overhead by not collecting the value twice for each iteration of the loop within loop_parse_if_eight_digits.
@dalle
Copy link
Collaborator

dalle commented Nov 25, 2024

I prefer the tests to be run before merging this

@dalle dalle requested a review from lemire November 25, 2024 13:58
Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not merge pull requests that skip CI.

@lemire
Copy link
Member

lemire commented Nov 25, 2024

I encourage you to include a rationale as to why you think that compilers will generate faster code.

Be mindful that if you are recording timings on purely synthetic benchmarks, without performance counters, you can easily derive the wrong conclusions.

The bar is pretty high for us to accept an optimization in this library as it is quite mature. It is fairly easy to reorganize the code and get better results under on compiler for one type of synthetic data set.

Please consider there is a significant cost for us to accepting such as change as it may introduce a new bug, which would be terrible. So we need hard evidence that the result is definitively faster.

For benchmarking, I use the following framework...

https://github.com/lemire/simple_fastfloat_benchmark

It includes realistic files (derived from actual content).

I tried testing your PR, but it won't build.

We do very much encourage pull requests which might improve performance, but, as I wrote, the bar is high. I will be definitively doing my own analysis, on different hardware and different compilers.

@lemire
Copy link
Member

lemire commented Nov 25, 2024

PR #293 added benchmarks to the project. I invite you to sync your fork and run these benchmarks (before and after), ideally with performance counters.

@RealTimeChris
Copy link
Author

PR #293 added benchmarks to the project. I invite you to sync your fork and run these benchmarks (before and after), ideally with performance counters.

Alright thank you very much I will do that.

@RealTimeChris
Copy link
Author

I encourage you to include a rationale as to why you think that compilers will generate faster code.

Be mindful that if you are recording timings on purely synthetic benchmarks, without performance counters, you can easily derive the wrong conclusions.

The bar is pretty high for us to accept an optimization in this library as it is quite mature. It is fairly easy to reorganize the code and get better results under on compiler for one type of synthetic data set.

Please consider there is a significant cost for us to accepting such as change as it may introduce a new bug, which would be terrible. So we need hard evidence that the result is definitively faster.

For benchmarking, I use the following framework...

https://github.com/lemire/simple_fastfloat_benchmark

It includes realistic files (derived from actual content).

I tried testing your PR, but it won't build.

We do very much encourage pull requests which might improve performance, but, as I wrote, the bar is high. I will be definitively doing my own analysis, on different hardware and different compilers.

I included that. As I said, given that we are no longer subtracting 0x30303030303030 twice, here:

return !((((val + 0x4646464646464646) | (val - 0x3030303030303030)) &

and here:
val -= 0x3030303030303030;

And instead moving the subtraction out to here:
size_t value_new = read8_to_u64(string) - 0x3030303030303030;

As well as no longer collecting the value twice as is done here:
is_made_of_eight_digits_fast(read8_to_u64(p))) {
i = i * 100000000 +
parse_eight_digits_unrolled(read8_to_u64(
p)); // in rare cases, this will overflow, but that's ok

But instead only collect it once here:
size_t value_new = read8_to_u64(string) - 0x3030303030303030;

This should produce faster code, no?

@lemire
Copy link
Member

lemire commented Nov 25, 2024

This should produce faster code, no?

Your mental model is that you are reducing the number of instructions. If so, then the next step is to validate that it is true. If you run the benchmark I have just included in privileged mode (so that you have access to hardware performance counters), you should see the number of instructions retired.

E.g., the output should look like this...

$ ./build/benchmarks/realbenchmark
...
fastfloat (64)                          :  1164.85 MB/s (+/- 1.8 %)    66.94 Mfloat/s      18.32 i/B ...

See the 18.32 i/B? This is the number of instructions (on average) per byte of input. If this number goes down sufficiently, then, everything else being equal, the performance might be improved.

This depends crucially on the compiler and on the targeted system.

Alternatively, you can examine the binary output and manually check that there are fewer instructions being executed.

There are other elements to consider... but if you can show a substantial reduction in the number of retired instructions per input byte, then it is a very good start. (This is not the only way to optimize the code.)

@dalle dalle changed the title Adding new loop_parse_if_digits function. [skip ci] Adding new loop_parse_if_digits function Nov 26, 2024
@RealTimeChris
Copy link
Author

This should produce faster code, no?

Your mental model is that you are reducing the number of instructions. If so, then the next step is to validate that it is true. If you run the benchmark I have just included in privileged mode (so that you have access to hardware performance counters), you should see the number of instructions retired.

E.g., the output should look like this...

$ ./build/benchmarks/realbenchmark
...
fastfloat (64)                          :  1164.85 MB/s (+/- 1.8 %)    66.94 Mfloat/s      18.32 i/B ...

See the 18.32 i/B? This is the number of instructions (on average) per byte of input. If this number goes down sufficiently, then, everything else being equal, the performance might be improved.

This depends crucially on the compiler and on the targeted system.

Alternatively, you can examine the binary output and manually check that there are fewer instructions being executed.

There are other elements to consider... but if you can show a substantial reduction in the number of retired instructions per input byte, then it is a very good start. (This is not the only way to optimize the code.)

The current code achieved the following results using your framework, on an i7 8700k:

CLANG/UBUNTU:

New Code:

# Using hardware counters
####
# reading /home/chris/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   726.38 MB/s (+/- 12.3 %)    41.74 Mfloat/s      21.22 i/B   387.24 i/f (+/- 0.0 %)      5.57 c/B   101.66 c/f (+/- 9.7 %)      3.81 i/c     60.44 b/f     11.78 bm/f      4.24 GHz
fastfloat (32)                          :   755.59 MB/s (+/- 7.7 %)    43.42 Mfloat/s      20.59 i/B   375.78 i/f (+/- 0.0 %)      5.38 c/B    98.10 c/f (+/- 5.5 %)      3.83 i/c     57.82 b/f     10.50 bm/f      4.26 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1429.88 MB/s (+/- 4.7 %)    41.09 Mfloat/s      10.29 i/B   375.51 i/f (+/- 0.0 %)      2.83 c/B   103.21 c/f (+/- 3.8 %)      3.64 i/c     59.46 b/f     10.94 bm/f      4.24 GHz
fastfloat (32)                          :  1500.15 MB/s (+/- 6.9 %)    43.10 Mfloat/s       9.98 i/B   364.06 i/f (+/- 0.0 %)      2.70 c/B    98.36 c/f (+/- 3.1 %)      3.70 i/c     56.84 b/f     10.02 bm/f      4.24 GHz
####
# reading /home/chris/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   471.22 MB/s (+/- 3.5 %)    64.19 Mfloat/s      29.92 i/B   230.27 i/f (+/- 0.0 %)      8.39 c/B    64.56 c/f (+/- 2.0 %)      3.57 i/c     41.87 b/f      9.35 bm/f      4.14 GHz
fastfloat (32)                          :   450.11 MB/s (+/- 3.4 %)    61.32 Mfloat/s      32.52 i/B   250.30 i/f (+/- 0.0 %)      8.96 c/B    68.94 c/f (+/- 1.6 %)      3.63 i/c     42.50 b/f      9.99 bm/f      4.23 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   990.81 MB/s (+/- 8.2 %)    67.49 Mfloat/s      14.95 i/B   230.09 i/f (+/- 0.0 %)      4.10 c/B    63.18 c/f (+/- 4.8 %)      3.64 i/c     41.50 b/f      9.73 bm/f      4.26 GHz
fastfloat (32)                          :   927.38 MB/s (+/- 28.7 %)    63.17 Mfloat/s      16.25 i/B   250.11 i/f (+/- 0.0 %)      4.39 c/B    67.51 c/f (+/- 24.2 %)      3.71 i/c     42.12 b/f     10.73 bm/f      4.26 GHz

Old Code:

# Using hardware counters
####
# reading /home/chris/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   693.12 MB/s (+/- 3.9 %)    39.83 Mfloat/s      20.52 i/B   374.37 i/f (+/- 0.0 %)      5.84 c/B   106.61 c/f (+/- 2.3 %)      3.51 i/c     60.44 b/f     10.99 bm/f      4.25 GHz
fastfloat (32)                          :   755.23 MB/s (+/- 5.2 %)    43.40 Mfloat/s      19.89 i/B   362.92 i/f (+/- 0.0 %)      5.37 c/B    97.99 c/f (+/- 3.5 %)      3.70 i/c     57.82 b/f      9.96 bm/f      4.25 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1231.28 MB/s (+/- 8.4 %)    35.38 Mfloat/s      10.29 i/B   375.51 i/f (+/- 0.0 %)      3.30 c/B   120.44 c/f (+/- 5.8 %)      3.12 i/c     59.46 b/f     11.15 bm/f      4.26 GHz
fastfloat (32)                          :  1271.75 MB/s (+/- 7.5 %)    36.54 Mfloat/s       9.98 i/B   364.06 i/f (+/- 0.0 %)      3.19 c/B   116.48 c/f (+/- 4.0 %)      3.13 i/c     56.84 b/f     10.07 bm/f      4.26 GHz
####
# reading /home/chris/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   489.71 MB/s (+/- 9.8 %)    66.71 Mfloat/s      29.15 i/B   224.37 i/f (+/- 0.0 %)      8.11 c/B    62.43 c/f (+/- 8.0 %)      3.59 i/c     41.87 b/f      9.49 bm/f      4.16 GHz
fastfloat (32)                          :   450.12 MB/s (+/- 33.4 %)    61.32 Mfloat/s      31.75 i/B   244.40 i/f (+/- 0.0 %)      8.82 c/B    67.92 c/f (+/- 32.1 %)      3.60 i/c     42.50 b/f     10.40 bm/f      4.16 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   944.14 MB/s (+/- 19.5 %)    64.31 Mfloat/s      14.95 i/B   230.09 i/f (+/- 0.0 %)      4.30 c/B    66.26 c/f (+/- 15.2 %)      3.47 i/c     41.50 b/f      9.88 bm/f      4.26 GHz
fastfloat (32)                          :   842.46 MB/s (+/- 10.7 %)    57.38 Mfloat/s      16.25 i/B   250.11 i/f (+/- 0.0 %)      4.78 c/B    73.53 c/f (+/- 6.3 %)      3.40 i/c     42.12 b/f     10.07 bm/f      4.22 GHz

MSVC/WINDOWS:

New Code:

####
# reading C:/Users/Chris/source/repos/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   366.17 MB/s (+/- 11.0 %)    21.04 Mfloat/s      47.52 ns/f
fastfloat (32)                          :   377.26 MB/s (+/- 8.8 %)    21.68 Mfloat/s      46.13 ns/f
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :   708.95 MB/s (+/- 13.4 %)    20.37 Mfloat/s      49.09 ns/f
fastfloat (32)                          :   698.66 MB/s (+/- 15.9 %)    20.07 Mfloat/s      49.81 ns/f
####
# reading C:/Users/Chris/source/repos/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   235.39 MB/s (+/- 8.0 %)    32.07 Mfloat/s      31.19 ns/f
fastfloat (32)                          :   217.98 MB/s (+/- 11.0 %)    29.69 Mfloat/s      33.68 ns/f
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   475.50 MB/s (+/- 6.0 %)    32.39 Mfloat/s      30.88 ns/f
fastfloat (32)                          :   435.04 MB/s (+/- 7.0 %)    29.63 Mfloat/s      33.75 ns/f

Old Code:

####
# reading C:/Users/Chris/source/repos/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   366.25 MB/s (+/- 13.7 %)    21.05 Mfloat/s
fastfloat (32)                          :   362.43 MB/s (+/- 17.4 %)    20.83 Mfloat/s
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :   662.16 MB/s (+/- 23.5 %)    19.03 Mfloat/s
fastfloat (32)                          :   661.80 MB/s (+/- 12.0 %)    19.02 Mfloat/s
####
# reading C:/Users/Chris/source/repos/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   228.54 MB/s (+/- 32.9 %)    31.13 Mfloat/s
fastfloat (32)                          :   212.88 MB/s (+/- 13.1 %)    29.00 Mfloat/s
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   457.15 MB/s (+/- 6.2 %)    31.14 Mfloat/s
fastfloat (32)                          :   437.25 MB/s (+/- 4.6 %)    29.78 Mfloat/s

GCC/UBUNTU:

New Code:

# Using hardware counters
####
# reading /home/chris/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   930.93 MB/s (+/- 5.7 %)    53.50 Mfloat/s      16.50 i/B   301.01 i/f (+/- 0.0 %)      4.44 c/B    80.92 c/f (+/- 2.0 %)      3.72 i/c     49.29 b/f     11.44 bm/f      4.33 GHz
fastfloat (32)                          :   934.23 MB/s (+/- 6.4 %)    53.69 Mfloat/s      16.49 i/B   300.93 i/f (+/- 0.0 %)      4.45 c/B    81.22 c/f (+/- 2.3 %)      3.71 i/c     47.80 b/f     10.11 bm/f      4.36 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1592.76 MB/s (+/- 5.2 %)    45.77 Mfloat/s       8.57 i/B   312.87 i/f (+/- 0.0 %)      2.55 c/B    92.93 c/f (+/- 3.6 %)      3.37 i/c     50.27 b/f     11.29 bm/f      4.25 GHz
fastfloat (32)                          :  1647.42 MB/s (+/- 8.2 %)    47.34 Mfloat/s       8.57 i/B   312.79 i/f (+/- 0.0 %)      2.46 c/B    89.89 c/f (+/- 4.5 %)      3.48 i/c     48.78 b/f     10.16 bm/f      4.25 GHz
####
# reading /home/chris/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   737.92 MB/s (+/- 7.5 %)   100.52 Mfloat/s      21.67 i/B   166.80 i/f (+/- 0.0 %)      5.51 c/B    42.44 c/f (+/- 2.2 %)      3.93 i/c     28.38 b/f      9.04 bm/f      4.27 GHz
fastfloat (32)                          :   585.43 MB/s (+/- 3.0 %)    79.75 Mfloat/s      24.04 i/B   185.08 i/f (+/- 0.0 %)      6.95 c/B    53.49 c/f (+/- 0.9 %)      3.46 i/c     30.34 b/f      9.70 bm/f      4.27 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :  1357.46 MB/s (+/- 6.4 %)    92.46 Mfloat/s      11.19 i/B   172.19 i/f (+/- 0.0 %)      3.00 c/B    46.14 c/f (+/- 3.6 %)      3.73 i/c     28.76 b/f      9.40 bm/f      4.27 GHz
fastfloat (32)                          :  1202.43 MB/s (+/- 2.7 %)    81.90 Mfloat/s      12.37 i/B   190.47 i/f (+/- 0.0 %)      3.38 c/B    52.08 c/f (+/- 1.7 %)      3.66 i/c     30.71 b/f      9.26 bm/f      4.27 GHz

Old Code:

# Using hardware counters
####
# reading /home/chris/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   885.60 MB/s (+/- 5.7 %)    50.89 Mfloat/s      16.29 i/B   297.15 i/f (+/- 0.0 %)      4.69 c/B    85.58 c/f (+/- 1.7 %)      3.47 i/c     50.27 b/f     11.15 bm/f      4.36 GHz
fastfloat (32)                          :   971.87 MB/s (+/- 8.2 %)    55.85 Mfloat/s      16.28 i/B   297.07 i/f (+/- 0.0 %)      4.28 c/B    78.07 c/f (+/- 5.1 %)      3.81 i/c     48.78 b/f     11.27 bm/f      4.36 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1565.78 MB/s (+/- 3.9 %)    44.99 Mfloat/s       8.49 i/B   309.89 i/f (+/- 0.0 %)      2.60 c/B    94.75 c/f (+/- 2.9 %)      3.27 i/c     50.27 b/f     11.12 bm/f      4.26 GHz
fastfloat (32)                          :  1629.05 MB/s (+/- 4.3 %)    46.81 Mfloat/s       8.49 i/B   309.81 i/f (+/- 0.0 %)      2.50 c/B    91.08 c/f (+/- 3.3 %)      3.40 i/c     48.78 b/f     10.01 bm/f      4.26 GHz
####
# reading /home/chris/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   680.55 MB/s (+/- 4.7 %)    92.71 Mfloat/s      22.24 i/B   171.22 i/f (+/- 0.0 %)      5.98 c/B    46.01 c/f (+/- 2.3 %)      3.72 i/c     28.76 b/f      9.34 bm/f      4.27 GHz
fastfloat (32)                          :   620.27 MB/s (+/- 3.3 %)    84.50 Mfloat/s      24.62 i/B   189.49 i/f (+/- 0.0 %)      6.56 c/B    50.49 c/f (+/- 2.3 %)      3.75 i/c     30.71 b/f      9.29 bm/f      4.27 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :  1241.73 MB/s (+/- 7.4 %)    84.58 Mfloat/s      11.43 i/B   175.95 i/f (+/- 0.0 %)      3.28 c/B    50.44 c/f (+/- 5.2 %)      3.49 i/c     28.76 b/f      9.35 bm/f      4.27 GHz
fastfloat (32)                          :  1120.19 MB/s (+/- 2.8 %)    76.30 Mfloat/s      12.62 i/B   194.23 i/f (+/- 0.0 %)      3.62 c/B    55.72 c/f (+/- 2.2 %)      3.49 i/c     30.71 b/f      9.49 bm/f      4.25 GHz

It would seem that it's not exactly less instructions, although it seems to run faster, generally.

@lemire
Copy link
Member

lemire commented Nov 26, 2024

Thank you. Don't worry about linting further. We can do that later.

I will need to review this manually.

@@ -32,7 +32,7 @@ template <typename UC> fastfloat_really_inline constexpr bool has_simd_opt() {
// able to optimize it well.
template <typename UC>
fastfloat_really_inline constexpr bool is_integer(UC c) noexcept {
return !(c > UC('9') || c < UC('0'));
return static_cast<uint8_t>(c - UC('0')) < 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overflow when sizeof(UC) > sizeof(uint8_t) resulting in false positives for:

wchar_t ch = '0';
ch += 256;
is_integer(ch); // incorrectly true

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overflow when sizeof(UC) > sizeof(uint8_t) resulting in false positives for:

wchar_t ch = '0';
ch += 256;
is_integer(ch); // incorrectly true

How does it look now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better, but conditional_t require c++14.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright it should be good now.

This should avoid using unnecessarily sized unsigned types, while also avoiding overflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants