Adding new loop_parse_if_digits function #291

RealTimeChris · 2024-11-24T19:21:04Z

We can eliminate one of the subtractions from within is_made_of_eight_digits_fast and parse_eight_digits_unrolled by instead moving it out into when the value is collected. Additionally, we can also save some overhead by not collecting the value twice for each iteration of the loop within loop_parse_if_eight_digits.

The new function results in the following benchmark scores:
https://pastebin.com/raw/EQJvPF5G
with the following code:
https://github.com/RealTimeChris/BenchmarkSuite/blob/loop_parse_if_digits/StringComparison/main.cpp

We can eliminate one of the subtractions from within is_made_of_eight_digits_fast and parse_eight_digits_unrolled by instead moving it out into when the value is collected. Additionally, we can also save some overhead by not collecting the value twice for each iteration of the loop within loop_parse_if_eight_digits.

dalle · 2024-11-25T11:37:19Z

I prefer the tests to be run before merging this

include/fast_float/ascii_number.h

lemire

We will not merge pull requests that skip CI.

lemire · 2024-11-25T15:11:05Z

I encourage you to include a rationale as to why you think that compilers will generate faster code.

Be mindful that if you are recording timings on purely synthetic benchmarks, without performance counters, you can easily derive the wrong conclusions.

The bar is pretty high for us to accept an optimization in this library as it is quite mature. It is fairly easy to reorganize the code and get better results under on compiler for one type of synthetic data set.

Please consider there is a significant cost for us to accepting such as change as it may introduce a new bug, which would be terrible. So we need hard evidence that the result is definitively faster.

For benchmarking, I use the following framework...

https://github.com/lemire/simple_fastfloat_benchmark

It includes realistic files (derived from actual content).

I tried testing your PR, but it won't build.

We do very much encourage pull requests which might improve performance, but, as I wrote, the bar is high. I will be definitively doing my own analysis, on different hardware and different compilers.

lemire · 2024-11-25T19:14:06Z

PR #293 added benchmarks to the project. I invite you to sync your fork and run these benchmarks (before and after), ideally with performance counters.

Co-authored-by: Anders Dalvander <[email protected]>

RealTimeChris · 2024-11-25T22:54:34Z

PR #293 added benchmarks to the project. I invite you to sync your fork and run these benchmarks (before and after), ideally with performance counters.

Alright thank you very much I will do that.

RealTimeChris · 2024-11-25T22:56:42Z

I encourage you to include a rationale as to why you think that compilers will generate faster code.

Be mindful that if you are recording timings on purely synthetic benchmarks, without performance counters, you can easily derive the wrong conclusions.

The bar is pretty high for us to accept an optimization in this library as it is quite mature. It is fairly easy to reorganize the code and get better results under on compiler for one type of synthetic data set.

Please consider there is a significant cost for us to accepting such as change as it may introduce a new bug, which would be terrible. So we need hard evidence that the result is definitively faster.

For benchmarking, I use the following framework...

https://github.com/lemire/simple_fastfloat_benchmark

It includes realistic files (derived from actual content).

I tried testing your PR, but it won't build.

We do very much encourage pull requests which might improve performance, but, as I wrote, the bar is high. I will be definitively doing my own analysis, on different hardware and different compilers.

I included that. As I said, given that we are no longer subtracting 0x30303030303030 twice, here:

fast_float/include/fast_float/ascii_number.h

Line 143 in d4c573d

return !((((val + 0x4646464646464646) | (val - 0x3030303030303030)) &

and here:

fast_float/include/fast_float/ascii_number.h

Line 124 in d4c573d

val -= 0x3030303030303030;

And instead moving the subtraction out to here:

fast_float/include/fast_float/ascii_number.h

Line 244 in d4c573d

size_t value_new = read8_to_u64(string) - 0x3030303030303030;

As well as no longer collecting the value twice as is done here:

fast_float/include/fast_float/ascii_number.h

Lines 227 to 230 in d4c573d

    
                is_made_of_eight_digits_fast(read8_to_u64(p))) { 
        
           i = i * 100000000 + 
        
               parse_eight_digits_unrolled(read8_to_u64( 
        
                   p)); // in rare cases, this will overflow, but that's ok

But instead only collect it once here:

fast_float/include/fast_float/ascii_number.h

Line 244 in d4c573d

size_t value_new = read8_to_u64(string) - 0x3030303030303030;

This should produce faster code, no?

lemire · 2024-11-25T23:37:20Z

This should produce faster code, no?

Your mental model is that you are reducing the number of instructions. If so, then the next step is to validate that it is true. If you run the benchmark I have just included in privileged mode (so that you have access to hardware performance counters), you should see the number of instructions retired.

E.g., the output should look like this...

$ ./build/benchmarks/realbenchmark
...
fastfloat (64)                          :  1164.85 MB/s (+/- 1.8 %)    66.94 Mfloat/s      18.32 i/B ...

See the 18.32 i/B? This is the number of instructions (on average) per byte of input. If this number goes down sufficiently, then, everything else being equal, the performance might be improved.

This depends crucially on the compiler and on the targeted system.

Alternatively, you can examine the binary output and manually check that there are fewer instructions being executed.

There are other elements to consider... but if you can show a substantial reduction in the number of retired instructions per input byte, then it is a very good start. (This is not the only way to optimize the code.)

include/fast_float/constexpr_feature_detect.h

RealTimeChris · 2024-11-26T08:49:23Z

This should produce faster code, no?

Your mental model is that you are reducing the number of instructions. If so, then the next step is to validate that it is true. If you run the benchmark I have just included in privileged mode (so that you have access to hardware performance counters), you should see the number of instructions retired.

E.g., the output should look like this...
$ ./build/benchmarks/realbenchmark
...
fastfloat (64)                          :  1164.85 MB/s (+/- 1.8 %)    66.94 Mfloat/s      18.32 i/B ...
See the 18.32 i/B? This is the number of instructions (on average) per byte of input. If this number goes down sufficiently, then, everything else being equal, the performance might be improved.

This depends crucially on the compiler and on the targeted system.

Alternatively, you can examine the binary output and manually check that there are fewer instructions being executed.

There are other elements to consider... but if you can show a substantial reduction in the number of retired instructions per input byte, then it is a very good start. (This is not the only way to optimize the code.)

The current code achieved the following results using your framework, on an i7 8700k:

CLANG/UBUNTU:

New Code:

# Using hardware counters
####
# reading /home/chris/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   726.38 MB/s (+/- 12.3 %)    41.74 Mfloat/s      21.22 i/B   387.24 i/f (+/- 0.0 %)      5.57 c/B   101.66 c/f (+/- 9.7 %)      3.81 i/c     60.44 b/f     11.78 bm/f      4.24 GHz
fastfloat (32)                          :   755.59 MB/s (+/- 7.7 %)    43.42 Mfloat/s      20.59 i/B   375.78 i/f (+/- 0.0 %)      5.38 c/B    98.10 c/f (+/- 5.5 %)      3.83 i/c     57.82 b/f     10.50 bm/f      4.26 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1429.88 MB/s (+/- 4.7 %)    41.09 Mfloat/s      10.29 i/B   375.51 i/f (+/- 0.0 %)      2.83 c/B   103.21 c/f (+/- 3.8 %)      3.64 i/c     59.46 b/f     10.94 bm/f      4.24 GHz
fastfloat (32)                          :  1500.15 MB/s (+/- 6.9 %)    43.10 Mfloat/s       9.98 i/B   364.06 i/f (+/- 0.0 %)      2.70 c/B    98.36 c/f (+/- 3.1 %)      3.70 i/c     56.84 b/f     10.02 bm/f      4.24 GHz
####
# reading /home/chris/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   471.22 MB/s (+/- 3.5 %)    64.19 Mfloat/s      29.92 i/B   230.27 i/f (+/- 0.0 %)      8.39 c/B    64.56 c/f (+/- 2.0 %)      3.57 i/c     41.87 b/f      9.35 bm/f      4.14 GHz
fastfloat (32)                          :   450.11 MB/s (+/- 3.4 %)    61.32 Mfloat/s      32.52 i/B   250.30 i/f (+/- 0.0 %)      8.96 c/B    68.94 c/f (+/- 1.6 %)      3.63 i/c     42.50 b/f      9.99 bm/f      4.23 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   990.81 MB/s (+/- 8.2 %)    67.49 Mfloat/s      14.95 i/B   230.09 i/f (+/- 0.0 %)      4.10 c/B    63.18 c/f (+/- 4.8 %)      3.64 i/c     41.50 b/f      9.73 bm/f      4.26 GHz
fastfloat (32)                          :   927.38 MB/s (+/- 28.7 %)    63.17 Mfloat/s      16.25 i/B   250.11 i/f (+/- 0.0 %)      4.39 c/B    67.51 c/f (+/- 24.2 %)      3.71 i/c     42.12 b/f     10.73 bm/f      4.26 GHz

Old Code:

# Using hardware counters
####
# reading /home/chris/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   693.12 MB/s (+/- 3.9 %)    39.83 Mfloat/s      20.52 i/B   374.37 i/f (+/- 0.0 %)      5.84 c/B   106.61 c/f (+/- 2.3 %)      3.51 i/c     60.44 b/f     10.99 bm/f      4.25 GHz
fastfloat (32)                          :   755.23 MB/s (+/- 5.2 %)    43.40 Mfloat/s      19.89 i/B   362.92 i/f (+/- 0.0 %)      5.37 c/B    97.99 c/f (+/- 3.5 %)      3.70 i/c     57.82 b/f      9.96 bm/f      4.25 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1231.28 MB/s (+/- 8.4 %)    35.38 Mfloat/s      10.29 i/B   375.51 i/f (+/- 0.0 %)      3.30 c/B   120.44 c/f (+/- 5.8 %)      3.12 i/c     59.46 b/f     11.15 bm/f      4.26 GHz
fastfloat (32)                          :  1271.75 MB/s (+/- 7.5 %)    36.54 Mfloat/s       9.98 i/B   364.06 i/f (+/- 0.0 %)      3.19 c/B   116.48 c/f (+/- 4.0 %)      3.13 i/c     56.84 b/f     10.07 bm/f      4.26 GHz
####
# reading /home/chris/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   489.71 MB/s (+/- 9.8 %)    66.71 Mfloat/s      29.15 i/B   224.37 i/f (+/- 0.0 %)      8.11 c/B    62.43 c/f (+/- 8.0 %)      3.59 i/c     41.87 b/f      9.49 bm/f      4.16 GHz
fastfloat (32)                          :   450.12 MB/s (+/- 33.4 %)    61.32 Mfloat/s      31.75 i/B   244.40 i/f (+/- 0.0 %)      8.82 c/B    67.92 c/f (+/- 32.1 %)      3.60 i/c     42.50 b/f     10.40 bm/f      4.16 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   944.14 MB/s (+/- 19.5 %)    64.31 Mfloat/s      14.95 i/B   230.09 i/f (+/- 0.0 %)      4.30 c/B    66.26 c/f (+/- 15.2 %)      3.47 i/c     41.50 b/f      9.88 bm/f      4.26 GHz
fastfloat (32)                          :   842.46 MB/s (+/- 10.7 %)    57.38 Mfloat/s      16.25 i/B   250.11 i/f (+/- 0.0 %)      4.78 c/B    73.53 c/f (+/- 6.3 %)      3.40 i/c     42.12 b/f     10.07 bm/f      4.22 GHz

MSVC/WINDOWS:

New Code:

####
# reading C:/Users/Chris/source/repos/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   366.17 MB/s (+/- 11.0 %)    21.04 Mfloat/s      47.52 ns/f
fastfloat (32)                          :   377.26 MB/s (+/- 8.8 %)    21.68 Mfloat/s      46.13 ns/f
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :   708.95 MB/s (+/- 13.4 %)    20.37 Mfloat/s      49.09 ns/f
fastfloat (32)                          :   698.66 MB/s (+/- 15.9 %)    20.07 Mfloat/s      49.81 ns/f
####
# reading C:/Users/Chris/source/repos/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   235.39 MB/s (+/- 8.0 %)    32.07 Mfloat/s      31.19 ns/f
fastfloat (32)                          :   217.98 MB/s (+/- 11.0 %)    29.69 Mfloat/s      33.68 ns/f
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   475.50 MB/s (+/- 6.0 %)    32.39 Mfloat/s      30.88 ns/f
fastfloat (32)                          :   435.04 MB/s (+/- 7.0 %)    29.63 Mfloat/s      33.75 ns/f

Old Code:

####
# reading C:/Users/Chris/source/repos/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   366.25 MB/s (+/- 13.7 %)    21.05 Mfloat/s
fastfloat (32)                          :   362.43 MB/s (+/- 17.4 %)    20.83 Mfloat/s
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :   662.16 MB/s (+/- 23.5 %)    19.03 Mfloat/s
fastfloat (32)                          :   661.80 MB/s (+/- 12.0 %)    19.02 Mfloat/s
####
# reading C:/Users/Chris/source/repos/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   228.54 MB/s (+/- 32.9 %)    31.13 Mfloat/s
fastfloat (32)                          :   212.88 MB/s (+/- 13.1 %)    29.00 Mfloat/s
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :   457.15 MB/s (+/- 6.2 %)    31.14 Mfloat/s
fastfloat (32)                          :   437.25 MB/s (+/- 4.6 %)    29.78 Mfloat/s

GCC/UBUNTU:

New Code:

# Using hardware counters
####
# reading /home/chris/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   930.93 MB/s (+/- 5.7 %)    53.50 Mfloat/s      16.50 i/B   301.01 i/f (+/- 0.0 %)      4.44 c/B    80.92 c/f (+/- 2.0 %)      3.72 i/c     49.29 b/f     11.44 bm/f      4.33 GHz
fastfloat (32)                          :   934.23 MB/s (+/- 6.4 %)    53.69 Mfloat/s      16.49 i/B   300.93 i/f (+/- 0.0 %)      4.45 c/B    81.22 c/f (+/- 2.3 %)      3.71 i/c     47.80 b/f     10.11 bm/f      4.36 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1592.76 MB/s (+/- 5.2 %)    45.77 Mfloat/s       8.57 i/B   312.87 i/f (+/- 0.0 %)      2.55 c/B    92.93 c/f (+/- 3.6 %)      3.37 i/c     50.27 b/f     11.29 bm/f      4.25 GHz
fastfloat (32)                          :  1647.42 MB/s (+/- 8.2 %)    47.34 Mfloat/s       8.57 i/B   312.79 i/f (+/- 0.0 %)      2.46 c/B    89.89 c/f (+/- 4.5 %)      3.48 i/c     48.78 b/f     10.16 bm/f      4.25 GHz
####
# reading /home/chris/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   737.92 MB/s (+/- 7.5 %)   100.52 Mfloat/s      21.67 i/B   166.80 i/f (+/- 0.0 %)      5.51 c/B    42.44 c/f (+/- 2.2 %)      3.93 i/c     28.38 b/f      9.04 bm/f      4.27 GHz
fastfloat (32)                          :   585.43 MB/s (+/- 3.0 %)    79.75 Mfloat/s      24.04 i/B   185.08 i/f (+/- 0.0 %)      6.95 c/B    53.49 c/f (+/- 0.9 %)      3.46 i/c     30.34 b/f      9.70 bm/f      4.27 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :  1357.46 MB/s (+/- 6.4 %)    92.46 Mfloat/s      11.19 i/B   172.19 i/f (+/- 0.0 %)      3.00 c/B    46.14 c/f (+/- 3.6 %)      3.73 i/c     28.76 b/f      9.40 bm/f      4.27 GHz
fastfloat (32)                          :  1202.43 MB/s (+/- 2.7 %)    81.90 Mfloat/s      12.37 i/B   190.47 i/f (+/- 0.0 %)      3.38 c/B    52.08 c/f (+/- 1.7 %)      3.66 i/c     30.71 b/f      9.26 bm/f      4.27 GHz

Old Code:

# Using hardware counters
####
# reading /home/chris/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64)                          :   885.60 MB/s (+/- 5.7 %)    50.89 Mfloat/s      16.29 i/B   297.15 i/f (+/- 0.0 %)      4.69 c/B    85.58 c/f (+/- 1.7 %)      3.47 i/c     50.27 b/f     11.15 bm/f      4.36 GHz
fastfloat (32)                          :   971.87 MB/s (+/- 8.2 %)    55.85 Mfloat/s      16.28 i/B   297.07 i/f (+/- 0.0 %)      4.28 c/B    78.07 c/f (+/- 5.1 %)      3.81 i/c     48.78 b/f     11.27 bm/f      4.36 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64)                          :  1565.78 MB/s (+/- 3.9 %)    44.99 Mfloat/s       8.49 i/B   309.89 i/f (+/- 0.0 %)      2.60 c/B    94.75 c/f (+/- 2.9 %)      3.27 i/c     50.27 b/f     11.12 bm/f      4.26 GHz
fastfloat (32)                          :  1629.05 MB/s (+/- 4.3 %)    46.81 Mfloat/s       8.49 i/B   309.81 i/f (+/- 0.0 %)      2.50 c/B    91.08 c/f (+/- 3.3 %)      3.40 i/c     48.78 b/f     10.01 bm/f      4.26 GHz
####
# reading /home/chris/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64)                          :   680.55 MB/s (+/- 4.7 %)    92.71 Mfloat/s      22.24 i/B   171.22 i/f (+/- 0.0 %)      5.98 c/B    46.01 c/f (+/- 2.3 %)      3.72 i/c     28.76 b/f      9.34 bm/f      4.27 GHz
fastfloat (32)                          :   620.27 MB/s (+/- 3.3 %)    84.50 Mfloat/s      24.62 i/B   189.49 i/f (+/- 0.0 %)      6.56 c/B    50.49 c/f (+/- 2.3 %)      3.75 i/c     30.71 b/f      9.29 bm/f      4.27 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64)                          :  1241.73 MB/s (+/- 7.4 %)    84.58 Mfloat/s      11.43 i/B   175.95 i/f (+/- 0.0 %)      3.28 c/B    50.44 c/f (+/- 5.2 %)      3.49 i/c     28.76 b/f      9.35 bm/f      4.27 GHz
fastfloat (32)                          :  1120.19 MB/s (+/- 2.8 %)    76.30 Mfloat/s      12.62 i/B   194.23 i/f (+/- 0.0 %)      3.62 c/B    55.72 c/f (+/- 2.2 %)      3.49 i/c     30.71 b/f      9.49 bm/f      4.25 GHz

It would seem that it's not exactly less instructions, although it seems to run faster, generally.

lemire · 2024-11-26T15:40:25Z

Thank you. Don't worry about linting further. We can do that later.

I will need to review this manually.

dalle · 2024-11-27T11:41:39Z

include/fast_float/ascii_number.h

@@ -32,7 +32,7 @@ template <typename UC> fastfloat_really_inline constexpr bool has_simd_opt() {
 // able to optimize it well.
 template <typename UC>
 fastfloat_really_inline constexpr bool is_integer(UC c) noexcept {
-  return !(c > UC('9') || c < UC('0'));
+  return static_cast<uint8_t>(c - UC('0')) < 10;


This will overflow when sizeof(UC) > sizeof(uint8_t) resulting in false positives for:

wchar_t ch = '0'; ch += 256; is_integer(ch); // incorrectly true

This will overflow when sizeof(UC) > sizeof(uint8_t) resulting in false positives for:

wchar_t ch = '0'; ch += 256; is_integer(ch); // incorrectly true

How does it look now?

Looks better, but conditional_t require c++14.

Alright it should be good now.

This should avoid using unnecessarily sized unsigned types, while also avoiding overflow.

RealTimeChris force-pushed the main branch from 23be9d0 to 11a6232 Compare November 24, 2024 19:27

dalle requested a review from lemire November 25, 2024 13:58

dalle reviewed Nov 25, 2024

View reviewed changes