some performance optimizations in decoders #369

Fe-r-oz · 2024-09-26T05:33:21Z

This PR is about removing the performance detriments because of allocations introduced by vcat and hcat in the decoders. In addition, evaluate_guesses is also made allocation free. Attempt is made to improve the decode and batchdecode as well. Instead of unnecessary copying, @view has been used. I have also checked this there is not any JET errors introduced as well. Resource: Performance Tips guide for better understanding of the context.

remove vcat and hcat from decoders in accordance with In-place and batch decode functions #217
make evaluate_guesses allocation free using loop in accordance with none of the decoders and evaluators should allocate #219
...
The code is properly formatted and commented.
Substantial new functionality is documented within the docs.
All new functionality is tested.
All of the automated tests on github pass.

Fe-r-oz · 2024-09-26T11:36:56Z

There is some improvement for most of the evaluate_decoders setup after improving evaluate_guesses and removing vcat and hcat from decoders.

Update #2: The table below shows memory and time ration after improving these: removed vcat, removed hcat, evaluate_decoders, improve decode, improve batchdecode Table taken from performance tracking: https://github.com/QuantumSavory/QuantumClifford.jl/actions/runs/11054157322/job/30710270440?pr=369#step:6:1141

                                                      ID                 time ratio               memory ratio
  –––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––––––––––––––– ––––––––––––––––––––––––––
            ["circuitsim", "pftrajectories", "q1001_r1"] 0.91 (5%) :whitecheckmark:                  1.00 (1%)
             ["circuitsim", "pftrajectories", "q101_r1"] 0.92 (5%) :whitecheckmark:                  1.00 (1%)
        ["clifford", "dense", "cnot250_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
       ["clifford", "dense", "dense500_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
             ["ecc", "evaluate_decoder", "shor_bp_comm"] 0.84 (5%) :whitecheckmark: 0.77 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "shor_bp_naivesyn"] 0.84 (5%) :whitecheckmark: 0.74 (1%) :whitecheckmark:
          ["ecc", "evaluate_decoder", "shor_bp_shorsyn"] 0.76 (5%) :whitecheckmark: 0.74 (1%) :whitecheckmark:
           ["ecc", "evaluate_decoder", "shor_pybp_comm"]              1.18 (5%) :x:              1.18 (1%) :x:
       ["ecc", "evaluate_decoder", "shor_pybp_naivesyn"]              1.14 (5%) :x:              1.17 (1%) :x:
        ["ecc", "evaluate_decoder", "shor_pybp_shorsyn"]              1.16 (5%) :x:              1.17 (1%) :x:
          ["ecc", "evaluate_decoder", "shor_table_comm"]                  0.96 (5%) 0.83 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "shor_table_naivesyn"] 0.94 (5%) :whitecheckmark: 0.84 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "shor_table_shorsyn"]                  0.95 (5%) 0.87 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "toric8_bp_naivesyn"]              1.09 (5%) :x: 0.98 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "toric8_bp_shorsyn"]              1.06 (5%) :x: 0.99 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "toric8_pybp_comm"]                  1.04 (5%)              1.08 (1%) :x:
     ["ecc", "evaluate_decoder", "toric8_pybp_naivesyn"]                  1.00 (5%)              1.07 (1%) :x:
      ["ecc", "evaluate_decoder", "toric8_pybp_shorsyn"]                  0.99 (5%)              1.06 (1%) :x:
      ["ecc", "evaluate_decoder", "toric8_pymatch_comm"]              1.07 (5%) :x:              1.75 (1%) :x:
  ["ecc", "evaluate_decoder", "toric8_pymatch_naivesyn"]                  1.00 (5%)              1.52 (1%) :x:
   ["ecc", "evaluate_decoder", "toric8_pymatch_shorsyn"]                  0.99 (5%)              1.38 (1%) :x:
        ["ecc", "evaluate_decoder", "toric8_table_comm"]              1.10 (5%) :x: 0.96 (1%) :whitecheckmark:
    ["ecc", "evaluate_decoder", "toric8_table_naivesyn"]                  0.99 (5%) 0.98 (1%) :whitecheckmark:
     ["ecc", "evaluate_decoder", "toric8_table_shorsyn"]                  0.99 (5%) 0.99 (1%) :whitecheckmark:
                                 ["pauli", "mul", "100"] 0.71 (5%) :whitecheckmark:                  1.00 (1%)
                   ["stabilizer", "canon", "md_rref500"] 0.87 (5%) :whitecheckmark:                  1.00 (1%)
                ["stabilizer", "tensor", "diag_pow5_20"]              1.09 (5%) :x:                  1.00 (1%)
                     ["stabilizer", "tensor", "pow5_20"]              1.10 (5%) :x:                  1.00 (1%)

Fe-r-oz · 2024-09-26T14:45:34Z

I think this is ready for review. I hope that these improvements results for evaluate decoders (#369 (comment)) are plausible and sensible. Thank you!

Krastanov

Thanks, this is a good idea!

Currently I do not believe that most of these changes do much though.

I like the

    row_x = @view syndrome_samples[:,1:d.nx]
    row_z = @view syndrome_samples[:,d.nx+1:end]

and the improvement to evaluate_gueses, but I do not believe the other changes help and would prefer to have them reverted.

Could you do microbenchmarks to prove what works here?

ext/QuantumCliffordLDPCDecodersExt/QuantumCliffordLDPCDecodersExt.jl

ext/QuantumCliffordPyQDecodersExt/QuantumCliffordPyQDecodersExt.jl

src/ecc/decoder_pipeline.jl

Krastanov · 2024-09-26T20:47:47Z

src/ecc/decoder_pipeline.jl

+    for i in 1:nsamples
+        is_decoded = true
+        for j in 1:size(faults_matrix, 1)
+            sum_mod = 0
+            @inbounds @simd for k in 1:size(faults_matrix, 2)
+                sum_mod += faults_matrix[j, k] * guesses[i, k]
+            end
+            sum_mod %= 2
+            if sum_mod != measured_faults[i, j]
+                is_decoded = false
+                break
+            end
+        end
+        decoded += is_decoded


This seems promising, but is it actually faster on its own? It probably needs to be benchmarked separately

Most of the performance improvements occurred due to evaluate guesses. I noticed that the in-place approach to vcat and hcat slowed things down, increased memory/time ratio, that was why the improvements was not that visible. I have reverted back as per your comments. Now, it looks better: #369 (comment)

I pushed the evaluate_guess as a separate commit so before and after performance are present in the log:
Before evaluate guesses: Results, After evaluate guesses: Results.

#Micro benchmarks: evaluate guesses julia> using BenchmarkTools julia> n_faults = 500; julia> n_guesses = 100; julia> measured_faults = rand(0:1, n_guesses, n_faults); julia> guesses = rand(0:1, n_guesses, n_faults); julia> faults_matrix = rand(0:1, n_faults, n_faults); julia> function evaluate_guesses_view(measured_faults, guesses, faults_matrix) nsamples = size(guesses, 1) guess_faults = (faults_matrix * guesses') .% 2 decoded = 0 for i in 1:nsamples (@view guess_faults[:,i]) == (@view measured_faults[i,:]) && (decoded += 1) end return (nsamples - decoded) / nsamples end julia> function evaluate_guesses_loop(measured_faults, guesses, faults_matrix) nsamples = size(guesses, 1) decoded = 0 for i in 1:nsamples is_decoded = true for j in 1:size(faults_matrix, 1) sum_mod = 0 @inbounds @simd for k in 1:size(faults_matrix, 2) sum_mod += faults_matrix[j, k] * guesses[i, k] end sum_mod %= 2 if sum_mod != measured_faults[i, j] is_decoded = false break end end decoded += is_decoded end return (nsamples - decoded) / nsamples end julia> @benchmark evaluate_guesses_view($measured_faults, $guesses, $faults_matrix) BenchmarkTools.Trial: 423 samples with 1 evaluation. Range (min … max): 9.462 ms … 100.313 ms ┊ GC (min … max): 0.00% … 87.81% Time (median): 11.613 ms ┊ GC (median): 0.00% Time (mean ± σ): 11.808 ms ± 4.717 ms ┊ GC (mean ± σ): 2.29% ± 5.55% ▂▂ ▅█▇▄ ▂ ██▅████▅▄▂▃▄▃▄▄▃▄▄▅▇▆▇▃▇▇▄█▇▇█▆▅█▆▇▆▇▆▅▇▄▃▂▅▂▃▃▃▂▂▃▂▃▁▂▁▁▁▂▂ ▄ 9.46 ms Histogram: frequency by time 15.5 ms < Memory estimate: 812.11 KiB, allocs estimate: 8. julia> @benchmark evaluate_guesses_loop($measured_faults, $guesses, $faults_matrix) BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 119.402 μs … 1.440 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 127.033 μs ┊ GC (median): 0.00% Time (mean ± σ): 135.949 μs ± 49.806 μs ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▇▃▄▃▂▁▁ ▁▁▁▂▁ ▁ ▁ █▆▄███████████████████████████▇▇██▇▇▇▆▆▇▇▆▆▅▅▅▄▆▇▆▄▆▄▄▄▄▅▃▃▅ █ 119 μs Histogram: log(frequency) by time 246 μs < Memory estimate: 0 bytes, allocs estimate: 0. # Micro benchmarks attached, with and without view: julia> syndrome_samples = rand(1000, 1000) ; julia> d = Dict(:nx => 500); julia> function with_view(syndrome_samples, d) row_x = @view syndrome_samples[:, 1:d[:nx]] row_z = @view syndrome_samples[:, d[:nx]+1:end] return row_x, row_z end julia> function without_view(syndrome_samples, d) row_x = syndrome_samples[:, 1:d[:nx]] row_z = syndrome_samples[:, d[:nx]+1:end] return row_x, row_z end julia> @benchmark with_view($syndrome_samples, $d) # Benchmarking without @view BenchmarkTools.Trial: 10000 samples with 996 evaluations. Range (min … max): 24.653 ns … 225.071 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 24.912 ns ┊ GC (median): 0.00% Time (mean ± σ): 27.309 ns ± 7.016 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▇▄▄▄ ▁ ▂ █▁█████▄▄▃▃▄▅▄▃▃▁▃▁▃▄▇▇▇▆▄▇▆▆█▇▇▇██▇█▆▆▅▅▅▄▄▆▆▇▆▅▅▆▇▇▄▄▆▅▆▆▇ █ 24.7 ns Histogram: log(frequency) by time 60 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark without_view($syndrome_samples, $d) BenchmarkTools.Trial: 3337 samples with 1 evaluation. Range (min … max): 995.098 μs … 7.976 ms ┊ GC (min … max): 0.00% … 85.61% Time (median): 1.064 ms ┊ GC (median): 0.00% Time (mean ± σ): 1.491 ms ± 795.193 μs ┊ GC (mean ± σ): 6.96% ± 13.21% ██▆▄▂ ▁▃▃▂ ▂▃▂▁ ▁▂▃▂ ▁ ██████▆▆▅▆▃▅▃▃▁▁▅▁▁▃▄▄▃▃▁▆█████▇▄▄▄▃▅▃▁▁▁▁▁▁▁▇█████▆▅▆█████▇▆ █ 995 μs Histogram: log(frequency) by time 3.33 ms < Memory estimate: 7.63 MiB, allocs estimate: 4.

Fe-r-oz · 2024-09-26T21:51:07Z

Thanks for your comments. As requested, MicroBenchmarks are attached here: #369 (comment).

I have removed all the unnecessary changes except the @views and evaluate guesses. The performance tracking improvements seems to have improved. Now, all of theevaluate decoderssetups show improvements.

Table taken from: https://github.com/QuantumSavory/QuantumClifford.jl/actions/runs/11060472698/job/30731074834?pr=369#step:6:1140

                                                      ID                 time ratio               memory ratio
  –––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––––––––––––––– ––––––––––––––––––––––––––
          ["circuitsim", "pftrajectories", "q1001_r100"] 0.94 (5%) :whitecheckmark:                  1.00 (1%)
        ["clifford", "dense", "cnot250_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
       ["clifford", "dense", "dense500_on_diag500_stab"] 0.84 (5%) :whitecheckmark:                  1.00 (1%)
             ["ecc", "evaluate_decoder", "shor_bp_comm"] 0.85 (5%) :whitecheckmark: 0.77 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "shor_bp_naivesyn"] 0.84 (5%) :whitecheckmark: 0.75 (1%) :whitecheckmark:
          ["ecc", "evaluate_decoder", "shor_bp_shorsyn"] 0.77 (5%) :whitecheckmark: 0.74 (1%) :whitecheckmark:
           ["ecc", "evaluate_decoder", "shor_pybp_comm"] 0.94 (5%) :whitecheckmark: 0.93 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "shor_pybp_naivesyn"] 0.93 (5%) :whitecheckmark: 0.92 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "shor_pybp_shorsyn"] 0.94 (5%) :whitecheckmark: 0.92 (1%) :whitecheckmark:
          ["ecc", "evaluate_decoder", "shor_table_comm"] 0.95 (5%) :whitecheckmark: 0.83 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "shor_table_naivesyn"] 0.94 (5%) :whitecheckmark: 0.84 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "shor_table_shorsyn"]                  0.95 (5%) 0.87 (1%) :whitecheckmark:
       ["ecc", "evaluate_decoder", "toric8_bp_naivesyn"]                  1.02 (5%) 0.98 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "toric8_bp_shorsyn"]              1.07 (5%) :x: 0.98 (1%) :whitecheckmark:
         ["ecc", "evaluate_decoder", "toric8_pybp_comm"]                  1.01 (5%) 0.95 (1%) :whitecheckmark:
     ["ecc", "evaluate_decoder", "toric8_pybp_naivesyn"]                  0.99 (5%) 0.95 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "toric8_pybp_shorsyn"]                  1.01 (5%) 0.95 (1%) :whitecheckmark:
      ["ecc", "evaluate_decoder", "toric8_pymatch_comm"]              1.06 (5%) :x: 0.91 (1%) :whitecheckmark:
  ["ecc", "evaluate_decoder", "toric8_pymatch_naivesyn"]                  1.00 (5%) 0.94 (1%) :whitecheckmark:
   ["ecc", "evaluate_decoder", "toric8_pymatch_shorsyn"]                  0.99 (5%) 0.96 (1%) :whitecheckmark:
        ["ecc", "evaluate_decoder", "toric8_table_comm"]              1.09 (5%) :x: 0.96 (1%) :whitecheckmark:
    ["ecc", "evaluate_decoder", "toric8_table_naivesyn"]                  1.00 (5%) 0.98 (1%) :whitecheckmark:
     ["ecc", "evaluate_decoder", "toric8_table_shorsyn"]                  1.00 (5%) 0.99 (1%) :whitecheckmark:
                            ["pauli", "mul", "20000000"] 0.93 (5%) :whitecheckmark:                  1.00 (1%)
                   ["stabilizer", "canon", "md_cano500"] 0.92 (5%) :whitecheckmark:                  1.00 (1%)
                     ["stabilizer", "tensor", "pow5_20"]              1.06 (5%) :x:                  1.00 (1%)

Fe-r-oz marked this pull request as draft September 26, 2024 07:02

Fe-r-oz force-pushed the decoderoptim branch from 87a6ace to 4300372 Compare September 26, 2024 09:13

remove hcat and vcat from decoders

1ac7557

Fe-r-oz force-pushed the decoderoptim branch from 4300372 to 1ac7557 Compare September 26, 2024 09:19

improve evaluate_guesses

a5cbdd5

Fe-r-oz added 2 commits September 26, 2024 16:39

improve decode

fb1e0d8

improve batchdecode

637cd3a

Fe-r-oz force-pushed the decoderoptim branch from 7341b00 to 637cd3a Compare September 26, 2024 14:17

Fe-r-oz marked this pull request as ready for review September 26, 2024 14:43

Krastanov requested changes Sep 26, 2024

View reviewed changes

Fe-r-oz added 2 commits September 27, 2024 02:00

revert back to vcat and hcat

3fcf26c

undo changes in decode and batchdecode

d827bde

Fe-r-oz requested a review from Krastanov September 26, 2024 23:35

slightly simplify the evaluate_guess code

087aeae

Krastanov added the Skip-Changelog label for control of CI: skips the changelog check label Sep 27, 2024

Krastanov merged commit e5b7bd5 into QuantumSavory:master Sep 27, 2024
15 of 17 checks passed

Fe-r-oz deleted the decoderoptim branch September 27, 2024 05:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some performance optimizations in decoders #369

some performance optimizations in decoders #369

Fe-r-oz commented Sep 26, 2024 •

edited

Loading

Fe-r-oz commented Sep 26, 2024 •

edited

Loading

Fe-r-oz commented Sep 26, 2024

Krastanov left a comment

Krastanov Sep 26, 2024

Fe-r-oz Sep 26, 2024

Fe-r-oz Sep 26, 2024

Fe-r-oz Sep 27, 2024

Fe-r-oz commented Sep 26, 2024 •

edited

Loading

some performance optimizations in decoders #369

some performance optimizations in decoders #369

Conversation

Fe-r-oz commented Sep 26, 2024 • edited Loading

Fe-r-oz commented Sep 26, 2024 • edited Loading

Fe-r-oz commented Sep 26, 2024

Krastanov left a comment

Choose a reason for hiding this comment

Krastanov Sep 26, 2024

Choose a reason for hiding this comment

Fe-r-oz Sep 26, 2024

Choose a reason for hiding this comment

Fe-r-oz Sep 26, 2024

Choose a reason for hiding this comment

Fe-r-oz Sep 27, 2024

Choose a reason for hiding this comment

Fe-r-oz commented Sep 26, 2024 • edited Loading

Fe-r-oz commented Sep 26, 2024 •

edited

Loading

Fe-r-oz commented Sep 26, 2024 •

edited

Loading

Fe-r-oz commented Sep 26, 2024 •

edited

Loading