Refactor row major matrix #624

mcalancea · 2024-11-22T14:14:15Z

Written December 10th, most content and conversation predates this description.

Fixes #600.
Makes several changes to the way RowMajorMatrix works:

It no longer uses MaybeUninit. Rather, its contents are initialized to T::default() using parallel iterators.
Padding is no longer allocated in the constructor. It is also not the concern of any particular user of the matrix to ensure correct padding. Rather, users only pass the InstancePaddingStrategy argument which describes the type of padding that they want. This padding is then performed at the very latest stage (in the call to self.into_mles()).
self.into_mles() does more than before. de_interleaving is removed because the indirection through a bi-dimensional matrix is unnecessary. Notably, self.into_mles() parallelizes the work differently than de_interleaving used to.

hero78119 · 2024-11-22T15:32:02Z

This is a great effort! As some moment I am also curious what would be the impact if we remove MaybeUint feature and use raw vector as we only have one implementation.

I just do a very quick tried on this branch against master branch on riscv_add benchmark with command on remote ceno benchmark machine, with command

### on master branch
cargo bench --bench riscv_add --package ceno_zkvm -- --save-baseline baseline

### on this branch
cargo bench --bench riscv_add --package ceno_zkvm -- --baseline baseline

and the result turns out to be ~+6% slower on e2e latency.

## master branch
add_op_20/prove_add/prove_add_log2_20
                        time:   [1.5172 s 1.5615 s 1.6110 s]

### this branch
add_op_20/prove_add/prove_add_log2_20
                        time:   [1.6454 s 1.6695 s 1.6898 s]
                        change: [+4.1692% +6.3642% +8.4701%] (p = 0.00 < 0.05)

I believed with fibonacchi example it might be even much slower. So it's still nessesary to this feature for keeping high performance.

hero78119 · 2024-11-22T15:35:51Z

If we just talked about capture this unssignment error, I think we can achieve it from different path: in unittest we support init vector into 2 different default value with same execution trance, and then compare 2 witnesses which should be identical. An unequality indicate there are some unassigment bug. And with that, we can capture the unassignment error in early stage.

mcalancea · 2024-11-22T15:52:51Z

@hero78119

Thank you! This is still work in progress; I'm examining some ways to make it faster while keeping it clean/safe. My bench results are a bit better for some reason:

add_op_20/prove_add/prove_add_log2_20
                        time:   [4.7420 s 4.7740 s 4.8031 s]
                        change: [+1.3563% +2.2286% +3.0911%] (p = 0.00 < 0.05)
                        Performance has regressed.
add_op_21/prove_add/prove_add_log2_21
                        time:   [9.9516 s 10.189 s 10.450 s]
                        change: [-1.7437% +0.5352% +2.8403%] (p = 0.70 > 0.05)
                        No change in performance detected.

What regression % (evaluated at yours) would you consider acceptable?

naure

Looking nice so far 👍

ceno_zkvm/src/witness.rs

matthiasgoergens · 2024-11-25T03:55:19Z

Could you please add to the PR description a very brief overview over what we are doing, and more importantly why?

(The what can be really brief, because the text of the PR answers that question in detail.)

Thanks!

matthiasgoergens · 2024-11-25T04:04:03Z

This is a great effort! As some moment I am also curious what would be the impact if we remove MaybeUint feature and use raw vector as we only have one implementation.

Yes, I have been rather suspicious of MaybUninit. It looks like a bit of a minefield, and I'm not sure it's actually worth it.

hero78119 · 2024-11-25T04:28:05Z

... I'm examining some ways to make it faster while keeping it clean/safe. My bench results are a bit better for some reason:
add_op_20/prove_add/prove_add_log2_20
                        time:   [4.7420 s 4.7740 s 4.8031 s]
                        change: [+1.3563% +2.2286% +3.0911%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
What regression % (evaluated at yours) would you consider acceptable?

I noticed the different with yours vs remote ceno benchmark server probably on environment number of cores.
In multicore environment (16x2 cores), removing MaybeUinit the regression will be more significant, e.g. + ~6% regressed.

To me any regression of % is somehow unacceptable.

So I would suggest to go from another way: in unittest initialized with 2 different default value with same witness, then check the witness polynomial equality. This help us to identify the problem while not sacrificing performance.

matthiasgoergens · 2024-11-25T06:52:16Z

If you want some inspiration for how to do this kind of change, you might want to have a look at how plonky3 changed the organisation of its data compared to plonky2. (It might be a bit much too read, though.)

naure · 2024-11-25T09:54:27Z

@hero78119 There is a draft of the approach based on testing here: #597

naure · 2024-11-25T10:05:20Z

To get back to optimal performance without unsafe types, there could be another constructor that accepts a vector or an iterator. Callers can built their Vec with e.g. flatten and collect.

See also the issue #600

@mcalancea

### Context Existing benchmark `riscv_add` are just for testing single opcode performance. This PR aim to have a first e2e bench which is representative to a workload in real world. ### What has been cover - add fibonacci bench with sp1 platform hardcode - refactor e2e flow into sub-function to maximize code sharing ### what's not being covered Hey @mcalancea, key generation + witness assignment are still encaptulated in same function and not including in bench latency meature. Break down key generation and extract witness assignment will expose massive intermediate variables in between, so probably need another refactor and design to support #624

hero78119

Awesome job, and this PR bring more insighful takeaways for the future implementation guidelines 🔥 🔥

ceno_zkvm/src/tables/ops/ops_circuit.rs

ceno_zkvm/src/witness.rs

hero78119 · 2024-12-10T10:45:38Z

ceno_zkvm/src/witness.rs

-                    .par_iter_mut()
-                    .skip(i)
-                    .step_by(self.num_col)
-                    .map(|v| unsafe { mem::replace(v, mem::MaybeUninit::uninit()).assume_init() })


cited from @mcalancea: mutable vector in place change probably trigger false sharing

In comparison, new design just allocated new vector and collect.

👍 Just noting it's pure speculation, I haven't investigated it. But false sharing is perhaps something to keep in mind throughout the code-base.

ceno_zkvm/src/witness.rs

ceno_zkvm/src/tables/ram/ram_impl.rs

hero78119 · 2024-12-10T11:13:09Z

Benchmarks result is awesome 🔥 🔥 🔥

On fibonacchi witness generation vs master branch => ~ -20% throughput improvement

command

cargo bench --bench fibonacci_witness -- --save-baseline fibo_wit
cargo bench --bench fibonacci_witness -- --baseline fibo_wit

results (run result twice, each time flushing with complete new baseline)

1st result
fib_wit_max_steps_18446744073709551615/fibonacci_witness/fib_wit_max_steps_18446744073709551615
            time:  [1.8106 s 1.8265 s 1.8428 s]
            change: [-22.274% -21.483% -20.588%] (p = 0.00 < 0.05)
            Performance has improved.
2nd result 
fib_wit_max_steps_18446744073709551615/fibonacci_witness/fib_wit_max_steps_18446744073709551615
                        time:   [1.8091 s 1.8224 s 1.8338 s]
                        change: [-23.306% -22.657% -22.023%] (p = 0.00 < 0.05)
                        Performance has improved.

On fibonacchi e2e with 2^20 instance: ~ -15% throughput improvement

this is very amazing, still trying to figure out how witness optimisation bring such improvement on overall

results (run result twice, each time flushing with complete new baseline)

fibonacci_max_steps_1048576/prove_fibonacci/fibonacci_max_steps_1048576
                        time:   [4.1083 s 4.1283 s 4.1487 s]
                        change: [-16.756% -15.297% -13.856%] (p = 0.00 < 0.05)
                        Performance has improved.

fibonacci_max_steps_1048576/prove_fibonacci/fibonacci_max_steps_1048576
                        time:   [4.0866 s 4.1098 s 4.1323 s]
                        change: [-17.146% -15.677% -14.221%] (p = 0.00 < 0.05)
                        Performance has improved.

ceno_zkvm/src/instructions.rs

ceno_zkvm/src/witness.rs

hero78119

epic works and LGTM!
(just one comment of TODO cleanup)

ceno_zkvm/src/tables/ram/ram_impl.rs

mcalancea added 9 commits November 21, 2024 10:18

remove maybeuninit; need to fix mock_prover.rs

580e087

fix inference

196142d

cargo fix

d5f3357

suspend padding

e40719b

merge master

13a6409

experiments

d2055bc

remove ad-hoc paddings

717f7fb

some clean-up

95d066d

cargo fix

e5f0b6c

naure linked an issue Nov 22, 2024 that may be closed by this pull request

Refactor RowMajorMatrix with a safe API #600

Closed

mcalancea changed the title ~~Refactor row major matrix~~ (WIP) Refactor row major matrix Nov 22, 2024

naure reviewed Nov 22, 2024

View reviewed changes

ceno_zkvm/src/witness.rs Outdated Show resolved Hide resolved

ceno_zkvm/src/witness.rs Outdated Show resolved Hide resolved

revert changes in virtual_poly

edec789

naure mentioned this pull request Nov 26, 2024

fix/program-size4: fix padding in edge cases #637

Closed

mcalancea added 2 commits November 26, 2024 14:13

handle repeat on empty

bb388a2

merge master

7b11a81

mcalancea force-pushed the cleanup/safe-row-major-matrix-variant branch from 171ca51 to 7b11a81 Compare December 2, 2024 13:15

hero78119 mentioned this pull request Dec 4, 2024

add fibonacchi bench #690

Merged

mcalancea added 2 commits December 5, 2024 11:20

merge master

ec8d001

stash

5d1e4df

mcalancea added 6 commits December 9, 2024 12:51

Merge branch 'master' into cleanup/safe-row-major-matrix-variant

10c64a6

clean up

50d5426

restore tests

239eee2

clean-up

d93b390

remove sltu.rs (present due to imprudent merge)

3af1ef1

slight clean-up

e13ef15

mcalancea force-pushed the cleanup/safe-row-major-matrix-variant branch from 6e0960b to e13ef15 Compare December 10, 2024 08:08

mcalancea changed the title ~~(WIP) Refactor row major matrix~~ Refactor row major matrix Dec 10, 2024

mcalancea marked this pull request as ready for review December 10, 2024 08:25

Merge branch 'master' into cleanup/safe-row-major-matrix-variant

f179d5d

mcalancea force-pushed the cleanup/safe-row-major-matrix-variant branch from f81730d to f179d5d Compare December 10, 2024 08:43

mcalancea requested review from hero78119 and lispc December 10, 2024 10:29

hero78119 reviewed Dec 10, 2024

View reviewed changes

address some comments

0953c7e

mcalancea mentioned this pull request Dec 10, 2024

Audit code for "false sharing" anti-patterns #730

Open

mcalancea force-pushed the cleanup/safe-row-major-matrix-variant branch from bdf705c to da45ffc Compare December 11, 2024 14:17

incorporate well-formed address padding and add unit test

94001c9

mcalancea force-pushed the cleanup/safe-row-major-matrix-variant branch from da45ffc to 94001c9 Compare December 11, 2024 14:31

mcalancea requested a review from hero78119 December 11, 2024 15:46

Merge branch 'master' into cleanup/safe-row-major-matrix-variant

cb866c8

hero78119 reviewed Dec 12, 2024

View reviewed changes

ceno_zkvm/src/instructions.rs Outdated Show resolved Hide resolved

ceno_zkvm/src/witness.rs Show resolved Hide resolved

mcalancea added 2 commits December 12, 2024 10:13

address some changes and fix index bug

3c1c563

merge master

c17d9e2

hero78119 approved these changes Dec 12, 2024

View reviewed changes

ceno_zkvm/src/tables/ram/ram_impl.rs Outdated Show resolved Hide resolved

mcalancea added 2 commits December 12, 2024 12:52

address left-over TODO

19e085d

Merge branch 'master' into cleanup/safe-row-major-matrix-variant

5ed8e10

mcalancea merged commit 979ee30 into master Dec 12, 2024
6 checks passed

mcalancea deleted the cleanup/safe-row-major-matrix-variant branch December 12, 2024 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor row major matrix #624

Refactor row major matrix #624

mcalancea commented Nov 22, 2024 •

edited

Loading

hero78119 commented Nov 22, 2024 •

edited

Loading

hero78119 commented Nov 22, 2024

mcalancea commented Nov 22, 2024 •

edited

Loading

naure left a comment

matthiasgoergens commented Nov 25, 2024

matthiasgoergens commented Nov 25, 2024

hero78119 commented Nov 25, 2024 •

edited

Loading

matthiasgoergens commented Nov 25, 2024

naure commented Nov 25, 2024

naure commented Nov 25, 2024

hero78119 left a comment

hero78119 Dec 10, 2024

mcalancea Dec 10, 2024

hero78119 commented Dec 10, 2024 •

edited

Loading

hero78119 left a comment

Refactor row major matrix #624

Refactor row major matrix #624

Conversation

mcalancea commented Nov 22, 2024 • edited Loading

hero78119 commented Nov 22, 2024 • edited Loading

hero78119 commented Nov 22, 2024

mcalancea commented Nov 22, 2024 • edited Loading

naure left a comment

Choose a reason for hiding this comment

matthiasgoergens commented Nov 25, 2024

matthiasgoergens commented Nov 25, 2024

hero78119 commented Nov 25, 2024 • edited Loading

matthiasgoergens commented Nov 25, 2024

naure commented Nov 25, 2024

naure commented Nov 25, 2024

hero78119 left a comment

Choose a reason for hiding this comment

hero78119 Dec 10, 2024

Choose a reason for hiding this comment

mcalancea Dec 10, 2024

Choose a reason for hiding this comment

hero78119 commented Dec 10, 2024 • edited Loading

hero78119 left a comment

Choose a reason for hiding this comment

mcalancea commented Nov 22, 2024 •

edited

Loading

hero78119 commented Nov 22, 2024 •

edited

Loading

mcalancea commented Nov 22, 2024 •

edited

Loading

hero78119 commented Nov 25, 2024 •

edited

Loading

hero78119 commented Dec 10, 2024 •

edited

Loading