You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Stow talk at SBC presents their benchmarks in a very helpful way. In particular, the following pie-chart gives an excellent overview of the time spent in the different phases of a proof.
These types of numbers are absolutely necessary in order to understand where we should spend resources optimizing the performance of sphinx and loam. Below is a wish-list of numbers that we should be tracking with each PR.
Arithmetization
Given that we are compiling Lair down to AirBuilder constraints, we are losing insight into how large our traces actually are. In the same way that we used to have expect-test to log how many constraints and witnesses we were using, we should have the same functionality in lair.
For each Chip, we should log the following information:
Total width of the chip
Total number of lookups
Number of bytes operations
Number of provide/provide
Number of memory load/store
Number of constraints, and their "complexity" given by counting the number of additions vs multiplications necessary to evaluate it (may require a "byte code representation of the constraints which could be embedded in the verification key later on)
Number of selectors which represent the number of mutually exclusive branches that can be taken
For each branch, log the same information as above, as this can indicate expensive branches that could benefit from their own chip
Number of multiplicative inverses (whose cost is 37 field multiplications) since we could potentially batch these using Montgomery's trick
The above list is exhaustive, but given all of this we would be able to better distinguish areas in our arithmetization that would benefit from "fine-tuning". Moreover, if this information is collected in a struct, it would allow us to more easily approximate the performance impact of future prover optimizations, such as an improved lookup protocol that bypasses the overflowing multiplicities issue. We would be able to deduce the shape of a shard with this new configuration and run a mock prover that accounts for the change in number of columns and potential extra rounds of interaction.
Prover
Given the fixed Lurk VM arithmetization and a set of benchmark programs to run, we should log the same type of information that was given in the Stow benchmark. Getting these numbers should only be a matter of adding tailored tracing to the prove_shard method. We could then get CI to output the following numbers
Program execution
How much time does it take to just evaluate the program to get the output, and populate the record of all events to be proved
Sharding information
For each chip, describe how many events will be proved in the chip. For example, how many Poseidon2 hashes of each size are required, how many arithmetic operations, and what does our memory look like.
We should do this for both sharded and unsharded traces, since we will want to figure out what a better sharding strategy would be for Lurk where execution is not sequential as in RiscV.
Trace generation
For each chip with a given number of events. How much time does it take to generate just that trace. This could be a good indicator of whether we need to optimize data storage and witness generation. For example, we could store integers as their native types and only perform the conversion to field elements during trace generation, rather than converting back and forth during execution.
It is also useful to just know the total time spent to generate all the traces, assuming proper parallelism. This is the part of the stack that can be optimized almost entirely in loam, as it does not depend on the proving strategy.
We also want to calculate the total trace generation time for the logUp columns, as these require the computation of multiplicative inverses which can be batched.
The Stwo prover spends about 30% of the proving time evaluating constraints in order to compute the quotient. This is where the prover calls Air::eval approximately 2n times, where n is the number of rows in the trace. This evaluation is always the same, and therefore should not exhibit any dynamic behavior, and ideally should not touch the heap at all (it should only apply arithmetic operations over slices of the trace).
At the moment, we are parsing the constraints dynamically at every iteration, resulting in the same vector allocations and pointer chasing at every call. I suspect that this has a strong negative impact on performance, though it may be the case that the CPU's branch predictor is able to cache the constraint and avoid the unnecessary memory operations. It is critical that we benchmark this portion of our prover against SP1 to understand whether we need to apply efforts towards optimizing our constraint evaluation.
One solution would be to build the constraint evaluation functions using macros or code generation, though this would add a lot of complexity.
Commitment
For all of preprocessed, main, permutation and quotient traces, we should log what proportion of the total time is spent performing iFFTs, FFTs and Merkle tree commitment. Note that these numbers depend on the hash function used, and the width/height of the trace.
Polynomial Commitment opening
This phase should depend only on the shape of the shard and should be very similar to SP1 given traces of the same size.
Proof size
Number of hashes for authentication paths
number of opening values
With these numbers, as well as estimates about the amount of time required to verify hashed inside a circuit, we would be able to estimate the "recursion overhead" without needing to implement a fully fledged verifier just yet.
We would also be able to estimate the savings against a future STIR implementation.
Overall, we need to collect some of these numbers when comparing Lurk against SP1. We are doing some things very differently and we need to understand the places where we need to focus optimization efforts.
Verifier complexity
At the moment, we are not able to accurately benchmark Lurk in a "real-world" setting as we are not taking into account the complexity of the recursive verifier.
The following information will help draw a better picture
Size of byte-code required for evaluating constraints
These translate to the number of instructions performed by the verifier when checking that the quotient was correct
Number of columns
The prover opens two rows of every trace, which are the leaves of the Merkle tree commitment. We open these for every FRI query, and so this leads to more hashes performed by the verifier.
Total number of hashes performed in the proof.
I suspect that including 3 variations of the Poseidon2 hash is likely the biggest contributor to the complexity of our constraints. This both drastically increases the number of columns and arithmetic operations performed by the verifier. In almost every other VM, only a single instance of Poseidon 2 is used with a width of 16, and a special VM circuit is used to call the permutation over arrays of field elements. While this model may have a small impact on the hashing speed of Lurk data, it may make a big difference when it comes to the recursive verifier.
Moreover, we would also be able to estimate the savings in recursive verification due to STIR, or a potential implementation of an accumulation scheme that replaces FRI.
Finally, it would help us understand if there are any benefits of modifying the prover to only open a single row in the case where there are no transition constraints (this would require the AirBuilder to expose a RowID selector that would replace our current "nonce").
In the mean time, we should also be logging the non-parallelized verification time of our proofs, as it would allow us to at least estimate when changes we are making would affect a future recursive verifier.
Conclusion
The above is an exhaustive list of many numbers that would be very useful to track as continue building Lurk on top of sphinx. Ideally, we can report some of these number as part of the CI pipeline directly when making changes to the VM or backend.
Some of these numbers could also be included in the code directly with expect-test. Given the many different aspects that can affect proving performance, we need to understand where our efforts should be targeted.
The text was updated successfully, but these errors were encountered:
adr1anh
changed the title
Benchmark: Constraint evaluation time
Benchmarking wishlist
Aug 12, 2024
The Stow talk at SBC presents their benchmarks in a very helpful way. In particular, the following pie-chart gives an excellent overview of the time spent in the different phases of a proof.
These types of numbers are absolutely necessary in order to understand where we should spend resources optimizing the performance of sphinx and loam. Below is a wish-list of numbers that we should be tracking with each PR.
Arithmetization
Given that we are compiling Lair down to AirBuilder constraints, we are losing insight into how large our traces actually are. In the same way that we used to have
expect-test
to log how many constraints and witnesses we were using, we should have the same functionality in lair.For each
Chip
, we should log the following information:The above list is exhaustive, but given all of this we would be able to better distinguish areas in our arithmetization that would benefit from "fine-tuning". Moreover, if this information is collected in a struct, it would allow us to more easily approximate the performance impact of future prover optimizations, such as an improved lookup protocol that bypasses the overflowing multiplicities issue. We would be able to deduce the shape of a shard with this new configuration and run a mock prover that accounts for the change in number of columns and potential extra rounds of interaction.
Prover
Given the fixed Lurk VM arithmetization and a set of benchmark programs to run, we should log the same type of information that was given in the Stow benchmark. Getting these numbers should only be a matter of adding tailored tracing to the
prove_shard
method. We could then get CI to output the following numbersAir::eval
approximately2n
times, wheren
is the number of rows in the trace. This evaluation is always the same, and therefore should not exhibit any dynamic behavior, and ideally should not touch the heap at all (it should only apply arithmetic operations over slices of the trace).preprocessed
,main
,permutation
andquotient
traces, we should log what proportion of the total time is spent performing iFFTs, FFTs and Merkle tree commitment. Note that these numbers depend on the hash function used, and the width/height of the trace.Overall, we need to collect some of these numbers when comparing Lurk against SP1. We are doing some things very differently and we need to understand the places where we need to focus optimization efforts.
Verifier complexity
At the moment, we are not able to accurately benchmark Lurk in a "real-world" setting as we are not taking into account the complexity of the recursive verifier.
The following information will help draw a better picture
I suspect that including 3 variations of the Poseidon2 hash is likely the biggest contributor to the complexity of our constraints. This both drastically increases the number of columns and arithmetic operations performed by the verifier. In almost every other VM, only a single instance of Poseidon 2 is used with a width of 16, and a special VM circuit is used to call the permutation over arrays of field elements. While this model may have a small impact on the hashing speed of Lurk data, it may make a big difference when it comes to the recursive verifier.
Moreover, we would also be able to estimate the savings in recursive verification due to STIR, or a potential implementation of an accumulation scheme that replaces FRI.
Finally, it would help us understand if there are any benefits of modifying the prover to only open a single row in the case where there are no transition constraints (this would require the AirBuilder to expose a
RowID
selector that would replace our current "nonce").In the mean time, we should also be logging the non-parallelized verification time of our proofs, as it would allow us to at least estimate when changes we are making would affect a future recursive verifier.
Conclusion
The above is an exhaustive list of many numbers that would be very useful to track as continue building Lurk on top of sphinx. Ideally, we can report some of these number as part of the CI pipeline directly when making changes to the VM or backend.
Some of these numbers could also be included in the code directly with
expect-test
. Given the many different aspects that can affect proving performance, we need to understand where our efforts should be targeted.The text was updated successfully, but these errors were encountered: