Optimize Montgomery multiplication #402

recmo · 2024-11-04T02:52:18Z

Motivation

Improves performance for mul_redc, which is critical to efficient prime field implementations.
Fixes modular: Make mul_redc alloc-free #284.

Solution

PR Checklist

Added Tests
Added Documentation
Updated the changelog

recmo · 2024-12-03T18:07:01Z

@prestwich Would love to see this merged! This makes Ruint competitive with Arkworks-ff for finite field math.

prestwich · 2024-12-06T01:35:19Z

cool, I'm currently traveling so it's been taking me a minute to get to. i'm going to try to sit down with the linked algorithm documentation this weekend

src/algorithms/mul_redc.rs

recmo · 2024-12-16T12:14:51Z

I've found concrete performance to be better with <const N: usize>(x: [u64; N]) rather than (x: &[u64]). In particular for Montgomery multiplication we benefit a lot from the compiler unrolling the loop when N is small and inlining the modulus when it is a compile time constant (which also eliminates the modulus-dependent branches).

I'm not sure how far we should push this observation through. E.g. would we want to do this for the division and GCD algorithm as well?

Right now the library is mostly instantiate with a small compile time N, typically N=4. For these cases the specialization would probably benefit performance. But I'm also toying with the idea of adding support for proper dynamically sized uints at some point, in which case slice based algorithms are the only way, and keeping both around violates DRY.

It maybe possible to abstract over this with something like <L: DerefMut<[u64]>>(x: L), but that has it's own complexity. For now compile time small N should be the priority.

prestwich · 2024-12-16T12:30:46Z

CI failure is due to a backwards incompatible change in proptest, and can be smoothed over by pinning to 1.5.0. Gonna investigate what's wrong upstream

prestwich · 2024-12-16T13:06:31Z

Right now the library is mostly instantiate with a small compile time N, typically N=4. For these cases the specialization would probably benefit performance. But I'm also toying with the idea of adding support for proper dynamically sized uints at some point, in which case slice based algorithms are the only way, and keeping both around violates DRY.

The idea being to allow expansion and do algorithm-switching based on the current size like go's big.Int?

I've found concrete performance to be better with (x: [u64; N]) rather than (x: &[u64]). In particular for Montgomery multiplication we benefit a lot from the compiler unrolling the loop when N is small and inlining the modulus when it is a compile time constant (which also eliminates the modulus-dependent branches).

my naive question is whether with small N expressing the loop as for i in 0..N { match a[i] < b[i] { ... } } will be optimized more readily than for (left, right) in zip(a,b) { match left < right }. I may sit down and investigate this at some point

prestwich · 2024-12-16T13:10:24Z

CI failure is due to a backwards incompatible change in proptest, and can be smoothed over by pinning to 1.5.0. Gonna investigate what's wrong upstream

i went ahead and pushed a commit to do this and opened #409, as I'm looking to merge this branch today

recmo · 2024-12-16T16:31:07Z

The idea being to allow expansion and do algorithm-switching based on the current size like go's big.Int?

There are a couple directions, big.Int is just one of them:

Compile time fixed, stack allocated: [u64; N] (current)
Compile time fixed, heap allocated: Box<[u64; N]>
Runtime fixed, heap allocated: Box<[u64]>
Runtime dynamic, heap allocated: Vec<u64>

When working with large-ish sizes, e.g. U4096, stack size may be limited and an approach 2-4 is required.

When implementing runtime negotiated cryptographic protocols (I stumbled on this in the icao-9303 lib), a runtime size 3 or 4 is required.

To approximate natural numbers and have virtual unlimited size, 3 or 4 is required.

This is orthogonal to the problem of algorithm selection, which depends purely on size and would be the same for all approaches. Though practically, limited stack size prevents you from using approach 1 with sizes where anything other than base-case algorithms are relevant (GMP doesn't do fancy multiplication until 20-30 limbs on modern hw: https://gmplib.org/devel/thres/MUL_TOOM22_THRESHOLD)

Main downside of 2-4 is that we loose the Copy trait, which is great for UX and optimizations.

prestwich · 2024-12-16T17:35:39Z

slightly adjusted my own cargo toml changes, approving and setting this to auto-merge

recmo added 4 commits November 4, 2024 20:04

Implement CIOS Montogomery Multiplciation

62ca750

Cleanup implementation

08699d2

Merge loops

9fef016

Add new mul_redc and square_redc impl

31672dd

recmo force-pushed the recmo/montgomery branch from e36ef09 to 31672dd Compare December 3, 2024 10:33

recmo added 5 commits December 3, 2024 18:12

Fix square_redc

5f940c1

Apply conditional optimization

65f0fc0

Add Uint::square_redc

2da0ca4

Merge branch 'main' into recmo/montgomery

fea1edd

Add changelog and regressions

4c1d6a9

recmo marked this pull request as ready for review December 3, 2024 17:41

recmo requested a review from prestwich as a code owner December 3, 2024 17:41

No black_box in MSRV

9729022

Use bitwise on bool

2af379b

DaniPopes reviewed Dec 6, 2024

View reviewed changes

src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved

src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved

prestwich reviewed Dec 6, 2024

View reviewed changes

src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved

src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved

Address review items

813ae36

Merge branch 'main' into recmo/montgomery

8aeb822

fix: set proptest to 1.5.0

16409c4

prestwich mentioned this pull request Dec 16, 2024

Unpin proptest #409

Open

fix: remove unnecessary alloc feature from proptest

2776215

prestwich approved these changes Dec 16, 2024

View reviewed changes

prestwich enabled auto-merge December 16, 2024 17:35

prestwich merged commit e657576 into main Dec 16, 2024
17 checks passed

prestwich deleted the recmo/montgomery branch December 16, 2024 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Montgomery multiplication #402

Optimize Montgomery multiplication #402

recmo commented Nov 4, 2024 •

edited

Loading

recmo commented Dec 3, 2024 •

edited

Loading

prestwich commented Dec 6, 2024

recmo commented Dec 16, 2024 •

edited

Loading

prestwich commented Dec 16, 2024 •

edited

Loading

prestwich commented Dec 16, 2024 •

edited

Loading

prestwich commented Dec 16, 2024

recmo commented Dec 16, 2024 •

edited

Loading

prestwich commented Dec 16, 2024

Optimize Montgomery multiplication #402

Optimize Montgomery multiplication #402

Conversation

recmo commented Nov 4, 2024 • edited Loading

Motivation

Solution

PR Checklist

recmo commented Dec 3, 2024 • edited Loading

prestwich commented Dec 6, 2024

recmo commented Dec 16, 2024 • edited Loading

prestwich commented Dec 16, 2024 • edited Loading

prestwich commented Dec 16, 2024 • edited Loading

prestwich commented Dec 16, 2024

recmo commented Dec 16, 2024 • edited Loading

prestwich commented Dec 16, 2024

recmo commented Nov 4, 2024 •

edited

Loading

recmo commented Dec 3, 2024 •

edited

Loading

recmo commented Dec 16, 2024 •

edited

Loading

prestwich commented Dec 16, 2024 •

edited

Loading

prestwich commented Dec 16, 2024 •

edited

Loading

recmo commented Dec 16, 2024 •

edited

Loading