Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Montgomery multiplication #402

Merged
merged 15 commits into from
Dec 16, 2024
Merged

Optimize Montgomery multiplication #402

merged 15 commits into from
Dec 16, 2024

Conversation

recmo
Copy link
Owner

@recmo recmo commented Nov 4, 2024

Motivation

Solution

PR Checklist

  • Added Tests
  • Added Documentation
  • Updated the changelog

@recmo recmo marked this pull request as ready for review December 3, 2024 17:41
@recmo recmo requested a review from prestwich as a code owner December 3, 2024 17:41
@recmo
Copy link
Owner Author

recmo commented Dec 3, 2024

@prestwich Would love to see this merged! This makes Ruint competitive with Arkworks-ff for finite field math.

@prestwich
Copy link
Collaborator

cool, I'm currently traveling so it's been taking me a minute to get to. i'm going to try to sit down with the linked algorithm documentation this weekend

src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved
src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved
src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved
src/algorithms/mul_redc.rs Outdated Show resolved Hide resolved
@recmo
Copy link
Owner Author

recmo commented Dec 16, 2024

I've found concrete performance to be better with <const N: usize>(x: [u64; N]) rather than (x: &[u64]). In particular for Montgomery multiplication we benefit a lot from the compiler unrolling the loop when N is small and inlining the modulus when it is a compile time constant (which also eliminates the modulus-dependent branches).

I'm not sure how far we should push this observation through. E.g. would we want to do this for the division and GCD algorithm as well?

Right now the library is mostly instantiate with a small compile time N, typically N=4. For these cases the specialization would probably benefit performance. But I'm also toying with the idea of adding support for proper dynamically sized uints at some point, in which case slice based algorithms are the only way, and keeping both around violates DRY.

It maybe possible to abstract over this with something like <L: DerefMut<[u64]>>(x: L), but that has it's own complexity. For now compile time small N should be the priority.

@prestwich
Copy link
Collaborator

prestwich commented Dec 16, 2024

CI failure is due to a backwards incompatible change in proptest, and can be smoothed over by pinning to 1.5.0. Gonna investigate what's wrong upstream

@prestwich
Copy link
Collaborator

prestwich commented Dec 16, 2024

Right now the library is mostly instantiate with a small compile time N, typically N=4. For these cases the specialization would probably benefit performance. But I'm also toying with the idea of adding support for proper dynamically sized uints at some point, in which case slice based algorithms are the only way, and keeping both around violates DRY.

The idea being to allow expansion and do algorithm-switching based on the current size like go's big.Int?

I've found concrete performance to be better with (x: [u64; N]) rather than (x: &[u64]). In particular for Montgomery multiplication we benefit a lot from the compiler unrolling the loop when N is small and inlining the modulus when it is a compile time constant (which also eliminates the modulus-dependent branches).

my naive question is whether with small N expressing the loop as for i in 0..N { match a[i] < b[i] { ... } } will be optimized more readily than for (left, right) in zip(a,b) { match left < right }. I may sit down and investigate this at some point

@prestwich prestwich mentioned this pull request Dec 16, 2024
@prestwich
Copy link
Collaborator

CI failure is due to a backwards incompatible change in proptest, and can be smoothed over by pinning to 1.5.0. Gonna investigate what's wrong upstream

i went ahead and pushed a commit to do this and opened #409, as I'm looking to merge this branch today

@recmo
Copy link
Owner Author

recmo commented Dec 16, 2024

The idea being to allow expansion and do algorithm-switching based on the current size like go's big.Int?

There are a couple directions, big.Int is just one of them:

  1. Compile time fixed, stack allocated: [u64; N] (current)
  2. Compile time fixed, heap allocated: Box<[u64; N]>
  3. Runtime fixed, heap allocated: Box<[u64]>
  4. Runtime dynamic, heap allocated: Vec<u64>

When working with large-ish sizes, e.g. U4096, stack size may be limited and an approach 2-4 is required.

When implementing runtime negotiated cryptographic protocols (I stumbled on this in the icao-9303 lib), a runtime size 3 or 4 is required.

To approximate natural numbers and have virtual unlimited size, 3 or 4 is required.

This is orthogonal to the problem of algorithm selection, which depends purely on size and would be the same for all approaches. Though practically, limited stack size prevents you from using approach 1 with sizes where anything other than base-case algorithms are relevant (GMP doesn't do fancy multiplication until 20-30 limbs on modern hw: https://gmplib.org/devel/thres/MUL_TOOM22_THRESHOLD)

Main downside of 2-4 is that we loose the Copy trait, which is great for UX and optimizations.

@prestwich
Copy link
Collaborator

slightly adjusted my own cargo toml changes, approving and setting this to auto-merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

modular: Make mul_redc alloc-free
3 participants