Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Msm/update docs #545

Merged
merged 11 commits into from
Jun 19, 2024
47 changes: 17 additions & 30 deletions docs/docs/icicle/golang-bindings/msm-pre-computation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,31 +4,30 @@ To understand the theory behind MSM pre computation technique refer to Niall Emm

## Core package

### MSM PrecomputeBases
### MSM PrecomputePoints

`PrecomputeBases` and `G2PrecomputeBases` exists for all supported curves.
`PrecomputePoints` and `G2PrecomputePoints` exists for all supported curves.

#### Description

This function extends each provided base point $(P)$ with its multiples $(2^lP, 2^{2l}P, ..., 2^{(precompute_factor - 1) \cdot l}P)$, where $(l)$ is a level of precomputation determined by the `precompute_factor`. The extended set of points facilitates faster MSM computations by allowing the MSM algorithm to leverage precomputed multiples of base points, reducing the number of point additions required during the computation.

The precomputation process is crucial for optimizing MSM operations, especially when dealing with large sets of points and scalars. By precomputing and storing multiples of the base points, the MSM function can more efficiently compute the scalar-point multiplications.

#### `PrecomputeBases`
#### `PrecomputePoints`

Precomputes bases for MSM by extending each base point with its multiples.
Precomputes points for MSM by extending each base point with its multiples.

```go
func PrecomputeBases(points core.HostOrDeviceSlice, precomputeFactor int32, c int32, ctx *cr.DeviceContext, outputBases core.DeviceSlice) cr.CudaError
func PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
```

##### Parameters

- **`points`**: A slice of the original affine points to be extended with their multiples.
- **`precomputeFactor`**: Determines the total number of points to precompute for each base point.
- **`c`**: Currently unused; reserved for future compatibility.
- **`ctx`**: CUDA device context specifying the execution environment.
- **`outputBases`**: The device slice allocated for storing the extended bases.
- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`outputBases`**: The device slice allocated for storing the extended points.

##### Example

Expand All @@ -50,28 +49,27 @@ func main() {
var precomputeOut core.DeviceSlice
precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())

err := bn254.PrecomputeBases(points, precomputeFactor, 0, &cfg.Ctx, precomputeOut)
err := bn254.PrecomputePoints(points, 1024, &cfg, precomputeOut)
if err != cr.CudaSuccess {
log.Fatalf("PrecomputeBases failed: %v", err)
}
}
```

#### `G2PrecomputeBases`
#### `G2PrecomputePoints`

This method is the same as `PrecomputeBases` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.
This method is the same as `PrecomputePoints` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.

```go
func G2PrecomputeBases(points core.HostOrDeviceSlice, precomputeFactor int32, c int32, ctx *cr.DeviceContext, outputBases core.DeviceSlice) cr.CudaError
func G2PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
```

##### Parameters

- **`points`**: A slice of G2 curve points to be extended.
- **`precomputeFactor`**: The total number of points to precompute for each base.
- **`c`**: Reserved for future use to ensure compatibility with MSM operations.
- **`ctx`**: Specifies the CUDA device context for execution.
- **`outputBases`**: Allocated device slice for the extended bases.
- **`points`**: A slice of the original affine points to be extended with their multiples.
- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`outputBases`**: The device slice allocated for storing the extended points.

##### Example

Expand All @@ -93,20 +91,9 @@ func main() {
var precomputeOut core.DeviceSlice
precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())

err := g2.G2PrecomputeBases(points, precomputeFactor, 0, &cfg.Ctx, precomputeOut)
err := g2.G2PrecomputePoints(points, 1024, 0, &cfg, precomputeOut)
if err != cr.CudaSuccess {
log.Fatalf("PrecomputeBases failed: %v", err)
}
}
```

### Benchmarks

Benchmarks where performed on a Nvidia RTX 3090Ti.

| Pre-computation factor | bn254 size `2^20` MSM, ms. | bn254 size `2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. | bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| 1 | 14.1 | 82.8 | 25.5 | 136.7 |
| 2 | 11.8 | 76.6 | 20.3 | 123.8 |
| 4 | 10.9 | 73.8 | 18.1 | 117.8 |
| 8 | 10.6 | 73.7 | 17.2 | 116.0 |
6 changes: 5 additions & 1 deletion docs/docs/icicle/golang-bindings/msm.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ func GetDefaultMSMConfig() MSMConfig

## How do I toggle between the supported algorithms?

When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle accumulation and `cfg.Ctx.IsBigTriangle = false` will activate Bucket accumulation.
When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle reduction and `cfg.Ctx.IsBigTriangle = false` will activate iterative reduction.

```go
...
Expand Down Expand Up @@ -152,6 +152,10 @@ out.Malloc(batchSize*p.Size(), p.Size())
...
```

## Parameters for optimal performance

Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)

## Support for G2 group

To activate G2 support first you must make sure you are building the static libraries with G2 feature enabled as described in the [Golang building instructions](../golang-bindings.md#using-icicle-golang-bindings-in-your-project).
Expand Down
172 changes: 139 additions & 33 deletions docs/docs/icicle/primitives/msm.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,36 +54,142 @@ You can learn more about how MSMs work from this [video](https://www.youtube.com
- [Golang](../golang-bindings/msm.md)
- [Rust](../rust-bindings//msm.md)

## Supported algorithms

Our MSM implementation supports two algorithms `Bucket accumulation` and `Large triangle accumulation`.

### Bucket accumulation

The Bucket Accumulation algorithm is a method of dividing the overall MSM task into smaller, more manageable sub-tasks. It involves partitioning scalars and their corresponding points into different "buckets" based on the scalar values.

Bucket Accumulation can be more parallel-friendly because it involves dividing the computation into smaller, independent tasks, distributing scalar-point pairs into buckets and summing points within each bucket. This division makes it well suited for parallel processing on GPUs.

#### When should I use Bucket accumulation?

In scenarios involving large MSM computations with many scalar-point pairs, the ability to parallelize operations makes Bucket Accumulation more efficient. The larger the MSM task, the more significant the potential gains from parallelization.

### Large triangle accumulation

Large Triangle Accumulation is a method for optimizing MSM which focuses on reducing the number of point doublings in the computation. This algorithm is based on the observation that the number of point doublings can be minimized by structuring the computation in a specific manner.

#### When should I use Large triangle accumulation?

The Large Triangle Accumulation algorithm is more sequential in nature, as it builds upon each step sequentially (accumulating sums and then performing doubling). This structure can make it less suitable for parallelization but potentially more efficient for a **large batch of smaller MSM computations**.

## MSM Modes

ICICLE MSM also supports two different modes `Batch MSM` and `Single MSM`

Batch MSM allows you to run many MSMs with a single API call while single MSM will launch a single MSM computation.

### Which mode should I use?

This decision is highly dependent on your use case and design. However, if your design allows for it, using batch mode can significantly improve efficiency. Batch processing allows you to perform multiple MSMs simultaneously, leveraging the parallel processing capabilities of GPUs.

Single MSM mode should be used when batching isn't possible or when you have to run a single MSM.
## Algorithm description

We follow the bucket method algorithm. The GPU implementation consists of four phases:

1. Preparation phase - The scalars are split into smaller scalars of `c` bits each. These are the bucket indices. The points are grouped according to their corresponding bucket index and the buckets are sorted by size.
2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected average bucket size times the `large_bucket_factor` parameter.
3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default, this is done by an iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.

## Batched MSM

The MSM supports batch mode - running multiple MSMs in parallel. It's always better to use the batch mode instead of running single msms in serial as long as there is enough memory available. We support running a batch of MSMs that share the same points as well as a batch of MSMs that use different points.

## MSM configuration

```cpp
/**
* @struct MSMConfig
* Struct that encodes MSM parameters to be passed into the [MSM](@ref MSM) function. The intended use of this struct
* is to create it using [default_msm_config](@ref default_msm_config) function and then you'll hopefully only need to
* change a small number of default values for each of your MSMs.
*/
struct MSMConfig {
device_context::DeviceContext ctx; /**< Details related to the device such as its id and stream id. */
int points_size; /**< Number of points in the MSM. If a batch of MSMs needs to be computed, this should be
* a number of different points. So, if each MSM re-uses the same set of points, this
* variable is set equal to the MSM size. And if every MSM uses a distinct set of
* points, it should be set to the product of MSM size and [batch_size](@ref
* batch_size). Default value: 0 (meaning it's equal to the MSM size). */
int precompute_factor; /**< The number of extra points to pre-compute for each point. See the
* [precompute_msm_points](@ref precompute_msm_points) function, `precompute_factor` passed
* there needs to be equal to the one used here. Larger values decrease the
* number of computations to make, on-line memory footprint, but increase the static
* memory footprint. Default value: 1 (i.e. don't pre-compute). */
int c; /**< \f$ c \f$ value, or "window bitsize" which is the main parameter of the "bucket
* method" that we use to solve the MSM problem. As a rule of thumb, larger value
* means more on-line memory footprint but also more parallelism and less computational
* complexity (up to a certain point). Currently pre-computation is independent of
* \f$ c \f$, however in the future value of \f$ c \f$ here and the one passed into the
* [precompute_msm_points](@ref precompute_msm_points) function will need to be identical.
* Default value: 0 (the optimal value of \f$ c \f$ is chosen automatically). */
int bitsize; /**< Number of bits of the largest scalar. Typically equals the bitsize of scalar field,
* but if a different (better) upper bound is known, it should be reflected in this
* variable. Default value: 0 (set to the bitsize of scalar field). */
int large_bucket_factor; /**< Variable that controls how sensitive the algorithm is to the buckets that occur
* very frequently. Useful for efficient treatment of non-uniform distributions of
* scalars and "top windows" with few bits. Can be set to 0 to disable separate
* treatment of large buckets altogether. Default value: 10. */
int batch_size; /**< The number of MSMs to compute. Default value: 1. */
bool are_scalars_on_device; /**< True if scalars are on device and false if they're on host. Default value:
* false. */
bool are_scalars_montgomery_form; /**< True if scalars are in Montgomery form and false otherwise. Default value:
* true. */
bool are_points_on_device; /**< True if points are on device and false if they're on host. Default value: false. */
bool are_points_montgomery_form; /**< True if coordinates of points are in Montgomery form and false otherwise.
* Default value: true. */
bool are_results_on_device; /**< True if the results should be on device and false if they should be on host. If set
* to false, `is_async` won't take effect because a synchronization is needed to
* transfer results to the host. Default value: false. */
bool is_big_triangle; /**< Whether to do "bucket accumulation" serially. Decreases computational complexity
* but also greatly decreases parallelism, so only suitable for large batches of MSMs.
* Default value: false. */
bool is_async; /**< Whether to run the MSM asynchronously. If set to true, the MSM function will be
* non-blocking and you'd need to synchronize it explicitly by running
* `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the MSM
* function will block the current CPU thread. */
};
```

## Choosing optimal parameters

`is_big_triangle` should be `false` in almost all cases. It might provide better results only for very small MSMs (smaller than 2^8^) with a large batch (larger than 100) but this should be tested per scenario.
Large buckets exist in two cases:
1. When the scalar distribution isn't uniform.
2. When `c` does not divide the scalar bit-size.

`large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around $log_2(msmSize) - 1$.

## Memory usage estimation

The main memory requirements of the MSM are the following:

- Scalars - `sizeof(scalar_t) * msm_size * batch_size`
- Scalar indices - `~6 * sizeof(unsigned) * nof_bucket_modules * msm_size * batch_size`
- Points - `sizeof(affine_t) * msm_size * precomp_factor * batch_size`
- Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`

where `nof_bucket_modules = ceil(ceil(bitsize / c) / precompute_factor)`

During the MSM computation first the memory for scalars and scalar indices is allocated, then the indices are freed and points and buckets are allocated. This is why a good estimation for the required memory is the following formula:

$max(scalars + scalarIndices, scalars + points + buckets)$

This gives a good approximation within 10% of the actual required memory for most cases.

## Example parameters

Here is a useful table showing optimal parameters for different MSMs. They are optimal for BLS12-377 curve when running on NVIDIA GeForce RTX 3090 Ti. This is the configuration used:

```cpp
msm::MSMConfig config = {
ctx, // DeviceContext
N, // points_size
precomp_factor, // precompute_factor
user_c, // c
0, // bitsize
10, // large_bucket_factor
batch_size, // batch_size
false, // are_scalars_on_device
false, // are_scalars_montgomery_form
true, // are_points_on_device
false, // are_points_montgomery_form
true, // are_results_on_device
false, // is_big_triangle
true // is_async
};
```

Here are the parameters and the results for the different cases:
LeonHibnik marked this conversation as resolved.
Show resolved Hide resolved

| MSM size | Batch size | Precompute factor | c | Memory estimation (GB) | Actual memory (GB) | Single MSM time (ms) |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | 1 | 1 | 9 | 0.00227 | 0.00277 | 9.2 |
| 10 | 1 | 23 | 11 | 0.00259 | 0.00272 | 1.76 |
| 10 | 1000 | 1 | 7 | 0.94 | 1.09 | 0.051 |
| 10 | 1000 | 23 | 11 | 2.59 | 2.74 | 0.025 |
| 15 | 1 | 1 | 11 | 0.011 | 0.019 | 9.9 |
| 15 | 1 | 16 | 16 | 0.061 | 0.065 | 2.4 |
| 15 | 100 | 1 | 11 | 1.91 | 1.92 | 0.84 |
| 15 | 100 | 19 | 14 | 6.32 | 6.61 | 0.56 |
| 18 | 1 | 1 | 14 | 0.128 | 0.128 | 14.4 |
| 18 | 1 | 15 | 17 | 0.40 | 0.42 | 5.9 |
| 22 | 1 | 1 | 17 | 1.64 | 1.65 | 68 |
| 22 | 1 | 13 | 21 | 5.67 | 5.94 | 54 |
| 24 | 1 | 1 | 18 | 6.58 | 6.61 | 232 |
| 24 | 1 | 7 | 21 | 12.4 | 13.4 | 199 |

The optimal values can vary per GPU and per curve. It is best to try a few combinations until you get the best results for your specific case.
Loading
Loading