-
Notifications
You must be signed in to change notification settings - Fork 242
/
Copy pathmakemcs.txt
396 lines (256 loc) · 11.6 KB
/
makemcs.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
This file offers some guidance for developers who need to create assembly
language Macros for an unsupported processor to support the Comba or KCM
methods for modular multiplication. Note that the "standard" C build of MIRACL
may be fast enough, or else the provided C macros may be sufficient. Use c.mcs
or c1.mcs, but if mr_qltype is defined in mirdef.h, use c3.mcs
An .mcs file contains C or assembly language macros to implement some simple
operations on big number digits. On most modern Load/Store RISC processors
the optimal assembly language required is fortunately quite generic - and
that's why this approach is possible. The Instruction set need only support
unsigned multiply, add and subtract. It should also support standard indexed
addressing modes, where the effective memory address is calculated by adding
an offset to a register. The compiler should ideally support in-line assembly.
Correct memory address offsets and label names are automatically inserted by
the mex utility. It uses the standard C function "fprintf" to expand the
macros, and hence inserts appropriate numbers where-ever %d appears in the
macro. This implies that if the symbol % should be part of assembly language
syntax, it must be replaced by %%.
These macros when extracted by the mex.c program from the specified .mcs
file are used to implement some fast algorithms.
MULTIPLY
a2 a1 a0
b2 b1 b0
--------------
a2.b0 a1.b0 a0.b0
a2.b1 a1.b1 a0.b1
a2.b2 a1.b2 a0.b2
------------------------------
c4 c3 c2 c1 c0
Here (a1.b0) is a partial product. As we add up each column, the result is
accumulated in a triple-precision register x|y|z. When the column is totalled
z is written to memory and 0|x|y becomes the "carry" to the next column. Note
that x will always be "small", less than the number of partial products in
the longest column.
The algorithm for fast multi-precision multiplication multiplies a*b=c and
requires the following macros.
MUL_START
Push any registers if necessary, Initialise pointers to a, b and c,
and zero the triple-precision register required to accumulate the
partial products.
STEP
Calculate a partial product and add it to the triple-register.
The array offsets from the pointers to a and b are inserted
automatically by the mex utility.
MFIN
Store total for this column z, and calculate the carry for the next.
MUL_END
Store the left over carry in the left-most column, and pop registers
if necessary.
MULTUP
Here we just need the first (lower) half of c=a*b; So we use the same
MUL_START, STEP, and MFIN macros as above. However no carry is needed from
the last column, so a simpler LAST macro is employed here.
SQUARE
Multiprecision squaring can be done nearly twice as fast, as partial products
appear twice in most columns, but only need to be calculated once. Observe
that if a=b in the above example then a2.b0 = a0.b2 in the third column.
This algorithm finds c=a*a
SQR_START
Push registers, form pointers to a and c, zeroise triple register
DSTEP
Calculate a partial product and add it twice to column total in
the triple register.
SELF
Calculate a "diagonal element", e.g. (a1.a1) and add to total.
SFIN
Store total for this column z, and calculate the carry for the next
SQR_END
Store the left over carry in the left-most column, and pop registers
if necessary.
REDC
Montgomery's modular reduction algorithm calculates a%=b; It also accesses
the pre-computed variable "ndash".
REDC_START
Push registers, get pointers to a and b, and put ndash in a register.
Initialise the triple register x|y|z to 0|0|a[0]
RFINU
Multiply ndash*z and store lower half of result in a[i] for i-th
column. Multiply this number now by b[0]. Add the result to the
triple register x|y|z. Set the triple register to reflect the
"carry" to the next column 0|x|y. Add a[i+1] to the triple register.
RFIND
Store z into a[i]. Set triple register to reflect the "carry"
to the next column. Add a[i+1] to the triple register.
REDC_END
Store the left-over carry in the left-most two columns.
This algorithm also uses the STEP macro as described above.
ADDITION
This algorithm does a simple element-by-element adition c=a+b,
propagating the carries.
ADD_START
Gets pointers to a, b and c. Adds the first two elements
c[0]=a[0]+b[0]
ADD
Adds-with-carry c[i]=a[i]+b[i]
ADD_END
If the carry flag is still set, set the variable "carry"=1,
otherwise =0
INCREMENT
Very similar to the above, but this time calculates a+=b
INC_START
Gets pointers to a and b. Adds the first elements a[0]+=b[0]
INC
Adds-with-carry a[i]+=b[i]
INC_END
If the carry flag is still set, set the variable "carry=1",
otherwise 0
SUBTRACTION
Find c=a-b, propagating borrows
SUB_START
Gets pointers to a, b and c. Subtracts the first two elements
c[0]=a[0]-b[0]
SUB
Subtracts-with-borrow c[i]=a[i]-b[i]
SUB_END
If there is a "borrow" outstanding, set the variable "carry"=1,
otherwise =0. **NOTE** This MAY be indicated by carry flag=1 OR
by carry flag=0 - it depends on architecture, so be careful.
DECREMENT
Find a-=b, propagating borrows
DEC_START
Gets pointers to a and b. Subtracts the first two elements
a[0]-=b[0]
DEC
Subtracts-with-borrow a[i]-=b[i]
DEC_END
If there is a "borrow" outstanding, set the variable "carry"=1,
otherwise =0. **NOTE** This MAY be indicated by carry flag=1 OR
by carry flag=0 - it depends on architecture, so be careful.
SUMMATION
Adds c=a+b in a "for" loop. Each time around the loop a fixed block of
digits are added. A total of n such blocks are to be added.
KADD_START
Initialise pointers to a, b and c. Move n into a register. Set
the carry flag to zero. Provide a label for looping back to. Note
that label numbers are automatically inserted by the mex utility.
KASL
Decrement the n register. If its zero jump to label at the end of
this macro. If its not, advance the pointer registers to point to
the next block, and branch back to label in KADD_START.
**NOTE** It is vitally important that this macro does NOT affect
the carry flag
KADD_END
If the carry flag is still set, set the variable "carry=1",
otherwise 0
This algorithm also uses the ADD macro. See above
INCREMENTATION
Adds a+=b in a "for" loop. Each time around the loop a fixed block of
digits are added. A total of n such blocks are to be added.
KINC_START
Initialise pointers to a and b. Move n into a register. Set
the carry flag to zero. Provide a label for looping back to. Note
that label numbers are automatically inserted by the mex utility.
KIDL
Decrement the n register. If its zero jump to label at the end of
this macro. If its not, advance the pointer registers to point to
the next block, and branch back to label in KINC_START.
**NOTE** It is vitally important that this macro does NOT affect
the carry flag
KINC_END
If the carry flag is still set, set the variable "carry=1",
otherwise 0
This algorithm also uses the INC macro. See above
DECREMENTATION
Subtracts a-=b in a "for" loop. Each time around the loop a fixed block of
digits are subtracted. A total of n such blocks are to be subtracted.
KDEC_START
Initialise pointers to a and b. Move n into a register. Set
the carry flag to that state which indicates no borrow. Provide a
label for looping back to. Note that label numbers are automatically
inserted by the mex utility.
KDEC_END
Set the variable "carry" to 1 if the carry flag is in that state
which reflects an outstanding "borrow", otherwise 0
This algorithm also uses the KIDL and DEC macros. See above.
NEW! - April 2002
Interleaved steps can be used in multiplication to allow for improved
instruction scheduling. This could be a lot faster if the multiply unit takes
more than 1 clock cycle. The idea is to expose more ILP (Instruction Level
Parallelism) for a modern pipelined (and possibly super-scaler) load-store
processor to chew on.
In the calculation of the sum of partial products in a column, replace
STEPM - Multiply
STEPA - Add
STEPM - Multiply
STEPA - Add
STEPM - Multiply
STEPA - Add
STEPM - Multiply
STEPA - Add
with the interleaved
STEP1M - Multiply 1
STEP2M - Multiply 2
STEP1A - Add 1
STEP1M - Multiply 1
STEP2A - Add 2
STEP2M - Multiply 2
STEP1A - Add 1
STEP2A - Add 2
The same applies to DSTEP for squaring.
In this way the multiply instruction gets more time to complete to its
destination registers.
If the MEX program sees that STEP1M macro is present as well as STEP,
it will permit scheduling with the -s flag as above. Of course this requires
that there to be enough registers - STEP1x and STEP2x should ideally use
different registers.
If this is going to help it is particularly important that the destination
registers of the multiply steps STEP1M and STEP2M be distinct.
Many processors use hardware dynamic scheduling (like the Pentium II).
For such a processor scheduling the code like this will have little
effect. The ARM assembler re-orders the code automatically for
optimum scheduling, so again scheduling the code like this will have little
impact.
Some example *.mcs files in this format are included - see c.mcs and
arm.mcs for example.
This idea could be extended in a fairly obvious way, for example
STEP1M - Multiply 1
STEP2M - Multiply 2
STEP3M - Multiply 3
STEP1A - Add 1
STEP1M - Multiply 1
STEP2A - Add 2
STEP2M - Multiply 2
STEP3A - Add 3
STEP3M - Multiply 3
STEP1A - Add 1
STEP1M - Multiply 1
STEP2A - Add 2
STEP3A - Add 3
STEP1A - Add 1
This is currently NOT supported, it would require a relatively simple
modification to mex.c
This might be justified in the case of a very deeply pipelined multiplier.
The idea would be that Multiply 1, 2, and 3 might all be active
simultaneously.
New features 12/10/2007
NEW: PMUL, PMUL_START and PMUL_END macros to support fast reduction for
pseudo mersenne prime moduli.
PMULT
When reducing modulo p=2^m-d, where m is a multiple of the word length and
d is single precision, we have the useful identity 2^m=d mod p. The product
of two field elements can be considered as 2^m.U+L, where U is the upper
half of the product, and L is the lower half. Therefore the reduced value
is 2^m.U+L mod p = dU+L. The dU component can in turn be considered as
2^m.x+Y, where x will be single precision. So the final result is d.x+Y+L.
Finally a couple of subtractions of p might be required to get a result
less than p. The function PMULT calculates d.x and Y
PMUL_START
Initialises pointers to the top half of the product a[], and arrays
to hold d.x (b[]) and Y (c[]). Moves d to a register.
PMUL
Sets c[i]=a[i]*d and propagates carries. Sets b[i]=0
PMUL_END
Sets b[0] (and b[1]) to d.x
NEW: Macros to support Hybrid method - see http://eprint.iacr.org/2007/299
Basically a 2x2 or 4x4 block of partial products are calculated together
which reduces memory accesses. Uses more registers.