## Introduction

Modular Exponentiation is one of the fundamental functions in modern cryptography, used in RSA Encryption, the Diffie-Hellman Algorithm, and Elliptic Curve Cryptography as well as in Probabilistic Primality Tests such as the Miller-Rabin and Baillie-PSW Tests.

Because of these important uses, a great deal of time and research has gone into creating fast and efficient implementations, which I will introduce in this article. While most fast implementations are normally used with BigInteger arithmetic, I will introduce them using simple word-sized arithmetic to make it easier for everyone to understand what is going on. I will also be ignoring special cases such as division by zero: in production code these issues must be handled, but for this article I feel they would be a distraction and so will skip straight to the core parts of the code.

On a similar note, I call this a "Practical" introduction, as I will not be covering any of the theoretical aspects of these algorithms, as there are already plenty of those publicly available. The vast majority of books on computer arithmetic and modern cryptography include a section on modular exponentiation, and a simple web search will reveal hundreds of similar resources.

With that said, let's begin.

## Code

So let's start with the naive Exponentiation by Squaring.

uint32 modpow_expsqr(uint32 b, uint32 X, uint32 M)
{
uint64 B = b;
uint64 P = 1;
while(X != 0)
{
if((X & 1) == 1)
{
P = (P * B) % M;
}
B = (B * B) % M;
X >>= 1;
}
return (uint32)P;
}

So right off the bat you may notice there's an issue. In the while loop, B = (B * B) % M; occurs during every iteration of the loop. This means that's no matter what, there will always be an extra Modular Squaring. Multiply and Modulus are two of the slowest arithmetic operations, and when you start dealing with BigInteger Arithmetic, the time delay can be even more significant. So let's rearrange this so that extra code is gone.

uint32 xsqrmodpow(uint32 b, uint32 X, uint32 M)
{
uint64 B, D;
B = b;
B %= M;
D = 1;
if ((X & 1) == 1)
{
D = B;
}
while(X > 1)
{
X >>= 1;
B = (B * B) % M;
if ((X & 1) == 1)
{
D = (D * B) % M;
}
}
return (uint32)D;
}

For wordsize arithmetic, the difference isn't that much better. But again, when you start dealing with 1024 bit numbers, the savings is huge.

#### Single Precision Barrett Reduction

Of course, the slowest part of the function is the Modulus Reduction, which is where the majority of the optimizations for modpow were developed. One such optimization is Barrett Reduction, where you calculate an integer reciprocal. While this is an effective optimization, there is a slight problem with the naive Barrett algorithm, in that it requires a multiplication with double the precision of the Modulus. While not a major problem, it is unnecessary. An accurate trial quotient can be generated using only the upper bits of the product. This means that we can find the remainder using only 2 N bit multiplications with only a few more subtractions than the original Barrett Reductions

uint32 barmodpow(uint32 B, uint32 X, uint32 M)
{
unsigned int S;
uint32 D, R;
S = CeilLog2(M);
R = sprecip(M, S);
D = 1;
B %= M;
if ((X & 1) == 1)
{
D = B;
}
while ((X >>= 1) != 0)
{
B = barmodmul(B, B, M, R, S);
if ((X & 1) == 1)
{
D = barmodmul(D, B, M, R, S);
}
}
return D;
}
uint32 barmodmul(uint32 A, uint32 B, uint32 M, uint32 R, unsigned int S)
{
uint64 P, T;
P = A;
P *= (uint64)B;
if (P >= M)
{
T = P >> S;
T *= R;
T >>= (S - 1);
T *= M;
P -= T;
unsigned int Ct = 4;
while((P >= M) && (Ct-- != 0))
{
P -= M;
}
}
return (uint32)P;
}
uint32 CeilLog2(uint32 V)
{
size_t S = 0;
while (V > 0)
{
S++;
V >>= 1;
}
return S;
}
uint32 sprecip(uint32 N, unsigned int S)
{
uint64 D(1);
D <<= ((S << 1) - 1);
D /= N;
return (uint32)D;
}

#### Montgomery Reduction

Montgomery Reduction is probably the most well known form of Modular Reduction. Essentially what Montgomery Reduction does is allow us to modify the numbers so that we can use a different number to perform the Division and Modulus by. In our case, we'll use a power of two, as we can replace the division with a shift and modulus with a bitwise and operation.

The standard Binary-Montgomery Reduction will only work correction with odd moduli, which usually isn't a problem: in most cases where we use Modular Exponentiation (i.e. encryption) we have to use odd numbers as they are the product of two large primes. However we can use Montgomery Reduction with even numbers by first factoring the modulus into an odd number and a power of two. We then have to perform two separate modular exponentiations using the two factors. Thankfully the second modpow with the power of two is much, much faster than even Montgomery Reduction, and so there is little loss of performance. We can then come up with the final answer by using the Chinese Remainder Theorem to derive the final product.

The Montgomery Reduction is a little more complicated than the other sections, so let's examine the functions:

uint32 montin(uint32 A, uint32 M)
uint32 montout(uint32 A, uint32 K, uint32 M)

In order for Montgomery reduction to work, we have to transform the numbers into the Montgomery "domain" which allows us to work perform modular arithmetic with different numbers that can be optimized. `montin`

transforms the number into the Montgomery domain, and `montout`

takes the number out of the Montgomery domain.

void montgcd(uint32 A, uint32 B, uint32& R, uint32& K)

`montgcd`

is an optimized GCD algorithm from Hacker's Delight that finds the multiplicative inverse of A & B, which in our case are our original modulus M, and the highest power of 2 greater than M, which will serve as our new modulus.

uint32 montredc(uint32 A, uint32 B, uint32 M, uint32 R)

And lastly `montredc`

performs the actual Montgomery Reduction. It multiplies A & B and finds the remainder modulo M in the Montgomery Domain.

uint32 montmodpow(uint32 B, uint32 X, uint32 M)
{
if((M & 1) == 1)
{
return montmodpow_odd(B, X, M);
}
else
{
return montmodpow_even(B, X, M);
}
}
uint32 montmodpow_odd(uint32 B, uint32 X, uint32 M)
{
uint32 D, R, K;
montgcd(1lu << 31, M, R, K);
B %= M;
B = montin(B, M);
if ((X & 1) == 1)
{
D = B;
}
else
{
D = montin(1, M);
}
while ((X >>= 1) != 0)
{
B = montredc(B, B, M, K);
if ((X & 1) == 1)
{
D = montredc(D, B, M, K);
}
}
D = montout(D, R, M);
return D;
}
uint32 montmodpow_even(uint32 B, uint32 X, uint32 M)
{
uint32 R, K, I, P;
uint32 X1, X2, Y;
size_t C;
C = ntz32(M);
M >>= C;
P = 1lu << C;
montgcd(P >> 1, M, R, K);
I = P - K;
X1 = montmodpow_odd(B, X, M);
X2 = modpow2x(B, X, P);
if (X2 < X1)
{
Y = (X1 - X2);
Y &= (P - 1);
if (Y != 0)
{
Y = P - Y;
}
}
else
{
Y = (X2 - X1);
Y &= (P - 1);
}
Y *= I;
Y &= (P - 1);
X1 += (M * Y);
return X1;
}
uint32 ntz32(uint32 N)
{
uint32 C = 32;
if(N != 0)
{
C = 0;
while((N & 1) == 0)
{
C++;
N >>= 1;
}
}
return C;
}
uint32 montredc(uint32 A, uint32 B, uint32 M, uint32 R)
{
uint64 T(A), X;
bool Of;
T *= B;
X = T;
T &= 0xffffffff;
T *= R;
T &= 0xffffffff;
T *= M;
T += X;
Of = (T < X);
T >>= 32;
if (Of || (T >= M))
{
T -= M;
}
return (uint32)T;
}
uint32 montin(uint32 A, uint32 M)
{
uint64 B = A;
B <<= 32;
B %= M;
return (uint32)B;
}
uint32 montout(uint32 A, uint32 K, uint32 M)
{
uint64 B = A;
B *= K;
B %= M;
return (uint32)B;
}
void montgcd(uint32 A, uint32 B, uint32& R, uint32& K)
{
uint32 alpha, beta, u, v;
u = 1;
v = 0;
alpha = A;
beta = B;
while (A > 0)
{
A >>= 1;
v >>= 1;
if ((u & 1) == 1)
{
u = ((u ^ beta) >> 1) + (u & beta);
v += alpha;
}
else
{
u >>= 1;
}
}
R = u;
K = v;
}

## K-Ary Modpow

Unlike the other methods we've looked at so far, the K-Ary method is not a different type of Modular Exponentiation, but rather is a general way to speed up the the other forms by storing parts of the calculation that we only need to calculate once.

Now before we get to the actual algorithm, let's get back to the basics of modular exponentiation. All the different ways we've looked at so far have had more or less the same design:

- Initialize our variables. Let 'I' = 0, the index of the Least Significant Bit
- If the
`I`

th least significant bit of the `Exponent`

is set, multiply the `Result`

by the `Base`

, and reduce by the `Modulus`

. - Square the
`Base`

and reduce it by the `Modulus`

. - Increment
`I`

by 1. - If I is not greater than the Most Significant Bit, go to Step 2.

Notice that we perform the calculations based on the least significant bit of the exponent.

So all of the methods we've looked at so far are based on us examining the exponent from right to left, i.e. the least significant bit to the most significant bit....

But what if we want to start at the other end?

Well the algorithm is pretty much the same, except now instead of squaring the base, we square the result.

1. Initialize our variables. Let '`I`

' be the index of the Most Significant Bit

2. Square the `Result`

, and reduce it by the `Modulus`

.

3. If the `I`

th most significant bit is set, multiply the `Result`

by the `Base`

, and reduce by the `Modulus`

.

4. If `I`

is not equal to 0, then Decrement `I`

, and go to Step 2.

Ok, so why is this useful?

Well, let's look at a simple example of the operation: (25^15) % 37

So our variables are

Base = 25

Modulus = 37

Exponent = 15

and of course we start out Result at 1.

In binary, our Exponent is 1111.

So let's quickly work through the problem. Since each bit is set, we'll multiply by the `Base`

each time.

Loop 1:

a) Square the `Result`

, and reduce by the `Modulus`

1 = (1*1) % 37

b) Multiply the `Result`

by the `Base`

, and reduce by the `Modulus`

25 = (1 * 25) % 37

Loop 2:

a) Square the `Result`

, and reduce by the `Modulus`

33 = (25 * 25) % 37

b) Multiply the `Result`

by the `Base`

, and reduce by the `Modulus`

11 = (33 * 25) % 37

Loop 3:

a) Square the `Result`

, and reduce by the `Modulus`

10 = (11 * 11) % 37

b) Multiply the `Result`

by the `Base`

, and reduce by the `Modulus`

28 = (10 * 25) % 37

Loop 4:

a) Square the `Result`

, and reduce by the `Modulus`

7 = (28 * 28) % 37

b) Multiply the `Result`

by the `Base`

, and reduce by the `Modulus`

27 = (7 * 25) % 37

The K-Ary Method is based on the observation that we can save some multiplication, by using a part

of the answer we've already calculated.

Instead of performing operations based in each bit, we work with groups of digits.

So for instance, 15 in binary 1111, we can split it in half and get 11 11. Since 11 is 3 in decimal, we can represent our exponent as 15 = (3*4)+3.

Use a temporary variable '`T`

' equal to (25^3)%37

So 11 = (25^3) % 37

`T`

= 11.

Since 3 takes two binary digits, we will square and reduce our `Result`

twice.

`Result`

= (11 * 11) % 37

`Result`

= 10

Square and reduce again

`Result`

= (10 * 10) % 37

`Result`

= 26

And now multiply by our temporary `T`

and reduce

`Result`

= (26 * 11) % 37

`Result`

= 27

And we get our original answer.

The idea then is to split up our exponent into separate digits, each digit with N bits. We then calculate a table of our `Base`

to the power of each exponent from I = 0 to I < N. The nice thing is that we get two values automatically for free, since

`Base`

^ 0 = 1

and

`Base`

^ 1 = `Base`

Then for each loop, we square and reduce our answer N times, and we lookup the Most Significant Digit in the table, and multiply our `Result`

by that value.

The nice thing about the K-Ary Method is that it can be used to speed up Modular Exponentiation no matter what form of Modular Reduction you are using. So there are separate functions that perform the K-Ary Method for regular Division, Barrett Reduction, and Montgomery Reduction. The example code performs K-Ary reduction by splitting the Exponent into 4 bits.

uint32 ReverseNibbles(uint32 N)
{
N = ((N >> 4) & 0x0F0F0F0F) | ((N & 0x0F0F0F0F) << 4);
N = ((N >> 8) & 0x00FF00FF) | ((N & 0x00FF00FF) << 8);
N = (N >> 16) | (N << 16);
return N;
}
uint32 xsqrmodpow_kary_16(uint32 B, uint32 X, uint32 M)
{
uint32 Table[15];
unsigned int C, I;
uint32 T, D;
C = CeilLog2(X);
X = ReverseNibbles(X);
I = C & 3;
C >>= 2;
C += (I + 3) >> 2;
X >>= (8 - C) << 2;
B %= M;
D = 1;
Table[0] = B;
for(I = 1; I < 15; I++)
{
Table[I] = modmul(Table[I - 1], B, M);
}
do
{
T = X & 15;
X >>= 4;
for(I = 0; I < 4; I++)
{
D = modmul(D, D, M);
}
if (T != 0)
{
D = modmul(Table[T - 1], D, M);
}
}while(--C != 0);
return D;
}
uint32 barmodpow_kary_16(uint32 B, uint32 X, uint32 M)
{
uint32 Table[15];
unsigned int C, I, S;
uint32 T, D, R;
C = CeilLog2(X);
X = ReverseNibbles(X);
I = C & 3;
C >>= 2;
C += (I + 3) >> 2;
X >>= (8 - C) << 2;
S = CeilLog2(M);
R = sprecip(M, S);
B %= M;
D = 1;
Table[0] = B;
for (I = 1; I < 15; I++)
{
Table[I] = barmodmul(Table[I - 1], B, M, R, S);
}
do
{
T = X & 15;
X >>= 4;
for (I = 0; I < 4; I++)
{
D = barmodmul(D, D, M, R, S);
}
if (T != 0)
{
D = barmodmul(Table[T - 1], D, M, R, S);
}
} while (--C != 0);
return D;
}
uint32 montmodpow_kary_16(uint32 B, uint32 X, uint32 M)
{
if ((M & 1) == 1)
{
return montmodpow_odd_kary_16(B, X, M);
}
else
{
return montmodpow_even_kary_16(B, X, M);
}
}
uint32 montmodpow_odd_kary_16(uint32 B, uint32 X, uint32 M)
{
uint32 Table[15];
unsigned int C, I;
uint32 T, D, R, K;
C = CeilLog2(X);
X = ReverseNibbles(X);
I = C & 3;
C >>= 2;
C += (I + 3) >> 2;
X >>= (8 - C) << 2;
montgcd(1lu << 31, M, R, K);
B %= M;
B = montin(B, M);
D = montin(1, M);
Table[0] = B;
for (I = 1; I < 15; I++)
{
Table[I] = montredc(Table[I - 1], B, M, K);
}
do
{
T = X & 15;
X >>= 4;
for (I = 0; I < 4; I++)
{
D = montredc(D, D, M, K);
}
if (T != 0)
{
D = montredc(Table[T - 1], D, M, K);
}
} while (--C != 0);
D = montout(D, R, M);
return D;
}
uint32 montmodpow_even_kary_16(uint32 B, uint32 X, uint32 M)
{
uint32 R, K, I, P;
uint32 X1, X2, Y;
size_t C;
C = ntz32(M);
M >>= C;
P = 1lu << C;
montgcd(P >> 1, M, R, K);
I = P - K;
X1 = montmodpow_odd_kary_16(B, X, M);
X2 = modpow2x(B, X, P);
if (X2 < X1)
{
Y = (X1 - X2);
Y &= (P - 1);
if (Y != 0)
{
Y = P - Y;
}
}
else
{
Y = (X2 - X1);
Y &= (P - 1);
}
Y *= I;
Y &= (P - 1);
X1 += (M * Y);
return X1;
}

## Conclusion

My goal for this article was to write what I had wanted to find when I first started learning about Modular Exponentiation. While I agree that discussing the theory is very important, I've always found that it's easier to get started by looking at a working example.

I hope you all found this useful. If you have any questions or find any errors, please let me know.

Also, while not required, I would greatly appreciate that if you use this code for anything, you include a reference back to this article.

Thank you very much!

Jacob Wells