In modular arithmetic and computational algebra in general, you often need to raise a number to the $n$-th power — to do modular division, perform primality tests, or compute some combinatorial values — and you usually want to spend fewer than $\Theta(n)$ operations calculating it.

*Binary exponentiation*, also known as *exponentiation by squaring*, is a method that allows for computation of the $n$-th power using $O(\log n)$ multiplications, relying on the following observation:

Since $n$ is at least halved every two recursive transitions, the depth of this recurrence and the total number of multiplications will be at most $O(\log n)$.

### #Recursive Implementation

As we already have a recurrence, it is natural to implement the algorithm as a case matching recursive function:

```
const int M = 1e9 + 7; // modulo
typedef unsigned long long u64;
u64 binpow(u64 a, u64 n) {
if (n == 0)
return 1;
if (n % 2 == 1)
return binpow(a, n - 1) * a % M;
else {
u64 b = binpow(a, n / 2);
return b * b % M;
}
}
```

In our benchmark, we use $n = m - 2$ so that we compute the multiplicative inverse of $a$ modulo $m$:

```
u64 inverse(u64 a) {
return binpow(a, M - 2);
}
```

We use $m = 10^9+7$, which is a modulo value is commonly used in *competitive programming* to calculate checksums in combinatorial problems because it is prime (allowing inverse via binary exponentiation), sufficiently large while not overflowing `int`

in addition or `long long`

in multiplication, and is easy to type as `1e9 + 7`

. Since it is a compile-time constant, the compiler can optimize the modulo by replacing it with multiplication (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction).

The execution path and hence the running time depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some overhead, it makes sense to unroll it into an iterative one.

### #Iterative Implementation

The result of $a^n$ can be represented as the product of $a$ to some powers of two — those that correspond to 1s in the binary representation of $n$. For example, if $n = 42 = 32 + 8 + 2$, then

$$ a^{42} = a^{32+8+2} = a^{32} \cdot a^8 \cdot a^2 $$To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration.

```
u64 binpow(u64 a, u64 n) {
u64 r = 1;
while (n) {
if (n & 1)
r = res * a % M;
a = a * a % M;
n >>= 1;
}
return r;
}
```

The iterative implementation takes about 180ns per call. The heavy calculations are the same; the improvement mainly comes from the reduced dependency chain: `a = a * a % M`

needs to finish before the loop can proceed, and it can now execute concurrently with `r = res * a % M`

.

The performance also benefits from $n$ being a constant, making all branches predictable and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1`

loop. We can rewrite it as a `for`

loop that takes constant 30 iterations:

```
u64 inverse(u64 a) {
u64 r = 1;
#pragma GCC unroll(30)
for (int l = 0; l < 30; l++) {
if ( (M - 2) >> l & 1 )
r = r * a % M;
a = a * a % M;
}
return r;
}
```

This forces the compiler to generate only the instructions we need, shoving off another 10ns and making the total running time ~170ns.

Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less not having to perform these off-path multiplications.