Integer Factorization - Algorithmica
Integer Factorization

Integer Factorization

The problem of factoring integers into primes is central to computational number theory. It has been studied since at least the 3rd century BC, and many methods have been developed that are efficient for different inputs.

In this case study, we specifically consider the factorization of word-sized integers: those on the order of 10910^9 and 101810^{18}. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the O(n4)O(\sqrt[4]{n})-time Pollard’s rho algorithm and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and ~3 times faster than the previous state-of-the-art.

#Benchmark

For all methods, we will implement find_factor function that takes a positive integer nn and returns any of its non-trivial divisors (or 1 if the number is prime):

// I don't feel like typing "unsigned long long" each time
typedef __uint16_t u16;
typedef __uint32_t u32;
typedef __uint64_t u64;
typedef __uint128_t u128;

u64 find_factor(u64 n);

To find the full factorization, you can apply it to nn, reduce it, and continue until a new factor can no longer be found:

vector<u64> factorize(u64 n) {
    vector<u64> factorization;
    do {
        u64 d = find_factor(n);
        factorization.push_back(d);
        n /= d;
    } while (d != 1);
    return factorization;
}

After each removed factor, the problem becomes considerably smaller, so the worst-case running time of full factorization is equal to the worst-case running time of a find_factor call.

For many factorization algorithms, including those presented in this section, the running time scales with the smaller prime factor. Therefore, to provide worst-case input, we use semiprimes: products of two prime numbers pqp \le q that are on the same order of magnitude. We generate a kk-bit semiprime as the product of two random k/2\lfloor k / 2 \rfloor-bit primes.

Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false-negative errors (when find_factor returns 1 despite number nn being composite), although this rate can be reduced to almost zero without significant performance penalties.

#Trial division

The most basic approach is to try every integer smaller than nn as a divisor:

u64 find_factor(u64 n) {
    for (u64 d = 2; d < n; d++)
        if (n % d == 0)
            return d;
    return 1;
}

We can notice that if nn is divided by d<nd < \sqrt n, then it is also divided by nd>n\frac{n}{d} > \sqrt n, and there is no need to check for it separately. This lets us stop trial division early and only check for potential divisors that do not exceed n\sqrt n:

u64 find_factor(u64 n) {
    for (u64 d = 2; d * d <= n; d++)
        if (n % d == 0)
            return d;
    return 1;
}

In our benchmark, nn is a semiprime, and we always find the lesser divisor, so both O(n)O(n) and O(n)O(\sqrt n) implementations perform the same and are able to factorize ~2k 30-bit numbers per second — while taking whole 20 seconds to factorize a single 60-bit number.

#Lookup Table

Nowadays, you can type factor 57 in your Linux terminal or Google search bar to get the factorization of any number. But before computers were invented, it was more practical to use factorization tables: special books containing factorizations of the first NN numbers.

We can also use this approach to compute these lookup tables during compile time. To save space, we can store only the smallest divisor of a number. Since the smallest divisor does not exceed the n\sqrt n, we need just one byte per a 16-bit integer:

template <int N = (1<<16)>
struct Precalc {
    unsigned char divisor[N];

    constexpr Precalc() : divisor{} {
        for (int i = 0; i < N; i++)
            divisor[i] = 1;
        for (int i = 2; i * i < N; i++)
            if (divisor[i] == 1)
                for (int k = i * i; k < N; k += i)
                    divisor[k] = i;
    }
};

constexpr Precalc P{};

u64 find_factor(u64 n) {
    return P.divisor[n];
}

With this approach, we can process 3M 16-bit integers per second, although it would probably get slower for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first 2162^{16} numbers, it does not scale well for larger inputs.

#Wheel factorization

To save paper space, pre-computer era factorization tables typically excluded numbers divisible by 22 and 55, making the factorization table ½ × ⅘ = 0.4 of its original size. In the decimal numeral system, you can quickly determine whether a number is divisible by 22 or 55 (by looking at its last digit) and keep dividing the number nn by 22 or 55 while it is possible, eventually arriving at some entry in the factorization table.

We can apply a similar trick to trial division by first checking if the number is divisible by 22 and then only considering odd divisors:

u64 find_factor(u64 n) {
    if (n % 2 == 0)
        return 2;
    for (u64 d = 3; d * d <= n; d += 2)
        if (n % d == 0)
            return d;
    return 1;
}

With 50% fewer divisions to perform, this algorithm works twice as fast.

This method can be extended: if the number is not divisible by 33, we can also ignore all multiples of 33, and the same goes for all other divisors. The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small.

For example, if we consider 22, 33, and 55, then, among the first 9090 numbers, we only need to check:

(1,) 7, 11, 13, 17, 19, 23, 29,
31, 37, 41, 43, 47, 49, 53, 59,
61, 67, 71, 73, 77, 79, 83, 89…

You can notice a pattern: the sequence repeats itself every 3030 numbers. This is not surprising since the remainder modulo 2×3×5=302 \times 3 \times 5 = 30 is all we need to determine whether a number is divisible by 22, 33, or 55. This means that we only need to check 88 numbers with specific remainders out of every 3030, proportionally improving the performance:

u64 find_factor(u64 n) {
    for (u64 d : {2, 3, 5})
        if (n % d == 0)
            return d;
    u64 offsets[] = {0, 4, 6, 10, 12, 16, 22, 24};
    for (u64 d = 7; d * d <= n; d += 30) {
        for (u64 offset : offsets) {
            u64 x = d + offset;
            if (n % x == 0)
                return x;
        }
    }
    return 1;
}

As expected, it works 308=3.75\frac{30}{8} = 3.75 times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved further by considering more primes, but the returns are diminishing: adding a new prime pp reduces the number of iterations by 1p\frac{1}{p} but increases the size of the skip-list by a factor of pp, requiring proportionally more memory.

#Precomputed Primes

If we keep increasing the number of primes in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don’t need this array of offsets but just the array of primes:

const int N = (1 << 16);

struct Precalc {
    u16 primes[6542]; // # of primes under N=2^16

    constexpr Precalc() : primes{} {
        bool marked[N] = {};
        int n_primes = 0;

        for (int i = 2; i < N; i++) {
            if (!marked[i]) {
                primes[n_primes++] = i;
                for (int j = 2 * i; j < N; j += i)
                    marked[j] = true;
            }
        }
    }
};

constexpr Precalc P{};

u64 find_factor(u64 n) {
    for (u16 p : P.primes)
        if (n % p == 0)
            return p;
    return 1;
}

This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small (<216< 2^{16}) factors.

Note that this is actually an asymptotic optimization: there are O(nlnn)O(\frac{n}{\ln n}) primes among the first nn numbers, so this algorithm performs O(nlnn)O(\frac{\sqrt n}{\ln \sqrt n}) operations, while wheel factorization only eliminates a large but constant fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under 2322^{32} (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of lnn2lnn=21/21/2lnnlnn=2\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2.

All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be optimized if we know the divisors in advance and allow for some additional precomputation. In our case, it is suitable to use the Lemire division check:

// ...precomputation is the same as before,
// but we store the reciprocal instead of the prime number itself
u64 magic[6542];
// for each prime i:
magic[n_primes++] = u64(-1) / i + 1;

u64 find_factor(u64 n) {
    for (u64 m : P.magic)
        if (m * n < m)
            return u64(-1) / m + 1;
    return 1;
}

This makes the algorithm ~18x faster: we can now factorize ~350k 30-bit numbers per second, which is actually the most efficient algorithm we have for this number range. While it can probably be optimized even further by performing these checks in parallel with SIMD, we will stop there and try a different, asymptotically better approach.

#Pollard’s Rho Algorithm

Pollard’s rho is a randomized O(n4)O(\sqrt[4]{n}) integer factorization algorithm that makes use of the birthday paradox:

One only needs to draw d=Θ(n)d = \Theta(\sqrt{n}) random numbers between 11 and nn to get a collision with high probability.

The reasoning behind it is that each of the dd added element has a dn\frac{d}{n} chance of colliding with some other element, implying that the expected number of collisions is d2n\frac{d^2}{n}. If dd is asymptotically smaller than n\sqrt n, then this ratio grows to zero as nn \to \infty, and to infinity otherwise.

Consider some function f(x)f(x) that takes a remainder x[0,n)x \in [0, n) and maps it to some other remainder of nn in a way that seems random from the number theory point of view. Specifically, we will use f(x)=x2+1modnf(x) = x^2 + 1 \bmod n, which is random enough for our purposes.

Now, consider a graph where each number-vertex xx has an edge pointing to f(x)f(x). Such graphs are called functional. In functional graphs, the “trajectory” of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point, we have to go to a vertex we have already visited).

The trajectory of an element resembles the greek letter ρ (rho), which is what the algorithm is named after

Consider a trajectory of some particular element x0x_0:

x0,  f(x0),  f(f(x0)),   x_0, \; f(x_0), \; f(f(x_0)), \; \ldots

Let’s make another sequence out of this one by reducing each element modulo pp, the smallest prime divisor of nn.

Lemma. The expected length of the reduced sequence before it turns into a cycle is O(n4)O(\sqrt[4]{n}).

Proof: Since pp is the smallest divisor, pnp \leq \sqrt n. Each time we follow a new edge, we essentially generate a random number between 00 and pp (we treat ff as a “deterministically-random” function). The birthday paradox states that we only need to generate O(p)=O(n4)O(\sqrt p) = O(\sqrt[4]{n}) numbers until we get a collision and thus enter a loop.

Since we don’t know pp, this mod-pp sequence is only imaginary, but if find a cycle in it — that is, ii and jj such that

fi(x0)fj(x0)(modp) f^i(x_0) \equiv f^j(x_0) \pmod p then we can also find pp itself as p=gcd(fi(x0)fj(x0),n) p = \gcd(|f^i(x_0) - f^j(x_0)|, n) The algorithm itself just finds this cycle and pp using this GCD trick and Floyd’s “tortoise and hare” algorithm: we maintain two pointers ii and j=2ij = 2i and check that gcd(fi(x0)fj(x0),n)1 \gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1

which is equivalent to comparing fi(x0)f^i(x_0) and fj(x0)f^j(x_0) modulo pp. Since jj (hare) is increasing at twice the rate of ii (tortoise), their difference is increasing by 11 each iteration and eventually will become equal to (or a multiple of) the cycle length, with ii and jj pointing to the same elements. And as we proved half a page ago, reaching a cycle would only require O(n4)O(\sqrt[4]{n}) iterations:

u64 f(u64 x, u64 mod) {
    return ((u128) x * x + 1) % mod;
}

u64 diff(u64 a, u64 b) {
    // a and b are unsigned and so is their difference, so we can't just call abs(a - b)
    return a > b ? a - b : b - a;
}

const u64 SEED = 42;

u64 find_factor(u64 n) {
    u64 x = SEED, y = SEED, g = 1;
    while (g == 1) {
        x = f(f(x, n), n); // advance x twice
        y = f(y, n);       // advance y once
        g = gcd(diff(x, y));
    }
    return g;
}

While it processes only ~25k 30-bit integers — which is almost 15 times slower than by checking each prime using a fast division trick — it dramatically outperforms every O~(n)\tilde{O}(\sqrt n) algorithm for 60-bit numbers, factorizing around 90 of them per second.

#Pollard-Brent Algorithm

Floyd’s cycle-finding algorithm has a problem in that it moves iterators more than necessary: at least half of the vertices are visited one additional time by the slower iterator.

One way to solve it is to memorize the values xix_i that the faster iterator visits and, every two iterations, compute the GCD using the difference of xix_i and xi/2x_{\lfloor i / 2 \rfloor}. But it can also be done without extra memory using a different principle: the tortoise doesn’t move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare xix_i and x2log2ix_{2^{\lfloor \log_2 i \rfloor}} on each iteration:

u64 find_factor(u64 n) {
    u64 x = SEED;
    
    for (int l = 256; l < (1 << 20); l *= 2) {
        u64 y = x;
        for (int i = 0; i < l; i++) {
            x = f(x, n);
            if (u64 g = gcd(diff(x, y), n); g != 1)
                return g;
        }
    }

    return 1;
}

Note that we also set an upper limit on the number of iterations so that the algorithm finishes in a reasonable amount of time and returns 1 if nn turns out to be a prime.

It actually does not improve performance and even makes the algorithm ~1.5x slower, which probably has something to do with the fact that xx is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the time requirement of this algorithm is currently O(n4logn)O(\sqrt[4]{n} \log n) because of it.

Instead of optimizing the GCD itself, we will optimize the number of its invocations. We can use the fact that if one of aa and bb contains factor pp, then abmodna \cdot b \bmod n will also contain it, so instead of computing gcd(a,n)\gcd(a, n) and gcd(b,n)\gcd(b, n), we can compute gcd(abmodn,n)\gcd(a \cdot b \bmod n, n). This way, we can group the calculations of GCP in groups of M=O(logn)M = O(\log n) we remove logn\log n out of the asymptotic:

const int M = 1024;

u64 find_factor(u64 n) {
    u64 x = SEED;
    
    for (int l = M; l < (1 << 20); l *= 2) {
        u64 y = x, p = 1;
        for (int i = 0; i < l; i += M) {
            for (int j = 0; j < M; j++) {
                y = f(y, n);
                p = (u128) p * diff(x, y) % n;
            }
            if (u64 g = gcd(p, n); g != 1)
                return g;
        }
    }

    return 1;
}

Now it performs 425 factorizations per second, bottlenecked by the speed of modulo.

#Optimizing the Modulo

The final step is to apply Montgomery multiplication. Since the modulo is constant, we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap:

struct Montgomery {
    u64 n, nr;
    
    Montgomery(u64 n) : n(n) {
        nr = 1;
        for (int i = 0; i < 6; i++)
            nr *= 2 - n * nr;
    }

    u64 reduce(u128 x) const {
        u64 q = u64(x) * nr;
        u64 m = ((u128) q * n) >> 64;
        return (x >> 64) + n - m;
    }

    u64 multiply(u64 x, u64 y) {
        return reduce((u128) x * y);
    }
};

u64 f(u64 x, u64 a, Montgomery m) {
    return m.multiply(x, x) + a;
}

const int M = 1024;

u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
    Montgomery m(n);
    u64 x = SEED;
    
    for (int l = M; l < (1 << 20); l *= 2) {
        u64 y = x, p = 1;
        for (int i = 0; i < l; i += M) {
            for (int j = 0; j < M; j++) {
                x = f(x, m);
                p = m.multiply(p, diff(x, y));
            }
            if (u64 g = gcd(p, n); g != 1)
                return g;
        }
    }

    return 1;
}

This implementation can processes around 3k 60-bit integers per second, which is ~3x faster than what PARI / SageMath’s factor / cat semiprimes.txt | time factor measures.

#Further Improvements

Optimizations. There is still a lot of potential for optimization in our implementation of the Pollard’s algorithm:

  • We could probably use a better cycle-finding algorithm, exploiting the fact that the graph is random. For example, there is little chance that we enter the loop in within the first few iterations (the length of the cycle and the path we walk before entering it should be equal in expectation since before we loop around, we choose the vertex of the path we’ve walked independently), so we may just advance the iterator for some time before starting the trials with the GCD trick.
  • Our current approach is bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we are waiting for it to complete, we could perform more than just one trial using the previous values.
  • If we run pp independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish p\sqrt p times faster (the reasoning is similar to the Birthday paradox; try to prove it yourself). We don’t have to use multiple cores for that: there is a lot of untapped instruction-level parallelism, so we could concurrently run two or three of the same operations on the same thread, or use SIMD instructions to perform 4 or 8 multiplications in parallel.

I would not be surprised to see another 3x improvement and throughput of ~10k/sec. If you implement some of these ideas, please let me know.

Errors. Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate for 60-bit integers, and it grows higher if the numbers are lower. These errors come from three main sources:

  • A cycle simply not being found (the algorithm is inherently random, and there is no guarantee that it will be found). In this case, we need to perform a primality test and optionally start again.
  • The p variable becoming zero (because both pp and qq can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant M. In this case, we need to either restart the process or (better) roll back the last MM iterations and perform the trials one by one.
  • Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if nn is large, we need to add more x > mod ? x - mod : x kind of statements to deal with overflows.

Larger numbers. These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we’ve implemented before. In general, the optimal approach should depend on the size of the numbers:

  • Smaller than 2162^{16}: use a lookup table;
  • Smaller than 2322^{32}: use a list of precomputed primes with a fast divisibility check;
  • Smaller than 2642^{64} or so: use Pollard’s rho algorithm with Montgomery multiplication;
  • Smaller than 105010^{50}: switch to Lenstra elliptic curve factorization;
  • Smaller than 1010010^{100}: switch to Quadratic Sieve;
  • Larger than 1010010^{100}: switch to General Number Field Sieve.

The last three approaches are very different from what we’ve been doing and require much more advanced number theory, and they deserve an article (or a full-length university course) of their own.