Prefetching - Algorithmica

Taking advantage of the free concurrency available in memory hardware, it can be beneficial to prefetch data that is likely to be accessed next if its location can be predicted. This is easy to do when there are no data of control hazards in the pipeline and the CPU can just run ahead of the instruction stream and execute memory operations out of order.

But sometimes the memory locations aren’t in the instruction stream, and yet they can still be predicted with high probability. In these cases, they can be prefetched by other means:

Explicitly, by separately reading the next data word or any of the bytes in the same cache line, so that it is lifted in the cache hierarchy.
Implicitly, by using simple access patterns such as linear iteration, which are detectable by the memory hardware that can start prefetching automatically.

Hiding memory latency is crucial for achieving performance, so in this section, we will look into prefetching techniques.

#Hardware Prefetching

Let’s modify the pointer chasing benchmark to show the effect of hardware prefetching. Now, we generate our permutation in a way that makes the CPU request consecutive cache lines when iterating over the permutation, but still accessing the elements inside a cache line in random order:

int p[15], q[N];

iota(p, p + 15, 1);

for (int i = 0; i + 16 < N; i += 16) {
    random_shuffle(p, p + 15);
    int k = i;
    for (int j = 0; j < 15; j++)
        k = q[k] = i + p[j];
    q[k] = i + 16;
}

There is no point in making a graph because it would be just flat: the latency is 3ns regardless of the array size. Even though the instruction scheduler still can’t tell what we are going to fetch next, the memory prefetcher can detect a pattern just by looking at the memory accesses and start loading the next cache line ahead of time, mitigating the latency.

Hardware prefetching is smart enough for most use cases, but it only detects simple patterns. You can iterate forward and backward over multiple arrays in parallel, perhaps with small-to-medium strides, but that’s about it. For anything more complex, the prefetcher won’t figure out what’s happening, and we need to help it out ourselves.

#Software Prefetching

The simplest way to do software prefetching is to load any byte in the cache line with the mov or any other memory instruction, but CPUs have a separate prefetch instruction that lifts a cache line without doing anything with it. This instruction isn’t a part of the C or C++ standard, but is available in most compilers as the __builtin_prefetch intrinsic:

__builtin_prefetch(&a[k]);

It’s quite hard to come up with a simple example when it can be useful. To make the pointer chasing benchmark benefit from software prefetching, we need to construct a permutation that at the same time loops around the whole array, can’t be predicted by hardware prefetcher, and has easily computable next addresses.

Luckily, the linear congruential generator has the property that if the modulus $n$ is a prime number, then the period of the generator will be exactly $n$. So we get all the properties we need if we use a permutation generated by the LCG with the current index as its state:

const int n = find_prime(N); // largest prime not exceeding N

for (int i = 0; i < n; i++)
    q[i] = (2 * i + 1) % n;

When we run it, the performance matches a normal random permutation. But now we get the ability to peek ahead:

int k = 0;

for (int t = 0; t < K; t++) {
    for (int i = 0; i < n; i++) {
        __builtin_prefetch(&q[(2 * k + 1) % n]);
        k = q[k];
    }
}

There is some overhead to computing the next address, but for arrays large enough, it is almost two times faster:

Interestingly, we can prefetch more than just one element ahead, making use of this pattern in the LCG function:

$$ \begin{aligned} f(x) &= 2 \cdot x + 1 \\ f^2(x) &= 4 \cdot x + 2 + 1 \\ f^3(x) &= 8 \cdot x + 4 + 2 + 1 \\ &\ldots \\ f^k(x) &= 2^k \cdot x + (2^k - 1) \end{aligned} $$

Hence, to load the D-th element ahead, we can do this:

__builtin_prefetch(&q[((1 << D) * k + (1 << D) - 1) % n]);

If we execute this request on every iteration, we will be simultaneously prefetching D elements ahead on average, increasing the throughput by D times. Ignoring some issues such as the integer overflow when D is too large, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the modulo operation).

Note that this is an artificial example, and you actually fail more often than not when trying to insert software prefetching into practical programs. This is largely because you need to issue a separate memory instruction that may compete for resources with the others. At the same time, hardware prefetching is 100% harmless as it only activates when the memory and cache buses are not busy.

You can also specify a specific level of cache the data needs to be brought to when doing software prefetching — when you aren’t sure if you will be using it and don’t want to kick out what is already in the L1 cache. You can use it with the _mm_prefetch intrinsic, which takes an integer value as the second parameter, specifying the cache level. This is useful in combination with non-temporal loads and stores.