Throughput Computing - Algorithmica

Optimizing for latency is usually quite different from optimizing for throughput:

When optimizing data structure queries or small one-time or branchy algorithms, you need to look up the latencies of its instructions, mentally construct the execution graph of the computation, and then try to reorganize it so that the critical path is shorter.
When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of their instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often.

The last advice only works for data-parallel loops, where each iteration is fully independent of the previous one. When there is some interdependency between consecutive iterations, there may potentially be a pipeline stall caused by a data hazard as the next iteration is waiting for the previous one to complete.

#Example

As a simple example, consider how the sum of an array is computed:

int s = 0;

for (int i = 0; i < n; i++)
    s += a[i];

Let’s assume for a moment that the compiler doesn’t vectorize this loop, the memory bandwidth isn’t a concern, and that the loop is unrolled so that we don’t pay any additional cost associated with maintaining the loop variables. In this case, the computation becomes very simple:

int s = 0;
s += a[0];
s += a[1];
s += a[2];
s += a[3];
// ...

How fast can we compute this? At exactly one cycle per element — because we need one cycle each iteration to add another value to s. The latency of the memory read doesn’t matter because the CPU can start it ahead of time.

But we can go higher than that. The throughput of add¹ is 2 on my CPU (Zen 2), meaning we could theoretically execute two of them every cycle. But right now this isn’t possible: while s is being used to accumulate $i$ -th element, it can’t be used for $(i+1)$ -th for at least one cycle.

The solution is to use two accumulators and just sum up odd and and even elements separately:

int s0 = 0, s1 = 0;
s0 += a[0];
s1 += a[1];
s0 += a[2];
s1 += a[3];
// ...
int s = s0 + s1;

Now our superscalar CPU can execute these two “threads” simultaneously, and our computation no longer has any critical paths that limit the throughput.

#The General Case

If an instruction has a latency of $x$ and a throughput of $y$ , then you would need to use $x \cdot y$ accumulators to saturate it. This also implies that you need $x \cdot y$ logical registers to hold their values, which is an important consideration for CPU designs, limiting the maximum number of usable execution units for high-latency instructions.

This technique is mostly used with SIMD and not in scalar code. You can generalize the code above and compute sums and other reductions faster than the compiler.

In general, when optimizing loops, you usually have just one or a few execution ports that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be overused. In situations like this, machine code analyzers can be very helpful for finding the bottlenecks of small assembly loops.

The throughput of register-register add is 4, but since we are reading its second operand from memory, it is bottlenecked by the throughput of memory mov, which is 2 on Zen 2. ↩︎