Instruction Tables - Algorithmica

# Instruction Tables

Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and memory. Most execution units have their own little pipelines and can take another instruction just one or two cycles after the previous one.

In this context, it makes sense to use two different “costs” for instructions:

• Latency: how many cycles are needed to receive the results of an instruction.
• Throughput: how many instructions can be, on average, executed per cycle.

You can get latency and throughput numbers for a specific architecture from special documents called instruction tables. Here are some sample values for my Zen 2 (all specified for 32-bit operands, if there is any difference):

InstructionLatencyRThroughput
jmp-2
mov r, r-1/4
mov r, m41/2
mov m, r31
add11/3
cmp11/4
popcnt11/4
mul31
div13-2813-28

• Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is integer division: it is either very poorly pipelined or not pipelined at all.
• Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like add), the latency is usually specified for the best case (an L1 cache hit).