Pointer Alternatives - Algorithmica

In the pointer chasing benchmark, for simplicity, we didn’t use actual pointers, but integer indices relative to a base address:

for (int i = 0; i < N; i++)
    k = q[k];

The memory addressing operator on x86 is fused with the address computation, so the k = q[k] line folds into just a single terse instruction that also does multiplication by 4 and addition under the hood:

mov rax, DWORD PTR q[0+rax*4]

Although fully fused, these additional computations add some delay to memory operations. The latency of an L1 fetch is either 4 or 5 cycles — the latter being the case if we need to perform a complex computation of the address. For this reason, the permutation benchmark measures 3ns or 6 cycles per jump: 4+1 for the read and address computation and another one to move the result to the right register.

#Pointers

We can make our benchmark run slightly faster if we replace “fake pointers” — indices — with actual pointers.

There are some syntactical issues in getting “pointer to pointer to pointer…” constructions to work, so instead we will define a struct that just wraps a pointers to its own type — this is how most pointer chasing works anyway:

struct node { node* ptr; };

Now we randomly fill our array with pointers and chase them instead:

node* k = q + p[N - 1];

for (int i = 0; i < N; i++)
    k = k->ptr = q + p[i];

for (int i = 0; i < N; i++)
    k = k->ptr;

This code now runs in 2ns / 4 cycles for arrays that fit in the L1 cache. Why not 4+1=5? Because Zen 2 has an interesting feature that allows zero-latency reuse of data accessed just by address, so the “move” here is transparent, resulting in whole two cycles saved.

Unfortunately, there is a problem with it on 64-bit systems as the pointers become twice as large, making the array spill out of cache much sooner compared to using a 32-bit index. The latency-versus-size graph looks like if it was shifted by one power of two to the left — exactly like it should:

This problem is mitigated by switching to the 32-bit mode:

You need to go through some trouble getting 32-bit libs to get this running on a computer made in this century, but this shouldn’t pose other problems unless you need to interoperate with 64-bit software or access more than 4G of RAM

#Bit Fields

The fact that on larger problem sizes the performance is bottlenecked by memory rather than CPU lets us try something even more strange: we can use less than 4 bytes for storing indices. This can be done with bit fields:

struct __attribute__ ((packed)) node { int idx : 24; };

You don’t need to do anything else other than defining a structure for the bit field — the compiler handles the 3-byte integer all by itself:

int k = p[N - 1];

for (int i = 0; i < N; i++) {
    k = q[k].idx = p[i];

for (int i = 0; i < N; i++) {
    k = q[k].idx;

This code measures at 6.5ns for the L1 cache. There is some room for improvement as the default conversion procedure chosen by the compiler is suboptimal. We could manually load a 4-byte integer and truncate it ourselves (we also need to add one more element to the q array to ensure we own that extra one byte of memory):

k = *((int*) (q + k));
k &= ((1<<24) - 1);

It now runs in 4ns, and produces the following graph:

If you zoom close enough (the graph is an svg), you’ll see that the pointers win on very small arrays, then starting from around the L2-L3 cache boundary our custom bit fields take over, and for very large arrays it doesn’t matter because we never hit cache anyway.

This isn’t a kind of optimization that can give you a 5x improvement, but it’s still something to try when all the other resources are exhausted.