As we established in the previous section, branches that can’t be effectively predicted by the CPU are expensive as they may cause a long pipeline stall to fetch new instructions after a branch mispredict. In this section, we discuss the means of removing branches in the first place.
#Predication
We are going to continue the same case study we’ve started before — we create an array of random numbers and sum up all its elements below 50:
for (int i = 0; i < N; i++)
a[i] = rand() % 100;
volatile int s;
for (int i = 0; i < N; i++)
if (a[i] < 50)
s += a[i];
Our goal is to eliminate the branch caused by the if
statement. We can try to get rid of it like this:
for (int i = 0; i < N; i++)
s += (a[i] < 50) * a[i];
The loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change 50
to some other threshold, so it doesn’t depend on the branch probability.
But wait… shouldn’t there still be a branch? How does (a[i] < 50)
map to assembly?
There are no Boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: (a[i] - 50) >> 31
. This trick relies on the binary representation of integers, specifically on the fact that if the expression a[i] - 50
is negative (implying a[i] < 50
), then the highest bit of the result will be set to one, which we can then extract using a right-shift.
mov ebx, eax ; t = x
sub ebx, 50 ; t -= 50
sar ebx, 31 ; t >>= 31
imul eax, ebx ; x *= t
Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise and
instead of multiplication: ((a[i] - 50) >> 31 - 1) & a[i]
. This makes the whole sequence one cycle faster, considering that, unlike other instructions, imul
takes 3 cycles:
mov ebx, eax ; t = x
sub ebx, 50 ; t -= 50
sar ebx, 31 ; t >>= 31
; imul eax, ebx ; x *= t
sub ebx, 1 ; t -= 1 (causing underflow if t = 0)
and eax, ebx ; x &= t
Note that this optimization is not technically correct from the compiler’s perspective: for the 50 lowest representable integers — those in the $[-2^{31}, - 2^{31} + 49]$ range — the result will be wrong due to underflow. We know that all numbers are all between 0 and 100, and this won’t happen, but the compiler doesn’t.
But the compiler actually elects to do something different. Instead of going with this arithmetic trick, it used a special cmov
(“conditional move”) instruction that assigns a value based on a condition (which is computed and checked using the flags register, the same way as for jumps):
mov ebx, 0 ; cmov doesn't support immediate values, so we need a zero register
cmp eax, 50
cmovge eax, ebx ; eax = (eax >= 50 ? eax : ebx=0)
So the code above is actually closer to using a ternary operator like this:
for (int i = 0; i < N; i++)
s += (a[i] < 50 ? a[i] : 0);
Both variants are optimized by the compiler and produce the following assembly:
mov eax, 0
mov ecx, -4000000
loop:
mov esi, dword ptr [rdx + a + 4000000] ; load a[i]
cmp esi, 50
cmovge esi, eax ; esi = (esi >= 50 ? esi : eax=0)
add dword ptr [rsp + 12], esi ; s += esi
add rdx, 4
jnz loop ; "iterate while rdx is not zero"
This general technique is called predication, and it is roughly equivalent to this algebraic trick:
$$ x = c \cdot a + (1 - c) \cdot b $$This way you can eliminate branching, but this comes at the cost of evaluating both branches and the cmov
itself. Because evaluating the “>=” branch costs nothing, the performance is exactly equal to the “always yes” case in the branchy version.
#When Predication Is Beneficial
Using predication eliminates a control hazard but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for cmov
to be resolved and not flush the entire pipeline in case of a mispredict.
However, there are many situations when it is more efficient to leave branchy code as it is. This is the case when the cost of computing both branches instead of just one outweighs the penalty for the potential branch mispredictions.
In our example, the branchy code wins when the branch can be predicted with a probability of more than ~75%.
This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the cmov
or not. Unfortunately, this probability is usually unknown at the compile time, so it needs to be provided in one of several ways:
- We can use profile-guided optimization which will decide for itself whether to use predication or not.
- We can use likeliness attributes and compiler-specific intrinsics to hint at the likeliness of branches:
__builtin_expect_with_probability
in GCC and__builtin_unpredictable
in Clang. - We can rewrite branchy code using the ternary operator or various arithmetic tricks, which acts as sort of an implicit contract between programmers and compilers: if the programmer wrote the code this way, then it was probably meant to be branchless.
The “right way” is to use branching hints, but unfortunately, the support for them is lacking. Right now these hints seem to be lost by the time the compiler back-end decides whether a cmov
is more beneficial. There is some progress towards making it possible, but currently, there is no good way of forcing the compiler to generate branch-free code, so sometimes the best hope is to just write a small snippet in assembly.
#Larger Examples
Strings. Oversimplifying things, an std::string
is comprised of a pointer to a null-terminated char
array (also known as a “C-string”) allocated somewhere on the heap and one integer containing the string size.
A common value for a string is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic approach is to assign nullptr
as the pointer and 0
as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
However, this requires a separate branch, which is costly (unless the majority of strings are either empty or non-empty). To remove the check and thus also the branch, we can allocate a “zero C-string,” which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction.
Binary search. The standard binary search can be implemented without branches, and on small arrays (that fit into cache) it works ~4x faster than the branchy std::lower_bound
:
int lower_bound(int x) {
int *base = t, len = n;
while (len > 1) {
int half = len / 2;
base += (base[half - 1] < x) * half; // will be replaced with a "cmov"
len -= half;
}
return *base;
}
Other than being more complex, it has another slight drawback in that it potentially does more comparisons (constant $\lceil \log_2 n \rceil$ instead of either $\lfloor \log_2 n \rfloor$ or $\lceil \log_2 n \rceil$) and can’t speculate on future memory reads (which acts as prefetching, so it loses on very large arrays).
In general, data structures are made branchless by implicitly or explicitly padding them so that their operations take a constant number of iterations. Refer to the article for more complex examples.
Data-parallel programming. Branchless programming is very important for SIMD applications because they don’t have branching in the first place.
In our array sum example, removing the volatile
type qualifier from the accumulator allows the compiler to vectorize the loop:
/* volatile */ int s = 0;
for (int i = 0; i < N; i++)
if (a[i] < 50)
s += a[i];
It now works in ~0.3 per element, which is mainly bottlenecked by the memory.
The compiler is usually able to vectorize any loop that doesn’t have branches or dependencies between the iterations — and some specific small deviations from that, such as reductions or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as masking and in-register permutations.