Masking and Blending - Algorithmica

One of the bigger challenges of SIMD programming is that its options for control flow are very limited — because the operations you apply to a vector are the same for all its elements.

This makes the problems that are usually trivially resolved with an if or any other type of branching much harder. With SIMD, they have to be dealt with by the means of various branchless programming techniques, which aren’t always that straightforward to apply.

#Masking

The main way to make a computation branchless is through predication — computing the results of both branches and then using either some arithmetic trick or a special “conditional move” instruction:

for (int i = 0; i < N; i++)
    a[i] = rand() % 100;

int s = 0;

// branch:
for (int i = 0; i < N; i++)
    if (a[i] < 50)
        s += a[i];

// no branch:
for (int i = 0; i < N; i++)
    s += (a[i] < 50) * a[i];

// also no branch:
for (int i = 0; i < N; i++)
    s += (a[i] < 50 ? a[i] : 0);

To vectorize this loop, we are going to need two new instructions:

_mm256_cmpgt_epi32, which compares the integers in two vectors and produces a mask of all ones if the first element is more than the second and a mask of full zeros otherwise.
_mm256_blendv_epi8, which blends (combines) the values of two vectors based on the provided mask.

By masking and blending the elements of a vector so that only the selected subset of them is affected by computation, we can perform predication in a manner similar to the conditional move:

const reg c = _mm256_set1_epi32(49);
const reg z = _mm256_setzero_si256();
reg s = _mm256_setzero_si256();

for (int i = 0; i < N; i += 8) {
    reg x = _mm256_load_si256( (reg*) &a[i] );
    reg mask = _mm256_cmpgt_epi32(x, c);
    x = _mm256_blendv_epi8(x, z, mask);
    s = _mm256_add_epi32(s, x);
}

(Minor details such as horizontal summation and accounting for the remainder of the array are omitted for brevity.)

This is how predication is usually done in SIMD, but it isn’t always the most optimal approach. We can use the fact that one of the blended values is zero, and use bitwise and with the mask instead of blending:

const reg c = _mm256_set1_epi32(50);
reg s = _mm256_setzero_si256();

for (int i = 0; i < N; i += 8) {
    reg x = _mm256_load_si256( (reg*) &a[i] );
    reg mask = _mm256_cmpgt_epi32(c, x);
    x = _mm256_and_si256(x, mask);
    s = _mm256_add_epi32(s, x);
}

This loop performs slightly faster because on this particular CPU, the vector and takes one cycle less than blend.

Several other instructions support masks as inputs, most notably:

The _mm256_blend_epi32 intrinsic is a blend that takes an 8-bit integer mask instead of a vector (which is why it doesn’t have v at the end).
The _mm256_maskload_epi32 and _mm256_maskstore_epi32 intrinsics that load/store a SIMD block from memory and and it with a mask in one go.

We can also use predication with built-in vector types:

vec *v = (vec*) a;
vec s = {};

for (int i = 0; i < N / 8; i++)
    s += (v[i] < 50 ? v[i] : 0);

All these versions work at around 13 GFLOPS as this example is so simple that the compiler can vectorize the loop all by itself. Let’s move on to more complex examples that can’t be auto-vectorized.

#Searching

In the next example, we need to find a specific value in an array and return its position (aka std::find):

const int N = (1<<12);
int a[N];

int find(int x) {
    for (int i = 0; i < N; i++)
        if (a[i] == x)
            return i;
    return -1;
}

To benchmark the find function, we fill the array with numbers from $0$ to $(N - 1)$ and then repeatedly search for a random element:

for (int i = 0; i < N; i++)
    a[i] = i;

for (int t = 0; t < K; t++)
    checksum ^= find(rand() % N);

The scalar version gives ~4 GFLOPS of performance. This number includes the elements we haven’t had to process, so divide this number by two in your head (the expected fraction of the elements we have to check).

To vectorize it, we need to compare a vector of its elements with the searched value for equality, producing a mask, and then somehow check if this mask is zero. If it isn’t, the needed element is somewhere within this block of 8.

To check if the mask is zero, we can use the _mm256_movemask_ps intrinsic, which takes the first bit of each 32-bit element in a vector and produces an 8-bit integer mask out of them. We can then check if this mask is non-zero — and if it is, also immediately get the index with the ctz instruction:

int find(int needle) {
    reg x = _mm256_set1_epi32(needle);

    for (int i = 0; i < N; i += 8) {
        reg y = _mm256_load_si256( (reg*) &a[i] );
        reg m = _mm256_cmpeq_epi32(x, y);
        int mask = _mm256_movemask_ps((__m256) m);
        if (mask != 0)
            return i + __builtin_ctz(mask);
    }

    return -1;
}

This version gives ~20 GFLOPS or about 5 times faster than the scalar one. It only uses 3 instructions in the hot loop:

vpcmpeqd  ymm0, ymm1, YMMWORD PTR a[0+rdx*4]
vmovmskps eax, ymm0
test      eax, eax
je        loop

Checking if a vector is zero is a common operation, and there is an operation similar to test in SIMD that we can use:

int find(int needle) {
    reg x = _mm256_set1_epi32(needle);

    for (int i = 0; i < N; i += 8) {
        reg y = _mm256_load_si256( (reg*) &a[i] );
        reg m = _mm256_cmpeq_epi32(x, y);
        if (!_mm256_testz_si256(m, m)) {
            int mask = _mm256_movemask_ps((__m256) m);
            return i + __builtin_ctz(mask);
        }
    }

    return -1;
}

We are still using movemask to do ctz later, but the hot loop is now one instruction shorter:

vpcmpeqd ymm0, ymm1, YMMWORD PTR a[0+rdx*4]
vptest   ymm0, ymm0
je       loop

This doesn’t improve performance much because both both vptest and vmovmskps have a throughput of one and will bottleneck the computation regardless of anything else we do in the loop.

To work around this limitation, we can iterate in blocks of 16 elements and combine the results of independent comparisons of two 256-bit AVX2 registers using a bitwise or:

int find(int needle) {
    reg x = _mm256_set1_epi32(needle);

    for (int i = 0; i < N; i += 16) {
        reg y1 = _mm256_load_si256( (reg*) &a[i] );
        reg y2 = _mm256_load_si256( (reg*) &a[i + 8] );
        reg m1 = _mm256_cmpeq_epi32(x, y1);
        reg m2 = _mm256_cmpeq_epi32(x, y2);
        reg m = _mm256_or_si256(m1, m2);
        if (!_mm256_testz_si256(m, m)) {
            int mask = (_mm256_movemask_ps((__m256) m2) << 8)
                     +  _mm256_movemask_ps((__m256) m1);
            return i + __builtin_ctz(mask);
        }
    }

    return -1;
}

With this obstacle removed, the performance now peaks at ~34 GFLOPS. But why not 40? Shouldn’t it be twice as fast?

Here is how one iteration of the loop looks in assembly:

vpcmpeqd ymm2, ymm1, YMMWORD PTR a[0+rdx*4]
vpcmpeqd ymm3, ymm1, YMMWORD PTR a[32+rdx*4]
vpor     ymm0, ymm3, ymm2
vptest   ymm0, ymm0
je       loop

Every iteration, we need to execute 5 instructions. While the throughputs of all relevant execution ports allow to do that in one cycle on average, we can’t do that because the decode width of this particular CPU (Zen 2) is 4. Therefore, the performance is limited by ⅘ of what it could have been.

To mitigate this, we can once again double the number of SIMD blocks we process on each iteration:

unsigned get_mask(reg m) {
    return _mm256_movemask_ps((__m256) m);
}

reg cmp(reg x, int *p) {
    reg y = _mm256_load_si256( (reg*) p );
    return _mm256_cmpeq_epi32(x, y);
}

int find(int needle) {
    reg x = _mm256_set1_epi32(needle);

    for (int i = 0; i < N; i += 32) {
        reg m1 = cmp(x, &a[i]);
        reg m2 = cmp(x, &a[i + 8]);
        reg m3 = cmp(x, &a[i + 16]);
        reg m4 = cmp(x, &a[i + 24]);
        reg m12 = _mm256_or_si256(m1, m2);
        reg m34 = _mm256_or_si256(m3, m4);
        reg m = _mm256_or_si256(m12, m34);
        if (!_mm256_testz_si256(m, m)) {
            unsigned mask = (get_mask(m4) << 24)
                          + (get_mask(m3) << 16)
                          + (get_mask(m2) << 8)
                          +  get_mask(m1);
            return i + __builtin_ctz(mask);
        }
    }

    return -1;
}

It now shows the throughput of 43 GFLOPS — or about 10x faster than the original scalar implementation.

Extending it to 64 values per cycle doesn’t help: small arrays suffer from the overhead of all these additional movemask-s when we hit the condition, and larger arrays are bottlenecked by memory bandwidth anyway.

#Counting Values

As the final exercise, let’s find the count of a value in an array instead of just its first occurrence:

int count(int x) {
    int cnt = 0;
    for (int i = 0; i < N; i++)
        cnt += (a[i] == x);
    return cnt;
}

To vectorize it, we just need to convert the comparison mask to either one or zero per element and calculate the sum:

const reg ones = _mm256_set1_epi32(1);

int count(int needle) {
    reg x = _mm256_set1_epi32(needle);
    reg s = _mm256_setzero_si256();

    for (int i = 0; i < N; i += 8) {
        reg y = _mm256_load_si256( (reg*) &a[i] );
        reg m = _mm256_cmpeq_epi32(x, y);
        m = _mm256_and_si256(m, ones);
        s = _mm256_add_epi32(s, m);
    }

    return hsum(s);
}

Both implementations yield ~15 GFLOPS: the compiler can vectorize the first one all by itself.

But a trick that the compiler can’t find is to notice that the mask of all ones is minus one when reinterpreted as an integer. So we can skip the and-the-lowest-bit part and use the mask itself, and then just negate the final result:

int count(int needle) {
    reg x = _mm256_set1_epi32(needle);
    reg s = _mm256_setzero_si256();

    for (int i = 0; i < N; i += 8) {
        reg y = _mm256_load_si256( (reg*) &a[i] );
        reg m = _mm256_cmpeq_epi32(x, y);
        s = _mm256_add_epi32(s, m);
    }

    return -hsum(s);
}

This doesn’t improve the performance in this particular architecture because the throughput is actually bottlenecked by updating s: there is a dependency on the previous iteration, so the loop can’t proceed faster than one iteration per CPU cycle. We can make use of instruction-level parallelism if we split the accumulator in two:

int count(int needle) {
    reg x = _mm256_set1_epi32(needle);
    reg s1 = _mm256_setzero_si256();
    reg s2 = _mm256_setzero_si256();

    for (int i = 0; i < N; i += 16) {
        reg y1 = _mm256_load_si256( (reg*) &a[i] );
        reg y2 = _mm256_load_si256( (reg*) &a[i + 8] );
        reg m1 = _mm256_cmpeq_epi32(x, y1);
        reg m2 = _mm256_cmpeq_epi32(x, y2);
        s1 = _mm256_add_epi32(s1, m1);
        s2 = _mm256_add_epi32(s2, m2);
    }

    s1 = _mm256_add_epi32(s1, s2);

    return -hsum(s1);
}

It now gives ~22 GFLOPS of performance, which is as high as it can get.

When adapting this code for shorter data types, keep in mind that the accumulator may overflow. To work around this, add another accumulator of larger size and regularly stop the loop to add the values in the local accumulator to it and then reset the local accumulator. For example, for 8-bit integers, this means creating another inner loop that does $\lfloor \frac{256-1}{8} \rfloor = 15$ iterations.