Sums and Other Reductions - Algorithmica

# Sums and Other Reductions

Reduction (also known as folding in functional programming) is the action of computing the value of some associative and commutative operation (i.e., $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements.

The simplest example of reduction is calculating the sum an array:

int sum(int *a, int n) {
int s = 0;
for (int i = 0; i < n; i++)
s += a[i];
return s;
}


The naive approach is not so straightforward to vectorize, because the state of the loop (sum $s$ on the current prefix) depends on the previous iteration. The way to overcome this is to split a single scalar accumulator $s$ into 8 separate ones, so that $s_i$ would contain the sum of every 8th element of the original array, shifted by $i$:

$$s_i = \sum_{j=0}^{n / 8} a_{8 \cdot j + i }$$

If we store these 8 accumulators in a single 256-bit vector, we can update them all at once by adding consecutive 8-element segments of the array. With vector extensions, this is straightforward:

int sum_simd(v8si *a, int n) {
//       ^ you can just cast a pointer normally, like with any other pointer type
v8si s = {0};

for (int i = 0; i < n / 8; i++)
s += a[i];

int res = 0;

// sum 8 accumulators into one
for (int i = 0; i < 8; i++)
res += s[i];

// add the remainder of a
for (int i = n / 8 * 8; i < n; i++)
res += a[i];

return res;
}


You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.

### #Horizontal Summation

The last part, where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum, is called “horizontal summation.”

Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a special instruction that adds together pairs of adjacent elements in a register.

Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:

int hsum(__m256i x) {
__m128i l = _mm256_extracti128_si256(x, 0);
__m128i h = _mm256_extracti128_si256(x, 1);
l = _mm_add_epi32(l, h);
l = _mm_hadd_epi32(l, l);
return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
}


There are other similar instructions, e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).

There is also one specific instruction, _mm_minpos_epu16, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.

### #Instruction-Level Parallelism

Our implementation matches what the compiler produces automatically, but it is actually suboptimal: when we use just one accumulator, we have to wait one cycle between the loop iterations for vector addition to complete, while its throughput is 2 on this microarchitecture.

If we again divide the array in $B \geq 2$ parts and use a separate accumulator for each, we can saturate the throughput of vector addition and increase the performance twofold:

const int B = 2;

int sum_simd(v8si *a, int n) {
v8si b[B] = {0};

for (int i = 0; i < n / 8; i += B)
for (int j = 0; j < B; j++)
b[j] += a[i + j];

for (int i = 1; i < B; i++)
b[0] += b[i];

int s = 0;

for (int i = 0; i < 8; i++)
s += b[0][i];

return s;
}


If you have more than 2 relevant execution ports, you can increase B accordingly. But the n-fold performance increase will only apply to arrays that fit L1 cache — memory bandwidth will be the bottleneck for anything larger.