Cache-Oblivious Algorithms - Algorithmica

# Cache-Oblivious Algorithms

In the context of cache hierarchies, there are two types of efficient external memory algorithms:

• Cache-aware algorithms that are efficient for known $B$ and $M$.
• Cache-oblivious algorithms that are efficient for any $B$ and $M$.

For example, external merge sort is cache-aware, but not cache-oblivious: we need to know memory characteristics of the system, namely the ratio of available memory to the block size, to find the right $k$ to do k-way merge sort.

Cache-oblivious algorithms are interesting because they automatically become optimal for all memory levels in the cache hierarchy, and not just the one for which they were specifically tuned. In this article we will consider some of their applications in matrix calculations.

## Matrix Transpose

Assume we have a square matrix $A$ of size $N \times N$ and we need to transpose it. The naive by-definition approach would go something like this:

for (int i = 0; i < n; i++)
for (int j = 0; j < i; j++)
swap(a[j * N + i], a[i * N + j]);


Here we used a single pointer to the beginning of the memory region instead of a 2d array to be more explicit about its memory operations.

The I/O complexity of this code is $O(N^2)$ because the writes are not sequential. If you try to swap the iteration variables it is going to be the the other way around, but the result will be the same.

### Algorithm

The cache-oblivious way relies on the following block matrix identity:

$$\begin{pmatrix} A & B \\ C & D \end{pmatrix}^T= \begin{pmatrix} A^T & C^T \\ B^T & D^T \end{pmatrix}$$

It lets us solve the problem recursively using a divide-and-conquer approach:

1. Divide the input matrix into 4 smaller matrices.
2. Transpose each one recursively.
3. Combine results by swapping the corner result matrices.

Implementing D&C on matrices is a bit more complex than on arrays, but the main idea is the same. Instead of copying submatrices explicitly, we want to use “views” into them, and also switch to the naive method when the data starts fitting in L1 cache (or pick something small like $32 \times 32$ if you don’t know it in advance). We also need to carefully handle the case when we have odd $n$ and thus can’t split the matrix into 4 equal submatrices.

void transpose(int *a, int n, int N) {
if (n <= 32) {
for (int i = 0; i < n; i++)
for (int j = 0; j < i; j++)
swap(a[i * N + j], a[j * N + i]);
} else {
int k = n / 2;

transpose(a, k, N);
transpose(a + k, k, N);
transpose(a + k * N, k, N);
transpose(a + k * N + k, k, N);

for (int i = 0; i < k; i++)
for (int j = 0; j < k; j++)
swap(a[i * N + (j + k)], a[(i + k) * N + j]);

if (n & 1)
for (int i = 0; i < n - 1; i++)
swap(a[i * N + n - 1], a[(n - 1) * N + i]);
}
}


The I/O complexity of the algorithm is $O(\frac{N^2}{B})$, as we only need to touch roughly half the memory blocks during each merge stage, meaning that on each stage our problem becomes smaller.

Adapting this code for the general case of non-square matrices is left as an exercise to the reader.

## Matrix Multiplication

Next, let’s consider something slightly more complex: matrix multiplication.

$$C_{ij} = \sum_k A_{ik} B_{kj}$$

The naive algorithm just translates its definition into code:

// don't forget to initialize c[][] with zeroes
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * n + j] += a[i * n + k] * b[k * n + j];


It needs to access $O(N^3)$ blocks in total as each scalar multiplication needs a new block read.

Many people know that one good optimization is to transpose transpose $B$ first:

for (int i = 0; i < n; i++)
for (int j = 0; j < i; j++)
swap(b[j][i], b[i][j])
// ^ or use our faster transpose from before

for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * n + j] += a[i * n + k] * b[j * n + k]; // <- note the indices


Regardless of whether the transpose is done naively or with the cache-oblivious method we just developed, the matrix multiplication with one of the matrices transposed would work in $O(N^3/B + N^2)$ as all memory accesses are now sequential.

It seems like we can’t do better, but turns out we can.

### Algorithm

Cache-oblivious matrix multiplication involves essentially the same trick. We need to divide the data until it fits into lowest cache (i. e. $N^2 \leq M$). For matrix multiplication, this equates to using this formula:

$$\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{pmatrix} = \begin{pmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22}\\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\\ \end{pmatrix}$$

It is slightly harder to implement though, as we now have 8 recursive matrix multiplications:

void matmul(const float *a, const float *b, float *c, int n, int N) {
if (n <= 32) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
} else {
int k = n / 2;

// c11 = a11 b11 + a12 b21
matmul(a,     b,         c, k, N);
matmul(a + k, b + k * N, c, k, N);

// c12 = a11 b12 + a12 b22
matmul(a,     b + k,         c + k, k, N);
matmul(a + k, b + k * N + k, c + k, k, N);

// c21 = a21 b11 + a22 b21
matmul(a + k * N,     b,         c + k * N, k, N);
matmul(a + k * N + k, b + k * N, c + k * N, k, N);

// c22 = a21 b12 + a22 b22
mul(a + k * N,     b + k,         c + k * N + k, k, N);
mul(a + k * N + k, b + k * N + k, c + k * N + k, k, N);

if (n & 1) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = (i < n - 1 && j < n - 1) ? n - 1 : 0; k < n; k++)
c[i * N + j] += a[i * N + k] * b[k * N + j];
}
}
}


Because there are many other factors in play here, we are not going to benchmark this implementation, and instead just do its theoretical performance analysis in external memory model.

### Analysis

Arithmetic complexity of the algorithm remains is the same, because the recurrence

$$T(N) = 8 \cdot T(N/2) + \Theta(N^2)$$

is solved by $T(N) = \Theta(N^3)$.

It doesn’t seem like we “conquered” anything yet, but let’s think about its I/O complexity:

$$T(N) = \begin{cases} O(\frac{N^2}{B}) & N \leq \sqrt M & \text{(we only need to read it)} \\ 8 \cdot T(N/2) + O(\frac{N^2}{B}) & \text{otherwise} \end{cases}$$ The recurrence is dominated by $O((\frac{N}{\sqrt M})^3)$ base cases, meaning that the total complexity is $$T(N) = O\left(\frac{(\sqrt{M})^2}{B} \cdot \left(\frac{N}{\sqrt M}\right)^3\right) = O\left(\frac{N^3}{B\sqrt{M}}\right)$$

This is better than just $O(\frac{N^3}{B})$ by quite a lot.

### Strassen Algorithm

In a spirit similar to the Karatsuba algorithm, matrix multiplication can be decomposed in a way that involves 7 matrix multiplications of size $\frac{n}{2}$, and master theorem tells us the such divide-and-conquer algorithm would work in $O(n^{\log_2 7}) \approx O(n^{2.81})$ time and a similar asymptotic in external memory model.

This technique, known as the Strassen algorithm, similarly splits each matrix into 4:

$$\begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \\ \end{pmatrix} =\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{pmatrix}$$ It then computes intermediate products of the $\frac{N}{2} \times \frac{N}{2}$ matrices and combines them to get matrix $C$: \begin{aligned} M_1 &= (A_{11} + A_{22})(B_{11} + B_{22}) & C_{11} &= M_1 + M_4 - M_5 + M_7 \\ M_2 &= (A_{21} + A_{22}) B_{11} & C_{12} &= M_3 + M_5 \\ M_3 &= A_{11} (B_{21} - B_{22}) & C_{21} &= M_2 + M_4 \\ M_4 &= A_{22} (B_{21} - B_{11}) & C_{22} &= M_1 - M_2 + M_3 + M_6 \\ M_5 &= (A_{11} + A_{12}) B_{22} \\ M_6 &= (A_{21} - A_{11}) (B_{11} + B_{12}) \\ M_7 &= (A_{12} - A_{22}) (B_{21} + B_{22}) \end{aligned}

You can verify these formulas with simple substitution if you feel like it.

As far as we know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are some prototype implementations that become efficient for matrices larger than 4000 or so.

This technique can and actually has been extended multiple times to reduce the asymptotic even further by considering more submatrix products. As of 2020, current world record is $O(n^{2.3728596})$. Whether you can multiply matrices in $O(n^2)$ or at least $O(n^2 \log^k n)$ time is an open problem.