While the external memory model is more or less accurate for computations involving HDDs and network storage, where cost of arithmetic operations on in-memory values is negligible compared to external I/O operations, it is too imprecise for lower levels in the cache hierarchy, where the costs of these operations become comparable.
To perform more fine-grained optimization of in-memory algorithms, we have to start taking into account the many specific details of the CPU cache system. And instead of studying loads of boring Intel documents with dry specs and theoretically achievable limits, we will estimate these parameters experimentally by running numerous small benchmark programs with access patterns that resemble the ones that often occur in practical code.
As before, I will be running all experiments on Ryzen 7 4700U, which is a “Zen 2” CPU with the following main cache-related specs:
- 8 physical cores (without hyper-threading) clocked at 2GHz (and 4.1GHz in boost mode — which we disable);
- 256K of 8-way set associative L1 data cache or 32K per core;
- 4M of 8-way set associative L2 cache or 512K per core;
- 8M of 16-way set associative L3 cache, shared between 8 cores;
- 16GB (2x8G) of DDR4 RAM @ 2667MHz.
You can compare it with your own hardware by running
dmidecode -t cache or
lshw -class memory on Linux or by installing CPU-Z on Windows. You can also find additional details about the CPU on WikiChip and 7-CPU. Not all conclusions will generalize to every CPU platform in existence.
Due to difficulties in preventing the compiler from optimizing away unused values, the code snippets in this article are slightly simplified for exposition purposes. Check the code repository if you want to reproduce them yourself.
This chapter is inspired by “Gallery of Processor Cache Effects” by Igor Ostrovsky and “What Every Programmer Should Know About Memory” by Ulrich Drepper, both of which can serve as good accompanying readings.