This is an upcoming high performance computing book titled “Algorithms for Modern Hardware” by Sergey Slotin.
Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from $O(n \log n)$ to $O(n \log \log n)$.
Bug/typo fixes. If you spot an error on any page, please do one of these — in the order of preference:
- fix it right away by either clicking on the pencil icon on the top right on any page (opens the Prose editor) or, more traditionally, by modifying the page directly on GitHub (the link to the source is also on the top right);
- create an issue on GitHub;
- tell me about it directly;
Release date. The book is split into several parts that I plan to finish sequentially with long breaks in-between. Part I, Performance Engineering, is ~75% complete as of March 2022 and will hopefully be >95% complete by this summer.
A “release” for an open-source book like this essentially means:
- finishing all essential sections and filling all the TODOs,
- mostly freezing the table of contents (except for the case studies),
- doing one final round of heavy copyediting (hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English),
- drawing illustrations (I stole a lot of those that are currently displayed),
- making a print-optimized PDF and figuring out the best way to distribute it.
After that, I will mostly be fixing errors and only doing some minor edits reflecting the changes in technology or new algorithm advancements. The e-book/printed editions will most likely be sold on a “pay what you want” basis, and in any case, the web version will always be fully available online.
Pre-ordering / financially supporting the book. Due to my unfortunate citizenship and place of birth, you can’t — that is, until I find a way that at the same time complies with international sanctions, doesn’t sponsor the war, and won’t put me in prison for tax evasion.
So, don’t bother. If you want to support this book, just share the articles you like on link aggregators and social media and help fix typos — that would be enough.
Translations. The website has a separate functionality for creating and managing translations — and I’ve already been contacted by some nice people willing to translate the book into Italian and Chinese (and I will personally translate at least some of it into my native Russian).
However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when a centralized translation process starts.
“Translating” the Russian version. The articles hosted at ru.algorithmica.org/cs/ are not about advanced performance engineering but mostly about classical computer science algorithms — without discussing how to speed them up beyond asymptotic complexity. Most of the information there is not unique and already exists in English on some other places on the internet: for example, the similar-spirited cp-algorithms.com.
Teaching performance engineering in colleges. One of my goals for writing this book is to change the way computer science — algorithm design, to be more precise — is taught in colleges. Let me elaborate on that.
There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but one of them is 50 years old, and the other is 30 years old, and computers have changed a lot since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs.
And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as “Performance Engineering of Software Systems” from MIT, “Programming Parallel Computers” from Aalto University, and some non-academic ones like Denis Bakhvalov’s “Performance Ninja” — most computer science graduates still treat the hardware like something from the 1990s.
What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for teaching performance engineering — and I’m looking forward to collaborating with other people who want to make it a reality as well.
Part I: Performance Engineering
The first part covers the basics of computer architecture and optimization of single-threaded algorithms.
It walks through the main CPU optimization topics such as caching, SIMD and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure.
Planned table of contents:
0. Preface 1. Complexity Models 1.1. Modern Hardware 1.2. Programming Languages 1.3. Models of Computation 1.4. When to Optimize 2. Computer Architecture 1.1. Instruction Set Architectures 1.2. Assembly Language 1.3. Loops and Conditionals 1.4. Functions and Recursion 1.5. Indirect Branching 1.6. Machine Code Layout 1.7. Interrupts and System Calls 1.8. Virtualization 3. Instruction-Level Parallelism 3.1. Pipeline Hazards 3.2. The Cost of Branching 3.3. Branchless Programming 3.4. Instruction Tables 3.5. Instruction Scheduling 3.6. Throughput Computing 3.7. Theoretical Performance Limits 4. Compilation 4.1. Stages of Compilation 4.2. Flags and Targets 4.3. Situational Optimizations 4.4. Contract Programming 4.5. Non-Zero-Cost Abstractions 4.6. Compile-Time Computation 4.7. Arithmetic Optimizations 4.8. What Compilers Can and Can't Do 5. Profiling 5.1. Instrumentation 5.2. Statistical Profiling 5.3. Program Simulation 5.4. Machine Code Analyzers 5.5. Benchmarking 5.6. Getting Accurate Results 6. Arithmetic 6.1. Floating-Point Numbers 6.2. Interval Arithmetic 6.3. Newton's Method 6.4. Fast Inverse Square Root 6.5. Integers 6.6. Integer Division 6.7. Bit Manipulation (6.8. Data Compression) 7. Number Theory 7.1. Modular Inverse 7.2. Montgomery Multiplication (7.3. Finite Fields) (7.4. Error Correction) 7.5. Cryptography 7.6. Hashing 7.7. Random Number Generation 8. External Memory 8.1. Memory Hierarchy 8.2. Virtual Memory 8.3. External Memory Model 8.4. External Sorting 8.5. List Ranking 8.6. Eviction Policies 8.7. Cache-Oblivious Algorithms 8.8. Spacial and Temporal Locality (8.9. B-Trees) (8.10. Sublinear Algorithms) (9.13. Memory Management) 9. RAM & CPU Caches 9.1. Memory Bandwidth 9.2. Memory Latency 9.3. Cache Lines 9.4. Memory Sharing 9.5. Memory-Level Parallelism 9.6. Prefetching 9.7. Alignment and Packing 9.8. Pointer Alternatives 9.9. Cache Associativity 9.10. Memory Paging 9.11. AoS and SoA 10. SIMD Parallelism 10.1. Intrinsics and Vector Types 10.2. Loading and Writing Data 10.3. Sums and Other Reductions 10.4. Masking and Blending 10.5. In-Register Shuffles 10.6. Auto-Vectorization 11. Algorithm Case Studies 11.1. Binary GCD (11.2. Prime Number Sieves) 11.3. Integer Factorization 11.4. Logistic Regression 11.5. Big Integers & Karatsuba Algorithm 11.6. Fast Fourier Transform 11.7. Number-Theoretic Transform 11.8. Argmin with SIMD 11.9. Prefix Sum with SIMD 11.10. Reading Decimal Integers 11.11. Writing Decimal Integers (11.12. Reading and Writing Floats) (11.13. String Searching) 11.14. Sorting 11.15. Matrix Multiplication 12. Data Structure Case Studies 12.1. Binary Search 12.2. Static B-Trees (12.3. Search Trees) 12.4. Segment Trees (12.5. Tries) (12.6. Range Minimum Query) 12.7. Hash Tables (12.8. Bitmaps) (12.9. Probabilistic Filters)
Among the cool things that we will speed up:
- 2x faster GCD (compared to
- 8-15x faster binary search (compared to
- 5-10x faster segment trees (compared to Fenwick trees)
- 5x faster hash tables (compared to
- 2x faster popcount (compared to repeatedly calling
- 35x faster parsing series of integers (compared to
- ?x faster sorting (compared to
- 2x faster sum (compared to
- 2-3x faster prefix sum (compared to naive implementation)
- 10x faster argmin (compared to naive implementation)
- 10x faster array searching (compared to
- 15x faster search tree (compared to
- 100x faster matrix multiplication (compared to “for-for-for”)
- optimal word-size integer factorization (~0.4ms per 60-bit integer)
- optimal Karatsuba Algorithm
- optimal FFT
This work is largely based on blog posts, research papers, conference talks and other work authored by a lot of people:
- Agner Fog
- Daniel Lemire
- Andrei Alexandrescu
- Chandler Carruth
- Wojciech Muła
- Malte Skarupke
- Travis Downs
- Brendan Gregg
- Andreas Abel
- Jakob Kogler
- Igor Ostrovsky
- Steven Pigeon
- Denis Bakhvalov
- Paul Khuong
- Pat Morin
- Victor Eijkhout
- Robert van de Geijn
- Edmond Chow
- Peter Cordes
- Geoff Langdale
- Matt Kulukundis
- Georg Sauthoff
- Marshall Lochbaum
- Pavel Zemtsov
- Z boson
Volume: 450-600 pages
Release date: Q2 2022
Part II: Parallel Algorithms
Concurrency, models of parallelism, green threads and concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking and graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication and sorting.
Volume: 150-200 pages
Release date: 2023?
Part III: Distributed Computing
Communication-constrained algorithms, message passing, actor model, partitioning, MapReduce, consistency and reliability at scale, storage, compression, scheduling and cloud computing, distributed deep learning.
Release date: ??? (more likely to be completed than not)
Part IV: Compilers and Domain-Specific Architectures
LLVM IR, compiler optimizations, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++ and oneAPI, XLA, Verilog, FPGAs, ASICs, TPUs and other AI accelerators.
Release date: ??? (less likely to be completed than not)
Disclaimer: Technology Choices
The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles we aim to convey are not specific to them.
To clear my conscience, I’m not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.