This is an upcoming high performance computing book titled “Algorithms for Modern Hardware” by Sergey Slotin.
Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from $O(n \log n)$ to $O(n \log \log n)$.
All materials are hosted on GitHub, with code in a separate repository. This isn’t a collaborative project, but any contributions and feedback are very much welcome.
Part I: Performance Engineering
The first part covers the basics of computer architecture and optimization of single-threaded algorithms.
It walks through the main CPU optimization topics such as caching, SIMD and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure.
Planned table of contents:
0. Preface 1. Complexity Models 1.1. Modern Hardware 1.2. Programming Languages 1.3. Models of Computation 1.4. Levels of Optimization 2. Computer Architecture 1.1. Instruction Set Architectures 1.2. Assembly Language 1.2. Loops and Conditionals 1.3. Functions and Recursion 1.4. Indirect Branching 1.5. Interrupts and System Calls 1.6. Machine Code Layout 3. Instruction-Level Parallelism 3.1. Pipeline Hazards 3.2. The Cost of Branching 3.3. Branchless Programming 3.4. Instruction Tables 3.5. Instruction Scheduling 3.6. Throughput Computing 3.7. Theoretical Performance Limits 4. Compilation 4.1. Stages of Compilation 4.2. Flags and Targets 4.3. Situational Optimizations 4.4. Contracts Programming 4.5. Non-Zero-Cost Abstractions 4.6. Compile-Time Computation 4.7. Arithmetic Optimizations 4.8. What Compilers Can and Can't Do 5. Profiling 5.1. Instrumentation 5.2. Statistical Profiling 5.3. Program Simulation 5.4. Machine Code Analyzers 5.5. Benchmarking 6. Arithmetic 6.1. Floating-Point Numbers 6.2. Interval Arithmetic 6.3. Newton's Method 6.4. Fast Inverse Square Root 6.5. Integers 6.6. Integer Division 6.7. Bit Manipulation (6.8. Data Compression) 7. Number Theory 7.1. Modular Inverse 7.2. Montgomery Multiplication (7.3. Finite Fields) (7.4. Error Correction) 7.5. Cryptography 7.6. Hashing 7.7. Random Number Generation 8. External Memory 8.1. External Sorting 8.2. List Ranking 8.3. Eviction Policies 8.4. Data Locality 8.5. Cache Blocking 8.6. Cache-Oblivious Algorithms (8.7. B-Trees) (8.8. Sublinear Algorithms) 9. RAM & CPU Caches 9.1. Memory Bandwidth 9.2. Cache Lines and Memory Alignment 9.3. Bit Fields and Packing 9.4. Memory Paging 9.5. Cache Associativity 9.6. Memory Latency 9.7. Memory-Level Parallelism 9.8. Prefetching 9.9. Pointers and Their Alternatives (9.10. Memory Management) (9.11. memcpy and memset) 10. SIMD Parallelism 10.1. Using SIMD in C/C++ 10.2. Reductions 10.3. Auto-Vectorization 10.4. Data Twiddling 10.5. SSE & AVX Cookbook 11. Algorithm Case Studies 11.1. Binary GCD (11.2. Prime Number Sieves) 11.3. Integer Factorization 11.4. Logistic Regression 11.5. Big Integers & Karatsuba Algorithm 11.6. Fast Fourier Transform 11.7. Number-Theoretic Transform 11.8. Argmin with SIMD 11.9. Reading and Writing Integers (11.10. Reading and Writing Floats) (11.11. String Searching) 11.12. Sorting 11.13. Matrix Multiplication 12. Data Structure Case Studies 12.1. Binary Search 12.2. Dynamic Prefix Sum (12.3. Ordered Trees) (12.4. Range Minimum Query) 12.5. Hash Tables (12.6. Bitmaps) (12.7. Probabilistic Filters)
Among cool things that we will speed up:
- 2x faster GCD (compared to
- 5x faster binary search (compared to
- 7x faster segment trees
- 5x faster hash tables (compared to
?x faster popcount
- 2x faster parsing series of integers (compared to
- ?x faster sorting (compared to
- 2x faster sum (compared to
- 100x faster matrix multiplication (compared to “for-for-for”)
- optimal word-size integer factorization (~0.4ms per 60-bit integer)
- optimal Karatsuba Algorithm
- optimal FFT
- argmin at the speed of memory
This work is largely based on blog posts, research papers, conference talks and other work authored by a lot of people:
- Agner Fog
- Daniel Lemire
- Andrei Alexandrescu
- Chandler Carruth
- Wojciech Muła
- Malte Skarupke
- Travis Downs
- Brendan Gregg
- Andreas Abel
- Jakob Kogler
- Igor Ostrovsky
- Steven Pigeon
- Denis Bakhvalov
- Paul Khuong
- Pat Morin
- Victor Eijkhout
- Robert van de Geijn
- Edmond Chow
- Peter Cordes
- Geoff Langdale
- Matt Kulukundis
Volume: 300-400 pages
Release date: early 2022
Part II: Parallel Algorithms
Concurrency, models of parallelism, green threads and runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking and graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication and sorting.
Volume: 150-200 pages
Release date: late 2022 / 2023?
Part III: Distributed Computing
Communication-constrained algorithms, message passing, actor model, partitioning, MapReduce, consistency and reliability at scale, storage, compression, scheduling and cloud computing, distributed deep learning.
Release date: ???
Part IV: Compilers and Domain-Specific Architectures
LLVM IR, main optimization techniques from the dragon book, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++ and oneAPI, XLA, FPGAs and Verilog, ASICs, TPUs and other AI accelerators.
Release date: ???
Disclaimer: Technology Choices
The examples in this book use C++, GCC, x86-64, CUDA and Spark, although the underlying principles we aim to convey are not specific to them.
To clear my conscience, I’m not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment, and thus more helpful for the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.