Functions and Recursion

To “call a function” in assembly, you need to jump to its beginning and then jump back. But then two important problems arise:

What if the caller stores data in the same registers as the callee?
Where is “back”?

Both of these concerns can be solved by having a dedicated location in memory where we can write all the information we need to return from the function before calling it. This location is called the stack.

#The Stack

The hardware stack works the same way software stacks do and is similarly implemented as just two pointers:

The base pointer marks the start of the stack and is conventionally stored in rbp.
The stack pointer marks the last element of the stack and is conventionally stored in rsp.

When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances; e.g., when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers.

You can implement all that with the usual memory operations and jumps, but because of how frequently it is used, there are 4 special instructions for doing this:

push writes data at the stack pointer and decrements it.
pop reads data from the stack pointer and increments it.
call puts the address of the following instruction on top of the stack and jumps to a label.
ret reads the return address from the top of the stack and jumps to it.

You would call them “syntactic sugar” if they weren’t actual hardware instructions — they are just fused equivalents of these two-instruction snippets:

; "push rax"
sub rsp, 8
mov QWORD PTR[rsp], rax

; "pop rax"
mov rax, QWORD PTR[rsp]
add rsp, 8

; "call func"
push rip ; <- instruction pointer (although accessing it like that is probably illegal)
jmp func

; "ret"
pop  rcx ; <- choose any unused register
jmp rcx

The memory region between rbp and rsp is called a stack frame, and this is where local variables of functions are typically stored. It is pre-allocated at the start of the program, and if you push more data on the stack than its capacity (8MB by default on Linux), you encounter a stack overflow error. Because modern operating systems don’t actually give you memory pages until you read or write to their address space, you can freely specify a very large stack size, which acts more like a limit on how much stack memory can be used, and not a fixed amount every program has to use.

#Calling Conventions

The people who develop compilers and operating systems eventually came up with conventions on how to write and call functions. These conventions enable some important software engineering marvels such as splitting compilation into separate units, reusing already-compiled libraries, and even writing them in different programming languages.

Consider the following example in C:

int square(int x) {
    return x * x;
}

int distance(int x, int y) {
    return square(x) + square(y);
}

By convention, a function should take its arguments in rdi, rsi, rdx, rcx, r8, r9 (and the rest in the stack if those weren’t enough), put the return value into rax, and then return. Thus, square, being a simple one-argument function, can be implemented like this:

square:             ; x = edi, ret = eax
    imul edi, edi
    mov  eax, edi
    ret

Each time we call it from distance, we just need to go through some trouble preserving its local variables:

distance:           ; x = rdi/edi, y = rsi/esi, ret = rax/eax
    push rdi
    push rsi
    call square     ; eax = square(x)
    pop  rsi
    pop  rdi

    mov  ebx, eax   ; save x^2
    mov  rdi, rsi   ; move new x=y

    push rdi
    push rsi
    call square     ; eax = square(x=y)
    pop  rsi
    pop  rdi

    add  eax, ebx   ; x^2 + y^2
    ret

There are a lot more nuances, but we won’t go into detail here because this book is about performance, and the best way to deal with functions calls is actually to avoid making them in the first place.

#Inlining

Moving data to and from the stack creates noticeable overhead for small functions like these. The reason you have to do this is that, in general, you don’t know whether the callee is modifying the registers where you store your local variables. But when you have access to the code of square, you can solve this problem by stashing the data in registers that you know won’t be modified.

distance:
    call square
    mov  ebx, eax
    mov  edi, esi
    call square
    add  eax, ebx
    ret

This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can inline function calls by stitching the callee’s code into the caller and resolving conflicts over registers. In our example:

distance:
    imul edi, edi       ; edi = x^2
    imul esi, esi       ; esi = y^2
    add  edi, esi
    mov  eax, edi       ; there is no "add eax, edi, esi", so we need a separate mov
    ret

This is fairly close to what optimizing compilers produce out of this snippet — only they use the lea trick to make the resulting machine code sequence a few bytes smaller:

distance:
    imul edi, edi       ; edi = x^2
    imul esi, esi       ; esi = y^2
    lea  eax, [rdi+rsi] ; eax = x^2 + y^2
    ret

In situations like these, function inlining is clearly beneficial, and compilers mostly do it automatically, but there are cases when it’s not — and we will talk about them in a bit.

#Tail Call Elimination

Inlining is straightforward to do when the callee doesn’t make any other function calls, or at least if these calls are not recursive. Let’s move on to a more complex example. Consider this recursive computation of a factorial:

int factorial(int n) {
    if (n == 0)
        return 1;
    return factorial(n - 1) * n;
}

Equivalent assembly:

; n = edi, ret = eax
factorial:
    test edi, edi   ; test if a value is zero
    jne  nonzero    ; (the machine code of "cmp rax, 0" would be one byte longer)
    mov  eax, 1     ; return 1
    ret
nonzero:
    push edi        ; save n to use later in multiplication
    sub  edi, 1
    call factorial  ; call f(n - 1)
    pop  edi
    imul eax, edi
    ret

If the function is recursive, it is still often possible to make it “call-less” by restructuring it. This is the case when the function is tail recursive, that is, it returns right after making a recursive call. Since no actions are required after the call, there is also no need for storing anything on the stack, and a recursive call can be safely replaced with a jump to the beginning — effectively turning the function into a loop.

To make our factorial function tail-recursive, we can pass a “current product” argument to it:

int factorial(int n, int p = 1) {
    if (n == 0)
        return p;
    return factorial(n - 1, p * n);
}

Then this function can be easily folded into a loop:

; assuming n > 0
factorial:
    mov  eax, 1
loop:
    imul eax, edi
    sub  edi, 1
    jne  loop
    ret

The primary reason why recursion can be slow is that it needs to read and write data to the stack, while iterative and tail-recursive algorithms do not. This concept is very important in functional programming, where there are no loops and all you can use are functions. Without tail call elimination, functional programs would require way more time and memory to execute.