Rounding Errors - Algorithmica
Rounding Errors

Rounding Errors

The way rounding works in hardware floats is remarkably simple: it occurs if and only if the result of the operation is not representable exactly, and by default gets rounded to the nearest representable number (in case of a tie preferring the number that ends with a zero).

Consider the following code snippet:

float x = 0;
for (int i = 0; i < (1 << 25); i++)
    x++;
printf("%f\n", x);

Instead of printing 225=335544322^{25} = 33554432 (what the result mathematically should be), it outputs 16777216=22416777216 = 2^{24}. Why?

When we repeatedly increment a floating-point number xx, we eventually hit a point where it becomes so big that (x+1)(x + 1) gets rounded back to xx. The first such number is 2242^{24} (the number of mantissa bits plus one) because

224+1=2241.00×231 2^{24} + 1 = 2^{24} \cdot 1.\underbrace{0\ldots0}_{\times 23} 1

has the exact same distance from 2242^{24} and (224+1)(2^{24} + 1) but gets rounded down to 2242^{24} by the above-stated tie-breaker rule. At the same time, the increment of everything lower than that can be represented exactly, so no rounding happens in the first place.

#Rounding Errors and Operation Order

The result of a floating-point computation may depend on the order of operations despite being algebraically correct.

For example, while the operations of addition and multiplication are commutative and associative in the pure mathematical sense, their rounding errors are not: when we have three floating-point variables xx, yy, and zz, the result of (x+y+z)(x+y+z) depends on the order of summation. The same non-commutativity principle applies to most if not all other floating-point operations.

Compilers are not allowed to produce non-spec-compliant results, so this annoying nuance disables some potential optimizations that involve rearranging operands in arithmetic. You can disable this strict compliance with the -ffast-math flag in GCC and Clang. If we add it and re-compile the code snippet above, it runs considerably faster and also happens to output the correct result, 33554432 (although you need to be aware that the compiler also could have chosen a less precise computation path).

#Rounding Modes

Apart from the default mode (also known as Banker’s rounding), you can set other rounding logic with 4 more modes:

  • round to nearest, with perfect ties always rounding “away” from zero;
  • round up (toward ++∞; negative results thus round toward zero);
  • round down (toward -∞; negative results thus round away from zero);
  • round toward zero (a truncation of the binary result).

For example, if you call fesetround(FE_UPWARD) before running the loop above, it outputs not 2242^{24}, and not even 2252^{25}, but 67108864=22667108864 = 2^{26}. This happens because when we get to 2242^{24}, (x+1)(x + 1) starts rounding to the next nearest representable number (x+2)(x + 2), and we reach 2252^{25} in half the time, and after that, (x+1)(x + 1) rounds up to (x+4)(x+4), and we start going four times as fast.

One of the uses for the alternative rounding modes is for diagnosing numerical instability. If the results of an algorithm substantially vary when switching between rounding to the positive and negative infinities, it indicates susceptibility to round-off errors.

This test is often better than switching all computations to lower precision and checking whether the result changed by too much because the default round-to-nearest policy converges to the correct “expected” value given enough averaging: half of the time the errors are rounding up, and the other they are rounding down — so, statistically, they cancel each other.

#Measuring Errors

It seems surprising to expect this guarantee from hardware that performs complex calculations such as natural logarithms and square roots, but this is it: you are guaranteed to get the highest precision possible from all operations. This makes it remarkably easy to analyze round-off errors, as we will see in a bit.

There are two natural ways to measure computational errors:

  • The engineers who create hardware or spec-compliant exact software are concerned with units in the last place (ulps), which is the distance between two numbers in terms of how many representable numbers can fit between the precise real value and the actual result of the computation.
  • People that are working on numerical algorithms care about relative precision, which is the absolute value of the approximation error divided by the real answer: vvv|\frac{v-v’}{v}|.

In either case, the usual tactic to analyze errors is to assume the worst case and simply bound them.

If you perform a single basic arithmetic operation, then the worst thing that can happen is the result rounding to the nearest representable number, meaning that the error does not exceed 0.5 ulps. To reason about relative errors the same way, we can define a number ϵ\epsilon called machine epsilon, equal to the difference between 11 and the next representable value (which should be equal to 2 to the negative power of however many bits are dedicated to mantissa).

This means that if after a single arithmetic operation you get result xx, then the real value is somewhere in the range

[x(1ϵ),  x(1+ϵ)] [x \cdot (1-\epsilon),\; x \cdot (1 + \epsilon)]

The omnipresence of errors is especially important to remember when making discrete “yes or no” decisions based on the results of floating-point calculations. For example, here is how you should check for equality:

const float eps = std::numeric_limits<float>::epsilon; // ~2^(-23)
bool eq(float a, float b) {
    return abs(a - b) <= eps;
}

The value of eps should depend on the application: the one above — the machine epsilon for float — is only good for no more than one floating-point operation.

#Interval Arithmetic

An algorithm is called numerically stable if its error, whatever its cause, does not grow much larger during the calculation. This can only happen if the problem itself is well-conditioned, meaning that the solution changes only by a small amount if the input data are changed by a small amount.

When analyzing numerical algorithms, it is often useful to adopt the same method that is used in experimental physics: instead of working with unknown real values, we will work with the intervals where they may be in.

For example, consider a chain of operations where we consecutively multiply a variable by arbitrary real numbers:

float x = 1;
for (int i = 0; i < n; i++)
    x *= a[i];

After the first multiplication, the value of xx relative to the value of the real product is bounded by (1+ϵ)(1 + \epsilon), and after each additional multiplication, this upper bound is multiplied by another (1+ϵ)(1 + \epsilon). By induction, after nn multiplications, the computed value is bound by (1+ϵ)n=1+nϵ+O(ϵ2)(1 + \epsilon)^n = 1 + n \epsilon + O(\epsilon^2) and a similar lower bound.

This implies that the relative error is O(nϵ)O(n \epsilon), which is sort of okay, because usually n1ϵn \ll \frac{1}{\epsilon}.

For example of a numerically unstable computation, consider the function

f(x,y)=x2y2 f(x, y) = x^2 - y^2 Assuming x>yx > y, the maximum value this function can return is roughly x2(1+ϵ)y2(1ϵ) x^2 \cdot (1 + \epsilon) - y^2 \cdot (1 - \epsilon) corresponding to the absolute error of x2(1+ϵ)y2(1ϵ)(x2y2)=(x2+y2)ϵ x^2 \cdot (1 + \epsilon) - y^2 \cdot (1 - \epsilon) - (x^2 - y^2) = (x^2 + y^2) \cdot \epsilon and hence the relative error of x2+y2x2y2ϵ \frac{x^2 + y^2}{x^2 - y^2} \cdot \epsilon

If xx and yy are close in magnitude, the error will be O(ϵx)O(\epsilon \cdot |x|).

Under direct computation, the subtraction “magnifies” the errors of squaring. But this can be fixed by instead using the following formula:

f(x,y)=x2y2=(x+y)(xy) f(x, y) = x^2 - y^2 = (x + y) \cdot (x - y)

In this one, it is easy to show that the error is bound by ϵxy\epsilon \cdot |x - y|. It is also faster because it needs 2 additions and 1 multiplication: one fast addition more and one slow multiplication less compared to the original.

#Kahan Summation

From the previous example, we can see that long chains of operations are not a problem, but adding and subtracting numbers of different magnitude is. The general approach to dealing with such problems is to try to keep big numbers with big numbers and small numbers with small numbers.

Consider the standard summation algorithm:

float s = 0;
for (int i = 0; i < n; i++)
    s += a[i];

Since we are performing summations and not multiplications, its relative error is no longer just bounded by O(ϵn)O(\epsilon \cdot n), but heavily depends on the input.

In the most ridiculous case, if the first value is 2242^{24} and the other values are equal to 11, the sum is going to be 2242^{24} regardless of nn, which can be verified by executing the following code and observing that it simply prints 16777216=22416777216 = 2^{24} twice:

const int n = (1<<24);
printf("%d\n", n);

float s = n;
for (int i = 0; i < n; i++)
    s += 1.0;

printf("%f\n", s);

This happens because float has only 23 mantissa bits, and so 224+12^{24} + 1 is the first integer number that can’t be represented exactly and has to be rounded down, which happens every time we try to add 11 to s=224s = 2^{24}. The error is indeed O(nϵ)O(n \cdot \epsilon) but in terms of the absolute error, not the relative one: in the example above, it is 22, and it would go up to infinity if the last number happened to be 224-2^{24}.

The obvious solution is to switch to a larger type such as double, but this isn’t really a scalable method. An elegant solution is to store the parts that weren’t added in a separate variable, which is then added to the next variable:

float s = 0, c = 0;
for (int i = 0; i < n; i++) {
    float y = a[i] - c; // c is zero on the first iteration
    float t = s + y;    // s may be big and y may be small, losing low-order bits of y
    c = (t - s) - y;    // (t - s) cancels high-order part of y
    s = t;
}

This trick is known as Kahan summation. Its relative error is bounded by 2ϵ+O(nϵ2)2 \epsilon + O(n \epsilon^2): the first term comes from the very last summation, and the second term is due to the fact that we work with less-than-epsilon errors on each step.

Of course, a more general approach that works not just for array summation would be to switch to a more precise data type, like double, also effectively squaring the machine epsilon. Furthermore, it can (sort of) be scaled by bundling two double variables together: one for storing the value and another for its non-representable errors so that they represent the value a+ba+b. This approach is known as double-double arithmetic, and it can be similarly generalized to define quad-double and higher precision arithmetic.