Errors in Floating-point Calculations

If you read the C and C++ newsgroups you’re probably tired of seeing the question about why 1.0 / 10.0 multiplied by 10.0 does not compare equal to 1.0. The correct answer, of course, is that floating point arithmetic can be inexact: roundoff errors in floating point math often make the result of operator == misleading. Sometimes, though, the answer that’s given is that the results of floating point computations are random. Now, the person who says this usually doesn’t mean that floating point results are selected by a random number generator; what they mean is that floating point math produces a result that is usually slightly different from the correct result, and that the variation from the correct result is random. Calling the difference random, though, is misleading. Here in The Journeyman’s Shop we know that the result of any floating point computation is well defined and completely reproducible. If you write a program that does some floating point computation several times, the result should be the same each time¹. So why do some people say that the results are random? I suspect that by this they mean that figuring out the details of the actual floating point computation is too hard; they prefer not to say that they don’t understand what’s going on, so they say that the result is random.

As intellectual precedent for this assertion, they probably have in mind fields like statistical mechanics, where large-scale measures like temperature, pressure, and volume replace the admittedly intractable computations of the interactions of far too many individual molecules. What this overlooks is that the large-scale measures of statistical mechanics are rigorously defined and their interrelationships, as expressed, for example, in Boyle’s law (PV = nRT), can be derived from the random behavior that is assumed for individual molecules. Indeed, it is this alternate approach to the description of the behavior of a large number of molecules, not the assertion that there are so many interactions that they cannot be treated rigorously, that makes statistical mechanics useful. In the absence of an analogous high-level analysis of the behavior of floating point computations, asserting that their results are random is an evasion of responsibility. Floating point computations are not inherently intractable. Many times the effort needed to analyze them thoroughly is not justified, and it’s reasonable to say that we don’t know how far a computed result is from the actual value. But that’s not because the result is random. It’s because we haven’t done the work to understand what the computations are actually doing.

This month we’re going to look at the sources of errors in floating point computations, and at how early errors propagate through to later results. We’ll look at roundoff errors in detail, because those are the errors that we have the most control over, and we’ll look at how to control roundoff in the floating point emulator that we discussed last month. We’ll begin by giving a more precise meaning to the word "error."

Measuring Errors

Since my background is in physics, I use "error" to mean the extent to which a correctly computed result differs from the actual value. By "correctly computed" I mean simply that the formula that was used to compute the result is the right one and that it was implemented correctly. If I accidentally typed 50 instead of 60 when I converted seconds to minutes that’s certainly an error, but for purposes of this discussion we’ll assume that the code doesn’t contain that sort of mistake.

Errors come in three forms: measurement errors, input conversion errors, and roundoff errors². Measurement errors arise because we cannot measure physical properties with infinite precision. When we say that something is a foot long, what we’re really saying is that its actual length is close to one foot. If we’re trying to speak accurately we’ll add an estimate of the possible range of actual values, by saying something like its length is one foot, plus or minus an eighth of an inch, or its length is one foot plus or minus one percent. I won’t go into the details of how to estimate measurement errors, or of what those error bounds actually mean, because that’s not really pertinent to our task as programmers. It’s important to recognize that such errors exist, because these measurement errors imply a lower bound on the error in our computed results. If we can show that the error in the final result is dominated by the effects of the measurement errors, then we’ve done all we as programmers need to do to reduce errors introduced by the computation. Additional effort to reduce errors won’t improve the result³.

Input conversion errors arise because the internal representation of a floating point value often cannot exactly represent the value being converted. This occurs whenever the floating point representation has fewer bits than are needed to hold the value being converted. The most common source for this form of error is probably base conversion. For example, the value 0.1 has an infinitely repeating representation as a base two fraction. If floating point numbers on our target system use binary representation we have to cut that infinitely repeating binary fraction off to fit it in a finite sized floating point representation, and the resulting value is not exactly 0.1. But the finite representation affects more than infinite representations. Even if an input value can be exactly represented in the base used by the floating point representation, it can be too long to fit in the number of bits available. In that case it will also be cut off, and the resulting value will not be exactly right.

Now, those two forms of input conversion error arise from the limitations of the internal representation of floating point values. There is another source of input conversion error that is not inherent in floating point math, but is, nevertheless, often present. The floating point input code in the runtime libraries supplied with every C and C++ compiler that I’m familiar with does a good but less than ideal job of translating input strings into floating point values. It’s possible to write input functions that produce the floating point value that is closest to the actual value represented by the input string, but the code for doing it is tricky, lengthy, and slow⁴. For most purposes this extra overhead isn’t necessary -- if the data that you’re dealing with is accurate to one place in one thousand, a few bits difference down at the low end of a 53-bit fraction isn’t going to matter. But don’t overlook this potential source of errors if you do have high precision input.

Propagating Errors

When you’re starting a project that’s going to involve floating point computations it’s a good idea to start out by estimating the error in each computed value that results from measurement errors. That’s not very hard to do. Suppose you’re doing something simple like reading in a length, multiplying it by eight, and writing out the result. If the input value is off by one eighth of an inch from the actual value, the result will be off by one inch from the actual result. In general, if you apply a function f(x) to the input value x, the error in the result will be the derivative df(x)/dx, multiplied by the error in the input value. When the result depends on more than one input value, the error is the square root of the sum of the squares of the individual errors contributed by each input value, where each individual error is the partial derivative of f with respect to that particular input value multiplied by the error in that input value. For example, if f(x,y) = x^2 + y, then df/dx = 2x, and df/dy = 1. For input values of x = 1.0 ± 0.2 and y = 2.0 ± 0.1, the contribution from x to the error is 2*1*0.2, and the contribution from y is 1*0.1. The sum of the squares of these two values is .17, and the resulting error is the square root of .17, which is about 0.4⁵. So the computed value of this function is 1.0^2 + 2.0, or 3.0 ± 0.4. Since the error that results from propagating measurement errors is more than ten percent, any error resulting from roundoff during our calculations will not be noticeable. Of course, in most cases the propagated error will be much smaller, and roundoff during a lengthy computation can be much more significant.

Roundoff Errors

Now, what does "roundoff error" actually mean? It means the difference between the actual representation of the result of some floating point operation and the result that would have been obtained if floating point operations were performed with infinite precision. As we’ve already seen, input values are coerced to fit into the bits that are available in the floating point representation. When we look at roundoff errors we don’t want to mix in the results of input conversion errors, so we begin with the values that are actually stored in the floating point variables that are involved in the computation. That way we can look at the effects of each separate operation, without being distracted by things that we can’t control. Of course, when we look at the overall error in the result we have to take all of these individual errors into account. But for purposes of evaluating roundoff errors, we can treat the stored floating point values that take part in the computation as exact values.

When we multiply two numbers of the same size together we get a number with twice as many significant digits. When we store the result back into a floating point variable we have to throw away the extra digits. That’s a roundoff error. Similarly, when we add two numbers of the same size together we sometimes get a number that has one more significant digit than the two that we added. When we store the result back into a floating point variable we have to throw away the extra digit. That’s a roundoff error.

For example, if I multiply 1.2*10^2 by 2.3*10^3, the result that my calculator gives is 2.76*10^5. If I insist on sticking to the two significant digits that the two factors have, then the product must be rounded. Most of us learned to round this value up, to 2.8*10^5, because the low-order digit that we’re discarding is greater than 5. Since floating point computations work with a fixed size representation, they can’t hang on to any extra low-order bits, so they, too must round their results. The rule that we usually apply in hand computations, known as "round to nearest," is usually the default rule for rounding in floating point computations. It’s not the only possible rule, though. The IEEE-754 specification for floating point arithmetic requires round to nearest as the default, but also requires support for three other rounding modes, selectable by the application programmer.

In addition to round to nearest, IEEE-754 requires support for round toward positive infinity, round toward negative infinity, and round toward zero. When rounding a positive value toward positive infinity, discarding any non-zero value in the low order bits requires adding one to the low bit in the portion of the value that is being retained. In a decimal example, 2.71*10^5 would be rounded to 2.8*10^5 when rounding toward positive infinity, even though we’d round it to 2.7*10^5 when rounding to nearest. Similarly, when rounding toward negative infinity, any extra bits in a positive value are simply discarded. 2.76*10^5 becomes 2.7*10^5 when rounding toward negative infinity.

When we apply round toward positive infinity and round toward negative infinity to negative numbers in our decimal examples the rewriting rules go just the other way: to round toward positive infinity, drop the extra bits; to round toward negative infinity, add one to the low digit. So -2.71*10^5, when rounded toward positive infinity, becomes - 2.7*10^5, and -2.76*10^5, when rounded toward negative infinity, becomes -2.8*10^5. In text we use sign-magnitude representation of decimal values, and adjusting a negative number towards positive infinity requires the same change to the magnitude as adjusting a positive number towards negative infinity. When we look at how to implement these different rounding modes in the emulator we’ll see the same rules, since it, too, uses a sign-magnitude representation.

With that out of the way, we can see that round toward zero is easy to do: just drop the extra bits. So 2.76*10^5 becomes 2.7*10^5, and -2.76*10^5 becomes - 2.7*10^5. That’s all there is to it.

Well, not quite. I did skip over one complication in round to nearest: what to do when the two adjacent values are equally near. For example, 2.75*10^5 is equally distant from 2.7*10^5 and from 2.8*10^5. I was taught to round exact halves up, but the IEEE- 754 specification doesn’t do that. It uses what’s known as "banker’s rounding," requiring that exact halves be rounded to the adjacent value with a 0 for its low bit. Translated into decimal notation, this means that exact halves round to the value that has an even number for its low digit. So 2.75*10^5 rounds up, to 2.8*10^5, while 2.65*10^5 rounds down, to 2.6*10^5. The theory is that this more or less evenly distributes upward and downward rounds of exact halves, which might reduce the overall error in a computation that involves many roundings.

Using Rounding Modes

What can we do with all these rounding modes⁶? For most computations, use the default -- round to nearest gives good results. Round toward positive infinity and round toward negative infinity are used to implement "interval arithmetic," where you do the same computation twice, one time adjusting the rounding mode to produce an upper bound for the result, and the other time adjusting the rounding mode to produce a lower bound⁷. Round toward zero can be used to truncate a floating point value when converting it to an integral value, as required by both C and C++.

Unfortunately, if you want to explore the effects of rounding modes in your own code you have to dig into how (or whether) your compiler supports setting rounding modes. Neither C nor C++ has any standard way of doing this. If you’re using Borland’s or Microsoft’s compilers, read the documentation for _control87. On UNIX systems you may have to hunt around. Sun provides fpsetround and fpgetround; gcc for Linux on Intel hardware provides two macros, _FPU_SETCW and _FPU_GETCW.

Implementing Rounding

Implementing these rounding modes in the emulator that we looked at last month requires a couple of changes. Listing 1 has the new class definition. It adds an enum to specify the rounding mode, two functions, set_round and get_round, to set and retrieve the rounding mode, and a private function, round, to perform the rounding. Listing 2 has the code for the two functions normalize and round. I’ve changed normalize a bit to simplify the code. Instead of working with a signed fraction all the way through, it starts out by setting the is_neg member variable and converting the fraction to its absolute value. In addition, the code that it used to have to handle rounding has been replaced by a call to the member function round⁸.

At the point in normalize where round is called the fraction holds a 30-bit fractional value which will later be truncated to 15 bits. In the function round we adjust the bits in that fraction so that truncating it will produce the correctly rounded result. The simplest form of rounding here is round toward zero. As you can see from the code, it doesn’t do anything -- truncating the fraction produces the correct result. The other rounding modes are a bit more complicated. As we saw earlier, to round toward positive infinity we simply truncate negative values. For positive values, if the part of the fraction that we’re going to discard does not equal zero we have to add one to the low bit of the part of the fraction that we’re going to keep. The code for the rounding mode rm_up does this with the help of the function low_bits, which returns a copy of the bits that are going to be discarded. Rounding toward negative infinity is similar: positive values are simply truncated, and negative values are incremented if the part of the fraction that will be discarded does not equal zero.

The code for round to nearest is the most complicated. It begins by checking whether the top bit that is going to be discarded is set. If not, the part that is going to be discarded is less than one half, and truncating the fraction will produce the right result, so round doesn’t need to make any changes. If that top bit is set and no other low-order bit in the part that is going to be discarded is set then the discarded part is exactly one half. In that case the code checks whether the low bit of the part that is going to be kept is zero. If so, truncating will produce the correctly rounded result, and round doesn’t need to make any changes. Finally, if neither of the previous cases applies, the value of the part of the fraction that will be kept must be adjusted to the next higher value.

The program in Listing 3 demonstrates all four rounding modes. The function show first produces two carefully contrived values whose sum is an odd number plus a fractional part whose low order bits have a value of exactly one half. It adds them together and displays the result, then adds their negations together and displays the result. By calling show with each of the four rounding modes set we can see how rounding modes affect the resulting values. Running this program produces the following output:


round to nearest
3.00024 (+,2,6002)
-3.00024 (-,2,6002)

round toward negative infinity
3.00012 (+,2,6001)
-3.00024 (-,2,6002)

round toward positive infinity
3.00024 (+,2,6002)
-3.00012 (-,2,6001)

round toward zero
3.00012 (+,2,6001)
-3.00012 (-,2,6001)

The first thing to notice in this output is the fraction values, displayed in hexadecimal as the rightmost element of the parenthesized portion of each line. Note that there are only two different values for the fraction in all eight results. That’s because the original values were set up to produce a result with the value of its full precision fraction exactly half way between those two hex values. That is, the fraction, before rounding, had the hex value 60018. Rounding produces either the value 6001 or 6002. Rounding affects only the low order bit in the result.

The second thing to notice is that the decimal values that the code displays differ by .00012, depending on how the rounding is being done. That’s the result of converting a binary value into decimal: small differences in bits can show up as apparently large differences in decimal digits. But that’s really no different from displaying the decimal value of 1/2 as 0.5 and displaying the decimal value of 1/2 + 1/16 as 0.5625. Those low order bits just don’t map very well into decimal digits.

Effects of Rounding

Since rounding only affects one bit of the result, it follows that doing several rounding operations in the course of a computation only affects a few low bits of the result. In the examples we’ve just looked at each result was within half a bit of the correct value. If we did a second addition it could increase the error by half a bit, or it could cancel the previous roundoff error and produce the correct result. Since we need to know how far off the computed value can be from the actual value, we take the worst case, and assume that every floating point operation can increase the cumulative roundoff error by half a bit. So ten additions can produce a result that’s off by five bits. In our simple emulator that’s a large error, but if you have 53 bits of precision, as in most implementations of doubles these days, it’s not a big deal.

But don’t let this make you complacent about roundoff errors. They can be magnified by what may look like a very safe operation. In the examples above, suppose we then subtracted 3.00012 from each of the results. Half of those subtractions would produce zero, and half would produce .00012. Interestingly, none of those subtractions involves any rounding. The difference in the results arises solely from the rounding that was done earlier. Although cumulative rounding errors cause only a small variance in the computed result, subtracting values that are very close together magnifies these small variances into much more noticeable differences. So be careful when you’re subtracting floating point values⁹: if their values are close together the result may be meaningless.

Another problem occurs when you add two numbers of very different magnitudes. If you look at the add function in Listing 4 (it’s the same as last month’s code) you’ll see that if the two exponents are too far apart it doesn’t bother adding the smaller number, and just returns the larger one. That’s an optimization; the code could have done the addition anyway, but the rounding code would have discarded any bits that the smaller number contributed. To see this more clearly, look at what happens when we add 1.0 and 0.0005. The result is 1.0005, and if we round 1.0005 to two significant digits we end up with 1.0, right back where we started. So be careful when you’re adding floating point values: if their values are far apart the result may not be affected by the smaller value.

Coming Up

Next month we’ll look more carefully at what happens when things go wrong. We might produce a value that’s too small to fit in our binary representation, but that isn’t zero. We might produce a value that’s too large to fit in our binary representation. We might do something that simply doesn’t make sense, such as taking the square root of a negative number. For many compilers the effect of some of these errors is to abort the program. But the IEEE-754 specification supports other ways of handling these problems, and robust floating point code can continue the computation and often get the right result if it takes these possibilities into account. We’ll look at gradual underflow, infinities, and NaNs, as part of an examination of floating point exceptions.

Errors In Floating-point Calculations

The C/C++ Users Journal, July, 2000