Floating-point Basics

The C/C++ Users Journal, June, 2000

Programmers mostly fall into one of three categories in their understanding of floating point math: There are some who don’t know enough about it to recognize that its results are not completely reliable; there are some who know just enough about it to think that its results are never reliable; and there are a few who understand it thoroughly and know exactly how reliable it is. As Donald Knuth puts it:1

There’s a credibility gap: We don’t know how much of the computer’s answers to believe. Novice computer users solve this problem by implicitly trusting in the computer as an infallible authority; they tend to believe that all digits of a printed answer are significant. Disillusioned computer users have just the opposite approach; they are constantly afraid that their answers are almost meaningless.

Here in The Journeyman’s Shop we try to fit ourselves into yet another category: those who know enough about floating point math to understand and avoid the most common pitfalls, and to recognize the times when our knowledge isn’t sufficient to produce reliable results, the times when we need help from an expert.

I’m not an expert on floating point math. On a good day I can tell you what `ulp’ means2, but if someone starts explaining how to get an additional half an ulp in the computation of atanh(x) when x is close to 1 I’ll nod and try to keep my eyes open while imagining that I’m on a beach soaking up the sunshine. Don’t misunderstand me: I admire people who know enough to be able to do this sort of analysis. But the intimate details of producing maximally accurate results are not something that I need to know for the work that I do, and they’re complex enough that they aren’t something that I want to dig into for recreation3. For most of us it simply isn’t worthwhile to spend the time needed to truly understand floating point math. Nevertheless, we need to know something about it in order to use it with a reasonable likelihood of success. From Knuth, once again:4

[E]very well-rounded programmer ought to have a knowledge of what goes on during the elementary steps of floating point arithmetic. This subject is not at all as trivial as most people think, and it involves a surprising amount of interesting information.

This month we’ll begin by looking at examples of some of the ways that floating point math can go astray, and then, in the interest of well-roundedness, we’ll look at the implementation of a class that supports floating point math. From time to time in future columns we’ll come back to that class to improve our understanding of some of the subtleties of floating point math.

Floating Point Values are Often Inexact

Beginners are often surprised by the behavior of code like this:

for (float index = 0.0; index != 1.0; index += 0.1)
    printf("%f\n", index);

They expect it to loop ten times and then stop. It doesn’t, of course; it loops forever, or until they run out of patience and kill the program. And if they’ve thought to capture the output in a file, they see that it displayed exactly the values that they expected:


Naturally, they want to know why the loop didn’t stop when index reached 1.0. Most of us know the answer: The increment value, 0.1, cannot be represented exactly in a binary floating point value, so each time through the loop the value of index increases by an amount that’s close to but not equal to 0.1. On the tenth time through the loop its value isn’t exactly 1.0, so the test for equality fails. The way to "fix" this is to change the loop condition from index != 1.0 to index < 1.0. Of course, that’s only half the answer: printf seems to think that the value of index the tenth time through the loop is exactly 1.0.

Floating Point Values are Often Rounded for Display

We all know that when we divide two positive integers the result is truncated: the result of the division is the largest integer that’s less than or equal to the result that we’d get if we used real numbers instead of integers. So nobody except a rank beginner is surprised that one divided by three is zero. Integers represent only the integral part of a value, and any fractional part is simply discarded.

The same sort of thing happens in floating point math. The discarding is more subtle, so we often don’t notice that it is happening. But consider the same division performed with floating point math: one divided by three is 0.333333, with however many threes are supported by the floating point package that is being used. We’re used to thinking of this sort of number as a shorthand representation for an infinitely repeating decimal, so it doesn’t surprise us when we multiply that result by three and get back our original value of one. Try it on your calculator.

Despite what the calculator shows, the result of the multiplication is not one. The calculator is rounding the result to the number of digits are being displayed, and when you round 0.999999 to a shorter representation it comes out as 1.0. If that’s what your calculator is showing you, try increasing the number of digits that it displays to the right of the decimal point. If I set my HP-48GX to display ten digits to the right of the decimal point it shows the result as 1.0000000000E0. If I increase it to 11 digits it shows as 9.99999999999E-1. If your calculator won’t display enough digits to show that the result really isn’t one, try subtracting one from the result.

Once you get the rounding out of the way, it’s clear that the result is actually 0.999999, with however many nines are supported by the floating point package in the calculator. And if we remember that the result of the original division is not an infinite sequence of threes, but a truncated sequence of threes, limited by the resolution of the calculator’s floating point math, then the result makes sense: (1.0/3.0)*3.0 is 0.999999, not 1.0. When you’re looking at the results of a floating point computation, don’t let rounding in the output fool you. The actual answer may be somewhat different from what is being displayed.5

Representing Floating Point Values

To understand how floating point values are represented internally, it helps to think about what’s commonly known as "scientific notation". Very large and very small values are represented by showing the digits of their values, and a power of ten to multiply the digits by. For example, Avogadro’s number is 6.023*10^23. The convention here is to reduce the digits to a value that’s greater than or equal to one and less than ten, and to adjust the exponent accordingly. We could, if we wanted to, use 60.23*10^22 for this value, but that would raise eyebrows because it is not in the usual form.

In a floating point math package the same basic convention is applied. A number is represented as a set of digits and an exponent. The tricky part in understanding floating point representations is that the base for the digits and the exponent varies from system to system. Calculators typically use base 10, so conversion problems such as the internal representation of 0.1 that we looked at earlier rarely come up because the internal representation of the value is in the same form as the external one. Most floating point packages on computers, however, use some form of binary base, most often base 2. That is, the digits in the value must be converted to base 2, and the exponent must be converted from a power of 10 to the appropriate power of 2. Further, the usual convention for a floating point math package is that the digits are normalized so that their value is greater than 1/2 and less than 16. So the value 1.0*10^1 translated to a binary floating point value would be .101*2^4.

In order to be able to manipulate floating point values efficiently, a floating point package packs the parts of the value into a few bytes of memory. Intel chips, for example, support 32-bit, 64-bit, and 80-bit floating point values. These are stored in 4, 8, and 10 bytes, respectively, which makes them easy for the hardware to manipulate. Assignment between floating point values, for example, doesn’t require any knowledge of how the parts of the value are laid out in the memory block. Copying the block is all that’s needed. Arithmetic, on the other hand, requires pulling out the various parts of the value so that they can be manipulated separately. The details of the internal representation are implementation-specific -- the C and C++ standards do not impose requirements on how the parts of a floating point value should be laid out. You can get at those parts through the function frexp, but what you get from frexp does not necessarily have exactly the same bit patterns as the internal representation of the parts.

For the 32-bit representation that Intel uses, the high bit contains the sign of the value. A 0 means that the value is positive, and a 1 means that the value is negative. The low 23 bits contain the fractional part, that is, what I’ve been calling the "digits" in the previous paragraphs. There’s a quirk here, though, that can fool you. Remember that the digits are normalized, so that their value is always between ½ and 1. That is, the high bit in the digits will always be 1. In Intel’s internal representation this bit is not stored. The floating point package puts it back in when it’s needed in a computation. So the 23-bit fractional part actually gives 24 bits of precision. One more thing to keep in mind: the bits in the fractional part are the same for a positive value and its negative counterpart. Only the sign bit changes7.

Finally, the exponent is represented in the remaining 8 bits in biased form: 127 is added to the actual value. So an exponent of 0 is represented in the 32-bit format as the value 127, an exponent of 1 is 128, and an exponent of -1 is 126. To convert an internal value back to an actual exponent subtract 127 from the internal value. Since the largest value that can be stored in an 8-bit field is 255, the largest possible exponent in this 32-bit representation is 128. The smallest possible exponent is -127. These exponents will only rarely show up in normal computations, however. An exponent of 128 is used to represent infinite values (both positive and negative), and invalid values, known as NaNs, which stands for Not a Number. An exponent of -127 with a fraction of 0 represents the value 0. All other values with an exponent of -127 are denormalized values. We’ll talk about Infinities, NaNs, and denormalized values some other time. For now, the possible values of the exponent range from -126 to 127, and an exponent of -127 for the values 0.0 and -0.0.

A Class to Represent Floating Point Values

Listing 1 contains the class definition for the class fp, which provides basic floating point operations. It separates the three components of a floating point value in order to make it easier to see how these parts are manipulated. The bool member is_neg indicates whether the value is negative. The unsigned char member exp holds the exponent, stored in biased form with a bias value of 127. The int member frac holds the fractional part containing 15 bits. It does not drop the high order bit, in order to simplify the code a bit.

I’ve chosen these particular types in order to produce a floating point implementation that resembles Intel’s, but can be written in portable C++. For example, the language definition guarantees that an unsigned char can hold values from 0 to 255, so the values of exponents for the class fp are limited to this range, even though some C++ implementations might support a larger range of values in an unsigned char. Similarly, the fraction is limited to 15 bits in order to guarantee that the result of multiplying two fractions are will fit in a long. The goal here is not efficient use of space, but clarity. Please don’t write to me about ways of packing more bits into this class. That’s not its purpose.

Multiplying Two Floating Point Values

Multiplying two floating point values is fairly easy, although there are a couple of edge conditions that we have to be careful of. To see how it’s done, let’s go back to scientific notation and look at how to compute the square of Avogadro’s number:


To get this raw result we multiplied the two digit parts and added the two exponents. To put it back into the usual form, though, we have to do two things: we have to adjust the digit part so that its value is less than 10, and we have to get rid of the extra digits that we got on the end of the value. We only had four significant figures in our original value, and we should have four significant figures in the result. We begin by normalizing the result: we replace 36.276529*10^46 with 3.6276529*10^47. The we round the resulting value to four significant digits, producing 3.628*10^47. We usually do these latter two steps as part of the multiplication itself, so instead of writing out the multiplication as we did above we’d write it like this:


If you look at the implementation of operator *= in Listing 2 you can see how this technique is applied in the class fp. The code begins by checking for zeros. If either value in the multiplication is zero the result is set to zero, with the appropriate sign. Otherwise it computes the exponent of the result by adding the exponents of the two factors and subtracting the bias, and it computes the fractional part of the result by multiplying the fractional parts of the two factors. When it computes the exponent the code has to subtract the bias to compensate for the fact that both exponents are biased. This could also be done by subtracting the bias from each exponent, adding the results, and adding the bias back in, but subtracting the bias is simpler. When it computes the fractional part of the result the code uses the member function lf which combines the fraction with the sign and converts the result to a long. Multiplying the two resulting signed fractions produces a result that fits in a long. The results of this multiplication and the computation of the new exponent are passed to the function normalize, which takes the raw result and converts it into a properly normalized form, just as we did above in scientific notation.

The function normalize is where we deal with various edge conditions. The first two if statements recognize zero values and overflowed fractions. Those don’t occur in multiplication or division. We’ll look at them when we talk about addition. The while loop that follows them normalizes fraction values whose high bit is zero. Again, think about scientific notation: .1 times .1 is .01, which has a leading zero digit. The same thing occurs in binary multiplication. We require that the high bit of the fraction be 1, so we multiply the fraction by 2 and reduce the exponent by 1 as many times as are needed. The next if statement checks whether the highest bit in the part of the fractional value that we’re going to discard is a 1. If so, it rounds the value up or down, according to its sign. Rounding can turn a normalized value into a value that is not normalized. Looking again at scientific notation, consider 9.9999*10^0. If we round it to two figures to the right of the decimal point the result is 10.00*10^0. To make the digits part of this value less than 10 we need to move the decimal point to the left by one position and increase the exponent by 1, producing the value 1.00*10^1. The floating point code does the same thing: if bit 31 is 1, the fractional part is too large to represent in our internal notation, so the code divides it by 2 and increases the exponent by 1. Now that the code has done all the adjustments needed to get the result into normalized form, it checks whether the resulting exponent is too large or too small. If so, it calls the member function exception, which handles the problem by displaying an error message and exiting from the program. In a later column we’ll look at less drastic (but not necessarily better) ways of handling these problems.

Finally, after all these adjustments and checks, the code is ready to store the new exponent and fractional part into the object. There’s one last adjustment needed to the fractional part, though. We started out with two 15-bit fractions, and multiplied them to produce a 30-bit value. All of the normalization code has dealt with this 30-bit value. Before we can store this value into an fp object we have to convert it back into a 15-bit fraction. That’s what the right shift in the next to last line of normalize does.

Dividing Floating Point Values

With the code for normalizing results in place, division is quite easy. It’s done with operator /= in Listing 2. Just like multiplication, the code begins by checking for zeros. A zero numerator produces a result of zero8. If the numerator is not zero but the denominator is, the code calls exception to report division by zero. If neither of the values is zero, the code computes the new exponent and the new fractional part and passes them to normalize for all of the necessary adjustments. The new exponent is the difference between the two exponents, again adjusted for the bias. The computation of the fractional part looks tricky, but it isn’t really. It shifts the numerator left by 15 bits to produce a 30-bit value, then divides that value by the denominator, producing a 15 bit quotient, then shifts the quotient left by 15 bits to produce a 30-bit value to pass to normalize.

Adding Two Floating Point Values

To add two floating point values we have to adjust the one that’s closer to zero so that they both have the same exponent. This means shifting the fractional part of the smaller value to the right and increasing its exponent appropriately. We do the same thing when we use scientific notation. To compute 1.0*10^1 + 1.0*10^0 (that is, 10 + 1), we adjust 1.0*10^0 by dividing its digits by 10 and increasing its exponent by 1 to produce 0.1*10^1. This is not in the usual form, but we’re willing to tolerate that during the addition, so long as the result we get is properly adjusted. Now we finish the addition by adding 1.0*10^1 and 0.1*10^1, producing the result 1.1*10^1.

The code in Listing 2 goes through two layers of functions to perform this additional adjustment. The code in operator += compares the exponents of the two values being added to decide which should be adjusted. It calls the function add with the value that has the larger exponent as the first argument and the value that has the smaller exponent as the second argument.

The function add begins by converting the fraction from the first argument into a 30-bit value. It then looks at the difference between the exponents of the two arguments. If the second argument’s exponent so small that the second argument won’t affect the result it ignores the second argument. Otherwise it gets the fraction from the second argument, shifts it left to produce a 30-bit value, shifts that result right by the difference between the two exponents, and adds the resulting value to the 30-bit value that it got from the first argument. The resulting value is passed to normalize, along with the exponent for the result.

Since the sum of two non-zero numbers can be zero, normalize checks to see whether the fraction that was passed to it is zero. If so, it stores a zero fraction and a zero exponent in the fp object in order to properly represent a value of zero. Otherwise, it checks whether bit 31 in the fraction is on. If so, it divides the fraction by 2 and increases the exponent by 1 to remove the overflow. This occurs when two values that are fairly close together are added. In scientific notation, consider 9.0*10^1 + 8.0*10^1. The result is 17.0*10^1, which must be adjusted to 1.7*10^2. The same thing can occur with the fraction part of a floating point value: adding two 15-bit values can produce a 31-bit value, which has to be adjusted back to 30 bits in order for the succeeding code to work correctly.

After these two adjustments, normalize goes through all of the steps that we looked at earlier. Some of those steps aren’t actually needed for addition, but checking for them doesn’t do any harm9.

Subtracting Floating Point Values

With the code for adding two floating point values in place, subtraction is easy. The code in operator -= uses the identity10 a - b == a + (-b): it negates the second argument, and adds the result to the first argument.


If you want to understand how floating point math works you need to experiment with it. The exploration that we’ve gone through in this column provides a sound basis for analyzing why floating point math produces the results that it does, but until you’ve seen the details in actual computations it’s hard to have a sound grasp of how it works. You can download the code for the class fp from fp.zip. The archive also includes a four-function command-line calculator that you can use to explore floating point computations in more detail than we’ve gone through here. Here are a few things that you might want to look into:

What is the smallest positive value that can be added to 1.0 to produce a value that is different from 1.0?

When you evaluate the expression ((1-x^2)/(1-x)) for values of x that are very close to 1, why do the results differ from the value of the expression (1+x) for the same values of x?

We looked at the expression (1.0/3.0)*3.0, and concluded that its value should be slightly less than 1.0. If you use the class fp to evaluate this expression you get exactly 1.0. Why?

1. Knuth, Donald E., The Art of Computer Programming, vol. 2, Seminumerical Algorithms, 3rd edition, 1998, ISBN 0-201-89683-4, p. 229.

2. No, it’s not the sound of the loud swallow that you make when you suddenly realize that you have no idea what’s going on. It stands for "unit in the last place," and it’s used to describe the accuracy of a floating point result.

3. I’m also sure that their eyes would glaze over if they had to listen to an explanation of some of the things that I think are fascinating

4. Knuth, op. cit., p. 214.

5. Use the C function frexp to see what’s really going on.

6. More generally, the fractional part must have a value that is less than 1 and greater than or equal to 1/b, where b is the base of the floating point package. For a binary package this becomes ¨. For a decimal package it would be 1/10.

7. Yes, this means that 0.0 and -0.0 have different representations. We’ll look at the implications of negative zeros in a later column.

8. This code ought to check for a zero denominator before deciding that the result is actually zero. We’ll look at how to handle 0.0/0.0 in a later column, when we talk about NaN values.

9. Except, of course, for speed.

10. Be careful about assuming that common identities apply to floating point math. For example, if a+b==a+c, it does not follow that b==c. The identity that we’re using for subtraction does work for floating point values, however.