One way to tell whether a piece of code was written by a beginner or by an experienced professional is to look at how it handles errors. Beginners often start by writing code that implements the desired algorithm, then, if they think of it, adding code to handle errors. The result can be a hodge-podge of ad hockeries, abandoning a computation when an input error produces noticeable symptoms, or attempting to patch a computation that has gone astray, rather than detecting input conditions early that will lead to problems later and taking appropriate steps to limit the damage before starting the real work. Here in The Journeyman’s Shop we’ve learned that error handling is often the most difficult part of implementing an algorithm. We’ve learned that spending time identifying potential sources of problems and deciding how to handle them before we start writing code reduces the time we spend implementing and debugging an algorithm. We’ve also learned that analyzing possible problems before coding can help us understand an algorithm better, sometimes producing insights that result in much simpler and cleaner code for the algorithm itself. We all enjoy writing code -- that’s the most obviously productive part of our job -- and we look forward to the feeling of satisfaction that we get when our code passes all of the rigorous tests that we’ve prepared for it. Experience teaches us that a furious rush toward this end can lead to disappointment, while a more deliberate pace, postponing completion until we’ve built a sound foundation, ultimately produces a more satisfying result.

In numerical analysis this rule works overtime. The biggest difficulty is often not writing the code, but deciding what code to write. In an earlier column we looked at some of the sources of errors in floating point computations. A numerical analyst has to understand those sources of errors and use that understanding to design a computation so that those errors won’t invalidate the result. I’m not qualified to do that, so I won’t try to discuss it in detail. But I do want to give one example, to emphasize how easily numerical computations can go astray if we aren’t careful. Back in high school we all learned the closed form solution for the roots of a quadratic equation. Given an equation of the form

```
ax^2 + bx + c = 0
```

we can find its roots with the following formulae:

```
x1 = (-b + sqrt(b^2 - 4ac))/2a
x2 = (-b - sqrt(b^2 - 4ac))/2a
```

We also learned that if `4ac`

is greater than
`b^2`

the equation has no real roots. Obviously, if we are
going to write code to compute the roots of a quadratic equation we have
to watch out for these cases. In Java we might be tempted to simply
ignore the problem, and rely on the square root function to detect the
attempt to take the square root of a negative number and throw an
exception. In C and C++ we don’t have that option. The behavior of the
`sqrt`

function is undefined when it is called with a
negative number, so we have to write code to detect this case and handle
it appropriately:

```
double disc = b*b - 4 * a * c;
if (disc < 0)
printf("no real roots\n");
else
{
/* continue the computation */
}
```

That’s also the right solution in Java; it makes the control flow clear, and doesn’t abuse the exception mechanism by using it to handle ordinary, non-exceptional conditions.

The possibility of having no real roots is inherent in solving
quadratic equations. When we try to solve them using floating point math
we introduce an additional set of complications that don’t come with the
original problem. Rather, they arise from our use of limited precision
representations of values that often cannot be expressed exactly in a
finite number of bits. That is, we run into the roundoff errors that we
discussed earlier. The simplest example is the case where the equation
has only one root. In that case, as you probably remember, the value of
the discriminant, `b^2 - 4ac`

, is zero. But that’s only true
in the theoretical world of abstract algebra. In the concrete world of
finite representations of floating point values, the values of
`a`

, `b`

, and `c`

have been rounded to
fit in floating point variables, so they might not have the exact values
that theory would indicate. Further, the multiplications that we have to
make to calculate the discriminant can introduce additional roundoff
errors. When we get around to subtracting `4ac`

from
`b^2`

we have probably introduced enough errors into those
two values that the result of the subtraction will not be zero. So in
our algorithm we will probably erroneously conclude that we do not have
a single root, but that we either have no roots or two roots.

Even if `b^2`

is much larger than 4ac we still can run
into problems. There’s another subtraction here which could magnify
roundoff errors and produce incorrect results. If `b`

is
positive, then the numerator in the calculation of `x1`

is
equivalent to

```
sqrt(disc) - b
```

and if `sqrt(disc)`

and `b`

are fairly close in
value roundoff errors can easily dominate the result. Similarly, if
`b`

is negative, the numerator in the calculation of
`x2`

is equivalent to

```
b - sqrt(disc)
```

and again roundoff errors can easily dominate the result.

The important thing to notice here is that one of the two results can
only be obtained by subtracting two numbers that may be close together.
So numerical analysts usually don’t use this formula for solving
quadratic equations. They use other approaches. One of those approaches
is to divide both sides of the original equation by `a`

, so
that its coefficients look like this:

```
x^2 + 2bx + c = 0
```

Note that we’ve also expressed the coefficient of `x`

as
`2b`

, rather than `b`

. That is, the value of
`b`

that we’ll use in solving this equation is the
coefficient of `x`

, divided by 2. Now the two solutions can
be written as:

```
x1 = -b + sqrt(b^2 - c)
x2 = -b - sqrt(b^2 - c)
```

This doesn’t eliminate the problem we saw earlier in calculating the
discriminant, but with the additional observation that the product of
the two roots will be `c`

, we can eliminate the second
subtraction. If `b`

is negative, we first get the value of
`x1`

from our formula. This involves the addition of two
positive numbers, so it avoids the magnification of roundoff errors that
subtraction produces. Once we have that result, we divide it into
`c`

to get the value of `x2`

. Similarly, if
`b`

is positive, we first get the value of `x2`

from our formula, and divide it into `c`

to get the value of
`x1`

. In both cases we’ve avoided one of the subtractions
that could have led to erroneous results^{1}.

Sometimes the best approach to a calculation is simply to do it and check afterwards whether the result is sensible. For example, we may know two different ways of computing the result we’re looking for, one of which is fast when it works, but can end up dividing by zero for some input values, and the other of which is much slower, but works for all input values. In that case we might decide to try the first one, and if it fails, drop back to the second. To do this our code needs to be able to detect that a division by zero occurred, so that it can switch to the slower method.

If you try this with the floating point emulator that we’ve been building in the past few columns, you’ll see that it doesn’t handle this sort of situation gracefully. Giving the four-function calculator an expression that divides by zero terminates the program:

d:> 4bang 1.0/0.0 floating point error: divide by zero d:>

That doesn’t help much if you need to detect the failure in order to
try another approach. So these days, many floating point systems provide
less abrupt failure mechanisms. Most of these follow, more or less, the
specifications set out in the IEEE Standard for Binary Floating-Point
Arithmetic (IEEE-754)^{2}. The IEEE
specification provides for error handling through exceptions, flags,
traps, and special floating point values that indicate that errors have
occurred. In the remainder of this column we’ll see how these error
handling mechanisms are defined, by looking at how they can be
implemented in our floating point emulator.

A floating point exception is not the same thing as an exception in
C++. They just happen to have the same name, and a certain superficial
similarity, since they both deal with computations that have somehow
gone astray. Don’t confuse the two terms: a floating point exception has
nothing to do with a C++ exception^{3}.

A floating point exception occurs when something goes wrong in a floating point computation. IEEE-754 defines five types of floating point exception: invalid operation, division by zero, overflow, underflow, and inexact. For each of these exception types you can install your own handler, known as a "trap handler," which will be called whenever an exception of the corresponding type occurs. If you don’t install your own trap handler, the default behavior of an IEEE-754 implementation is to set a flag indicating the type of exception that occurred, and, depending on the type of exception, to produce a special value that indicates that an exception occurred. The flags are sometimes known as "sticky bits." They are set when an exception occurs, but floating point computations never clear them. That means that when a computation has finished you can check the flags to see if any exceptions occurred.

With an updated emulator that supports IEEE-754 exceptions, the four-function calculator produces the following output for division by zero:

d:> 4bang 1.0/0.0 Inf (+,128,0) d:>

That is, the result of `1.0/0.0`

is the special value
`Infinity`

. Unlike the previous version of the emulator,
dividing by zero does not terminate execution, so we can continue with
our computation without having to know that we divided by zero:

d:> 4bang 1.0/0.0+1.0 Inf (+,128,0) d:>

With that in mind, let’s look at the code that implements the default
behavior of the five IEEE-754 categories of exceptions. Listing 1 is the new version of the
header file fp.h. There’s been quite a bit added to support floating
point exceptions. The enumerated type `fp_exception`

, which
indicates the type of the exception that occurred, has been moved from
the private section to the public section. It is followed by a type
declaration, to simplify trap handling, then five functions that deal
with flags and two functions that deal with traps. We’ll look at those
in the next column. The private section of class `fp`

now has
an additional enumerated type, `fpx`

, and the static member
function `exception`

, which is called whenever an exception
occurs, now takes a second argument, of type `fp`

, and
returns a result, also of type `fp`

. These are followed by
seven new static data members that are a convenient way of getting the
special floating point values that we’ll be looking at later.

The static member `exc_handlers`

is an array of pointers
to functions. When an exception occurs, the function
`exception`

uses the type code to generate an index into this
array and call the corresponding exception handler.

The class `fp`

also has five static member functions that
implement the default handling for the five exception types. Their names
take the form `ex_XXX`

, where the letters `XXX`

are replaced by a three-letter code for the exception type that the
particular function handles. The class also declares five friend
functions, with names of the form `def_XXX`

, where
`XXX`

, again, is a code for the exception type that the
function handles. This two-layer system makes trap handling simpler. The
member functions `ex_XXX`

implement the default trap
handling. Since user-installed traps cannot be members of the class
`fp`

, the trap handling mechanism must be able to manage
non-member functions. That’s the reason for the `def_XXX`

functions: they are installed as the default trap handlers, and each of
them simply calls the corresponding `ex_XXX`

function. Since
the function `exception`

does not have to distinguish between
default handling and user- installed trap handlers, it is very simple to
write:

```
fp fp::exception(fpx except, fp arg)
{ // handle floating point exceptions
return exc_hndlrs[except]
((fp_exception)(1 << except), arg);
}
```

The handling of the inexact exception is the simplest: when rounding drops low bits from a value the inexact flag is set. Here’s the code that does this:

```
fp fp::ex_ine(fp::fp_exception except, fp arg)
{
set_flag(except);
return arg;
}
```

The function `set_flag`

, as its name suggests, sets the
flag that indicates that an exception of the specified type occurred.
All of the default exception handlers call this function. After calling
`set_flag`

, this handler just returns the value of its second
argument. The value that an exception handler returns will be used in
the floating point computation in place of the value that triggered the
exception. In this case we want to use the rounded value that has
already been computed, so the exception handler simply returns that
value.

On division by zero, the default behavior is to set the flag and return positive or negative infinity, depending on the signs of the operands in the division. The default handler looks like this:

```
fp fp::ex_dbz(fp::fp_exception except, fp arg)
{
set_flag(except);
return arg.is_neg ? ninf : pinf;
}
```

This code relies on a trick that doesn’t conform to the IEEE-754 standard: the value passed in arg has its sign set according to the sign of the desired result. The function uses that sign value to determine the sign of the infinite value that it returns.

On overflow, the default behavior is to set a flag and to return a value that depends on the sign of the result and on the rounding mode. When the rounding mode is set to round toward zero, overflows produce the largest finite representable value with the appropriate sign. When the rounding mode is round down, an overflow in the positive direction produces the largest positive finite representable value. When the rounding mode is round up, an overflow in the negative direction produces the largest negative finite representable value. In all other cases an overflow produces an infinity with the appropriate sign. The code for the default overflow handler looks like this:

```
fp fp::ex_exo(fp::fp_exception except, fp arg)
{
set_flag(except);
return rm == rm_nearest
? arg.is_neg ? ninf : pinf
: rm == rm_zero
? arg.is_neg ? nmax : pmax
: rm == rm_down
? arg.is_neg ? ninf : pmax
: arg.is_neg ? nmax : pinf;
}
```

Underflow requires us to introduce a new concept. In the previous versions of the emulator, when an arithmetic operation produced a value whose exponent was too small to represent, we converted the value to zero. IEEE-754 provides the notion of gradual underflow: when an exponent is too small to represent, the value is represented with the smallest representable exponent and a denormalized fraction, that is, a fraction in which the high bit is not one. In our emulator, the smallest representable exponent is -127. The smallest normalized value has an exponent of -127 and a fraction of 0x4000. If we divide this value by two, since we can’t decrease the exponent below -127, we decrease the fraction by dividing it by two, producing a value with an exponent of -127 and a fraction of 0x2000. The smallest positive non-zero value that can be represented has an exponent of -127 and a fraction of 1. If we divide this value by two we end up with an exponent of -127 and a fraction of 0, which is the representation for the value zero.

The thing to keep in mind when dealing with denormals is that when the fraction has high-order bits equal to zero it has fewer significant bits than a normalized value. So gradual underflow is accompanied by gradual loss of precision. Sometimes that’s okay. If the low bits that are discarded in producing the denormalized fraction are all zero, then the denormalized fraction hasn’t actually lost any bits, and subsequent computations using this value won’t be compromised by the denormalization. Consequently, an underflow exception is generated only when the result of a computation is too small to represent in normalized form, and some low order non-zero bits were discarded in producing the denormalized value. The underflow handler is called with a floating point value whose exponent is 192 more than the exponent that would have been generated, and a normalized fraction. That is, the value that the handler gets is 2^192 times the correct value. This value can be represented with full precision in a normal floating point value. Our default handler simply removes this factor of 2^192 to produce the actual exponent, then shifts the fraction, denormalizing it:

```
fp fp::ex_exu(fp::fp_exception except, fp arg)
{
set_flag(except);
int ex = arg.exp - 192;
int fr = arg.frac >> -ex;
return fp(0 - BIAS, fr, arg.is_neg);
}
```

Finally, the invalid operation exception introduces another special
value, `NaN`

, short for "not a number." These come
in two forms: signaling `NaN`

s and quiet `NaN`

s.
Floating point computations never produce signaling `NaN`

s.
They only arise if you write code to create them. We’ll see later how
`NaN`

s propagate through floating point computations. The
default handler for an invalid operation exception sets the flag to
indicate that this exception occurs and returns a quiet
`NaN`

. It looks like this:

```
fp fp::ex_inv(fp::fp_exception except, fp)
{
set_flag(except);
return qnan;
}
```

Now that we’ve seen the code that handles floating point exceptions
we need to look at where these functions are called. The code changes
from the old emulator are mostly tedious additions to the member
functions `fp::add`

, `fp::operator *=`

,
f`p::operator /=`

, and `fp::normalize`

.

The new version of `fp::add`

is in Listing 2. The code begins by testing
whether either argument is a `NaN`

. If so, the result of the
addition is also a `NaN`

. Then it checks for addition of
infinities of opposite signs. Such an operation doesn’t have any
sensible result, so it generates an invalid operation exception. As we
saw earlier, the default behavior for an invalid operation exception is
to return a `NaN`

. Next, the function checks whether its
first argument is an infinity. If so, it returns that value. If not, it
checks whether its second argument is an infinity. If so, it returns
that value. Finally, if none of those special cases has occurred, it
falls through into the addition code for ordinary floating point
values.

The order of these tests is important. Any arithmetic operation on a
`NaN`

must produce a NaN^{4}, so
we check for `NaN`

values first. Any operations involving
infinities but no `NaN`

s must produce either a
`NaN`

(for addition of infinities of opposite signs) or an
infinity, so we check next for infinities. Once we’ve dealt with any
`NaN`

s and infinities we’re left with two floating point
values. We don’t have to look for denormals here -- ordinary arithmetic
operations work correctly with denormals.

The new version of `fp::operator *=`

is in Listing 3. It begins by converting its
arguments with the macro `ARG`

. This macro checks whether its
argument is a signaling `NaN`

, and if so, raises an invalid
operation exception^{5}. As we saw earlier,
the default result of an invalid operation exception is a quiet
`NaN`

^{6}. After these
adjustments, the code checks whether either of the function’s arguments
is a `NaN`

, and if so, assigns that `NaN`

to the
return value. If neither argument is a `NaN`

, the code checks
for the two cases of infinity times zero and zero times infinity. These
trigger an invalid operation exception, resulting, again, in a
`NaN`

. If neither of these cases has occurred, the code
checks whether either argument is zero. If so, it returns a value of
zero with the appropriate sign. Next it checks whether either value is
infinity, and if either one is, it returns a value of infinity with the
appropriate sign. Otherwise it does the multiplication and returns the
result.

The code for `fp::operator /=`

is in Listing 4. It looks quite a bit like the
code for `fp::operator *=`

. The first difference is in the
tests that produce an invalid operation exception. For multiplication,
zero times infinity is an invalid operation. For division, infinity
divided by infinity and zero divided by zero are invalid operations. The
second difference is in the handling when the second argument is zero,
which triggers a division by zero exception.

Finally, the function `fp::normalize`

has been changed a
bit from the previous version. The new code is in Listing 5. The function
`round`

has been changed to return a boolean value indicating
whether any low-order bits will be discarded in the rounding. The code
saves the result of the call to `round`

for later use. The
tests for underflow and overflow at the end of the function have to test
more case because of the additional rules imposed by IEEE-754.

The first test checks whether the exponent is smaller than the smallest representable exponent. If so, we have to denormalize the result. If rounding the result will discard non-zero bits we have an underflow exception. If not, we just generate a denormalized value.

The next test checks whether the exponent is larger than the largest representable exponent. If so, we have an overflow exception.

The next test checks whether rounding will discard non-zero bits. If so, we have an inexact exception.

If we get through all those tests, we have the pieces of a properly normalized value, and we simply generate that value and return.

After you’ve downloaded the code in fp.zip, build the four-function calculator and
try out some of the edge cases we’ve talked about. You’ll need to
compile `4bang.cpp`

and `fp.cpp`

. The
four-function calculator is a bit limited in the form of input that it
takes, so you sometimes have to be a little tricky to hit the edge
cases. But with a little ingenuity and some carefully placed parentheses
you can generate anything you’d like. Here are a few things you might
try.

To generate infinity:

d:> 4bang 1.0/0.0 Inf (+,128,0) d:>

To generate negative infinity:

d:> 4bang -1.0/0.0 -Inf (-,128,0) d:>

To generate `NaN`

:

```
d:> 4bang 0.0/0.0
NaN (+,128,1)
d:>
d:> 4bang (1.0/0.0)/(-1.0/0.0)
NaN (+,128,1)
d:>
```

```
```### Next Month

Next month we’ll wrap up this exploration of floating point math,
looking at user- defined trap handlers and the changes made to the C
language definition by the recent C99 revision.

```
```**1.** This approach is taken from Forman S. Acton,
"Real Computing Made Real," Princeton University Press 1996,
ISBN 0-691-03663-2, p. 5.

**2.** Available at shop.ieee.org/store/.

**3.** Some compilers turn floating point exceptions
into C++ exceptions, but this is not allowed by the C++ language
definition..

**4.** We’ll see next month that comparisons of values
involving NaNs produce boolean results.

**5.** This doesn’t conform to the IEEE-754 standard,
because it ends up generating two exceptions when an argument to
`operator *=`

is a signaling `NaN`

: one when the
signaling `NaN`

is converted to a quiet `NaN`

, and
another when the code later tests for a `NaN`

.

**6.** The same check is performed in
`fp::operator+`

and `fp::operator-`

before these
functions call `add`

.

Copyright © 2000-2006 by Pete Becker. All rights reserved.