I said last month that one of the things we’d look at this month
would be the support provided for floating point math in the recent
revision to the C language standard^{1}.
But that’s a large subject, and deserves a column or two of its own, so
I’ve decided to put if off for a while. This month we’ll wrap up our
examination of floating point math by looking at trap handlers and
sticky bits, then coming back to roughly where we started, by looking at
techniques for comparing floating point values.

Last month we looked at the five types of floating point exceptions
defined by the IEEE 754 standard: invalid operation, division by zero,
overflow, underflow, and inexact. We also looked at what an IEEE-754
conforming floating point implementation should do when any of these
exceptions occurs. The default exception handling specified by IEEE-754
is suitable for most of what most of us do most of the time. If you’re
in a situation where the default behavior isn’t suitable, though, you
can change what happens when any particular exception occurs by
installing a trap handler for that exception. A trap handler is simply a
function that you write. You install it by calling a function that tells
the floating point package to call your handler whenever an exception of
the corresponding type occurs. In our emulator, trap handlers are
managed by the member functions set_handler and get_handler. The code
for `set_handler`

looks like this:

```
fp::exc_hnd fp::set_handler(
fp::fp_exception except, fp::exc_hnd hnd)
{ // install user-specified exception handler
fpx exc = to_fpx(except);
exc_hnd prev = exc_hndlrs[except];
exc_hndlrs[except] = hnd ? hnd : def_hndlrs[except];
return prev == def_hndlrs[except] ? 0 : prev;
}
```

It takes two arguments. The type of the first is an enumeration,
defined in the class `fp`

, named
`fp_exception`

:

```
enum fp_exception
{ // bit flags to identify exceptions
div_by_zero = 0x01,
exp_underflow = 0x02,
exp_overflow = 0x04,
inexact = 0x08,
invalid = 0x10
};
```

The second is a pointer to a handler. Passing a null pointer for this
argument tells the floating point package to use the default handling
for the specified exception. Otherwise, the pointer must point to a
function whose signature matches the typedef
`fp::exc_hnd`

:

```
typedef fp (*exc_hnd)(fp_exception, fp);
```

The `fp_exception`

value passed to
`set_handler`

is translated into an array index by the
function `to_fpx`

:

```
static inline fp::fpx fp::to_fpx(fp::fp_exception except)
{ // translate exception bit flag to table index
return except & div_by_zero ? dbz
: except & exp_underflow ? exu
: except & exp_overflow ? exo
: except & inexact ? ine
: inv;
}
```

The resulting index is used to retrieve the address of the current handler for that exception from the handler table and store it in the variable prev. If the caller passed a null pointer as the pointer to the new handler then the address of the default handler is stored into the handler table. Otherwise, the address of the new handler is stored. The function returns the address of the previous handler, or a null pointer if there was no user- installed handler.

The function `get_handler`

takes a single argument of type
`fp_exception`

. It returns a pointer to the installed handler
for the specified exception, or a null pointer if there is no
user-installed handler.

```
fp::exc_hnd fp::get_handler(fp_exception except)
{ // return pointer to user-specified exception handler
fpx exc = to_fpx(except);
return exc_hndlrs[exc] == def_hndlrs[exc]
? 0 : exc_hndlrs[exc];
}
```

Internally, when a floating point exception occurs the emulator calls
the private member function `exception`

and passes an
argument of type `fpx`

, which serves as an index into the
handler table, and an argument of type `fp`

, whose value is
determined according to the rules for the particular exception. We’ll
look at those rules a little later. The function `exception`

is straightforward:

```
fp fp::exception(fpx exc, fp arg)
{ // handle floating point exceptions
return exc_hndlrs[exc](
(fp_exception)(1 << exc), arg);
}
```

Note that the function translates the table index into a value
corresponding to an enumerator of type `fp_exception`

in
order to call the trap handler.

Now that we have a mechanism for installing trap handlers, you may be asking what they can be used for. I suggested earlier that they’re useful when the default handling of floating point exceptions isn’t what you need. For example, if you don’t want your computation to continue if any of these exceptional conditions occurs, you can install your own trap handler for each of the five exception types. In that handler you’d print a message and terminate the program:

```
fp handle_exceptions(fp::fp_exception except, fp)
{
std::cout << "Floating point error: "
<< (except == fp::div_by_zero ? "divide by zero"
: except == fp::exp_underflow ? "underflow"
: except == fp::exp_overflow ? "overflow"
: except == fp::inexact ? "inexact"
: "invalid operation") << `\n’;
exit(EXIT_FAILURE);
}
```

You’d then install this handler for each of the five exception types:

```
int main()
{
fp::set_handler(fp::div_by_zero, handle_exceptions);
fp::set_handler(fp::exp_underflow, handle_exceptions);
fp::set_handler(fp::exp_overflow, handle_exceptions);
fp::set_handler(fp::inexact, handle_exceptions);
fp::set_handler(fp::invalid, handle_exceptions);
// do the floating point computation
return 0;
}
```

This sort of brute force approach may not be appropriate in some
cases. You can leave the default behavior in place for any of the
exception types by not installing a handler. You can also provide a more
sophisticated handler than this one that returns a value to be used as
the result of the operation that caused the exception. That’s what the
default handlers in the emulator do: they compute an appropriate
replacement value and return it. For example, when an invalid operation
exception occurs, the IEEE-754 specification says that the result should
be a `NaN`

. The default handler for this exception does that
substitution through a layer of indirection. The trap handler that is
installed by default for an invalid operation exception looks like
this:

```
static fp def_inv(fp::fp_exception except, fp arg)
{ // default handler for invalid exception
return fp::ex_inv(except, arg);
}
```

The member function `fp::ex_inv`

is responsible for
actually handling the exception. It looks like this:

```
fp fp::ex_inv(fp::fp_exception except, fp)
{ // handle invalid exception
set_flag(except);
return qnan;
}
```

It first sets the internal flag that indicates that an invalid
operation exception occurred, then returns a `NaN`

.

Under the IEEE-754 specification, the trap handlers for overflow and underflow are called with a floating point value with the fraction and sign that the result would have had, and an exponent that is adjusted to be within the valid range for exponents in the particular floating point representation being used. For example, in our emulator, when an overflow exception occurs, the function exception is called with a floating point value whose exponent is 192 less than the result would have been. If we try to multiply a value whose fraction is 1/2 and whose exponent is 120 by itself we’d expect a result whose fraction is 1/4 and whose exponent is 240. After normalization this value would have a fraction of 1/2 and an exponent of 239. But 239 is too large an exponent in our representation, so the multiplication causes an exception. The value passed as the second argument to the function exception has a fraction of 1/2 and an exponent of 239 - 192, or 47. That way, if we need to know in our trap handler what the value would have been we can figure it out.

The default handlers for overflow and underflow both use this information to compute the replacement value that they return. In the case of overflow the result has to be suitably rounded. For example, if we’re rounding toward zero an overflow results in the largest positive representable value or the largest negative representable value, depending on the sign of the value that overflowed. The code that picks the appropriate overflowed value looks like this:

```
fp fp::ex_exo(fp::fp_exception except, fp arg)
{ // handle overflow exception
set_flag(except);
return rm == rm_nearest
? arg.is_neg ? ninf : pinf
: rm == rm_zero
? arg.is_neg ? nmax : pmax
: rm == rm_down
? arg.is_neg ? ninf : pmax
: arg.is_neg ? nmax : pinf;
}
```

For an underflow exception the computed value has to be turned into a suitable denormalized representation. That’s done by this code:

```
fp fp::ex_exu(fp::fp_exception except, fp arg)
{ // handle underflow exception
set_flag(except);
int ex = arg.exp - BIAS - 192;
int fr = arg.frac;
while (ex < 0)
{ // denormalize
++ex;
fr /= 2;
}
return fp(ex - BIAS, fr, arg.is_neg);
}
```

Note that it determines the value of the actual signed exponent by subtracting the exponent bias, then subtracting 192 to offset the adjustment that was made in order to create a valid floating point value. Once that has been done, the code simply divides the fraction by two and increases the exponent until the exponent reaches zero.

Chances are you won’t have much use for trap handlers. But if you do end up using them make sure you understand how the value that is passed to the handler relates to the computed value. In particular, the exponent adjustment of 192 that we’ve just looked at is used for the IEEE-754 32-bit floating point representation. Other representations use other adjustments. Also, IEEE-754 doesn’t specify a calling convention for trap handlers, and the convention varies among compiler implementations. Read any appropriate documentation before you write a trap handler.

You’ve probably noticed that all of the default exception handlers
begin by calling `set_flag(except)`

. That’s because the
default behavior for any exception under IEEE-754 is to set a flag to
indicate that an exception of that type occurred. These flags are often
referred to as "sticky bits" because the floating point
implementation never clears them. If you want them cleared you must do
so yourself.

In our emulator the sticky bits are accessed through these five member functions:

```
static void set_flag(fp_exception);
static bool get_flag(fp_exception);
static void clear_flags();
static void set_flags(int);
static int get_flags();
```

The class `fp`

has a static member of type int that holds
the current flag settings. If you look back at the values of the
enumeration `fp_exception`

you’ll see that each enumeration
is represented by a value with a single bit set. That means that we can
use logical operations to manipulate the flag representation. I won’t go
through the implementation details; they’re pretty simple. Call
`set_flag`

to turn on a particular flag and
`get_flag`

to test whether a particular flag is on. Call
`clear_flags`

to turn all the flags off. Call
`get_flags`

to get an int value that represents the state of
all the flags. You can manipulate specific flags within this value by
using the `fp_exception`

values as masks. You can also set
the internal flag values to whatever pattern you want by calling
`set_flags`

with an int value having the desired pattern. For
example, to turn on the invalid operation flag and the overflow flag and
turn all the other flags off, call ```
fp::set_flags(fp::invalid |
fp::exp_overflow)
```

.

In talking about exceptions we’ve looked at the immediate results of
various operations. If we had to write code that checked for a
`NaN`

or an infinity after every operation we’d end up with
code that was unreadable: all the tests would obscure the logic of the
computation itself. One of the strengths of the IEEE-754 exception
mechanism is that exceptional values propagate sensibly through the
computation. `NaN`

s in particular tend to stick around.

We saw earlier that `0.0/0.0`

produces a `NaN`

.
Once you have that result, any arithmetic operation that uses it will
also produce a `NaN`

. This means that you don’t have to check
for a `NaN`

after every operation. Instead, you can do some
logical chunk of the computation and see if you got a `NaN`

.
If you got one, something went astray. If not, you can continue the
computation.

That’s actually a bit of an oversimplification. There are situations
where `NaN`

s will disappear. For example, the function
`hypot`

^{2} returns the square
root of the sum of the squares of its two arguments. If one of those
arguments is infinity, it doesn’t matter what the other argument is. The
result will always be infinity. So `hypot(1.0/0.0, 0.0/0.0)`

is infinity, even though the second argument is a `NaN`

.

Infinities, of course, can also disappear, in the obvious way: any finite value divided by infinity produces zero.

What all this means is that checking for `NaN`

s and
infinites won’t tell you whether something went wrong in your
computation. What it will tell you is that at the point where you
checked, your values are probably okay. If you need more assurance than
this you have to check the sticky bits. That’s why they’re never cleared
by the floating point package: they provide information that could
otherwise get lost.

Under the IEEE-754 standard there are four possible relationships
between two floating point values. They can be equal, the first can be
less than the second, the first can be greater than the second, and they
can be unordered. If any `NaN`

s are involved, the two values
are unordered. Further, any comparison that involves a `NaN`

raises an invalid operation exception.

In order to define meaningful predicates for comparing two floating
point values we have to take into account the possibility that the
values will be unordered. You may have heard the rule that all
comparisons involving `NaN`

s are false. That’s true as far as
it goes: under the IEEE-754 specification, the numerical comparisons
that we’re used to in C and C++ are all supposed to be false. This leads
to the peculiar-looking result that testing for equality of two numbers
can return false, and that testing for inequality of the same two
numbers can also return false. In order to do meaningful comparisons
when `NaN`

s can be involved we have to expand the range of
tests that we can perform. We need to add predicates that explicitly
allow for the possibility that two values can be unordered.

The IEEE-754 specification provides a list of all the meaningful predicates, along with descriptive names for them. Table 1 is a list of all of the predicates from the IEEE- 754 specification, along with their definitions. I’ve given them names based on the suggested FORTRAN names.

Implementing these predicates is straightforward. The class
`fp`

defines an enumerated type named `cmp_res`

with enumerations for each of the four possible relationships:

```
enum cmp_res
{
cmp_lt,
cmp_eq,
cmp_gt,
cmp_un
};
```

The static member function `fp::cmp`

takes two arguments
of type `fp`

and compares them, returning the appropriate
enumeration:

```
fp::cmp_res fp::cmp(const fp& f1, const fp& f2)
{
if (IS_NAN(f1) || IS_NAN(f2))
return cmp_un;
else if (IS_ZERO(f1) && IS_ZERO(f2)
|| f1.is_neg == f2.is_neg
&& f1.frac == f2.frac
&& f1.exp == f2.exp)
return cmp_eq;
else if (f1.is_neg != f2.is_neg)
return f1.is_neg ? cmp_lt : cmp_gt;
else if (f1.exp != f2.exp)
return f1.exp < f2.exp != f1.is_neg
? cmp_lt : cmp_gt;
else
return f1.frac < f2.frac != f1.is_neg
? cmp_lt : cmp_gt;
}
```

Each predicate function calls `cmp`

and returns the
appropriate result, in accordance with the definitions in Table 1. I
won’t go through all of them, because they’re pretty much the same.
Here’s the code for `ule`

(unordered or less or equal):

```
bool fp::ule(const fp& f1, const fp& f2)
{
cmp_res res = cmp(ARG(f1), ARG(f2));
return res == cmp_un
|| res == cmp_lt
|| res == cmp_eq;
}
```

In addition, most of these have meaningful negations. I’ve named the
negations with the prefix `not_`

. They all follow the obvious
pattern:

```
inline bool not_ule(const fp& f1, const fp& f2)
{ return !fp::ule(f1, f2); }
```

While it’s important to know about all of the complications that
`NaN`

s can introduce into comparisons, for many everyday
applications it’s probably better to simply avoid the possibility of
comparing `NaN`

s. Treat a `NaN`

as an error, and
abandon the computation. That lets you think of comparisons in the way
that you’re used to, without having to worry about the possibility that
two values will be unordered.

Unfortunately, that doesn’t help with the problem that we started out
with many months ago, that `(1.0/10.0)*10.0`

on most systems
does not compare equal to `1.0`

. As we’ve seen, that’s
because of rounding in the division. Since one tenth cannot be
represented exactly as a binary fraction (it has an infinite expansion,
just like one third as a decimal), some bits get lost. Multiplying by
ten cannot restore those lost bits, so the result isn’t exactly one. For
most purposes, though, it’s close enough. We need to be able to express
the notion of "close enough" meaningfully.

Most people’s first attempt at the notion of "close enough" is to define an acceptable tolerance, and check whether the difference between the two values being compared is less than this tolerance:

```
if (abs(f1 - f2) < .0001)
// close enough
```

If you’ve been following the discussion closely enough you know why
this doesn’t work: if `f1`

and `f2`

are numbers
with large exponents, .0001 is too small to be meaningful. The
difference between 1.0001e40 and 1.0000e40 is .0001e40, which is much
larger than .0001. For large numbers the only difference that’s smaller
than .0001 is zero, and we’re right back where we started. This test
also fails for small numbers, but in the opposite direction: the
difference between .0001 and .00019 is less than .0001, but these two
numbers differ by nearly a factor of two. They’re close enough under
this naãve test, but that’s almost certainly not what we want.

The solution is to scale the tolerance according to the magnitude of the numbers that we’re dealing with. A simple approach is to divide the difference by one of the numbers:

```
if (abs((f1 - f2) / f2) < .0001)
// close enough
```

If we’re using a floating point package that conforms to IEEE-754
this will work the way we expect. Some floating point packages abort
execution on division by zero, and for a package like that we have to
check whether `f2`

is zero before doing the division:

```
if (abs((f1 - f2) / (f2 == 0 ? 1.0 : F2) < .0001)
// close enough
```

This will work the way we expect in almost all the situations we’ll
run into in ordinary programming. For a floating point package that
doesn’t handle overflows gracefully there’s still the possibility that
`f1`

will be so much larger than `f2`

that the
division will overflow and abort execution. If this is a possibility for
the numbers you’re dealing with, it’s worth adding a check that the
exponents are equal. If they’re not equal, the number with the greater
exponent is greater than the other. If they’re equal the division won’t
overflow, and the test code above will work reliably.

There’s a more subtle problem, though, in all of these tests. We started out trying to compensate for the effects of rounding on our answers. So the question that we’re actually trying to answer is whether two values are close enough together that rounding could account for their difference. The value .0001 has nothing to do with this question. Rather than pick an arbitrary tolerance, we need to use a value that reflects the magnitude of the errors introduced by rounding.

When we looked at rounding last July we saw that each rounding
operation affects only one bit of the answer. As we chain operations
together the rounding error propagates, so we can end up with several
low bits that aren’t particularly meaningful. When we compare two
numbers we ought to use a tolerance that allows us to ignore those low
bits. We can’t pick a tolerance without knowing details of how floating
point values are represented. But we don’t have to figure that out for
ourselves. Most programming languages provide a way of getting at the
value of `epsilon`

, which is the difference between one and
the least value greater than one that is representable in the floating
point type. In C we have the macros `FLT_EPSILON`

for floats,
`DBL_EPSILON`

for doubles, and `LDBL_EPSILON`

for
long doubles. In C++ we have
`numeric_limits<type>::epsilon()`

.

In a binary representation, the least value greater than one that is representable in the floating point type is the value of 1.0 with its low bit changed to one. That is, its value is 1.0 plus the error introduced by a single rounding error. Subtracting 1.0 from this value gives the value of a single rounding error for results close to 1.0. This gives us a basis for computing a tolerance for floating point comparisons: we just multiply the expected number of rounding errors by the value of epsilon. So our test for close enough becomes:

```
if (abs((f1 - f2) / (f2 == 0 ? 1.0 : f2)
< NROUNDS * std::numeric_limits<double>::epsilon())
// close enough
```

More generally, rather than hard-code the type of the floating point value, we can use a template:

```
template<class FP>
bool close_enough(FP f1, FP f2, int NROUNDS)
{
return abs((f1 - f2) / (f2 == FP(0.0) ? FP(1.0) : f2)
< FP(NROUNDS) * std::numeric_limits<FP>::epsilon());
}
```

The casts are necessary both to avoid forcing inappropriate type
conversions and to enable the template to be used for user-defined types
like our class `fp`

. Listing
1 is a short program that demonstrates this template in action.

Be careful with this template, though. It has some properties that
may be surprising. In particular, it is not transitive. That is, for
three values `f1`

, `f2`

, and `f3`

, it’s
possible for `close_enough(f1, f2, 1)`

and
`close_enough(f2, f3, 1)`

to both return true and for
`close_enough(f1, f3, 1)`

to return false. So don’t think of
it as an equality test. Think of it as what it’s name suggests: a way of
deciding whether two values are close enough to each other that they can
be considered equal.

The complete code for the final version of the floating point
emulator is in js200011fp.zip. It includes
all of the code we’ve looked at in this column, as well as a
specialization of `std::numeric_limits<fp>`

.

My goal in this series of columns has been to present the underlying issues of floating point math in a way that makes them more understandable to people who don’t make their living doing this sort of work. There’s a great deal that I’ve left out, some because I haven’t had space, and most because I don’t understand it well enough to present it coherently. If you understand what we’ve looked at in this series you ought to be able to get good results from floating point math in many situations, and to identify the situations where you need help. Floating point math can be tricky, and there are times when you need to consult an expert. But there’s nothing magic about it, and there’s no need to panic when you’re faced with a problem that requires accurate floating point results.

Next month we’ll look at another topic that is much more subtle than most people realize. I recently spent a few weeks implementing quick sort for Dinkumware’s Jcore library, and came up with a new wrinkle that I call an inside-out sort. If you want a preview, the inside-out sort is also used in the latest version of the Dinkum C++ library, available for purchase on our web site.

**1.** Formally known as ISO/IEC 9899:1999, Programming
Languages - C, and informally known as C99.

**2.** This example comes from a paper by Prof. William
Kahan, entitled "Lecture Notes on the Status of IEEE Standard 754
for Binary Floating-Point Arithmetic", available at www.cs.
berkeley.edu/~wkahan/ieee754status/ieee754.ps. Kahan was one of the
architects of Intel’s approach to floating point math, much of which was
incorporated into the IEEE-754 standard.

Copyright © 2000-2006 by Pete Becker. All rights reserved.