Trap Handlers, Sticky Bits, and Floating-point Comparisons

The C/C++ Users Journal, December, 2000

I said last month that one of the things we’d look at this month would be the support provided for floating point math in the recent revision to the C language standard1. But that’s a large subject, and deserves a column or two of its own, so I’ve decided to put if off for a while. This month we’ll wrap up our examination of floating point math by looking at trap handlers and sticky bits, then coming back to roughly where we started, by looking at techniques for comparing floating point values.

Trap Handlers

Last month we looked at the five types of floating point exceptions defined by the IEEE 754 standard: invalid operation, division by zero, overflow, underflow, and inexact. We also looked at what an IEEE-754 conforming floating point implementation should do when any of these exceptions occurs. The default exception handling specified by IEEE-754 is suitable for most of what most of us do most of the time. If you’re in a situation where the default behavior isn’t suitable, though, you can change what happens when any particular exception occurs by installing a trap handler for that exception. A trap handler is simply a function that you write. You install it by calling a function that tells the floating point package to call your handler whenever an exception of the corresponding type occurs. In our emulator, trap handlers are managed by the member functions set_handler and get_handler. The code for set_handler looks like this:

fp::exc_hnd fp::set_handler(
    fp::fp_exception except, fp::exc_hnd hnd)
    {   // install user-specified exception handler
    fpx exc = to_fpx(except);
    exc_hnd prev = exc_hndlrs[except];
    exc_hndlrs[except] = hnd ? hnd : def_hndlrs[except];
    return prev == def_hndlrs[except] ? 0 : prev;

It takes two arguments. The type of the first is an enumeration, defined in the class fp, named fp_exception:

enum fp_exception
    {   // bit flags to identify exceptions
    div_by_zero = 0x01,
    exp_underflow   = 0x02,
    exp_overflow    = 0x04,
    inexact     = 0x08,
    invalid     = 0x10

The second is a pointer to a handler. Passing a null pointer for this argument tells the floating point package to use the default handling for the specified exception. Otherwise, the pointer must point to a function whose signature matches the typedef fp::exc_hnd:

typedef fp (*exc_hnd)(fp_exception, fp);

The fp_exception value passed to set_handler is translated into an array index by the function to_fpx:

static inline fp::fpx fp::to_fpx(fp::fp_exception except)
    {   // translate exception bit flag to table index
    return except & div_by_zero ? dbz
        : except & exp_underflow ? exu
        : except & exp_overflow ? exo
        : except & inexact ? ine
        : inv;

The resulting index is used to retrieve the address of the current handler for that exception from the handler table and store it in the variable prev. If the caller passed a null pointer as the pointer to the new handler then the address of the default handler is stored into the handler table. Otherwise, the address of the new handler is stored. The function returns the address of the previous handler, or a null pointer if there was no user- installed handler.

The function get_handler takes a single argument of type fp_exception. It returns a pointer to the installed handler for the specified exception, or a null pointer if there is no user-installed handler.

fp::exc_hnd fp::get_handler(fp_exception except)
    { // return pointer to user-specified exception handler
    fpx exc = to_fpx(except);
    return exc_hndlrs[exc] == def_hndlrs[exc]
        ? 0 : exc_hndlrs[exc];

Internally, when a floating point exception occurs the emulator calls the private member function exception and passes an argument of type fpx, which serves as an index into the handler table, and an argument of type fp, whose value is determined according to the rules for the particular exception. We’ll look at those rules a little later. The function exception is straightforward:

fp fp::exception(fpx exc, fp arg)
    {   // handle floating point exceptions
    return exc_hndlrs[exc](
        (fp_exception)(1 << exc), arg);

Note that the function translates the table index into a value corresponding to an enumerator of type fp_exception in order to call the trap handler.

Using Trap Handlers

Now that we have a mechanism for installing trap handlers, you may be asking what they can be used for. I suggested earlier that they’re useful when the default handling of floating point exceptions isn’t what you need. For example, if you don’t want your computation to continue if any of these exceptional conditions occurs, you can install your own trap handler for each of the five exception types. In that handler you’d print a message and terminate the program:

fp handle_exceptions(fp::fp_exception except, fp)
    std::cout << "Floating point error: "
        << (except == fp::div_by_zero ? "divide by zero"
            : except == fp::exp_underflow ? "underflow"
            : except == fp::exp_overflow ? "overflow"
            : except == fp::inexact ? "inexact"
            : "invalid operation") << `\n’;

You’d then install this handler for each of the five exception types:

int main()
    fp::set_handler(fp::div_by_zero, handle_exceptions);
    fp::set_handler(fp::exp_underflow, handle_exceptions);
    fp::set_handler(fp::exp_overflow, handle_exceptions);
    fp::set_handler(fp::inexact, handle_exceptions);
    fp::set_handler(fp::invalid, handle_exceptions);
    // do the floating point computation
    return 0;

This sort of brute force approach may not be appropriate in some cases. You can leave the default behavior in place for any of the exception types by not installing a handler. You can also provide a more sophisticated handler than this one that returns a value to be used as the result of the operation that caused the exception. That’s what the default handlers in the emulator do: they compute an appropriate replacement value and return it. For example, when an invalid operation exception occurs, the IEEE-754 specification says that the result should be a NaN. The default handler for this exception does that substitution through a layer of indirection. The trap handler that is installed by default for an invalid operation exception looks like this:

static fp def_inv(fp::fp_exception except, fp arg)
    {   // default handler for invalid exception
    return fp::ex_inv(except, arg);

The member function fp::ex_inv is responsible for actually handling the exception. It looks like this:

fp fp::ex_inv(fp::fp_exception except, fp)
    {   // handle invalid exception
    return qnan;

It first sets the internal flag that indicates that an invalid operation exception occurred, then returns a NaN.

The Value Passed to the Trap Handler

Under the IEEE-754 specification, the trap handlers for overflow and underflow are called with a floating point value with the fraction and sign that the result would have had, and an exponent that is adjusted to be within the valid range for exponents in the particular floating point representation being used. For example, in our emulator, when an overflow exception occurs, the function exception is called with a floating point value whose exponent is 192 less than the result would have been. If we try to multiply a value whose fraction is 1/2 and whose exponent is 120 by itself we’d expect a result whose fraction is 1/4 and whose exponent is 240. After normalization this value would have a fraction of 1/2 and an exponent of 239. But 239 is too large an exponent in our representation, so the multiplication causes an exception. The value passed as the second argument to the function exception has a fraction of 1/2 and an exponent of 239 - 192, or 47. That way, if we need to know in our trap handler what the value would have been we can figure it out.

The default handlers for overflow and underflow both use this information to compute the replacement value that they return. In the case of overflow the result has to be suitably rounded. For example, if we’re rounding toward zero an overflow results in the largest positive representable value or the largest negative representable value, depending on the sign of the value that overflowed. The code that picks the appropriate overflowed value looks like this:

fp fp::ex_exo(fp::fp_exception except, fp arg)
    {   // handle overflow exception
    return rm == rm_nearest
            ? arg.is_neg ? ninf : pinf
        : rm == rm_zero
            ? arg.is_neg ? nmax : pmax
        : rm == rm_down
            ? arg.is_neg ? ninf : pmax
            : arg.is_neg ? nmax : pinf;

For an underflow exception the computed value has to be turned into a suitable denormalized representation. That’s done by this code:

fp fp::ex_exu(fp::fp_exception except, fp arg)
    {   // handle underflow exception
    int ex = arg.exp - BIAS - 192;
    int fr = arg.frac;
    while (ex < 0)
        {   // denormalize
        fr /= 2;
    return fp(ex - BIAS, fr, arg.is_neg);

Note that it determines the value of the actual signed exponent by subtracting the exponent bias, then subtracting 192 to offset the adjustment that was made in order to create a valid floating point value. Once that has been done, the code simply divides the fraction by two and increases the exponent until the exponent reaches zero.

Writing Your Own Trap Handler

Chances are you won’t have much use for trap handlers. But if you do end up using them make sure you understand how the value that is passed to the handler relates to the computed value. In particular, the exponent adjustment of 192 that we’ve just looked at is used for the IEEE-754 32-bit floating point representation. Other representations use other adjustments. Also, IEEE-754 doesn’t specify a calling convention for trap handlers, and the convention varies among compiler implementations. Read any appropriate documentation before you write a trap handler.

Sticky Bits

You’ve probably noticed that all of the default exception handlers begin by calling set_flag(except). That’s because the default behavior for any exception under IEEE-754 is to set a flag to indicate that an exception of that type occurred. These flags are often referred to as "sticky bits" because the floating point implementation never clears them. If you want them cleared you must do so yourself.

In our emulator the sticky bits are accessed through these five member functions:

static void set_flag(fp_exception);
static bool get_flag(fp_exception);
static void clear_flags();
static void set_flags(int);
static int get_flags();

The class fp has a static member of type int that holds the current flag settings. If you look back at the values of the enumeration fp_exception you’ll see that each enumeration is represented by a value with a single bit set. That means that we can use logical operations to manipulate the flag representation. I won’t go through the implementation details; they’re pretty simple. Call set_flag to turn on a particular flag and get_flag to test whether a particular flag is on. Call clear_flags to turn all the flags off. Call get_flags to get an int value that represents the state of all the flags. You can manipulate specific flags within this value by using the fp_exception values as masks. You can also set the internal flag values to whatever pattern you want by calling set_flags with an int value having the desired pattern. For example, to turn on the invalid operation flag and the overflow flag and turn all the other flags off, call fp::set_flags(fp::invalid | fp::exp_overflow).

NaNs and Infinities in Real Computations

In talking about exceptions we’ve looked at the immediate results of various operations. If we had to write code that checked for a NaN or an infinity after every operation we’d end up with code that was unreadable: all the tests would obscure the logic of the computation itself. One of the strengths of the IEEE-754 exception mechanism is that exceptional values propagate sensibly through the computation. NaNs in particular tend to stick around.

We saw earlier that 0.0/0.0 produces a NaN. Once you have that result, any arithmetic operation that uses it will also produce a NaN. This means that you don’t have to check for a NaN after every operation. Instead, you can do some logical chunk of the computation and see if you got a NaN. If you got one, something went astray. If not, you can continue the computation.

That’s actually a bit of an oversimplification. There are situations where NaNs will disappear. For example, the function hypot2 returns the square root of the sum of the squares of its two arguments. If one of those arguments is infinity, it doesn’t matter what the other argument is. The result will always be infinity. So hypot(1.0/0.0, 0.0/0.0) is infinity, even though the second argument is a NaN.

Infinities, of course, can also disappear, in the obvious way: any finite value divided by infinity produces zero.

What all this means is that checking for NaNs and infinites won’t tell you whether something went wrong in your computation. What it will tell you is that at the point where you checked, your values are probably okay. If you need more assurance than this you have to check the sticky bits. That’s why they’re never cleared by the floating point package: they provide information that could otherwise get lost.

Comparing Values under IEEE-754

Under the IEEE-754 standard there are four possible relationships between two floating point values. They can be equal, the first can be less than the second, the first can be greater than the second, and they can be unordered. If any NaNs are involved, the two values are unordered. Further, any comparison that involves a NaN raises an invalid operation exception.

In order to define meaningful predicates for comparing two floating point values we have to take into account the possibility that the values will be unordered. You may have heard the rule that all comparisons involving NaNs are false. That’s true as far as it goes: under the IEEE-754 specification, the numerical comparisons that we’re used to in C and C++ are all supposed to be false. This leads to the peculiar-looking result that testing for equality of two numbers can return false, and that testing for inequality of the same two numbers can also return false. In order to do meaningful comparisons when NaNs can be involved we have to expand the range of tests that we can perform. We need to add predicates that explicitly allow for the possibility that two values can be unordered.

The IEEE-754 specification provides a list of all the meaningful predicates, along with descriptive names for them. Table 1 is a list of all of the predicates from the IEEE- 754 specification, along with their definitions. I’ve given them names based on the suggested FORTRAN names.

Implementing these predicates is straightforward. The class fp defines an enumerated type named cmp_res with enumerations for each of the four possible relationships:

enum cmp_res

The static member function fp::cmp takes two arguments of type fp and compares them, returning the appropriate enumeration:

fp::cmp_res fp::cmp(const fp& f1, const fp& f2)
    if (IS_NAN(f1) || IS_NAN(f2))
        return cmp_un;
    else if (IS_ZERO(f1) && IS_ZERO(f2)
        || f1.is_neg == f2.is_neg
            && f1.frac == f2.frac
            && f1.exp == f2.exp)
        return cmp_eq;
    else if (f1.is_neg != f2.is_neg)
        return f1.is_neg ? cmp_lt : cmp_gt;
    else if (f1.exp != f2.exp)
        return f1.exp < f2.exp != f1.is_neg
            ? cmp_lt : cmp_gt;
        return f1.frac < f2.frac != f1.is_neg
            ? cmp_lt : cmp_gt;

Each predicate function calls cmp and returns the appropriate result, in accordance with the definitions in Table 1. I won’t go through all of them, because they’re pretty much the same. Here’s the code for ule (unordered or less or equal):

bool fp::ule(const fp& f1, const fp& f2)
    cmp_res res = cmp(ARG(f1), ARG(f2));
    return res == cmp_un
        || res == cmp_lt
        || res == cmp_eq;

In addition, most of these have meaningful negations. I’ve named the negations with the prefix not_. They all follow the obvious pattern:

inline bool not_ule(const fp& f1, const fp& f2)
    { return !fp::ule(f1, f2); }

Comparing Values in Real Life

While it’s important to know about all of the complications that NaNs can introduce into comparisons, for many everyday applications it’s probably better to simply avoid the possibility of comparing NaNs. Treat a NaN as an error, and abandon the computation. That lets you think of comparisons in the way that you’re used to, without having to worry about the possibility that two values will be unordered.

Unfortunately, that doesn’t help with the problem that we started out with many months ago, that (1.0/10.0)*10.0 on most systems does not compare equal to 1.0. As we’ve seen, that’s because of rounding in the division. Since one tenth cannot be represented exactly as a binary fraction (it has an infinite expansion, just like one third as a decimal), some bits get lost. Multiplying by ten cannot restore those lost bits, so the result isn’t exactly one. For most purposes, though, it’s close enough. We need to be able to express the notion of "close enough" meaningfully.

Most people’s first attempt at the notion of "close enough" is to define an acceptable tolerance, and check whether the difference between the two values being compared is less than this tolerance:

if (abs(f1 - f2) < .0001)
    // close enough

If you’ve been following the discussion closely enough you know why this doesn’t work: if f1 and f2 are numbers with large exponents, .0001 is too small to be meaningful. The difference between 1.0001e40 and 1.0000e40 is .0001e40, which is much larger than .0001. For large numbers the only difference that’s smaller than .0001 is zero, and we’re right back where we started. This test also fails for small numbers, but in the opposite direction: the difference between .0001 and .00019 is less than .0001, but these two numbers differ by nearly a factor of two. They’re close enough under this na├úve test, but that’s almost certainly not what we want.

The solution is to scale the tolerance according to the magnitude of the numbers that we’re dealing with. A simple approach is to divide the difference by one of the numbers:

if (abs((f1 - f2) / f2) < .0001)
    // close enough

If we’re using a floating point package that conforms to IEEE-754 this will work the way we expect. Some floating point packages abort execution on division by zero, and for a package like that we have to check whether f2 is zero before doing the division:

if (abs((f1 - f2) / (f2 == 0 ? 1.0 : F2) < .0001)
    // close enough

This will work the way we expect in almost all the situations we’ll run into in ordinary programming. For a floating point package that doesn’t handle overflows gracefully there’s still the possibility that f1 will be so much larger than f2 that the division will overflow and abort execution. If this is a possibility for the numbers you’re dealing with, it’s worth adding a check that the exponents are equal. If they’re not equal, the number with the greater exponent is greater than the other. If they’re equal the division won’t overflow, and the test code above will work reliably.

There’s a more subtle problem, though, in all of these tests. We started out trying to compensate for the effects of rounding on our answers. So the question that we’re actually trying to answer is whether two values are close enough together that rounding could account for their difference. The value .0001 has nothing to do with this question. Rather than pick an arbitrary tolerance, we need to use a value that reflects the magnitude of the errors introduced by rounding.

When we looked at rounding last July we saw that each rounding operation affects only one bit of the answer. As we chain operations together the rounding error propagates, so we can end up with several low bits that aren’t particularly meaningful. When we compare two numbers we ought to use a tolerance that allows us to ignore those low bits. We can’t pick a tolerance without knowing details of how floating point values are represented. But we don’t have to figure that out for ourselves. Most programming languages provide a way of getting at the value of epsilon, which is the difference between one and the least value greater than one that is representable in the floating point type. In C we have the macros FLT_EPSILON for floats, DBL_EPSILON for doubles, and LDBL_EPSILON for long doubles. In C++ we have numeric_limits<type>::epsilon().

In a binary representation, the least value greater than one that is representable in the floating point type is the value of 1.0 with its low bit changed to one. That is, its value is 1.0 plus the error introduced by a single rounding error. Subtracting 1.0 from this value gives the value of a single rounding error for results close to 1.0. This gives us a basis for computing a tolerance for floating point comparisons: we just multiply the expected number of rounding errors by the value of epsilon. So our test for close enough becomes:

if (abs((f1 - f2) / (f2 == 0 ? 1.0 : f2)
    < NROUNDS * std::numeric_limits<double>::epsilon())
    // close enough

More generally, rather than hard-code the type of the floating point value, we can use a template:

template<class FP>
    bool close_enough(FP f1, FP f2, int NROUNDS)
    return abs((f1 - f2) / (f2 == FP(0.0) ? FP(1.0) : f2)
        < FP(NROUNDS) * std::numeric_limits<FP>::epsilon());

The casts are necessary both to avoid forcing inappropriate type conversions and to enable the template to be used for user-defined types like our class fp. Listing 1 is a short program that demonstrates this template in action.

Be careful with this template, though. It has some properties that may be surprising. In particular, it is not transitive. That is, for three values f1, f2, and f3, it’s possible for close_enough(f1, f2, 1) and close_enough(f2, f3, 1) to both return true and for close_enough(f1, f3, 1) to return false. So don’t think of it as an equality test. Think of it as what it’s name suggests: a way of deciding whether two values are close enough to each other that they can be considered equal.


The complete code for the final version of the floating point emulator is in It includes all of the code we’ve looked at in this column, as well as a specialization of std::numeric_limits<fp>.

My goal in this series of columns has been to present the underlying issues of floating point math in a way that makes them more understandable to people who don’t make their living doing this sort of work. There’s a great deal that I’ve left out, some because I haven’t had space, and most because I don’t understand it well enough to present it coherently. If you understand what we’ve looked at in this series you ought to be able to get good results from floating point math in many situations, and to identify the situations where you need help. Floating point math can be tricky, and there are times when you need to consult an expert. But there’s nothing magic about it, and there’s no need to panic when you’re faced with a problem that requires accurate floating point results.

Next month we’ll look at another topic that is much more subtle than most people realize. I recently spent a few weeks implementing quick sort for Dinkumware’s Jcore library, and came up with a new wrinkle that I call an inside-out sort. If you want a preview, the inside-out sort is also used in the latest version of the Dinkum C++ library, available for purchase on our web site.

1. Formally known as ISO/IEC 9899:1999, Programming Languages - C, and informally known as C99.

2. This example comes from a paper by Prof. William Kahan, entitled "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic", available at www.cs. Kahan was one of the architects of Intel’s approach to floating point math, much of which was incorporated into the IEEE-754 standard.