Come on in. Shut the door -- it’s getting cold in here. Welcome to The Journeyman’s Shop. Pour yourself a cup of coffee if you like. Sorry the room is such a mess -- as soon as things calm down a little I’m going to clean the place up and get organized. Warm your hands by the woodstove, and let’s talk a bit about software development before we get down to work.
Software projects always start out with architectural decisions: where the software will get its data from, what it will do to the data, and how it will report its results. Even if you work alone and develop software by hacking code all day long you still have to make all of those architectural decisions. You might not have written up an architectural specification, but you have it there in your head, and you’re updating it as you go.
Here at The Journeyman’s Shop we don’t develop architectural specifications. We leave that to the application architect. Our job is to build the pieces that fit together to become the application. We take the blueprint that the architect supplies, we make parts according to the blueprint, and we assemble the parts to make the application. Of course, we don’t have to be blind to problems that we see in the blueprint, and should call any serious problems to the architect’s attention. But for the most part those high level decisions have already been made, and our job is to implement them. Don’t worry, though: there’s lots of room for creative innovation in what we do. That architect hasn’t worked out all the details. If you’re into categorizing the steps in the development process, what the architect does is high level design. What we do in The Journeyman’s Shop is low level design and implementation. In more mundane terms, we make the application work. To do that we need to understand a wide range of programming disciplines. We need to be able to divide a task into a coherent set of functions and design the interfaces between those functions. We need to know how to choose the best sorting algorithm given the amount of data to be sorted, the time constraints, the amount of available memory, and whether the data is in memory or on disk. We need to be able to design and implement error handling within the code that we’re writing, and make sure that our error handling strategy integrates easily into the application’s error handling strategy. More generally, we need to be able to break a task down into smaller pieces, make those pieces work efficiently, and then assemble those pieces into a larger component that satisfies the application’s architectural requirements. These design decisions affect the maintainability, efficiency, and correctness of the application.
Over the years I’ve often heard from programmers who made a design decision that didn’t quite work out and then asked for help in coding their way out of trouble. They seem to think that they’re like the cartoon character who can paint himself into a corner, paint a door in the wall, open the door, step through, and reach back and paint the part of the floor where he was standing. Trouble is, we’re not cartoon characters. When we paint that door on the wall, no matter how hard we try, it’s not going to open. There’s a good side to this, though: it’s not really paint, and we can go back and start fresh without messing up our shoes. Part of being a good programmer is being able to recognize when an earlier design decision isn’t working out, and having the courage to abandon what we’ve done and start over. It hurts our pride, and it may produce pitying looks from our colleagues, but sometimes it’s the best solution. Don’t think, though, that I’m urging you to simply give up when things get hard. Far from it. That drive to complete what we started, the desire to tear down all obstacles, and the thrill that we get when we succeed in solving what looked like an intractable problem all contribute to the excitement and sense of satisfaction that we get from our profession. On the other hand, the frustration of continuing to slog along through a maze of twisty little passages, all alike, as we struggle to implement a design that we know, deep down inside, won’t work contributes to burnout and job hopping. We must all learn to recognize when we’ve run into a dead end, and be willing to discard what we’ve done and start over.
In the application that we’re building, however, we usually don’t have that flexibility. The code that we’ve written is out there in the field, and our customers expect it to work. If something goes wrong we can’t expect to be called in to fix the problem. Like it or not, our code must be written to anticipate difficulties and to handle them in appropriate ways. This aspect of program design is, of course, what’s usually referred to as "error handling," and it’s a subject that many of us try to postpone thinking about or avoid altogether. We’d much rather focus on getting the code to produce results than on figuring out how our code can go wrong. But producing an application that’s robust and reliable requires that we pay attention to error handling. We must design it into the application, and not try to retrofit it after we’ve done the parts that we like better. Otherwise we’ll leave footprints all over the painted floor.
Simply stated, an error occurs when a function is unable to produce a required result. When we are designing the code to perform some computation we must look at the ways that the code we’re about to write could fail, and with that information, we must make decisions about how to handle those potential failures. That doesn’t mean that every design should incorporate explicit error checking and handling - the C character classification functions, for example, accept without complaint values that are not valid character representations. As we’ll see later, in some cases deciding to ignore an error is a reasonable choice. In others it is not. In all cases, however, failing to decide how to handle an error is itself an error. One thing we should always do before we conclude that a function we’ve written is finished is to ask ourselves whether we’ve considered all of the possible ways that our function could fail to produce a required result.
There are four broad categories of errors that we should consider: our function can be called with arguments that are outside the range that it is designed to handle; it may be unable to get resources needed for the computation; it may be violating the application’s security policies; and we may have made a coding error when we wrote it. There’s a certain amount of overlap in these categories. For example, if we made a coding error that inflates the length of a string to a couple of gigabytes we’ll probably find that we can’t get the resources needed to create that string. This makes a coding error look like a resource failure. When we’re identifying possible errors we need to focus on where errors come from, because that makes it easier to spot them in our code. Once we’ve spotted them we can figure out how to detect them. When we’re debugging, on the other hand, we need to focus on what effects an error will have, that is, how to recognize it. Once we’ve found out what’s going wrong it’s much easier to track down the source of the problem. At the moment we’re talking about identifying possible ways that our code can fail, so we should think in terms of how errors can arise.
The first category of errors consists of calls to our function with arguments that are outside the range that it is designed to handle. For example, think about the C function
double sqrt(double x);
Since it returns a double and not a complex value, it cannot be used
to calculate the square root of a negative number. Calling
sqrt
with a negative argument is an error. If we’re writing
the sqrt
function, one of the error cases we have to think
about is being called with a negative argument.
The second category of errors is those in which necessary resources
are not available. The obvious example of a scarce resource is memory.
As C programmers we’ve been told ever since we started programming that
we must always check the return value of malloc
to see
whether it is NULL
. If it is, the runtime library was
unable to allocate the memory that we requested, and our computation
probably cannot continue without taking corrective action. More
generally we should be careful about anything that we have to ask for
before we can use it. It might not be available.
The third category of errors is security violations. Java has made programmers more aware of the significance of security policies in programming, but security didn’t start with Java. For example, Windows NT has, from the start, supported security controls in applications. In any event, if the application that we are writing is subject to any sort of security control then we have to consider possible security violations when we design our code. If our application or its user does not have sufficient security rights to do whatever it is that we’ve been asked to do, our code cannot produce a meaningful result. This is an error that we must consider in our design.
The fourth category of errors is coding errors. These happen to all of us at one time or another. If we’ve done our jobs well these errors don’t make it past our internal reviews and testing. Still, we must consider the possibility that coding errors will invalidate the results of our computation.
Once we’ve identified the possible sources of errors in the function that we’re writing we should look at our specification once again, to decide whether these possible errors are covered by the specification. If the specification does not address them it may need to be revised. For example, suppose we’ve been asked to write a function that takes a pointer to a null terminated array of char containing characters representing the digits 0 through 9, and translates those characters into an integral value. Our first pass at writing this function might look like this:
int translate(const char *str)
{
int val = 0;
while (*str != ’\0’)
val = val * 10 + *str++ - ’0’;
return val;
}
When we look at this code we see several possible error conditions.
First, our code returns 0 when the first character in the array is the
null terminator. That’s the most natural way to implement it, but it may
not be what the writer of the specification intended. Second, if the
array contains characters that are not digits we will incorporate them
into our result anyway, producing a value that doesn’t make sense.
Third, str
might be a null pointer. Fourth, the value
represented by the digits in the array might be too large to store in an
int. The first three are actually covered by our specification if we
read it literally: they are not valid input to this function. On the
other hand, if the specification that we’re working from is an informal
description of capabilities, it may be that the writer did not consider
these possibilities. In that case we need to ask what the appropriate
action is for these input conditions. This calls for an exercise of
judgment: if the architect always says things precisely and accurately,
then reading the specification literally is usually the right thing to
do. We can’t fall back on a literal reading of the specification as a
defense for writing code that is obviously flawed, however. Application
development is a cooperative effort by a team of programmers, and while
our role is primarily to implement what the specification describes, we
may be in a better position than the architect to see some problems and
to recommend changes to the specification.
The fourth problem, getting a result that is too large to fit in an int, is clearly a failure in the specification. We cannot fix this problem ourselves, because the solution affects how the function is used. For example, if our function simply drops the high bits as the value overflows, we produce the wrong value. That’s acceptable if the specification tells users of the function what the maximum allowable value is. If, instead, our function somehow indicates that an error occurred, then the caller of our function must be prepared to handle that error indication. We have to ask the architect what to do if this error occurs, perhaps suggesting what we think is an appropriate solution, and we have to make sure that the answer becomes part of the specification. That way users of this function will know what to expect.
The next step is to decide what to do with each of those possible errors. There are two broad answers here: do nothing, or detect the error and handle it. Doing nothing has a couple of advantages: it’s simple to understand, it doesn’t introduce any additional control flows into our code, and it doesn’t increase the size of our application. In some cases it’s a reasonable approach.
For example, if it’s easy for the caller to avoid using invalid
arguments then there is less need for the function itself to check for
them. One example is the function sqrt
that we talked about
earlier. The rule for the caller is simple: don’t call sqrt
with a negative value. It’s easy for the caller to write code that
insures that sqrt
is never called with a negative value,
either because the caller explicitly checks for a negative value or
because the logic of the calling code always deals with non-negative
values. For example:
for (i=0; i < 10; ++i)
printf("%d: %f\n", i, sqrt(i));
In this code sqrt
can never be called with a negative
value. If the user of our function can be relied on to never call it
with invalid values then we don’t need to check for them.
On the other hand, if recognizing invalid input is hard to do, we shouldn’t push that burden onto our users. Consider a function that computes the two real roots of a quadratic equation:
void quad(double a, double b, double c,
double *r1, double *r2)
{
double disc = b*b - 4*a*c;
*r1 = (-b + sqrt(disc))/(2*a);
*r2 = (-b - sqrt(disc))/(2*a);
}
If disc
is less than 0 the roots are complex numbers,
and cannot be represented by double values. We could document that this
function fails if the roots are not real, but that’s hard for users to
recognize. We could go a step further, and document that the function
fails if b*b-4*a*c
is less than 0, but that isn’t much of
an improvement. This is a case where we probably should not ignore the
problem. It’s too hard for users to avoid it.
If we decide that we need to check for errors in the code that we’re
writing, we have to add code to our function to check for these errors.
That’s pretty straightforward: just insert if
statements in
the appropriate places. However, there’s a vocabulary that’s grown up
around error checking that you should be familiar with. When we write
code at the top of a function to check for invalid argument values we’re
testing a "precondition." When we write code at the end of a
function to check for a correct result we’re testing a
"postcondition." Preconditions and postconditions, together,
are the constituents of the notion of "programming by
contract." The contract is: if you call this function with
arguments that satisfy the preconditions, the function will return with
results that satisfy the postcondition.
In the case of sqrt
, the precondition is that the
argument is not negative. The postcondition is that the square of the
result is equal to the argument. If we write code to check both the
precondition and the postcondition, the function looks something like
this:
double sqrt(double x)
{
double res;
assert(0 <= x);
/* some lengthy computation */
assert(fabs(res*res - x) < MAX_ERROR);
return res;
}
While we’re working on the code in sqrt
the
postcondition test is very useful: it lets us know if we’ve produced an
incorrect result. The precondition test, on the other hand, is more
useful to users of our code: it lets them know that they have called our
code with an invalid value. Once the entire application has been
completed and thoroughly tested we may decide to remove the explicit
precondition and postcondition tests. That is, we might change our
design decision about checking for these errors. This should be done
cautiously, however, because recompiling the application with these
tests removed can bring out symptoms of problems that were masked
before. Be sure to allow sufficient time for retesting and debugging
after removing such tests.
If we find that an error has occurred, we have to decide what to do with it. There are four possibilities here: abort, avoid, protect, and report.
Aborting program execution when an error occurs is, of course, not at all appropriate in, say, a pacemaker. There are times, though, when it’s the best thing to do. In particular, if the error indicates that the application’s internal data structures are so hopelessly corrupted that there is no way that the program can continue to run at all, the best thing to do is to quit before we do any further damage. We should give the user the best possible description that we can of what’s wrong, and then stop. In less hopeless situations, there may, nevertheless, be nothing that our function can do to make sense of the data that it has been passed. As an extreme example, consider the case of a compiler being asked to compile a file that actually holds data from a spreadsheet. This simply won’t work, and most compilers have a limit on the number of coding errors that they will report before they decide that it’s time to quit.
Avoiding the problem usually means trying a different approach. For example, some of the algorithms in the Standard C++ Library can be implemented to run faster if there is extra memory available for storing intermediate results. It could be an error for the implementor of one of these algorithms to simply assume that the extra memory is available and use only the fast version. If there is a possibility that the extra memory won’t be available, the code should check whether it can get the extra memory. If it can, the code can then use the faster version. If it cannot, it can fall back on the slower one. Another example of avoiding a problem occurs in code that asks the program’s user for input, then checks whether that input is acceptable. If not, it asks again.
Protecting ourselves from an error may seem like an odd notion, but it’s actually fairly common. C++’s iostreams do exactly this: when an operation fails, all subsequent operations on that stream will also fail, without attempting to perform any actual input or output. Every stream has a data member that can be examined to determine whether the stream is in a usable state, and every operation on that stream checks that flag before doing any actual work. When an operation fails it sets this flag, so no further attempts will be made to use this stream. This means that the user of the stream doesn’t have to check every stream operation for successful completion, but can wait until a logically related set of operations have been performed, and check for success at the end.
Another example occurs in what will probably become the new C
standard in a year or so. In C as it exists today, floating point
operations that produce values too large to store in a floating point
number store the value HUGE_VAL
, which is defined in
<float.h>. The actual value of HUGE_VAL
is not
specified by the C standard, but it is often the case that it is a large
but finite floating point value. If floating point code doesn’t check
for HUGE_VAL
, it may end up performing some operation on it
such as dividing it by 1000 that produces a value that looks reasonable
but just happens to be dead wrong. The solution to this problem in the
working paper for the new standard is to adopt the IEEE specification
for floating point computations. This adds three values to the usual
range of floating point values: positive infinity, negative infinity,
and NaN
(not a number). Instead of producing
HUGE_VAL
, operations that produce values too large to store
in a float produce the value positive infinity. Unlike
HUGE_VAL
, dividing positive infinity by a number greater
than 0 produces the value positive infinity. Dividing it by a negative
value produces negative infinity. Dividing it by 0 produces
NaN
. Further, all operations involving NaN
s
result in NaN
. If you think it through, you can see that
once we’ve gotten one of these special values in some computation, we
won’t ever get back to a normal value. This means that floating point
code can defer checking for errors until the computation is finished. If
an error of this sort occurred the result will be one of these special
values, and the code can recognize immediately that something went
wrong.
In both of these example, protecting ourselves from an error does not solve the underlying problem. It does permit us to simplify our code, because we don’t have to deal with error handling throughout it. However, when we do something like this we still must report back to the calling code that something went wrong.
Finally, if we can’t ignore the problem and we can’t avoid it and we can’t just quit, we’ve got to report it. We’ve run out of things that we can do in our own part of the code, so we’ve got to pass the responsibility on to the code that called our code. There are many techniques for telling the calling code that our function was unable to do what it was asked to do. That’s a big topic, and we’ll dig into it next month.
Copyright © 1998-2006 by Pete Becker. All rights reserved.