Common Design Mistakes, Part II

The C/C++ Users Journal, February, 2000

One of the things that intrigues me about Java is its appeal to beginning programmers. In part, that appeal is the result of a deliberate decision to keep the language simple. When beginners ask questions about how to do something they won’t be deluged with subtly different answers followed by lengthy debates about which approach is better. With fewer alternatives, less discussion is needed1. That makes it easier to start writing code in Java than in C or C++. The other aspect of Java that contributes to its appeal to beginners is that much of the Java library was designed and implemented by people who were new to object-oriented programming. They made typical beginner’s mistakes, and that makes the library comfortable for beginners, because it’s written the way they think.

Having a library designed by beginners leads to a serious problem, however: as beginners learn more about programming, the way they look at programming problems changes and they begin to look for more sophisticated approaches to their programming tasks. This produces pressure to change the Java libraries to accommodate the gradual increase in sophistication of the average Java user. So byte-oriented I/O has been supplemented by character-oriented I/O, awt has been replaced by swing, classes have proliferated2, and the proposed Java standard has turned into an 8400 page document3. Those of us who are implementing the Java library had hoped that standardization would slow down this otherwise unchecked growth, but Sun has once again decided that the well-established procedures for producing international technical standards are unsuitable for Java4. So Java may well continue to grow in "internet time," which seems to provide a convenient excuse for inadequately tested class designs and sloppy documentation. Last month we looked at half a dozen beginner’s mistakes in the design of the Java libraries. This month we’ll look at another half dozen mistakes, all arising from trying to move too fast.

Pushing designs beyond their capabilities

The Java package java.io has a class named StreamTokenizer. As its name suggests, you create an object of type StreamTokenizer when you need to read an input stream and parse its contents into tokens such as numbers, keywords, punctuators, and quotes. In its original incarnation, an object of type StreamTokenizer read a byte at a time from an object whose type was derived from java.io.InputStream. You could set whether any particular byte value was part of a keyword, whether it introduced a comment, whether it began a quote, and so on. This makes it easy to write code that understands the structure of data in text files.

The byte values returned by an input stream range from 0 to 255, so the obvious way to implement this class is with an array of 256 integer elements, one for each possible input value. Each array element contains a flag that tells the parser what to do with a character having the value of that element’s index. The values in this array are set by various methods in the class:


public void quoteChar(int ch)
	{   // mark value as quote delimiter
	if (0 <= ch && ch < ctype.length)
		ctype[ch] |= CC_QTE;
	}

The main parsing loop, simplified, looks something like this:


int ch;
while ((ch = in.read()) != -1)
	{   // process a token
	if ((ctype[ch] & CC_QTE) != 0)
		// process quote
	else if ((ctype[ch] & CC_CHR) != 0)
		// process word
	else if ((ctype[ch] & CC_NUM) != 0)
		// process number
	else
		// process ordinary character
	}

The reason for checking the return value from InputStream.read is that it returns a number in the range -1 to 255, with -1 indicating the end of the input stream. For any other value the code checks the value in the ctype array to decide what to do. That worked fine, until the Java designers decided that they needed to provide better support for international character sets. They changed the library by adding a set of classes derived from java.io.Reader and java.io.Writer which traffic in 16-bit characters instead of 8-bit bytes, providing support for input and output operations of Unicode characters. Now, that’s not a bad thing, but in the case of StreamTokenizer it’s a problem. The obvious way to extend StreamTokenizer to handle 16-bit characters, by increasing the size of the internal array, isn’t practical: an array of 65,536 elements is too big to impose on every use of this class. So the design compromise for StreamTokenizer is to read input data from a Reader, with input values in the range -1 to 65,535, but to preserve the old implementation to the extent possible by treating character values greater than 255 only as ordinary characters. The parser’s main loop changes very little. All it needs is an initial check for a value outside the range of the array:


int ch;
while ((ch = in.read()) != -1)
	{   // process a token
	if (ctype.length <= ch)
		// process ordinary character
	else if ((ctype[ch] & CC_QTE) != 0)
		// process quote
	else if ((ctype[ch] & CC_CHR) != 0)
		// process word
	else if ((ctype[ch] & CC_NUM) != 0)
		// process number
	else
		// process ordinary character
	}

In addition, StreamTokenizer got a new constructor that takes an object whose type is derived from Reader, and the old constructor, taking an object whose type is derived from InputStream, is now deprecated. The result is that existing code that uses the InputStream-based class hierarchy will continue to work as it did before, and new code can use the Reader-based hierarchy.

That’s all fine, except for one problem: the reason for changing from byte representations of text to character representations throughout the library was to be able to handle text from various international character sets transparently. StreamTokenizer does not do this. Aside from the obvious problem of not being able to set the type of an arbitrary character, there’s the problem of numbers: many non-ASCII character sets have their own set of characters that represent numbers, and the values that represent those characters are outside the range that a StreamTokenizer can handle. All numbers must be written in their ASCII equivalent so that the parser can recognize them as numbers.

This is a problem for programmers because the interface to StreamTokenizer, by supporting construction with a Reader as its input source, implies that it treats all Unicode characters equally when in fact it only permits meaningful operations on the ASCII subset of the Unicode characters. This could have been avoided in either of two ways: by rewriting the parser to handle all Unicode characters, or by accepting the limitation to ASCII characters by not adding the constructor that takes a Reader. The former would make the class much more complex and significantly slower. The latter would cause problems for people who wanted to parse text from a Reader that they knew returned only ASCII characters. In fact, both of these solutions could have been adopted, using the same technique that the library used to introduce 16-bit streams: leaving the existing StreamTokenizer class alone, and adding a similar class that handled 16-bit characters.

If you find yourself tempted to write a class that only handles a subset of the data values that you can pass to it, you should probably write it so that you can only pass values in the usable subset.

Failing to understand the implications of design changes

Another change that came with the addition of 16-bit characters to Java was the addition of the PrintWriter class. Like the PrintStream class, it translates numeric values and objects into String representations and inserts the String’s text into a stream. A PrintWriter object writes 16-bit Unicode values. A PrintStream object writes 8-bit ASCII values. At the same time that the PrintWriter class was added, the PrintStream class was deprecated. But this was done in a slightly peculiar way: in the JDK 1.1.5 documentation, both constructors for PrintStream are deprecated5. If your code creates a PrintStream object you’ll get a warning that you are using a deprecated feature. If you use PrintStream methods, however, you won’t get any warnings.

The reason for this is probably that the PrintStream class is used for writing to the console from a Java application. The class java.lang.System has two static data members of type PrintStream named out and err. Java code is riddled with lines like this:


System.out.println("Hello, world");

Deprecating all the methods in this class would have been impractical: far too much code relies on it, although usually without explicitly creating any PrintStream objects. So the library design has a compromise: you aren’t supposed to create PrintStream objects, but if you do get your hands on one it’s okay to use it.

There’s a problem with this: java.lang.System permits programs to replace the standard streams with streams created by the program. You might do this, for example, if you wanted to redirect standard out to a file programmatically. In order to do this, though, you have to create a PrintStream object to use as the replacement for the standard output stream. If you take deprecation seriously, as you should, you can’t do that: the PrintStream constructors are deprecated. Fortunately, Sun recognized this problem, and in JDK 1.2 the PrintStream constructors are no longer deprecated. Unfortunately, this sets a precedent, and makes it just a little harder to believe the documentation when it says that something is deprecated.

If you find yourself tempted to tell users of your code that they shouldn’t rely on some feature, you should make sure that there are reasonable alternatives. Otherwise your users will insist that you to change your mind.

Removing documented features

The class java.io.InputStream has a method named read with the following signature:


public int read(byte b[],
	int off, int len) throws IOException
	{ ..... }

The method reads up to len bytes from the stream, storing them in the array b beginning at index off. In addition, in early versions of Java, if this method was called with a null argument for the byte array, it would read and discard up to len bytes from the stream. Now that’s obviously a silly specification: it runs into the same problems as the overworked constructor for java.util.SimpleTimeZone that we looked at last month. But it was there, and now it’s gone. Today, if you call read with a null argument it will throw an object of type NullPointerException6. That is, code that used to be legal will now fail at runtime. This is a particularly insidious problem in Java, because the Java library is not distributed as part of an application. Instead, the application gets the library that exists on the system where it is running. Sun has been very careful in this regard: this is the only example of this problem that I know of in the Java standard library. In your own code you also have to be careful, because changes in runtime behavior can only be found by testing, and then only by a test case that executes the changed code.

If you find yourself tempted to change library code in a way that makes previously valid input values invalid, you should resist the temptation. Such changes lead to program defects that can often only be detected by extensive and possibly expensive testing. Sometimes you have to live with your mistakes.

Changing the semantics of features

The class java.util.BitSet, as the name suggests, supports set, clear, and test operations for an arbitrary sized set of bits. Although its documentation doesn’t say so, the implementation in JDK 1.1.5 is thread-safe: you can perform operations on a single BitSet object from multiple threads without having to worry about muddling the object’s internal storage. That’s because the methods are synchronized internally. That’s not a good implementation decision, because it makes what should be a fairly speedy test of a single bit very slow. The time spent getting and releasing the object’s mutex lock overwhelms the time needed for an individual bit manipulation. While it’s possible that programmers will sometimes need to share a BitSet object between threads, it’s probably much more common to create one for use in a single thread, where the overhead imposed by synchronization is wasted. Sun changed this in JDK 1.2: the methods are no longer synchronized. But it’s too late for that change. Even though the documentation is silent concerning synchronization, inevitably users will look at the source code to determine whether BitSet objects can be used from multiple threads. Having once provided this capability, Sun should not have removed it. It would have been better to acknowledge the mistake, document the behavior of the BitSet class, and provide a new class, perhaps FastBitSet, providing the same interface but without the internal synchronization. That way the old version would continue to work in the same way, and users who didn’t want the speed penalty could switch to the new version.

If you find yourself tempted to remove visible but undocumented behavior from a class, you should instead provide a new version of the class. Telling your customers that they shouldn’t have relied on it because it wasn’t documented is a good way of converting them into ex-customers.

Documenting features that don’t exist

I mentioned earlier that java.io.InputStream used to allow calling read with a null argument, and the effect would be to skip bytes in the stream. The JDK 1.1.5 documentation tells you that that’s how the read method works, but in fact the code for this is not present in the 1.1.5 implementation. Despite what the documentation says, if you call read with a null argument you’ll get a NullPointerException. The problem here, of course, is failing to keep the documentation up to date as the code changes. This is usually the result of rushing the product out the door, which is usually done by cutting quality checks.

If you find yourself tempted to eliminate quality checks in order to meet a deadline, you should instead eliminate features or get the deadline extended.

Documenting descriptively instead of prescriptively

The class java.io.InputStream has a method named skip that takes an argument of type long. It’s pretty clear from the name that when you call it, it discards as many bytes from the input stream as you specified with the argument in the call. The documentation in JDK 1.2 describes it like this:

public long skip(long n)
	throws IOException

Skips over and discards n bytes of data from this
input stream. The skip method may, for a variety of
reasons, end up skipping over some smaller number of
bytes, possibly 0. This may result from any of a number
of conditions; reaching end of file before n bytes
have been skipped is only one possibility. The actual number 
of bytes skipped is returned. If n is negative, no bytes
are skipped.

The skip method of InputStream creates a byte array and
then repeatedly reads into it until n bytes have been
read or the end of the stream has been reached. Subclasses
are encouraged to provide a more efficient implementation of
this method.

From reading this description, what does the specification require the following method to do?


public int skipOne(InputStream in)
	{
	return in.skip(1);
	}

The answer is that the only requirement imposed by this specification is that if the stream is at the end of the file the call to skip will return -1; if not, the function is allowed to do nothing and return 0. Nothing in the specification requires that skip actually do anything other than report the end of the input stream. That’s what it says: "[t]he skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0." It does not say anything about the circumstances under which a subclass is allowed to do this. So if I were writing a class that, say, reads incoming data from a modem, I could write the skip method like this:


public class ModemStream
	extends InputStream
{
public long skip(long n)
	{
	return endOfFile() ? -1 : 0;
	}
}

This conforms to the specification. And it’s utterly useless. Unless you know the actual type of the InputStream object that you’re dealing with, you cannot call skip and expect it to actually change the stream input position.

The problem here is that the documentation attempts to summarize, rather vaguely, what the skip methods in the various subclasses of InputStream in the Java standard library actually do. ByteArrayInputStream can skip all the bytes that are asked for, up to the end of the file. BufferedInputStream can always skip all the bytes that are currently in the buffer, and more if the stream that it is buffering can skip any bytes. SequenceInputStream, as implemented in the JDK, has quirky behavior. An object of type SequenceInputStream acts like the Unix cat function: it concatenates streams, but for some reason the skip method was written so that it only works within the current stream. So if you want to skip more bytes than there are in the current stream you have to call skip more than once. Once the stream switches to the next input stream it will continue to skip bytes, up to the end of that stream. Faced with such a variety of possible behavior, whoever wrote the documentation decided, wisely, not to try to describe all of the possibilities. Unfortunately, they didn’t understand the method well enough to tell writers of subclasses what they have to do.

Writing good documentation for virtual methods is hard. You have to write requirements, not descriptions. This requires understanding how the class is intended to be used, and understanding the role of the virtual method in implementing the class’s specification. For example, InputStream.skip could have been documented like this:

Skips over and discards m bytes of data from
this input stream, where m is the minimum of the
number of bytes requested, the number of bytes from
the current input position to the end of file, and
the number of bytes that can be read without blocking.

By clearly stating the circumstances under which the method is permitted to skip fewer bytes than were requested, the specification makes it clear what implementers are required to do. This would make the ModemStream example that I gave earlier invalid. It would also make the implementation of SequenceInputStream in the JDK invalid. In both cases, the class would be much more useful.

If you find yourself tempted to document a virtual method by describing what it does, you should give more thought to what you expect subclasses to do. Once you understand what the requirements for the method are, you can write its specification.

Conclusion

Last month we looked at design errors in the Java libraries. This month we looked at errors that result from rushing the development of the libraries. I’ve used Java for these examples because it provides fertile ground for this sort of investigation. By looking at the nature of the mistakes that have been made in Java we can understand better the types of mistakes that we can make in all of our code.

1. On the other hand, having fewer alternatives often makes it harder to express an algorithm succinctly. But that’s a topic for another column.

2. The Java 1.0 library had 7 packages and 188 classes. The Java 1.1library had 22 packages and 477 classes. The Java 1.2 library has 61 packages and 1575 classes.

3. The C standard is a little over 200 pages long. The C++ standard is about 750 pages long.

4. See http://www.infoworld.com/cgi-bin/displayStory.pl?991119.ececma.htm, "Sun, ECMA Fall Out Over Java Standardization."

5. Technically, stating that a feature is deprecated means that it might not be present in a future version, so you should be careful in relying on it. Etymologically it is a much stronger statement: the word "deprecate" comes from the Latin "deprecari," meaning "to avert by prayer."

6. I think it’s interesting that Java proponents makes a big point of how Java doesn’t have pointers, without mentioning that it has a NullPointerException.