One of the things that intrigues me about Java is its appeal to beginning programmers. In part, that appeal is the result of a deliberate decision to keep the language simple. When beginners ask questions about how to do something they won’t be deluged with subtly different answers followed by lengthy debates about which approach is better. With fewer alternatives, less discussion is needed1. That makes it easier to start writing code in Java than in C or C++. The other aspect of Java that contributes to its appeal to beginners is that much of the Java library was designed and implemented by people who were new to object-oriented programming. They made typical beginner’s mistakes, and that makes the library comfortable for beginners, because it’s written the way they think.
Having a library designed by beginners leads to a serious problem, however: as beginners learn more about programming, the way they look at programming problems changes and they begin to look for more sophisticated approaches to their programming tasks. This produces pressure to change the Java libraries to accommodate the gradual increase in sophistication of the average Java user. So byte-oriented I/O has been supplemented by character-oriented I/O, awt has been replaced by swing, classes have proliferated2, and the proposed Java standard has turned into an 8400 page document3. Those of us who are implementing the Java library had hoped that standardization would slow down this otherwise unchecked growth, but Sun has once again decided that the well-established procedures for producing international technical standards are unsuitable for Java4. So Java may well continue to grow in "internet time," which seems to provide a convenient excuse for inadequately tested class designs and sloppy documentation. Last month we looked at half a dozen beginner’s mistakes in the design of the Java libraries. This month we’ll look at another half dozen mistakes, all arising from trying to move too fast.
The Java package java.io
has a class named
StreamTokenizer
. As its name suggests, you create an object
of type StreamTokenizer
when you need to read an input
stream and parse its contents into tokens such as numbers, keywords,
punctuators, and quotes. In its original incarnation, an object of type
StreamTokenizer
read a byte at a time from an object whose
type was derived from java.io.InputStream
. You could set
whether any particular byte value was part of a keyword, whether it
introduced a comment, whether it began a quote, and so on. This makes it
easy to write code that understands the structure of data in text
files.
The byte values returned by an input stream range from 0 to 255, so the obvious way to implement this class is with an array of 256 integer elements, one for each possible input value. Each array element contains a flag that tells the parser what to do with a character having the value of that element’s index. The values in this array are set by various methods in the class:
public void quoteChar(int ch)
{ // mark value as quote delimiter
if (0 <= ch && ch < ctype.length)
ctype[ch] |= CC_QTE;
}
The main parsing loop, simplified, looks something like this:
int ch;
while ((ch = in.read()) != -1)
{ // process a token
if ((ctype[ch] & CC_QTE) != 0)
// process quote
else if ((ctype[ch] & CC_CHR) != 0)
// process word
else if ((ctype[ch] & CC_NUM) != 0)
// process number
else
// process ordinary character
}
The reason for checking the return value from
InputStream.read
is that it returns a number in the range
-1 to 255, with -1 indicating the end of the input stream. For any other
value the code checks the value in the ctype
array to
decide what to do. That worked fine, until the Java designers decided
that they needed to provide better support for international character
sets. They changed the library by adding a set of classes derived from
java.io.Reader
and java.io.Writer
which
traffic in 16-bit characters instead of 8-bit bytes, providing support
for input and output operations of Unicode characters. Now, that’s not a
bad thing, but in the case of StreamTokenizer
it’s a
problem. The obvious way to extend StreamTokenizer
to
handle 16-bit characters, by increasing the size of the internal array,
isn’t practical: an array of 65,536 elements is too big to impose on
every use of this class. So the design compromise for
StreamTokenizer
is to read input data from a
Reader
, with input values in the range -1 to 65,535, but to
preserve the old implementation to the extent possible by treating
character values greater than 255 only as ordinary characters. The
parser’s main loop changes very little. All it needs is an initial check
for a value outside the range of the array:
int ch;
while ((ch = in.read()) != -1)
{ // process a token
if (ctype.length <= ch)
// process ordinary character
else if ((ctype[ch] & CC_QTE) != 0)
// process quote
else if ((ctype[ch] & CC_CHR) != 0)
// process word
else if ((ctype[ch] & CC_NUM) != 0)
// process number
else
// process ordinary character
}
In addition, StreamTokenizer
got a new constructor that
takes an object whose type is derived from Reader
, and the
old constructor, taking an object whose type is derived from
InputStream
, is now deprecated. The result is that existing
code that uses the InputStream
-based class hierarchy will
continue to work as it did before, and new code can use the
Reader
-based hierarchy.
That’s all fine, except for one problem: the reason for changing from
byte representations of text to character representations throughout the
library was to be able to handle text from various international
character sets transparently. StreamTokenizer
does not do
this. Aside from the obvious problem of not being able to set the type
of an arbitrary character, there’s the problem of numbers: many
non-ASCII character sets have their own set of characters that represent
numbers, and the values that represent those characters are outside the
range that a StreamTokenizer
can handle. All numbers must
be written in their ASCII equivalent so that the parser can recognize
them as numbers.
This is a problem for programmers because the interface to
StreamTokenizer
, by supporting construction with a
Reader
as its input source, implies that it treats all
Unicode characters equally when in fact it only permits meaningful
operations on the ASCII subset of the Unicode characters. This could
have been avoided in either of two ways: by rewriting the parser to
handle all Unicode characters, or by accepting the limitation to ASCII
characters by not adding the constructor that takes a
Reader
. The former would make the class much more complex
and significantly slower. The latter would cause problems for people who
wanted to parse text from a Reader
that they knew returned
only ASCII characters. In fact, both of these solutions could have been
adopted, using the same technique that the library used to introduce
16-bit streams: leaving the existing StreamTokenizer
class
alone, and adding a similar class that handled 16-bit characters.
If you find yourself tempted to write a class that only handles a subset of the data values that you can pass to it, you should probably write it so that you can only pass values in the usable subset.
Another change that came with the addition of 16-bit characters to
Java was the addition of the PrintWriter
class. Like the
PrintStream
class, it translates numeric values and objects
into String
representations and inserts the
String
’s text into a stream. A PrintWriter
object writes 16-bit Unicode values. A PrintStream
object
writes 8-bit ASCII values. At the same time that the
PrintWriter
class was added, the PrintStream
class was deprecated. But this was done in a slightly peculiar way: in
the JDK 1.1.5 documentation, both constructors for
PrintStream
are deprecated5. If your code creates a
PrintStream
object you’ll get a warning that you are using
a deprecated feature. If you use PrintStream
methods,
however, you won’t get any warnings.
The reason for this is probably that the PrintStream
class is used for writing to the console from a Java application. The
class java.lang.System
has two static data members of type
PrintStream
named out
and err
.
Java code is riddled with lines like this:
System.out.println("Hello, world");
Deprecating all the methods in this class would have been
impractical: far too much code relies on it, although usually without
explicitly creating any PrintStream
objects. So the library
design has a compromise: you aren’t supposed to create
PrintStream
objects, but if you do get your hands on one
it’s okay to use it.
There’s a problem with this: java.lang.System
permits
programs to replace the standard streams with streams created by the
program. You might do this, for example, if you wanted to redirect
standard out to a file programmatically. In order to do this, though,
you have to create a PrintStream
object to use as the
replacement for the standard output stream. If you take deprecation
seriously, as you should, you can’t do that: the
PrintStream
constructors are deprecated. Fortunately, Sun
recognized this problem, and in JDK 1.2 the PrintStream
constructors are no longer deprecated. Unfortunately, this sets a
precedent, and makes it just a little harder to believe the
documentation when it says that something is deprecated.
If you find yourself tempted to tell users of your code that they shouldn’t rely on some feature, you should make sure that there are reasonable alternatives. Otherwise your users will insist that you to change your mind.
The class java.io.InputStream
has a method named
read
with the following signature:
public int read(byte b[],
int off, int len) throws IOException
{ ..... }
The method reads up to len
bytes from the stream,
storing them in the array b
beginning at index
off
. In addition, in early versions of Java, if this method
was called with a null argument for the byte array, it would read and
discard up to len
bytes from the stream. Now that’s
obviously a silly specification: it runs into the same problems as the
overworked constructor for java.util.SimpleTimeZone
that we
looked at last month. But it was there, and now it’s gone. Today, if you
call read with a null argument it will throw an object of type
NullPointerException
6. That
is, code that used to be legal will now fail at runtime. This is a
particularly insidious problem in Java, because the Java library is not
distributed as part of an application. Instead, the application gets the
library that exists on the system where it is running. Sun has been very
careful in this regard: this is the only example of this problem that I
know of in the Java standard library. In your own code you also have to
be careful, because changes in runtime behavior can only be found by
testing, and then only by a test case that executes the changed
code.
If you find yourself tempted to change library code in a way that makes previously valid input values invalid, you should resist the temptation. Such changes lead to program defects that can often only be detected by extensive and possibly expensive testing. Sometimes you have to live with your mistakes.
The class java.util.BitSet
, as the name suggests,
supports set
, clear
, and test
operations for an arbitrary sized set of bits. Although its
documentation doesn’t say so, the implementation in JDK 1.1.5 is
thread-safe: you can perform operations on a single BitSet
object from multiple threads without having to worry about muddling the
object’s internal storage. That’s because the methods are synchronized
internally. That’s not a good implementation decision, because it makes
what should be a fairly speedy test of a single bit very slow. The time
spent getting and releasing the object’s mutex lock overwhelms the time
needed for an individual bit manipulation. While it’s possible that
programmers will sometimes need to share a BitSet
object
between threads, it’s probably much more common to create one for use in
a single thread, where the overhead imposed by synchronization is
wasted. Sun changed this in JDK 1.2: the methods are no longer
synchronized. But it’s too late for that change. Even though the
documentation is silent concerning synchronization, inevitably users
will look at the source code to determine whether BitSet
objects can be used from multiple threads. Having once provided this
capability, Sun should not have removed it. It would have been better to
acknowledge the mistake, document the behavior of the
BitSet
class, and provide a new class, perhaps
FastBitSet
, providing the same interface but without the
internal synchronization. That way the old version would continue to
work in the same way, and users who didn’t want the speed penalty could
switch to the new version.
If you find yourself tempted to remove visible but undocumented behavior from a class, you should instead provide a new version of the class. Telling your customers that they shouldn’t have relied on it because it wasn’t documented is a good way of converting them into ex-customers.
I mentioned earlier that java.io.InputStream
used to
allow calling read
with a null argument, and the effect
would be to skip bytes in the stream. The JDK 1.1.5 documentation tells
you that that’s how the read
method works, but in fact the
code for this is not present in the 1.1.5 implementation. Despite what
the documentation says, if you call read with a null argument you’ll get
a NullPointerException
. The problem here, of course, is
failing to keep the documentation up to date as the code changes. This
is usually the result of rushing the product out the door, which is
usually done by cutting quality checks.
If you find yourself tempted to eliminate quality checks in order to meet a deadline, you should instead eliminate features or get the deadline extended.
The class java.io.InputStream
has a method named
skip
that takes an argument of type long
. It’s
pretty clear from the name that when you call it, it discards as many
bytes from the input stream as you specified with the argument in the
call. The documentation in JDK 1.2 describes it like this:
public long skip(long n)
throws IOException
Skips over and discards n bytes of data from this
input stream. The skip method may, for a variety of
reasons, end up skipping over some smaller number of
bytes, possibly 0. This may result from any of a number
of conditions; reaching end of file before n bytes
have been skipped is only one possibility. The actual number
of bytes skipped is returned. If n is negative, no bytes
are skipped.
The skip method of InputStream creates a byte array and
then repeatedly reads into it until n bytes have been
read or the end of the stream has been reached. Subclasses
are encouraged to provide a more efficient implementation of
this method.
From reading this description, what does the specification require the following method to do?
public int skipOne(InputStream in)
{
return in.skip(1);
}
The answer is that the only requirement imposed by this specification
is that if the stream is at the end of the file the call to
skip
will return -1; if not, the function is allowed to do
nothing and return 0. Nothing in the specification requires that
skip
actually do anything other than report the end of the
input stream. That’s what it says: "[t]he skip method may, for a
variety of reasons, end up skipping over some smaller number of bytes,
possibly 0." It does not say anything about the circumstances under
which a subclass is allowed to do this. So if I were writing a class
that, say, reads incoming data from a modem, I could write the
skip
method like this:
public class ModemStream
extends InputStream
{
public long skip(long n)
{
return endOfFile() ? -1 : 0;
}
}
This conforms to the specification. And it’s utterly useless. Unless
you know the actual type of the InputStream
object that
you’re dealing with, you cannot call skip
and expect it to
actually change the stream input position.
The problem here is that the documentation attempts to summarize,
rather vaguely, what the skip
methods in the various
subclasses of InputStream
in the Java standard library
actually do. ByteArrayInputStream
can skip all the bytes
that are asked for, up to the end of the file.
BufferedInputStream
can always skip all the bytes that are
currently in the buffer, and more if the stream that it is buffering can
skip any bytes. SequenceInputStream
, as implemented in the
JDK, has quirky behavior. An object of type
SequenceInputStream
acts like the Unix cat function: it
concatenates streams, but for some reason the skip
method
was written so that it only works within the current stream. So if you
want to skip more bytes than there are in the current stream you have to
call skip
more than once. Once the stream switches to the
next input stream it will continue to skip bytes, up to the end of that
stream. Faced with such a variety of possible behavior, whoever wrote
the documentation decided, wisely, not to try to describe all of the
possibilities. Unfortunately, they didn’t understand the method well
enough to tell writers of subclasses what they have to do.
Writing good documentation for virtual methods is hard. You have to
write requirements, not descriptions. This requires understanding how
the class is intended to be used, and understanding the role of the
virtual method in implementing the class’s specification. For example,
InputStream.skip
could have been documented like this:
Skips over and discards m bytes of data from this input stream, where m is the minimum of the number of bytes requested, the number of bytes from the current input position to the end of file, and the number of bytes that can be read without blocking.
By clearly stating the circumstances under which the method is
permitted to skip fewer bytes than were requested, the specification
makes it clear what implementers are required to do. This would make the
ModemStream
example that I gave earlier invalid. It would
also make the implementation of SequenceInputStream
in the
JDK invalid. In both cases, the class would be much more useful.
If you find yourself tempted to document a virtual method by describing what it does, you should give more thought to what you expect subclasses to do. Once you understand what the requirements for the method are, you can write its specification.
Last month we looked at design errors in the Java libraries. This month we looked at errors that result from rushing the development of the libraries. I’ve used Java for these examples because it provides fertile ground for this sort of investigation. By looking at the nature of the mistakes that have been made in Java we can understand better the types of mistakes that we can make in all of our code.
1. On the other hand, having fewer alternatives often makes it harder to express an algorithm succinctly. But that’s a topic for another column.
2. The Java 1.0 library had 7 packages and 188 classes. The Java 1.1library had 22 packages and 477 classes. The Java 1.2 library has 61 packages and 1575 classes.
3. The C standard is a little over 200 pages long. The C++ standard is about 750 pages long.
4. See http://www.infoworld.com/cgi-bin/displayStory.pl?991119.ececma.htm, "Sun, ECMA Fall Out Over Java Standardization."
5. Technically, stating that a feature is deprecated means that it might not be present in a future version, so you should be careful in relying on it. Etymologically it is a much stronger statement: the word "deprecate" comes from the Latin "deprecari," meaning "to avert by prayer."
6. I think it’s interesting that Java proponents
makes a big point of how Java doesn’t have pointers, without mentioning
that it has a NullPointerException
.
Copyright © 2000-2006 by Pete Becker. All rights reserved.