I’d like to begin the new year here in The Journeyman’s Shop with an examination of something fairly old, by computing standards, that you’re probably familiar with: the quicksort algorithm. It was first published by C.A.R. Hoare in 1962, and since then it has become one of the most widely used, most widely analyzed, and perhaps most widely misunderstood techniques for sorting data held in memory. As usual, you can find more information than you’ll ever need in Donald Knuth’s "The Art of Computer Programming," vol. 3, "Sorting and Searching."
The quicksort algorithm has undergone some significant improvements since Hoare’s initial version. These improvements have been driven by the need to avoid the quadratic time complexity and linear space requirement that quicksort can fall into. As the conditions that lead to these problems have been recognized, the details of the algorithm have been changed to avoid them. Thus, time and space requirements have been improved in the worst cases, usually at the cost of somewhat slower execution in the ordinary cases. In this column we’ll look at these improvements to quicksort, with the goal of better understanding how the time and space problems arise in quicksort, how these problems are solved, and what effect solving these problems has on performance. Although you probably won’t ever have to write an efficient version of quicksort, the techniques that we use in analyzing quicksort ought to be useful to you in tuning your own code.
Given two iterators first
and last
that
define a half-open range [first,last)
and an operator
<
that can be used to compare two objects of the type that
first
and last
point to, our task is to write
an algorithm that arranges the elements in that range in increasing
order. More formally, after we have applied the algorithm, it must be
true that for every value of j
in the half-open range
[0,last - first)
1,
!(*(first + j + 1) < *(first + j)2.
In order to make sure that we don’t accidentally use an
operation that isn’t necessarily available for the items that
we’ll be sorting, we need to write a simple class that provides
only the basic operations that are essential to our algorithm. We need
to be able to compare two items with the operator <
, and
we need to be able to copy items. We also need to be able to create
items that have distinct values, and it will be useful to be able to
display the values of these items. That leads directly to the following
class definition:
class item
{
public:
// create:
explicit item(int v) : val(v) {}
// copy:
item(const item& other) : val(other.val) {}
item& operator = (const iterm& other)
{ val = other.val; return *this; }
// compare:
bool operator < (const item& other) const
{ return val < other.val; }
// display:
friend std::ostream& operator <<
(std::ostream& os, const item& it)
{ return os << it.val; }
private:
int val;
};
The quicksort algorithm uses the divide and conquer approach: the range to be sorted is divided into two parts, such that each of the elements in the first part is less than each of the elements in the second part. The algorithm is then re-applied to each of the two parts. As long as we make sure that neither part is ever empty, each time that we partition a range we’ll end up with two parts that each contain fewer elements than the range that we just partitioned. Sooner or later we’ll get down to parts that contain only one element, and those parts don’t need any further work: they’re already in the correct order. So we have the two essential parts of a recursive algorithm: a recursion process that breaks large pieces into smaller ones, and a terminating condition so that we know when to stop the recursion.
The implementation of any algorithm begins by defining how it is invoked. In the case of quicksort, since we’re sorting the elements in a range, we need to specify the beginning and the end of the range to be sorted. And since we’re writing an algorithm that can handle almost arbitrary types we need to write a template:
template<class RanIt>
void sort(RanIt first, RanIt last);
I’ve chosen the name RanIt
for the type of the
iterator to suggest that the iterator should be a random access
iterator. Some of the operations that we perform on these iterators are
supported only by random access iterators.
As with most recursive algorithms, the first thing that the code should do is check whether the terminating condition is true. The terminating condition is that the range contains only one element. However, it costs nothing to also test for zero elements, which makes the algorithm safe when called with an empty range:
if (last - first <= 1)
return;
Having survived that test, our code next needs to begin the partitioning process. In its simplest form, the quicksort algorithm uses the first element in the range as a pivot value, and separates the rest of the range into one piece that contains elements that are less than the pivot value and one piece that contains elements that are greater than or equal to the pivot value. The usual technique is to build up one piece starting from the left end of the range and to build up the other piece starting from the right end of the range. As we shuffle elements around the left-hand piece grows to the right, and the right-hand piece grows to the left. In order to talk about these two new ranges more easily we’ll introduce four new iterator variables, one pair for each of the two pieces that we’re creating, plus an iterator that points to the pivot element:
RanIt pivot = first; // pivot element
RanIt first1, last1; // left-hand piece
RanIt first2, last2; // right-hand piece
Since we haven’t yet made any comparisons of elements, each of the two subranges must be empty. The left-hand subrange begins with the element immediately after the pivot element, so its iterators both get initialized to point to that element. The right-hand subrange ends at the end of the input range, so its iterators both get initialized with the value of the iterator for the end of the range:
first1 = last1 = first + 1;
first2 = last2 = last;
Now we’re ready to expand both pieces, maintaining the two
invariants that all of the elements in the half-open range
[first1,last1)
are less than the pivot element and that all
of the elements in the half-open range [first2,last2)
are
greater than or equal to the pivot element. Since both ranges are
initially empty, these two invariants are both true when the algorithm
begins. We start by examining the element just past the end of the first
range. If its value is less than the value of the pivot element then it
belongs to that range, and we can move the top end of that range up to
include it, then go back and try the next element:
while (last1 < first2 && *last1 < *pivot)
++last1;
When this loop exits, last1
points to the beginning of
the upper range, in which case we won’t find any more elements
that belong in the lower range, or to an element whose value is greater
than or equal to the value of the pivot element.
Similarly, we expand the upper range downward by looking at the element immediately below the lower end of the range, and adding it to the range if it is greater than or equal to the value of the pivot element:
while (last1 <= first2 - 1 && *pivot < *(first2 - 1))
--first2;
When this loop exits, first2
points to the upper end of
the lower range or to an element whose value is less than the value of
the pivot element.
After executing both of these loops, if the two iterators haven’t met in the middle they point to elements that don’t belong in the subrange where they were found. We exchange these two elements, and resume the comparisons. If the iterators met then we’ve partitioned the input range into the two desired subranges, and its time to use quicksort on the two subranges. The initial version of quicksort, along with a driver program, is in Listing 1.
I mentioned earlier that the quicksort algorithm can fall into
quadratic time complexity and linear space usage. In order to be able to
investigate these performance problems I’ve added two variables to
the implementation of the algorithm. The variable maxdepth
keeps track of the deepest nesting of the recursive calls to
sort
. The variable ops
keeps rough track of
the number of operations that have been performed. With these two
variables we can perform simple experiments to investigate the
performance of the algorithm. We can change the value of the constant
ELTS
to change the number of elements that we’re
sorting, and we can vary the patterns in the data that we sort. As
we’ll see, there are some data patterns that lead to very bad
performance with this version of the algorithm.
You probably know that the time to sort a range of values using
quicksort should be proportional to NlogN, where N is the number of
elements being sorted. To see this happen, begin by running the program
in Listing 1 as is. It sorts 100,000 values, initially in random order.
Then change the value of ELTS
to 1000000 and run it again,
and finally, change the value of ELTS
to 10000000 and run
it again. For one hundred thousand elements, NlogN is 500,000. For a
million elements it’s 6,000,000, and for ten million it’s
70,000,000. So running this sort implementation with one million
elements ought to require about twelve times as many operations as
running it with one hundred thousand elements, and running it with ten
million elements ought to require about 140 times as many. With the
compiler I used I got the following values3:
Elements | Operations | Predicted Operations |
100,000 | 2,139,243 | --- |
1,000,000 | 25,146,540 | 25,670,916 |
10,000,000 | 304,778,130 | 299,494,020 |
As you can see, the measured values are within about two percent of the predicted values.
This version of quicksort falls apart when the data that it is given
is already in order. To see this happen, comment out the call to
std::random_shuffle
in main
and run the
program again. You might want to reduce the value of ELTS
before you do this: with ELTS
set to 1,000 I got 500,499
operations. With ELTS
set to 10,000, I got 50,004,999
operations. A little math shows that increasing the number of elements
by a factor of ten increased the number of operations by a factor of one
hundred. For data that is already in order, the time complexity of this
version of quicksort is O(n^2).
There’s another problem with this version of quicksort: look at the call depth when it processes a range of sorted data. With 1,000 elements it reaches a depth of 998 nested calls; with 10,000 elements it reaches a depth of 9,998. Since each nested call requires a fixed amount of space for its local variables, the space usage of this version of quicksort is linear in the number of elements being sorted when it is applied to data that is already in order. That’s far too much space. In fact, when I tried to sort 100,000 elements that were already in order I got a stack overflow. A sort implementation that crashes when you give it 100,000 elements isn’t very useful.
We can fix the space problem by eliminating some of the recursive
calls. Notice that the last statement in the sort function is a
recursive call to sort itself. This is known as "tail
recursion." Tail recursion can usually be eliminated by writing a
loop instead of the final call. Listing
2 has a new version of the sort template with the tail recursion
removed. Replace the sort implementation in Listing 1 with this version,
then repeat the tests that we just did to see what effect this has on
memory usage. It may seem a bit strange to see a maximum depth of 0, but
that’s just a side effect of the already-sorted input. Put the
std::random_shuffle
back in and you’ll see that we
still get about the same stack usage as before for random data.
We’re not finished yet, though. Replace the call to
std::random_shuffle
with a call to
std::reverse
, so that the incoming data is sorted in
reverse order. For this data the stack usage goes way up, although not
as far as it did for already sorted data before we eliminated the tail
recursion.
The reason that the stack usage was so high in the first place is
that when we split the input range into two pieces, the algorithm chose
the first value as the pivot value. When the data is already in order,
this means that the pivot value is the lowest value in the range, so the
first piece consists of just the pivot value, and the second piece
consists of all the values except for the pivot. That is, each time we
split the input range we create a subrange with one element and a
subrange with N - 1 elements. Although we do make progress with each of
these partitions, we don’t make much progress. We ended up calling
sort recursively once for each element in the input sequence.
Eliminating the tail recursion fixes this space problem when the input
range is already sorted, but doesn’t help much when the input
range is sorted in reverse order. In order to get rid of the problem
here, too, we need to make the code that eliminates the tail recursion
just a little smarter. If we call sort
for the smaller of
the two pieces, and loop on the larger, we’ll reduce stack usage
significantly. This version of the algorithm is shown in Listing 3.
None of these changes affected the number of operations needed to sort any particular input sequence. We’ve avoided stack overflows, but we haven’t increased the speed of the sort. For a sorted input sequence and for a reverse sorted input sequence the time complexity is still O(n^2). That’s because in each of these cases we’re choosing the worst possible pivot. Ideally we’d like to have a pivot value that will end up near the middle of the sorted sequence, so that the two subranges are about the same size. The usual solution is to look at several values and to pick the middle value of those examined. Most often this is done with three values, and this approach is known as "median of three."
The code in Listing 4 implements
this approach. It uses a bubble sort to sort the elements pointed to by
mid
, first
, and last - 1
. The
result is that the median of the three values pointed to by
first
, mid
, and last - 1
ends up
at the beginning of the range being sorted, so that’s the value
that will be used as the pivot. If you rerun the previous tests with
this version of the algorithm you’ll see that the number of
operations needed to sort a range of already sorted values drops
significantly4. For sorted and
reverse-sorted data I get 119,535 operations to sort a range with 10,000
elements, and 1,534,481 operations to sort a range with 100,000
elements, giving a ratio of about 12.8. That’s what we’d
expect if the time complexity is O(NlogN). So we’ve eliminated the
quadratic behavior for sorted data, without affecting the time
complexity for random data.
More recent investigations of the quicksort algorithm have focused on another issue: how to make the algorithm run quickly for equal data. If I generate 10,000 equal elements and sort them the latest version of the algorithm performs 125,264 operations. It ought to be possible to recognize that all the elements are equal with only 9,999 operations. If we can modify the quicksort algorithm to deal quickly with equal data elements, without slowing it down too much for random data, we’ll have yet another significant improvement in performance. Next month we’ll look at Sedgwick’s version of the quicksort algorithm, which is probably the most commonly used technique today, and we’ll look at the approach that Dinkumware now uses in our libraries, which I have named the Inside Out sort.
1. Subject, of course, to the condition that
(last - first - 1)
is not negative.
2. It would be more natural to write this last
condition as *(first + j) <= *(first + j + 1)
, thus
comparing elements in their natural order, but that would require an
operator <=
. Since we are assuming here that we only
have an operator <
, we must resort to this somewhat less
obvious way of expressing the order of each adjacent pair of
elements.
3. You may get slightly different results, depending on the random number generator that your compiler provides. This determines the initial order of the elements.
4. In addition to median of three, Hoare suggested picking a pivot element at random. This will almost always avoid quadratic behavior.
Copyright © 2001-2006 by Pete Becker. All rights reserved.