The first string sorting algorithm that
we're going to look at is actually the
basis for several more complicated
algorithms is called key-indexed counting
and it's very useful on a particular
special situation. but let's take a quick
review of where we left off with sorting.
so we considered a number of sorting
algorithms starting with insertion sort,
and then merge sort, quick sort and
heapsort. And we got to the point where we
could find an algorithm that's heapsort
that guarantees to sort NL items and time
proportional to N log N without using
extra space, unfortunately not stable. And
all these algorithms were useful, or are
useful for any type of generic key as long
as it implements the comparative
operation. and not only that, we prove
that, any algorithm that just uses
compares has to use number of compares
proportional to N log base 2N. So, in a
very important sense, merge sort, or
heapsort for example, or optimal. you
can't use syntatically fewer compares,
for, either one and, and, with heapsort,
you can't use less, extra space. So why do
we consider, other sorting algorithms?
there was a lower bound, what, what, why
are we thinking about this? And the
question is, can we do better and
obviously we're here because, the answer
is that we can do better and if we don't
depend on compares. The lower bound, the
one assumption made by the lower bound is
that we use compares, but we don't always
need to use compares. And so let's look at
an example. Key-indexed counting is a fine
example of that. and it's representative
of fairly common situation where, in
sorting application, where it happens to
be that the keys that we're using to sort
are small integers. So, in this, this
case, this is supposed to mimic an
application where there's students and
they're assigned to sections. There's not
too many sections and we want to get the
thing sorted. so we want to distribute the
students by section and so we want to sort
according to the section number and that's
a small integer. And the implication of
knowing that the key is a small integer
is, that we can use the key as an array
index and, and by knowing that the key is
an array index, we can arrange for fast
sort. so lots of applications for that
when we have maybe a phone numbers, we can
sort by area code or if you have a string
you just want to sort by the first letter,
you can do it that way. And actually, that
idea leads to efficient sorting algorithm
actually two different ways. now don't
forget that we're sorting according to a
sort key, but usually we're sorting bigger
generic items that have other information
associated with if you were just sorting
the small integers, you could just count
how many 1's there are, how many 2's there
are and like that, and then in one path
and if there's three 1's, just output
three 1's and so forth. but the
complication is that we have to carry the
associated information along so we have to
work a bit harder than that. so, here's
the code for this method called
key-indexed counting, and, let's look at a
demo. So here's the key-indexed counting
demo. Now, to make this a little less
confusing in not so many numbers, we're
going to use lower case a for zero, b for
one, c for two and like that. So it's the
a - first letter of the alphabet, and
however you want to think of it. so, and
we're already going to look at six.
So we're supposing that we're sorting this
array that has six different small
integers and we're using lower case
letters to represent the integers so that
we can easily distinguish between the keys
and this. So now, let's look at, the
processing for this. So the first thing
that we do is we go through and we count
the frequency of occurrence of each letter
and, so the way that we do that is we keep
an array. now their arrays actually got to
be one bigger than number of different
key, keys that we have and the number of
different small integers that we have. So
in this case array of size seven.
And just this, make the code a little
cleaner, we keep the number of a's in
count of one, the number of b's in count
of two and so forth. and s o if we're,
once we've set up, that's what we want to
do, then its trivial to go ahead and count
the frequencies. we'd simply go through,
for i from zero to N, we'd go through our
input. and when we, a of i, when we access
a value in our input, it's a small
integer. So it's zero, one, two, three,
four, or five and we simply add one to
that integer, and use it to index into
that array. So when we see an a, that's
zero, then we're, incrementing count of
one, and we see a b, that's one, we're
incrementing count of two and so forth. so
in this case, we increment count
corresponding to b and then a and c and
like that. And everytime we encounter a
new key, we simply increment one of these
counters. In one pass through we get an
array that gives us the number of a's,
b's, c's, d's, e's and f's. That's the
first path of key-indexed counting, count
the frequencies of each letter, using the
key as an index. now the next step is, is
called computing cumulus, and that's a
really easy thing as well. all we do is we
go through the count array and simply at
each step we add the current one to the
sum computed so far. So if we look before,
we had two a's and three b's, so that
means there's five letters less than c.
that's the a's and the b's, and they're
six letters less than d and eight letters
less than e and so forth. And that's just
obtained by, we start with two, add three
to it, get five, add one to it to get
five.
And with that one passed through the count
array, then we can find out for example,
there are six keys less than d and eight
keys less than e. And those cumulus tell
us where the d's go in the output. There's
six keys less than d, and eight keys less
than e, so the d's have to go in a6 and
a7. So this is an array of indices that is
going to tell us how to distribute the
keys in the output. So that's the next
step, is access the cumulus using the key
as the index to move items. So let's take
a look at. So now, remember when we see an
a, we're just going to count that as zero,
so we're going to go to count zero and
that will access this entry in the count
ar ray.
So, we go through the whole array to be
sorted and we move each key exactly to
where it has to go and we'll do that one
at, at time now. So when I is zero we're
looking at the d, the count array
corresponding to d has six, so it says,
just put d in there and increment that. It
means if you get another d, it's going to
go into seven.
And these things, the way we pre-computed
them, are not going to run into one
another. So now, a, we go, that goes in
zero, and we increment the count array
corresponding to a. next thing is c, and
so that's going to says to put it in five
and then increment, the count array
corresponding to c, and f, it says put it
in nine.
Next is b, we put it in two.
Sorry another f, we put in ten.
Next is b that we put in wo. So you can
see the keys from the input are getting
distributed in the output according to the
counts and the cumulus that we've
pre-computed. so now we get the other d
which goes into seven, we get the, another
b which goes into three, and then
increment the four for where the next one
goes, f goes into eleven.
The last b goes into four, the e goes into
a, and the second a goes into one.
So that's move items again, simply by
using the key as index into the count
array. and then the last step is to just
copy the sorted array back into the
original input. that's a demo of
key-indexed counting. Quick summary of
key-indexed counting, we make one pass
through the array to count frequencies of
each letter using the key as an index,
then we go through that count array to,
compute cumulus just by adding each new
one into the running sum. then we use
those cumulates, and access that using key
as index to actually move items over and
get them in sorted order, and then move
back into the original array. what's the
running time of this algorithm? Well, the
analysis is actually quite simple, because
it's just a couple of loops through the
array that we sorted, and through the
count array. and the key, key fact to
note, that it takes time proportional to N
plus R and space proportional to N + R.
Now R, remember is our array that excess
the number of different character values.
so, so for asking, maybe that's the 205th.
and, and for generic data maybe it's four.
And N, we're assuming we're sorting huge
files. So really, this is linear time in
many, many practical situations. there's
also the question of, is it stable? Yeah,
it's actually stable. because when we do
the move, we move things with equal keys
in the order that we see them, we keep
them in the order that we see them. That's
just the way the method works. So we have
for this special situation, we have a
linear time stable sorting method which
beats the N log N and it's useful in many
practical situations.