we can also get strings sorted by moving
from left to right.
That's called MSD string sorter.
Most-significant-digit-first string sort.
And this is going to kind of, kind of
related to what we did with quick sort,
and maybe viewed as a generalization of
quick sort.
So the idea is that we start with the
first character, to use key index counting
on the first character of the array and
that will partition it into the strings
that start with each character.
So.
If we use the first character, the
left-most character in a string as, as the
first character in this case, then we get
all the strings that start with A followed
by all the strings that start with B and
so forth.
And then recursively use the same method
for each subfile.
Essentially, we have a subarray for, each
of the characters.
And, that cumulate that we built with key
index counting gives us the subarrays and
actually completely delineates the
subarrays that we need to sort.
Remember we had built up this array that
said that there's six keys less than d and
eight keys less than e and so forth.
So that tells us exactly precisely the
boundaries of the subarray that contain
the keys to start with d,
And then we can just use the same method
recursively to sort each of the subarrays
one for each character.
Mm.
Now, this is a little bit complicated
trace,
But if you look at it more carefully after
the lecture you'll see that it's pretty
simple, setup for what we need to do.
So, here's our input and we sort on the
first character.
If we sort on the first character,
In this example, we have lots of s's.
So the subfile for the s's, it's already
sorted on the first character.
This d is the digit that we're currently
working on. So then, we sort that on,
it's, the rest of it on its first
character.
So it's the second character and all the
words that start with s.
And then there's a lot of them that start
with e,
So we move on to the third character for
those,
And then there's some that start with a
and so forth.
So, recursively,
Every time that we move on one character,
we have to keep going until one thing we
have to do, is if there's two keys that
are equal, we have to examine every
character in the key.
We never find out that the two keys are
equal until the end.
And if we reach the end of a string, we
can just assume that goes before any
character value.
And with those two things, then you, you
can trace through the recursion to see how
MSD string sort of works.
So and again, in this case, it sorts by
the first character, A and B have only
size once, so you don't have to do
anything.
So then most of this slide is showing what
happens in the keys that start with S, and
then at the end, there's the two keys that
start with T, and they have both h's and
both e's, we have to go through the whole
thing.
So that's an example of MSD string sort.
Now, if strings are variable length like
they were in that example, we just treat
them as if they had an extra character at
the end that's smaller than any character.
So they'll appear before they're supposed
to appear alphabetically, before strings
that have the same starting characters
that are longer.
And you can do that by overloading, you
know, implementing charAt to just return
minus one, if, if we're past the character
that we're looking at.
So, that's the easy part of the
implementation.
One thing to point out, the in the C
programming language, the representation
of strings puts an extra character that's
zero at the end and those string has the
zero character.
So, actually not to do anything at all.
But when that change the code for MSD
string sort is also really an extremely
simple modification or extension to key
index counting.
So we've got our sort and it takes its
input array and then, output buffer,
And then we have to take the deliminators
of the subfile we're going to sort, low
and high just like we've done for other
recursive sorts.
We go ahead and we do the key index
counting, again, to take care of, we have
to take care of the fact that we have an
extra character that kind of the end of
string character.
And also, pull out the d character of our
string just like we did for LSD sort.
And we just go through the whole thing and
sort according to the given character.
And then, what we do next is just do a
recursive call for every entry in the
counter array where we, we just sort the
part of the array from count of r to lo
plus count of r or lo plus count of r plus
one.
Really just one line of curve for each
subarray.
And then we move to the right one
character.
So, again, this is very little code to get
a very useful and powerful sorting method,
That's MSD string sorting.
One thing to notice is that, with the,
extra output buffer,
We can use that even with the recursive
calls,
But not the counter array,
We have to keep the counter array around.
So it's part, it's gotta be local to the
recursive procedure,
Cuz when we call for the next character,
we're going to need, a new counter array
for that character but have to save the
old one to do the next subarray for, the
calling, method.
Well that turns out to be important
because there's a potential for big
problems with MSD sort when we start
looking at the analysis.
And the thing to notice is that if you
have a tiny array, say an array of size
two, it doesn't matter how small your
array is, you need to have a counter array
for each potential character in the
alphabet.
So for ASCII, just to, to initialize the
counter and to set it to zero, just create
it, it's going to be a hundred times
slower, then just sorting the thing or
just copying it back, even if you're using
Unicode, it's 32,000 times slower. And
what's worse is it's a recursive program,
so there's going to be lots of small
subarrays.
Sorry.
This feature or characteristic of MSD
string sort actually [INAUDIBLE], we're
having a lot of applications that were
using it when people switched from ASCII
to Unicode a while back.
All of a sudden programs that were really
efficient sorts,
All of a sudden, became hundreds of times
slower, with the switch to Unicode,
Because these big arrays in the recursive,
these big arrays in the recursive
procedure.
It was a, a serious performance problem,
So we definitely have to watch out for
that, with, MSD string sort.
Now there is a good solution to avoid this
danger and it's the same solution we've
used before.
If you've got a small subarray and the
sort is going to be slower just cut off
the insertion sort.
The other thing you can do is and save
some more time, is to just have insertion
sort start at the character that we're
currently working on.
Cuz we know things are, equal, to the left
and we're just looking for the right.
And, that's easy to implement just by
changing the implementation of the compare
function, to take into account which
character we're at.
Notice it's quicker,
And, and it happens to be quicker in Java
to just pull out the substrings and use
compare to, than to go in and get the
charAt that's just a feature of the
implementation.
So switching to, cutting off the insertion
sort for small subarrays is definitely a
good idea for MSD string sort.
And so, what about the performance of this
method?
Well the key characteristic of, of MSD
string sorting, it examines just the
characters that it needs to examine in
order to get the C key sorted.
So, that means that its performance is
going to be really dependent on the data.
Now, in the, worst case for the algorithm,
it has to examine all the data, in this
case all the characters in all the
strings, and that's, for example, when
they're all equal, it's going to look at
all the characters.
If you have some duplicate keys, it might
have to examine all the characters in
those duplicate keys, but there's plenty
of other strings that it doesn't examine
all the characters. so,
That, depending on, and end of the keys
are random in some way, if they, or
they're approximated by random keys.
Then it's not going to examine very many
characters at all,
Actually and this is a typical case, say,
for account numbers or some situation
library call numbers or some situation
like that,
In a case like that.
Msd string sort will examine only a small
fraction of the data,
A small constant fraction of the data,
And it's always going to be sublinear.
Now, it's also possible that sorts that do
comparisons can be sublinear, but MSD
string sort is, you, you know, good and
that it really only examines the
characters that need to be examined in
order to get the sort done.
So this gives us another line on, on the
table,
And the key here is that if the data is
approximated by random log base r of N is
going to be pretty well approximated by a
constant in the real world.
So this is going to be a fast method.
The one danger is that, you have to worry
out [inaudible], worry about using too
much extra space, with those big counter
arrays.
But that's a important and useful,
practical sorting method.
So now let's look at, MSD string sort
versus Quicksort for strings.
There, it, they are similar in many ways.
They're both recursive methods that,
partition a file up.
So one of the problems with MSD string
sort is that, It tends to, when it's doing
the counting, it's kind of making random
accesses to memory when it, particularly
when it's distributing the keys out.
So on modern systems that have caches not
everything is in the fastest memory at, at
the same time and programs that move from
one piece of data to another right next
are, are going to be much more efficient.
And quicksorts like that, an MSD quicksort
is not.
Another disadvantage of MSD string sort is
that there's actually quite a few
instructions in, in our loop.
With the indexing and accounting.
And the plus and [inaudible] in the
accumulating and so forth,
Whereas Quicksort remember is fast,
because it has only a few instructions in
the inner loop.
And amnesty strings are used as extra
space,
Whereas quick sort, there's two things,
one is, it's not linear in these
applications, where MSD string sort is,
and kind of the reason that it's not
linear, is that if you've got keys that
have a lot of the same characters at the
beginning, when it's doing the compares,
it has to rescan or recompare, lots of
those characters.
So, what were going to look at next, is
try to get this, achieve this goal, of
combining the advantages of, of these two
sorting algorithms.