1
00:00:03,500 --> 00:00:08,584
The first string sorting algorithm that
we're going to look at is actually the

2
00:00:08,584 --> 00:00:13,734
basis for several more complicated
algorithms is called key-indexed counting

3
00:00:13,734 --> 00:00:19,287
and it's very useful on a particular
special situation. but let's take a quick

4
00:00:19,287 --> 00:00:25,486
review of where we left off with sorting.
so we considered a number of sorting

5
00:00:25,486 --> 00:00:32,117
algorithms starting with insertion sort,
and then merge sort, quick sort and

6
00:00:32,584 --> 00:00:39,497
heapsort. And we got to the point where we
could find an algorithm that's heapsort

7
00:00:39,964 --> 00:00:47,436
that guarantees to sort NL items and time
proportional to N log N without using

8
00:00:47,436 --> 00:00:55,002
extra space, unfortunately not stable. And
all these algorithms were useful, or are

9
00:00:55,002 --> 00:01:02,052
useful for any type of generic key as long
as it implements the comparative

10
00:01:02,052 --> 00:01:08,721
operation. and not only that, we prove
that, any algorithm that just uses

11
00:01:08,721 --> 00:01:15,960
compares has to use number of compares
proportional to N log base 2N. So, in a

12
00:01:15,960 --> 00:01:23,397
very important sense, merge sort, or
heapsort for example, or optimal. you

13
00:01:23,397 --> 00:01:30,835
can't use syntatically fewer compares,
for, either one and, and, with heapsort,

14
00:01:30,835 --> 00:01:38,272
you can't use less, extra space. So why do
we consider, other sorting algorithms?

15
00:01:38,538 --> 00:01:44,382
there was a lower bound, what, what, why
are we thinking about this? And the

16
00:01:44,382 --> 00:01:50,241
question is, can we do better and
obviously we're here because, the answer

17
00:01:50,241 --> 00:01:56,071
is that we can do better and if we don't
depend on compares. The lower bound, the

18
00:01:56,071 --> 00:02:02,438
one assumption made by the lower bound is
that we use compares, but we don't always

19
00:02:02,438 --> 00:02:08,727
need to use compares. And so let's look at
an example. Key-indexed counting is a fine

20
00:02:08,727 --> 00:02:14,638
example of that. and it's representative
of fairly common situation where, in

21
00:02:14,638 --> 00:02:21,280
sorting application, where it happens to
be that the keys that we're using to sort

22
00:02:21,280 --> 00:02:26,666
are small integers. So, in this, this
case, this is supposed to mimic an

23
00:02:26,666 --> 00:02:32,968
application where there's students and
they're assigned to sections. There's not

24
00:02:32,968 --> 00:02:40,181
too many sections and we want to get the
thing sorted. so we want to distribute the

25
00:02:40,181 --> 00:02:46,635
students by section and so we want to sort
according to the section number and that's

26
00:02:46,635 --> 00:02:53,001
a small integer. And the implication of
knowing that the key is a small integer

27
00:02:53,001 --> 00:02:59,543
is, that we can use the key as an array
index and, and by knowing that the key is

28
00:02:59,543 --> 00:03:05,839
an array index, we can arrange for fast
sort. so lots of applications for that

29
00:03:05,839 --> 00:03:12,708
when we have maybe a phone numbers, we can
sort by area code or if you have a string

30
00:03:12,708 --> 00:03:19,495
you just want to sort by the first letter,
you can do it that way. And actually, that

31
00:03:19,495 --> 00:03:27,196
idea leads to efficient sorting algorithm
actually two different ways. now don't

32
00:03:27,196 --> 00:03:35,162
forget that we're sorting according to a
sort key, but usually we're sorting bigger

33
00:03:35,162 --> 00:03:41,536
generic items that have other information
associated with if you were just sorting

34
00:03:41,536 --> 00:03:47,475
the small integers, you could just count
how many 1's there are, how many 2's there

35
00:03:47,475 --> 00:03:52,980
are and like that, and then in one path
and if there's three 1's, just output

36
00:03:52,980 --> 00:03:58,267
three 1's and so forth. but the
complication is that we have to carry the

37
00:03:58,267 --> 00:04:04,718
associated information along so we have to
work a bit harder than that. so, here's

38
00:04:04,718 --> 00:04:10,771
the code for this method called
key-indexed counting, and, let's look at a

39
00:04:10,771 --> 00:04:16,621
demo. So here's the key-indexed counting
demo. Now, to make this a little less

40
00:04:16,621 --> 00:04:21,663
confusing in not so many numbers, we're
going to use lower case a for zero, b for

41
00:04:21,663 --> 00:04:27,318
one, c for two and like that. So it's the
a - first letter of the alphabet, and

42
00:04:27,318 --> 00:04:31,815
however you want to think of it. so, and
we're already going to look at six.

43
00:04:31,815 --> 00:04:37,130
So we're supposing that we're sorting this
array that has six different small

44
00:04:37,130 --> 00:04:42,445
integers and we're using lower case
letters to represent the integers so that

45
00:04:42,445 --> 00:04:48,684
we can easily distinguish between the keys
and this. So now, let's look at, the

46
00:04:48,684 --> 00:04:54,626
processing for this. So the first thing
that we do is we go through and we count

47
00:04:54,626 --> 00:05:01,089
the frequency of occurrence of each letter
and, so the way that we do that is we keep

48
00:05:01,089 --> 00:05:09,323
an array. now their arrays actually got to
be one bigger than number of different

49
00:05:09,323 --> 00:05:14,672
key, keys that we have and the number of
different small integers that we have. So

50
00:05:14,672 --> 00:05:19,729
in this case array of size seven.
And just this, make the code a little

51
00:05:19,729 --> 00:05:24,920
cleaner, we keep the number of a's in
count of one, the number of b's in count

52
00:05:24,920 --> 00:05:30,517
of two and so forth. and s o if we're,
once we've set up, that's what we want to

53
00:05:30,517 --> 00:05:36,180
do, then its trivial to go ahead and count
the frequencies. we'd simply go through,

54
00:05:36,180 --> 00:05:43,508
for i from zero to N, we'd go through our
input. and when we, a of i, when we access

55
00:05:43,508 --> 00:05:49,036
a value in our input, it's a small
integer. So it's zero, one, two, three,

56
00:05:49,036 --> 00:05:55,261
four, or five and we simply add one to
that integer, and use it to index into

57
00:05:55,261 --> 00:06:01,557
that array. So when we see an a, that's
zero, then we're, incrementing count of

58
00:06:01,557 --> 00:06:08,106
one, and we see a b, that's one, we're
incrementing count of two and so forth. so

59
00:06:08,106 --> 00:06:14,150
in this case, we increment count
corresponding to b and then a and c and

60
00:06:14,150 --> 00:06:18,869
like that. And everytime we encounter a
new key, we simply increment one of these

61
00:06:18,869 --> 00:06:24,147
counters. In one pass through we get an
array that gives us the number of a's,

62
00:06:24,147 --> 00:06:29,044
b's, c's, d's, e's and f's. That's the
first path of key-indexed counting, count

63
00:06:29,044 --> 00:06:35,911
the frequencies of each letter, using the
key as an index. now the next step is, is

64
00:06:36,697 --> 00:06:45,438
called computing cumulus, and that's a
really easy thing as well. all we do is we

65
00:06:45,733 --> 00:06:54,573
go through the count array and simply at
each step we add the current one to the

66
00:06:54,868 --> 00:07:03,690
sum computed so far. So if we look before,
we had two a's and three b's, so that

67
00:07:03,690 --> 00:07:09,974
means there's five letters less than c.
that's the a's and the b's, and they're

68
00:07:09,974 --> 00:07:16,552
six letters less than d and eight letters
less than e and so forth. And that's just

69
00:07:16,552 --> 00:07:22,576
obtained by, we start with two, add three
to it, get five, add one to it to get

70
00:07:22,576 --> 00:07:26,301
five.
And with that one passed through the count

71
00:07:26,301 --> 00:07:32,642
array, then we can find out for example,
there are six keys less than d and eight

72
00:07:32,642 --> 00:07:39,214
keys less than e. And those cumulus tell
us where the d's go in the output. There's

73
00:07:39,214 --> 00:07:45,540
six keys less than d, and eight keys less
than e, so the d's have to go in a6 and

74
00:07:45,540 --> 00:07:51,709
a7. So this is an array of indices that is
going to tell us how to distribute the

75
00:07:51,709 --> 00:07:58,778
keys in the output. So that's the next
step, is access the cumulus using the key

76
00:07:58,778 --> 00:08:05,634
as the index to move items. So let's take
a look at. So now, remember when we see an

77
00:08:05,634 --> 00:08:11,603
a, we're just going to count that as zero,
so we're going to go to count zero and

78
00:08:11,603 --> 00:08:15,106
that will access this entry in the count
ar ray.

79
00:08:15,111 --> 00:08:20,931
So, we go through the whole array to be
sorted and we move each key exactly to

80
00:08:20,931 --> 00:08:26,901
where it has to go and we'll do that one
at, at time now. So when I is zero we're

81
00:08:26,901 --> 00:08:31,445
looking at the d, the count array
corresponding to d has six, so it says,

82
00:08:31,445 --> 00:08:36,052
just put d in there and increment that. It
means if you get another d, it's going to

83
00:08:36,052 --> 00:08:40,017
go into seven.
And these things, the way we pre-computed

84
00:08:40,017 --> 00:08:45,161
them, are not going to run into one
another. So now, a, we go, that goes in

85
00:08:45,161 --> 00:08:51,463
zero, and we increment the count array
corresponding to a. next thing is c, and

86
00:08:51,463 --> 00:08:58,356
so that's going to says to put it in five
and then increment, the count array

87
00:08:58,356 --> 00:09:03,127
corresponding to c, and f, it says put it
in nine.

88
00:09:03,131 --> 00:09:08,319
Next is b, we put it in two.
Sorry another f, we put in ten.

89
00:09:08,321 --> 00:09:14,774
Next is b that we put in wo. So you can
see the keys from the input are getting

90
00:09:14,774 --> 00:09:20,985
distributed in the output according to the
counts and the cumulus that we've

91
00:09:20,985 --> 00:09:27,437
pre-computed. so now we get the other d
which goes into seven, we get the, another

92
00:09:27,437 --> 00:09:33,487
b which goes into three, and then
increment the four for where the next one

93
00:09:33,487 --> 00:09:38,999
goes, f goes into eleven.
The last b goes into four, the e goes into

94
00:09:38,999 --> 00:09:45,771
a, and the second a goes into one.
So that's move items again, simply by

95
00:09:45,771 --> 00:09:53,979
using the key as index into the count
array. and then the last step is to just

96
00:09:54,272 --> 00:10:01,503
copy the sorted array back into the
original input. that's a demo of

97
00:10:01,503 --> 00:10:07,958
key-indexed counting. Quick summary of
key-indexed counting, we make one pass

98
00:10:07,958 --> 00:10:13,870
through the array to count frequencies of
each letter using the key as an index,

99
00:10:13,870 --> 00:10:20,078
then we go through that count array to,
compute cumulus just by adding each new

100
00:10:20,078 --> 00:10:26,360
one into the running sum. then we use
those cumulates, and access that using key

101
00:10:26,360 --> 00:10:32,346
as index to actually move items over and
get them in sorted order, and then move

102
00:10:32,346 --> 00:10:38,646
back into the original array. what's the
running time of this algorithm? Well, the

103
00:10:38,646 --> 00:10:46,217
analysis is actually quite simple, because
it's just a couple of loops through the

104
00:10:46,217 --> 00:10:52,257
array that we sorted, and through the
count array. and the key, key fact to

105
00:10:52,257 --> 00:10:59,022
note, that it takes time proportional to N
plus R and space proportional to N + R.

106
00:10:59,022 --> 00:11:05,626
Now R, remember is our array that excess
the number of different character values.

107
00:11:05,868 --> 00:11:16,043
so, so for asking, maybe that's the 205th.
and, and for generic data maybe it's four.

108
00:11:16,045 --> 00:11:23,026
And N, we're assuming we're sorting huge
files. So really, this is linear time in

109
00:11:23,026 --> 00:11:30,036
many, many practical situations. there's
also the question of, is it stable? Yeah,

110
00:11:30,036 --> 00:11:36,521
it's actually stable. because when we do
the move, we move things with equal keys

111
00:11:36,521 --> 00:11:42,772
in the order that we see them, we keep
them in the order that we see them. That's

112
00:11:42,772 --> 00:11:49,023
just the way the method works. So we have
for this special situation, we have a

113
00:11:49,023 --> 00:11:55,899
linear time stable sorting method which
beats the N log N and it's useful in many

114
00:11:55,899 --> 00:11:57,540
practical situations.