1
00:00:00,000 --> 00:00:04,367
This optional video will be, more or less,
the last word that we have on sorting for

2
00:00:04,367 --> 00:00:08,313
the purposes of this course. And it'll
answer the question, can we do better?

3
00:00:08,313 --> 00:00:12,575
Remember, that's the mantra of any good
algorithm designer. I've shown you N log N

4
00:00:12,575 --> 00:00:16,890
algorithms for sorting, Merge Short in the
worst case, Quick Sort, on average. Can we

5
00:00:16,890 --> 00:00:20,836
do better than N log N? Indeed, for the
selection problem, we saw we could do

6
00:00:20,836 --> 00:00:25,256
better than N log N. We could linear time.
Maybe we can do linear time for sorting as

7
00:00:25,256 --> 00:00:29,537
well. The purpose of this video is to
explain to you why we cannot, do sorting,

8
00:00:29,537 --> 00:00:34,216
in linear time. So this is a rare problem
where we understand quite precisely how

9
00:00:34,216 --> 00:00:38,399
well it can be solved at least for a
particular class of [inaudible] called

10
00:00:38,399 --> 00:00:42,802
comparison based sorts which I'll explain
in a moment. So here's the form of the

11
00:00:42,802 --> 00:00:48,035
theorem, I want to give you the gist of in
this video. So in addition to restricting

12
00:00:48,035 --> 00:00:51,525
to comparison based sorts which is
necessary as we'll see in a second, I'm

13
00:00:51,525 --> 00:00:55,440
going to make a second assumption which is
not necessary but is convenient for the

14
00:00:55,440 --> 00:00:59,261
lecture which is that I'm only going to
think about deterministic algorithms for

15
00:00:59,261 --> 00:01:03,352
the moment. I encourage you to think about
why the same style of arguments gives an N

16
00:01:03,352 --> 00:01:07,070
log and lower bound on the expected
running time of any randomized algorithm.

17
00:01:07,070 --> 00:01:10,835
Maybe I'll put that on the course site as
an optional theory problem. So, in

18
00:01:10,835 --> 00:01:14,891
particular, a quick sort is optimal in the
randomized sense. It have average and long

19
00:01:14,891 --> 00:01:18,753
end time and then again claims that no
comparison based sort can be better than

20
00:01:18,753 --> 00:01:22,470
that, even on average. So, I need to tell
you what I mean by a comparison based

21
00:01:22,470 --> 00:01:26,612
sorting algorithm. What it means, it's a
sorting algorithm that accesses the

22
00:01:26,612 --> 00:01:30,975
elements of the input array. Only via
comparisons, it does not do any kind of

23
00:01:30,975 --> 00:01:35,510
direct manipulation on a single array
element. All it does, is it picks pairs of

24
00:01:35,510 --> 00:01:40,259
elements and asks the question is the left
one bigger or is the right one bigger. I

25
00:01:40,259 --> 00:01:44,292
like to think of comparison based sorts as
general purpose sorting routines. They

26
00:01:44,292 --> 00:01:48,376
make no assumptions about what the data is
other than that it's from some totally

27
00:01:48,376 --> 00:01:52,509
ordered set. I like to think of it really
as a function that takes as an argument a

28
00:01:52,509 --> 00:01:56,393
function pointer that allows it to do
comparisons between abstract data types.

29
00:01:56,393 --> 00:02:00,326
There's no way to access the guts of the
elements. All you can do is go through

30
00:02:00,326 --> 00:02:04,210
this API, which allows you to make
comparisons. And indeed if you look at the

31
00:02:04,210 --> 00:02:08,144
sorting routine and say the unit's
operating system, that's exactly how it's

32
00:02:08,144 --> 00:02:12,056
set up. You just patch in a function
pointer to a comparison operator. I know

33
00:02:12,056 --> 00:02:15,746
this sounds super abstract so, I think it
becomes clear once we talk about some

34
00:02:15,746 --> 00:02:19,435
examples. There's famous examples of
comparison based sort including everything

35
00:02:19,435 --> 00:02:22,797
we've discussed in the class so far.
There's also famous examples of non

36
00:02:22,797 --> 00:02:26,487
comparison based sort which we're not
gonna cover, but perhaps some of you have

37
00:02:26,487 --> 00:02:30,410
heard of or at the very least they're very
easy to look up on Wikipedia or wherever.

38
00:02:30,410 --> 00:02:35,393
So examples include the two sorting
algorithms we discussed so far, mergesort.

39
00:02:35,393 --> 00:02:40,634
The only way that mergesort interacts with
the elements in the input array is by

40
00:02:40,634 --> 00:02:45,698
comparing them and by copying them.
Similarly, the only think Quick Sort does

41
00:02:45,698 --> 00:02:50,456
with the input array elements is compare
them and swap them in place. For those of

42
00:02:50,456 --> 00:02:54,807
you that know about the heap data
structure which we'll be reviewing later

43
00:02:54,807 --> 00:02:59,506
in the class. Heap sort. Where you just,
heapify a bunch of elements, and then

44
00:02:59,506 --> 00:03:04,147
extract the minimum N times. That also
uses only comparisons. So what are some

45
00:03:04,147 --> 00:03:08,847
famous non examples? I think this will
make it even more clear what we're talking

46
00:03:08,847 --> 00:03:12,694
about. So bucket sort is one very useful
one. So, bucket sort's used most

47
00:03:12,694 --> 00:03:16,576
frequently when you have some kind of
distributional assumption on the data that

48
00:03:16,576 --> 00:03:20,266
you're sorting. Remember that's exactly
what I'm focusing on avoiding in this

49
00:03:20,266 --> 00:03:24,100
class. I'm focusing on general purpose
subroutines where you don't know anything

50
00:03:24,100 --> 00:03:28,078
about the data. If you do know stuff about
the data, bucket sorting can sometimes be

51
00:03:28,078 --> 00:03:31,720
a really useful method. For example,
suppose you can model your data as I-I-D

52
00:03:31,720 --> 00:03:35,650
samples from the uniform distribution on
zero one. So they're all rational numbers,

53
00:03:35,650 --> 00:03:39,436
bigger than zero, less than one, and you
expect them to be evenly spread through

54
00:03:39,436 --> 00:03:43,151
that interval. Then what you can do in
bucket sort is you can just. Preallocate

55
00:03:43,151 --> 00:03:46,870
end buckets where you're gonna collect
these elements. Each one is gonna have the

56
00:03:46,870 --> 00:03:50,452
same width, width one over n. The first
bucket you just do linear pass with the

57
00:03:50,452 --> 00:03:54,172
input array. Everything that's between
zero and one over n you stick in the first

58
00:03:54,172 --> 00:03:57,800
bucket. Everything in between one over n
and two over n you stick in the second

59
00:03:57,800 --> 00:04:01,565
bucket. Two over end and three over n you
sick in the third bucket and so on. So

60
00:04:01,565 --> 00:04:05,627
with the single pass. You've classified
the input elements according to which

61
00:04:05,627 --> 00:04:09,942
bucket they belong in, now because the
data is assumed to be uniform at random,

62
00:04:09,942 --> 00:04:14,423
that means you expect each of the buckets
to have a very small population, just a

63
00:04:14,423 --> 00:04:18,157
few elements in it. So remember if it.
Elements are drawing uniform from the

64
00:04:18,157 --> 00:04:21,532
interval zero one, then it's equally
likely to be in each of the N available

65
00:04:21,532 --> 00:04:25,085
buckets. And since there's N elements that
means you only expect one element per

66
00:04:25,085 --> 00:04:28,727
bucket. So that each one is gonna have a
very small population. Having bucketed the

67
00:04:28,727 --> 00:04:32,325
data, you can now just use, say, insertion
sort on each bucket independently. You're

68
00:04:32,325 --> 00:04:35,744
gonna be doing insertion sort on a tiny
number of elements, so that'll run in

69
00:04:35,744 --> 00:04:39,297
constant time, and then there's gonna be
linear number of buckets, so it's linear

70
00:04:39,297 --> 00:04:43,447
time overall. So the upshot is. If you're
willing to make really strong assumptions

71
00:04:43,447 --> 00:04:48,053
about your data like it's drawn uniformly
at random from the interval zero one then

72
00:04:48,053 --> 00:04:52,333
there's not an N log in lower bound in
fact you can allude the lower bound and

73
00:04:52,333 --> 00:04:56,622
sort them in your time. So, just to be
clear. In what sense is bucket sort not

74
00:04:56,622 --> 00:05:01,156
comparison based? In what sense does it
look at the guts of its elements and do

75
00:05:01,156 --> 00:05:05,920
something other than access them by pairs
of comparisons? Well, it actually looks at

76
00:05:05,920 --> 00:05:10,684
an element at input array and it says what
is its value, and it checks if its value

77
00:05:10,684 --> 00:05:15,390
is.17 versus.27 versus.77, and according
to what value it sees inside this element,

78
00:05:15,390 --> 00:05:20,039
it makes the decision of which bucket to
allocate it to. So, it actually stares at

79
00:05:20,039 --> 00:05:24,688
the guts of an element to decide how, what
to do next. Another non-example, which eh,

80
00:05:24,688 --> 00:05:29,678
can be quite useful is count and sort. So
this sorting algorithm is good when your

81
00:05:29,678 --> 00:05:34,657
data again we're gonna make an assumption
on the data, when their integers, and

82
00:05:34,657 --> 00:05:40,083
their small integers, so they're between
zero and K where K is say ideally at most

83
00:05:40,083 --> 00:05:44,423
linear in N. So then what you do, is you
do a single pass through the input array.

84
00:05:44,423 --> 00:05:48,260
Again, you just bucket the elements
according to what their value is. It's

85
00:05:48,260 --> 00:05:52,307
somewhere between zero and K, and it's an
integer by assumption. So you need K

86
00:05:52,307 --> 00:05:56,459
buckets. And then you do a pass, and you
sort of depopulate the buckets and copy

87
00:05:56,459 --> 00:06:00,873
them into an output array. And that gives
you a, a sorting algorithm which runs in

88
00:06:00,873 --> 00:06:04,762
time, O of N Plus K. Where K is the size
of the biggest integer. So the upshot with

89
00:06:04,762 --> 00:06:09,072
counting sort is that, if you're willing
to assume that datas are integers bounded

90
00:06:09,072 --> 00:06:12,961
above by some factor linear in N,
proportional to N, then you can sort them

91
00:06:12,961 --> 00:06:17,109
in linear time. Again county sort does not
access the rail and it's merely through

92
00:06:17,109 --> 00:06:20,921
comparisons. It actually stares at an
element, figures out what it's value is,

93
00:06:20,921 --> 00:06:24,832
and uses that value to determine what
bucket to put the element in. So in that

94
00:06:24,832 --> 00:06:28,994
sense it's not a comparison case sort and
it can under compare it's assumptions to

95
00:06:28,994 --> 00:06:32,806
beat the end log and lower it down. So a
final example is the one that would

96
00:06:32,806 --> 00:06:36,918
[inaudible] them rated sort. I think that
this is sort of an extension of counting

97
00:06:36,918 --> 00:06:40,879
sort, although you don't have to use
counting sort as the interloop you can use

98
00:06:40,879 --> 00:06:44,841
other so called stable sorts as well. It's
the stuff you can read about in many

99
00:06:44,841 --> 00:06:48,678
programming books or on the web. And up
shot at rated sort. [inaudible]. You, you

100
00:06:48,678 --> 00:06:51,975
again you assume that the date are
integers. You think of them in digit

101
00:06:51,975 --> 00:06:55,644
representation, say binary representation.
And now you just sort one bit at time,

102
00:06:55,644 --> 00:06:59,313
starting from the least significant bits
and going all the way out to the most

103
00:06:59,313 --> 00:07:03,028
significant bits. And so the upside of
rating sort, it's an extension of counting

104
00:07:03,028 --> 00:07:06,232
sort is the sense that if your data is
integers that are not too big,

105
00:07:06,232 --> 00:07:11,653
polynomially bounded in N. Then it lets
you sort in linear time. So, summarizing,

106
00:07:11,653 --> 00:07:15,667
a comparison based sort is one that can
only access the input array through this

107
00:07:15,667 --> 00:07:19,879
API, that lets you do comparisons between
two elements. You cannot access the value

108
00:07:19,879 --> 00:07:24,042
of an element, so in particular you cannot
do any kind of bucketing technique. Bucket

109
00:07:24,042 --> 00:07:27,808
sort, counting sort, and rating sort all
fundamentally are doing some kind of

110
00:07:27,808 --> 00:07:31,772
bucketing and that's why when you're
willing to make assumptions about what the

111
00:07:31,772 --> 00:07:35,439
data is and how you are permitted to
access that data, that's when you can

112
00:07:35,439 --> 00:07:39,503
bypass in all of those cases, this analog
and lower value. But if you're stuck with

113
00:07:39,503 --> 00:07:43,606
a comparison based sort, if you wanna have
something. General purpose. You're gonna

114
00:07:43,606 --> 00:07:48,479
be doing n log n comparisons in the worst
case. Let's see why. So we have to prove a

115
00:07:48,479 --> 00:07:52,691
lower band for every single comparison
based sorting method, so a fixed one. And

116
00:07:52,691 --> 00:07:57,634
let's focus on a particular input length.
Call it N. Okay, so now, let's simplify

117
00:07:57,634 --> 00:08:01,217
our lives. Now that we're focused on a
comparison based sorting method, one that

118
00:08:01,217 --> 00:08:04,801
doesn't look at the values of the array
elements just in the relative order. We

119
00:08:04,801 --> 00:08:08,112
may as well think of the array as just
containing the elements... One, two,

120
00:08:08,112 --> 00:08:11,423
three, all the way up to N, in some
jumbled order. Now, some other algorithm

121
00:08:11,423 --> 00:08:15,233
could make use of the fact that everything
is small integers. But a comparison based

122
00:08:15,233 --> 00:08:18,952
sorting method cannot. So there's no loss
in just thinking about an unsorted array

123
00:08:18,952 --> 00:08:22,348
containing the integers [inaudible] N
inclusive. Now, depsite seemingly

124
00:08:22,348 --> 00:08:27,000
restricting the space of inputs that we're
thinking about, even here, there's kind of

125
00:08:27,000 --> 00:08:31,320
a lot of different inputs we've gotta
worry about, right? So N elements can, can

126
00:08:31,320 --> 00:08:35,308
show up, and N factorial different
orderings, right? There's N choices for

127
00:08:35,308 --> 00:08:39,573
who the first element is, then N-1 choices
for the second element, M minus two

128
00:08:39,573 --> 00:08:43,837
choices for the third element, and so on.
So, there's N factorial for how these

129
00:08:43,837 --> 00:08:47,995
elements are, are arranged in the input
array. So I don't wanna prove this super

130
00:08:47,995 --> 00:08:51,866
formally, but I wanna give you, the gist,
I think, the good intuition. Now, we're

131
00:08:51,866 --> 00:08:56,178
interested in lower bounding the number of
comparisons that, this method makes in the

132
00:08:56,178 --> 00:09:00,000
worst case. So let's introduce a parameter
K, which is its worst case number of

133
00:09:00,000 --> 00:09:03,626
comparisons. That is, for every input,
each of these end factorial inputs, by

134
00:09:03,626 --> 00:09:07,664
assumption, this method makes no more than
K comparisons. The idea behind the proof

135
00:09:07,664 --> 00:09:11,200
is that, because we have N factorial
fundamentally different inputs, the

136
00:09:11,200 --> 00:09:14,884
sorting method has to execute in a
fundamentally different way on each of

137
00:09:14,884 --> 00:09:18,818
those inputs. But since the only thing
that causes a branch in the execution of

138
00:09:18,818 --> 00:09:22,404
the sorting method is the resolution of
the comparison, and we have only

139
00:09:22,404 --> 00:09:26,537
[inaudible] comparisons, it can only have
two to the K different execution paths. So

140
00:09:26,537 --> 00:09:30,570
that forces two to the K to be at least N
factorial. And a calculation then shows

141
00:09:30,570 --> 00:09:35,030
that, that forces K to be at least Omega N
log N. So let me just quickly fill in the

142
00:09:35,030 --> 00:09:38,339
details. So cross all in-factorial
possible inputs just as a thought

143
00:09:38,339 --> 00:09:42,212
experiment. We can imagine running this
method in factorial times just looking at

144
00:09:42,212 --> 00:09:45,560
the pattern of how the comparison is
resolved. Right? For each of these

145
00:09:45,560 --> 00:09:49,289
in-factorial inputs, we run it through
this sorting method, it makes comparison

146
00:09:49,289 --> 00:09:52,733
number one, then comparison number two,
then comparison number three, then

147
00:09:52,733 --> 00:09:56,510
comparison number four, then comparison
number five, and you know it gets back a

148
00:09:56,510 --> 00:10:00,384
zero, then a one, then a one, then a zero.
Give in some other input and it gets back

149
00:10:00,384 --> 00:10:04,018
a one, then a one, then a zero, then a
zero and so on. The point is, for each of

150
00:10:04,018 --> 00:10:07,796
these in-factorial inputs, it makes at
most K comparisons, we can associate that

151
00:10:07,796 --> 00:10:11,916
with a K bit string, and because it. Is
there's only K bits we're only going to

152
00:10:11,916 --> 00:10:16,239
see two to the K different K-bit strings
two to the K different ways that a

153
00:10:16,239 --> 00:10:21,114
sequence of comparisons results. Now to
finish the proof we are gonna apply

154
00:10:21,114 --> 00:10:25,323
something which I don't get to use as much
as I'd like in an evident class but it's

155
00:10:25,323 --> 00:10:30,288
always fun when it comes up, which is the
pigeon-hole principle. The [inaudible]

156
00:10:30,288 --> 00:10:34,432
principle you recall is the essentially
obvious fact that if you try to stuff K

157
00:10:34,432 --> 00:10:38,732
plus one pigeons into just K cubby holes,
one of those K cubby holes has got to get

158
00:10:38,732 --> 00:10:42,980
two of the pigeons. Okay at least one of
the cubby holes gets at least two pigeons.

159
00:10:43,260 --> 00:10:47,717
So for us what are the pigeons and what
are the holes? So our pigeons are these in

160
00:10:47,717 --> 00:10:51,956
factorial different inputs. The different
ways you can scramble the images one

161
00:10:51,956 --> 00:10:56,250
through. And, what are our holes? Those
are the two indicate different executions

162
00:10:56,250 --> 00:11:04,526
that the sorting method can possibly take
on. Now if. The number of comparisons K

163
00:11:04,526 --> 00:11:09,307
used is so small, that two to the K, the
number of distinct execution, number of

164
00:11:09,307 --> 00:11:13,966
distinct ways comparisons can resolve
themselves, is less than the number of

165
00:11:13,966 --> 00:11:19,139
different inputs that have to be correctly
sorted. Then by the pivotal principal. One

166
00:11:19,139 --> 00:11:24,328
Color [inaudible] gets two holes. That is,
two different inputs get treated in

167
00:11:24,328 --> 00:11:29,517
exactly the same way, by the sorting
method. They are asked, exactly the same k

168
00:11:29,517 --> 00:11:34,837
comparisons and the comparisons resolve
identically. [inaudible] one. Jumbling of

169
00:11:34,837 --> 00:11:39,937
one through N, then you get a 01101 then
it's a totally different jumbling of N and

170
00:11:39,937 --> 00:11:44,669
then again you get a 01101 and if this
happens the algorithm is toast, in the

171
00:11:44,669 --> 00:11:49,339
sense that it's definitely not correct,
right, cuz we've fed it two different

172
00:11:49,339 --> 00:11:53,839
inputs. And it is unable to resolve which
of the two it is. Right? So, it may be one

173
00:11:53,839 --> 00:11:57,908
premutation of one through N, or this
totally different premutation of one

174
00:11:57,908 --> 00:12:02,032
through N. The algorithm has tried to
learn about what the input is through

175
00:12:02,032 --> 00:12:06,432
these K comparisons, but it has exactly
the same data about the input in two, the

176
00:12:06,432 --> 00:12:10,886
two cases. So, if it outputs the correct
sorted version in one case, it's gonna get

177
00:12:10,886 --> 00:12:15,285
the other one wrong. So, you can't have a
common execution of a sorting algorithm

178
00:12:15,285 --> 00:12:20,052
unscramble totally different premutations.
It can't be done. So what have we learned?

179
00:12:20,052 --> 00:12:24,745
We've learned that by correctness, two to
the K is in fact at least in the

180
00:12:24,745 --> 00:12:28,982
factorial. So how does that help us? Well,
we wanna lower bound K. K is the number of

181
00:12:28,982 --> 00:12:33,135
comparisons this arbitrary storing method
is using. They wanna show that's at least

182
00:12:33,135 --> 00:12:36,737
N log N. So we, to lower bound K, we
better lower bound N factorial. So, you

183
00:12:36,737 --> 00:12:40,239
know, you could use Stirling's
Approximation or something fancy. But we

184
00:12:40,239 --> 00:12:44,342
don't need anything fancy here. We'll just
do something super crude. We'll just say,

185
00:12:44,342 --> 00:12:48,095
well, look. This is the product of N
things, right? N times N minus one time N

186
00:12:48,095 --> 00:12:51,997
minus two, blah, blah, blah, blah. And the
largest of those, the N over two largest

187
00:12:51,997 --> 00:12:56,216
of those N terms are all at least N over
two. The rest we'll just ignore. Pretty

188
00:12:56,216 --> 00:13:03,615
sloppy, but it gives us a lower bound of N
divided by two raised to the N divided by

189
00:13:03,615 --> 00:13:10,838
two. Now we'll just take log base two of
both sides, and we get the K is at least N

190
00:13:10,838 --> 00:13:17,796
over two, log base two of N over two, also
known as omega of N log N. And that my

191
00:13:17,796 --> 00:13:25,019
friends is why a heretics deterministic
sorting algorithm that's comparison based

192
00:13:25,019 --> 00:13:29,600
has gotta use N log N comparisons in the
worst case.