1
00:00:00,500 --> 00:00:03,010
I've said pretty much everything I want
to say about sorting at this point but

2
00:00:03,010 --> 00:00:05,110
I do want to cover one more related topic.

3
00:00:05,110 --> 00:00:07,190
Namely the selection problem.

4
00:00:07,190 --> 00:00:09,900
This is a problem of computing
ordered statistics of an array

5
00:00:09,900 --> 00:00:12,710
with computing the median of
an array being a special case.

6
00:00:12,710 --> 00:00:16,240
Analogous to our coverage of quick sort
the goal is going to be the design and

7
00:00:16,240 --> 00:00:20,590
analysis of a super practical randomized
algorithm that solves the problem.

8
00:00:20,590 --> 00:00:24,290
And this time, we'll even achieve
an expected running time that is linear

9
00:00:24,290 --> 00:00:25,890
in the length of the input array.

10
00:00:25,890 --> 00:00:29,210
That is big O of n for
input arrays of length n, as opposed

11
00:00:29,210 --> 00:00:32,970
to the o of n log in time that we had for
the expected running time of quick sort.

12
00:00:32,970 --> 00:00:36,850
Like quick sort, the mathematical analysis
is also going to be quite elegant.

13
00:00:36,850 --> 00:00:40,777
So in addition these two required videos
on this very practical algorithm will

14
00:00:40,777 --> 00:00:43,810
motivate two optional videos
that are on very cool topics but

15
00:00:43,810 --> 00:00:45,785
of a similar more theoretical nature.

16
00:00:45,785 --> 00:00:49,622
The first optional video is going to be
on how you solve the selection problem in

17
00:00:49,622 --> 00:00:51,087
deterministic linear time.

18
00:00:51,087 --> 00:00:53,124
That is without using randomization.

19
00:00:53,124 --> 00:00:57,115
And the second optional video will be
a sorting lower bound that is why no

20
00:00:57,115 --> 00:01:00,280
comparison based sort can
be better than mergeshort.

21
00:01:00,280 --> 00:01:02,450
Can have better running
time than big O of n login.

22
00:01:04,090 --> 00:01:07,242
So a few words about what you should
have fresh in your mind before you

23
00:01:07,242 --> 00:01:08,136
watch this video.

24
00:01:08,136 --> 00:01:10,760
I have definitely assuming that
you watched quicksort videos.

25
00:01:10,760 --> 00:01:11,851
And not just watched them but

26
00:01:11,851 --> 00:01:14,440
that you have that material
pretty fresh in your mind.

27
00:01:14,440 --> 00:01:18,360
So in particular the video of quicksort
about the partition subroutine, so

28
00:01:18,360 --> 00:01:22,790
this is where you take a input ray and you
choose a pivot and you do repeated swaps.

29
00:01:22,790 --> 00:01:25,950
You rearrange the array so that everything
less then the pivot is to the left of it.

30
00:01:25,950 --> 00:01:27,840
Everything bigger then the pivot
is to the right of it.

31
00:01:27,840 --> 00:01:29,030
You should remember that sub routine,

32
00:01:29,030 --> 00:01:32,920
you should also remember the previous
discussion about pivot choices.

33
00:01:32,920 --> 00:01:36,581
The idea that the quality of a pivot
depends on how balanced a split into two

34
00:01:36,581 --> 00:01:38,482
different sub problems it gives you.

35
00:01:38,482 --> 00:01:39,880
Those are both going to be important.

36
00:01:39,880 --> 00:01:44,014
For the analysis of this randomized linear
time selection algorithm I need you to

37
00:01:44,014 --> 00:01:47,014
remember the concepts from
probability review part one.

38
00:01:47,014 --> 00:01:52,390
And particular random variables, their
expectation, and linearity of expectation.

39
00:01:52,390 --> 00:01:56,670
That said, let's move on and formally
define what the selection problem is.

40
00:01:56,670 --> 00:01:57,770
The input is the same as for

41
00:01:57,770 --> 00:02:02,750
the sorting problem, just you're
giving it array of indistinct entries.

42
00:02:02,750 --> 00:02:05,790
But in addition, you're told what
order statistic you're looking for.

43
00:02:05,790 --> 00:02:11,421
So that's going to be a number I,
which an integer between 1 and N.

44
00:02:11,421 --> 00:02:13,889
And the goal is to output
just a single number.

45
00:02:13,889 --> 00:02:15,890
Namely the ith order statistic,

46
00:02:15,890 --> 00:02:18,970
that is the ith smallest
entry in this input array.

47
00:02:20,800 --> 00:02:25,928
So just to be clear, if you had
an array entry of let's just say

48
00:02:25,928 --> 00:02:30,974
4 elements, containing the numbers 10,
8, 2 and 4.

49
00:02:30,974 --> 00:02:35,215
And you were looking for, let's say, the
3rd or a statistic that would be this 8.

50
00:02:36,250 --> 00:02:39,840
The first order statistic is just
the minimum element of the array.

51
00:02:39,840 --> 00:02:41,930
That's easier to find with a linear scan.

52
00:02:41,930 --> 00:02:45,883
The nth order statistic is just
the maximum, again easier,

53
00:02:45,883 --> 00:02:47,983
easy to find with a linear scan.

54
00:02:47,983 --> 00:02:50,182
The middle element is the median.

55
00:02:50,182 --> 00:02:53,219
You should think of that as the canonical
version of the selection problem.

56
00:02:54,870 --> 00:02:58,580
Now when n is odd, it's obvious what the
median is, that's just the middle element,

57
00:02:58,580 --> 00:03:01,601
so the n plus one over
2th order statistic.

58
00:03:03,010 --> 00:03:05,900
If the array has even length,
there's two possible medians, so

59
00:03:05,900 --> 00:03:09,745
let's just take the smaller of them,
that's the n over 2th order statistic.

60
00:03:09,745 --> 00:03:13,145
You might wonder why you'd ever want to
compute the median of an array rather than

61
00:03:13,145 --> 00:03:14,503
the mean, that is the average.

62
00:03:14,503 --> 00:03:17,127
It's easy to see you that you can
compute the average just with a simple

63
00:03:17,127 --> 00:03:18,110
linear scan.

64
00:03:18,110 --> 00:03:22,620
And the median you can, one motivation is
it's a more robust version of the mean.

65
00:03:22,620 --> 00:03:26,470
So if you just have a data entry problem
and it corrupts one element of an input

66
00:03:26,470 --> 00:03:29,740
array it can totally screw up
the average value of the array, but

67
00:03:29,740 --> 00:03:33,190
it has generally very little
impact on the median.

68
00:03:33,190 --> 00:03:36,729
Final comment about the problem is that I
am going to assume that the array entries

69
00:03:36,729 --> 00:03:39,189
are distinct,
that is there's no repeated elements.

70
00:03:39,189 --> 00:03:42,817
But just like in our discussions of
sorting, this is not a big assumption.

71
00:03:42,817 --> 00:03:46,051
I can encourage you to think about how to
adapt these algorithms to work even if

72
00:03:46,051 --> 00:03:47,770
the arrays do have duplicate.

73
00:03:47,770 --> 00:03:51,048
You can, indeed,
still get the same very practical,

74
00:03:51,048 --> 00:03:53,977
very fast algorithms
with duplicate elements.

75
00:03:53,977 --> 00:03:58,414
Now if you think about it, we already have
a pretty darn good algorithm that solves

76
00:03:58,414 --> 00:04:00,620
the selection problem.

77
00:04:00,620 --> 00:04:01,320
Here's the algorithm.

78
00:04:01,320 --> 00:04:04,100
It's two simple steps and
it runs in o of n log n time.

79
00:04:06,490 --> 00:04:08,500
Step one, sort the input array.

80
00:04:08,500 --> 00:04:10,310
We have various subroutines to do that.

81
00:04:10,310 --> 00:04:11,550
Let's say we pick MergeSort.

82
00:04:12,610 --> 00:04:13,510
Now, what is it we're trying to do?

83
00:04:13,510 --> 00:04:16,758
We're trying to the ith smallest
element of the input array.

84
00:04:16,758 --> 00:04:20,208
Well, once we've sorted it we certainly
know where the ith smallest element is,

85
00:04:20,208 --> 00:04:22,160
it's in the ith position
of the sorted array.

86
00:04:23,340 --> 00:04:26,581
So that's pretty cool, we've just
done what a computer scientist would

87
00:04:26,581 --> 00:04:29,940
call a reduction and that's a super
useful and super fundamental concept.

88
00:04:29,940 --> 00:04:34,051
It's when you realize that you can solve
one problem by reducing it to another

89
00:04:34,051 --> 00:04:36,860
problem that you already
know how to solve.

90
00:04:36,860 --> 00:04:40,490
So what we just showed is that
the selection problem reduces easily

91
00:04:40,490 --> 00:04:41,610
to the sorting problem.

92
00:04:41,610 --> 00:04:44,470
We already know how to solve
the sorting problem n log n time so

93
00:04:44,470 --> 00:04:48,340
that gives an n log n time solution
to this selection problem.

94
00:04:48,340 --> 00:04:52,410
But again remember the mantra of any
algorithm designer worth their salt,

95
00:04:52,410 --> 00:04:53,950
is can we do better.

96
00:04:53,950 --> 00:04:55,900
We should avoid contentedness.

97
00:04:55,900 --> 00:04:58,370
Just because we got nlogn
we should stop there.

98
00:04:58,370 --> 00:05:01,080
Maybe can be even faster.

99
00:05:01,080 --> 00:05:04,797
Now certainly we're going to have to look
at all the elements in the input array,

100
00:05:04,797 --> 00:05:05,682
in the worst case.

101
00:05:05,682 --> 00:05:09,730
You shouldn't expect to do better than
linear, but hey, why not linear time?

102
00:05:11,050 --> 00:05:13,810
Actually if you think about it, we
probably should have asked that question

103
00:05:13,810 --> 00:05:15,800
back when we were studying
the sorting problem.

104
00:05:15,800 --> 00:05:19,220
Why were we so content with the end
login time bound for merch sort.

105
00:05:19,220 --> 00:05:22,840
And the O of N login time on
average bound, for quick sort.

106
00:05:22,840 --> 00:05:26,150
Well it turns out, we have a really
good reason to be happy with our N login

107
00:05:26,150 --> 00:05:27,599
upper bounds for the sorting problem.

108
00:05:28,600 --> 00:05:32,760
It turns out and this is not obvious, and
will be the subject of the optional video.

109
00:05:32,760 --> 00:05:38,760
You actually can't sort an input array
of length N better than N log n time.

110
00:05:38,760 --> 00:05:40,530
Either in the worst case or an average.

111
00:05:41,660 --> 00:05:45,720
So another words, if we insist on
solving the selection problem via

112
00:05:45,720 --> 00:05:50,370
a reduction to the sorting problem then
we're stuck with this N log N time bound.

113
00:05:50,370 --> 00:05:51,580
Okay, strictly speaking that's for

114
00:05:51,580 --> 00:05:55,160
something called comparison sorts,
see the video for more details but

115
00:05:55,160 --> 00:05:58,000
the upshot is if you want
a general purpose algorithm.

116
00:05:58,000 --> 00:05:59,430
And we want to do better than N log N for

117
00:05:59,430 --> 00:06:04,760
selection we have to do it using
ingenuity beyond this reduction,

118
00:06:04,760 --> 00:06:10,030
we have to prove that selection is
a strictly easier problem then sort it.

119
00:06:10,030 --> 00:06:12,930
That's the only way we're going to
have an algorithm that beats n log n.

120
00:06:12,930 --> 00:06:16,410
That's the only way we can conceivably
get a linear time algorithm.

121
00:06:16,410 --> 00:06:19,090
And that is exactly what
is up next on our plates.

122
00:06:19,090 --> 00:06:23,150
We're going to show selection is indeed
fundamentally easier than sorting.

123
00:06:23,150 --> 00:06:24,940
We can have a linear time algorithm for
it,

124
00:06:24,940 --> 00:06:27,990
even though we can't get a linear
time algorithm for sorting.

125
00:06:27,990 --> 00:06:31,071
You can think of the algorithm we're
going to discuss as a modification of

126
00:06:31,071 --> 00:06:34,465
quick sort and in the same spirit of quick
sort it will be a randomized algorithm.

127
00:06:34,465 --> 00:06:38,091
And the running time will be an expected
running time that will hold for

128
00:06:38,091 --> 00:06:39,020
any input array.

129
00:06:40,060 --> 00:06:43,170
Now, for the sorting problem we
know that quick sort that's n

130
00:06:43,170 --> 00:06:47,420
log in time on average, where the average
is over the coin flips done by the code.

131
00:06:47,420 --> 00:06:49,050
But we also know that if we wanted to,

132
00:06:49,050 --> 00:06:52,880
we could get a sorting algorithm in n log
n time that doesn't use randomization.

133
00:06:52,880 --> 00:06:55,670
The merge sort algorithm
is one such solution.

134
00:06:55,670 --> 00:06:58,202
So here,
we're giving a linear time solution for

135
00:06:58,202 --> 00:07:02,300
selection, for finding order
statistics that uses randomization.

136
00:07:02,300 --> 00:07:05,670
And it would be natural to wonder,
is there an analog to merge sort?

137
00:07:05,670 --> 00:07:08,514
Is there an algorithm which
does not use randomization, and

138
00:07:08,514 --> 00:07:10,365
gets this exact same linear time down.

139
00:07:10,365 --> 00:07:11,400
In fact there is.

140
00:07:11,400 --> 00:07:13,470
The algorithm's a little more complicated,
and

141
00:07:13,470 --> 00:07:16,450
therefore not quite as practical
as this randomized algorithm.

142
00:07:16,450 --> 00:07:17,925
But it's still very cool.

143
00:07:17,925 --> 00:07:20,790
It's a really fun algorithm to learn and
to teach.

144
00:07:20,790 --> 00:07:23,690
So I will have an optional video
about linear time selection

145
00:07:23,690 --> 00:07:25,330
without randomization.

146
00:07:25,330 --> 00:07:28,003
So for those of you who aren't
going to watch that video or

147
00:07:28,003 --> 00:07:29,639
want to know what's the key idea.

148
00:07:29,639 --> 00:07:34,201
The idea is to choose the pivot
deterministically in a very careful way

149
00:07:34,201 --> 00:07:37,073
using a trick called
the median of medians.

150
00:07:37,073 --> 00:07:40,044
That's all I'm going to say about it now
you should watch the optional video if you

151
00:07:40,044 --> 00:07:40,809
want more details.

152
00:07:41,830 --> 00:07:44,652
I do feel compelled to warn you that
if you're going to actually implemented

153
00:07:44,652 --> 00:07:45,587
a selection algorithm.

154
00:07:45,587 --> 00:07:49,100
You should do the one that we discuss
in this video, not the linear time one.

155
00:07:49,100 --> 00:07:52,310
because the one we'll discuss in this
video has both smaller constants and

156
00:07:52,310 --> 00:07:53,030
works in place.

157
00:07:54,050 --> 00:07:58,400
So what I want to do next is develop
the idea that can modify the QuickSort

158
00:07:58,400 --> 00:08:02,420
paradigm in order to directly
solve The selection problem.

159
00:08:02,420 --> 00:08:06,460
So to get an idea of how that works,
let me review the Partition subroutine.

160
00:08:06,460 --> 00:08:11,870
Like in Quicksort this subroutine will be
our workhorse for the selection algorithm.

161
00:08:11,870 --> 00:08:15,810
So, what the Partition subroutine does, it
takes as inputs, some jumbled up array and

162
00:08:15,810 --> 00:08:19,640
it's going to solve a problem which
is much more modest than sorting.

163
00:08:19,640 --> 00:08:23,274
So in partitioning, it's going to
first choose a pivot element somehow.

164
00:08:23,274 --> 00:08:26,850
We'll have to discuss what's a good
strategy for choosing a pivot element.

165
00:08:26,850 --> 00:08:30,020
But suppose in this particular input
array it chooses the first element,

166
00:08:30,020 --> 00:08:33,765
this three, as the pivot element,
the responsibility of the partition

167
00:08:33,765 --> 00:08:37,370
sub-routine then is to rearrange
the elements in this array so

168
00:08:37,370 --> 00:08:39,910
that the following
properties are satisfied.

169
00:08:39,910 --> 00:08:43,740
Anything less than the pivot is to the
left of it and it can be in jumbled order.

170
00:08:43,740 --> 00:08:47,000
But if you're less than pivot you
better be to the left like this two and

171
00:08:47,000 --> 00:08:48,890
one is less than three.

172
00:08:48,890 --> 00:08:52,170
If you're bigger than the pivot than again
you can be in jumbled order amongst those

173
00:08:52,170 --> 00:08:54,670
elements but all of them have to
be to the right of the pivot and

174
00:08:54,670 --> 00:08:57,220
that's true for
the numbers four through eight.

175
00:08:57,220 --> 00:09:00,880
They all are to the right of
the pivot three in a jumbled order.

176
00:09:00,880 --> 00:09:03,778
So this in particular puts
the pivot in its rightful position,

177
00:09:03,778 --> 00:09:05,974
where it will belong in
the final sorted array.

178
00:09:05,974 --> 00:09:06,884
And at least for

179
00:09:06,884 --> 00:09:11,870
Quicksort, it enabled us to recursively
sort to smaller subproblems.

180
00:09:11,870 --> 00:09:14,830
So this is where I want you to think
a little bit about how we should adapt

181
00:09:14,830 --> 00:09:16,080
this paradigm.

182
00:09:16,080 --> 00:09:19,610
So, suppose I told you the first step of
our selection algorithm is going to be

183
00:09:19,610 --> 00:09:22,010
choose a pivot and partition the array.

184
00:09:22,010 --> 00:09:25,240
Now the question is,
how are we going to recurse?

185
00:09:25,240 --> 00:09:28,860
We need to understand how to find
the ith order statistic of the original

186
00:09:28,860 --> 00:09:29,970
input array.

187
00:09:29,970 --> 00:09:35,082
It suffices to recurse on just
one sub problem of smaller size,

188
00:09:35,082 --> 00:09:38,367
and find a suitable or a statistic in it.

189
00:09:38,367 --> 00:09:39,930
So how should we do that?

190
00:09:39,930 --> 00:09:43,040
Let me ask you that with
some very concrete examples.

191
00:09:43,040 --> 00:09:46,780
About what pivot we choose and
what order statistic we're looking for and

192
00:09:46,780 --> 00:09:47,460
see what you think.

193
00:09:47,460 --> 00:09:52,070
So the correct action to this
quiz is the second answer.

194
00:09:53,320 --> 00:09:57,080
So we can get away with recursing just
once, and then this particular example,

195
00:09:57,080 --> 00:09:59,580
we're going to recurse on
the right side of the array.

196
00:09:59,580 --> 00:10:03,820
And instead of looking for the fifth
order statistic like we would originally,

197
00:10:03,820 --> 00:10:06,680
we're going to recursively search for
the second order statistic.

198
00:10:06,680 --> 00:10:07,270
So why is that?

199
00:10:07,270 --> 00:10:09,900
Well first why do we recurse on
the right side of the array?

200
00:10:09,900 --> 00:10:13,630
So by assumption we have this array
of ten elements, we choose the pivot,

201
00:10:13,630 --> 00:10:17,280
we do partitioning, remember the pivot
winds up in its rightful position.

202
00:10:17,280 --> 00:10:18,940
That's what partitioning does.

203
00:10:18,940 --> 00:10:21,750
So in the bid it winds up
in the third position,

204
00:10:21,750 --> 00:10:24,770
we know it's the third
smallest element in the array.

205
00:10:24,770 --> 00:10:26,631
Now that's not what we were looking for.

206
00:10:26,631 --> 00:10:30,310
We were looking for
the fifth smallest element in the array.

207
00:10:30,310 --> 00:10:34,630
That, of course, is bigger than
the third smallest element of the array.

208
00:10:34,630 --> 00:10:37,440
So by partitioning,
where is the fifth element going to be?

209
00:10:37,440 --> 00:10:40,270
It's gotta be to the right of
this third smallest element,

210
00:10:40,270 --> 00:10:41,830
to the right of the pivot.

211
00:10:41,830 --> 00:10:46,050
So we know for sure that the fifth
order statistic of the original array

212
00:10:46,050 --> 00:10:47,540
lies to the right of the pivot.

213
00:10:47,540 --> 00:10:48,880
That is guaranteed.

214
00:10:48,880 --> 00:10:51,188
So we know where to recurse
on the right hand side.

215
00:10:51,188 --> 00:10:53,560
Now, what are we looking for?

216
00:10:53,560 --> 00:10:57,260
We are no longer looking for
the fifth order statistic,

217
00:10:57,260 --> 00:10:58,900
the fifth smallest element.

218
00:10:58,900 --> 00:10:59,600
Why?

219
00:10:59,600 --> 00:11:03,650
Well we've thrown out both the pivot and
everything smaller than it.

220
00:11:03,650 --> 00:11:06,110
Remember we're only recursing
on the right hand side.

221
00:11:06,110 --> 00:11:08,820
So we've thrown out the pivot, the hird
element, and everything less than it,

222
00:11:08,820 --> 00:11:10,850
the minimum and the second minimum.

223
00:11:10,850 --> 00:11:14,800
Having deleted the three smallest
elements and originally looking for

224
00:11:14,800 --> 00:11:18,910
the fifth smallest of what remains,
of what we're recursing on.

225
00:11:18,910 --> 00:11:21,960
We're looking for
the second smallest element.

226
00:11:21,960 --> 00:11:27,380
So the selection algorithm in general,
is just the generalization of this idea.

227
00:11:27,380 --> 00:11:31,880
So arbitrary arrays and arbitrary
situations of whether the pivot comes back

228
00:11:31,880 --> 00:11:35,530
equal to less or bigger than
the element you are looking for.

229
00:11:35,530 --> 00:11:40,710
So let me be more precise, I am going
to call this algorithm R select for

230
00:11:40,710 --> 00:11:46,320
randomized selection, and according to
the problem definition it takes as input,

231
00:11:46,320 --> 00:11:49,140
as usual an array A of some length n.

232
00:11:49,140 --> 00:11:53,210
Then also the order statistic that we
are looking for, so we are going to call

233
00:11:53,210 --> 00:11:58,370
that i, and of course we assume that i is
some integer between one and inclusive.

234
00:11:58,370 --> 00:12:02,610
So for the base case, that is going
to be if the array has size one,

235
00:12:02,610 --> 00:12:06,690
then the only element we could be looking
for is the oneth order statistic and

236
00:12:06,690 --> 00:12:08,680
we just return the sole
element of the array.

237
00:12:10,010 --> 00:12:12,760
Now we have to partition the array
around the pivot element.

238
00:12:12,760 --> 00:12:16,710
And just like in quick sort, we're going
to very lazy about choosing the pivot.

239
00:12:16,710 --> 00:12:19,930
We're going to choose it uniformly at
random from the n possibilities, and

240
00:12:19,930 --> 00:12:20,840
hope things work out.

241
00:12:20,840 --> 00:12:22,440
And that will be the crux of the analysis,

242
00:12:22,440 --> 00:12:25,920
proving that random pivots are good
enough sufficiently often.

243
00:12:25,920 --> 00:12:28,990
Having chosen the pivot, we now just
invoke the standard partitioning and

244
00:12:28,990 --> 00:12:30,020
subroutine.

245
00:12:30,020 --> 00:12:32,720
As usual, that's going to give
us the partitioned array.

246
00:12:32,720 --> 00:12:35,460
You'll have the pivot element, you'll have
everything less in the pivot to the left,

247
00:12:35,460 --> 00:12:36,770
everything bigger, to the right.

248
00:12:36,770 --> 00:12:39,879
As usual,
I'll call everything to the left,

249
00:12:39,879 --> 00:12:42,762
the first parts of the partitioned array.

250
00:12:42,762 --> 00:12:44,310
And everything bigger, the second part.

251
00:12:45,550 --> 00:12:48,970
Now we have a couple of cases, depending
on whether the pivot is bigger or

252
00:12:48,970 --> 00:12:51,030
less then the element we are looking for.

253
00:12:51,030 --> 00:12:53,450
So I need a little notation
to talk about that.

254
00:12:53,450 --> 00:12:58,832
So let's let j be the order
statistic that p is.

255
00:12:58,832 --> 00:13:02,720
So if p winds up being the third
smallest element like in the quiz,

256
00:13:02,720 --> 00:13:05,030
then j's going to be equal to three.

257
00:13:05,030 --> 00:13:08,270
Equivalently we can think of j
as defined as the physician of

258
00:13:08,270 --> 00:13:11,080
the pivot in the partition
version of the array.

259
00:13:11,080 --> 00:13:13,580
Now there's one case,
which is very unlikely to occur, but

260
00:13:13,580 --> 00:13:15,560
we should include it just for
completeness.

261
00:13:15,560 --> 00:13:18,400
If we're really lucky, then, in fact,

262
00:13:18,400 --> 00:13:22,730
a random pivot just happens to be
the order statistic we were looking for.

263
00:13:22,730 --> 00:13:24,090
That's when i equals j.

264
00:13:24,090 --> 00:13:26,037
We're looking for
the ith smallest element.

265
00:13:26,037 --> 00:13:29,475
If by dumb luck the pivot winds up being
the ith smallest element, we're done.

266
00:13:29,475 --> 00:13:30,275
We can just return it.

267
00:13:30,275 --> 00:13:31,209
We don't have to recurse.

268
00:13:32,300 --> 00:13:33,850
Now in general of course,

269
00:13:33,850 --> 00:13:35,880
we don't randomly choose
the element we are looking for.

270
00:13:35,880 --> 00:13:39,330
We choose something that, that could
be bigger or could be smaller then it.

271
00:13:39,330 --> 00:13:42,380
In the quiz we chose a pivot that was
smaller then what we were looking for.

272
00:13:42,380 --> 00:13:43,490
Actually, that's the harder case.

273
00:13:43,490 --> 00:13:44,963
So, let's first start with a case,

274
00:13:44,963 --> 00:13:48,540
where the pivot winds up being bigger
then the element we were looking for.

275
00:13:48,540 --> 00:13:51,020
So that means that j is bigger than i.

276
00:13:51,020 --> 00:13:52,400
We're looking for the i smallest.

277
00:13:52,400 --> 00:13:55,325
We randomly chose the j smallest for
j bigger than i.

278
00:13:55,325 --> 00:13:57,310
So this is the opposite case of the quiz.

279
00:13:57,310 --> 00:14:01,460
This is where we know what we're looking
for has to be to the left of the pivot.

280
00:14:01,460 --> 00:14:04,110
The pivot's the j smallest
everything less than is to the left.

281
00:14:04,110 --> 00:14:06,395
We're looking for the i smallest,
i is less than j, so

282
00:14:06,395 --> 00:14:07,650
that's got to be on the left.

283
00:14:07,650 --> 00:14:09,090
That's where it recurs.

284
00:14:09,090 --> 00:14:11,810
Moreover it clear we're looking for
exactly the same order statistic.

285
00:14:11,810 --> 00:14:15,410
If we're looking for the third smallest
element, we're only throwing out stuff

286
00:14:15,410 --> 00:14:18,260
which is bigger than something even bigger
hthan the third smallest element so

287
00:14:18,260 --> 00:14:21,700
we're still looking for
the third smallest of what remains.

288
00:14:21,700 --> 00:14:25,550
And naturally the new array size is
j minus 1 because that's what's to

289
00:14:25,550 --> 00:14:26,882
the left of the pivot.

290
00:14:26,882 --> 00:14:31,490
And then finally, the final case is when
the random element that we choose is less

291
00:14:31,490 --> 00:14:34,430
than what we're looking for and
then we're just like the quiz.

292
00:14:34,430 --> 00:14:37,040
Namely what we're looking for
is bigger than the pivot.

293
00:14:37,040 --> 00:14:38,390
It's got to be in the right-hand side.

294
00:14:38,390 --> 00:14:40,598
We know we've got a recurse
in the right-hand side.

295
00:14:40,598 --> 00:14:42,970
Whenever the right-hand side
has n minus j elements,

296
00:14:42,970 --> 00:14:44,270
we throw out everything up to the pivot.

297
00:14:44,270 --> 00:14:45,367
So we throw out j things.

298
00:14:45,367 --> 00:14:47,380
There's n minus j left.

299
00:14:47,380 --> 00:14:50,820
All of those j things we threw out
are less than what we're looking for.

300
00:14:50,820 --> 00:14:53,240
So if we used to be looking for the i
smallest element now we're looking for

301
00:14:53,240 --> 00:14:55,031
the i minus j smallest element.

302
00:14:56,140 --> 00:14:57,220
So that is the whole algorithm.

303
00:14:57,220 --> 00:15:01,020
That is how we adopt the approach we took
to the sorting problem in quick sort and

304
00:15:01,020 --> 00:15:04,000
adapt it to the problem of selection.

305
00:15:04,000 --> 00:15:06,440
So, is this algorithm any good?

306
00:15:06,440 --> 00:15:09,510
Let's start studying its properties and
understand how well it works.

307
00:15:09,510 --> 00:15:11,590
So let's begin with correctness.

308
00:15:11,590 --> 00:15:14,580
So the claim is that, no matter how
the algorithm's coin flips come up,

309
00:15:14,580 --> 00:15:17,380
no matter what random pivots we choose,
the algorithm is correct.

310
00:15:17,380 --> 00:15:21,830
In the sense that it's guaranteed
to output the ith order statistic.

311
00:15:21,830 --> 00:15:23,130
The proof is by induction.

312
00:15:23,130 --> 00:15:25,860
It proceeds very similarly to quick sort.

313
00:15:25,860 --> 00:15:27,130
So I'm not going to give it here.

314
00:15:27,130 --> 00:15:28,650
If you're curious about
how these proofs go,

315
00:15:28,650 --> 00:15:31,640
there's an optional video about
the correctness of quick sort.

316
00:15:31,640 --> 00:15:34,780
If you watch that and understand it,
it should be clear how to adapt

317
00:15:34,780 --> 00:15:38,160
that inductive argument to apply
to this select algorithm as well.

318
00:15:39,620 --> 00:15:43,110
So as usual for divide and conquer
algorithms, the interesting part is not so

319
00:15:43,110 --> 00:15:46,030
much knowing,
understanding why the algorithm works, but

320
00:15:46,030 --> 00:15:48,890
rather understanding how fast it runs.

321
00:15:48,890 --> 00:15:54,030
So the big question is, what is the
running time of this selection algorithm?

322
00:15:54,030 --> 00:15:57,550
Now, to understand this we have to
understand the ramifications of

323
00:15:57,550 --> 00:16:00,220
pivot choices on the running time.

324
00:16:00,220 --> 00:16:03,860
So you've seen the QuickSort videos
they're fresh in your mind so

325
00:16:03,860 --> 00:16:06,970
what should be clear is that
just like in QuickSort how

326
00:16:06,970 --> 00:16:11,690
fast this algorithm runs is going to
depend on how good the pivots are and

327
00:16:11,690 --> 00:16:15,680
what good pivots means is pivots
that guarantee a balanced split.

328
00:16:16,720 --> 00:16:19,190
So, the next quiz, we'll make sure
that you understand this point and

329
00:16:19,190 --> 00:16:23,280
ask you to think about just how bad the
running time of the selection algorithm

330
00:16:23,280 --> 00:16:27,470
could be if you get extremely
unlucky in your pivot choices.

331
00:16:29,660 --> 00:16:32,760
So the correct answer to this question
is exactly the same as the answer for

332
00:16:32,760 --> 00:16:33,990
QuickSort.

333
00:16:33,990 --> 00:16:38,430
The worst case running time, if the pivots
are chosen just in a really unlucky way.

334
00:16:38,430 --> 00:16:40,630
Is actually quadratic in the array length.

335
00:16:40,630 --> 00:16:42,290
Remember, we're shooting for linear time.

336
00:16:42,290 --> 00:16:44,990
So this quadratic is a total disaster.

337
00:16:44,990 --> 00:16:46,030
So how could this happen?

338
00:16:46,030 --> 00:16:49,070
Well suppose you're looking for
the median, and

339
00:16:49,070 --> 00:16:53,780
suppose you choose the minimum element
as the pivot every single time.

340
00:16:53,780 --> 00:16:56,710
So if this is what happens, if every time
you choose a pivot to be the minimum,

341
00:16:56,710 --> 00:16:59,180
just like in QuickSort,
this means every time you recurse,

342
00:16:59,180 --> 00:17:03,670
all you succeed in doing is peeling off
a single element from the input array.

343
00:17:03,670 --> 00:17:06,850
Now, you're not going to find
the median element until you've roughly

344
00:17:06,850 --> 00:17:08,590
n over 2 recursive cause,

345
00:17:08,590 --> 00:17:12,550
each on an array that has size at least
a constant fraction of the original one.

346
00:17:12,550 --> 00:17:14,840
So it's a linear number
of recursive calls,

347
00:17:14,840 --> 00:17:17,930
each on an array of size at
least some constant times n.

348
00:17:17,930 --> 00:17:21,410
So that gives you a total running
time of quadratic overall.

349
00:17:21,410 --> 00:17:23,780
Of course,
this is an absurdly unlikely event.

350
00:17:23,780 --> 00:17:27,250
Frankly, your computer is more likely
to be struck by a meteor than it is for

351
00:17:27,250 --> 00:17:30,530
the pivot to be chosen as the minimum
element in every recursive call.

352
00:17:30,530 --> 00:17:34,260
But if you really have an absolutely
worst case choice of pivots,

353
00:17:34,260 --> 00:17:35,910
it would give this
quadratic run time down.

354
00:17:37,740 --> 00:17:40,780
So the upshot then is that the running
time of this randomized selection

355
00:17:40,780 --> 00:17:43,280
algorithm depends on how
good our pivots are.

356
00:17:43,280 --> 00:17:47,460
And for a worse case chose of pivots the
running time can be as large as m squared.

357
00:17:47,460 --> 00:17:49,970
Now hopefully most of the time we're
going to have much better pivots.

358
00:17:49,970 --> 00:17:52,770
So the analysis receives by
making that idea precise.

359
00:17:54,190 --> 00:17:58,130
So the key to a fast running time is going
to be the usual property that we want to

360
00:17:58,130 --> 00:18:02,768
see in the divide and conquer algorithms,
namely every time that recurse the problem

361
00:18:02,768 --> 00:18:07,140
size better not just be smaller but it
better be smaller by a significant factor.

362
00:18:07,140 --> 00:18:09,960
How would that happen in the selection
approach based on the partition

363
00:18:09,960 --> 00:18:10,612
subroutine?

364
00:18:10,612 --> 00:18:13,780
Well if both of the sub-problems
are not too big,

365
00:18:13,780 --> 00:18:17,660
then we're guaranteed that when we
recurse we make a lot of progress.

366
00:18:17,660 --> 00:18:20,890
So let's think about what
the best possible pivot would be

367
00:18:20,890 --> 00:18:24,440
in the sense of giving a "balanced" split,
right, so of course in some sense the best

368
00:18:24,440 --> 00:18:27,745
pivot is you just choose your
statistic group you're looking for.

369
00:18:27,745 --> 00:18:29,525
Then you're done in constant time.

370
00:18:29,525 --> 00:18:32,005
But that's extremely unlikely, and
it's not worth worrying about.

371
00:18:32,005 --> 00:18:33,965
So ignore the fact that
we might guess the pivot.

372
00:18:33,965 --> 00:18:36,645
What's the best pivot if we
want to guarantee an aggressive

373
00:18:36,645 --> 00:18:39,815
decrease in the input side for
the next iteration.

374
00:18:39,815 --> 00:18:44,550
Well, the best pivot is the one that
gives as most balanced split as possible.

375
00:18:44,550 --> 00:18:47,440
So what's the pivot that gives
us the most balanced split?

376
00:18:47,440 --> 00:18:49,060
A 50/50 split.
If you think

377
00:18:49,060 --> 00:18:50,920
about it it's exactly the median.

378
00:18:50,920 --> 00:18:53,980
Of course, this is not super-helpful,

379
00:18:53,980 --> 00:18:56,760
because the median might well be what
we're looking for in the first place.

380
00:18:56,760 --> 00:18:59,690
So this is sort of a circular idea.

381
00:18:59,690 --> 00:19:02,530
But for intuition, it's still worth
exploring what kind of running time we

382
00:19:02,530 --> 00:19:04,040
would get in the best case, right?

383
00:19:04,040 --> 00:19:06,840
If we're not going to get linear
time even in this magical best case,

384
00:19:06,840 --> 00:19:11,160
we certainly wouldn't expect to get it on
average over random choices of the pivots.

385
00:19:11,160 --> 00:19:14,830
So what would happen if we actually did
luckily choose the median as the pivot

386
00:19:14,830 --> 00:19:16,390
every single time?

387
00:19:16,390 --> 00:19:19,390
Well we get the recurrence
that the running time that

388
00:19:19,390 --> 00:19:22,640
the algorithm requires
at a rate of length n.

389
00:19:22,640 --> 00:19:25,030
Well, there's only going
to be one recursive call.

390
00:19:25,030 --> 00:19:26,340
So this is the big
difference from QuickSort

391
00:19:26,340 --> 00:19:29,890
where we had to recurse on both sides and
we had two recursive calls.

392
00:19:29,890 --> 00:19:32,520
So here, we're only going
to have one recursive call.

393
00:19:32,520 --> 00:19:35,450
In the magical case where our pivots
are always equal to the median,

394
00:19:35,450 --> 00:19:40,550
both sub-problem sizes are only
half as large as the original one.

395
00:19:40,550 --> 00:19:43,550
So when we recurse,
it's on a problem size guaranteed.

396
00:19:43,550 --> 00:19:47,650
Could be at most n over two and then
outside the recursive call pretty much

397
00:19:47,650 --> 00:19:51,880
all we do is a partitioning invocation,
and we know that that is linear time.

398
00:19:51,880 --> 00:19:56,860
So the recurrence we get is T of N is
the most T of N over two plus big O of N.

399
00:19:56,860 --> 00:20:00,350
This is totally ready to get
plugged into the master method.

400
00:20:00,350 --> 00:20:03,630
It winds up being two of
the master method and

401
00:20:03,630 --> 00:20:06,230
indeed we get exactly what we wanted,
linear time.

402
00:20:07,470 --> 00:20:09,960
To reiterate this is not
interesting in its own right.

403
00:20:09,960 --> 00:20:11,580
This is just for intuition.

404
00:20:11,580 --> 00:20:14,975
This was a sanity check that at least for
a best case choice of pivots

405
00:20:14,975 --> 00:20:18,590
we'd get what we want,
the linear time algorithm and we do.

406
00:20:18,590 --> 00:20:22,210
Now, the question is how well
do we do with random pivots?

407
00:20:22,210 --> 00:20:26,510
Now the intuition, the hope is exactly as
it was for QuickSort which is the random

408
00:20:26,510 --> 00:20:30,400
pivots are perfectly good surrogate for
the median, for the perfect pivot.

409
00:20:31,820 --> 00:20:35,560
So having the analysis of Quicksort under
our belt where indeed random pivots

410
00:20:35,560 --> 00:20:39,490
do approximate very closely to the
performance you get with best case pivots

411
00:20:39,490 --> 00:20:42,008
maybe now we have reason to believe
that this is hopefully true.

412
00:20:42,008 --> 00:20:45,630
That said, as a mathematical statement
this is totally not obvious and

413
00:20:45,630 --> 00:20:46,820
it's going to take a proof.

414
00:20:46,820 --> 00:20:48,430
That's the subject for the next video.

415
00:20:48,430 --> 00:20:51,280
Let me just be clear exactly
what we're claiming.

416
00:20:51,280 --> 00:20:55,610
Here is the running time guarantee
the random Rselection provide.

417
00:20:57,510 --> 00:21:00,780
For an arbitrary input
array of input length n,

418
00:21:00,780 --> 00:21:04,040
the average running time of this
randomized selection is linear.

419
00:21:04,040 --> 00:21:06,894
Big O of n.
Let me reiterate a couple of points I made

420
00:21:06,894 --> 00:21:10,593
for the analogous guarantee for
the QuickSort algorithm.

421
00:21:10,593 --> 00:21:14,130
The first is that we're making no
assumptions for data whatsoever.

422
00:21:14,130 --> 00:21:16,920
In particular we're not assuming
that the data is random.

423
00:21:16,920 --> 00:21:17,870
This guarantee holds,

424
00:21:17,870 --> 00:21:21,030
no matter what input array you feed
into this randomized algorithm.

425
00:21:21,030 --> 00:21:23,830
In that sense, this is a totally
general purpose subroutine.

426
00:21:25,070 --> 00:21:26,920
So where then does this
averaging come from?

427
00:21:26,920 --> 00:21:28,620
Where does the expectation come from?

428
00:21:28,620 --> 00:21:34,180
The randomness is not in the data,
rather, the randomness is in the code.

429
00:21:34,180 --> 00:21:36,360
And we put it there ourselves.

430
00:21:37,760 --> 00:21:39,380
Now let's proceed to the analysis.