1 00:00:00,000 --> 00:00:04,122 I said pretty much everything I wanna say about sorting at this point but I do wanna 2 00:00:04,122 --> 00:00:08,099 cover one more related topic, namely the selection problem. This is a problem of 3 00:00:08,099 --> 00:00:11,834 computing order statistics of an array. With computing the median of an array 4 00:00:11,834 --> 00:00:15,617 being a special case. Analogous to our coverage of quick sort the goal is going 5 00:00:15,617 --> 00:00:19,691 to be the design and analysis of a super practical randomized algorithm that solves 6 00:00:19,691 --> 00:00:23,620 the problem. And this time we'll even achieve an expected running time that is 7 00:00:23,620 --> 00:00:27,452 linear in the length of the input array. That is bigger of N for input array of 8 00:00:27,452 --> 00:00:31,623 length N as supposed to the O of N login time that we had for the expected running 9 00:00:31,623 --> 00:00:35,573 time of quick sort. Like quick sort the [inaudible]. Our analysis is also going to 10 00:00:35,573 --> 00:00:39,350 be, quite elegant. So in addition these two required video's on this very 11 00:00:39,350 --> 00:00:43,390 practical algorithm will motivate two optional video's that are on very cool 12 00:00:43,390 --> 00:00:47,849 topics but of a similar more theoretical nature. The first optional video is going 13 00:00:47,849 --> 00:00:51,993 to be on how you solve the selection problem in deterministic linear time. That 14 00:00:51,993 --> 00:00:56,295 is without using randomization. And the second optional video will be a sorting 15 00:00:56,295 --> 00:01:00,177 lower bound. That is why no comparison [inaudible] sort can be better than 16 00:01:00,177 --> 00:01:04,153 merSort. Can have better running time. [inaudible] and login. So a few words 17 00:01:04,153 --> 00:01:08,062 about what you should have fresh in your mind before you watch this video. I'm 18 00:01:08,062 --> 00:01:11,633 definitely assuming that you've watched the quick sort videos. And not just 19 00:01:11,633 --> 00:01:15,348 watched them, but that you have that material pretty fresh in your mind. So in 20 00:01:15,348 --> 00:01:19,305 particular the video of quick sort about the partition subroutine, so this is where 21 00:01:19,305 --> 00:01:22,971 you take an input array, you choose a pivot, and you do by repeated swaps. You 22 00:01:22,971 --> 00:01:26,745 rearrange the array so that everything less than the pivot's to the left of it, 23 00:01:26,745 --> 00:01:30,568 everything bigger than the pivot is to the right of it. You should remember that 24 00:01:30,568 --> 00:01:34,438 sub-routine. You should also remember the previous discussion about pivot choices, 25 00:01:34,438 --> 00:01:38,165 the idea that the quality of a pivot depends on how balanced a split into two 26 00:01:38,165 --> 00:01:42,035 different sub-problems it gives you. Those are both going to be important. For the 27 00:01:42,035 --> 00:01:45,715 analysis of this randomized linear time selection algorithm, I need you to be, 28 00:01:45,715 --> 00:01:49,346 remember the concepts from probability review part one, in particular, random 29 00:01:49,346 --> 00:01:53,004 variables, their expectation, and linearity of expectation. That said let?s 30 00:01:53,004 --> 00:01:57,438 move on and formally define what the selection problem is. The input is the 31 00:01:57,438 --> 00:02:01,278 same as for the sorting problem; just you're given an array. Of indistinct 32 00:02:01,278 --> 00:02:06,081 entries [sound]. But in addition, you're told what order statistic you're looking 33 00:02:06,081 --> 00:02:11,193 for. So that's gonna be a number I, which is an integer between one and n. And the 34 00:02:11,193 --> 00:02:16,656 goal is to output just a single number, mainly the I-th order statistic. That is 35 00:02:16,656 --> 00:02:22,520 the I-th smallest entry in this input array. So just to be clear, we have an 36 00:02:22,520 --> 00:02:28,353 array entry of, let's just, say four elements. Continue the numbers ten, eight, 37 00:02:28,353 --> 00:02:33,382 two, and four, and you were looking for say the third or statistic, that would be 38 00:02:33,382 --> 00:02:38,348 this eight. The first order statistic is just the minimum elent, element of the 39 00:02:38,348 --> 00:02:43,504 array. That's easy to find with a linear scan. The Nth order statistic is just the 40 00:02:43,504 --> 00:02:48,725 maximum. Again easier, easy to find with a linear scan. The middle element is the 41 00:02:48,725 --> 00:02:53,690 median and you should think of that as the canonical version of the selection 42 00:02:53,690 --> 00:02:59,032 problem. Now, when n is odd, it's obvious what the median is. That's just the middle 43 00:02:59,032 --> 00:03:03,652 element, so the n + one over 2-th order statistic. If the array has even link, 44 00:03:03,652 --> 00:03:07,873 there's two possible mediums, so let's just take the smaller of them. That's the 45 00:03:07,873 --> 00:03:12,147 end over truth order statistic. You might wonder why you'd ever want to compute the 46 00:03:12,147 --> 00:03:16,074 median of an array rather than the mean, that is, the average. It's easier to see 47 00:03:16,074 --> 00:03:20,001 that you can compute the average with a simple linear scan. And the median you 48 00:03:20,001 --> 00:03:24,180 can, one motivation is it's a more robust version of the mean. So if you just have a 49 00:03:24,180 --> 00:03:28,207 data entry problem and it corrupts one element of an input array, it can totally 50 00:03:28,207 --> 00:03:32,386 screw up the average value of the array. But it has generally very little impact on 51 00:03:32,386 --> 00:03:36,270 the median. A final comment about the problem is I am going to assume that the 52 00:03:36,270 --> 00:03:39,899 array entries are distinct. That is, there's no repeated elements. But just 53 00:03:39,899 --> 00:03:44,025 like in our discussions of sorting, this is not a big assumption. I can encourage 54 00:03:44,025 --> 00:03:47,952 you to think about how to adapt these algorithms to work even if the arrays do 55 00:03:47,952 --> 00:03:51,830 have duplicates. You can indeed still get the same very practical, very fast 56 00:03:51,830 --> 00:03:56,574 algorithms with duplicate elements. Now, if you think about it, we already have a 57 00:03:56,574 --> 00:04:02,174 pretty darn good algorithm that solves the selection problem. Here's the algorithm, 58 00:04:02,174 --> 00:04:07,207 its two simple steps and it runs in O of N login time. Step one, sort the input 59 00:04:07,207 --> 00:04:11,306 array. We have various subroutines to do that. Let's say we pick merge sort. Now 60 00:04:11,306 --> 00:04:15,511 what is it we're trying to do? We're trying to find the Ith smallest element in 61 00:04:15,511 --> 00:04:19,450 the input array. Well, once we've sorted it, we certainly know where the Ith 62 00:04:19,450 --> 00:04:23,423 smallest element is. It's in the Ith position of the sorted array. So that's 63 00:04:23,423 --> 00:04:27,375 pretty cool. We've just done what a computer scientist would call a reduction, 64 00:04:27,375 --> 00:04:31,378 and that's a super useful and super fundamental concept. It's when you realize 65 00:04:31,378 --> 00:04:35,374 that you can solve one problem. By reducing it to another problem that you 66 00:04:35,374 --> 00:04:39,791 already know how to solve. So what we just showed is that the selection problem 67 00:04:39,791 --> 00:04:44,209 reduces easily to the sorting problem. We already know how to solve the sorting 68 00:04:44,209 --> 00:04:48,682 problem N log N time, so that gives us an N log N time solution to the selection 69 00:04:48,682 --> 00:04:53,211 problem. But, again, remember the mantra of any algorithm designer worth their salt 70 00:04:53,211 --> 00:04:57,460 is can we do better. We should avoid continents. Just because we got N log N 71 00:04:57,460 --> 00:05:01,990 doesn't mean we can stop there. Maybe we can be even faster. Now certainly we gonna 72 00:05:01,990 --> 00:05:05,895 have to look at all the elements in the input array in the worst case. We 73 00:05:05,895 --> 00:05:11,146 shouldn't expect to do better than linear, but, Hey Why not linear time? Actually, if 74 00:05:11,146 --> 00:05:15,642 you think about it, we probably should have asked that question back when we were 75 00:05:15,642 --> 00:05:20,305 studying the sorting problem. Why were we so content with the N log N time bound for 76 00:05:20,305 --> 00:05:24,635 Merge Short? And the O of N log N time on average bound for Quick Sort. Well, it 77 00:05:24,635 --> 00:05:29,354 turns out we have a really good reason to be happy with our N log N upper bounds for 78 00:05:29,354 --> 00:05:33,962 the sorted problem. It turns out, and this is not obvious, and will be the subject of 79 00:05:33,962 --> 00:05:38,514 an optional video. You actually can't sort an input array of length N better than N 80 00:05:38,514 --> 00:05:42,715 log N time. Either in the worse case, or on average. So, in other words, if we 81 00:05:42,715 --> 00:05:47,406 insist on solving the selection problem via a reduction to the sorting problem 82 00:05:47,406 --> 00:05:52,216 then we're stuck with this n log n time amount. Okay, strictly speaking that's for 83 00:05:52,216 --> 00:05:57,145 something called comparison sorts; see the video for more details. But the upshot is 84 00:05:57,145 --> 00:06:01,837 if we want a general purpose algorithm and we wanna do better than n log n for 85 00:06:01,837 --> 00:06:06,172 selection, we have to do it using ingenuity beyond this reduction. We have 86 00:06:06,172 --> 00:06:10,413 to prove that selection is a strictly easier problem. Then sort it. That's the 87 00:06:10,413 --> 00:06:14,014 only way we're gonna have an algorithm that beats n log n. It's the only way we 88 00:06:14,014 --> 00:06:17,573 can conceivably get a linear time algorithm. And, that is exactly what is up 89 00:06:17,573 --> 00:06:21,584 next on our plates. We're going to show selection is indeed fundamentally easier 90 00:06:21,584 --> 00:06:25,644 than sorting. We can have a linear time algorithm for it, even though we can't get 91 00:06:25,644 --> 00:06:29,604 a linear time algorithm for sorting. You can think of the algorithm we're gonna 92 00:06:29,604 --> 00:06:33,614 discuss as a modification of quick sort, and in the same spirit of quick sort, it 93 00:06:33,614 --> 00:06:37,624 will be a randomized algorithm, and the running time will be an expected running 94 00:06:37,624 --> 00:06:41,777 time that will hold for any input array. Now, for the sorting problem, we know that 95 00:06:41,777 --> 00:06:45,864 quick sort, that n log n time on average for the average is over the coin flips 96 00:06:45,864 --> 00:06:49,950 done by the code. But we also know that if we wanted to we could get a sorting 97 00:06:49,950 --> 00:06:54,403 algorithm in n log n time that doesn't use randomization. The merge sort algorithm is 98 00:06:54,403 --> 00:06:58,699 one such solution. So, if you were giving a liner time solution for a selection for 99 00:06:58,699 --> 00:07:02,471 finding order statistics that uses randomization. And we'd be natural to 100 00:07:02,471 --> 00:07:06,714 wonder is there an analog to merge sort? Is there an algorithm which does not use 101 00:07:06,714 --> 00:07:10,748 randomization and gets this exact same linear time [inaudible]? In fact, there 102 00:07:10,748 --> 00:07:14,543 is. The algorithm's a little more complicated and there. For not quite as 103 00:07:14,543 --> 00:07:18,683 practical as this randomize algorithm. But, it's still very cool. It's a really 104 00:07:18,683 --> 00:07:23,091 fun algorithm to learn and to teach. So, I will have an optional video about linear 105 00:07:23,091 --> 00:07:27,231 time selection without randomization. So, for those of you who aren't gonna watch 106 00:07:27,231 --> 00:07:31,284 that video or wanna know what's the key idea the idea is to choose the pivots 107 00:07:31,284 --> 00:07:35,036 deterministically in a very careful way using a trick called the median of 108 00:07:35,036 --> 00:07:39,139 medians. That's all I'm gonna say about it now. You should watch the optional video 109 00:07:39,139 --> 00:07:43,042 if you want more details. I do feel compelled to warn you that if you're going 110 00:07:43,042 --> 00:07:46,745 to actually implement a selection algorithm, you should do the one that we 111 00:07:46,745 --> 00:07:50,898 discuss in this video and not the linear time one because the one we'll discuss in 112 00:07:50,898 --> 00:07:55,244 this video has both smaller constants and works in place. So, what I want to do next 113 00:07:55,244 --> 00:08:00,013 is develop the idea that we can modify the quick sort paradigm in order to directly 114 00:08:00,013 --> 00:08:04,725 solve the selection problem. So, to get an idea of how that works, let me review the 115 00:08:04,725 --> 00:08:09,437 partition subroutine. Like in quick sort. This subroutine will be our workhorse for 116 00:08:09,437 --> 00:08:13,919 the selection algorithm. So what the partition subroutine does is it takes and 117 00:08:13,919 --> 00:08:18,153 inputs some jumbled up array and it's going to. Solve a problem which is much 118 00:08:18,153 --> 00:08:21,884 more modest than sorting. So, in partitioning, it's going to first choose a 119 00:08:21,884 --> 00:08:25,513 pivot element, somehow. We'll have to discuss what's a good strategy for 120 00:08:25,513 --> 00:08:29,602 choosing a pivot element. But suppose, you know, in this particular input array, it 121 00:08:29,602 --> 00:08:33,691 chooses the first element, this three, as the pivot element. The responsibility of 122 00:08:33,691 --> 00:08:37,575 the partition subroutine, then, is to rearrange the elements in this array so 123 00:08:37,575 --> 00:08:41,460 that the following properties are satisfied. Anything less than the pivot is 124 00:08:41,460 --> 00:08:45,345 to the left of it. It can be in jumbled order but if you're less than the pivot, 125 00:08:45,345 --> 00:08:49,379 you'd better be to the left like this two and one is less than the three. If you're 126 00:08:49,379 --> 00:08:53,462 bigger than the pivot, then again you can be in jumbled order amongst those elements 127 00:08:53,462 --> 00:08:57,594 but all of them have to be to the right of the pivot, and that's true for the numbers 128 00:08:57,594 --> 00:09:01,337 four through eight, they all are to the right of the pivot three in a jumbled 129 00:09:01,337 --> 00:09:05,080 order. So this in particular puts the pivot in its rightful position where we 130 00:09:05,080 --> 00:09:08,896 belong in the final sorted array and at least for quick sort, it enabled. Us to 131 00:09:08,896 --> 00:09:13,111 recursively sort to smaller sub problems, so this is where I want you to think a 132 00:09:13,111 --> 00:09:17,536 little bit about how we should adapt this paradigm. So supposed I told you the first 133 00:09:17,536 --> 00:09:21,910 step of our selection algorithm is going to be to choose a pivot and partition the 134 00:09:21,910 --> 00:09:26,901 array, now the question is. How are we going to recurs? We need to understand how 135 00:09:26,901 --> 00:09:31,931 to find the Ith order statistic of the original input array. It suffices to 136 00:09:31,931 --> 00:09:37,674 recurs on just one sub problem. Of smaller size. And find a suitable order statistic 137 00:09:37,674 --> 00:09:42,168 in it. So how should we do that? Let me ask you that in, with some very concrete 138 00:09:42,168 --> 00:09:46,604 examples, about what pivot we'd choose, and what order statistic we're looking 139 00:09:46,604 --> 00:09:52,320 for, and see what you think. So the correct answer to this quiz is the second 140 00:09:52,320 --> 00:09:56,717 answer. So we can get away with recursing just once, and in this particular example 141 00:09:56,717 --> 00:10:01,157 we're going to recurs on the right side of the array. And instead of looking for the 142 00:10:01,157 --> 00:10:04,972 fifth order statistic like we were originally. We're going to recursively 143 00:10:04,972 --> 00:10:09,261 search for the second order statistic. So, why is that? Well, first, why do we recurs 144 00:10:09,261 --> 00:10:13,132 on the right side of the array? So, by assumption we have an array with ten 145 00:10:13,132 --> 00:10:17,108 elements. We choose the pivot, we do partitioning. Remember the pivot winds up 146 00:10:17,108 --> 00:10:21,397 in its rightful positioning. That's what partioning does. So, if the pivot winds up 147 00:10:21,397 --> 00:10:25,530 in the third position, we know it's the third smallest element in the array. Now 148 00:10:25,530 --> 00:10:29,453 that's not what we were looking for. We were looking for the fifth smallest 149 00:10:29,453 --> 00:10:33,875 element in the array. That of course is bigger. Than the third smallest element of 150 00:10:33,875 --> 00:10:38,426 the array, so by partitioning, where is the fifth element gonna be, it's gotta be 151 00:10:38,426 --> 00:10:42,803 to the right, of this third smallest element, to the right of the pivot, so we 152 00:10:42,803 --> 00:10:46,891 know for sure. That the fifth order statistic of the original array lies to 153 00:10:46,891 --> 00:10:50,818 the right of the pivot. That is guaranteed. So we know where to recurs on 154 00:10:50,818 --> 00:10:55,099 the right-hand side. Now. What are we looking for? We are no longer looking for 155 00:10:55,099 --> 00:10:59,329 the fifth order statistic, the fifth smallest element. Why? Well, we've thrown 156 00:10:59,329 --> 00:11:04,011 out both the pivot, and everything smaller than it. Remember, we're only recursing on 157 00:11:04,011 --> 00:11:08,072 the right hand side. So we've thrown out the pivot, the third element, and 158 00:11:08,072 --> 00:11:12,471 everything less that it. The minimum and the second minimum. Having deleted the 159 00:11:12,471 --> 00:11:16,871 three smallest elements, and originally looking for the fifth smallest of what 160 00:11:16,871 --> 00:11:21,440 remains of what we're recursing on. We're looking for the second smallest element. 161 00:11:21,440 --> 00:11:26,549 So, the selection algorithm, in general, is just the generalization of this idea, 162 00:11:26,549 --> 00:11:31,201 so arbitrary arrays. And arbitrary situations of whether the pivot comes 163 00:11:31,201 --> 00:11:36,441 back, equal to less than or bigger than the element you're looking for. So let me 164 00:11:36,441 --> 00:11:40,785 be more precise. I'm gonna call this algorithm R select for randomize 165 00:11:40,785 --> 00:11:45,760 selection. And according to the problem definition it takes as input as usual an 166 00:11:45,760 --> 00:11:50,486 array A of some length N. But then also the order statistic that we're 167 00:11:50,486 --> 00:11:55,336 looking for. So we're gonna call that I. And of course we assume that I is some 168 00:11:55,336 --> 00:12:00,311 integer between one and N inclusive. So for the base case that's gonna be if the 169 00:12:00,311 --> 00:12:05,721 array has size one. Then the only element we could be looking for is the one quarter 170 00:12:05,721 --> 00:12:10,083 statistic and we just return the sole element of the array. Now we have to 171 00:12:10,083 --> 00:12:13,738 partition the array around the pivot element. And just like in quick sort we're 172 00:12:13,738 --> 00:12:17,485 not going to be, we're going to be very lazy about choosing the pivot. We're gonna 173 00:12:17,485 --> 00:12:21,278 choose it uniformly at random from the [inaudible] possibilities and hope things 174 00:12:21,278 --> 00:12:25,025 work out. And that'll be the crups of the analysis proving that random pivots are 175 00:12:25,025 --> 00:12:28,679 good enough sufficiently often. Having chosen the pivot we now just invoke that 176 00:12:28,679 --> 00:12:31,917 standard partitioning sub routine. As usually that's gonna give us the 177 00:12:31,917 --> 00:12:35,664 partition's array. You'll have the pivot element. You'll have everything less than 178 00:12:35,664 --> 00:12:39,226 the pivot to the left, everything bigger than the pivot to the right. As usual, 179 00:12:39,226 --> 00:12:43,423 I'll call everything to the left the first parts of the partitioned array. Everything 180 00:12:43,423 --> 00:12:48,333 bigger the second part. Now we have a couple of cases, depending on whether the 181 00:12:48,333 --> 00:12:53,184 pivot is bigger or less than the element we're looking for. So I need [inaudible] 182 00:12:53,184 --> 00:12:58,214 notation to, to talk about that. So let's let J be, the ordered statistic that P is. 183 00:12:58,214 --> 00:13:03,124 So if P winds up being the third smallest element, like in the quiz, then J's gonna 184 00:13:03,124 --> 00:13:07,855 be equal to three. Equivalently, we can think of J as defined as the position of 185 00:13:07,855 --> 00:13:12,467 the pivot in the partitioned version of the array. Now there's one case which is 186 00:13:12,467 --> 00:13:16,725 very unlikely to occur but we should include it just for completeness. If we're 187 00:13:16,725 --> 00:13:21,307 really lucky then in fact our random pivot just happens to be the older statistic we 188 00:13:21,307 --> 00:13:25,727 were looking for. That's when I equals J. We're looking for I's smallest element if 189 00:13:25,727 --> 00:13:30,147 by dumb luck the pivot winds up being I's smallest element, we're done. We can just 190 00:13:30,147 --> 00:13:33,990 return it we don't have to recurs. Now in general of course, we don't randomly 191 00:13:33,990 --> 00:13:37,684 choose the element we're looking for. We choose something that well, it could be 192 00:13:37,684 --> 00:13:41,472 bigger or could be smaller than it. In the quiz, we chose a pivot that was smaller 193 00:13:41,472 --> 00:13:44,979 than what we're looking for. Actually, that's the harder case. So let's first 194 00:13:44,979 --> 00:13:48,673 start with a case where the pivot winds up being bigger than the element we're 195 00:13:48,673 --> 00:13:52,227 looking for. So that means that J is bigger than I. We're looking for the Ith 196 00:13:52,227 --> 00:13:55,724 smallest. We randomly chose the Jth smallest for J bigger than I. So this is 197 00:13:55,724 --> 00:13:59,441 opposite case of the quiz. This is where we know what we're looking for has to be 198 00:13:59,441 --> 00:14:03,203 to the left of the pivot. The pivot's the J smallest. Everything less than it is to 199 00:14:03,203 --> 00:14:06,369 the left. We're looking for the I smallest. I is less than J, so that's 200 00:14:06,369 --> 00:14:09,947 gotta be on the left. That's where we recurs. Moreover, it's clear we're looking 201 00:14:09,947 --> 00:14:13,480 for exactly the same order statistic. If we're looking for the third smallest 202 00:14:13,480 --> 00:14:17,197 element, we're only throwing out stuff which is bigger than something even bigger 203 00:14:17,197 --> 00:14:20,913 than the third smallest element. So we're still looking for the third smallest of 204 00:14:20,913 --> 00:14:26,874 what remains. And naturally the new array size is J minus one, because that's what's 205 00:14:27,104 --> 00:14:31,839 is to the left of the pivot. Is when the random element that we choose is less than 206 00:14:31,839 --> 00:14:35,605 what we are looking for and then we're just like the quiz. So namely what we are 207 00:14:35,605 --> 00:14:39,512 looking for is bigger than the pivot, it's got to be on the right hand side, we know 208 00:14:39,512 --> 00:14:43,372 we've got recurs on the right hand side. Into the right hand side has n-j elements 209 00:14:43,372 --> 00:14:47,279 we threw out everything up to the pivots. We threw out J things that's n-j left and 210 00:14:47,279 --> 00:14:51,139 of all of the j things we threw out are less than what we are looking for. So what 211 00:14:51,139 --> 00:14:54,999 we use to be looking for is I's smallest element now we are looking for the I-J's 212 00:14:54,999 --> 00:14:58,686 smallest element. So that is the whole algorithm. That is how we adopt the 213 00:14:58,686 --> 00:15:02,708 approach we took toward the sorting problem in Quick Sort, and adapt it to the 214 00:15:02,708 --> 00:15:06,419 problem of selection. So, is this algorithm any good? Let's start studying 215 00:15:06,419 --> 00:15:10,698 its properties, and understand how well it works. So let's begin with correctness. So 216 00:15:10,698 --> 00:15:14,668 the claim is that, no matter how the algorithm's coin flips come up, no matter 217 00:15:14,668 --> 00:15:18,379 what random pivots we choose, the algorithm is correct, in the sense that 218 00:15:18,379 --> 00:15:22,185 it's guaranteed to output the [inaudible] order statistic. The proof is by 219 00:15:22,185 --> 00:15:26,313 induction. It precedes very similarly to Quick Sort, so I'm not gonna give it here. 220 00:15:26,313 --> 00:15:30,390 If you're curious about how these proofs go, there's an optional video about the 221 00:15:30,390 --> 00:15:34,518 correctness of Quick Sort. If you watch that and understand it, it should be clear 222 00:15:34,518 --> 00:15:38,850 how to adapt that inductive argument, to apply to this selection algorithm as well. 223 00:15:38,850 --> 00:15:43,533 So as usual for divide-and-conquer algorithms, the interesting part is not so 224 00:15:43,533 --> 00:15:48,156 much knowing, understanding why the algorithm works, but rather understanding 225 00:15:48,156 --> 00:15:53,205 how fast it runs. So the big question is, what is the running time of this selection 226 00:15:53,205 --> 00:15:58,193 algorithm? Now to understand this, we have to understand the ramifications of pivot 227 00:15:58,193 --> 00:16:03,180 choices on the running time. So, you've seen the Quick Sort videos, they're fresh 228 00:16:03,180 --> 00:16:08,313 in your mind. So what should be clear is that, just like in Quick Sort, how fast is 229 00:16:08,313 --> 00:16:13,447 algorithm one is going to depend on how good the pivots are, and what good pivots 230 00:16:13,447 --> 00:16:18,406 means is pivots that guarantee a balanced split. So in the next quiz, we'll make 231 00:16:18,406 --> 00:16:23,414 sure that you understand this point, and ask you to think about just how bad this 232 00:16:23,414 --> 00:16:28,360 selection algorithm could be if you get extremely unlucky in your pivot choices. 233 00:16:29,180 --> 00:16:33,576 So the correct answer to this question is exactly the same as the answer for Quick 234 00:16:33,576 --> 00:16:37,708 Sort. The worst case running time, if the pivots are chosen, just, in a really 235 00:16:37,708 --> 00:16:41,680 unlucky way, is actually quadratic in the array length. Remember, we're 236 00:16:41,680 --> 00:16:45,388 shooting for linear time. So this quadratic is a total disaster. So how 237 00:16:45,388 --> 00:16:49,433 could this happen? Well, suppose you're looking for the median. And suppose you 238 00:16:49,433 --> 00:16:53,348 choose the minimum element as the pivot every single time. So if this is what 239 00:16:53,348 --> 00:16:57,669 happens if every time you choose the pivot to be the minimum, just like in quick sort 240 00:16:57,669 --> 00:17:01,940 this means every time you recurce all you succeed doing is peeling off a single 241 00:17:01,940 --> 00:17:06,262 element from the inputt array, now you're not going to find the medium element until 242 00:17:06,262 --> 00:17:10,482 you've done roughly N over two recursive calls, each on an array that has size at 243 00:17:10,482 --> 00:17:14,447 least a constant fraction of the original one, so that it's a linear number of 244 00:17:14,447 --> 00:17:18,566 recursive calls, each on an array of size at least some constant times N, so that 245 00:17:18,566 --> 00:17:22,302 gives you a total running time of quadratic overall. Of course, this is an 246 00:17:22,302 --> 00:17:26,410 absurdly unlikely event. Frankly, your computer is more likely to be struck by a 247 00:17:26,410 --> 00:17:30,363 meteor than it is for the pivot to be chosen as the minimum element in every 248 00:17:30,363 --> 00:17:34,627 recursive call. But, if you really had an absolutely worst case choice of pivots, it 249 00:17:34,627 --> 00:17:39,222 would give this quadratic run time bound. So the upshot then is that the running 250 00:17:39,222 --> 00:17:43,035 time of this randomized selection algorithm depends on how good our pivots 251 00:17:43,035 --> 00:17:46,898 are and for worst case choice of pivots, the running time can be as large as 252 00:17:46,898 --> 00:17:51,168 N-squared. Now hopefully, most of the time we're gonna have much better pivots and so 253 00:17:51,168 --> 00:17:55,412 the analysis proceeds on making that idea precise. So the key to a fast running time 254 00:17:55,412 --> 00:17:59,417 is going to be the, the usual property that we want to see in divide and conquer 255 00:17:59,417 --> 00:18:03,422 algorithms. Mainly every time recourse, every time they recourse, the problem size 256 00:18:03,422 --> 00:18:07,427 better not just be smaller, but it better be smaller by a significant factor. How 257 00:18:07,427 --> 00:18:10,831 would that happen in this selection approach based on the partition 258 00:18:10,831 --> 00:18:14,386 subroutine; well if both of the sub problems are not too big, then we're 259 00:18:14,386 --> 00:18:18,584 guaranteed that when we recourse we make a lot of progress. So let's think about what 260 00:18:18,584 --> 00:18:22,363 the best possible pivot would be in the sense of giving a balanced split. Right? 261 00:18:22,363 --> 00:18:25,761 So, of course, in some sense the best pivot is you just choose the order 262 00:18:25,761 --> 00:18:29,540 statistic you're looking for and that, then you're done in constant time but 263 00:18:29,540 --> 00:18:33,320 that's extremely unlikely and it's not worth worrying about. So ignore the fact 264 00:18:33,320 --> 00:18:37,100 that we might guess the pivot. What's the best pivot if we want to guarantee an 265 00:18:37,100 --> 00:18:40,959 aggressive decrease in the input size before the next generation. Well, the best 266 00:18:40,959 --> 00:18:45,435 pivot's the one that gives us as most balanced split as possible. So what's the 267 00:18:45,435 --> 00:18:49,740 pivot that gives us the most balanced split, a 50-50 split? Well, if you think 268 00:18:49,740 --> 00:18:53,963 about it, it's exactly the median. Of course, this is not super helpful, because 269 00:18:53,963 --> 00:18:57,975 the median might well be what we're looking for in the first place. So this 270 00:18:57,975 --> 00:19:01,891 is, sort of, a circular idea. But for intuition, it's still worth exploring what 271 00:19:01,891 --> 00:19:05,797 kind of running time we would get in the best case, right. If we're not going to 272 00:19:05,797 --> 00:19:09,901 get linear time even in this magical best case, we certainly wouldn't expect to get 273 00:19:09,901 --> 00:19:13,609 it on average over random choices of the pivots. So what would happen if we 274 00:19:13,609 --> 00:19:17,663 actually did luckily choose the median as the pivot every single time? Well, we get 275 00:19:17,663 --> 00:19:21,915 the recurrence that the running time that the algorithm requires on an array of 276 00:19:21,915 --> 00:19:25,847 length N. Well, there's only gonna be one recursive call. So this is the big 277 00:19:25,847 --> 00:19:30,139 difference from Quick Sort, where we had to recurs on both sides, and we had two 278 00:19:30,139 --> 00:19:34,432 recursive calls. So here, we're gonna have only one recursive call. In the magical 279 00:19:34,432 --> 00:19:38,509 case where our pivots are always equal to the median, both sub problems sizes 280 00:19:38,509 --> 00:19:42,748 contain, are only half as large as the original one. So when we recurs, it's not 281 00:19:42,748 --> 00:19:47,148 a problem size guaranteed to be, at most, N over two. And then outside the recursive 282 00:19:47,148 --> 00:19:51,064 call, pretty much all we do is a partitioning indication. And we know that 283 00:19:51,064 --> 00:19:55,752 that's linear time. So the. Recurrence we get is T of n is at most T of n over two, 284 00:19:55,752 --> 00:20:00,634 plus big O of n. This is totally ready to get plugged into the master method. It 285 00:20:00,634 --> 00:20:05,556 winds up being case two of the master method. And indeed, we get exactly what we 286 00:20:05,556 --> 00:20:10,396 wanted, linear time. To reiterate, this is not interesting in its own right, this is 287 00:20:10,396 --> 00:20:14,633 just for intuition. This was a sanity check that, at least for a best case 288 00:20:14,633 --> 00:20:19,335 choice of pivots, we'd get what we want, a linear time algorithm, and we do. Now the 289 00:20:19,335 --> 00:20:23,920 question is, how well do we do with random pivots? Now the intuition, the hope is 290 00:20:23,920 --> 00:20:28,506 exactly as it was for Quick sort, which is that random pivots are perfectly good 291 00:20:28,506 --> 00:20:32,747 surrogates for the median, the perfect pivot. So having the analysis of Quick 292 00:20:32,747 --> 00:20:37,194 sort under our belt we're indeed random pivot do approximate very closely with the 293 00:20:37,194 --> 00:20:41,122 performance you'll get with best case pivots. Maybe, now we have reason to 294 00:20:41,122 --> 00:20:45,265 believe this is hopefully true. That said as a mathematical statement this is 295 00:20:45,265 --> 00:20:49,678 totally not obvious and it's going to take proof, that's the next subject for next 296 00:20:49,678 --> 00:20:53,929 video. But let me just be clear exactly what we're claiming. Here is the running 297 00:20:53,929 --> 00:20:58,389 time guaranteed the randomized selection provides. For an arbitrary input array of 298 00:20:58,389 --> 00:21:03,035 length N, the average running time of this randomized collection algorithm is linear, 299 00:21:03,035 --> 00:21:07,404 Big O of N. Let me reiterate a couple points I made for the analogous guarantee 300 00:21:07,404 --> 00:21:11,829 for the Quick sort algorithm. The first is that, we're making no assumptions about 301 00:21:11,829 --> 00:21:16,364 the data whatsoever, in particular we're not assuming that the data is random. This 302 00:21:16,364 --> 00:21:20,733 guarantee holds not matter what input array that you feed into this randomized 303 00:21:20,733 --> 00:21:25,516 algorithm, in that sense this is a totally general purpose subroutine. So, where then 304 00:21:25,516 --> 00:21:31,180 does this averaging come from? Where does the expectation come from? The randomness 305 00:21:31,180 --> 00:21:36,429 is not in the data, rather the randomness is in the code and we put it there 306 00:21:36,429 --> 00:21:39,983 ourselves. Now let's proceed to the analysis.