1 00:00:00,000 --> 00:00:03,095 Okay, so let's move on, and actually discuss the pseudo-code for the 2 00:00:03,095 --> 00:00:06,744 merge sort algorithm. First, let me just tell you the pseudo-code, leaving aside 3 00:00:06,744 --> 00:00:10,347 exactly how the merging subroutine is implemented. And thus, high levels should 4 00:00:10,347 --> 00:00:14,182 be very simple and clear at this point. So there's gonna be two recursive calls, and 5 00:00:14,182 --> 00:00:17,970 then there's gonna be a merging step. Now, I owe you a few comments, 'cause I'm being 6 00:00:17,970 --> 00:00:21,480 a little sloppy. Again, as I promised, this isn't something you would directly 7 00:00:21,480 --> 00:00:25,222 translate into code, although it's pretty close. But so what are the couple of the 8 00:00:25,222 --> 00:00:28,779 ways that I'm being sloppy? Well, first of all, there's, [inaudible], you know, in 9 00:00:28,779 --> 00:00:32,475 any recursive algorithm, you gotta have some base cases. You gotta have this idea 10 00:00:32,475 --> 00:00:36,261 that when the input's sufficient. Really small you don't do any recursion, you just 11 00:00:36,261 --> 00:00:39,906 return some trivial answer. So in the sorting problem the base case would be if 12 00:00:39,906 --> 00:00:43,690 your handed an array that has either zero or an elements, well it's already sorted, 13 00:00:43,690 --> 00:00:47,381 there's nothing to do, so you just return it without any recursion. Okay, so to be 14 00:00:47,381 --> 00:00:51,120 clear, I haven't written down the base cases. Although of course you would if you were 15 00:00:51,120 --> 00:00:54,826 actually implementing, a merge short. Some of you, make a note of that. A couple of 16 00:00:54,826 --> 00:00:58,603 other things I'm ignoring. I'm ignoring what the, what to do if the array has odd 17 00:00:58,603 --> 00:01:02,380 lengths, so if it has say nine elements, obviously you have to somehow break that 18 00:01:02,380 --> 00:01:06,156 into five and four or four and five, so you would do that just in either way and 19 00:01:06,156 --> 00:01:09,792 that would fine. And then secondly, I'm ignoring the details or what it really 20 00:01:09,792 --> 00:01:13,569 means to sort of recursively sort, so for example, I'm not discussing exactly how 21 00:01:13,569 --> 00:01:17,251 you would pass these subarrays onto the recursive calls. That's something that 22 00:01:17,251 --> 00:01:20,792 would really depend somewhat on what, on the programming language, so that's 23 00:01:20,792 --> 00:01:24,380 exactly what I want to avoid. I really want to talk about the concepts which 24 00:01:24,380 --> 00:01:28,350 transcend any particular programming language implementation. So that's why I'm 25 00:01:28,350 --> 00:01:33,262 going to describe algorithms at this level okay. Alright, so the hard part relatively 26 00:01:33,262 --> 00:01:37,318 speaking, that is. How do you implement the merge depth? The recursive calls have 27 00:01:37,318 --> 00:01:41,734 done their work. We have these two sort of separated half the numbers. The left half 28 00:01:41,734 --> 00:01:45,893 and the right half. How do we combine them into one? And in English, I already told 29 00:01:45,893 --> 00:01:50,052 you on the last slide. The idea is you just populate the output array in a sorted 30 00:01:50,052 --> 00:01:53,852 order, by traversing pointers or just traversing through the two, sorted 31 00:01:53,852 --> 00:01:58,062 sub-arrays in parallel. So let's look at that in some more detail. Okay, so here is 32 00:01:58,062 --> 00:02:02,760 the pseudo-code for the merge step. [sound] So let me begin by, introducing 33 00:02:02,760 --> 00:02:07,046 some names for the, characters in the, what we're about to discuss. So let's use 34 00:02:07,046 --> 00:02:11,280 C. To denote the output array. So this is what we're suppose to spit out with the 35 00:02:11,280 --> 00:02:15,626 numbers in sorted order. And then, I'm gonna use a and b to denote the results of 36 00:02:15,626 --> 00:02:19,910 the two recursive calls, okay? So, the first recursive call has given us array a, 37 00:02:19,910 --> 00:02:24,085 which contains the left half of the input array in sorted order. Similarly, b 38 00:02:24,085 --> 00:02:28,094 contains the right half of the input array, again, in sorted order. So, as I 39 00:02:28,094 --> 00:02:32,159 said, we're gonna need to traverse the two, sorted sub-arrays, a and b, in 40 00:02:32,159 --> 00:02:36,059 parallel. So, I'm gonna introduce a counter, i, to traverse through a, j to 41 00:02:36,059 --> 00:02:40,563 traverse through b. I and j will both be initialized to one, to be at the beginning 42 00:02:40,563 --> 00:02:44,934 of their respective arrays. And now we're gonna do. We're going to do a single pass 43 00:02:44,934 --> 00:02:48,791 of the output array copying it in an increasing order. Always taking the 44 00:02:48,791 --> 00:02:52,800 smallest from the union of the two sorted sub arrays. And if you, if there's one 45 00:02:52,800 --> 00:02:56,730 idea in this merge step it's just the realization that. The minimum element that 46 00:02:56,730 --> 00:03:00,898 you haven't yet looked at in A and B has to be at the front of one or the two lists 47 00:03:00,898 --> 00:03:04,917 right so for example at the very beginning of the algorithm where is the minimum 48 00:03:04,917 --> 00:03:08,837 element over all. Well, which ever of the two arrays it lands in -- A or B -- it has to be 49 00:03:08,837 --> 00:03:12,558 the smallest one there okay. So the smallest element over all is either the 50 00:03:12,558 --> 00:03:16,528 smallest element A or it's the smallest element B. So you just check both places, 51 00:03:16,528 --> 00:03:20,520 the smaller one is the smallest you copy it over and you repeat. That's it. So the 52 00:03:20,520 --> 00:03:24,005 purpose of K is just to traverse the output array from left to right. That's 53 00:03:24,005 --> 00:03:27,398 the order we're gonna populate it. Currently looking at position I, and the 54 00:03:27,398 --> 00:03:31,157 first array of position J and the second array. So that's how far we've gotten, how 55 00:03:31,157 --> 00:03:34,871 deeply we've probed in the both of those two arrays. We look at which one has the 56 00:03:34,871 --> 00:03:38,981 current smallest, and we copy the smallest one over. Okay? So if the, if, the entry 57 00:03:38,981 --> 00:03:43,438 in the i position of A is smaller, we copy that one over. Of course, we have to 58 00:03:43,438 --> 00:03:48,009 increment i. We probe one deeper into the list A, and symmeterically for the case 59 00:03:48,009 --> 00:03:52,466 where the current position in B has the smaller element. Now again, I'm being a 60 00:03:52,466 --> 00:03:57,144 little bit sloppy, so that we can focus on the forest, and not sort of, And not get 61 00:03:57,144 --> 00:04:01,003 bogged down with the trees. I'm ignoring some end cases, so if you really wanted to 62 00:04:01,003 --> 00:04:04,956 implement this, you'd have to add a little bit, to keep track of when you fall off, 63 00:04:05,098 --> 00:04:08,863 either, either A or B. Because you have additional checks for when i or j reaches 64 00:04:08,863 --> 00:04:12,722 the end of the array, at which point you copy over all the remaining elements into 65 00:04:12,722 --> 00:04:16,298 C. Alright, so I'm gonna give you a cleaned up version, of, that pseudo-code 66 00:04:16,298 --> 00:04:20,016 so that you don't have to tolerate my questionable handwriting any longer than 67 00:04:20,016 --> 00:04:23,734 is absolutely necessary. This again, is just the same thing that we wrote on the 68 00:04:23,734 --> 00:04:27,499 last slide, okay? The pseudo-code for the merge step. Now, so that's the Merge Sort 69 00:04:27,499 --> 00:04:32,886 algorithm. Now let's get to the meaty part of this lecture, which is, okay, so merge 70 00:04:32,886 --> 00:04:38,633 sort produces a sorted array. What makes it, if anything, better than much simpler 71 00:04:38,633 --> 00:04:44,034 non divide and conquer algorithms, like say, insertion sort? Other words, what is 72 00:04:44,034 --> 00:04:48,413 the running time of the merge sort algorithm? Now I'm not gonna give you a 73 00:04:48,413 --> 00:04:52,073 completely precise definition, definition of what I mean by running time and there's 74 00:04:52,073 --> 00:04:55,514 good reason for that, as we'll discuss shortly. But intuitively, you should think 75 00:04:55,514 --> 00:04:59,000 of the running time of an algorithm, you should imagine that you're just running 76 00:04:59,000 --> 00:05:02,959 the algorithm in a debugger. Then, every time you press enter, you advance with one 77 00:05:02,959 --> 00:05:07,095 line of the program through the debugger. And then basically, the running time is 78 00:05:07,095 --> 00:05:11,283 just a number of operations executed, the number of lines of code executed. So the 79 00:05:11,283 --> 00:05:15,212 question is, how many times you have to hit enter on the debugger before the, 80 00:05:15,367 --> 00:05:19,554 program finally terminates. So we're interested in how many such, lines of code 81 00:05:19,554 --> 00:05:23,172 get executed for Merge Short when an input array has n numbers. Okay, so 82 00:05:23,172 --> 00:05:27,197 that's a fairly complicated question. So let's start with a more modest school. 83 00:05:27,197 --> 00:05:31,377 Rather than thinking about the number of operations executed by Merge Sort, which 84 00:05:31,377 --> 00:05:35,557 is this crazy recursive algorithm, which is calling itself over and over and over 85 00:05:35,557 --> 00:05:39,685 again. Let's just think about how many operations are gonna get executed when we 86 00:05:39,685 --> 00:05:43,555 do a single merge of two sorted sub arrays. That seems like it should be an 87 00:05:43,555 --> 00:05:47,528 easier place to start. So let me remind you, the pseudo code of the merge 88 00:05:47,528 --> 00:05:51,244 subroutine, here it is. So let's just go and count up how many operations 89 00:05:51,244 --> 00:05:56,117 that are gonna get used. So there's the initialization step. So let's say that 90 00:05:56,117 --> 00:06:02,009 I'm gonna charge us one operation for each of these two initializations. So let's 91 00:06:02,009 --> 00:06:08,075 call this two operations, just set i equal to one and j equal to one then we have this four 92 00:06:08,075 --> 00:06:13,925 loop executes a total number of end times so each of these in iterations of this 93 00:06:13,925 --> 00:06:19,485 four loop how many instructions get executed, well we have one here we have a 94 00:06:19,485 --> 00:06:25,334 comparison so we compare A(i) to B(j) and either way the comparison comes up we then 95 00:06:25,334 --> 00:06:31,809 do two more operations, we do an assignment. Here or here. And then we do 96 00:06:31,809 --> 00:06:37,197 an increment of the relevent variable either here or here. So that's gonna be 97 00:06:37,197 --> 00:06:42,795 three operations per iteration. And then maybe I'll also say that in order to 98 00:06:42,795 --> 00:06:48,533 increment K we're gonna call it a fourth iteration. Okay? So for each of these N 99 00:06:48,533 --> 00:06:54,341 iterations of the four loop we're gonna do four operations. All right? So putting it 100 00:06:54,341 --> 00:06:59,660 all together, what do we have is the running time for merge. So let's see the 101 00:06:59,660 --> 00:07:04,453 upshot. So the upshot is that the running time of the merge subroutine, given an 102 00:07:04,453 --> 00:07:09,174 array of M numbers, is at most four M plus two. So a couple of comments. First of 103 00:07:09,174 --> 00:07:14,135 all, I've changed a letter on you so don't get confused. In the previous slide we 104 00:07:14,135 --> 00:07:18,832 were thinking about an input size of N. Here I've just made it. See I've changed 105 00:07:18,832 --> 00:07:22,514 the name of the variable to M. That's gonna be convenient once we think about 106 00:07:22,514 --> 00:07:26,338 merge sort, which is recursing on smaller sub-problems. But it's exactly the same 107 00:07:26,338 --> 00:07:30,020 thing and, and whatever. So an array of M entries does as most four M plus two. 108 00:07:30,020 --> 00:07:34,325 Lines of code. The second thing is, there's some ambiguity in exactly how we 109 00:07:34,325 --> 00:07:38,975 counted lines of code on the previous slide. So maybe you might argue that, you 110 00:07:38,975 --> 00:07:42,936 know, really, each loop iteration should count as two operations, not just 111 00:07:42,936 --> 00:07:47,242 one.'Cause you don't just have to increment K, but you also have to compare 112 00:07:47,242 --> 00:07:51,547 it to the, upper bound of N. Eh, maybe. Would have been 5M+2 instead of 4M+2. So 113 00:07:51,547 --> 00:07:56,517 it turns out these small differences in how you count up. The number of lines of 114 00:07:56,517 --> 00:08:01,217 code executed are not gonna matter, and we'll see why shortly. So, amongst 115 00:08:01,217 --> 00:08:06,375 friends, let's just agree, let's call it 4M plus two operations from merge, to 116 00:08:06,375 --> 00:08:11,728 execute on array on exactly M entries. So, let me abuse our friendship now a little 117 00:08:11,728 --> 00:08:17,016 bit further with an, an inequality which is true, but extremely sloppy. But I promise 118 00:08:17,016 --> 00:08:22,173 it'll make our lives just easier in some future calculations. So rather than 4m+2, 119 00:08:22,173 --> 00:08:28,250 'cause 2's sorta getting on my nerves. Let's just call this. Utmost six N. 120 00:08:28,250 --> 00:08:35,256 Because m is at least one. [sound] Okay, you have to admit it's true, 6MO is at 121 00:08:35,256 --> 00:08:40,108 least 4M plus two. It's very sloppy, these numbers are not anything closer to each 122 00:08:40,108 --> 00:08:45,021 other for M large but, let's just go ahead and be sloppy in the interest of future 123 00:08:45,021 --> 00:08:49,456 simplicity. Okay. Now I don't expect anyone to be impressed with this rather 124 00:08:49,456 --> 00:08:53,789 crude upper bound, the number of lines of code that the merge subroutine needs to 125 00:08:53,789 --> 00:08:57,961 finish, to execute. The key question you recall was how many lines of code does 126 00:08:57,961 --> 00:09:02,169 merge sort require to correctly sort the input array, not just this subroutine. And 127 00:09:02,169 --> 00:09:06,196 in fact, analyzing Merge Sort seems a lot more intimidating, because if it keeps 128 00:09:06,196 --> 00:09:10,325 spawning off these recursive versions of itself. So the number of recursive calls, 129 00:09:10,325 --> 00:09:14,403 the number of things we have to analyze, is blowing up exponentially as we think 130 00:09:14,403 --> 00:09:18,328 about various levels of the recursion. Now, if there's one thing we have going 131 00:09:18,328 --> 00:09:22,137 for us, it's that every time we make a recursive call. It's on a quite a bit 132 00:09:22,137 --> 00:09:26,317 smaller input then what we started with, it's on an array only half the size of the 133 00:09:26,317 --> 00:09:30,396 input array. So there's some kind of tension between on the one hand explosion 134 00:09:30,396 --> 00:09:34,313 of sub problems, a proliferation of sub problems and the fact that successive 135 00:09:34,313 --> 00:09:38,412 subproblems only have to solve smaller and smaller subproblems. And resolute 136 00:09:38,412 --> 00:09:42,888 resolving these two forces is what's going to drive our analysis of Merge Short. So, 137 00:09:42,888 --> 00:09:47,203 the good news is, is I'll be able to show you a complete analysis of exactly how 138 00:09:47,203 --> 00:09:51,518 many lines of code Merge Sort takes. And I'll be able to give you, and, in fact, a 139 00:09:51,518 --> 00:09:55,886 very precise upper bound. And so here's gonna be the claim that we're gonna prove 140 00:09:55,886 --> 00:10:00,363 in the remainder of this lecture. So the claim is that Merge Short never needs than 141 00:10:00,363 --> 00:10:05,052 more than six times N. Times the logarithm of N log base two if you're keeping track 142 00:10:05,052 --> 00:10:08,949 plus an extra six N operations to correctly sort an input array of N 143 00:10:08,949 --> 00:10:13,581 numbers, okay so lets discuss for a second is this good is this a win, knowing that 144 00:10:13,751 --> 00:10:18,439 this is an upper bound of the number of lines of code the merger takes well yes it 145 00:10:18,439 --> 00:10:23,026 is and it shows the benefits of the divide and conquer paradigm. Recall. In the 146 00:10:23,026 --> 00:10:27,636 simpler sorting methods that we briefly discussed like insertion sort, selection 147 00:10:27,636 --> 00:10:31,901 sort, and bubble sort, I claimed that their performance was governed by the 148 00:10:31,901 --> 00:10:36,454 quadratic function of the input size. That is they need a constant times in the 149 00:10:36,454 --> 00:10:40,834 squared number of operations to sort an input array of length N. Merge sort by 150 00:10:40,834 --> 00:10:45,445 contrast needs at most a constant times N times log N, not N squared but N times 151 00:10:45,445 --> 00:10:50,030 log N lines of code to correctly sort an input array. So to get a feel for what 152 00:10:50,030 --> 00:10:54,796 kind of win this is let me just remind you for those of you who are rusty, or for 153 00:10:54,796 --> 00:10:59,622 whatever reason have lived in fear of a logarithm, just exactly what the logarithm 154 00:10:59,622 --> 00:11:05,491 is. Okay? So. The way to think about the logarithm is as follows. So you have the X 155 00:11:05,491 --> 00:11:12,099 axis, where you have N, which is going from one up to infinity. And for 156 00:11:12,099 --> 00:11:19,165 comparison let's think about just the identity function, okay? So, the function 157 00:11:19,165 --> 00:11:27,244 which is just. F(n)=n. Okay, and let's contrast this with a logarithm. So 158 00:11:27,244 --> 00:11:31,267 what is the logorithm? Well, for our purposes, we can just think of a logorithm 159 00:11:31,267 --> 00:11:36,041 as follows, okay? So the log of n, log base 2 of n is, you type the number N 160 00:11:36,041 --> 00:11:40,956 into your calculator, okay? Then you hit divide by two. And then you keep repeating 161 00:11:40,956 --> 00:11:44,888 dividing by two and you count how many times you divide by two until you get a 162 00:11:44,888 --> 00:11:50,229 number that drops below one okay. So if you plug in 32 you got to divide five 163 00:11:50,229 --> 00:11:56,359 times by two to get down to one. Log base two of 32 is five. You put in 1024 you have to 164 00:11:56,359 --> 00:12:01,189 divide by two, ten times till you get down to one. So log base two of 1024 is ten and 165 00:12:01,189 --> 00:12:06,343 so on, okay. So the point is you already see this if a log of a 1000 roughly is 166 00:12:06,343 --> 00:12:12,056 something like ten then the logarithm is much, much smaller than the input. 167 00:12:12,056 --> 00:12:18,186 So graphically, what the logarithm is going to look like is it's going to look like. A 168 00:12:18,186 --> 00:12:27,317 curve becomes very flat very quickly, as N grows large, okay? So F(n) being log 169 00:12:27,317 --> 00:12:30,552 base 2 of n. And I encourage you to do this, perhaps a little bit more 170 00:12:30,552 --> 00:12:34,031 precisely on the computer or a graphing calculator, at home. But log is 171 00:12:34,031 --> 00:12:38,359 running much, much, much slower than the identity function. And as a result, 172 00:12:38,359 --> 00:12:42,600 sorting algorithm which runs in time proportional to n times log n is much, 173 00:12:42,600 --> 00:12:47,419 much faster, especially as n grows large, than a sorting algorithm with a 174 00:12:47,419 --> 00:12:51,538 running time that's a constant times n squared.