1 00:00:00,012 --> 00:00:02,887 So I don't blame you if you're feeling a little bit restless. 2 00:00:02,887 --> 00:00:06,222 You've now watched three videos about this alternative approach to the 3 00:00:06,222 --> 00:00:08,487 Union-Find Data Structure based on Lazy Unions. 4 00:00:08,487 --> 00:00:11,768 And, frankly, we don't have many tangible things to show for it. 5 00:00:11,768 --> 00:00:15,052 We already had a perfectly fine implementation of this data structure 6 00:00:15,052 --> 00:00:17,749 based on eager unions. That gave us constant time finds. 7 00:00:17,749 --> 00:00:20,140 And yeah, union could be linear in the worst case. 8 00:00:20,140 --> 00:00:23,603 But it was logarithmic on average of. For a sequence of unions. 9 00:00:23,603 --> 00:00:26,146 And so now we have this other implementation. 10 00:00:26,146 --> 00:00:29,238 Both of our operations are requiring logarithmic time. 11 00:00:29,238 --> 00:00:33,617 It's true, it's a worst case bound, and it's true union by rank is pretty cool, 12 00:00:33,617 --> 00:00:37,261 but still the bottom line up to this point is not that compelling. 13 00:00:37,261 --> 00:00:39,829 But I've still got another trick up my sleeve. 14 00:00:39,829 --> 00:00:43,875 Now just watch what happens once we employ a second optimization known as 15 00:00:43,875 --> 00:00:47,784 path compression. So the optimization is very natural. 16 00:00:47,784 --> 00:00:52,856 I suspect many as serious programmers would come up with it on their own if 17 00:00:52,856 --> 00:00:56,912 they were tasked with implementing union find with union. 18 00:00:56,912 --> 00:00:59,747 By rank. So to motivate this optimization let's 19 00:00:59,747 --> 00:01:03,537 think about what would be our worst nightmare as someone maintaining such a 20 00:01:03,537 --> 00:01:05,992 data structure and hoping for good performance. 21 00:01:05,992 --> 00:01:09,997 Well remember the running time of a find is proportional to the number or parent 22 00:01:09,997 --> 00:01:14,002 pointers you have to traverse that is as proportional to the depth of the object 23 00:01:14,002 --> 00:01:15,557 that which. it's invoked. 24 00:01:15,557 --> 00:01:19,227 So what's the worst case find look like? Well it's going to be the leaf, and 25 00:01:19,227 --> 00:01:23,127 moreover it's going to be the leaf which is furthest away from its corresponding 26 00:01:23,127 --> 00:01:25,517 root. Now, on the one hand we're using union by 27 00:01:25,517 --> 00:01:27,717 rank so we know this depth can't be too bad. 28 00:01:27,717 --> 00:01:29,912 We know it's going to be at most big o of log n. 29 00:01:29,912 --> 00:01:33,952 However there will be examples, there will in general be leafs that are Theta 30 00:01:33,952 --> 00:01:37,512 of log n hops away from their root. And for all we know we're just going to 31 00:01:37,512 --> 00:01:41,112 get those endless sequence of Find operations where every single one is 32 00:01:41,112 --> 00:01:44,427 invoked on a leaf that's a log n number of hops away from its root. 33 00:01:44,427 --> 00:01:48,711 So for example in the pink tree that I've shown in the upper right of the slide You 34 00:01:48,711 --> 00:01:53,088 know, maybe someone just keeps searching for the object, keeps invoking find from 35 00:01:53,088 --> 00:01:55,310 the object 1 over and over and over again. 36 00:01:55,310 --> 00:01:59,107 And then we're going to be suffering a log number of steps every single 37 00:01:59,107 --> 00:02:01,818 operation. But if you think about it, it's totally 38 00:02:01,818 --> 00:02:06,054 pointless to keep redoing the work of previous finds, to keep re-traversing the 39 00:02:06,054 --> 00:02:10,332 same paretn pointers over and over again. So, for example, in the pink tree that 40 00:02:10,332 --> 00:02:14,522 I've shown in the upper right, imagine that find is invoked from object. 41 00:02:14,522 --> 00:02:17,487 The one. So then we traverse the 3 parent corners, 42 00:02:17,487 --> 00:02:22,202 we discover the ancestors 4 and 6, before terminating at 7 and we discover that 7 43 00:02:22,202 --> 00:02:25,582 is the leader, or the root corresponding to the object 1. 44 00:02:25,582 --> 00:02:29,152 Now remember, we do not care that 4 and 6 are ancestors of 1. 45 00:02:29,152 --> 00:02:32,402 That is uninteresting. We only visited them, to discover we 46 00:02:32,402 --> 00:02:36,342 actually cared about that 7 is the leader, the root, corresponding to 1. 47 00:02:36,342 --> 00:02:40,400 Well, but let's just, cache that information, now that we've computed it. 48 00:02:40,400 --> 00:02:44,796 So this is now, basically reconstructing the eager, union-find implementation. 49 00:02:44,796 --> 00:02:48,046 Let's just rewire 1's parent pointer to point straight to 7. 50 00:02:48,046 --> 00:02:51,342 We don't need to recompute 4 and 6 and some later find them. 51 00:02:51,342 --> 00:02:58,387 In operation,and more generally after we'll worth find from object 1 we may as 52 00:02:58,387 --> 00:03:05,052 well rewire 4's parent pointer as well to point directly to it's leader 7. 53 00:03:05,052 --> 00:03:10,748 So that then, is, path compression. When find is invoked from some node x, 54 00:03:10,748 --> 00:03:15,817 and you traverse parent pointers from x, up to its root, call it r. 55 00:03:15,817 --> 00:03:21,735 For every, object that you visit, along this path, from x to r, you rewire the 56 00:03:21,735 --> 00:03:26,339 parent pointer to Point, directly to r. So r doesn't have its parent pointer 57 00:03:26,339 --> 00:03:30,229 changed, it still points to itself. The penultimate object on this path 58 00:03:30,229 --> 00:03:34,582 doesn't have it's parent pointer changed, it already was pointing to the root r. 59 00:03:34,582 --> 00:03:38,851 But anything below the root, and it's immediate descendants, on this path from 60 00:03:38,851 --> 00:03:43,195 x, will have its parent point updated, and it will be updated to point directly, 61 00:03:43,195 --> 00:03:45,560 to r. Now we couldn't get away with this, if 62 00:03:45,560 --> 00:03:48,692 the tree had to be binary But it doesn't have to be binary. 63 00:03:48,692 --> 00:03:52,492 We couldn't get away with this, if we had to satisfy some other constraint, like a 64 00:03:52,492 --> 00:03:54,192 search tree property. But we don't. 65 00:03:54,192 --> 00:03:57,592 So nothing's stopping us from just casting this information, about whose 66 00:03:57,592 --> 00:04:00,242 everyone's leaders are. Because that's really, the only 67 00:04:00,242 --> 00:04:03,617 information we're responsible from exporting, from this union find data 68 00:04:03,617 --> 00:04:07,097 structure. So pictorially if you like thinking about 69 00:04:07,097 --> 00:04:12,368 the trees in effect what past compression does is make the tree shallower and more 70 00:04:12,368 --> 00:04:15,404 bushy. So in our example in the upper right, it 71 00:04:15,404 --> 00:04:20,750 rips out 1 and 4 and makes them immediate decendent of the root 7 Prefer to think 72 00:04:20,750 --> 00:04:26,714 in terms of the array representation, and remember, in the array representation, 73 00:04:26,714 --> 00:04:31,220 the array index for i, is recording the parent of the object i. 74 00:04:31,220 --> 00:04:34,667 So, in our pink tree, 5, 6 and 7, all point to 7. 75 00:04:34,667 --> 00:04:39,615 They all ave parent 7. Where as 2 and 3 have parent 5. 76 00:04:39,615 --> 00:04:43,966 4 has the parent 6, and 1 has the parent 4. 77 00:04:43,966 --> 00:04:52,809 And after applying path in, compression, following a find invoked at the object 1, 78 00:04:52,809 --> 00:04:59,504 1 and 4 are redirected to have parent 7. So what are the pros and cons of the path 79 00:04:59,504 --> 00:05:03,400 compression optimization. Well the con side is quite minor. 80 00:05:03,400 --> 00:05:08,294 Really, we're just doing a little multitasking on find, and that introduces 81 00:05:08,294 --> 00:05:13,209 a small constant factor overhead. You're already doing work proportional to 82 00:05:13,209 --> 00:05:16,814 the number of hops. On the path from x towards root and we're 83 00:05:16,814 --> 00:05:20,089 still just doing constant work for each node on that path. 84 00:05:20,089 --> 00:05:24,476 We're doing an extra constant work preach node after the fact to rewire the parent 85 00:05:24,476 --> 00:05:27,919 pointer out of the point to the root. The Pro should be obvious. 86 00:05:27,919 --> 00:05:32,163 This is going to speed up all subsequent find operations, you're really making 87 00:05:32,163 --> 00:05:36,609 sure you don't redo redundant work, you don't traverse parent pointers over and 88 00:05:36,609 --> 00:05:39,909 over and over again. So what's clear is that you know finds 89 00:05:39,909 --> 00:05:42,462 will speed up, what's really not clear is. 90 00:05:42,462 --> 00:05:46,663 Whether they'll speed up by a lot. So is this only going to have an affect 91 00:05:46,663 --> 00:05:50,953 the performance of the Find operation by say a factor of 2? Or, will we get 92 00:05:50,953 --> 00:05:55,555 something fundamentally better than the logarithmic performance we were stuck 93 00:05:55,555 --> 00:06:00,293 with without path compression? So, let me spend a slide on the subtle point of how 94 00:06:00,293 --> 00:06:04,742 ranks interact with the path compression. In optimization. 95 00:06:04,742 --> 00:06:11,219 So the plan is, we're going to manipulate, rank fields, with path 96 00:06:11,219 --> 00:06:18,424 compression, in exactly the same way, as we did, when we didn't have path 97 00:06:18,424 --> 00:06:22,412 compression. So how did we manipulate them previously? 98 00:06:22,412 --> 00:06:26,863 Well, at the birth of the Union Find data structure, everybody has rank 0. 99 00:06:26,863 --> 00:06:30,174 Ranks never change except possibly when you do a union. 100 00:06:30,174 --> 00:06:34,716 When you do a union, you look at the roots of the two objects whose groups you 101 00:06:34,716 --> 00:06:39,481 have to union, you look at their ranks. If one of them has strictly bigger rank, 102 00:06:39,481 --> 00:06:42,239 that becomes the root of your new merged tree. 103 00:06:42,239 --> 00:06:46,395 The root with the smaller rank is installed as a child Underneath. 104 00:06:46,395 --> 00:06:51,759 If you union two roots that have exactly the same rank, you pick arbitrarily which 105 00:06:51,759 --> 00:06:55,542 of them is the new root, and you bump up it's rank, by one. 106 00:06:55,542 --> 00:07:00,272 That is how we manipulated ranks before. That is exactly how we're going to 107 00:07:00,272 --> 00:07:04,076 manipulate them now. So in particular, when we apply path 108 00:07:04,076 --> 00:07:08,806 compression, when we rewire parent pointers following a find, we do not 109 00:07:08,806 --> 00:07:12,857 touch Anybody's ranks, we leave them all exactly the same. 110 00:07:12,857 --> 00:07:17,782 So again, ranks only change in one specific case, in a very specific way. 111 00:07:17,782 --> 00:07:23,007 Namely, during the union, when we merge two trees, that have exactly the same 112 00:07:23,007 --> 00:07:27,092 rank and when we do that kind of union, that kind of merge. 113 00:07:27,092 --> 00:07:31,502 Which ever of the old roots winds up being the new root, we bump up its rank 114 00:07:31,502 --> 00:07:35,177 by 1, that is the only modification we ever make to any ranks. 115 00:07:35,177 --> 00:07:39,785 Now, it might bother you, that we don't recompute the ranks, so when we apply 116 00:07:39,785 --> 00:07:43,303 path compression in a way I sort of hope it does bother you. 117 00:07:43,303 --> 00:07:48,182 Because by now recomputing ranks by just manipulating them exactly as before were 118 00:07:48,182 --> 00:07:52,599 actually loosing the semantics of what ranks meant in the previous video. 119 00:07:52,599 --> 00:07:56,242 So the key envering we have that ranks exactly represented. 120 00:07:56,242 --> 00:07:59,818 Worse case, search time to that object is now lost. 121 00:07:59,818 --> 00:08:05,366 The easiest way to see what I'm talking about is probably through an example, so 122 00:08:05,366 --> 00:08:10,964 let me redraw the same tree we had on the previous slide, both before and after we 123 00:08:10,964 --> 00:08:15,522 apply path compression following a find invoked at the object 1. 124 00:08:15,522 --> 00:08:20,130 So what I've done here is in the top tree, before the path compression's been 125 00:08:20,130 --> 00:08:24,222 applied I've labeled ranks just as they were in the previous video. 126 00:08:24,222 --> 00:08:28,540 So for each node in the top tree, its rank is equal to the longest path from a 127 00:08:28,540 --> 00:08:31,927 leaf to that node. Then I applied path compression from the 128 00:08:31,927 --> 00:08:36,530 object one, resulting in the Bushier more shallow tree on the bottom and as I said, 129 00:08:36,530 --> 00:08:40,104 I did not touch anybody's rank when I applied this past compression. 130 00:08:40,104 --> 00:08:43,703 I didn't do any recomputation. I know see we've now lost the semantic 131 00:08:43,703 --> 00:08:46,454 for the rank. Just to give a simple example, they're 132 00:08:46,454 --> 00:08:49,492 now leaves. That don't have rank 0 they have rank 133 00:08:49,492 --> 00:08:53,403 strictly bigger than 0. So what we can say is that for every node 134 00:08:53,403 --> 00:08:58,180 the rank is a upper band not necessarily exactly the same as but at least as big 135 00:08:58,180 --> 00:09:00,903 as the longest path from a leaf to that node. 136 00:09:00,903 --> 00:09:05,809 But because of the shortcuts that we've installed we no longer have the equality 137 00:09:05,809 --> 00:09:09,008 that we had before. There is good news however, we can 138 00:09:09,008 --> 00:09:13,053 definitely salvage a lot of the technology that we developed in the last 139 00:09:13,053 --> 00:09:17,508 video, for the union by rank analysis, and apply it fruitfully here to the case 140 00:09:17,508 --> 00:09:20,945 with path compression. In particular, any statement that we 141 00:09:20,945 --> 00:09:25,301 made, last video, that talks only about the ranks, of objects, and not about 142 00:09:25,301 --> 00:09:29,697 their parent pointers per se, that will be as true with path compression As 143 00:09:29,697 --> 00:09:32,912 without. Why? Well, think about making two copies 144 00:09:32,912 --> 00:09:37,487 of a union-find data structure. Exactly the same set of objects, and now 145 00:09:37,487 --> 00:09:42,577 run them in parallel, on exactly the same sequence of union and find operations. 146 00:09:42,577 --> 00:09:47,317 So, by definition, we manipulate the ranks of objects, in exactly the same 147 00:09:47,317 --> 00:09:49,778 way. In the 2 copies it doesn't matter if this 148 00:09:49,778 --> 00:09:53,206 is path impression or not. So all moment in time every object will 149 00:09:53,206 --> 00:09:56,659 have exactly the same ranks in one copy as it does in another copy. 150 00:09:56,659 --> 00:09:59,948 Path impression or no, doesn't matter, exactly the same rank. 151 00:09:59,948 --> 00:10:03,351 What's going to be different in the true copy is the parent pointer. 152 00:10:03,351 --> 00:10:07,349 They're in some sense going to be further along, they're going to point further up 153 00:10:07,349 --> 00:10:10,271 into the tree and then copied with path impression. 154 00:10:10,271 --> 00:10:14,212 But no matter, any statements that's purely about ranks is going to be equally 155 00:10:14,212 --> 00:10:17,223 well true. With or without path compression. 156 00:10:17,223 --> 00:10:21,807 So in particular, remember the Rank Lemma that we proved in the last video. 157 00:10:21,807 --> 00:10:25,011 That says there's the most n/r^r objects of rank r. 158 00:10:25,011 --> 00:10:29,269 That was true without path compression. It's going to be true with path 159 00:10:29,269 --> 00:10:32,352 compression. And we're going to use it in the fourth 160 00:10:32,352 --> 00:10:35,452 The coming analysis. Another fact, which is still true with 161 00:10:35,452 --> 00:10:39,075 path compression, in fact in some sense, it's only more true with path 162 00:10:39,075 --> 00:10:42,897 compression, is that whenever you traverse a parent pointer, you will get 163 00:10:42,897 --> 00:10:46,900 to a node with strictly, larger, rank. Remember in the last video we said, as 164 00:10:46,900 --> 00:10:50,402 you go up toward the root, you're going to see a strictly increasing 165 00:10:50,402 --> 00:10:54,737 [UNKNOWN] sequence of ranks. Here each pointer with path impression 166 00:10:54,737 --> 00:10:57,782 represents a sequence without path impression. 167 00:10:57,782 --> 00:11:02,342 So certainly, you will still only be increasing in ranks each time you. 168 00:11:02,342 --> 00:11:05,272 A pointer. Lets now quantify the benefit of path 169 00:11:05,272 --> 00:11:10,061 compression, by proving new performance gaurantees on the operatioins in this 170 00:11:10,061 --> 00:11:13,937 union find data structure. In fact, the running time bounds are 171 00:11:13,937 --> 00:11:18,826 going to blow away the logarithmic bound that we had when we were only doing union 172 00:11:18,826 --> 00:11:22,643 by rank, without Path compression. We're going to do the analysis in two 173 00:11:22,643 --> 00:11:24,450 steps. And it's going to be historically 174 00:11:24,450 --> 00:11:27,936 accurate, in the sense that I'm first going to show you a theorem by Hopcroft 175 00:11:27,936 --> 00:11:31,447 and Ullman, which gives an excellent, but not quite optimal bound, on the 176 00:11:31,447 --> 00:11:33,713 performance of this union-find data structure. 177 00:11:33,713 --> 00:11:37,327 And then, we're going to build on the Hopcroft-Ullman ideas, to prove Tarjan's 178 00:11:37,327 --> 00:11:40,472 even better bound on the performance of this data structure. 179 00:11:40,472 --> 00:11:46,091 So here is the performance guarantee that Hopcroft and Ullman proved back in 1973, 180 00:11:46,091 --> 00:11:49,887 about 40 years ago. They said consider a union find data 181 00:11:49,887 --> 00:11:53,254 structure implemented as we've been discussing. 182 00:11:53,254 --> 00:11:56,493 Lazy unions, union by rank, path the pressure. 183 00:11:56,493 --> 00:12:01,935 Consider and arbitrary sequence of M find and union operations, and suppose the 184 00:12:01,935 --> 00:12:06,592 data structure contains N objects. Then the total work that the data 185 00:12:06,592 --> 00:12:10,777 structure does. To process this entire sequence of m 186 00:12:10,777 --> 00:12:17,214 operations, is m, times, log *, of n. What is, log * of n, you ask? Well, it's 187 00:12:17,214 --> 00:12:22,810 the iterated logarithm operator. So, by analogy, remember when we first 188 00:12:22,810 --> 00:12:27,163 demystified the logarithm. That way back at the beginning of part 1 189 00:12:27,163 --> 00:12:31,383 we noticed that log base 2 of a number is just well you type that number into your 190 00:12:31,383 --> 00:12:35,636 calculator, you count the number of times you / by 2 until the result drops below 191 00:12:35,636 --> 00:12:38,107 1. So log * you type the number n into your 192 00:12:38,107 --> 00:12:42,194 calculator you count the number of times you need to hit log before the result 193 00:12:42,194 --> 00:12:43,947 drops below. Work one. 194 00:12:43,947 --> 00:12:50,847 So it totally burgles the mind, just have slowly this log * function grows. 195 00:12:50,847 --> 00:12:57,672 In fact that may give you a quick quiz just to make sure that you appreciate the 196 00:12:57,672 --> 00:13:04,547 glacial growth of the log * function. So the correct answer is B, it is 5 and 197 00:13:04,547 --> 00:13:10,532 you should probably spend a couple of seconds just reflecting on. 198 00:13:10,532 --> 00:13:15,190 How absurd this is, right? So, you take this ridiculous number, 2 to the 65,000, 199 00:13:15,190 --> 00:13:19,572 remember the number of atoms, in the known universe estimated is way smaller 200 00:13:19,572 --> 00:13:23,739 than this, it's just 10 to the 80. So you dig this ridiculous number, you 201 00:13:23,739 --> 00:13:26,408 apply the log * function, you get back, 5. 202 00:13:26,408 --> 00:13:30,474 Why do you get back 5? Well, take logarithm once, what do you get back? 203 00:13:30,474 --> 00:13:34,762 65,536, Also know as 2 to the 16. So, you take log again, you get back 16. 204 00:13:34,762 --> 00:13:37,850 Also known as 2^4. You take the value again, you get back 4, 205 00:13:37,850 --> 00:13:40,913 also known as 2^2. You take the value again, you get back at 206 00:13:40,913 --> 00:13:43,335 2, and then one more time it gets you down to 1. 207 00:13:43,335 --> 00:13:47,512 So, you think of log * as the inverse of tower function with the tower function 208 00:13:47,512 --> 00:13:49,828 takes, think of it as a positive integer, t. 209 00:13:49,828 --> 00:13:53,702 And it just raises to 2, to the 2, to the 2, to the 2, to the 2 t times. 210 00:13:53,702 --> 00:13:59,117 And this in particular shows that log*, despite it's glacial growth, is not 211 00:13:59,117 --> 00:14:02,722 bounded by a constant. This function is not L of 1. 212 00:14:02,722 --> 00:14:08,397 For any constant c you can think of, I can, in principle, exhibit a large enough 213 00:14:08,397 --> 00:14:11,325 n so that log* of my n is bigger than your c. 214 00:14:12,434 --> 00:14:16,747 Anyway, I can just set c to be 2 to the 2 to the 2 to the so on c times. 215 00:14:16,747 --> 00:14:21,691 log * of that will be at least c. So that's the HopCroft Ullman performance 216 00:14:21,691 --> 00:14:26,923 guarantee, that, on average, over a sequence of N union and fine operations, 217 00:14:26,923 --> 00:14:30,122 you spend only this ridiculously small log *. 218 00:14:30,122 --> 00:14:33,759 In amount per operation. We're going to prove it in the next 219 00:14:33,759 --> 00:14:34,127 video.