1 00:00:00,840 --> 00:00:03,600 So I don't blame you if you're feeling a little bit restless. 2 00:00:03,600 --> 00:00:06,460 You've now watched three videos about this alternative approach 3 00:00:06,460 --> 00:00:10,020 to the union find data structure based on lazy unions, and 4 00:00:10,020 --> 00:00:12,380 frankly we don't have many changeable things to show for it. 5 00:00:12,380 --> 00:00:15,330 We already had a perfectly fine implementation of this data structure 6 00:00:15,330 --> 00:00:19,190 based on eager unions that gave us constant time fines, and yeah, 7 00:00:19,190 --> 00:00:20,750 union could be linear in the worst case, 8 00:00:20,750 --> 00:00:23,750 but it was logarithmic on average over sequence of unions. 9 00:00:23,750 --> 00:00:25,850 So now we have this other implementation. 10 00:00:25,850 --> 00:00:28,910 Both of our operations are requiring logarithmic time. 11 00:00:28,910 --> 00:00:32,700 It's true it's a worst case bound, and it's true union by rank is pretty cool. 12 00:00:32,700 --> 00:00:36,390 But still, the bottom line up to this point is not that compelling. 13 00:00:36,390 --> 00:00:39,600 But I've still got another trick up my sleeve. 14 00:00:39,600 --> 00:00:44,161 Now just watch what happens once we employ a second optimization known as 15 00:00:44,161 --> 00:00:45,439 path compression. 16 00:00:48,107 --> 00:00:49,990 So the optimization is very natural. 17 00:00:49,990 --> 00:00:54,388 I suspect many a serious programmer would come up with this on their 18 00:00:54,388 --> 00:00:59,191 own if they were tasked with implementing union find with union by rank. 19 00:00:59,191 --> 00:01:02,542 So to motivate this optimization let's think about what would be our worst 20 00:01:02,542 --> 00:01:05,681 nightmare as someone maintaining such a data structure and hoping for 21 00:01:05,681 --> 00:01:07,160 good performance. 22 00:01:07,160 --> 00:01:10,660 Well, remember the running time of a find is proportional to the number of parent 23 00:01:10,660 --> 00:01:11,740 pointers you have to traverse. 24 00:01:11,740 --> 00:01:15,750 That is disproportional to the depth of the object at which it's invoked. 25 00:01:15,750 --> 00:01:17,350 So, what's the worst case find look like? 26 00:01:17,350 --> 00:01:18,720 Well it's going to be the leaf. 27 00:01:18,720 --> 00:01:19,790 And moreover, it's going to be a leaf, 28 00:01:19,790 --> 00:01:22,830 which is furthest away from its corresponding roots. 29 00:01:22,830 --> 00:01:25,190 Now, on the one hand we're using union by rank, so 30 00:01:25,190 --> 00:01:27,000 we know that this depth can't be too bad. 31 00:01:27,000 --> 00:01:30,210 It's going to be at most big 0 of log n. 32 00:01:30,210 --> 00:01:32,145 However, there will be example there will, 33 00:01:32,145 --> 00:01:36,585 in general, be leafs that are theta of log n hops away from their root. 34 00:01:36,585 --> 00:01:39,745 And, for all we know, we're just going to get this endless sequence of find 35 00:01:39,745 --> 00:01:42,475 operations, where every single one is invoked on 36 00:01:42,475 --> 00:01:45,262 a leaf that's a log n number of hops away from from its root. 37 00:01:45,262 --> 00:01:49,123 So for example, in the pink tree that is shown in the upper right of the slide 38 00:01:49,123 --> 00:01:51,578 maybe someone keeps searching for the object, 39 00:01:51,578 --> 00:01:55,155 keeps invoking find from the object one over and over and over again. 40 00:01:55,155 --> 00:01:59,106 And then we're going to be suffering a log number of steps with every single 41 00:01:59,106 --> 00:01:59,832 operation. 42 00:02:02,040 --> 00:02:02,820 But if you think about it, 43 00:02:02,820 --> 00:02:06,080 it's totally pointless to keep re-doing the work of previous finds. 44 00:02:06,080 --> 00:02:09,220 To keep retraversing the same parent pointers over and over again. 45 00:02:09,220 --> 00:02:12,750 So for example, in the pink tree that I've shown in the upper right, 46 00:02:12,750 --> 00:02:15,010 imagine that find is invoked from object one. 47 00:02:15,010 --> 00:02:16,980 So then we traverse the three parent pointers. 48 00:02:16,980 --> 00:02:20,250 We discover the ancestors four and six before terminating at seven. 49 00:02:20,250 --> 00:02:21,950 And we discover that seven is the leader or 50 00:02:21,950 --> 00:02:24,180 the root corresponding to the object one. 51 00:02:24,180 --> 00:02:26,905 Now remember, we do not care that four and 52 00:02:26,905 --> 00:02:30,360 six are ancestors of one, that is uninteresting. 53 00:02:30,360 --> 00:02:33,650 We only visited them to discover what we actually cared about, 54 00:02:33,650 --> 00:02:36,960 that seven is the leader, the root corresponding to one. 55 00:02:36,960 --> 00:02:40,590 Well, but let's just cache that information now that we've computed it. 56 00:02:40,590 --> 00:02:44,890 So, this is now basically reconstructing the eager union find implementation. 57 00:02:44,890 --> 00:02:48,700 Let's just rewire one's parent corner to point straight to seven. 58 00:02:48,700 --> 00:02:52,079 We don't need to recompute foreign six in some later find operation. 59 00:02:55,004 --> 00:02:59,541 And more generally, after we've invoked FIND from object one, we may as well 60 00:02:59,541 --> 00:03:04,302 rewire four's parent pointer as well, to point directly to its leader, seven. 61 00:03:07,147 --> 00:03:09,730 So that then is path compression. 62 00:03:09,730 --> 00:03:14,720 When FIND is invoked from some node, x, and you traverse parent pointers from x 63 00:03:14,720 --> 00:03:19,710 up to its root, call it r for every object that you visit along 64 00:03:19,710 --> 00:03:24,930 this path from x to r, you rewire the parent corner to point directly to r. 65 00:03:24,930 --> 00:03:28,340 So r doesnt have it's parent pointer changed, it still points to itself. 66 00:03:28,340 --> 00:03:32,760 The penultimate object on this path doesn't have its parent pointer changed. 67 00:03:32,760 --> 00:03:35,250 It already was pointing to the root r, but 68 00:03:35,250 --> 00:03:39,910 anything below the root and its immediate descendents on this path from x 69 00:03:39,910 --> 00:03:44,230 will have its parent point updated and it'll be updated to point directly to r. 70 00:03:44,230 --> 00:03:47,430 Now we couldn't get away with this if the tree had to be binary. 71 00:03:47,430 --> 00:03:48,960 But it doesn't have to be binary. 72 00:03:48,960 --> 00:03:52,120 We couldn't get away with this if we had to satisfy some other constraint like 73 00:03:52,120 --> 00:03:54,340 a search tree property, but we don't. 74 00:03:54,340 --> 00:03:57,230 So nothing's stopping us from just caching this information 75 00:03:57,230 --> 00:03:59,290 about who everyone's leaders are. 76 00:03:59,290 --> 00:04:02,100 because that's really the only information we're responsible for 77 00:04:02,100 --> 00:04:04,239 exporting from this union find data structure. 78 00:04:06,340 --> 00:04:09,280 So pictorially, if you like thinking about the trees, in effect, 79 00:04:09,280 --> 00:04:13,330 what path compression does is make the trees shallower and more bushy. 80 00:04:13,330 --> 00:04:15,916 So in our example in the upper right, it rips out one and 81 00:04:15,916 --> 00:04:20,019 four and makes them immediate descendants of the root seven. 82 00:04:21,450 --> 00:04:24,860 Prefer to think in terms of the array representation and remember in the array 83 00:04:24,860 --> 00:04:31,430 representation, the array index for i is recording the parent of the object i. 84 00:04:31,430 --> 00:04:35,460 So, in our pink tree, five, six, and seven all point to seven. 85 00:04:35,460 --> 00:04:36,590 They all have parent seven. 86 00:04:39,920 --> 00:04:41,920 Whereas, two and three all have parent five. 87 00:04:45,740 --> 00:04:48,930 Four has the parent six and one has the parent four. 88 00:04:48,930 --> 00:04:53,120 And after applying path compression following a find invoked at the object 89 00:04:53,120 --> 00:04:57,110 one, one and four are redirected to have parent seven. 90 00:04:59,630 --> 00:05:03,180 So what are the pros and cons of the path compression optimization? 91 00:05:03,180 --> 00:05:05,370 Well, the cons side is quite minor. 92 00:05:05,370 --> 00:05:08,380 Really we're just doing a little multi-tasking on find and 93 00:05:08,380 --> 00:05:11,430 that introduces a small constant factor overhead. 94 00:05:11,430 --> 00:05:14,790 We're already doing work proportional to the number of hops. 95 00:05:14,790 --> 00:05:18,040 On the path from x to it's root and we are still just doing constant work for 96 00:05:18,040 --> 00:05:19,850 each node on that path. 97 00:05:19,850 --> 00:05:21,220 We're doing an extra constant work for 98 00:05:21,220 --> 00:05:27,080 each node after the fact to rewire the parent pointer to point to the root. 99 00:05:27,080 --> 00:05:31,540 The pro should be obvious this is going to speed up all subsequent finds operations. 100 00:05:31,540 --> 00:05:34,660 You are really making sure you don't redo redundant work. 101 00:05:34,660 --> 00:05:38,250 You don't traverse parent pointers over and over and over again. 102 00:05:38,250 --> 00:05:41,300 So what's clear is that FINDs will speed up. 103 00:05:41,300 --> 00:05:43,940 What's really not clear is whether they'll speed up by a lot. 104 00:05:43,940 --> 00:05:47,140 So this one going to affect the performance of the FIND operation by say 105 00:05:47,140 --> 00:05:50,600 a factor of two, where we get something fundamentally better 106 00:05:50,600 --> 00:05:54,070 than the logger than performance we were stuck with without path compression. 107 00:05:57,750 --> 00:06:02,345 So, let me spend a slide on the subtle point of how ranks interact with the path 108 00:06:02,345 --> 00:06:04,122 compression optimization. 109 00:06:08,234 --> 00:06:13,271 So, the plan is we're going to manipulate rank fields with path 110 00:06:13,271 --> 00:06:19,920 compression in exactly the same way as we did when we didn't have path compression. 111 00:06:21,590 --> 00:06:23,660 So how did we manipulate them previously? 112 00:06:23,660 --> 00:06:27,780 Well at the birth of the union find data structure everybody has rank zero. 113 00:06:27,780 --> 00:06:32,131 Ranks never change except possibly when you do a union. 114 00:06:32,131 --> 00:06:35,444 When you do a union you look at the roots of the two objects whose groups you 115 00:06:35,444 --> 00:06:36,160 have to union. 116 00:06:36,160 --> 00:06:37,440 You look at their ranks. 117 00:06:37,440 --> 00:06:40,650 If one of them has strictly bigger rank that becomes 118 00:06:40,650 --> 00:06:42,770 the root of your new merged tree. 119 00:06:42,770 --> 00:06:46,620 The root with the smaller rank is installed as a child underneath. 120 00:06:46,620 --> 00:06:50,340 If you union two roots that have exactly the same rank, 121 00:06:50,340 --> 00:06:55,760 you pick arbitrarily which of them is the new root, and you bump up its rank by one. 122 00:06:55,760 --> 00:06:57,530 That is how we manipulated ranks before, 123 00:06:57,530 --> 00:06:59,810 that is exactly how we're going to manipulate them now. 124 00:07:01,910 --> 00:07:04,580 So in particular, when we apply path compression, 125 00:07:04,580 --> 00:07:10,560 when we rewire parent pointers following a find, we do not touch anybody's ranks. 126 00:07:10,560 --> 00:07:13,090 We leave them all exactly the same. 127 00:07:13,090 --> 00:07:18,920 So again, ranks only change in one specific case in a very specific way. 128 00:07:18,920 --> 00:07:24,890 Namely, during a union when we merge two trees that have exactly the same rank. 129 00:07:24,890 --> 00:07:27,630 And when we do that kind of union, that kind of merge, 130 00:07:27,630 --> 00:07:32,060 whichever of the old roots winds up being the new root, we bump up its rank by one. 131 00:07:32,060 --> 00:07:35,580 That is the only modification we ever make to any ranks. 132 00:07:37,770 --> 00:07:40,760 Now, it might bother you that we don't recompute the ranks when we apply 133 00:07:40,760 --> 00:07:41,590 path compression. 134 00:07:41,590 --> 00:07:46,282 And in a way I sort of hope it does bother you because by not recomputing ranks, by 135 00:07:46,282 --> 00:07:51,189 just manipulating them exactly as before, we're actually losing the semantics 136 00:07:51,189 --> 00:07:55,312 of what ranks meant in the previous video, so the key invariant we had 137 00:07:55,312 --> 00:08:00,248 that ranks exactly represented worst case search time to that object is now lost. 138 00:08:02,571 --> 00:08:06,355 The easiest way to see what I'm talking about probably through an example, so 139 00:08:06,355 --> 00:08:09,792 let me redraw the same tree we had on the previous slide, both before and 140 00:08:09,792 --> 00:08:11,900 after we apply path compression. 141 00:08:11,900 --> 00:08:15,103 Following a find invoked at the object one. 142 00:08:16,499 --> 00:08:20,120 So what I've done here is, in the top tree, before the path compression's been 143 00:08:20,120 --> 00:08:23,460 applied, I've labeled ranks just as they were in the previous video. 144 00:08:23,460 --> 00:08:27,760 So for each node in the top tree, its rank is equal to the longest path. 145 00:08:27,760 --> 00:08:29,620 From a leaf to that note. 146 00:08:29,620 --> 00:08:32,680 Then I applied path compression from he object one 147 00:08:32,680 --> 00:08:36,360 resulting in the bushier more shallow tree on the bottom. 148 00:08:36,360 --> 00:08:39,900 And as I said I did not touch anybody's ranks when I applied 149 00:08:39,900 --> 00:08:40,810 this path compression. 150 00:08:40,810 --> 00:08:42,415 I didn't do any recomplications. 151 00:08:42,415 --> 00:08:45,260 And you'll see now we've lost the semantics for the ranks. 152 00:08:45,260 --> 00:08:48,560 Just to give a simple example there are now leaves that don't have rank zero, 153 00:08:48,560 --> 00:08:51,030 they have ranks strictly bigger than zero. 154 00:08:51,030 --> 00:08:55,480 So what we can say is that for every node the rank is an upper bound. 155 00:08:55,480 --> 00:08:58,017 Not necessarily exactly the same as, but 156 00:08:58,017 --> 00:09:01,940 at least as big as the longest path from a leaf to that node. 157 00:09:01,940 --> 00:09:05,990 But because of the shortcuts that we've installed we no longer have the equality 158 00:09:05,990 --> 00:09:06,780 that we had before. 159 00:09:09,190 --> 00:09:10,360 There is good news however. 160 00:09:10,360 --> 00:09:12,910 We can definitely salvage a lot of the technology that we 161 00:09:12,910 --> 00:09:15,880 developed in last video for the union by rank analysis and 162 00:09:15,880 --> 00:09:18,960 apply it fruitfully here to the case with path compression. 163 00:09:18,960 --> 00:09:21,920 In particular, any statement that we made last video 164 00:09:21,920 --> 00:09:26,980 that talks only about the ranks of objects and not about their parent pointers per se 165 00:09:26,980 --> 00:09:30,840 that will be as true with path compression as without. 166 00:09:30,840 --> 00:09:31,720 Why? 167 00:09:31,720 --> 00:09:35,320 Well, think about making two copies of a union find data structure. 168 00:09:35,320 --> 00:09:39,450 Exactly the same set of objects and they'll run them in parallel on exactly 169 00:09:39,450 --> 00:09:42,650 the same sequence of union and find operations. 170 00:09:42,650 --> 00:09:44,020 So, by definition, 171 00:09:44,020 --> 00:09:48,860 we manipulate the ranks of objects in exactly the same way in the two copies. 172 00:09:48,860 --> 00:09:51,090 It doesn't matter if there's path compression or not. 173 00:09:51,090 --> 00:09:55,130 So at all moments in time, every object will have exactly the same rank in 174 00:09:55,130 --> 00:09:57,050 one copy as it does in the other copy. 175 00:09:57,050 --> 00:09:59,200 Path compression or no, it doesn't matter. 176 00:09:59,200 --> 00:10:00,980 Exactly the same ranks. 177 00:10:00,980 --> 00:10:04,240 What's going to be different in the two copies are the parent pointers. 178 00:10:04,240 --> 00:10:06,050 They're, in some sense, going to be further along. 179 00:10:06,050 --> 00:10:10,480 They're going to point further up into the tree in the copy with path compression. 180 00:10:10,480 --> 00:10:11,780 But no matter. 181 00:10:11,780 --> 00:10:15,860 Any statement that's purely about ranks, it's going to be equally well true with or 182 00:10:15,860 --> 00:10:17,270 without path compression. 183 00:10:19,050 --> 00:10:22,460 So in particular, remember the ranking lemma that we proved in the last video. 184 00:10:22,460 --> 00:10:26,970 That says there's at most n/2 to the r objects of rank r. 185 00:10:26,970 --> 00:10:28,490 That was true without path compression. 186 00:10:28,490 --> 00:10:30,820 It's going to be true with path compression. 187 00:10:30,820 --> 00:10:33,330 And we're going to use it in the forthcoming analysis. 188 00:10:35,790 --> 00:10:38,260 Another fact which is still true with path compression, 189 00:10:38,260 --> 00:10:42,220 in fact in some sense it's only more true with path compression, is that whenever 190 00:10:42,220 --> 00:10:46,320 you traverse a parent corner you will get to a node with strictly larger rank. 191 00:10:46,320 --> 00:10:48,950 Remember in the last video we said, as you go up toward the root, 192 00:10:48,950 --> 00:10:52,570 you're going to see a strictly increasing sequence of ranks. 193 00:10:52,570 --> 00:10:56,240 Here each pointer with path compression represents a sequence 194 00:10:56,240 --> 00:10:58,570 of pointers without path compression. 195 00:10:58,570 --> 00:11:01,777 So certainly you will still only be increasing in a rank every time you 196 00:11:01,777 --> 00:11:02,782 traverse a pointer. 197 00:11:06,230 --> 00:11:10,238 Let's now quantify the benefit of path compression by proving new performance 198 00:11:10,238 --> 00:11:13,540 guarantees on the operations and the union find data structure. 199 00:11:13,540 --> 00:11:17,301 And frankly, the running time bounds are going to blow away the logarithmic bound 200 00:11:17,301 --> 00:11:20,750 that we had when we were only doing union by rank without path compression. 201 00:11:22,150 --> 00:11:24,100 We're going to do the analysis in two steps, and 202 00:11:24,100 --> 00:11:26,940 it's going to be historically accurate in the sense that I'm first going to show you 203 00:11:26,940 --> 00:11:30,410 what theorem by Hopcroft-Ullman, which gives an excellent but 204 00:11:30,410 --> 00:11:34,550 not quite optimal bound on the performance of this union find data structure. 205 00:11:34,550 --> 00:11:37,230 And then we're going to build on the Hopcroft-Ullman ideas 206 00:11:37,230 --> 00:11:40,430 to prove even better bound on the performance on this data structure. 207 00:11:42,450 --> 00:11:45,633 So here is the performance guarantee that Hopcroft and 208 00:11:45,633 --> 00:11:48,894 Ullman proved back in 1973, about 40 years ago. 209 00:11:48,894 --> 00:11:49,421 They said, 210 00:11:49,421 --> 00:11:53,390 consider a union find data structure implemented as we've been discussing. 211 00:11:53,390 --> 00:11:57,570 Lazy unions, union by rank, path compression. 212 00:11:57,570 --> 00:12:02,560 Consider an arbitrary sequence of m, find, and union operations, and 213 00:12:02,560 --> 00:12:05,540 suppose the data structure contains n objects. 214 00:12:05,540 --> 00:12:09,710 Then the total work that the data structure does to process this entire 215 00:12:09,710 --> 00:12:15,518 sequence of m operations is m times log star of n. 216 00:12:15,518 --> 00:12:18,770 What is log* of n, you ask? 217 00:12:18,770 --> 00:12:21,290 Well, it's the iterated logarithm operator. 218 00:12:21,290 --> 00:12:25,139 So, my analogy, remember when we first demystified the logarithm way back at 219 00:12:25,139 --> 00:12:28,693 the beginning of part one, we notice the log base two of a number is well, 220 00:12:28,693 --> 00:12:30,966 you just type that number in your calculator, 221 00:12:30,966 --> 00:12:34,910 you count the number of times you divide by two until the result drops below one. 222 00:12:34,910 --> 00:12:38,401 So, log star, you type the number n into your calculator and 223 00:12:38,401 --> 00:12:43,140 you count the number of times you need to hit log before the result drops below one. 224 00:12:44,190 --> 00:12:48,660 So it totally boggles the mind just how slowly this log star function grows. 225 00:12:48,660 --> 00:12:52,836 In fact, let me give you a quick quiz just to make sure that you appreciate 226 00:12:52,836 --> 00:12:55,451 the glacial growth of the log star function. 227 00:13:03,814 --> 00:13:06,280 So the correct answer is B. 228 00:13:06,280 --> 00:13:07,584 It is five. 229 00:13:07,584 --> 00:13:11,667 And you should probably spend a couple seconds just reflecting on how absurd 230 00:13:11,667 --> 00:13:12,190 this is. 231 00:13:12,190 --> 00:13:14,522 Right? So you take this ridiculous number 2 232 00:13:14,522 --> 00:13:15,700 to the 65500. 233 00:13:15,700 --> 00:13:19,334 Remember the number of atoms in the known universe estimated is way smaller 234 00:13:19,334 --> 00:13:20,010 than this. 235 00:13:20,010 --> 00:13:20,880 Which is 10 to the 80. 236 00:13:20,880 --> 00:13:23,850 So, you take this ridiculous number, and you apply the log star function. 237 00:13:23,850 --> 00:13:25,291 You get back 5. 238 00:13:25,291 --> 00:13:26,351 Why do you get back 5? 239 00:13:26,351 --> 00:13:28,150 Well, take the logarithm once. 240 00:13:28,150 --> 00:13:29,635 What do you get back? 241 00:13:29,635 --> 00:13:33,770 65,536, also known as 2 to the 16, so you take log again. 242 00:13:33,770 --> 00:13:35,888 You get back 16, also known as two to the four. 243 00:13:35,888 --> 00:13:36,750 You take log again. 244 00:13:36,750 --> 00:13:39,420 You get back four, also known as two to the two. 245 00:13:39,420 --> 00:13:40,210 You take log again. 246 00:13:40,210 --> 00:13:42,542 You get back a 2, and then one more time gets you down to 1. 247 00:13:43,959 --> 00:13:46,907 So, you can think of log star as the inverse of the tower function, 248 00:13:46,907 --> 00:13:50,222 where the tower function takes, think of it as a positive integer, t, and 249 00:13:50,222 --> 00:13:53,144 it just raises two to the two to the two to the two to the two t times. 250 00:13:54,581 --> 00:13:58,677 And this, in particular, shows that log star, despite its glacial growth, 251 00:13:58,677 --> 00:14:00,240 is not bounded by a constant. 252 00:14:00,240 --> 00:14:01,700 This function is not. 253 00:14:01,700 --> 00:14:02,830 Big o of one. 254 00:14:02,830 --> 00:14:08,270 For any constant c you can think of I can in principle exhibit a large enough n so 255 00:14:08,270 --> 00:14:13,530 that log star of my n is bigger than your c. 256 00:14:13,530 --> 00:14:17,920 Namely I can just set n to be two to the two to the two to the so on c times. 257 00:14:17,920 --> 00:14:19,650 Log star of that will be at least c. 258 00:14:20,880 --> 00:14:24,220 So that's the Hobcroft-Owen performance guarantee that 259 00:14:24,220 --> 00:14:27,980 on average over a sequence of m union of five operations you spend 260 00:14:27,980 --> 00:14:32,560 this only ridiculously small log star n amount per operation. 261 00:14:32,560 --> 00:14:34,240 We're going to prove it in this next video.