1 00:00:00,012 --> 00:00:04,312 I need to begin by setting up some notation and definitions. 2 00:00:04,312 --> 00:00:09,747 The first one is that of rank blocks. So if we're given value of N, the number 3 00:00:09,747 --> 00:00:14,737 of objects in the data structure we define the rank blocks as follows. 4 00:00:14,737 --> 00:00:18,312 0 gets its own block as does 1. The next block is 2. 5 00:00:18,312 --> 00:00:18,732 3. 4. 6 00:00:18,732 --> 00:00:24,487 The next block starts at 5 obviously and the last entry is going to be 2 raised to 7 00:00:24,487 --> 00:00:27,192 the fourth. So 2 raised to the final. 8 00:00:27,192 --> 00:00:31,278 The rank of the previous block as known as 16. 9 00:00:31,278 --> 00:00:38,448 The nest block starts at 17 and goes up to 2 raised to the last rank of previous 10 00:00:38,448 --> 00:00:42,225 blocks. So 2 raised to the 16 also known as 11 00:00:42,225 --> 00:00:48,032 65536, and so until we have enough rank blocks that we capture. 12 00:00:48,032 --> 00:00:50,215 N. Actually, if you think about it, because 13 00:00:50,215 --> 00:00:54,132 ranks can only be log n, we only have to go up to log n, but that actually doesn't 14 00:00:54,132 --> 00:00:58,182 affect anything by more than a constant. So let's just go ahead all the way up to 15 00:00:58,182 --> 00:01:03,481 So, how many rank blocks are there? Well, focus just on the largest member of each 16 00:01:03,481 --> 00:01:07,334 of the rank blocks. The largest member of a given rank block 17 00:01:07,334 --> 00:01:12,173 is the largest member is 2 raised to the largest member of the previous rank 18 00:01:12,173 --> 00:01:15,353 block. So, for the t-th rank block, roughly what 19 00:01:15,353 --> 00:01:18,712 you have is this largest entry is 2 to the 2 to the 2. 20 00:01:18,712 --> 00:01:21,981 Raised t times. How many times you need to do that before 21 00:01:21,981 --> 00:01:24,624 you get in? By definition, you need log*(n). 22 00:01:24,624 --> 00:01:29,120 So, that's how many rank block star, that's where log*(n) enters the analysis. 23 00:01:29,120 --> 00:01:33,527 So, I realized this must be a very inscrutable definition, these weird rank 24 00:01:33,527 --> 00:01:36,323 blocks. So, let me tell you how you should think 25 00:01:36,323 --> 00:01:39,262 of about them. They're meant to encode an intuition that 26 00:01:39,262 --> 00:01:43,065 we had on the previous slide where we said well, you know we should be happy if 27 00:01:43,065 --> 00:01:46,585 the rank of an object's parent is a lot bigger than the rank of that object 28 00:01:46,585 --> 00:01:49,044 itself. Why? Well, when we traverse that object's 29 00:01:49,044 --> 00:01:52,978 parent pointer, we make a ton of progress through the rank space and you can only 30 00:01:52,978 --> 00:01:56,421 make a ton of progress through the rank space a limited number of times. 31 00:01:56,421 --> 00:02:00,242 So that means a small number of parent traversals, that means fast fines. 32 00:02:00,242 --> 00:02:04,221 So these rank blocks are just a very clever way of defining sufficient 33 00:02:04,221 --> 00:02:06,865 progress. When we're going to be happy with a gap 34 00:02:06,865 --> 00:02:10,088 between the rank of an object and the rank of its parent. 35 00:02:10,088 --> 00:02:14,082 Specifically we're happy whenever these 2 ranks lie in different. 36 00:02:14,082 --> 00:02:17,028 Rank blocks. We're not happy if they lie on the same 37 00:02:17,028 --> 00:02:19,999 rank block. So, for example, if you're at an object 38 00:02:19,999 --> 00:02:24,594 and it has rank 8, and together with his parents, it has rank 18, then we're happy 39 00:02:24,594 --> 00:02:27,842 because 18 is in the rank block after the one that contains 8. 40 00:02:27,842 --> 00:02:32,361 On the other hand, if you go from rank 8 to rank 15, then we're not happy, okay? 41 00:02:32,361 --> 00:02:37,070 We're going to call that not a big gap Between the rank of the object and the 42 00:02:37,070 --> 00:02:40,269 object's parent. So let's build on that idea to make 43 00:02:40,269 --> 00:02:43,876 another definition. So consider a given snapshot in time. 44 00:02:43,876 --> 00:02:48,263 That is, we've done some sequence of finds, we've done some sequence of 45 00:02:48,263 --> 00:02:51,190 unions. So, we're going to call some objects at 46 00:02:51,190 --> 00:02:53,913 this point in time good and some of them bad. 47 00:02:53,913 --> 00:02:56,732 Here are the criteria by which you are good. 48 00:02:56,732 --> 00:03:00,747 So first of all, if you're the root of your tree, you're going to be good. 49 00:03:00,747 --> 00:03:04,892 If you're a direct descendant of the root, that is if your parent is the root 50 00:03:04,892 --> 00:03:09,302 of your tree, then you're also good, If not though,if you're deeper in the tree, 51 00:03:09,302 --> 00:03:13,711 then you're going to be good only if your parents ranked Rank is in a strictly 52 00:03:13,711 --> 00:03:18,129 larger ranked block, than your own rank. So, that is in the previous example. 53 00:03:18,129 --> 00:03:21,505 If your rank is 8, and your parents is 18, than you're good. 54 00:03:21,505 --> 00:03:26,097 If your rank is 8, and your parents is 15 and your parent's not a root, then you're 55 00:03:26,097 --> 00:03:29,478 going to be bad. The role of this definition is to split 56 00:03:29,478 --> 00:03:33,940 the work that we do into two parts. First of all, we do some work visiting 57 00:03:33,940 --> 00:03:37,189 good nodes. Second of all, we do some work vising bad 58 00:03:37,189 --> 00:03:40,091 nodes. The work we do visiting good nodes would 59 00:03:40,091 --> 00:03:44,231 be very easy to bout, no problem. We'll have to do a separate global 60 00:03:44,231 --> 00:03:47,761 analysis to bound the total work that we do visiting bad nodes. 61 00:03:47,761 --> 00:03:51,421 This dichotomy is exactly the same as something we faced when analyzing 62 00:03:51,421 --> 00:03:55,554 Kruskal's algorithm when we implemented it using the union 5 data structure with 63 00:03:55,554 --> 00:03:58,430 eager unions. Here, you recall there were some parts of 64 00:03:58,430 --> 00:04:02,364 the work in Kruskal's algorithm that were easy to balance, just iteration by 65 00:04:02,364 --> 00:04:05,104 iteration. For example every cycle check cost only 66 00:04:05,104 --> 00:04:09,457 at constant time but then there is this more complicated type of work, namely all 67 00:04:09,457 --> 00:04:13,238 of leader pointer updates that we had that bound globally via separate 68 00:04:13,238 --> 00:04:15,764 argument. The exact same thing is going to happen 69 00:04:15,764 --> 00:04:17,818 here. Good nodes will be able to bound 70 00:04:17,818 --> 00:04:22,155 operation by operation whereas the bad nodes will have a global analysis control 71 00:04:22,155 --> 00:04:24,982 the total work done over all of the operations. 72 00:04:24,982 --> 00:04:30,260 So more precisely, I've set up the definitions so that the work done 73 00:04:30,260 --> 00:04:35,682 visiting good nodes is bounded by O(log* n) every single operation. 74 00:04:35,682 --> 00:04:39,567 Indeed, how many good nodes could you possibly visit during a find operation? 75 00:04:39,567 --> 00:04:42,854 Say, from some object x. So you start at x, and you traverse these 76 00:04:42,854 --> 00:04:45,343 parent pointers going all the way up to the roots. 77 00:04:45,343 --> 00:04:48,609 Well, there's the roots. There's the descendant of the roots, so 78 00:04:48,609 --> 00:04:50,400 that's 2. So let's set those aside. 79 00:04:50,400 --> 00:04:54,040 What about the other good nodes that you encounter on your path? Well, by 80 00:04:54,040 --> 00:04:58,196 definition, when you visit a good node. The rank of its parent is on the bigger 81 00:04:58,196 --> 00:05:00,629 rank block than the rank of that node itself. 82 00:05:00,629 --> 00:05:04,712 That is, every time you traverse a parent pointer from a good node, you will 83 00:05:04,712 --> 00:05:08,627 progress to a subsequent rank block. Well, assuming that log star n rank 84 00:05:08,627 --> 00:05:12,399 blocks, so you can only progress through subsequent log star n times. 85 00:05:12,399 --> 00:05:16,402 So, the total number of good nodes you're going to see is O(log*n). 86 00:05:16,402 --> 00:05:20,485 So now, let's go ahead and express the total amount of the work that we do 87 00:05:20,485 --> 00:05:23,885 overall to find a union operations, and use two quantities. 88 00:05:23,885 --> 00:05:28,179 Work on the good nodes, which we now know as just log star in each operation. 89 00:05:28,179 --> 00:05:32,734 Plus, the visits to the bad nodes, which at the moment, we have no idea how big it 90 00:05:32,734 --> 00:05:35,787 is. So now, let's proceed to the global bound 91 00:05:35,787 --> 00:05:40,837 on the total amount of work that we performed on bad nodes and this is really 92 00:05:40,837 --> 00:05:45,162 the crux of the whole theorem. [SOUND] So, let's just review the 93 00:05:45,162 --> 00:05:49,562 definitions real quick. What is that mean that you're a bad node? 94 00:05:49,562 --> 00:05:54,017 So, first of all, you're not the root Second of all you're not a direct 95 00:05:54,017 --> 00:05:57,077 descendant of the root. That is you have a grandparent. 96 00:05:57,077 --> 00:06:01,357 And third, it is not the case that your parents' rank is in a later rank block. 97 00:06:01,357 --> 00:06:04,117 Is in exactly the same rank block as your own rank. 98 00:06:04,117 --> 00:06:08,776 That's what it means that your. So how much work do we spend on bad 99 00:06:08,776 --> 00:06:13,369 nodes? Well let's analyze it one rank block at a time. 100 00:06:13,369 --> 00:06:19,905 So, fix an arbitrary rank block, let's say for some integer K, it's smallest 101 00:06:19,905 --> 00:06:24,602 rank is K + 1 and it's biggest rank is 2 to the K. 102 00:06:24,602 --> 00:06:27,338 Now I told you what our two main building blocks were. 103 00:06:27,338 --> 00:06:31,376 First 1 is the rank lemma, I am going to ask you to remember that in a second. 104 00:06:31,376 --> 00:06:35,281 But first I want to put to use our other building block, which is that path 105 00:06:35,281 --> 00:06:39,036 compression increases the rank gap between objects and their parents. 106 00:06:39,036 --> 00:06:45,155 That's what we're going to use right now. Specifically, consider a fine operation 107 00:06:45,155 --> 00:06:50,707 and a bad object x that it visits. By virtue of being bad x is not the root 108 00:06:50,707 --> 00:06:53,827 and is not a direct descendant of the root. 109 00:06:53,827 --> 00:06:57,407 So, root is a higher up ancestor than its parent. 110 00:06:57,407 --> 00:07:02,484 Therefore, x's parent will be changed In the subsequent path compression. 111 00:07:02,484 --> 00:07:06,604 It will be rewired to point directly to the root, a strict ancestor of its 112 00:07:06,604 --> 00:07:09,731 previous parent. Therefore, the rank of its new parent 113 00:07:09,731 --> 00:07:13,193 will be strictly bigger than the rank of its previous parent. 114 00:07:13,193 --> 00:07:17,313 That's going to keep happening every time that x is visited, y looks bad. 115 00:07:17,313 --> 00:07:21,755 It keeps getting new parents, and those new parents keep having ranks strictly 116 00:07:21,755 --> 00:07:26,172 bigger than the previous one. Well, how many times can this happen 117 00:07:26,172 --> 00:07:31,845 before the rank of x's parent has grown so big that it lies in a subsequent rank 118 00:07:31,845 --> 00:07:37,165 block? The biggest value in x's rank block and remember x is a nonroot its 119 00:07:37,165 --> 00:07:41,861 rank is frozen forever. So it's always stuck in this rank block. 120 00:07:41,861 --> 00:07:47,673 Once its parents rank gets updated let's say at least 2^k times then the rank has 121 00:07:47,673 --> 00:07:51,125 to be so big that it lies. In the next rank block. 122 00:07:51,125 --> 00:07:55,143 At that point, x is no longer bad. Its parent pointer makes so much 123 00:07:55,143 --> 00:07:59,680 progress, it goes to another rank block. Now, we're going to call it good. 124 00:07:59,680 --> 00:08:03,857 And, of course, once x becomes good in this fashion, it would be good 125 00:08:03,857 --> 00:08:07,088 forevermore. It is not a root, it will never be a root 126 00:08:07,088 --> 00:08:09,591 again. Its rank is frozen forever and its 127 00:08:09,591 --> 00:08:13,877 parent's rank can only go up. So, once you're good, once your parent's 128 00:08:13,877 --> 00:08:18,318 rank is sufficiently large. It will be sufficiently large for the 129 00:08:18,318 --> 00:08:21,671 rest of time. Alright, so we're almost there. 130 00:08:21,671 --> 00:08:26,582 Let's just make sure we don't forget anything that we have done. 131 00:08:26,582 --> 00:08:30,284 So, the total work we're bounding in this two ways. 132 00:08:30,284 --> 00:08:35,457 So, first of all, we visit log star n good nodes for each operation. 133 00:08:35,457 --> 00:08:40,346 So, overall m operations at O(m). log*n for the good nodes, plus there's a 134 00:08:40,346 --> 00:08:43,544 number of visits over the m operations to the bad nodes. 135 00:08:43,544 --> 00:08:48,077 We're going to bound that wrok globally but we're going to proceed rank block by 136 00:08:48,077 --> 00:08:50,960 rank blocks. We're fixing one ranks k+1 up to 2^k. 137 00:08:50,960 --> 00:08:55,321 What we shown on the last slide is that for every object x whose final frozen 138 00:08:55,321 --> 00:08:59,674 rank lies somewhere in this rank block. The number of times it gets visited while 139 00:08:59,674 --> 00:09:03,968 it's bad, the number of times it could be visited before it becomes good forever 140 00:09:03,968 --> 00:09:07,343 more is bounded above by 2^k. So, if now you used one of our key 141 00:09:07,343 --> 00:09:11,764 building blocks, that path compression increases the ranks of parents, now let's 142 00:09:11,764 --> 00:09:14,452 use the other building block, the rank lemma. 143 00:09:14,452 --> 00:09:19,957 The rank lemma, if you recall, says, that in any given moment in time and for any 144 00:09:19,957 --> 00:09:25,362 possible rank r, the number of objects that currently have rank r cannot be any 145 00:09:25,362 --> 00:09:29,485 bigger than n/2^r. So, it's used the rank lemma and apply it 146 00:09:29,485 --> 00:09:35,077 to the upper bound, how many nodes Could possibly have their final frozen ranks 147 00:09:35,077 --> 00:09:40,282 lengths somewhere in this rank block. They have their final frozen ranks 148 00:09:40,282 --> 00:09:45,322 somewhere between k+1 and 2^5. So, let's just sum over all of the ranks 149 00:09:45,322 --> 00:09:49,617 in the rank blocks, so it's starting at 2k+1 going up to 2^k. 150 00:09:49,617 --> 00:09:55,954 By the rank lemma, for a given value of i, we noticed the most n/2^i objects that 151 00:09:55,954 --> 00:10:03,383 eventually wind up with rank i and by the usual geometrical sum, this whole thing 152 00:10:03,383 --> 00:10:08,657 can be upper bounded by n / 2^k. So, this is now just trying to look like 153 00:10:08,657 --> 00:10:12,062 a magical coincidence. Of course, we made a number of 154 00:10:12,062 --> 00:10:16,042 definitions in the analysis. specifically, we structured the rank 155 00:10:16,042 --> 00:10:20,812 blocks so that this magic would happen. Specifically, the number of inhabitants 156 00:10:20,812 --> 00:10:25,747 of a rank block, n/2^k, times the maximum number of visits to an inhabitant in the 157 00:10:25,747 --> 00:10:30,343 rank block wile they're bad, times 2^k, that's actually independent of the rank 158 00:10:30,343 --> 00:10:33,025 block. We multiply these two things together, 159 00:10:33,025 --> 00:10:37,570 the number of visits per object to the k, the number of objects n/2^k and what do 160 00:10:37,570 --> 00:10:40,931 we get? We get n. Now, this was only counting the visits to 161 00:10:40,931 --> 00:10:45,363 bad objects in a given rank block but there aren't many rank block, remember, 162 00:10:45,363 --> 00:10:50,423 there's only log*n of them. So, that means the total amount of work 163 00:10:50,423 --> 00:10:55,478 spent visiting the bad nodes, some that over the rank blocks is O(m*log*n). 164 00:10:57,381 --> 00:11:03,652 Combining the balance of the good nodes and the bad nodes, we get (m+n) log*n. 165 00:11:03,652 --> 00:11:08,020 At the beginning, I mentioned that the interesting case is when m is big Omega 166 00:11:08,020 --> 00:11:10,441 of n, in that case, this bound is just O(m*log*n). 167 00:11:11,446 --> 00:11:15,988 Essentially, if you have a very sparse set of union operations, you can just 168 00:11:15,988 --> 00:11:19,487 apply this analysis to each of the directed trees separately. 169 00:11:19,487 --> 00:11:23,952 So, that's it, that's the full story of complete analysis of the brillaint 170 00:11:23,952 --> 00:11:28,518 Hopcroft only analysis. Log star and on average operation time, 171 00:11:28,518 --> 00:11:32,714 under calf compression. Brilliant as it is, you can do even 172 00:11:32,714 --> 00:11:35,862 there. That's the subject of the next couple 173 00:11:35,862 --> 00:11:36,349 videos.