So, welcome to this optional sequence of
videos on state of the art implementations of the union to find data structure. Now as advanced in optional material, let me make a few comments
before we get started. So the first comment is I'm going to
assume that you're in a particularly motivated mood. Now, no one's forcing you to
take these algorithm courses, so I'm always assuming that you're
very motivated individual, but that's going to hold even more true
than usual in these advanced videos. And the way that plays out is while I'm
going to hold myself to the same exacting standards of clarity that I usually do,
I'm going to ask a little bit more of you, the viewer. I'm going to perhaps dot a few less Is,
cross a few less Ts than I usually do. So you may find yourself
periodically needing to pause and think through some of the details
that I'm glossing over. The second comment is I'm going to give
is sort of short shrift to applications of the union find data structure. That's not because there aren't any,
there's plenty of applications of this data structure but for these videos we're
really just going to immerse ourselves in the beauty and the depth of ideas behind
the design of the data structures and especially the analysis
of their performance. The final comment is to keep in
mind that this is some seriously next level material. It is totally normal that the first time
you see this stuff, you find it confusing. You find it difficult to understand. So confusion should not discourage you. It does not represent any
intellectual failing on your part. Rather, keep in mind, it represents
an opportunity to get even smarter. So, with those comments in mind,
let's go ahead and proceed to a different approach to the
Union-find data structure via lazy unions. So let's have one slide with a quick
review of what we already know about the Union-find data structure. Recall we discussed this in the context
of a fast implementation of Kruskal's minimum spanning tree algorithm
The raison d'etre of a union-find data structure is to maintain
a partition of a universe of objects. So in the context of Kruskal's algorithm,
the objects were the vertices, and the groups we wanted to maintain were
the connected components with respect to the edges we committed ourselves to so
far. The data structure should
support two operations. No prizes for guessing the names
of those two operations. First is the find operation. This is given an object from the universe
return the name of that object's group. So for example if X was in
the middle of this square that I put on the right representing the universe
capital X we would return a C3. The name of the group that contains X. So in the context of Kruskal's algorithm,
we use find operation to check for cycles. How did we know whether adding a new edge
would create a cycle with respect to the edges we've already chosen? Well, it would be if and only if the end points of that edge were
already in the same connected component. That is, if and only if the two find
operations returned the same answer. In the Union Operation you're given out
one object but you call them x and y, and then the responsibility of
the operation is to merge the groups that contain x and y. So for example in the picture y was in C4,
x was as before in C3. Then your job is to fuse the groups C3 and
C4 into a single group. In the context of Kruskal's algorithm, we needed this operation
when we added a new edge. This fused together two of the existing
connecting components into one. So that was exactly the union operation. So we all ready went through one
implementation of union and find. That was order to get a blazingly fast
implementation of Kruskal's algorithm. So let's just review how that works. With each group,
we associated a linked structure, so each object had one pointer
associated with it. And the invariant was in a given group, there was some representative
leader object. And everybody in that group pointed
to the leader of that group. So for example, on the right I'm showing
you a group with three objects, x, y, and z, and
x would be the leader of this group. All three of the objects
point directly to x, and that would just be the output of the find
operation, the leader of the group. We use that as the group name. Now the part which is cool and obvious about this approach is our
find operations take constant time. All you do is return the leader
of the given object. Now the tricky part was analyzing
the cost of Union operations. So the problem here is that to
maintain the invariant that every object of a group
points to its leader. When you fuse two groups, you have to
update a bunch of the objects' leaders. We'd only have one leader for
the one new group. The simple, but totally crucial,
optimization that we discussed was when two groups merge you update
the leader pointers of all objects in the smaller group to point to
the leader of the bigger group. That is the new fused
group inherits the leader from the bigger of its constituent parts. If we do that optimization, it's still the
case that a single union might take linear time fade event time but a sequence of n
unions takes only big o of n log in time. And that's because each object endures
at most a logarithmic number of leader updates, cause every time it's leader
pointer gets updated the population of the group that it inhabits doubles. So in this sequence of videos we are going
to discuss a different approach to implementing the union
find data structure. And let's just go all in and try to get away with updating only
a single pointer in each union. All right,
well how are we going to do that? Well, I think if we look at a picture, it'll be clear what
the approach is going to be. So let's look at a simple example. There's just six objects in the world and
currently they're in two groups. One, two, and
three are in one group with leader one. Four, five, and six are in
the second group with leader four. So, with our previous implementation of
union find and the two groups have equal size, so we pick one arbitrarily
to represent the new leader. Let's say four is going to
be the new leader. And now we update objects one,
two, and three so that they point directly
to their new leader, four. And so the new idea is very simple
let's just sort of as a short hand for re-wiring all of one, two and
three to point to four, let's just re-wire one pointer to point to
four and then it's understood that two and three as descendants of one also now
have the new leader four as well. So as a result, we do again get
a directed tree afterwards, but we don't get one
with just depth of one. We don't get as shallow, pushy a tree. We get a deeper one that has
two levels below the root. So, let me give you another way of
thinking about this in terms of an array representation. And here I also want to make a point
that why conceptually it's going to be very useful to think of these union
find data structure in terms of these directed trees. You're actually implementing this data
structure this is not how you'd do it. You wouldn't bother with actually
pointers between the different objects. You'd just have a simply
array indexed by the nodes. And in the entry corresponding to
a node i you would just store the name of i's parent. For instance if we reframe this exact
same example in an array representation before we do the union what do we got? Well objects one, two, and
three all have parent one. Objects four, five, and
six all have parent four. So in the array we'd have a one. In the first three entries and
a four in the second three entries. So in the old solution we have to update
the parent pointers of objects one two and three, they're all going to
point directly to four, so that gives us an array entirely of fours. Whereas in the new solution
the only node whose parent pointer gets updated is that of object
one, it gets updated from itself to four. And in a particular objects two and three still point to object one
not directly to object four. So that's how union works so
in simple example. How does it work in general? Well, in general, you're given two
objects, each belongs to some group. You can think of these two groups
conceptually as directed trees. If you follow all the parent corners, all
parent corners lead to the root vertex. We identify those root vertices with
the leaders of these two groups. Now when you have to merge the two
trees together, how do you do it? Well, you pick one of the groups and you look at its roots
currently it points to itself. You can change its parent pointer to
point to the other groups leader. That is you install one of the two roots
as a new child of the other trees root. So I hope the pros and cons of this
alternative approach to the union find data structure are intuitively clear. The win, the pro, comes in the union
operation, which is now very simple. Now you might be tempted to just
say union takes constant time, because all you do is update one pointer,
It's actually not so simple, right? Because remember you're given two objects,
x and y, and in general you're not linking x and
y together. They might be somewhere deep in a tree. You're linking roots together. So what is true is that
the union operation reduces two two indications of find. You have to find x's root. R sub 1, you have to y's root r sub 2, and then you just do constant time
linking either r1 to r 2 or vice versa. Now the issue with this lazy union
approach is that it's not at all obvious, and in fact it's not going to be true,
that find operation takes constant time. Remember, previously when we actually
did this hard work of updating all these leader pointers every time we had a union
it guaranteed that whenever there is a find boom, we just look in the field,
we return the leader pointer, we're done. Here, parent pointers do not
point directly to roots rather you have to traverse in general a sequence
of parent pointers from a given object x to go all the way up to the root
of the correspondence tree. So because of these trade offs. These pros and cons, it's really
not obvious at all whether this lazy union approach to the union
find data structure is a good idea. Whether it's going to have any pay offs. So that's going to take really some
quite subtle analysis which is the main point of the lectures to come. It's also going to take
a couple of optimizations and the next video will start
with the first one. You might be wondering, okay,
so when you do a union and you have to install one root under
the other, how do you make that choice? So the right answer is
something called union by rank. That's coming up next.