Alright. So the plan for this video is to
prove the correctness of the divide and
conquer closest to pair algorithm that we
discussed in the previous video. So just
to refresh your memory, how does the outer
algorithm work? Well, we're given endpoints
in the plane. We begin by sorting them,
first by x-coordinate and then by
y-coordinate. That takes n log in time.
Then we enter the main recursive divide
and conquer part of the algorithm. So what
do we do? We divide the point set into the
left half and the right half, Q and
R, then we conquer. We recursively compute
the closest pair in the left half of the
point set Q. We recursively compute the
closest pair in the right half of the
point set R. There is a lucky case where
the closest pair on the entire point set
lies either all on the left or all on the
right. In that case, the closest pair is
handed to us on a silver platter,
by one of the two recursive calls. But there
remains the unlucky case where the closest
pair is actually split with one point on
the left and one point on the right. So to
get our N log N running time bound,
analogous to Merge Short in our inversion
counting, we need to have a linear time
implementation of a subroutine which
computes the best, the closest pair of
points, which is split, one on the left
and one on the right. Well, actually, we
don't need to do quite that. We need to
do something only a little bit weaker. We need
a linear time algorithm, which whenever
the closest pair in the whole point set is
in fact split, then computes that split
pair in linear time. So let me now
remind you of how that subroutine works.
So it has two basic steps. So first,
there's a filtering step. So it looks at,
first of all, a vertical strip, roughly
down the middle of the point set. And it
looks at, only at points which fall into
that vertical strip. That was a subset of
the points that we called S sub Y, 'cause
we keep track of them sorted by Y
coordinate. And then we do essentially a
linear scan through SY. So we go through
the points one at a time, and then, for
each point, we look at only the almost
adjacent points. So for each index I, we
look only at J's that are between one and
seven positions further to the right, than
I. So among all such points, we
compare them, we look at their distances.
We remember the best such pair of points.
And then that's what we return from the
count split pair subroutine. So we've
already argued, in the previous video,
that the overall running time of the
algorithm is N log N. And what remains to
prove correctness. And we also argued, in
the previous video, that correctness boils
down to the following correctness claim.
In the sense that, if we can prove this
claim, then the entire algorithm is
correct. So this is what remains. Our
residual work is to provide a proof of
the correctness claim. What does it say?
It says consider any split pair that is
one point p from the left side Q, capital
Q, and another point little q drawn from
the right side of the point set capital R.
And fur, further suppose that it's an
interesting split pair meaning that the
distance between them's at most delta.
Here delta is recall the parameter pass to
the count split pair subroutine, which is
the smallest distance between a pair of
points all on the left or all on the
right. And this is the only case we're
interested in. There's two claims. First
of all, for p and q, both members of the
split pair survive the filtering step.
They make it into the sorted list S sub Y, and
second of all, they will be considered by
that double for loop, in the sense that
the positions of p and q in this array, S
sub Y, differ by at most seven. So that's
the story so far. Let's move on to the
proof. So let's start with part A which is
the easy part relatively of the claim. So
remember what we start with, our
assumptions. We have a point p, let's
write it out in terms of the X
coordinates, X1 and Y1, which is from the
left half of the point set. And we have a
point q, which we'll call X2Y2, which
comes from the right half of the point
set. And furthermore, we're assuming that
these points are close to each other. And
we're gonna use that hypothesis over and
over again. So the Euclidean distance
between p and q is no more than this
parameter delta. So, first, something very
simple, which is that if you have two
points that are close in Euclidean
distance, then both of their coordinates
have to be close to each other, right? If
you have two points, and they differ by a
lot in one coordinate, then the Euclidean
distance is gonna be pretty big as well.
So, specifically. By our hypothesis, that p
and q have Euclidean distance less than
delta, it must be that the difference
between their coordinates in absolute
value is no more than delta, and as well,
the difference between their y-coordinates
is at most delta. Okay, and this is easy
to see if you'd just return to the
definition of Euclidean distance that we
reviewed at the beginning of the
discussion of closest points. Okay? So if
your distance is at most delta, then in
each coordinate, you differ by at most
delta as well. Now, what does A say?
[sound]. So proof of A. So what does part A
of the claim assert? It asserts that p
and q are both members of SY, are both
members of that vertical strip. So another
way of saying that is that the X
coordinates of p and q, that is, the
numbers X1 and X2 both are within delta of
Xbar. Remember, Xbar was in some sense
the median X coordinate. So the X
coordinate of the N over two'th leftmost
point. So we're gonna do a proof by
picture, so consider, forget about the Y
coordinates, that's of irrelevant right
now, and just focus on the X coordinates
of all of these points. So on the one hand
we have X bar. This is the X coordinate of
the N over two'th point to the left. And then
there are the X coordinates which define
the left and the right borders of that
vertical strip. Namely Xbar-delta and
Xbar+delta. And then somewhere in here are
X1 and Y1, the X coordinates of the points
we care about, p and q. So a simple
observation, so because p comes from the
left half of the point set, and Xbar is
the rightmost X coordinate of the left
half of the point set, the X coordinate is
at most Xbar. Right? So all points of Q
have X coordinate, at most, Xbar, in
particular, p does. Similarly, since Xbar
is the rightmost edge of the left half of
the point set, everything in the right half
of the point set has X coordinate, at
least Xbar. So in particular, little q
does as well. So what does this mean. So
this means x1, wherever it is, has to be
at the left of x bar. X2 wherever it is
has to be to the right of x bar. What
we're trying to prove is that they're wedged
in between x bar minus delta and x bar
plus delta. And the reason why that's
true is because their x coordinates
also differ by at most delta. Okay, so what
you should imagine is. You can imagine x1
and x2 are sort of people tied by a
rope at the waist. And this rope has
length delta. So wherever x1 and x2
move, they're at most delta apart.
Furthermore x1, we just observed, can't
move any farther to the right than Xbar.
So even if X1 moves as far to the right as
it can, all the way to Xbar, X2, since
it's at most delta away, tied by the
waist, can't extend beyond X bar+ delta. By
the same reasoning, X2 can't move any
further to the left than Xbar, X1 being
tied to the waist to X2, can never drift
further to the left than Xbar minus delta.
So that's the proof that X1 and X2 both
lie within this region, and that defines
the vertical strip. So that's part A. If you
have any split pair whose distance between
them is less than delta, they both have to
wind up, in this vertical strip. And
therefore wind up in the filtered set, the
proof set, S sub Y. So that's part A of
the claim. Let's now move to Part B.
Recall what part B asserts. It says that
the points p and q, this split pair that
are distance only delta apart. Not only do
they wind up in this sort of filtered set
SY, but in fact, they are almost adjacent
in SY, in the sense that the indices in
the array differ by, at most, seven
positions. And this is a part of the claim
that is a little bit shocking. Really what
this says is that we're getting away with
more or less a variant of our one
dimensional algorithm. Remember when we
wanted to find the closest pair of points
on the line, all we had to do was sort
them by their single coordinate and then
look at consecutive pairs and return the
best of those consecutive pairs. Here what
we're saying is really, once we do a
suitable filtering focus on points in this
vertical strip, then we just go through
the points according to their Y
coordinate. And okay, we don't just look
at adjacent pairs. We look at pairs within
seven positions, but still we basically do
a linear sweep through the points in SY.
According to their Y coordinate and that's
sufficient to identify the closest split
pair. So why on earth will this be true.
So our workhorse in this argument will be
a picture which I am going to draw on
next. So I'm going to draw eight boxes,
which have a height and width delta over
two. So here, delta is the same parameter
that gets passed to the closest split
pair subroutine. And it's also the same
delta which we're assuming p and q are
closer to each other than, right? So
that's, remember, that's one of our
hypotheses in this claim. The distance
between p and q is strictly less than
delta. So we're gonna draw eight delta
over two boxes. And they're gonna be
centered at x bar. So, this same center of
the vertical strip that defines S Y. And
the bottom is going to be the smaller of the
Y-coordinates of the points p and q.
So it might be p, it might be q. It
doesn't really matter. But the bottom
is going to be the smaller of the two. So
the picture then looks as follows. So the
center of these collections of eight
boxes, X bar, the bottom is the minimum of
Y1, Y2. We're gonna have two rows and four
columns. And needless to say, we're
drawing this picture just for the sake of
this correctness proof, right? This
picture is just a thought experiment in
our head. We're just trying to understand
why the algorithm works. The algorithm, of
course, does not draw these boxes. The
subroutine, the, closest split pair
subroutine is just that pseudo code we saw
in the previous video. This is just to
reason about the behavior of that
subroutine. Now looking ahead, I'll make
this precise in two lemmas that are
about to come up, what's going to be true
is the following. So, either p or q is on
this bottom line, right? So we define the
bottom to be the lower Y coordinate of the
two. So maybe, for example, q is the one
that has the smaller Y coordinate, in
which case, is gonna be, somewhere, say,
down here. P, you remember, is from the
left half of the point sets. So p is maybe
gonna be here or something. And we're
gonna argue that both p and q have to be
in these boxes. Moreover, we're gonna
argue that these boxes are sparsely
populated. Every one contains either zero
or one point of the array S sub Y. So,
what we're gonna see is that there's at
most eight points in this picture, two of
which are p and q, and therefore, if you
look at these points sorted by Y
coordinate, it has to be that they're
within seven of each other, the difference
of indices is no more than seven.
So, we're gonna make those two statements
precise one at a time by the following two
lemmas. Let's start with lemma one. Lemma
one is the easy one. And it states that
all of the points of S sub Y, which show up
in between the Y coordinates of the points
we care about p and q have to appear in
this picture, they have to lie in one of
these eight boxes. So we're going to argue
this in two steps. First, we're going to
argue that all such points have to have Y
coordinates within the relevant range of
this picture between the minimum of Y1 and
Y2 and delta more than that, and secondly
that they have to have X coordinates in
the range of this picture, namely between X
bar minus delta and X bar plus delta. So
let's start with Y coordinates. So again,
remember this key hypothesis we have,
okay. We're dealing with a split pair p-q
that are close to each other. The distance
between X and Y is strictly less than
delta. So the very first thing we did at
the beginning of this proof is we said
well, if their Euclidean distance is less
than delta then they have to differ by at
most delta in both of their coordinates,
in particular in their Y coordinate.
Now remember whichever is lower of p and
q, whichever one has a smaller y
coordinate is precisely at the bottom of
this diagram. For example, if q is the one
with the smaller y coordinate, it might be
on the black line right here. So that
means in particular x has y coordinate no
more than the top part of this diagram. No
more than delta bigger than q. And of
course all points with Y coordinates in
between them are equally well wedged into
this picture. So that's why all points of
SY with a Y coordinate between those of p
and q have to be in the range of this
picture, between the minimum of the two Y
coordinates and delta more than that. Now
what about horizontally? What about the X
coordinates? Well this just follows from
the definition of S sub Y. So remember,
S sub Y are the points that fall into this
vertical strip. How did we define the
vertical strip? Well it had center Xbar,
and then we fattened it by delta on both
sides. So just by definition, if you're an
SY, you've gotta have X coordinates in the
range of this picture. X delta plus minus,
sorry, xbar plus minus delta. So that
completes the proof of the lemma. So this is
not. This is just a lemma. So I'll use a
lower case qed. Remember this is
just a step toward proving the overall
correctness claim. But this is a good
step. And again, the way you think about
this is it says we draw this boxes. We
know that either p or q is at the bottom.
The other one is going to be on the other
side of the black line x bar but will be
in some other box so perhaps maybe p is
here and the lemma is saying that all the
relevant points of SY have to be somewhere
in this picture. Now remember in our
double for loop we only search seven
positions away, so the concern is that
this is a sorta super highly populated
collection of eight boxes. That's the
concern, but that's not going to be the
case and that's exactly what lemma two is
going to say. Not only do the points
between p and q in Y coordinates show up
in this diagram, but there have to be very
few. In particular, every box has to be
sparse, with population either zero or one.
So, let's move on to lemma two. So formally
the claim is [sound], we have at most one
point of the point set in each of these
eight boxes. And this is finally where we
use, in a real way, the definition of
delta. This is where we finally get the
payoff from our realization long ago, that
when defining the closest split pair
subroutine, we only really need to be
correct in the unlucky case. In the case
we're not handed the right answer by one
of our recursive calls. We're finally
gonna use that fact in a fundamental way.
So we're gonna proceed by contradiction.
So we're going to think about what happens
if there are two points in a single box
and from that we'll be able to derive a
contradiction. So, call the points that
wind up in the same box A and B. So, to
the contrary, suppose A and B lie in the
same box. So, maybe this is A here, and
this is B here, at antipodal corners of
this particular box. So from this
supposition, we have two consequences.
First of all. I claim that A and B lie on
the same side of the point set. They're
either both in the left side, Q or they're
both in the right side, R. So why is this
true? Well it's because every box lies either
entirely on the left half of the point set
or on the right half of the point
set. Recall how we define x bar. X bar is
the x coordinate of the right most point
amongst the left half of the point set
capital Q. So therefore points with x
coordinate at most x bar have to lie
inside the left half Q. Points with x
coordinates at least x bar have to
lie inside the right half of the point
set capital R. So that would be like in
this example. A and b both lie in a box
which is to the right of x bar. So they
both have to come in the right half
of the point set capital R. This is one
place we are using that there are no ties
in X coordinate, so if there's a point
with X, X coordinate or X bar, we can
count it as part of the left half. So
every box, by virtue of being either to
the left of xbar or to the right of xbar,
can only contain points from a common half
of the point set. So that's the first
consequence of assuming that you have two
points in the same box. The second
consequence is, because the boxes are
small, the points gotta be close. So, if A
and B co-habitate a box, how far could
they be from each other? Well, the
farthest they could be is like I've drawn
in the picture, with the points A and B.
Where they're at opposite corners of a
common box. And then you bust out
Pythagorean's Theorem, and what do you
get? You get that the distance between
them is delta over two, the side of the
box times root two. And what's relevant
for us is this is strictly less than
delta. Okay? But, now, here is where we
use, finally, the definition of delta.
Consequences one and two in tandem,
contradict how we define delta. Remember
what delta is. It's as close as two pair
of, a pair of points can get if they both
lie on the left side of the point set, or
if they both lie on the right side of the
point set. That is how we defined it. As
small as a pair of points on a common half
can get to each other. But what have we
just done? We've exhibited a pair A and B
that lie on the same half of the point
set, and are strictly closer than delta.
So that contradicts the definition of
delta. So that completes the proof of lemma
two. Let me just make sure we're all clear
on why having proved lemma one and lemma two
we're done with the proof part B of the
claim and therefore the entire claim
because we already proved
part one, now a long time ago.
So let's interpret the 2 lemmas in
the context of our picture that we had all
throughout. In terms of the eight boxes of
side length delta over two by
delta over two. So again, whichever is the
lower of p and q, and again let's just for
the sake of concreteness say it's q, is at
the bottom of the picture. The other point
is on the other half of the line Xbar, and
is in one of the other boxes. So, for
example, maybe p is right here. So lemma
one says that every relevant point, every
point that survives the filtering and
makes it into SY, by virtue of being in
the vertical strip, has to be in one of
those boxes, okay? If it has Y coordinate
in between p and q. Lemma two says that
you can only have one point in each of these boxes
from the point set, so that's gonna be at
most eight total. [sound] So combining
them. Lemmas one and two imply, that there
are almost eight points in this picture
and that includes p and q because they
also occupy two of eight boxes. So in the
worst case, if this is as densely
populated as could possibly be, given
lemmas one and two, every other box might
have a point and perhaps every one of
those points has a Y coordinate between p
and q. But this is as bad as it gets.
Any point of the strip with Y
coordinate between p and q occupies a box.
So, at most, there are these six wedged in
between them. What does this mean? This
means if from q you look seven
positions ahead in the array, you are
guaranteed to find this point p. So a
split pair with distance less than delta
is guaranteed to be identified by our
double for loop. Looking seven positions
ahead in the sorted array SY is sufficient
to identify, to look at every conceivably
interesting split pair. So that completes
the assertion B of the correctness
claim and we're done. That
establishes that this supremely clever
divide and conquer algorithm is indeed
a correct O(nlog(n)) algorithm that computes the
closest pair of a set of n points in the
plane.