Most of the supervised learning algorithms
we've seen, things like linear
regression, logistic regression and
so on. All of those
algorithms have an optimization objective
or some cost function that the algorithm was trying to minimize.
It turns out that K-means also
has an optimization objective or
a cost function that is trying to minimize.
And in this video, I'd like to tell
you what that optimization objective is.
And the reason I want to do so
is because this will be useful to us for two purposes.
First, knowing what is the
optimization objective of K-means
will help us to
debug the learning algorithm and
just make sure that K-means is
running correctly, and second,
and perhaps even more importantly, in
a later video we'll talk
about how we can use this to
help K-means find better clusters
and avoid local optima, but we'll do that in a later video that follows this one.
Just as a quick reminder, while K-means is
running we're going to be
keeping track of two sets of variables.
First is the CI's and
that keeps track of the
index or the number of the cluster
to which an example x(i) is currently assigned.
And then, the other set of variables
we use as Mu subscript
K, which is the location
of cluster centroid K. And,
again, for K-means
we use capital K to denote the total number of clusters.
And here lower case K,
you know, is going to be an
index into the cluster
centroids, and so lower
case k is going to be
a number between 1 and
capital K. Now, here's
one more bit of notation which
is going to use Mu
subscript c(i) to denote
the cluster centroid of the
cluster to which example x(i)
has been assigned and
to explain that notation
a little bit more, let's
say that x(i) has been
assigned to cluster number five.
What that means is that c(i),
that is the index of x(i),
that that is equal to 5.
Right? Because you know, having c(i) equals 5,
that's what it means for the
example x(i) to be
assigned to cluster number 5.
And so Mu subscript
c(i) is going to
be equal to Mu subscript
5 because c(i) is equal
to 5.
This Mu substitute c(i) is the
cluster centroid of cluster number
5, which is the cluster
to which my example x(i) has been assigned.
With this notation, we're now
ready to write out what
is the optimization objective of
the K Mu clustering algorithm.
And here it is.
The cost function that K-means
is minimizing is the
function J of all of
these parameters c1 through
cM, Mu1 through MuK, that
K-means is varying as the algorithm runs.
And the optimization objective is shown
on the right, is the average of
one over M of the sum
of i equals one through M of this term here
that I've just drawn the red box around.
The squared distance between
each example x(i) and the
location of the cluster
centroid to which x(i)
has been assigned.
So let me just draw this in, let me explain this.
Here is the location of training
example x(i), and here's the location
of the cluster centroid to which
example x(i) has been assigned.
So to explain this in pictures, if here is X1, X2.
And if a point
here, is my example
x(i), so if that
is equal to my example x(i),
and if x(i) has been assigned
to some cluster centroid, and
I'll denote my cluster centroid with a cross.
So if that's the location of,
you know, Mu 5, let's
say, if x(i) has been
assigned to cluster centroid 5 in my example up there.
Then, the squared distance, that's
the squared of the distance
between the point x(i) and this
cluster centroid, to which x(i) has been assigned.
And what K-means can be shown
to be doing is that, it
is trying to find parameters c(i)
and Mu(i), try to
find cMU to try to
minimize this cost function J.
This cost function is sometimes
also called the distortion cost
function or the distortion of
the K-means algorithm.
And, just to provide a little bit more
detail, here's the K-means algorithm,
Here's exactly the algorithm as we have it, in the real
form the earlier slide.
And what this first step
of this algorithm is, this was
the cluster assignment step
where we assign each
point to the cluster centroid, and
it's possible to show mathematically that
what the cluster assignment step
is doing is exactly minimizing
J with respect
to the variables, C1, C2
and so on, up
to C(m), while holding the
closest centroids, MU1 up to
MUK fixed.
So, what the first assignment step
does is you know, it doesn't change
the cluster centroids, but what it's
doing is, exactly picking the values of C1, C2, up to CM
that minimizes the cost
function or the distortion function,
J. And it's possible to prove
that mathematically but, I won't do so here.
That has a pretty intuitive meaning
of just, yo know, well let's assign
these points to the cluster centroid
that is closest to it, because
that's what minimizes the square
of distance between the points and the corresponding cluster centroid.
And then the other part of
the second step of K-means, this second step over here.
This second step was the move
centroid step and,
once again, I won't prove it,
but it can be shown
mathematically, that what the
root centroid step does, is
it chooses the values
of mu that minimizes J.
So it minimizes the cost function
J with respect to,
where wrt is my
abbreviation for with respect to.
But it minimizes J with respect
to the locations of the cluster centroids, Mu1 through MuK.
So, what K-means really
is doing is it's taking the
two sets of variables and
partitioning them into two halves right here.
First the C set of variables and then you have the Mu sets of variables.
And what it does is it first
minimizes J with respect
to the variable C, and then minimizes
J with respect the variables
Mu, and then it keeps on iterating.
And so that's all that K-means does.
And, now that we understand
K-means, let's try to
minimize this cost function J.  We
can also use this to
try to debug our learning
algorithm and just kind
of make sure that our implementation
of K-means is running correctly.
So, we now understand the
K-means algorithm as trying to
optimize this cost function J,
which is also called the dispulsion function.
We can use that to debug K-means
and help me show that K-means
is converging, and that it's
running properly, and in the
next video, we'll also see
how we can us this to
help K-means find better clusters
and help K-means to avoid local optima.