In this video I'd like
to tell you about the principle
components analysis algorithm.
And by the end of this
video you know to implement PCA for yourself.
And use it reduce the dimension of your data.
Before applying PCA, there is
a data pre-processing step which you should always do.
Given the trading sets of the
examples is important to
always perform mean normalization,
and then depending on your data,
maybe perform feature scaling as well.
this is very similar to the
mean normalization and feature scaling
process that we have for supervised learning.
In fact it's exactly the
same procedure except that we're
doing it now to our unlabeled
data, X1 through Xm.
So for mean normalization we
first compute the mean of
each feature and then
we replace each feature, X,
with X minus its mean,
and so this makes each feature
now have exactly zero mean
The different features have very different scales.
So for example, if x1
is the size of a house, and
x2 is the number of bedrooms, to
use our earlier example, we
then also scale each feature
to have a comparable range of values.
And so, similar to what
we had with supervised learning,
we would take x, i substitute
j, that's the j feature
and so we would
subtract of the mean,
now that's what we have on top, and then divide by sj.
Here, sj is some measure of the beta values of feature j.  So, it could be the max minus
min value, or more commonly,
it is the standard deviation of
feature j. Having done
this sort of data pre-processing, here's what the PCA algorithm does.
We saw from the previous video
that what PCA does is, it
tries to find a lower
dimensional sub-space onto which to
project the data, so as
to minimize the squared projection
errors, sum of the
squared projection errors, as the
square of the length of
those blue lines that and so
what we wanted to do specifically
is find a vector, u1, which
specifies that direction or
in the 2D case we want
to find two vectors, u1 and
u2, to define this surface
onto which to project the data.
So, just as a
quick reminder of what reducing
the dimension of the data means,
for this example on the
left we were given
the examples xI, which are in r2.
And what we
like to do is find
a set of numbers zI in
r push to represent our data.
So that's what from reduction from 2D to 1D means.
So specifically by projecting
data onto this red line there.
We need only one number to
specify the position of the points on the line.
So i'm going to call that number
z or z1.
Z here  [xx] real number, so that's like a one dimensional vector.
So z1 just refers to
the first component of this,
you know, one by one matrix, or this one dimensional vector.
And so we need only
one number to specify the position of a point.
So if this example
here was my example
X1, then maybe that gets mapped here.
And if this example was X2
maybe that example gets mapped
And so this point
here will be Z1
and this point here will be
Z2, and similarly we
would have those other points
for These, maybe X3,
X4, X5 get mapped to Z1, Z2, Z3.
So What PCA has
to do is we need to
come up with a way to compute two things.
One is to compute these vectors,
u1, and in this case u1 and u2.
And the other is
how do we compute these numbers,
Z. So on the
example on the left we're reducing the data from 2D to 1D.
In the example on the right,
we would be reducing data from
3 dimensional as in
r3, to zi, which is now two dimensional.
So these z vectors would now be two dimensional.
So it would be z1
z2 like so, and so
we need to give away to compute
these new representations, the z1
and z2 of the data as well.
So how do you compute all of these quantities?
It turns out that a mathematical
derivation, also the mathematical
proof, for what is
the right value U1, U2, Z1,
Z2, and so on.
That mathematical proof is very
complicated and beyond the
scope of the course.
But once you've done  [xx] it
turns out that the procedure
to actually find the value
of u1 that you want
is not that hard, even though
so that the mathematical proof that
this value is the correct
value is someone more
involved and more than i want to get into.
But let me just describe the
specific procedure that you
have to implement in order
to compute all of these
things, the vectors, u1, u2,
the vector z.  Here's the procedure.
Let's say we want to reduce
the data to n dimensions
to k dimension What we're
going to do is first
compute something called the
covariance matrix, and the covariance
matrix is commonly denoted by
this Greek alphabet which is
the capital Greek alphabet sigma.
It's a bit unfortunate that the
Greek alphabet sigma looks exactly
like the summation symbols.
So this is the
Greek alphabet Sigma is used
to denote a matrix and this here is a summation symbol.
So hopefully in these slides
there won't be ambiguity about which
is Sigma Matrix, the
matrix, which is a
summation symbol, and hopefully
it will be clear from context when
I'm using each one.
How do you compute this matrix
let's say we want to
store it in an octave
variable called sigma.
What we need to do is
compute something called the
eigenvectors of the matrix sigma.
And an octave, the way you
do that is you use this
command, u s v equals
s v d of sigma.
SVD, by the way, stands for singular value decomposition.
This is a Much
more advanced single value composition.
It is much more advanced linear
algebra than you actually need
to know but now It turns out
that when sigma is equal
to matrix there is
a few ways to compute these are
high in vectors and If you
are an expert in linear algebra
and if you've heard of high in
vectors before you may know
that there is another octet function
called I, which can
also be used to compute the same thing.
and It turns out that the
SVD function and the
I function it will give
you the same vectors, although SVD
is a little more numerically stable.
So I tend to use SVD, although
I have a few friends that use
the I function to do
this as wellbut when you
apply this to a covariance matrix
sigma it gives you the same thing.
This is because the covariance matrix
always satisfies a mathematical
Property called symmetric positive definite
You really don't need to know
what that means, but the SVD
and I-functions are different functions but
when they are applied to a
covariance matrix which can
be proved to always satisfy this
mathematical property; they'll always give you the same thing.
Okay, that was probably much more linear algebra than you needed to know.
In case none of that made sense, don't worry about it.
All you need to know is that
this system command you
should implement in Octave.
And if you're implementing this in a
different language than Octave or MATLAB,
what you should do is find
the numerical linear algebra library
that can compute the SVD
or singular value decomposition, and
there are many such libraries for
probably all of the major programming languages.
People can use that to
compute the matrices u,
s, and d of the covariance matrix sigma.
So just to fill
in some more details, this covariance
matrix sigma will be
an n by n matrix.
And one way to see that
is if you look at the definition
this is an n by 1
vector and this
here I transpose is
1 by N so the
product of these two things
is going to be an N
by N matrix.
1xN transfers, 1xN, so
there's an NxN matrix and when
we add up all of these you still
have an NxN matrix.
And what the SVD outputs three
matrices, u, s, and
v.  The thing you really need out of the SVD is the u matrix.
The u matrix will also be a NxN matrix.
And if we look at the
columns of the U
matrix it turns
out that the columns
of the U matrix will be
exactly those vectors, u1,
u2 and so on.
So u, will be matrix.
And if we want to reduce
the data from n dimensions
down to k dimensions, then what
we need to do is take the first k vectors.
that gives us u1 up
to uK which gives
us the K direction onto which
we want to project the data.
the rest of the procedure from
this SVD numerical linear
algebra routine we get this
matrix u.  We'll call
these columns u1-uN.
So, just to wrap up the
description of the rest of
the procedure, from the SVD
numerical linear algebra routine we
get these matrices u, s,
and d.  we're going
to use the first K columns
of this matrix to get u1-uK.
Now the other thing we need
to is take my original
data set, X which is
an RN And find a
lower dimensional representation Z, which
is a R K for this data.
So the way we're
going to do that is
take the first K Columns of the U matrix.
Construct this matrix.
Stack up U1, U2 and
so on up to U K in columns.
It's really basically taking, you know,
this part of the matrix, the
first K columns of this matrix.
And so this is
going to be an N
by K matrix.
I'm going to give this matrix a name.
I'm going to call this matrix
U, subscript "reduce," sort
of a reduced version of the U matrix maybe.
I'm going to use it to reduce the dimension of my data.
And the way I'm going to compute Z is going
to let Z be equal to this
U reduce matrix transpose times
X. Or alternatively, you know,
to write down what this transpose means.
When I take this transpose of
this U matrix, what I'm
going to end up with is these vectors now in rows.
I have U1 transpose down to UK transpose.
Then take that times X,
and that's how I get
my vector Z. Just to
make sure that these dimensions make sense,
this matrix here is going
to be k by n
and x here is going
to be n by 1
and so the product
here will be k by 1.
And so z is k
dimensional, is a k
dimensional vector, which is exactly
what we wanted.
And of course these x's here right, can
be Examples in our
training set can be examples
in our cross validation set, can be
examples in our test set, and
for example if you know,
I wanted to take training example i,
I can write this as xi
XI and that's what will
give me ZI over there.
So, to summarize, here's the
PCA algorithm on one slide.
After mean normalization, to ensure
that every feature is zero mean
and optional feature scaling whichYou
really should do feature scaling if
your features take on very different ranges of values.
After this pre-processing we compute
the carrier matrix Sigma like
so by the
way if your data is
given as a matrix
like hits if you have your
data Given in rows like this.
If you have a matrix X
which is your time trading sets
written in rows where x1
transpose down to x1 transpose,
this covariance matrix sigma actually has
a nice vectorizing implementation.
You can implement in octave,
you can even run sigma equals 1
over m, times x,
which is this matrix up here,
transpose times x and
this simple expression, that's
the vectorize implementation of how
to compute the matrix sigma.
I'm not going to prove that today.
This is the correct vectorization whether you
want, you can either numerically test
this on yourself by trying out an
octave and making sure that
both this and this implementations
give the same answers or you Can try to prove it yourself mathematically.
Either way but this is the
correct vectorizing implementation, without compusingnext
we can apply the SVD
routine to get u, s,
and d. And then we
grab the first k
columns of the u
matrix you reduce and
finally this defines how
we go from a feature
vector x to this
reduce dimension representation z. And
similar to k Means
if you're apply PCA, they way
you'd apply this is with vectors X and RN.
So, this is not done with X-0 1.
So that was
the PCA algorithm.
One thing I didn't do is
give a mathematical proof that
this There it actually give
the projection of the data onto
the K dimensional subspace onto the
K dimensional surface that actually
minimizes the square projection error Proof
of that is beyond the scope of this course.
Fortunately the PCA algorithm
can be implemented in not
too many lines of code.
and if you implement this in
octave or algorithm, you
actually get a very effective
dimensionality reduction algorithm.
So, that was the PCA algorithm.
One thing I didn't do was
give a mathematical proof that
the U1 and U2 and so
on and the Z and so
on you get out of this
procedure is really the
choices that would minimize
these squared projection error.
Right, remember we said What
PCA tries to do is try
to find a surface or line
onto which to project the data
so as to minimize to square projection error.
So I didn't prove that this
that, and the mathematical proof
of that is beyond the scope of this course.
But fortunately the PCA algorithm can
be implemented in not too many lines of octave code.
And if you implement this,
this is actually what will
work, or this will work well,
and if you implement this algorithm,
you get a very effective dimensionality reduction algorithm.
That does do the right thing
of minimizing this square projection error.