For the problem of dimensionality
reduction, by far the
most popular, by far the
most commonly used algorithm is
something called principal components analysis or PCA.
It In this video, I would
like to start talking about the
problem formulation for PCA,
in other words let us try
to formulate precisely exactly what
we would like PCA to do.
Let's say we have a dataset like
this, so this is a dataset
of example X in R2,
and let's say I want
to reduce the dimension of the
data from two dimensional to one dimensional.
In other words I would like to find
a line onto which to project the data.
So what seems like a good line onto which to project the data?
A line like this might be a pretty good choice.
And the reason you think
this might be a good choice is
that if you look at
where the projected versions of the points goes.
I'm gonna take this point and project it down here and get that.
This point gets projected here, to
here, to here, to here
what we find is that the
distance between each point
and the projected version is pretty small.
That is, these blue
line segments are pretty short.
So what PCA
does, formally, is it tries
to find a lower-dimensional surface
really a line in
this case, onto which to
project the data, so that
the sum of squares of these
little blue line segments is minimized.
The length of those blue line
segments, that's sometimes also called
the projection error, and so what PCA
does is it tries to find
the surface onto which to
project the data so as to
minimize that. As an a
side, before applying PCA
it's standard practice to
first perform mean normalization and
feature scaling so that the
features, X1 and X2,
should have zero mean and
should have comparable ranges of values.
I've already done this for this
example, but I'll come
back to this later and talk more
about feature scaling and mean normalization in the context of PCA later.
Coming back to this example,
in contrast to the red
lines that I just drew here's
a different line onto which I could project my data.
This magenta line.
And as you can see, you
know, this magenta line is a
much worse direction onto which to project my data, right?
So if I were to project my
data onto the magenta
line, like the other set of points like that.
And the projection errors, that
is these blue line segments would be huge.
So these points have to
move a huge distance in
order to get onto
the, in order to
get projected onto the magenta line.
And so that's why PCA
principle component analysis would choose
something like the red line
rather than like the magenta line down here.
Let's write out the PCA problem a little more formally.
The goal of PCA if we
want to reduce data from two-
dimensional to one-dimensional is
we're going to try to find
a vector, that is
a vector Ui
which is going to be in Rn,
so that would be in R2 in this case,
going to find direction onto
which to project the data so as to minimize the projection error.
So in this example I'm
hoping that PCA will find this
vector, which I'm going
to call U1, so that
when I project the data onto
the line that I defined
by extending out this vector,
I end up with pretty small
reconstruction errors and reference
data looks like this.
And by the way,
I should mention that whether PCA
gives me U1 or negative U1, it doesn't matter.
So if it gives me a positive
vector in this direction that's fine, if it gives me,
sort of the opposite vector facing
in the opposite direction, so that
would be -U1, draw that in
blue instead, whether it gives me positive
U1 negative U1,
it doesn't matter, because each of
these vectors defines the
same red line onto which
I'm projecting my data. So this
is a case of reducing data
from 2 dimensional to 1 dimensional.
In the more general case we
have N dimensional data and
we want to reduce it K dimensions.
In that case, we want to
find not just a single vector
onto which to project the data
but we want to find K dimensions
onto which to project the data.
So as to minimize this projection error.
So here's an example.
If I have a 3D point
cloud like this then maybe
what I want to do is find
vectors, sorry find a pair of vectors,
and I'm gonna call these vectors
lets draw these in red, I'm going
to find a pair of vectors,
extending for the origin here's
U1, and here's
my second vector U2,
and together these two
vectors define a plane,
or they define a 2D surface,
kind of like this, sort of,
2D surface, onto which I'm going to project my data.
For those of you that are
familiar with linear algebra, for
those of you that are really experts
in linear algebra, the formal
definition of this is that
we're going to find a set of
vectors, U1 U2 maybe up
to Uk, and what
we're going to do is project
the data onto the linear
subspace spanned by this set of k vectors.
But if you are not familiar
with linear algebra, just think
of it as finding k directions
instead of just one direction, onto which to project the data.
So, finding a k-dimensional surface,
really finding a 2D plane
in this case, shown in this
figure, where we can
define the position of the points in the plane using k directions.
That's why for PCA, we
want to find k vectors onto which to project the data.
And so, more formally, in
PCA what we want
to do is find this way
to project the data so as
to minimize the, sort of,
projection distance, which is distance between points and projections.
In this so 3D example,
too, given a point we
would take the point and project
it onto this 2D surface.
When you're done with that,
and so the projection error would
be, you know, the distance between the
point and where it gets projected down.
to my 2D surface,
and so what PCA does is it'll
try to find a line or
plane or whatever onto which
to project the data, to try
to minimize that squared projection,
that 90 degree, or that orthogonal projection error.
Finally, one question that I
sometimes get asked is how
does PCA relate to
linear regression, because when explaining
PCA I sometimes end up
drawing diagrams like these and that looks a little bit like linear regression.
It turns out PCA is not
linear regression, and despite
some cosmetic similarity these are actually totally different algorithms.
If we were doing linear regression
what we would do would be, on
the left we would be trying
to predict the value of some
variable y given some input
features x. And so linear
regression, what we're doing
is we're fitting a straight line
so as to minimize the squared
error between a point and the straight line.
And so what we'd be minimizing
would be the squared magnitude of these blue lines.
And notice I'm drawing these
blue lines vertically, that they
are the vertical distance between
a point and the value
predicted by the hypothesis, whereas in
contrast, in PCA, what it
does is it tries to
minimize the magnitude of these blue lines,
which are drawn at an angle,
these are really the shortest orthogonal
distances, the shortest distance between
the point X and this
red line, and this
gives very different effects, depending
on the data set.
And more generally generally and
more generally when you're doing
linear regression there is this
distinguished variable y that
we're trying to predict, all that
linear regression is about is taking all the values
of X and try to use
that to predict Y. Whereas
in PCA, there is no
distinguished or there is
no special variable Y that
we're trying to predict, and instead
we have a list of features
x1, x2, and so on
up to xN, and all of
these features are treated equally.
So, no one of them is special.
As one last example, if I
have three-dimensional data, and
I want to reduce data from
3D to 2D, so maybe
I want to find two directions,
you know, u1, and u2,
onto which to project my data,
then what I have is I
have three features, x1, x2,
x3, and all of these are treated alike.
All of these are treated symmetrically
and there is no special variable
y that I'm trying to predict.
And so PCA is not
linear regression, and even
though at some cosmetic level they
might look related, these are
actually very different algorithms. So,
hopefully you now understand what
PCA is doing: it's trying
to find a lower dimensional
surface onto which to project
the data so as to
minimize this squared projection error,
to minimize the squared distance between
each point and the location of where it gets projected.
In the next video we'll start
to talk about how to actually
find this lower dimensional surface
onto which to project the data.