For the problem of dimensionality reduction, by far the most popular, by far the most commonly used algorithm is something called principal components analysis or PCA. It In this video, I would like to start talking about the problem formulation for PCA, in other words let us try to formulate precisely exactly what we would like PCA to do. Let's say we have a dataset like this, so this is a dataset of example X in R2, and let's say I want to reduce the dimension of the data from two dimensional to one dimensional. In other words I would like to find a line onto which to project the data. So what seems like a good line onto which to project the data? A line like this might be a pretty good choice. And the reason you think this might be a good choice is that if you look at where the projected versions of the points goes. I'm gonna take this point and project it down here and get that. This point gets projected here, to here, to here, to here what we find is that the distance between each point and the projected version is pretty small. That is, these blue line segments are pretty short. So what PCA does, formally, is it tries to find a lower-dimensional surface really a line in this case, onto which to project the data, so that the sum of squares of these little blue line segments is minimized. The length of those blue line segments, that's sometimes also called the projection error, and so what PCA does is it tries to find the surface onto which to project the data so as to minimize that. As an a side, before applying PCA it's standard practice to first perform mean normalization and feature scaling so that the features, X1 and X2, should have zero mean and should have comparable ranges of values. I've already done this for this example, but I'll come back to this later and talk more about feature scaling and mean normalization in the context of PCA later. Coming back to this example, in contrast to the red lines that I just drew here's a different line onto which I could project my data. This magenta line. And as you can see, you know, this magenta line is a much worse direction onto which to project my data, right? So if I were to project my data onto the magenta line, like the other set of points like that. And the projection errors, that is these blue line segments would be huge. So these points have to move a huge distance in order to get onto the, in order to get projected onto the magenta line. And so that's why PCA principle component analysis would choose something like the red line rather than like the magenta line down here. Let's write out the PCA problem a little more formally. The goal of PCA if we want to reduce data from two- dimensional to one-dimensional is we're going to try to find a vector, that is a vector Ui which is going to be in Rn, so that would be in R2 in this case, going to find direction onto which to project the data so as to minimize the projection error. So in this example I'm hoping that PCA will find this vector, which I'm going to call U1, so that when I project the data onto the line that I defined by extending out this vector, I end up with pretty small reconstruction errors and reference data looks like this. And by the way, I should mention that whether PCA gives me U1 or negative U1, it doesn't matter. So if it gives me a positive vector in this direction that's fine, if it gives me, sort of the opposite vector facing in the opposite direction, so that would be -U1, draw that in blue instead, whether it gives me positive U1 negative U1, it doesn't matter, because each of these vectors defines the same red line onto which I'm projecting my data. So this is a case of reducing data from 2 dimensional to 1 dimensional. In the more general case we have N dimensional data and we want to reduce it K dimensions. In that case, we want to find not just a single vector onto which to project the data but we want to find K dimensions onto which to project the data. So as to minimize this projection error. So here's an example. If I have a 3D point cloud like this then maybe what I want to do is find vectors, sorry find a pair of vectors, and I'm gonna call these vectors lets draw these in red, I'm going to find a pair of vectors, extending for the origin here's U1, and here's my second vector U2, and together these two vectors define a plane, or they define a 2D surface, kind of like this, sort of, 2D surface, onto which I'm going to project my data. For those of you that are familiar with linear algebra, for those of you that are really experts in linear algebra, the formal definition of this is that we're going to find a set of vectors, U1 U2 maybe up to Uk, and what we're going to do is project the data onto the linear subspace spanned by this set of k vectors. But if you are not familiar with linear algebra, just think of it as finding k directions instead of just one direction, onto which to project the data. So, finding a k-dimensional surface, really finding a 2D plane in this case, shown in this figure, where we can define the position of the points in the plane using k directions. That's why for PCA, we want to find k vectors onto which to project the data. And so, more formally, in PCA what we want to do is find this way to project the data so as to minimize the, sort of, projection distance, which is distance between points and projections. In this so 3D example, too, given a point we would take the point and project it onto this 2D surface. When you're done with that, and so the projection error would be, you know, the distance between the point and where it gets projected down. to my 2D surface, and so what PCA does is it'll try to find a line or plane or whatever onto which to project the data, to try to minimize that squared projection, that 90 degree, or that orthogonal projection error. Finally, one question that I sometimes get asked is how does PCA relate to linear regression, because when explaining PCA I sometimes end up drawing diagrams like these and that looks a little bit like linear regression. It turns out PCA is not linear regression, and despite some cosmetic similarity these are actually totally different algorithms. If we were doing linear regression what we would do would be, on the left we would be trying to predict the value of some variable y given some input features x. And so linear regression, what we're doing is we're fitting a straight line so as to minimize the squared error between a point and the straight line. And so what we'd be minimizing would be the squared magnitude of these blue lines. And notice I'm drawing these blue lines vertically, that they are the vertical distance between a point and the value predicted by the hypothesis, whereas in contrast, in PCA, what it does is it tries to minimize the magnitude of these blue lines, which are drawn at an angle, these are really the shortest orthogonal distances, the shortest distance between the point X and this red line, and this gives very different effects, depending on the data set. And more generally generally and more generally when you're doing linear regression there is this distinguished variable y that we're trying to predict, all that linear regression is about is taking all the values of X and try to use that to predict Y. Whereas in PCA, there is no distinguished or there is no special variable Y that we're trying to predict, and instead we have a list of features x1, x2, and so on up to xN, and all of these features are treated equally. So, no one of them is special. As one last example, if I have three-dimensional data, and I want to reduce data from 3D to 2D, so maybe I want to find two directions, you know, u1, and u2, onto which to project my data, then what I have is I have three features, x1, x2, x3, and all of these are treated alike. All of these are treated symmetrically and there is no special variable y that I'm trying to predict. And so PCA is not linear regression, and even though at some cosmetic level they might look related, these are actually very different algorithms. So, hopefully you now understand what PCA is doing: it's trying to find a lower dimensional surface onto which to project the data so as to minimize this squared projection error, to minimize the squared distance between each point and the location of where it gets projected. In the next video we'll start to talk about how to actually find this lower dimensional surface onto which to project the data.