In the PCA algorithm we take N dimensional features and reduce them to some K dimensional feature representation. This number K is a parameter of the PCA algorithm. This number K is also called the number of principle components or the number of principle components that we've retained. And in this video I'd like to give you some guidelines, tell you about how people tend to think about how to choose this parameter K for PCA. In order to choose k, that is to choose the number of principal components, here are a couple of useful concepts. What PCA tries to do is it tries to minimize the average squared projection error. So it tries to minimize this quantity, which I'm writing down, which is the difference between the original data X and the projected version, X-approx-i, which was defined last video, so it tries to minimize the squared distance between x and it's projection onto that lower dimensional surface. So that's the average square projection error. Also let me define the total variation in the data to be the average length squared of these examples Xi so the total variation in the data is the average of my training sets of the length of each of my training examples. And this one says, "On average, how far are my training examples from the vector, from just being all zeros?" How far is, how far on average are my training examples from the origin? When we're trying to choose k, a pretty common rule of thumb for choosing k is to choose the smaller values so that the ratio between these is less than 0.01. So in other words, a pretty common way to think about how we choose k is we want the average squared projection error. That is the average distance between x and it's projections divided by the total variation of the data. That is how much the data varies. We want this ratio to be less than, let's say, 0.01. Or to be less than 1%, which is another way of thinking about it. And the way most people think about choosing K is rather than choosing K directly the way most people talk about it is as what this number is, whether it is 0.01 or some other number. And if it is 0.01, another way to say this to use the language of PCA is that 99% of the variance is retained. I don't really want to, don't worry about what this phrase really means technically but this phrase "99% of variance is retained" just means that this quantity on the left is less than 0.01. And so, if you are using PCA and if you want to tell someone, you know, how many principle components you've retained it would be more common to say well, I chose k so that 99% of the variance was retained. And that's kind of a useful thing to know, it means that you know, the average squared projection error divided by the total variation that was at most 1%. That's kind of an insightful thing to think about, whereas if you tell someone that, "Well I had to 100 principle components" or "k was equal to 100 in a thousand dimensional data" it's a little hard for people to interpret that. So this number 0.01 is what people often use. Other common values is 0.05, and so this would be 5%, and if you do that then you go and say well 95% of the variance is retained and, you know other numbers maybe 90% of the variance is retained, maybe as low as 85%. So 90% would correspond to say 0.10, kinda 10%. And so range of values from, you know, 90, 95, 99, maybe as low as 85% of the variables contained would be a fairly typical range in values. Maybe 95 to 99 is really the most common range of values that people use. For many data sets you'd be surprised, in order to retain 99% of the variance, you can often reduce the dimension of the data significantly and still retain most of the variance. Because for most real life data says many features are just highly correlated, and so it turns out to be possible to compress the data a lot and still retain you know 99% of the variance or 95% of the variance. So how do you implement this? Well, here's one algorithm that you might use. You may start off, if you want to choose the value of k, we might start off with k equals 1. And then we run through PCA. You know, so we compute, you reduce, compute z1, z2, up to zm. Compute all of those x1 approx and so on up to xm approx and then we check if 99% of the variance is retained. Then we're good and we use k equals 1. But if it isn't then what we'll do we'll next try K equals 2. And then we'll again run through this entire procedure and check, you know is this expression satisfied. Is this less than 0.01. And if not then we do this again. Let's try k equals 3, then try k equals 4, and so on until maybe we get up to k equals 17 and we find 99% of the data have is retained and then we use k equals 17, right? That is one way to choose the smallest value of k, so that and 99% of the variance is retained. But as you can imagine, this procedure seems horribly inefficient we're trying k equals one, k equals two, we're doing all these calculations. Fortunately when you implement PCA it actually, in this step, it actually gives us a quantity that makes it much easier to compute these things as well. Specifically when you're calling SVD to get these matrices u, s, and d, when you're calling usvd on the covariance matrix sigma, it also gives us back this matrix S and what S is, is going to be a square matrix an N by N matrix in fact, that is diagonal. So is diagonal entries s one one, s two two, s three three down to s n n are going to be the only non-zero elements of this matrix, and everything off the diagonals is going to be zero. Okay? So those big O's that I'm drawing, by that what I mean is that everything off the diagonal of this matrix all of those entries there are going to be zeros. And so, what is possible to show, and I won't prove this here, and it turns out that for a given value of k, this quantity over here can be computed much more simply. And that quantity can be computed as one minus sum from i equals 1 through K of Sii divided by sum from I equals 1 through N of Sii. So just to say that it words, or just to take another view of how to explain that, if K equals 3 let's say. What we're going to do to compute the numerator is sum from one-- I equals 1 through 3 of of Sii, so just compute the sum of these first three elements. So that's the numerator. And then for the denominator, well that's the sum of all of these diagonal entries. And one minus the ratio of that, that gives me this quantity over here, that I've circled in blue. And so, what we can do is just test if this is less than or equal to 0.01. Or equivalently, we can test if the sum from i equals 1 through k, s-i-i divided by sum from i equals 1 through n, s-i-i if this is greater than or equal to 4.99, if you want to be sure that 99% of the variance is retained. And so what you can do is just slowly increase k, set k equals one, set k equals two, set k equals three and so on, and just test this quantity to see what is the smallest value of k that ensures that 99% of the variance is retained. And if you do this, then you need to call the SVD function only once. Because that gives you the S matrix and once you have the S matrix, you can then just keep on doing this calculation by increasing the value of K in the numerator and so you don't need keep to calling SVD over and over again to test out the different values of K. So this procedure is much more efficient, and this can allow you to select the value of K without needing to run PCA from scratch over and over. You just run SVD once, this gives you all of these diagonal numbers, all of these numbers S11, S22 down to SNN, and then you can just you know, vary K in this expression to find the smallest value of K, so that 99% of the variance is retained. So to summarize, the way that I often use, the way that I often choose K when I am using PCA for compression is I would call SVD once in the covariance matrix, and then I would use this formula and pick the smallest value of K for which this expression is satisfied. And by the way, even if you were to pick some different value of K, even if you were to pick the value of K manually, you know maybe you have a thousand dimensional data and I just want to choose K equals one hundred. Then, if you want to explain to others what you just did, a good way to explain the performance of your implementation of PCA to them, is actually to take this quantity and compute what this is, and that will tell you what was the percentage of variance retained. And if you report that number, then, you know, people that are familiar with PCA, and people can use this to get a good understanding of how well your hundred dimensional representation is approximating your original data set, because there's 99% of variance retained. That's really a measure of your square of construction error, that ratio being 0.01, just gives people a good intuitive sense of whether your implementation of PCA is finding a good approximation of your original data set. So hopefully, that gives you an efficient procedure for choosing the number K. For choosing what dimension to reduce your data to, and if you apply PCA to very high dimensional data sets, you know, to like a thousand dimensional data, very often, just because data sets tend to have highly correlated features, this is just a property of most of the data sets you see, you often find that PCA will be able to retain ninety nine per cent of the variance or say, ninety five ninety nine, some high fraction of the variance, even while compressing the data by a very large factor.