1 00:00:00,090 --> 00:00:01,560 In the PCA algorithm we take 2 00:00:01,980 --> 00:00:03,530 N dimensional features and reduce 3 00:00:03,970 --> 00:00:06,260 them to some K dimensional feature representation. 4 00:00:07,620 --> 00:00:09,090 This number K is a parameter 5 00:00:09,820 --> 00:00:10,800 of the PCA algorithm. 6 00:00:11,810 --> 00:00:13,240 This number K is also called 7 00:00:13,620 --> 00:00:15,080 the number of principle components 8 00:00:15,830 --> 00:00:17,480 or the number of principle components that we've retained. 9 00:00:18,530 --> 00:00:19,640 And in this video I'd like 10 00:00:19,810 --> 00:00:20,850 to give you some guidelines, 11 00:00:21,730 --> 00:00:23,090 tell you about how people 12 00:00:23,430 --> 00:00:24,490 tend to think about how to 13 00:00:24,610 --> 00:00:26,740 choose this parameter K for PCA. 14 00:00:28,650 --> 00:00:29,670 In order to choose k, 15 00:00:30,110 --> 00:00:30,990 that is to choose the number 16 00:00:31,360 --> 00:00:34,110 of principal components, here are a couple of useful concepts. 17 00:00:36,430 --> 00:00:37,520 What PCA tries to do 18 00:00:37,760 --> 00:00:38,760 is it tries to minimize 19 00:00:40,070 --> 00:00:41,510 the average squared projection error. 20 00:00:42,030 --> 00:00:43,200 So it tries to minimize 21 00:00:43,430 --> 00:00:45,480 this quantity, which I'm writing down, 22 00:00:46,410 --> 00:00:47,980 which is the difference between the 23 00:00:48,150 --> 00:00:50,010 original data X and the 24 00:00:50,690 --> 00:00:53,470 projected version, X-approx-i, which 25 00:00:53,620 --> 00:00:54,930 was defined last video, so 26 00:00:55,020 --> 00:00:55,900 it tries to minimize the squared 27 00:00:56,160 --> 00:00:57,360 distance between x and it's projection 28 00:00:58,330 --> 00:00:59,750 onto that lower dimensional surface. 29 00:01:01,220 --> 00:01:02,990 So that's the average square projection error. 30 00:01:03,990 --> 00:01:05,320 Also let me define the 31 00:01:05,440 --> 00:01:07,020 total variation in the 32 00:01:07,100 --> 00:01:08,730 data to be the 33 00:01:09,020 --> 00:01:11,730 average length squared of 34 00:01:12,140 --> 00:01:14,130 these examples Xi 35 00:01:14,450 --> 00:01:16,010 so the total variation in the 36 00:01:16,260 --> 00:01:17,930 data is the average of 37 00:01:18,070 --> 00:01:19,250 my training sets of the 38 00:01:19,370 --> 00:01:21,640 length of each of my training examples. 39 00:01:22,190 --> 00:01:23,690 And this one says, "On average, how 40 00:01:23,940 --> 00:01:24,850 far are my training examples 41 00:01:25,690 --> 00:01:27,960 from the vector, from just being all zeros?" 42 00:01:28,770 --> 00:01:30,460 How far is, how far 43 00:01:30,820 --> 00:01:32,820 on average are my training examples from the origin? 44 00:01:33,510 --> 00:01:34,450 When we're trying to choose k, a 45 00:01:35,870 --> 00:01:37,210 pretty common rule of thumb 46 00:01:37,400 --> 00:01:38,620 for choosing k is to choose 47 00:01:38,800 --> 00:01:40,290 the smaller values so that 48 00:01:40,980 --> 00:01:43,810 the ratio between these is less than 0.01. 49 00:01:44,550 --> 00:01:45,540 So in other words, 50 00:01:46,340 --> 00:01:47,370 a pretty common way to 51 00:01:47,510 --> 00:01:48,460 think about how we choose k 52 00:01:48,800 --> 00:01:51,180 is we want the average squared projection error. 53 00:01:51,580 --> 00:01:54,700 That is the average distance 54 00:01:55,240 --> 00:01:56,340 between x and it's projections 55 00:01:57,570 --> 00:02:00,330 divided by the total variation of the data. 56 00:02:00,800 --> 00:02:01,870 That is how much the data varies. 57 00:02:02,940 --> 00:02:04,060 We want this ratio to be 58 00:02:04,240 --> 00:02:06,760 less than, let's say, 0.01. 59 00:02:06,830 --> 00:02:09,450 Or to be less than 1%, which is another way of thinking about it. 60 00:02:10,860 --> 00:02:11,940 And the way most people think 61 00:02:12,150 --> 00:02:13,640 about choosing K is rather 62 00:02:13,860 --> 00:02:15,660 than choosing K directly the 63 00:02:15,890 --> 00:02:17,110 way most people talk about 64 00:02:17,480 --> 00:02:18,940 it is as what this 65 00:02:19,160 --> 00:02:20,630 number is, whether it is 0.01 66 00:02:20,740 --> 00:02:23,330 or some other number. 67 00:02:23,720 --> 00:02:25,320 And if it is 0.01, another way 68 00:02:25,490 --> 00:02:27,020 to say this to use the 69 00:02:27,270 --> 00:02:30,120 language of PCA is that 99% of the variance is retained. 70 00:02:32,060 --> 00:02:33,480 I don't really want to, don't 71 00:02:33,850 --> 00:02:34,810 worry about what this phrase 72 00:02:35,140 --> 00:02:36,920 really means technically but this 73 00:02:37,830 --> 00:02:39,170 phrase "99% of variance is retained" just means 74 00:02:39,420 --> 00:02:41,710 that this quantity on the left is less than 0.01. 75 00:02:42,340 --> 00:02:43,910 And so, if you 76 00:02:44,930 --> 00:02:46,510 are using PCA and if you want 77 00:02:46,630 --> 00:02:47,730 to tell someone, you know, 78 00:02:48,220 --> 00:02:49,860 how many principle components you've 79 00:02:49,980 --> 00:02:51,080 retained it would be 80 00:02:51,140 --> 00:02:52,360 more common to say well, I 81 00:02:52,450 --> 00:02:55,360 chose k so that 99% of the variance was retained. 82 00:02:55,990 --> 00:02:56,960 And that's kind of a useful thing 83 00:02:57,660 --> 00:02:58,530 to know, it means that you 84 00:02:58,620 --> 00:02:59,820 know, the average squared projection 85 00:03:00,760 --> 00:03:01,720 error divided by the total 86 00:03:01,900 --> 00:03:03,260 variation that was at most 1%. 87 00:03:03,820 --> 00:03:04,770 That's kind of an insightful 88 00:03:05,270 --> 00:03:06,790 thing to think about, whereas if 89 00:03:06,920 --> 00:03:08,420 you tell someone that, "Well I 90 00:03:09,170 --> 00:03:10,710 had to 100 principle 91 00:03:10,890 --> 00:03:12,030 components" or "k was equal 92 00:03:12,720 --> 00:03:13,850 to 100 in a thousand dimensional 93 00:03:14,220 --> 00:03:15,350 data" it's a little 94 00:03:15,420 --> 00:03:16,600 hard for people to interpret 95 00:03:19,100 --> 00:03:19,100 that. 96 00:03:19,320 --> 00:03:22,220 So this number 0.01 is what people often use. 97 00:03:23,070 --> 00:03:25,380 Other common values is 0.05, 98 00:03:26,840 --> 00:03:27,810 and so this would be 5%, 99 00:03:27,990 --> 00:03:28,870 and if you do that then 100 00:03:29,210 --> 00:03:30,390 you go and say well 95% 101 00:03:30,740 --> 00:03:32,320 of the variance is 102 00:03:32,480 --> 00:03:34,280 retained and, you know 103 00:03:34,700 --> 00:03:36,710 other numbers maybe 90% of the variance is 104 00:03:37,980 --> 00:03:40,030 retained, maybe as low as 85%. 105 00:03:40,150 --> 00:03:42,410 So 90% would correspond to say 106 00:03:44,340 --> 00:03:46,950 0.10, kinda 10%. 107 00:03:47,250 --> 00:03:49,160 And so range of values 108 00:03:49,900 --> 00:03:50,770 from, you know, 90, 95, 109 00:03:50,870 --> 00:03:51,470 99, maybe as low as 85% of 110 00:03:51,500 --> 00:03:55,100 the variables contained would be a fairly typical range in values. 111 00:03:55,780 --> 00:03:56,900 Maybe 95 to 99 112 00:03:57,690 --> 00:03:58,810 is really the most 113 00:03:59,020 --> 00:04:02,080 common range of values that people use. 114 00:04:02,130 --> 00:04:02,950 For many data sets you'd be 115 00:04:03,010 --> 00:04:04,320 surprised, in order to retain 116 00:04:04,790 --> 00:04:06,590 99% of the variance, you can 117 00:04:06,790 --> 00:04:08,160 often reduce the dimension of 118 00:04:08,200 --> 00:04:11,810 the data significantly and still retain most of the variance. 119 00:04:12,440 --> 00:04:13,410 Because for most real life 120 00:04:13,560 --> 00:04:15,210 data says many features are 121 00:04:15,280 --> 00:04:17,060 just highly correlated, and so 122 00:04:17,310 --> 00:04:17,940 it turns out to be possible 123 00:04:18,490 --> 00:04:19,540 to compress the data a 124 00:04:19,610 --> 00:04:20,990 lot and still retain you 125 00:04:21,360 --> 00:04:22,310 know 99% of the variance 126 00:04:22,530 --> 00:04:26,260 or 95% of the variance. So how do you implement this? 127 00:04:26,810 --> 00:04:28,610 Well, here's one algorithm that you might use. 128 00:04:28,890 --> 00:04:30,360 You may start off, if you 129 00:04:30,540 --> 00:04:31,360 want to choose the value of 130 00:04:31,470 --> 00:04:33,510 k, we might start off with k equals 1. 131 00:04:33,550 --> 00:04:34,670 And then we run through PCA. 132 00:04:35,350 --> 00:04:36,440 You know, so we compute, you 133 00:04:36,570 --> 00:04:38,880 reduce, compute z1, z2, up to zm. 134 00:04:39,520 --> 00:04:40,790 Compute all of those x1 135 00:04:41,090 --> 00:04:42,540 approx and so on up to xm approx 136 00:04:43,200 --> 00:04:45,110 and then we check if 99% of the variance is retained. 137 00:04:47,140 --> 00:04:48,890 Then we're good and we use k equals 1. 138 00:04:49,020 --> 00:04:51,960 But if it isn't then what we'll do we'll next try K equals 2. 139 00:04:52,620 --> 00:04:53,810 And then we'll again 140 00:04:54,200 --> 00:04:56,070 run through this entire procedure and 141 00:04:56,170 --> 00:04:57,770 check, you know is this expression satisfied. 142 00:04:58,470 --> 00:05:00,980 Is this less than 0.01. And if not then we do this again. 143 00:05:01,220 --> 00:05:03,070 Let's try k equals 3, 144 00:05:03,310 --> 00:05:04,910 then try k equals 4, 145 00:05:04,970 --> 00:05:06,250 and so on until maybe 146 00:05:06,630 --> 00:05:07,730 we get up to k equals 147 00:05:08,070 --> 00:05:09,040 17 and we find 99% of 148 00:05:09,090 --> 00:05:13,060 the data have is retained and then 149 00:05:14,120 --> 00:05:15,110 we use k equals 17, right? 150 00:05:15,570 --> 00:05:17,160 That is one way 151 00:05:17,240 --> 00:05:18,790 to choose the smallest value 152 00:05:19,130 --> 00:05:20,920 of k, so that and 99% of the variance is retained. 153 00:05:22,380 --> 00:05:23,440 But as you can imagine, 154 00:05:23,550 --> 00:05:25,140 this procedure seems horribly inefficient 155 00:05:26,210 --> 00:05:28,120 we're trying k equals one, k equals two, we're doing all these calculations. 156 00:05:29,580 --> 00:05:30,540 Fortunately when you implement 157 00:05:31,130 --> 00:05:33,510 PCA it actually, in 158 00:05:33,960 --> 00:05:35,530 this step, it actually gives us 159 00:05:35,910 --> 00:05:37,080 a quantity that makes it 160 00:05:37,320 --> 00:05:40,160 much easier to compute these things as well. 161 00:05:41,110 --> 00:05:42,160 Specifically when you're calling 162 00:05:42,820 --> 00:05:44,120 SVD to get these 163 00:05:44,340 --> 00:05:45,550 matrices u, s, and d, 164 00:05:45,610 --> 00:05:46,780 when you're calling usvd on the 165 00:05:47,040 --> 00:05:48,560 covariance matrix sigma, it also 166 00:05:48,860 --> 00:05:49,780 gives us back this matrix 167 00:05:50,300 --> 00:05:52,170 S and what 168 00:05:52,360 --> 00:05:53,430 S is, is going to 169 00:05:53,520 --> 00:05:56,790 be a square matrix an N by N matrix in fact, 170 00:05:57,640 --> 00:05:58,090 that is 171 00:05:58,290 --> 00:05:58,290 diagonal. 172 00:05:58,830 --> 00:06:00,380 So is diagonal entries s one 173 00:06:00,540 --> 00:06:01,640 one, s two two, s 174 00:06:01,980 --> 00:06:03,240 three three down to s 175 00:06:03,590 --> 00:06:05,130 n n are going to 176 00:06:05,260 --> 00:06:07,010 be the only non-zero elements of 177 00:06:07,130 --> 00:06:08,880 this matrix, and everything off 178 00:06:09,060 --> 00:06:11,470 the diagonals is going to be zero. 179 00:06:11,590 --> 00:06:11,590 Okay? 180 00:06:11,670 --> 00:06:12,530 So those big O's that I'm drawing, 181 00:06:13,340 --> 00:06:14,260 by that what I mean is 182 00:06:14,740 --> 00:06:16,330 that everything off the diagonal 183 00:06:17,130 --> 00:06:18,220 of this matrix all of those 184 00:06:18,480 --> 00:06:20,310 entries there are going to be zeros. 185 00:06:22,300 --> 00:06:23,790 And so, what is possible 186 00:06:24,190 --> 00:06:25,250 to show, and I won't prove 187 00:06:25,480 --> 00:06:26,380 this here, and it turns out 188 00:06:26,620 --> 00:06:27,880 that for a given value of 189 00:06:27,980 --> 00:06:29,920 k, this quantity 190 00:06:30,590 --> 00:06:37,820 over here can be computed much more simply. 191 00:06:38,800 --> 00:06:40,310 And that quantity can be computed 192 00:06:41,000 --> 00:06:42,900 as one minus sum from 193 00:06:43,130 --> 00:06:44,400 i equals 1 through 194 00:06:44,610 --> 00:06:47,960 K of Sii divided by 195 00:06:48,640 --> 00:06:50,050 sum from I equals 1 196 00:06:50,170 --> 00:06:52,010 through N of Sii. 197 00:06:53,360 --> 00:06:54,820 So just to say that it words, or 198 00:06:55,000 --> 00:06:56,170 just to take another 199 00:06:56,450 --> 00:06:57,330 view of how to explain that, 200 00:06:57,960 --> 00:06:59,370 if K equals 3 let's say. 201 00:07:00,810 --> 00:07:01,970 What we're going to do to 202 00:07:02,080 --> 00:07:03,200 compute the numerator is sum 203 00:07:03,340 --> 00:07:04,680 from one-- I equals 1 204 00:07:04,820 --> 00:07:05,830 through 3 of of Sii, so just 205 00:07:06,240 --> 00:07:08,170 compute the sum of these first three elements. 206 00:07:09,280 --> 00:07:09,710 So that's the numerator. 207 00:07:10,980 --> 00:07:12,880 And then for the denominator, well that's 208 00:07:13,090 --> 00:07:14,970 the sum of all of these diagonal entries. 209 00:07:16,210 --> 00:07:17,470 And one minus the ratio of 210 00:07:17,660 --> 00:07:19,080 that, that gives me this 211 00:07:19,300 --> 00:07:21,330 quantity over here, that I've 212 00:07:21,650 --> 00:07:23,440 circled in blue. 213 00:07:23,650 --> 00:07:24,380 And so, what we can do 214 00:07:24,650 --> 00:07:26,000 is just test if this 215 00:07:26,430 --> 00:07:29,330 is less than or equal to 0.01. 216 00:07:29,370 --> 00:07:30,460 Or equivalently, we can test 217 00:07:30,830 --> 00:07:31,960 if the sum from 218 00:07:32,180 --> 00:07:33,010 i equals 1 through k, s-i-i 219 00:07:33,970 --> 00:07:35,180 divided by sum from i 220 00:07:35,320 --> 00:07:37,090 equals 1 through n, s-i-i 221 00:07:37,650 --> 00:07:38,580 if this is greater than or 222 00:07:38,770 --> 00:07:40,600 equal to 4.99, if you 223 00:07:40,720 --> 00:07:42,920 want to be sure that 99% of the variance is retained. 224 00:07:44,770 --> 00:07:45,650 And so what you can do 225 00:07:45,940 --> 00:07:48,360 is just slowly increase k, 226 00:07:48,770 --> 00:07:49,820 set k equals one, set k equals 227 00:07:50,100 --> 00:07:51,290 two, set k equals three and so 228 00:07:52,140 --> 00:07:53,240 on, and just test this quantity 229 00:07:54,720 --> 00:07:56,120 to see what is the 230 00:07:56,350 --> 00:07:58,820 smallest value of k that ensures that 99% of the variance is retained. 231 00:08:00,600 --> 00:08:01,810 And if you do 232 00:08:02,000 --> 00:08:02,790 this, then you need to call 233 00:08:03,170 --> 00:08:04,660 the SVD function only once. 234 00:08:04,970 --> 00:08:05,830 Because that gives you the 235 00:08:06,010 --> 00:08:07,060 S matrix and once you 236 00:08:07,090 --> 00:08:08,350 have the S matrix, you can 237 00:08:08,490 --> 00:08:09,540 then just keep on doing 238 00:08:09,770 --> 00:08:11,370 this calculation by increasing 239 00:08:11,930 --> 00:08:12,910 the value of K in the 240 00:08:13,070 --> 00:08:14,450 numerator and so you 241 00:08:14,560 --> 00:08:16,290 don't need keep to calling SVD over 242 00:08:16,540 --> 00:08:18,620 and over again to test out the different values of K. 243 00:08:18,910 --> 00:08:20,030 So this procedure is much more 244 00:08:20,150 --> 00:08:21,700 efficient, and this can 245 00:08:21,910 --> 00:08:24,020 allow you to select the 246 00:08:24,090 --> 00:08:25,890 value of K without needing 247 00:08:26,260 --> 00:08:27,620 to run PCA from scratch 248 00:08:28,030 --> 00:08:30,650 over and over. You just run SVD once, this 249 00:08:30,850 --> 00:08:32,350 gives you all of these diagonal numbers, 250 00:08:32,780 --> 00:08:35,090 all of these numbers S11, S22 down to SNN, 251 00:08:35,780 --> 00:08:36,820 and then you can 252 00:08:36,920 --> 00:08:38,440 just you know, vary K 253 00:08:38,730 --> 00:08:40,740 in this expression to find 254 00:08:41,010 --> 00:08:42,250 the smallest value of K, so 255 00:08:43,140 --> 00:08:44,030 that 99% of the variance is retained. 256 00:08:44,850 --> 00:08:45,870 So to summarize, the way 257 00:08:46,050 --> 00:08:47,850 that I often use, the 258 00:08:47,970 --> 00:08:49,050 way that I often choose K 259 00:08:49,420 --> 00:08:50,790 when I am using PCA for compression 260 00:08:51,120 --> 00:08:52,590 is I would call SVD once 261 00:08:52,950 --> 00:08:54,480 in the covariance matrix, and then 262 00:08:54,540 --> 00:08:55,750 I would use this formula and 263 00:08:56,030 --> 00:08:57,930 pick the smallest value of 264 00:08:58,020 --> 00:09:00,390 K for which this expression is satisfied. 265 00:09:01,580 --> 00:09:02,560 And by the way, even if you 266 00:09:02,650 --> 00:09:03,850 were to pick some different value 267 00:09:04,180 --> 00:09:04,960 of K, even if you were 268 00:09:05,000 --> 00:09:05,920 to pick the value of K 269 00:09:06,090 --> 00:09:07,250 manually, you know maybe you 270 00:09:07,300 --> 00:09:08,620 have a thousand dimensional data 271 00:09:09,540 --> 00:09:11,590 and I just want to choose K equals one hundred. 272 00:09:12,430 --> 00:09:13,450 Then, if you want to explain 273 00:09:13,690 --> 00:09:14,760 to others what you just did, 274 00:09:15,230 --> 00:09:17,070 a good way to explain the performance 275 00:09:17,750 --> 00:09:18,910 of your implementation of PCA to 276 00:09:19,220 --> 00:09:20,300 them, is actually to take 277 00:09:20,540 --> 00:09:21,670 this quantity and compute what 278 00:09:21,890 --> 00:09:23,000 this is, and that will 279 00:09:23,110 --> 00:09:25,770 tell you what was the percentage of variance retained. 280 00:09:26,300 --> 00:09:28,070 And if you report that number, then, 281 00:09:28,340 --> 00:09:29,720 you know, people that are familiar 282 00:09:30,100 --> 00:09:31,610 with PCA, and people can 283 00:09:31,880 --> 00:09:33,020 use this to get a 284 00:09:33,080 --> 00:09:34,560 good understanding of how well 285 00:09:34,900 --> 00:09:37,340 your hundred dimensional representation is 286 00:09:37,690 --> 00:09:39,270 approximating your original data 287 00:09:39,580 --> 00:09:41,300 set, because there's 99% of variance retained. 288 00:09:41,990 --> 00:09:44,140 That's really a measure of your 289 00:09:44,360 --> 00:09:45,860 square of construction error, that 290 00:09:46,240 --> 00:09:47,870 ratio being 0.01, just 291 00:09:48,430 --> 00:09:49,940 gives people a good intuitive 292 00:09:50,430 --> 00:09:51,820 sense of whether your implementation 293 00:09:52,580 --> 00:09:53,840 of PCA is finding a 294 00:09:54,000 --> 00:09:56,530 good approximation of your original data set. 295 00:09:58,440 --> 00:09:59,600 So hopefully, that gives you 296 00:09:59,800 --> 00:10:01,260 an efficient procedure for choosing 297 00:10:01,850 --> 00:10:02,800 the number K. For choosing 298 00:10:03,260 --> 00:10:04,940 what dimension to reduce your 299 00:10:05,160 --> 00:10:06,630 data to, and if 300 00:10:06,750 --> 00:10:07,830 you apply PCA to very 301 00:10:07,970 --> 00:10:09,740 high dimensional data sets, you know, to like 302 00:10:09,990 --> 00:10:11,570 a thousand dimensional data, very often, 303 00:10:11,980 --> 00:10:13,340 just because data sets tend 304 00:10:13,530 --> 00:10:14,720 to have highly correlated 305 00:10:15,070 --> 00:10:16,140 features, this is just a 306 00:10:16,280 --> 00:10:17,190 property of most of the data sets you see, 307 00:10:18,440 --> 00:10:19,420 you often find that PCA 308 00:10:20,040 --> 00:10:21,610 will be able to retain ninety nine 309 00:10:21,840 --> 00:10:22,940 per cent of the variance or say, 310 00:10:23,110 --> 00:10:24,440 ninety five ninety nine, some 311 00:10:24,720 --> 00:10:25,910 high fraction of the variance, 312 00:10:26,360 --> 00:10:27,580 even while compressing the data 313 00:10:28,560 --> 00:10:29,720 by a very large factor.