1 00:00:00,090 --> 00:00:01,450 In an earlier video, I had 2 00:00:01,610 --> 00:00:02,710 said that PCA can be 3 00:00:02,840 --> 00:00:05,410 sometimes used to speed up the running time of a learning algorithm. 4 00:00:07,070 --> 00:00:08,140 In this video, I'd like 5 00:00:08,370 --> 00:00:09,520 to explain how to actually 6 00:00:09,820 --> 00:00:11,270 do that, and also say 7 00:00:11,460 --> 00:00:12,900 some, just try to give 8 00:00:12,990 --> 00:00:14,550 some advice about how to apply PCA. 9 00:00:17,110 --> 00:00:19,630 Here's how you can use PCA to speed up a learning algorithm, 10 00:00:20,270 --> 00:00:21,940 and this supervised learning algorithm 11 00:00:22,270 --> 00:00:23,630 speed up is actually the most 12 00:00:23,870 --> 00:00:25,870 common use that I 13 00:00:26,530 --> 00:00:27,720 personally make of PCA. 14 00:00:28,710 --> 00:00:29,640 Let's say you have a supervised 15 00:00:30,300 --> 00:00:31,660 learning problem, note this is 16 00:00:31,810 --> 00:00:33,380 a supervised learning problem with inputs 17 00:00:33,690 --> 00:00:35,510 X and labels Y, and 18 00:00:35,810 --> 00:00:37,330 let's say that your examples 19 00:00:37,830 --> 00:00:39,140 xi are very high dimensional. 20 00:00:39,840 --> 00:00:41,670 So, lets say that your examples, xi are 21 00:00:41,800 --> 00:00:44,000 10,000 dimensional feature vectors. 22 00:00:45,510 --> 00:00:46,550 One example of that, would 23 00:00:46,700 --> 00:00:48,130 be, if you were doing some computer 24 00:00:48,540 --> 00:00:50,390 vision problem, where you have 25 00:00:50,650 --> 00:00:52,410 a 100x100 images, and so 26 00:00:52,780 --> 00:00:55,550 if you have 100x100, that's 10000 27 00:00:55,850 --> 00:00:57,520 pixels, and so if xi are, 28 00:00:57,780 --> 00:00:59,240 you know, feature vectors 29 00:00:59,760 --> 00:01:01,670 that contain your 10000 pixel 30 00:01:02,470 --> 00:01:03,580 intensity values, then 31 00:01:04,410 --> 00:01:05,580 you have 10000 dimensional feature vectors. 32 00:01:06,880 --> 00:01:08,530 So with very high-dimensional 33 00:01:09,300 --> 00:01:10,890 feature vectors like this, running a 34 00:01:11,320 --> 00:01:12,860 learning algorithm can be slow, right? 35 00:01:13,030 --> 00:01:14,300 Just, if you feed 10,000 dimensional 36 00:01:14,790 --> 00:01:16,980 feature vectors into logistic regression, 37 00:01:17,570 --> 00:01:19,780 or a new network, or support vector machine or what have you, 38 00:01:20,450 --> 00:01:22,000 just because that's a lot of data, 39 00:01:22,200 --> 00:01:23,060 that's 10,000 numbers, 40 00:01:24,130 --> 00:01:25,970 it can make your learning algorithm run more slowly. 41 00:01:27,170 --> 00:01:28,530 Fortunately with PCA we'll be 42 00:01:28,680 --> 00:01:29,810 able to reduce the dimension of 43 00:01:30,060 --> 00:01:31,050 this data and so make 44 00:01:31,180 --> 00:01:32,410 our algorithms run more 45 00:01:32,920 --> 00:01:34,440 efficiently. Here's how you 46 00:01:34,590 --> 00:01:35,780 do that. We are going 47 00:01:35,980 --> 00:01:37,030 first check our labeled 48 00:01:37,400 --> 00:01:39,520 training set and extract just 49 00:01:39,800 --> 00:01:41,830 the inputs, we're just going to extract the X's 50 00:01:42,730 --> 00:01:45,130 and temporarily put aside the Y's. 51 00:01:45,860 --> 00:01:46,750 So this will now give us 52 00:01:47,090 --> 00:01:49,150 an unlabelled training set x1 53 00:01:49,400 --> 00:01:51,000 through xm which are maybe 54 00:01:51,240 --> 00:01:53,600 there's a ten thousand dimensional data, 55 00:01:53,940 --> 00:01:55,800 ten thousand dimensional examples we have. 56 00:01:55,870 --> 00:01:57,230 So just extract the input vectors 57 00:01:58,370 --> 00:01:58,930 x1 through xm. 58 00:02:00,650 --> 00:02:01,810 Then we're going to apply PCA 59 00:02:02,700 --> 00:02:03,740 and this will give me a 60 00:02:03,980 --> 00:02:06,100 reduced dimension representation of the 61 00:02:06,410 --> 00:02:08,010 data, so instead of 62 00:02:08,260 --> 00:02:09,540 10,000 dimensional feature vectors I now 63 00:02:09,780 --> 00:02:11,880 have maybe one thousand dimensional feature vectors. 64 00:02:12,330 --> 00:02:13,500 So that's like a 10x savings. 65 00:02:15,110 --> 00:02:17,260 So this gives me, if you will, a new training set. 66 00:02:17,910 --> 00:02:19,430 So whereas previously I might 67 00:02:19,620 --> 00:02:21,180 have had an example x1, y1, 68 00:02:21,490 --> 00:02:24,340 my first training input, is now represented by z1. 69 00:02:24,580 --> 00:02:25,800 And so we'll have a 70 00:02:26,050 --> 00:02:27,010 new sort of training example, 71 00:02:28,210 --> 00:02:29,240 which is Z1 paired with y1. 72 00:02:30,700 --> 00:02:33,170 And similarly Z2, Y2, and so on, up to ZM, YM. 73 00:02:33,770 --> 00:02:35,300 Because my training examples are 74 00:02:35,460 --> 00:02:36,980 now represented with this much 75 00:02:37,480 --> 00:02:41,040 lower dimensional representation Z1, Z2, up to ZM. 76 00:02:41,310 --> 00:02:42,340 Finally, I can take this 77 00:02:43,650 --> 00:02:45,060 reduced dimension training set and 78 00:02:45,240 --> 00:02:46,540 feed it to a learning algorithm maybe 79 00:02:46,640 --> 00:02:47,900 a neural network, maybe logistic 80 00:02:48,280 --> 00:02:49,450 regression, and I can 81 00:02:49,750 --> 00:02:51,990 learn the hypothesis H, that 82 00:02:52,230 --> 00:02:53,830 takes this input, these low-dimensional 83 00:02:54,330 --> 00:02:56,230 representations Z and tries to make predictions. 84 00:02:57,890 --> 00:02:59,030 So if I were using logistic 85 00:02:59,460 --> 00:03:00,880 regression for example, I would 86 00:03:01,060 --> 00:03:02,760 train a hypothesis that outputs, you know, 87 00:03:03,080 --> 00:03:04,020 one over one plus E to 88 00:03:04,180 --> 00:03:06,020 the negative-theta transpose 89 00:03:07,620 --> 00:03:10,150 Z, that 90 00:03:10,610 --> 00:03:11,530 takes this input to one these 91 00:03:11,960 --> 00:03:13,660 z vectors, and tries to make a prediction. 92 00:03:15,260 --> 00:03:16,310 And finally, if you have 93 00:03:16,630 --> 00:03:17,800 a new example, maybe a new 94 00:03:17,920 --> 00:03:20,060 test example X. What 95 00:03:20,220 --> 00:03:21,340 you do is you would 96 00:03:22,130 --> 00:03:23,730 take your test example x, 97 00:03:24,960 --> 00:03:26,590 map it through the same mapping 98 00:03:26,990 --> 00:03:27,860 that was found by PCA 99 00:03:28,220 --> 00:03:29,610 to get you your corresponding z. 100 00:03:30,390 --> 00:03:31,280 And that z then gets 101 00:03:31,950 --> 00:03:33,740 fed to this hypothesis, and this 102 00:03:33,910 --> 00:03:35,450 hypothesis then makes a 103 00:03:35,750 --> 00:03:36,740 prediction on your input x. 104 00:03:38,110 --> 00:03:40,090 One final note, what PCA does 105 00:03:40,510 --> 00:03:42,350 is it defines a mapping from 106 00:03:42,710 --> 00:03:45,090 x to z and 107 00:03:45,960 --> 00:03:46,970 this mapping from x to 108 00:03:47,050 --> 00:03:48,280 z should be defined by running 109 00:03:48,580 --> 00:03:50,840 PCA only on the training sets. 110 00:03:51,650 --> 00:03:53,310 And in particular, this mapping that 111 00:03:53,530 --> 00:03:54,770 PCA is learning, right, this 112 00:03:54,950 --> 00:03:57,650 mapping, what that does is it computes the set of parameters. 113 00:03:58,210 --> 00:04:00,500 That's the feature scaling and mean normalization. 114 00:04:01,240 --> 00:04:04,040 And there's also computing this matrix U reduced. 115 00:04:04,680 --> 00:04:05,510 But all of these things that 116 00:04:05,670 --> 00:04:06,980 U reduce, that's like a 117 00:04:07,120 --> 00:04:08,420 parameter that is learned 118 00:04:08,670 --> 00:04:09,950 by PCA and we should 119 00:04:10,150 --> 00:04:12,270 be fitting our parameters only to 120 00:04:12,480 --> 00:04:13,990 our training sets and not 121 00:04:14,040 --> 00:04:16,250 to our cross validation or test sets and 122 00:04:16,370 --> 00:04:17,560 so these things the U reduced 123 00:04:18,180 --> 00:04:19,460 so on, that should be 124 00:04:19,820 --> 00:04:22,430 obtained by running PCA only on your training set. 125 00:04:23,300 --> 00:04:26,930 And then having found U reduced, or having found the parameters for feature 126 00:04:27,350 --> 00:04:28,620 scaling where the mean normalization 127 00:04:29,320 --> 00:04:31,790 and scaling the scale 128 00:04:32,180 --> 00:04:34,500 that you divide the features by to get them on to comparable scales. 129 00:04:34,760 --> 00:04:36,010 Having found all those parameters 130 00:04:36,570 --> 00:04:38,010 on the training set, you can 131 00:04:38,220 --> 00:04:41,560 then apply the same mapping to other examples that may be 132 00:04:41,820 --> 00:04:45,020 In your cross-validation sets or 133 00:04:45,180 --> 00:04:46,680 in your test sets, OK? 134 00:04:47,150 --> 00:04:48,340 Just to summarize, when you're 135 00:04:48,450 --> 00:04:49,790 running PCA, run your 136 00:04:49,900 --> 00:04:51,070 PCA only on the 137 00:04:51,220 --> 00:04:52,450 training set portion of the 138 00:04:52,490 --> 00:04:55,880 data not the cross-validation set or the test set portion of your data. 139 00:04:56,410 --> 00:04:57,620 And that defines the mapping from 140 00:04:57,870 --> 00:04:58,770 x to z and you can 141 00:04:58,950 --> 00:05:00,320 then apply that mapping to 142 00:05:00,560 --> 00:05:02,240 your cross-validation set and your 143 00:05:02,290 --> 00:05:03,370 test set and by the 144 00:05:03,450 --> 00:05:04,660 way in this example I talked 145 00:05:05,000 --> 00:05:06,660 about reducing the data from 146 00:05:06,950 --> 00:05:08,510 ten thousand dimensional to one 147 00:05:08,740 --> 00:05:10,350 thousand dimensional, this is actually 148 00:05:10,660 --> 00:05:11,950 not that unrealistic. For many 149 00:05:12,280 --> 00:05:14,720 problems we actually reduce the dimensional data. You 150 00:05:17,600 --> 00:05:18,700 know by 5x maybe by 10x 151 00:05:18,780 --> 00:05:20,910 and still retain most of the variance and we can do this 152 00:05:21,270 --> 00:05:22,680 barely effecting the performance, 153 00:05:23,900 --> 00:05:25,840 in terms of classification accuracy, let's say, 154 00:05:26,240 --> 00:05:27,970 barely affecting the classification 155 00:05:28,770 --> 00:05:30,320 accuracy of the learning algorithm. 156 00:05:31,090 --> 00:05:32,140 And by working with lower dimensional 157 00:05:32,590 --> 00:05:33,730 data our learning algorithm 158 00:05:34,060 --> 00:05:36,500 can often run much much faster. 159 00:05:36,910 --> 00:05:38,120 To summarize, we've so far talked 160 00:05:38,410 --> 00:05:40,920 about the following applications of PCA. 161 00:05:41,970 --> 00:05:43,780 First is the compression application where 162 00:05:44,020 --> 00:05:45,140 we might do so to reduce 163 00:05:45,500 --> 00:05:46,440 the memory or the disk space 164 00:05:46,590 --> 00:05:47,960 needed to store data and we 165 00:05:48,240 --> 00:05:49,390 just talked about how to 166 00:05:49,460 --> 00:05:51,630 use this to speed up a learning algorithm. 167 00:05:52,100 --> 00:05:53,870 In these applications, in order 168 00:05:54,130 --> 00:05:56,240 to choose K, often we'll 169 00:05:56,420 --> 00:05:58,770 do so according to, figuring 170 00:05:59,160 --> 00:06:00,590 out what is the percentage of 171 00:06:00,810 --> 00:06:03,880 variance retained, and so 172 00:06:04,780 --> 00:06:06,320 for this learning algorithm, speed 173 00:06:06,570 --> 00:06:10,050 up application often will retain 99% of the variance. 174 00:06:10,530 --> 00:06:11,690 That would be a very typical choice 175 00:06:12,100 --> 00:06:14,270 for how to choose k. So 176 00:06:14,730 --> 00:06:16,640 that's how you choose k for these compression applications. 177 00:06:17,850 --> 00:06:19,590 Whereas for visualization applications 178 00:06:20,760 --> 00:06:22,100 while usually we know 179 00:06:22,230 --> 00:06:23,550 how to plot only two dimensional 180 00:06:24,020 --> 00:06:25,520 data or three dimensional data, 181 00:06:26,540 --> 00:06:28,650 and so for visualization applications, we'll 182 00:06:28,830 --> 00:06:29,660 usually choose k equals 2 183 00:06:29,710 --> 00:06:31,930 or k equals 3, because we can plot 184 00:06:32,740 --> 00:06:33,500 only 2D and 3D data sets. 185 00:06:34,510 --> 00:06:35,720 So that summarizes the main 186 00:06:36,020 --> 00:06:37,230 applications of PCA, as well 187 00:06:37,870 --> 00:06:39,580 as how to choose the 188 00:06:39,670 --> 00:06:41,540 value of k for these different applications. 189 00:06:42,890 --> 00:06:45,710 I should mention that there is often one frequent misuse 190 00:06:46,400 --> 00:06:48,100 of PCA and 191 00:06:48,800 --> 00:06:50,300 you sometimes hear about others 192 00:06:50,580 --> 00:06:51,820 doing this hopefully not too often. 193 00:06:52,230 --> 00:06:54,780 I just want to mention this so that you know not to do it. 194 00:06:55,480 --> 00:06:56,460 And there is one bad use of 195 00:06:56,540 --> 00:06:59,170 PCA, which iss to try to use it to prevent over-fitting. 196 00:07:00,380 --> 00:07:00,660 Here's the reasoning. 197 00:07:01,910 --> 00:07:03,080 This is not a great 198 00:07:03,730 --> 00:07:04,610 way to use PCA, 199 00:07:04,670 --> 00:07:05,630 but here's the reasoning behind 200 00:07:05,690 --> 00:07:07,080 this method, which is,you know 201 00:07:07,350 --> 00:07:09,090 if we have Xi, then 202 00:07:09,300 --> 00:07:10,660 maybe we'll have n features, but 203 00:07:10,830 --> 00:07:12,640 if we compress the data, and 204 00:07:12,750 --> 00:07:13,700 use Zi instead 205 00:07:14,270 --> 00:07:15,410 and that reduces the number 206 00:07:15,560 --> 00:07:17,050 of features to k, which 207 00:07:17,290 --> 00:07:19,300 could be much lower dimensional. And 208 00:07:19,410 --> 00:07:21,130 so if we have a much smaller 209 00:07:21,490 --> 00:07:22,520 number of features, if k 210 00:07:22,770 --> 00:07:25,800 is 1,000 and n is 211 00:07:26,090 --> 00:07:27,010 10,000, then if we have 212 00:07:27,780 --> 00:07:29,390 only 1,000 dimensional data, maybe 213 00:07:29,670 --> 00:07:30,580 we're less likely to over-fit 214 00:07:31,260 --> 00:07:32,230 than if we were using 10,000-dimensional 215 00:07:33,280 --> 00:07:34,980 data with like a thousand features. 216 00:07:35,950 --> 00:07:37,160 So some people think 217 00:07:37,360 --> 00:07:39,360 of PCA as a way to prevent over-fitting. 218 00:07:39,950 --> 00:07:41,940 But just to emphasize this 219 00:07:42,110 --> 00:07:44,000 is a bad application of PCA 220 00:07:44,260 --> 00:07:46,080 and I do not recommend doing this. 221 00:07:46,520 --> 00:07:48,430 And it's not that this method works badly. 222 00:07:49,000 --> 00:07:49,920 If you want to use 223 00:07:50,330 --> 00:07:51,560 this method to reduce the dimensional 224 00:07:51,890 --> 00:07:52,830 data, to try to prevent over-fitting, 225 00:07:53,690 --> 00:07:54,830 it might actually work OK. 226 00:07:55,560 --> 00:07:56,720 But this just is not 227 00:07:57,040 --> 00:07:58,340 a good way to address 228 00:07:58,680 --> 00:08:00,390 over-fitting and instead, if you're 229 00:08:00,510 --> 00:08:01,810 worried about over-fitting, there is 230 00:08:02,030 --> 00:08:03,420 a much better way to address 231 00:08:03,800 --> 00:08:05,680 it, to use regularization instead of 232 00:08:05,900 --> 00:08:07,910 using PCA to reduce the dimension of the data. 233 00:08:08,670 --> 00:08:10,000 And the reason is, if 234 00:08:11,010 --> 00:08:12,150 you think about how PCA works, 235 00:08:12,900 --> 00:08:13,950 it does not use the labels 236 00:08:14,530 --> 00:08:15,680 y. You are just looking 237 00:08:16,050 --> 00:08:17,220 at your inputs xi, and you're 238 00:08:17,340 --> 00:08:19,070 using that to find a 239 00:08:19,130 --> 00:08:21,150 lower-dimensional approximation to your data. 240 00:08:21,390 --> 00:08:22,840 So what PCA does, 241 00:08:23,190 --> 00:08:25,410 is it throws away some information. 242 00:08:26,460 --> 00:08:28,040 It throws away or reduces the 243 00:08:28,180 --> 00:08:29,680 dimension of your data without 244 00:08:30,110 --> 00:08:31,390 knowing what the values of y 245 00:08:32,380 --> 00:08:33,700 is, so this is probably 246 00:08:34,250 --> 00:08:35,770 okay using PCA this way 247 00:08:35,920 --> 00:08:37,750 is probably okay if, say 248 00:08:37,990 --> 00:08:39,190 99 percent of the 249 00:08:39,410 --> 00:08:40,400 variance is retained, if you're keeping most 250 00:08:40,830 --> 00:08:41,970 of the variance, but 251 00:08:42,100 --> 00:08:44,230 it might also throw away some valuable information. 252 00:08:45,010 --> 00:08:45,980 And it turns out that 253 00:08:46,310 --> 00:08:47,580 if you're retaining 99% of 254 00:08:47,820 --> 00:08:49,260 the variance or 95% 255 00:08:49,360 --> 00:08:50,940 of the variance or whatever, it 256 00:08:51,020 --> 00:08:52,310 turns out that just using 257 00:08:52,720 --> 00:08:54,650 regularization will often give 258 00:08:54,790 --> 00:08:56,010 you at least as good 259 00:08:56,220 --> 00:08:57,880 a method for preventing over-fitting 260 00:08:58,900 --> 00:09:00,340 and regularization will often just 261 00:09:00,590 --> 00:09:02,220 work better, because when you 262 00:09:02,350 --> 00:09:03,890 are applying linear regression or logistic 263 00:09:04,250 --> 00:09:05,240 regression or some other method 264 00:09:05,600 --> 00:09:07,390 with regularization, well, this minimization 265 00:09:08,010 --> 00:09:09,420 problem actually knows what the 266 00:09:09,480 --> 00:09:10,740 values of y are, and 267 00:09:10,960 --> 00:09:12,680 so is less likely to throw 268 00:09:12,880 --> 00:09:14,330 away some valuable information, whereas 269 00:09:14,730 --> 00:09:15,790 PCA doesn't make use 270 00:09:16,060 --> 00:09:17,810 of the labels and is more 271 00:09:17,850 --> 00:09:19,940 likely to throw away valuable information. 272 00:09:20,230 --> 00:09:21,370 So, to summarize, it is 273 00:09:21,620 --> 00:09:22,900 a good use of PCA, if your 274 00:09:23,010 --> 00:09:24,380 main motivation to speed up 275 00:09:24,530 --> 00:09:26,490 your learning algorithm, but using 276 00:09:26,790 --> 00:09:28,360 PCA to prevent over-fitting, that 277 00:09:28,650 --> 00:09:29,630 is not a good use of 278 00:09:30,030 --> 00:09:32,270 PCA, and using regularization instead 279 00:09:32,900 --> 00:09:36,190 is really what many people 280 00:09:36,440 --> 00:09:40,490 would recommend doing instead. Finally, 281 00:09:41,310 --> 00:09:43,350 one last misuse of PCA. 282 00:09:43,750 --> 00:09:45,760 And so I should say PCA is a very useful algorithm, 283 00:09:46,270 --> 00:09:49,170 I often use it for the compression on the visualization purposes. 284 00:09:50,230 --> 00:09:51,400 But, what I sometimes 285 00:09:51,570 --> 00:09:53,310 see, is also people sometimes 286 00:09:53,710 --> 00:09:56,080 use PCA where it shouldn't be. 287 00:09:56,220 --> 00:09:57,940 So, here's a pretty common thing that 288 00:09:58,030 --> 00:09:59,140 I see, which is if someone 289 00:09:59,330 --> 00:10:00,330 is designing a machine-learning system, 290 00:10:01,010 --> 00:10:02,130 they may write down the 291 00:10:02,200 --> 00:10:04,150 plan like this: let's design a learning system. 292 00:10:05,060 --> 00:10:06,080 Get a training set and then, 293 00:10:06,570 --> 00:10:07,350 you know, what I'm going to 294 00:10:07,400 --> 00:10:08,700 do is run PCA, then train 295 00:10:08,860 --> 00:10:11,200 logistic regression and then test on my test data. 296 00:10:11,680 --> 00:10:12,770 So often at the very 297 00:10:13,090 --> 00:10:14,360 start of a project, 298 00:10:14,600 --> 00:10:15,600 someone will just write out a 299 00:10:15,720 --> 00:10:16,980 project plan than says lets 300 00:10:17,310 --> 00:10:18,610 do these four steps with PCA inside. 301 00:10:20,210 --> 00:10:21,220 Before writing down a project 302 00:10:21,530 --> 00:10:23,350 plan the incorporates PCA like 303 00:10:23,560 --> 00:10:24,860 this, one very good 304 00:10:25,030 --> 00:10:27,110 question to ask is, well, what if we 305 00:10:27,630 --> 00:10:28,560 were to just do the whole 306 00:10:29,540 --> 00:10:31,470 without using PCA. 307 00:10:32,170 --> 00:10:33,450 And often people do not 308 00:10:33,800 --> 00:10:34,940 consider this step before 309 00:10:35,440 --> 00:10:37,080 coming up with a complicated project plan and 310 00:10:37,920 --> 00:10:40,620 implementing PCA and so on. 311 00:10:40,810 --> 00:10:42,360 And sometime, and so specifically, 312 00:10:43,050 --> 00:10:44,300 what I often advise people 313 00:10:44,670 --> 00:10:45,980 is, before you implement 314 00:10:46,450 --> 00:10:47,970 PCA, I would first 315 00:10:48,220 --> 00:10:49,410 suggest that, you know, do 316 00:10:49,600 --> 00:10:50,770 whatever it is, take whatever it 317 00:10:50,850 --> 00:10:52,030 is you want to do and first 318 00:10:52,450 --> 00:10:53,650 consider doing it with your 319 00:10:53,980 --> 00:10:56,420 original raw data xi, and 320 00:10:56,600 --> 00:10:57,860 only if that doesn't do 321 00:10:57,960 --> 00:10:59,650 what you want, then implement PCA before using Zi. 322 00:11:01,010 --> 00:11:02,420 So, before using PCA you know, 323 00:11:03,030 --> 00:11:03,930 instead of reducing the dimension 324 00:11:04,360 --> 00:11:05,710 of the data, I would consider 325 00:11:06,640 --> 00:11:08,020 well, let's ditch this PCA step, 326 00:11:08,420 --> 00:11:09,690 and I would consider, let's 327 00:11:10,040 --> 00:11:11,460 just train my learning algorithm 328 00:11:12,440 --> 00:11:13,560 on my original data. 329 00:11:14,410 --> 00:11:15,990 Let's just use my original raw 330 00:11:16,300 --> 00:11:17,770 inputs xi, and I would 331 00:11:18,180 --> 00:11:19,550 recommend, instead of putting 332 00:11:19,720 --> 00:11:20,910 PCA into the algorithm, just 333 00:11:21,030 --> 00:11:23,250 try doing whatever it is you're doing with the xi first. 334 00:11:24,090 --> 00:11:25,000 And only if you have 335 00:11:25,150 --> 00:11:26,180 a reason to believe that doesn't 336 00:11:26,480 --> 00:11:27,590 work, so that only if your 337 00:11:27,790 --> 00:11:29,470 learning algorithm ends up 338 00:11:29,510 --> 00:11:31,100 running too slowly, or only if 339 00:11:31,280 --> 00:11:32,680 the memory requirement or the 340 00:11:32,910 --> 00:11:34,140 disk space requirement is too large, 341 00:11:34,430 --> 00:11:35,850 so you want to compress your 342 00:11:36,190 --> 00:11:37,810 representation, but if only 343 00:11:38,000 --> 00:11:39,020 using the xi doesn't work, 344 00:11:39,360 --> 00:11:40,640 only if you have evidence or strong 345 00:11:40,950 --> 00:11:41,890 reason to believe that using 346 00:11:42,380 --> 00:11:43,890 the xi won't work, then implement 347 00:11:44,380 --> 00:11:46,730 PCA and consider using the compressed representation. 348 00:11:47,990 --> 00:11:48,830 Because what I do see, is 349 00:11:49,100 --> 00:11:50,380 sometimes people start off with 350 00:11:50,530 --> 00:11:51,520 a project plan that incorporates PCA 351 00:11:52,100 --> 00:11:54,580 inside, and sometimes they, 352 00:11:54,650 --> 00:11:55,620 whatever they're 353 00:11:55,820 --> 00:11:57,380 doing will work just 354 00:11:57,660 --> 00:11:59,520 fine, even with out using PCA instead. 355 00:11:59,840 --> 00:12:01,650 So, just consider that 356 00:12:01,860 --> 00:12:03,130 as an alternative as well, before you 357 00:12:03,320 --> 00:12:04,170 go to spend a lot of 358 00:12:04,300 --> 00:12:05,570 time to get PCA in, figure 359 00:12:05,770 --> 00:12:08,100 out what k is and so on. 360 00:12:08,250 --> 00:12:09,330 So, that's it for PCA. 361 00:12:09,800 --> 00:12:11,000 Despite these last sets of 362 00:12:11,080 --> 00:12:12,380 comments, PCA is an 363 00:12:12,690 --> 00:12:14,060 incredibly useful algorithm, when you 364 00:12:14,150 --> 00:12:15,330 use it for the appropriate applications 365 00:12:16,070 --> 00:12:17,480 and I've actually used PCA pretty 366 00:12:17,770 --> 00:12:19,330 often and for me, 367 00:12:19,580 --> 00:12:20,650 I use it mostly to speed 368 00:12:20,850 --> 00:12:22,150 up the running time of my learning algorithms. 369 00:12:22,880 --> 00:12:24,310 But I think, just as 370 00:12:24,400 --> 00:12:25,690 common an application of PCA, 371 00:12:26,020 --> 00:12:27,300 is to use it to 372 00:12:27,410 --> 00:12:29,030 compress data, to reduce 373 00:12:29,620 --> 00:12:30,650 the memory or disk space 374 00:12:30,990 --> 00:12:33,130 requirements, or to use it to visualize data. 375 00:12:34,270 --> 00:12:35,710 And PCA is one of 376 00:12:35,750 --> 00:12:36,960 the most commonly used and one 377 00:12:36,990 --> 00:12:39,420 of the most powerful unsupervised learning algorithms. 378 00:12:40,060 --> 00:12:41,210 And with what you've learned 379 00:12:41,420 --> 00:12:43,120 in these videos, I think hopefully 380 00:12:43,500 --> 00:12:44,710 you'll be able to implement 381 00:12:45,150 --> 00:12:46,280 PCA and use them 382 00:12:46,500 --> 00:12:47,930 through all of these purposes as well.