1 00:00:00,340 --> 00:00:01,410 In this video I'd like 2 00:00:01,550 --> 00:00:03,020 to tell you about the principle 3 00:00:03,340 --> 00:00:04,570 components analysis algorithm. 4 00:00:05,600 --> 00:00:06,560 And by the end of this 5 00:00:06,710 --> 00:00:09,200 video you know to implement PCA for yourself. 6 00:00:10,170 --> 00:00:12,540 And use it reduce the dimension of your data. 7 00:00:13,100 --> 00:00:14,690 Before applying PCA, there is 8 00:00:14,800 --> 00:00:17,760 a data pre-processing step which you should always do. 9 00:00:18,510 --> 00:00:20,220 Given the trading sets of the 10 00:00:20,520 --> 00:00:22,290 examples is important to 11 00:00:22,600 --> 00:00:24,070 always perform mean normalization, 12 00:00:25,330 --> 00:00:26,140 and then depending on your data, 13 00:00:26,840 --> 00:00:28,540 maybe perform feature scaling as well. 14 00:00:29,620 --> 00:00:30,950 this is very similar to the 15 00:00:31,650 --> 00:00:33,250 mean normalization and feature scaling 16 00:00:34,080 --> 00:00:36,580 process that we have for supervised learning. 17 00:00:36,910 --> 00:00:38,240 In fact it's exactly the 18 00:00:38,390 --> 00:00:40,160 same procedure except that we're 19 00:00:40,310 --> 00:00:41,790 doing it now to our unlabeled 20 00:00:42,930 --> 00:00:43,670 data, X1 through Xm. 21 00:00:44,180 --> 00:00:45,530 So for mean normalization we 22 00:00:45,720 --> 00:00:47,080 first compute the mean of 23 00:00:47,390 --> 00:00:49,070 each feature and then 24 00:00:49,340 --> 00:00:50,900 we replace each feature, X, 25 00:00:51,150 --> 00:00:52,680 with X minus its mean, 26 00:00:52,810 --> 00:00:54,120 and so this makes each feature 27 00:00:54,520 --> 00:00:57,450 now have exactly zero mean 28 00:00:58,690 --> 00:01:00,690 The different features have very different scales. 29 00:01:01,540 --> 00:01:03,050 So for example, if x1 30 00:01:03,080 --> 00:01:04,060 is the size of a house, and 31 00:01:04,100 --> 00:01:05,390 x2 is the number of bedrooms, to 32 00:01:05,580 --> 00:01:07,370 use our earlier example, we 33 00:01:07,480 --> 00:01:08,680 then also scale each feature 34 00:01:09,130 --> 00:01:10,540 to have a comparable range of values. 35 00:01:10,980 --> 00:01:12,490 And so, similar to what 36 00:01:12,680 --> 00:01:13,860 we had with supervised learning, 37 00:01:14,060 --> 00:01:16,200 we would take x, i substitute 38 00:01:16,680 --> 00:01:17,620 j, that's the j feature 39 00:01:23,250 --> 00:01:25,530 and so we would 40 00:01:25,890 --> 00:01:27,610 subtract of the mean, 41 00:01:28,370 --> 00:01:29,520 now that's what we have on top, and then divide by sj. 42 00:01:29,610 --> 00:01:30,020 Here, sj is some measure of the beta values of feature j. So, it could be the max minus 43 00:01:30,080 --> 00:01:31,310 min value, or more commonly, 44 00:01:31,890 --> 00:01:33,540 it is the standard deviation of 45 00:01:33,640 --> 00:01:35,520 feature j. Having done 46 00:01:36,230 --> 00:01:39,480 this sort of data pre-processing, here's what the PCA algorithm does. 47 00:01:40,620 --> 00:01:41,630 We saw from the previous video 48 00:01:41,960 --> 00:01:43,050 that what PCA does is, it 49 00:01:43,170 --> 00:01:44,520 tries to find a lower 50 00:01:44,790 --> 00:01:46,080 dimensional sub-space onto which to 51 00:01:46,170 --> 00:01:47,500 project the data, so as 52 00:01:47,650 --> 00:01:49,780 to minimize the squared projection 53 00:01:50,540 --> 00:01:51,660 errors, sum of the 54 00:01:51,740 --> 00:01:53,080 squared projection errors, as the 55 00:01:53,420 --> 00:01:54,800 square of the length of 56 00:01:54,870 --> 00:01:56,790 those blue lines that and so 57 00:01:57,110 --> 00:01:58,510 what we wanted to do specifically 58 00:01:59,210 --> 00:02:02,730 is find a vector, u1, which 59 00:02:03,280 --> 00:02:04,750 specifies that direction or 60 00:02:05,040 --> 00:02:06,630 in the 2D case we want 61 00:02:06,880 --> 00:02:08,760 to find two vectors, u1 and 62 00:02:10,640 --> 00:02:12,980 u2, to define this surface 63 00:02:13,590 --> 00:02:14,610 onto which to project the data. 64 00:02:16,620 --> 00:02:17,920 So, just as a 65 00:02:18,040 --> 00:02:19,160 quick reminder of what reducing 66 00:02:19,730 --> 00:02:20,820 the dimension of the data means, 67 00:02:21,490 --> 00:02:22,430 for this example on the 68 00:02:22,470 --> 00:02:23,560 left we were given 69 00:02:23,680 --> 00:02:26,010 the examples xI, which are in r2. 70 00:02:26,300 --> 00:02:28,390 And what we 71 00:02:28,660 --> 00:02:29,500 like to do is find 72 00:02:29,970 --> 00:02:32,400 a set of numbers zI in 73 00:02:33,000 --> 00:02:34,950 r push to represent our data. 74 00:02:36,000 --> 00:02:37,820 So that's what from reduction from 2D to 1D means. 75 00:02:39,020 --> 00:02:41,450 So specifically by projecting 76 00:02:42,710 --> 00:02:44,080 data onto this red line there. 77 00:02:44,800 --> 00:02:46,320 We need only one number to 78 00:02:46,450 --> 00:02:48,340 specify the position of the points on the line. 79 00:02:48,590 --> 00:02:49,380 So i'm going to call that number 80 00:02:50,700 --> 00:02:51,830 z or z1. 81 00:02:52,020 --> 00:02:54,850 Z here [xx] real number, so that's like a one dimensional vector. 82 00:02:55,380 --> 00:02:56,650 So z1 just refers to 83 00:02:56,690 --> 00:02:58,080 the first component of this, 84 00:02:58,280 --> 00:03:00,430 you know, one by one matrix, or this one dimensional vector. 85 00:03:01,670 --> 00:03:03,170 And so we need only 86 00:03:03,490 --> 00:03:05,590 one number to specify the position of a point. 87 00:03:06,330 --> 00:03:07,940 So if this example 88 00:03:08,460 --> 00:03:09,510 here was my example 89 00:03:10,610 --> 00:03:13,160 X1, then maybe that gets mapped here. 90 00:03:13,900 --> 00:03:15,450 And if this example was X2 91 00:03:15,680 --> 00:03:17,250 maybe that example gets mapped 92 00:03:17,530 --> 00:03:18,790 And so this point 93 00:03:19,060 --> 00:03:20,400 here will be Z1 94 00:03:20,840 --> 00:03:21,920 and this point here will be 95 00:03:22,080 --> 00:03:24,240 Z2, and similarly we 96 00:03:24,620 --> 00:03:26,410 would have those other points 97 00:03:26,840 --> 00:03:30,230 for These, maybe X3, 98 00:03:30,510 --> 00:03:32,550 X4, X5 get mapped to Z1, Z2, Z3. 99 00:03:34,360 --> 00:03:35,940 So What PCA has 100 00:03:36,050 --> 00:03:36,830 to do is we need to 101 00:03:36,930 --> 00:03:38,920 come up with a way to compute two things. 102 00:03:39,310 --> 00:03:40,710 One is to compute these vectors, 103 00:03:41,830 --> 00:03:44,970 u1, and in this case u1 and u2. 104 00:03:45,230 --> 00:03:46,880 And the other is 105 00:03:47,130 --> 00:03:48,140 how do we compute these numbers, 106 00:03:49,360 --> 00:03:51,200 Z. So on the 107 00:03:51,430 --> 00:03:53,910 example on the left we're reducing the data from 2D to 1D. 108 00:03:55,290 --> 00:03:56,100 In the example on the right, 109 00:03:56,510 --> 00:03:58,100 we would be reducing data from 110 00:03:58,450 --> 00:04:00,600 3 dimensional as in 111 00:04:00,710 --> 00:04:04,840 r3, to zi, which is now two dimensional. 112 00:04:05,390 --> 00:04:07,790 So these z vectors would now be two dimensional. 113 00:04:08,450 --> 00:04:09,590 So it would be z1 114 00:04:10,150 --> 00:04:11,410 z2 like so, and so 115 00:04:11,640 --> 00:04:12,940 we need to give away to compute 116 00:04:13,670 --> 00:04:15,410 these new representations, the z1 117 00:04:15,570 --> 00:04:17,350 and z2 of the data as well. 118 00:04:18,280 --> 00:04:20,350 So how do you compute all of these quantities? 119 00:04:20,520 --> 00:04:21,520 It turns out that a mathematical 120 00:04:22,490 --> 00:04:23,660 derivation, also the mathematical 121 00:04:24,300 --> 00:04:26,020 proof, for what is 122 00:04:26,090 --> 00:04:27,970 the right value U1, U2, Z1, 123 00:04:28,290 --> 00:04:29,480 Z2, and so on. 124 00:04:29,690 --> 00:04:31,230 That mathematical proof is very 125 00:04:31,480 --> 00:04:32,890 complicated and beyond the 126 00:04:32,950 --> 00:04:34,620 scope of the course. 127 00:04:35,280 --> 00:04:37,290 But once you've done [xx] it 128 00:04:37,590 --> 00:04:38,590 turns out that the procedure 129 00:04:39,350 --> 00:04:40,570 to actually find the value 130 00:04:41,200 --> 00:04:42,210 of u1 that you want 131 00:04:42,950 --> 00:04:43,950 is not that hard, even though 132 00:04:44,180 --> 00:04:45,640 so that the mathematical proof that 133 00:04:45,840 --> 00:04:46,940 this value is the correct 134 00:04:47,260 --> 00:04:48,450 value is someone more 135 00:04:48,700 --> 00:04:49,960 involved and more than i want to get into. 136 00:04:50,880 --> 00:04:52,070 But let me just describe the 137 00:04:52,480 --> 00:04:53,830 specific procedure that you 138 00:04:53,960 --> 00:04:55,250 have to implement in order 139 00:04:55,440 --> 00:04:56,450 to compute all of these 140 00:04:56,570 --> 00:04:57,840 things, the vectors, u1, u2, 141 00:04:58,910 --> 00:05:00,980 the vector z. Here's the procedure. 142 00:05:02,070 --> 00:05:02,970 Let's say we want to reduce 143 00:05:03,170 --> 00:05:04,220 the data to n dimensions 144 00:05:04,840 --> 00:05:05,760 to k dimension What we're 145 00:05:06,760 --> 00:05:07,640 going to do is first 146 00:05:07,900 --> 00:05:09,400 compute something called the 147 00:05:09,830 --> 00:05:11,140 covariance matrix, and the covariance 148 00:05:11,700 --> 00:05:13,620 matrix is commonly denoted by 149 00:05:13,820 --> 00:05:15,050 this Greek alphabet which is 150 00:05:15,190 --> 00:05:16,880 the capital Greek alphabet sigma. 151 00:05:18,000 --> 00:05:19,210 It's a bit unfortunate that the 152 00:05:19,310 --> 00:05:21,080 Greek alphabet sigma looks exactly 153 00:05:21,760 --> 00:05:22,710 like the summation symbols. 154 00:05:23,210 --> 00:05:24,620 So this is the 155 00:05:24,700 --> 00:05:26,220 Greek alphabet Sigma is used 156 00:05:26,420 --> 00:05:29,540 to denote a matrix and this here is a summation symbol. 157 00:05:30,510 --> 00:05:32,330 So hopefully in these slides 158 00:05:32,680 --> 00:05:34,190 there won't be ambiguity about which 159 00:05:34,410 --> 00:05:36,340 is Sigma Matrix, the 160 00:05:36,520 --> 00:05:37,850 matrix, which is a 161 00:05:38,090 --> 00:05:39,620 summation symbol, and hopefully 162 00:05:39,940 --> 00:05:41,460 it will be clear from context when 163 00:05:41,820 --> 00:05:43,510 I'm using each one. 164 00:05:43,740 --> 00:05:44,790 How do you compute this matrix 165 00:05:45,530 --> 00:05:46,550 let's say we want to 166 00:05:47,135 --> 00:05:47,640 store it in an octave 167 00:05:48,120 --> 00:05:49,970 variable called sigma. 168 00:05:50,840 --> 00:05:51,890 What we need to do is 169 00:05:52,030 --> 00:05:53,660 compute something called the 170 00:05:54,130 --> 00:05:56,190 eigenvectors of the matrix sigma. 171 00:05:57,560 --> 00:05:58,450 And an octave, the way you 172 00:05:58,590 --> 00:05:59,820 do that is you use this 173 00:05:59,970 --> 00:06:01,020 command, u s v equals 174 00:06:01,350 --> 00:06:02,600 s v d of sigma. 175 00:06:03,650 --> 00:06:06,090 SVD, by the way, stands for singular value decomposition. 176 00:06:08,520 --> 00:06:10,590 This is a Much 177 00:06:10,790 --> 00:06:12,660 more advanced single value composition. 178 00:06:14,450 --> 00:06:15,560 It is much more advanced linear 179 00:06:15,800 --> 00:06:16,950 algebra than you actually need 180 00:06:16,950 --> 00:06:18,770 to know but now It turns out 181 00:06:18,950 --> 00:06:20,250 that when sigma is equal 182 00:06:20,480 --> 00:06:21,800 to matrix there is 183 00:06:21,880 --> 00:06:23,420 a few ways to compute these are 184 00:06:23,610 --> 00:06:25,810 high in vectors and If you 185 00:06:25,960 --> 00:06:27,350 are an expert in linear algebra 186 00:06:27,700 --> 00:06:28,610 and if you've heard of high in 187 00:06:28,860 --> 00:06:30,170 vectors before you may know 188 00:06:30,350 --> 00:06:31,660 that there is another octet function 189 00:06:31,990 --> 00:06:33,420 called I, which can 190 00:06:33,520 --> 00:06:35,030 also be used to compute the same thing. 191 00:06:35,950 --> 00:06:36,980 and It turns out that the 192 00:06:37,370 --> 00:06:39,180 SVD function and the 193 00:06:39,290 --> 00:06:40,310 I function it will give 194 00:06:40,370 --> 00:06:42,170 you the same vectors, although SVD 195 00:06:42,840 --> 00:06:44,210 is a little more numerically stable. 196 00:06:44,540 --> 00:06:45,890 So I tend to use SVD, although 197 00:06:46,140 --> 00:06:47,040 I have a few friends that use 198 00:06:47,280 --> 00:06:48,720 the I function to do 199 00:06:48,920 --> 00:06:50,050 this as wellbut when you 200 00:06:50,130 --> 00:06:51,270 apply this to a covariance matrix 201 00:06:51,750 --> 00:06:52,960 sigma it gives you the same thing. 202 00:06:53,870 --> 00:06:55,070 This is because the covariance matrix 203 00:06:55,500 --> 00:06:57,250 always satisfies a mathematical 204 00:06:57,940 --> 00:07:00,560 Property called symmetric positive definite 205 00:07:01,360 --> 00:07:02,140 You really don't need to know 206 00:07:02,280 --> 00:07:03,890 what that means, but the SVD 207 00:07:05,340 --> 00:07:07,090 and I-functions are different functions but 208 00:07:07,400 --> 00:07:08,670 when they are applied to a 209 00:07:08,780 --> 00:07:10,410 covariance matrix which can 210 00:07:10,550 --> 00:07:12,080 be proved to always satisfy this 211 00:07:13,190 --> 00:07:15,220 mathematical property; they'll always give you the same thing. 212 00:07:16,580 --> 00:07:19,180 Okay, that was probably much more linear algebra than you needed to know. 213 00:07:19,260 --> 00:07:22,380 In case none of that made sense, don't worry about it. 214 00:07:22,560 --> 00:07:23,490 All you need to know is that 215 00:07:24,130 --> 00:07:27,840 this system command you 216 00:07:28,140 --> 00:07:29,690 should implement in Octave. 217 00:07:30,080 --> 00:07:30,550 And if you're implementing this in a 218 00:07:30,710 --> 00:07:32,120 different language than Octave or MATLAB, 219 00:07:32,710 --> 00:07:33,790 what you should do is find 220 00:07:34,190 --> 00:07:35,860 the numerical linear algebra library 221 00:07:36,730 --> 00:07:37,960 that can compute the SVD 222 00:07:38,230 --> 00:07:40,460 or singular value decomposition, and 223 00:07:40,970 --> 00:07:42,680 there are many such libraries for 224 00:07:43,570 --> 00:07:45,060 probably all of the major programming languages. 225 00:07:45,300 --> 00:07:46,920 People can use that to 226 00:07:47,050 --> 00:07:48,920 compute the matrices u, 227 00:07:49,200 --> 00:07:52,770 s, and d of the covariance matrix sigma. 228 00:07:53,340 --> 00:07:54,490 So just to fill 229 00:07:54,620 --> 00:07:56,090 in some more details, this covariance 230 00:07:56,660 --> 00:07:58,080 matrix sigma will be 231 00:07:58,250 --> 00:08:01,480 an n by n matrix. 232 00:08:02,250 --> 00:08:03,240 And one way to see that 233 00:08:03,510 --> 00:08:04,220 is if you look at the definition 234 00:08:05,250 --> 00:08:06,280 this is an n by 1 235 00:08:06,660 --> 00:08:08,680 vector and this 236 00:08:08,920 --> 00:08:10,830 here I transpose is 237 00:08:11,010 --> 00:08:13,260 1 by N so the 238 00:08:13,380 --> 00:08:14,480 product of these two things 239 00:08:15,150 --> 00:08:15,800 is going to be an N 240 00:08:16,570 --> 00:08:17,530 by N matrix. 241 00:08:19,100 --> 00:08:22,130 1xN transfers, 1xN, so 242 00:08:22,280 --> 00:08:22,840 there's an NxN matrix and when 243 00:08:22,910 --> 00:08:23,710 we add up all of these you still 244 00:08:23,840 --> 00:08:26,140 have an NxN matrix. 245 00:08:27,600 --> 00:08:29,920 And what the SVD outputs three 246 00:08:30,500 --> 00:08:32,580 matrices, u, s, and 247 00:08:32,830 --> 00:08:35,070 v. The thing you really need out of the SVD is the u matrix. 248 00:08:36,230 --> 00:08:40,160 The u matrix will also be a NxN matrix. 249 00:08:41,510 --> 00:08:42,280 And if we look at the 250 00:08:42,350 --> 00:08:43,260 columns of the U 251 00:08:43,480 --> 00:08:45,330 matrix it turns 252 00:08:45,630 --> 00:08:47,210 out that the columns 253 00:08:48,570 --> 00:08:50,180 of the U matrix will be 254 00:08:50,350 --> 00:08:53,860 exactly those vectors, u1, 255 00:08:54,260 --> 00:08:56,290 u2 and so on. 256 00:08:57,640 --> 00:08:59,330 So u, will be matrix. 257 00:09:00,910 --> 00:09:01,830 And if we want to reduce 258 00:09:02,230 --> 00:09:03,200 the data from n dimensions 259 00:09:03,800 --> 00:09:05,380 down to k dimensions, then what 260 00:09:05,490 --> 00:09:07,950 we need to do is take the first k vectors. 261 00:09:09,800 --> 00:09:12,670 that gives us u1 up 262 00:09:12,860 --> 00:09:14,700 to uK which gives 263 00:09:14,780 --> 00:09:16,930 us the K direction onto which 264 00:09:17,200 --> 00:09:19,770 we want to project the data. 265 00:09:20,090 --> 00:09:21,640 the rest of the procedure from 266 00:09:22,410 --> 00:09:24,170 this SVD numerical linear 267 00:09:24,490 --> 00:09:25,580 algebra routine we get this 268 00:09:25,840 --> 00:09:27,140 matrix u. We'll call 269 00:09:27,530 --> 00:09:29,080 these columns u1-uN. 270 00:09:30,580 --> 00:09:31,620 So, just to wrap up the 271 00:09:31,830 --> 00:09:32,520 description of the rest of 272 00:09:32,540 --> 00:09:34,550 the procedure, from the SVD 273 00:09:35,320 --> 00:09:36,940 numerical linear algebra routine we 274 00:09:37,240 --> 00:09:38,650 get these matrices u, s, 275 00:09:38,830 --> 00:09:41,320 and d. we're going 276 00:09:41,900 --> 00:09:44,460 to use the first K columns 277 00:09:45,050 --> 00:09:46,310 of this matrix to get u1-uK. 278 00:09:48,710 --> 00:09:49,460 Now the other thing we need 279 00:09:49,700 --> 00:09:53,730 to is take my original 280 00:09:54,110 --> 00:09:55,430 data set, X which is 281 00:09:55,630 --> 00:09:57,080 an RN And find a 282 00:09:57,250 --> 00:09:59,210 lower dimensional representation Z, which 283 00:09:59,420 --> 00:10:01,280 is a R K for this data. 284 00:10:01,570 --> 00:10:02,800 So the way we're 285 00:10:02,900 --> 00:10:03,930 going to do that is 286 00:10:04,180 --> 00:10:06,690 take the first K Columns of the U matrix. 287 00:10:08,330 --> 00:10:09,790 Construct this matrix. 288 00:10:11,060 --> 00:10:13,040 Stack up U1, U2 and 289 00:10:14,170 --> 00:10:16,690 so on up to U K in columns. 290 00:10:17,350 --> 00:10:19,120 It's really basically taking, you know, 291 00:10:19,280 --> 00:10:20,350 this part of the matrix, the 292 00:10:20,530 --> 00:10:22,260 first K columns of this matrix. 293 00:10:23,420 --> 00:10:25,370 And so this is 294 00:10:25,600 --> 00:10:26,920 going to be an N 295 00:10:27,200 --> 00:10:28,580 by K matrix. 296 00:10:29,500 --> 00:10:30,690 I'm going to give this matrix a name. 297 00:10:30,880 --> 00:10:32,200 I'm going to call this matrix 298 00:10:32,930 --> 00:10:35,760 U, subscript "reduce," sort 299 00:10:36,090 --> 00:10:38,620 of a reduced version of the U matrix maybe. 300 00:10:39,140 --> 00:10:41,250 I'm going to use it to reduce the dimension of my data. 301 00:10:43,040 --> 00:10:43,950 And the way I'm going to compute Z is going 302 00:10:44,250 --> 00:10:45,960 to let Z be equal to this 303 00:10:46,220 --> 00:10:49,570 U reduce matrix transpose times 304 00:10:50,010 --> 00:10:52,030 X. Or alternatively, you know, 305 00:10:52,510 --> 00:10:53,860 to write down what this transpose means. 306 00:10:54,630 --> 00:10:55,910 When I take this transpose of 307 00:10:56,010 --> 00:10:57,920 this U matrix, what I'm 308 00:10:58,010 --> 00:11:00,680 going to end up with is these vectors now in rows. 309 00:11:00,950 --> 00:11:04,540 I have U1 transpose down to UK transpose. 310 00:11:07,060 --> 00:11:08,860 Then take that times X, 311 00:11:09,700 --> 00:11:10,740 and that's how I get 312 00:11:10,920 --> 00:11:12,670 my vector Z. Just to 313 00:11:12,740 --> 00:11:14,280 make sure that these dimensions make sense, 314 00:11:15,370 --> 00:11:16,380 this matrix here is going 315 00:11:16,560 --> 00:11:17,450 to be k by n 316 00:11:18,270 --> 00:11:19,350 and x here is going 317 00:11:19,420 --> 00:11:20,530 to be n by 1 318 00:11:20,750 --> 00:11:21,810 and so the product 319 00:11:22,320 --> 00:11:24,330 here will be k by 1. 320 00:11:24,820 --> 00:11:27,920 And so z is k 321 00:11:28,790 --> 00:11:29,810 dimensional, is a k 322 00:11:30,010 --> 00:11:31,230 dimensional vector, which is exactly 323 00:11:32,000 --> 00:11:33,180 what we wanted. 324 00:11:33,550 --> 00:11:34,640 And of course these x's here right, can 325 00:11:34,890 --> 00:11:36,010 be Examples in our 326 00:11:36,100 --> 00:11:36,970 training set can be examples 327 00:11:37,540 --> 00:11:38,750 in our cross validation set, can be 328 00:11:38,980 --> 00:11:40,330 examples in our test set, and 329 00:11:40,500 --> 00:11:41,590 for example if you know, 330 00:11:41,930 --> 00:11:43,830 I wanted to take training example i, 331 00:11:44,260 --> 00:11:45,910 I can write this as xi 332 00:11:47,270 --> 00:11:48,430 XI and that's what will 333 00:11:48,510 --> 00:11:50,080 give me ZI over there. 334 00:11:50,940 --> 00:11:53,140 So, to summarize, here's the 335 00:11:53,460 --> 00:11:54,820 PCA algorithm on one slide. 336 00:11:56,250 --> 00:11:58,200 After mean normalization, to ensure 337 00:11:58,420 --> 00:11:59,230 that every feature is zero mean 338 00:11:59,610 --> 00:12:01,420 and optional feature scaling whichYou 339 00:12:02,280 --> 00:12:03,780 really should do feature scaling if 340 00:12:03,890 --> 00:12:05,820 your features take on very different ranges of values. 341 00:12:06,620 --> 00:12:08,610 After this pre-processing we compute 342 00:12:09,130 --> 00:12:12,010 the carrier matrix Sigma like 343 00:12:12,240 --> 00:12:14,070 so by the 344 00:12:14,210 --> 00:12:16,340 way if your data is 345 00:12:16,610 --> 00:12:17,780 given as a matrix 346 00:12:18,030 --> 00:12:18,960 like hits if you have your 347 00:12:19,230 --> 00:12:22,580 data Given in rows like this. 348 00:12:22,780 --> 00:12:24,370 If you have a matrix X 349 00:12:24,960 --> 00:12:26,190 which is your time trading sets 350 00:12:27,030 --> 00:12:28,830 written in rows where x1 351 00:12:29,210 --> 00:12:30,400 transpose down to x1 transpose, 352 00:12:31,530 --> 00:12:32,700 this covariance matrix sigma actually has 353 00:12:33,020 --> 00:12:36,040 a nice vectorizing implementation. 354 00:12:37,390 --> 00:12:38,980 You can implement in octave, 355 00:12:39,440 --> 00:12:41,130 you can even run sigma equals 1 356 00:12:41,670 --> 00:12:45,250 over m, times x, 357 00:12:45,550 --> 00:12:46,440 which is this matrix up here, 358 00:12:47,250 --> 00:12:50,770 transpose times x and 359 00:12:50,980 --> 00:12:53,320 this simple expression, that's 360 00:12:53,570 --> 00:12:55,070 the vectorize implementation of how 361 00:12:55,220 --> 00:12:56,910 to compute the matrix sigma. 362 00:12:58,020 --> 00:12:59,020 I'm not going to prove that today. 363 00:12:59,160 --> 00:13:00,600 This is the correct vectorization whether you 364 00:13:00,740 --> 00:13:02,460 want, you can either numerically test 365 00:13:02,870 --> 00:13:03,900 this on yourself by trying out an 366 00:13:03,980 --> 00:13:05,100 octave and making sure that 367 00:13:05,840 --> 00:13:06,890 both this and this implementations 368 00:13:06,920 --> 00:13:10,050 give the same answers or you Can try to prove it yourself mathematically. 369 00:13:11,250 --> 00:13:12,330 Either way but this is the 370 00:13:12,430 --> 00:13:14,580 correct vectorizing implementation, without compusingnext 371 00:13:16,480 --> 00:13:17,570 we can apply the SVD 372 00:13:17,920 --> 00:13:19,050 routine to get u, s, 373 00:13:19,250 --> 00:13:20,840 and d. And then we 374 00:13:21,100 --> 00:13:22,720 grab the first k 375 00:13:23,050 --> 00:13:24,450 columns of the u 376 00:13:24,660 --> 00:13:26,550 matrix you reduce and 377 00:13:26,650 --> 00:13:28,540 finally this defines how 378 00:13:28,740 --> 00:13:29,980 we go from a feature 379 00:13:30,290 --> 00:13:31,600 vector x to this 380 00:13:31,850 --> 00:13:34,340 reduce dimension representation z. And 381 00:13:34,540 --> 00:13:35,760 similar to k Means 382 00:13:36,090 --> 00:13:37,860 if you're apply PCA, they way 383 00:13:38,030 --> 00:13:40,300 you'd apply this is with vectors X and RN. 384 00:13:41,100 --> 00:13:43,990 So, this is not done with X-0 1. 385 00:13:44,200 --> 00:13:46,080 So that was 386 00:13:46,990 --> 00:13:48,680 the PCA algorithm. 387 00:13:50,120 --> 00:13:51,380 One thing I didn't do is 388 00:13:51,590 --> 00:13:53,190 give a mathematical proof that 389 00:13:53,520 --> 00:13:54,600 this There it actually give 390 00:13:54,970 --> 00:13:56,560 the projection of the data onto 391 00:13:57,230 --> 00:13:58,730 the K dimensional subspace onto the 392 00:13:58,870 --> 00:14:00,620 K dimensional surface that actually 393 00:14:02,170 --> 00:14:04,800 minimizes the square projection error Proof 394 00:14:05,110 --> 00:14:07,170 of that is beyond the scope of this course. 395 00:14:07,700 --> 00:14:09,110 Fortunately the PCA algorithm 396 00:14:09,470 --> 00:14:10,940 can be implemented in not 397 00:14:11,320 --> 00:14:12,510 too many lines of code. 398 00:14:13,190 --> 00:14:14,510 and if you implement this in 399 00:14:14,640 --> 00:14:16,120 octave or algorithm, you 400 00:14:16,520 --> 00:14:17,590 actually get a very effective 401 00:14:18,110 --> 00:14:19,710 dimensionality reduction algorithm. 402 00:14:22,430 --> 00:14:23,850 So, that was the PCA algorithm. 403 00:14:25,010 --> 00:14:26,290 One thing I didn't do was 404 00:14:26,840 --> 00:14:28,420 give a mathematical proof that 405 00:14:29,170 --> 00:14:30,360 the U1 and U2 and so 406 00:14:30,720 --> 00:14:31,630 on and the Z and so 407 00:14:31,770 --> 00:14:32,830 on you get out of this 408 00:14:32,980 --> 00:14:34,330 procedure is really the 409 00:14:34,680 --> 00:14:35,870 choices that would minimize 410 00:14:36,500 --> 00:14:37,800 these squared projection error. 411 00:14:38,140 --> 00:14:39,350 Right, remember we said What 412 00:14:39,610 --> 00:14:40,660 PCA tries to do is try 413 00:14:40,960 --> 00:14:42,170 to find a surface or line 414 00:14:42,570 --> 00:14:43,690 onto which to project the data 415 00:14:44,280 --> 00:14:46,340 so as to minimize to square projection error. 416 00:14:46,700 --> 00:14:48,610 So I didn't prove that this 417 00:14:49,140 --> 00:14:50,680 that, and the mathematical proof 418 00:14:50,970 --> 00:14:52,520 of that is beyond the scope of this course. 419 00:14:53,170 --> 00:14:55,550 But fortunately the PCA algorithm can 420 00:14:55,730 --> 00:14:58,890 be implemented in not too many lines of octave code. 421 00:14:59,350 --> 00:15:00,740 And if you implement this, 422 00:15:01,430 --> 00:15:02,560 this is actually what will 423 00:15:02,770 --> 00:15:03,730 work, or this will work well, 424 00:15:04,710 --> 00:15:05,940 and if you implement this algorithm, 425 00:15:06,500 --> 00:15:09,220 you get a very effective dimensionality reduction algorithm. 426 00:15:09,780 --> 00:15:10,650 That does do the right thing 427 00:15:11,050 --> 00:15:13,460 of minimizing this square projection error.