1 00:00:00,260 --> 00:00:01,490 For linear regression, we had 2 00:00:01,680 --> 00:00:03,130 previously worked out two learning 3 00:00:03,490 --> 00:00:05,010 algorithms, one based on 4 00:00:05,180 --> 00:00:07,650 gradient descent and one based on the normal equation. 5 00:00:08,750 --> 00:00:09,740 In this video we will take 6 00:00:09,890 --> 00:00:11,640 those two algorithms and generalize 7 00:00:12,290 --> 00:00:13,380 them to the case of regularized 8 00:00:14,330 --> 00:00:17,640 linear regression. Here's the 9 00:00:18,100 --> 00:00:19,540 optimization objective, that we 10 00:00:20,200 --> 00:00:22,380 came up with last time for regularized linear regression. 11 00:00:23,360 --> 00:00:24,580 This first part is our 12 00:00:24,980 --> 00:00:27,240 usual, objective for linear regression, 13 00:00:28,170 --> 00:00:29,300 and we now have this additional 14 00:00:30,200 --> 00:00:31,750 regularization term, where londer 15 00:00:32,450 --> 00:00:34,960 is our regularization parameter, and 16 00:00:35,220 --> 00:00:36,690 we like to find parameters theta, 17 00:00:37,160 --> 00:00:38,550 that minimizes this cost function, 18 00:00:39,030 --> 00:00:41,280 this regularized cost function, J of theta. 19 00:00:41,840 --> 00:00:43,030 Previously, we were using 20 00:00:43,440 --> 00:00:45,180 gradient descent for the original 21 00:00:46,620 --> 00:00:48,060 cost function, without the regularization 22 00:00:48,770 --> 00:00:49,820 term, and we had 23 00:00:50,060 --> 00:00:51,990 the following algorithm for regular 24 00:00:52,370 --> 00:00:53,620 linear regression, without regularization. 25 00:00:54,660 --> 00:00:56,260 We will repeatedly update the 26 00:00:56,330 --> 00:00:57,670 parameters theta J as follows 27 00:00:58,270 --> 00:01:00,030 for J equals 1,2 up 28 00:01:00,400 --> 00:01:02,110 through n. Let me 29 00:01:02,530 --> 00:01:03,960 take this and just write 30 00:01:04,240 --> 00:01:06,580 the case for theta zero separately. 31 00:01:07,210 --> 00:01:08,400 So, you know, I'm just gonna 32 00:01:08,720 --> 00:01:09,900 write the update for theta 33 00:01:10,160 --> 00:01:12,500 zero separately, then for 34 00:01:12,680 --> 00:01:14,380 the update for the parameters 35 00:01:14,780 --> 00:01:17,090 1, 2, 3, and so on up 36 00:01:17,370 --> 00:01:19,760 to n. So, I haven't changed anything yet, right? 37 00:01:19,970 --> 00:01:21,070 This is just writing the update 38 00:01:21,300 --> 00:01:23,300 for theta zero separately from the 39 00:01:23,550 --> 00:01:25,240 updates from theta 1, theta 40 00:01:25,510 --> 00:01:26,980 2, theta 3, up to theta n. And 41 00:01:27,040 --> 00:01:27,900 the reason I want to do this 42 00:01:28,230 --> 00:01:29,320 is you may remember 43 00:01:29,880 --> 00:01:31,260 that for our regularized linear regression, 44 00:01:32,620 --> 00:01:33,970 we penalize the parameters theta 45 00:01:34,440 --> 00:01:35,540 1, theta 2, and so 46 00:01:35,860 --> 00:01:38,360 on up to theta n, but we don't penalize theta zero. 47 00:01:38,820 --> 00:01:40,250 So when we modify this 48 00:01:40,410 --> 00:01:42,400 algorithm for regularized 49 00:01:42,750 --> 00:01:44,050 linear regression, we're going to 50 00:01:44,710 --> 00:01:46,870 end up treating theta zero slightly differently. 51 00:01:48,560 --> 00:01:50,360 Concretely, if we 52 00:01:50,500 --> 00:01:52,170 want to take this algorithm and 53 00:01:52,300 --> 00:01:53,780 modify it to use the 54 00:01:53,870 --> 00:01:55,630 regularized objective, all we 55 00:01:55,740 --> 00:01:57,170 need to do is take this 56 00:01:57,350 --> 00:02:00,010 term at the bottom and modify as follows. 57 00:02:00,460 --> 00:02:01,860 We're gonna take this term and add 58 00:02:02,670 --> 00:02:05,310 minus londer M, 59 00:02:06,330 --> 00:02:08,920 times theta J. And 60 00:02:09,100 --> 00:02:10,850 if you implement this, then you 61 00:02:11,000 --> 00:02:13,230 have gradient descent for trying 62 00:02:13,960 --> 00:02:15,920 to minimize the regularized cost 63 00:02:16,160 --> 00:02:18,200 function J of F theta, and concretely, 64 00:02:19,520 --> 00:02:20,570 I'm not gonna do the 65 00:02:20,680 --> 00:02:22,260 calculus to prove it, but 66 00:02:22,390 --> 00:02:23,480 concretely if you look 67 00:02:23,690 --> 00:02:26,580 at this term, this term that's written is square brackets. 68 00:02:27,730 --> 00:02:28,930 If you know calculus, it's possible 69 00:02:29,380 --> 00:02:31,150 to prove that that term is 70 00:02:31,370 --> 00:02:33,150 the partial derivative, with respect of 71 00:02:33,980 --> 00:02:35,400 J of theta, using the new 72 00:02:35,660 --> 00:02:37,520 definition of J of theta 73 00:02:38,140 --> 00:02:39,330 with the regularization term. 74 00:02:39,510 --> 00:02:42,490 And somebody on this 75 00:02:42,760 --> 00:02:43,960 term up on top, 76 00:02:44,750 --> 00:02:45,570 which I guess I am 77 00:02:45,680 --> 00:02:47,240 drawing the salient box 78 00:02:48,000 --> 00:02:49,270 that's still the partial derivative 79 00:02:49,940 --> 00:02:52,700 respect of theta zero for J of theta. 80 00:02:53,680 --> 00:02:54,900 If you look at the update for 81 00:02:55,600 --> 00:02:56,710 theta J, it's possible to 82 00:02:56,910 --> 00:02:59,190 show's something pretty interesting, concretely 83 00:02:59,860 --> 00:03:01,100 theta J gets updated as 84 00:03:01,280 --> 00:03:03,400 theta J, minus alpha times, 85 00:03:04,090 --> 00:03:05,010 and then you have this other term 86 00:03:05,380 --> 00:03:06,730 here that depends on theta J 87 00:03:06,910 --> 00:03:08,310 . So if you 88 00:03:08,420 --> 00:03:09,410 group all the terms together 89 00:03:10,030 --> 00:03:11,690 that depending on theta J. We 90 00:03:11,780 --> 00:03:13,190 can show that this update can 91 00:03:13,670 --> 00:03:15,100 be written equivalently as 92 00:03:15,200 --> 00:03:16,160 follows and all I did 93 00:03:16,470 --> 00:03:17,620 was have, you know, theta J 94 00:03:18,310 --> 00:03:20,100 here is theta J times 95 00:03:20,450 --> 00:03:21,950 1 and this term is 96 00:03:22,910 --> 00:03:24,830 londer over M. There's also an alpha 97 00:03:25,140 --> 00:03:25,990 here, so you end up 98 00:03:26,180 --> 00:03:27,650 with alpha londer over 99 00:03:27,970 --> 00:03:31,450 m, multiply them to 100 00:03:31,820 --> 00:03:33,660 theta J and this term here, one minus 101 00:03:34,230 --> 00:03:36,300 alpha times londer M, is 102 00:03:36,600 --> 00:03:39,470 a pretty interesting term, it has a pretty interesting effect. 103 00:03:42,310 --> 00:03:43,710 Concretely, this term one 104 00:03:43,890 --> 00:03:45,320 minus alpha times londer over 105 00:03:45,730 --> 00:03:46,780 M, is going to be 106 00:03:46,870 --> 00:03:48,740 a number that's, you know, usually a number 107 00:03:48,800 --> 00:03:50,390 that's a loop and less than 1, 108 00:03:50,610 --> 00:03:51,670 right? Because of 109 00:03:51,920 --> 00:03:53,580 alpha times londer over M is 110 00:03:54,070 --> 00:03:55,920 going to be positive and usually, if you're learning rate is small and M is large. 111 00:03:58,650 --> 00:03:58,860 That's usually pretty small. 112 00:03:59,650 --> 00:04:00,680 So this term here, it's going 113 00:04:00,740 --> 00:04:03,060 to be a number, it's usually, you know, a little bit less than one. 114 00:04:03,340 --> 00:04:04,150 So think of it as 115 00:04:04,330 --> 00:04:05,860 a number like 0.99, let's say 116 00:04:07,380 --> 00:04:08,800 and so, the effect of our 117 00:04:09,120 --> 00:04:10,550 updates of theta J is we're 118 00:04:10,690 --> 00:04:11,950 going to say that theta J 119 00:04:12,410 --> 00:04:15,420 gets replaced by thetata J times 0.99. 120 00:04:15,770 --> 00:04:17,500 Alright so theta J 121 00:04:18,490 --> 00:04:20,940 times 0.99 has the effect of 122 00:04:21,280 --> 00:04:23,560 shrinking theta J a little bit towards 0. 123 00:04:23,670 --> 00:04:25,690 So this makes theta J a bit smaller. 124 00:04:26,220 --> 00:04:28,080 More formally, this you know, this 125 00:04:28,420 --> 00:04:29,750 square norm of theta J 126 00:04:29,870 --> 00:04:31,580 is smaller and then 127 00:04:31,720 --> 00:04:33,430 after that the second 128 00:04:33,910 --> 00:04:35,400 term here, that's actually 129 00:04:35,980 --> 00:04:37,930 exactly the same as the 130 00:04:38,050 --> 00:04:40,270 original gradient descent updated that we had. 131 00:04:40,750 --> 00:04:42,840 Before we added all this regularization stuff. 132 00:04:44,270 --> 00:04:46,920 So, hopefully this gradient 133 00:04:47,380 --> 00:04:48,630 descent, hopefully this update makes 134 00:04:48,880 --> 00:04:51,350 sense, when we're using regularized linear 135 00:04:51,550 --> 00:04:52,920 regression what we're doing is on 136 00:04:53,320 --> 00:04:55,210 every regularization were multiplying data 137 00:04:55,420 --> 00:04:56,310 J by a number that 138 00:04:56,400 --> 00:04:57,300 is a little bit less than one, so 139 00:04:57,400 --> 00:04:58,900 we're shrinking the parameter a 140 00:04:59,230 --> 00:05:00,340 little bit, and then we're 141 00:05:00,500 --> 00:05:03,000 performing a, you know, similar update as before. 142 00:05:04,170 --> 00:05:05,460 Of course that's just the 143 00:05:05,610 --> 00:05:08,310 intuition behind what this particular update is doing. 144 00:05:08,910 --> 00:05:10,130 Mathematically, what it's doing 145 00:05:10,580 --> 00:05:12,950 is exactly gradient descent on 146 00:05:13,130 --> 00:05:14,330 the cost function J of theta 147 00:05:15,150 --> 00:05:16,020 that we defined on the previous 148 00:05:16,480 --> 00:05:18,820 slide that uses the regularization term. 149 00:05:19,780 --> 00:05:21,210 Gradient descent was just 150 00:05:21,470 --> 00:05:23,050 one our two algorithms for 151 00:05:24,470 --> 00:05:25,530 fitting a linear regression model. 152 00:05:26,630 --> 00:05:28,090 The second algorithm was the 153 00:05:28,160 --> 00:05:29,130 one based on the normal 154 00:05:29,680 --> 00:05:31,650 equation where, what we 155 00:05:31,740 --> 00:05:32,980 did was we created the 156 00:05:33,060 --> 00:05:34,770 design matrix "x" where each 157 00:05:35,080 --> 00:05:37,830 row corresponded to a separate training example. 158 00:05:38,520 --> 00:05:39,790 And we created a vector 159 00:05:40,170 --> 00:05:41,780 Y, so this is 160 00:05:41,940 --> 00:05:43,320 a vector that is an 161 00:05:43,590 --> 00:05:45,520 M dimensional vector and that 162 00:05:46,010 --> 00:05:47,750 contain the labels from a training set. 163 00:05:48,470 --> 00:05:49,600 So whereas X is an 164 00:05:49,830 --> 00:05:52,660 M by N plus 1 dimensional matrix. 165 00:05:53,590 --> 00:05:55,220 Y is an M dimensional 166 00:05:55,780 --> 00:05:57,550 vector and in order 167 00:05:58,030 --> 00:05:59,200 to minimize the cost 168 00:05:59,470 --> 00:06:00,940 function change we found 169 00:06:01,470 --> 00:06:03,000 that of one way 170 00:06:03,230 --> 00:06:04,440 to do is to set 171 00:06:04,670 --> 00:06:06,790 theta to be equal to this. 172 00:06:07,540 --> 00:06:09,040 We have X transpose X, 173 00:06:10,860 --> 00:06:12,770 inverse X transpose Y. 174 00:06:13,020 --> 00:06:13,920 I am leaving room here, to fill 175 00:06:14,120 --> 00:06:17,160 in stuff of course. And what this 176 00:06:17,650 --> 00:06:18,820 value for theta does, is 177 00:06:19,180 --> 00:06:20,980 this minimizes the cost 178 00:06:21,250 --> 00:06:22,710 function J of theta when 179 00:06:22,840 --> 00:06:26,280 we were not using regularization. Now 180 00:06:26,460 --> 00:06:28,580 that we are using regularization, if 181 00:06:28,780 --> 00:06:30,290 you were to derive what the 182 00:06:30,520 --> 00:06:31,820 minimum is, and just to 183 00:06:31,910 --> 00:06:32,760 give you a sense of how to 184 00:06:32,980 --> 00:06:34,110 derive the minimum, the way 185 00:06:34,220 --> 00:06:35,220 you derive it is you know, 186 00:06:35,930 --> 00:06:37,910 take partial derivatives in respect 187 00:06:38,340 --> 00:06:40,600 to each parameter, set this 188 00:06:40,830 --> 00:06:41,910 to zero, and then do 189 00:06:42,060 --> 00:06:42,920 a bunch of math, and you can 190 00:06:43,100 --> 00:06:45,060 then show that is a formula 191 00:06:45,550 --> 00:06:47,640 like this that minimizes the cost function. 192 00:06:48,590 --> 00:06:52,130 And concretely, if you 193 00:06:52,240 --> 00:06:54,080 are using regularization then this 194 00:06:54,250 --> 00:06:56,320 formula changes as follows. Inside this 195 00:06:56,480 --> 00:06:59,120 parenthesis, you end up with a matrix like this. 196 00:06:59,460 --> 00:07:00,940 Zero, one, one, one 197 00:07:01,800 --> 00:07:03,520 and so on, one until the bottom. 198 00:07:04,510 --> 00:07:05,510 So this thing over here is 199 00:07:05,630 --> 00:07:07,810 a matrix, who's upper leftmost entry is zero. 200 00:07:08,560 --> 00:07:10,080 There's ones on the diagonals and 201 00:07:10,190 --> 00:07:11,960 then the zeros everywhere else on this matrix. 202 00:07:13,050 --> 00:07:14,020 Because I am drawing this a little bit sloppy. 203 00:07:15,180 --> 00:07:16,790 But as a concrete 204 00:07:17,060 --> 00:07:18,210 example if N equals 2, 205 00:07:19,090 --> 00:07:21,110 then this matrix 206 00:07:21,840 --> 00:07:23,500 is going to be a three by three matrix. 207 00:07:24,300 --> 00:07:26,210 More generally, this matrix is 208 00:07:26,360 --> 00:07:27,660 a N plus one 209 00:07:28,270 --> 00:07:30,290 by N plus one dimensional matrix. 210 00:07:31,620 --> 00:07:33,150 So then N equals two then that 211 00:07:33,370 --> 00:07:35,410 matrix becomes something that looks like this. 212 00:07:35,980 --> 00:07:37,360 Zero, and then ones 213 00:07:37,640 --> 00:07:39,020 on the diagonals, and then 214 00:07:39,160 --> 00:07:41,100 zeros on the rest of the diagonals. 215 00:07:42,390 --> 00:07:43,990 And once again, you know, I'm not going to those this derivation. 216 00:07:44,620 --> 00:07:46,280 Which is frankly somewhat long and involved. 217 00:07:46,620 --> 00:07:47,530 But it is possible to prove 218 00:07:47,970 --> 00:07:49,550 that if you are 219 00:07:49,940 --> 00:07:50,770 using the new definition of 220 00:07:51,250 --> 00:07:53,730 J of theta, with the regularization objective. 221 00:07:54,780 --> 00:07:56,070 Then this new formula for 222 00:07:56,220 --> 00:07:57,180 theta is the one 223 00:07:57,390 --> 00:08:00,080 that will give you the global minimum of J of theta. 224 00:08:01,420 --> 00:08:02,460 So finally, I want to 225 00:08:02,610 --> 00:08:05,460 just quickly describe the issue of non-invertibility. 226 00:08:06,800 --> 00:08:08,110 This is relatively advanced material. 227 00:08:08,600 --> 00:08:09,530 So you should consider this as 228 00:08:09,770 --> 00:08:11,600 optional and feel free 229 00:08:11,750 --> 00:08:12,520 to skip it or if you 230 00:08:12,660 --> 00:08:14,180 listen to it and you know, possibly it 231 00:08:14,320 --> 00:08:15,680 don't really make sense, don't worry about it either. 232 00:08:16,400 --> 00:08:18,950 But earlier when I talked the normal equation method. 233 00:08:19,700 --> 00:08:20,920 We also had an optional video 234 00:08:21,800 --> 00:08:22,960 on the non-invertability issue. 235 00:08:23,700 --> 00:08:25,740 So this is another optional part, 236 00:08:26,170 --> 00:08:27,070 that is sort of add on 237 00:08:27,700 --> 00:08:30,100 earlier optional video on non-invertibility. 238 00:08:31,610 --> 00:08:33,350 Now considering setting where M 239 00:08:33,850 --> 00:08:35,340 the number of examples is less 240 00:08:35,690 --> 00:08:37,530 than or equal to N the number features. 241 00:08:38,650 --> 00:08:40,080 If you have fewer examples then 242 00:08:40,200 --> 00:08:41,480 features then this matrix 243 00:08:42,170 --> 00:08:43,870 X transpose X will be 244 00:08:44,070 --> 00:08:47,770 non-invertible or singular, or 245 00:08:48,060 --> 00:08:50,120 the other term 246 00:08:50,360 --> 00:08:51,470 for this is the matrix will 247 00:08:51,530 --> 00:08:53,390 be degenerate and if 248 00:08:53,860 --> 00:08:54,780 you implement this in Octave 249 00:08:55,300 --> 00:08:56,380 anyway, and you use the 250 00:08:56,620 --> 00:08:58,570 P in function to take the psuedo inverse. 251 00:08:58,850 --> 00:08:59,800 It will kind of do the 252 00:09:00,080 --> 00:09:01,900 right thing that is not 253 00:09:02,240 --> 00:09:03,450 clear that it will 254 00:09:03,560 --> 00:09:04,570 give you a very good hypothesis 255 00:09:05,410 --> 00:09:07,720 even though numerically the octave 256 00:09:08,370 --> 00:09:09,670 P in function 257 00:09:10,020 --> 00:09:11,050 will give you a result that 258 00:09:11,340 --> 00:09:13,210 kind of makes sense. 259 00:09:13,440 --> 00:09:15,460 But, if you were doing this in a different language. 260 00:09:16,270 --> 00:09:17,590 And if you were 261 00:09:17,710 --> 00:09:19,030 taking just the regular inverse 262 00:09:20,470 --> 00:09:22,070 which an octave is denoted with the function INV. 263 00:09:23,240 --> 00:09:24,010 We're trying to take the regular 264 00:09:24,330 --> 00:09:25,620 inverse of X transpose X, 265 00:09:26,300 --> 00:09:28,030 then in this setting you 266 00:09:28,150 --> 00:09:30,340 find that X transpose X 267 00:09:30,450 --> 00:09:32,750 is singular, is non-invertible and 268 00:09:32,790 --> 00:09:33,740 if you're doing this in a different 269 00:09:33,990 --> 00:09:35,830 programming language and using some 270 00:09:36,230 --> 00:09:39,160 linear algebra library try to take the inverse of this matrix. 271 00:09:39,840 --> 00:09:41,080 It just might not work because that 272 00:09:41,220 --> 00:09:43,060 matrix is non-invertible or singular. 273 00:09:44,650 --> 00:09:47,110 Fortunately, regularization also takes 274 00:09:47,110 --> 00:09:49,850 care of this for us, and concretely, so 275 00:09:50,010 --> 00:09:53,370 long as the regularization parameter is strictly greater than zero. 276 00:09:53,870 --> 00:09:55,220 It is actually possible to 277 00:09:55,300 --> 00:09:56,840 prove that this matrix X 278 00:09:57,080 --> 00:09:58,690 transpose X plus parameter time, you know,s 279 00:09:59,080 --> 00:10:00,400 this funny matrix here, 280 00:10:00,970 --> 00:10:02,250 is possible to prove that this 281 00:10:02,470 --> 00:10:03,650 matrix will not be 282 00:10:03,760 --> 00:10:05,710 singular and that this matrix will be invertible. 283 00:10:07,450 --> 00:10:09,430 So using regularization also takes 284 00:10:09,700 --> 00:10:11,910 care of any non-invertibility issues 285 00:10:12,580 --> 00:10:14,470 of the X transpose X matrix as well. 286 00:10:15,260 --> 00:10:18,000 So, you now know how to implement regularize linear regression. 287 00:10:18,870 --> 00:10:19,910 Using this, you'll be able 288 00:10:20,300 --> 00:10:21,970 to avoid overfitting, even 289 00:10:22,210 --> 00:10:24,720 if you have lots of features in a relatively small training set. 290 00:10:25,360 --> 00:10:26,630 And this should let you get 291 00:10:26,980 --> 00:10:29,000 linear regression to work much better for many problems. 292 00:10:30,060 --> 00:10:31,190 In the next video, we'll take 293 00:10:31,390 --> 00:10:34,310 this regularization idea and apply it to logistic regression. 294 00:10:35,140 --> 00:10:36,170 So that you'll be able to 295 00:10:36,280 --> 00:10:37,630 get logistic impression to avoid 296 00:10:37,920 --> 00:10:39,830 overfitting and perform much better as well.