1 00:00:00,160 --> 00:00:01,480 For logistic regression, we previously 2 00:00:02,110 --> 00:00:04,730 talked about two types of optimization algorithms. 3 00:00:05,190 --> 00:00:06,190 We talked about how to use 4 00:00:06,560 --> 00:00:09,210 gradient descent to optimize as cost function J of theta. 5 00:00:09,690 --> 00:00:10,770 And we also talked about 6 00:00:11,120 --> 00:00:12,730 advanced optimization methods. 7 00:00:13,520 --> 00:00:14,670 Ones that require that you 8 00:00:14,790 --> 00:00:16,300 provide a way to compute 9 00:00:16,940 --> 00:00:18,160 your cost function J of 10 00:00:18,420 --> 00:00:20,920 theta and that you provide a way to compute the derivatives. 11 00:00:22,450 --> 00:00:23,920 In this video, we'll show how 12 00:00:24,190 --> 00:00:25,420 you can adapt both of 13 00:00:25,500 --> 00:00:27,570 those techniques, both gradient descent and 14 00:00:27,720 --> 00:00:29,350 the more advanced optimization techniques 15 00:00:30,280 --> 00:00:31,770 in order to have them 16 00:00:31,950 --> 00:00:33,550 work for regularized logistic regression. 17 00:00:35,430 --> 00:00:36,670 So, here's the idea. 18 00:00:37,260 --> 00:00:38,770 We saw earlier that Logistic 19 00:00:39,190 --> 00:00:40,490 Regression can also be prone 20 00:00:40,850 --> 00:00:42,540 to overfitting if you fit 21 00:00:42,810 --> 00:00:44,090 it with a very, sort of, 22 00:00:44,290 --> 00:00:45,890 high order polynomial features like this. 23 00:00:46,470 --> 00:00:48,250 Where G is the 24 00:00:48,480 --> 00:00:49,970 sigmoid function and in 25 00:00:50,030 --> 00:00:51,330 particular you end up with 26 00:00:51,530 --> 00:00:53,020 a hypothesis, you know, 27 00:00:53,150 --> 00:00:54,120 whose decision bound to be 28 00:00:54,360 --> 00:00:55,930 just sort of an overly complex 29 00:00:56,620 --> 00:00:58,600 and extremely contortive function that 30 00:00:58,820 --> 00:00:59,680 really isn't such a great 31 00:00:59,790 --> 00:01:01,000 hypothesis for this training 32 00:01:01,350 --> 00:01:02,990 set, and more generally if you have 33 00:01:03,120 --> 00:01:04,890 logistic regression with a lot of features. 34 00:01:05,150 --> 00:01:06,630 Not necessarily polynomial ones, but 35 00:01:06,790 --> 00:01:07,510 just with a lot of 36 00:01:07,670 --> 00:01:09,720 features you can end up with overfitting. 37 00:01:11,620 --> 00:01:14,010 This was our cost function for logistic regression. 38 00:01:14,810 --> 00:01:16,210 And if we want to modify 39 00:01:16,740 --> 00:01:18,820 it to use regularization, all we 40 00:01:18,950 --> 00:01:20,630 need to do is add to 41 00:01:20,820 --> 00:01:22,290 it the following term 42 00:01:22,650 --> 00:01:24,860 plus londer over 2M, sum 43 00:01:25,110 --> 00:01:26,580 from J equals 1, and 44 00:01:26,730 --> 00:01:29,670 as usual sum from J equals 1. 45 00:01:29,800 --> 00:01:31,000 Rather than the sum from J 46 00:01:31,550 --> 00:01:33,670 equals 0, of theta J squared. 47 00:01:34,330 --> 00:01:35,470 And this has to 48 00:01:35,750 --> 00:01:36,960 effect therefore, of penalizing 49 00:01:37,650 --> 00:01:39,140 the parameters theta 1 theta 50 00:01:39,570 --> 00:01:42,600 2 and so on up to theta N from being too large. 51 00:01:43,610 --> 00:01:44,720 And if you do this, 52 00:01:45,720 --> 00:01:46,450 then it will the have the 53 00:01:46,750 --> 00:01:48,870 effect that even though you're fitting 54 00:01:49,250 --> 00:01:51,500 a very high order polynomial with a lot of parameters. 55 00:01:52,210 --> 00:01:53,240 So long as you apply regularization 56 00:01:53,910 --> 00:01:55,090 and keep the parameters small 57 00:01:55,850 --> 00:01:57,580 you're more likely to get a decision boundary. 58 00:01:58,830 --> 00:02:00,040 You know, that maybe looks more like this. 59 00:02:00,320 --> 00:02:01,460 It looks more reasonable for separating 60 00:02:02,500 --> 00:02:03,740 the positive and the negative examples. 61 00:02:05,300 --> 00:02:06,970 So, when using regularization 62 00:02:08,140 --> 00:02:09,080 even when you have a lot 63 00:02:09,220 --> 00:02:11,110 of features, the regularization can 64 00:02:11,620 --> 00:02:13,500 help take care of the overfitting problem. 65 00:02:14,740 --> 00:02:15,790 How do we actually implement this? 66 00:02:16,720 --> 00:02:18,280 Well, for the original gradient descent 67 00:02:18,710 --> 00:02:20,380 algorithm, this was the update we had. 68 00:02:20,670 --> 00:02:22,300 We will repeatedly perform the following 69 00:02:22,750 --> 00:02:24,610 update to theta J. This 70 00:02:24,740 --> 00:02:26,940 slide looks a lot like the previous one for linear regression. 71 00:02:27,510 --> 00:02:28,460 But what I'm going to do is 72 00:02:29,210 --> 00:02:31,390 write the update for theta 0 separately. 73 00:02:31,670 --> 00:02:32,930 So, the first line is 74 00:02:33,060 --> 00:02:34,110 for update for theta 0 and 75 00:02:34,230 --> 00:02:35,470 a second line is now 76 00:02:35,590 --> 00:02:36,730 my update for theta 1 77 00:02:36,880 --> 00:02:38,470 up to theta N. 78 00:02:38,900 --> 00:02:40,740 Because I'm going to treat theta 0 separately. 79 00:02:41,700 --> 00:02:43,140 And in order to 80 00:02:43,700 --> 00:02:45,370 modify this algorithm, to use 81 00:02:46,770 --> 00:02:48,480 a regularized cos function, 82 00:02:49,100 --> 00:02:50,510 all I need to do is 83 00:02:50,950 --> 00:02:51,810 pretty similar to what we 84 00:02:51,930 --> 00:02:53,700 did for linear regression is 85 00:02:53,870 --> 00:02:55,620 actually to just modify this 86 00:02:55,890 --> 00:02:57,480 second update rule as follows. 87 00:02:58,510 --> 00:02:59,800 And, once again, this, you know, 88 00:03:00,380 --> 00:03:02,080 cosmetically looks identical what 89 00:03:02,230 --> 00:03:03,720 we had for linear regression. 90 00:03:04,580 --> 00:03:05,580 But of course is not the 91 00:03:05,660 --> 00:03:06,590 same algorithm as we had, 92 00:03:06,890 --> 00:03:08,370 because now the hypothesis 93 00:03:08,780 --> 00:03:10,420 is defined using this. 94 00:03:10,860 --> 00:03:12,550 So this is not the same algorithm 95 00:03:13,130 --> 00:03:14,390 as regularized linear regression. 96 00:03:14,830 --> 00:03:16,340 Because the hypothesis is different. 97 00:03:16,940 --> 00:03:18,360 Even though this update that I wrote down. 98 00:03:18,630 --> 00:03:20,160 It actually looks cosmetically the 99 00:03:20,350 --> 00:03:22,130 same as what we had earlier. 100 00:03:22,480 --> 00:03:25,310 We're working out gradient descent for regularized linear regression. 101 00:03:26,690 --> 00:03:27,720 And of course, just to wrap 102 00:03:27,830 --> 00:03:29,360 up this discussion, this term 103 00:03:29,560 --> 00:03:30,860 here in the square 104 00:03:31,130 --> 00:03:32,330 brackets, so this term 105 00:03:32,670 --> 00:03:35,120 here, this term is, 106 00:03:35,410 --> 00:03:36,750 of course, the new partial 107 00:03:37,210 --> 00:03:38,590 derivative for respect of 108 00:03:38,660 --> 00:03:41,420 theta J of the new cost function J of theta. 109 00:03:42,300 --> 00:03:43,480 Where J of theta here is 110 00:03:43,700 --> 00:03:44,980 the cost function we defined on 111 00:03:45,180 --> 00:03:48,100 a previous slide that does use regularization. 112 00:03:49,770 --> 00:03:52,060 So, that's gradient descent for regularized linear regression. 113 00:03:55,200 --> 00:03:56,430 Let's talk about how to 114 00:03:56,580 --> 00:03:58,290 get regularized linear regression 115 00:03:58,950 --> 00:04:00,010 to work using the more 116 00:04:00,360 --> 00:04:02,070 advanced optimization methods. 117 00:04:03,180 --> 00:04:05,590 And just to remind you for 118 00:04:05,840 --> 00:04:06,800 those methods what we needed 119 00:04:07,080 --> 00:04:08,390 to do was to define the 120 00:04:08,450 --> 00:04:09,460 function that's called the cost 121 00:04:09,640 --> 00:04:11,160 function, that takes us 122 00:04:11,280 --> 00:04:13,660 input the parameter vector theta and 123 00:04:13,790 --> 00:04:16,180 once again in the equations 124 00:04:16,770 --> 00:04:19,030 we've been writing here we used 0 index vectors. 125 00:04:19,510 --> 00:04:20,690 So we had theta 0 up 126 00:04:21,180 --> 00:04:22,810 to theta N. But 127 00:04:23,020 --> 00:04:25,920 because Octave indexes the vectors starting from 1. 128 00:04:26,820 --> 00:04:28,240 Theta 0 is written 129 00:04:28,560 --> 00:04:29,990 in Octave as theta 1. 130 00:04:30,120 --> 00:04:31,630 Theta 1 is written in 131 00:04:31,860 --> 00:04:32,930 Octave as theta 2, and 132 00:04:33,280 --> 00:04:35,070 so on down to theta 133 00:04:36,270 --> 00:04:36,650 N plus 1. 134 00:04:36,740 --> 00:04:38,450 And what we needed to 135 00:04:38,600 --> 00:04:40,240 do was provide a function. 136 00:04:41,170 --> 00:04:42,370 Let's provide a function called 137 00:04:42,780 --> 00:04:44,140 cost function that we would 138 00:04:44,360 --> 00:04:46,920 then pass in to what we have, what we saw earlier. 139 00:04:47,300 --> 00:04:48,490 We will use the fminunc 140 00:04:49,060 --> 00:04:50,310 and then 141 00:04:50,540 --> 00:04:52,160 you know at cost function, 142 00:04:54,830 --> 00:04:55,430 and so on, right. 143 00:04:55,600 --> 00:04:56,870 But the F min, u 144 00:04:57,030 --> 00:04:58,060 and c was the F min 145 00:04:58,280 --> 00:04:59,310 unconstrained and this will 146 00:04:59,650 --> 00:05:01,230 work with fminunc 147 00:05:01,310 --> 00:05:02,300 was what will take 148 00:05:02,540 --> 00:05:04,340 the cost function and minimize it for us. 149 00:05:05,950 --> 00:05:07,050 So the two main things that 150 00:05:07,170 --> 00:05:08,600 the cost function needed to 151 00:05:08,700 --> 00:05:10,620 return were first J-val. 152 00:05:11,280 --> 00:05:12,400 And for that, we need 153 00:05:12,720 --> 00:05:13,950 to write code to 154 00:05:14,020 --> 00:05:15,710 compute the cost function J of theta. 155 00:05:17,130 --> 00:05:19,030 Now, when we're using regularized logistic 156 00:05:19,450 --> 00:05:20,920 regression, of course the 157 00:05:20,990 --> 00:05:21,960 cost function j of theta 158 00:05:22,280 --> 00:05:23,450 changes and, in particular, 159 00:05:24,480 --> 00:05:25,760 now a cost function needs to 160 00:05:25,870 --> 00:05:29,580 include this additional regularization term at the end as well. 161 00:05:29,850 --> 00:05:30,930 So, when you compute j of 162 00:05:31,030 --> 00:05:33,410 theta be sure to include that term at the end. 163 00:05:34,590 --> 00:05:35,520 And then, the other thing that 164 00:05:36,050 --> 00:05:37,240 this cost function thing 165 00:05:37,690 --> 00:05:39,010 needs to derive with a gradient. 166 00:05:39,530 --> 00:05:41,170 So gradient one needs 167 00:05:41,400 --> 00:05:42,570 to be set to the 168 00:05:42,660 --> 00:05:44,080 partial derivative of J 169 00:05:44,240 --> 00:05:45,520 of theta with respect to theta 170 00:05:45,690 --> 00:05:47,170 zero, gradient two needs 171 00:05:47,580 --> 00:05:49,520 to be set to that, and so on. 172 00:05:49,780 --> 00:05:50,900 Once again, the index is off by one. 173 00:05:51,220 --> 00:05:52,850 Right, because of the indexing from 174 00:05:53,110 --> 00:05:54,450 one Octave users. 175 00:05:55,940 --> 00:05:56,780 And looking at these terms. 176 00:05:57,850 --> 00:05:58,680 This term over here. 177 00:05:59,410 --> 00:06:00,640 We actually worked this out 178 00:06:00,720 --> 00:06:02,840 on a previous slide is actually equal to this. 179 00:06:03,230 --> 00:06:03,640 It doesn't change. 180 00:06:04,120 --> 00:06:07,250 Because the derivative for theta zero doesn't change. 181 00:06:07,650 --> 00:06:09,540 Compared to the version without regularization. 182 00:06:10,960 --> 00:06:13,210 And the other terms do change. 183 00:06:13,840 --> 00:06:16,340 And in particular the derivative respect to theta one. 184 00:06:17,010 --> 00:06:18,830 We worked this out on the previous slide as well. 185 00:06:19,110 --> 00:06:20,670 Is equal to, you know, 186 00:06:20,890 --> 00:06:22,560 the original term and then minus 187 00:06:23,450 --> 00:06:24,870 londer M times theta 1. 188 00:06:25,310 --> 00:06:27,140 Just so we make sure we pass this correctly. 189 00:06:27,800 --> 00:06:29,370 And we can add parentheses here. 190 00:06:29,830 --> 00:06:30,980 Right, so the summation doesn't extend. 191 00:06:31,570 --> 00:06:33,160 And similarly, you know, 192 00:06:33,380 --> 00:06:34,800 this other term here looks 193 00:06:35,130 --> 00:06:36,180 like this, with this additional 194 00:06:37,070 --> 00:06:37,950 term that we had on 195 00:06:38,030 --> 00:06:39,770 the previous slide, that corresponds to 196 00:06:39,950 --> 00:06:41,450 the gradient from their regularization objective. 197 00:06:42,230 --> 00:06:43,650 So if you implement this 198 00:06:43,820 --> 00:06:45,140 cost function and pass 199 00:06:45,720 --> 00:06:47,370 this into fminunc 200 00:06:48,190 --> 00:06:49,160 or to one of those advanced optimization 201 00:06:50,050 --> 00:06:51,940 techniques, that will minimize 202 00:06:52,540 --> 00:06:55,990 the new regularized cost function J of theta. 203 00:06:56,990 --> 00:06:58,220 And the parameters you get out 204 00:06:59,530 --> 00:07:00,740 will be the ones that correspond to 205 00:07:01,450 --> 00:07:02,940 logistic regression with regularization. 206 00:07:04,410 --> 00:07:05,540 So, now you know 207 00:07:05,780 --> 00:07:08,210 how to implement regularized logistic regression. 208 00:07:09,780 --> 00:07:10,920 When I walk around Silicon Valley, 209 00:07:11,380 --> 00:07:12,900 I live here in Silicon Valley, there are 210 00:07:13,100 --> 00:07:14,900 a lot of engineers that are frankly, making 211 00:07:15,420 --> 00:07:16,490 a ton of money for their 212 00:07:16,610 --> 00:07:18,090 companies using machine learning algorithms. 213 00:07:19,180 --> 00:07:20,390 And I know we've 214 00:07:20,600 --> 00:07:22,860 only been, you know, studying this stuff for a little while. 215 00:07:23,620 --> 00:07:25,410 But if you understand linear 216 00:07:26,510 --> 00:07:28,360 regression, the advanced optimization 217 00:07:29,210 --> 00:07:30,710 algorithms and regularization, by 218 00:07:30,950 --> 00:07:32,520 now, frankly, you probably know 219 00:07:32,950 --> 00:07:34,270 quite a lot more machine learning 220 00:07:35,010 --> 00:07:36,290 than many, certainly now, 221 00:07:36,750 --> 00:07:38,050 but you probably know quite a 222 00:07:38,180 --> 00:07:39,580 lot more machine learning right now 223 00:07:40,240 --> 00:07:41,670 than frankly, many of the 224 00:07:41,820 --> 00:07:44,760 Silicon Valley engineers out there having very successful careers. 225 00:07:45,300 --> 00:07:46,420 You know, making tons of money for the companies. 226 00:07:47,050 --> 00:07:49,250 Or building products using machine learning algorithms. 227 00:07:50,370 --> 00:07:50,960 So, congratulations. 228 00:07:52,080 --> 00:07:53,120 You've actually come a long ways. 229 00:07:53,490 --> 00:07:54,550 And you can actually, you 230 00:07:54,780 --> 00:07:55,990 actually know enough to 231 00:07:56,310 --> 00:07:58,210 apply this stuff and get to work for many problems. 232 00:07:59,260 --> 00:08:00,580 So congratulations for that. 233 00:08:00,780 --> 00:08:01,880 But of course, there's 234 00:08:02,350 --> 00:08:03,280 still a lot more that we 235 00:08:03,400 --> 00:08:05,180 want to teach you, and in 236 00:08:05,380 --> 00:08:06,540 the next set of videos after 237 00:08:06,560 --> 00:08:07,850 this, we'll start to talk 238 00:08:08,030 --> 00:08:10,890 about a very powerful cause of non-linear classifier. 239 00:08:11,680 --> 00:08:13,350 So whereas linear regression, logistic 240 00:08:13,690 --> 00:08:14,940 regression, you know, you can 241 00:08:15,080 --> 00:08:17,310 form polynomial terms, but it 242 00:08:17,460 --> 00:08:18,350 turns out that there are much 243 00:08:18,510 --> 00:08:21,150 more powerful nonlinear quantifiers that 244 00:08:21,460 --> 00:08:23,650 can then sort of polynomial regression. 245 00:08:24,640 --> 00:08:25,780 And in the next set 246 00:08:25,810 --> 00:08:28,280 of videos after this one, I'll start telling you about them. 247 00:08:28,510 --> 00:08:29,560 So that you have even more 248 00:08:29,760 --> 00:08:30,440 powerful learning algorithms than you have 249 00:08:31,380 --> 00:08:32,870 now to apply to different problems.