1 00:00:00,000 --> 00:00:04,228 [MUSIC] 2 00:00:04,228 --> 00:00:08,221 We're now down to our final practical issue stochastic gradient, 3 00:00:08,221 --> 00:00:11,400 which again is going to be an optional section. 4 00:00:11,400 --> 00:00:14,320 And the question is, how do we introduce regularization and 5 00:00:14,320 --> 00:00:15,870 what impact it has on the updates. 6 00:00:15,870 --> 00:00:18,390 And this is going to be pretty significant 7 00:00:18,390 --> 00:00:21,739 if you want to implement it yourself in a general way. 8 00:00:23,130 --> 00:00:26,310 Again, optional section because it's pretty detailed. 9 00:00:27,810 --> 00:00:31,560 So let's remind ourselves of the regularized likelihood. 10 00:00:31,560 --> 00:00:35,214 So, we have some total party metric defined in the pit of our data, 11 00:00:35,214 --> 00:00:36,858 which is the log likelihood. 12 00:00:36,858 --> 00:00:41,597 And some measures of the complex of the parameters which in our case, 13 00:00:41,597 --> 00:00:46,580 here we're using w squared, the actual normal for the parameters, and 14 00:00:46,580 --> 00:00:51,238 we wanted to compute the gradient of this regularized likelihood to 15 00:00:51,238 --> 00:00:54,450 make some progress, and avoid over fitting. 16 00:00:55,610 --> 00:00:59,660 So, the total derivative is the derivative of the first term, the quality, 17 00:00:59,660 --> 00:01:03,510 which is the sum over the data points of the contribution of each data point, 18 00:01:03,510 --> 00:01:05,530 the thing that got really expensive. 19 00:01:05,530 --> 00:01:09,630 And the contribution of the second one, once we introduced that parameter, lambda, 20 00:01:09,630 --> 00:01:16,522 to trade off the two things, talked about this a lot, contribution is -2 lambda wj. 21 00:01:16,522 --> 00:01:19,360 And we derive this update 22 00:01:19,360 --> 00:01:22,800 during an earlier module on regularized logistic regression. 23 00:01:24,180 --> 00:01:29,300 Now this is how we do straight up gradient updates with regularization. 24 00:01:29,300 --> 00:01:31,110 The question is what do we do? 25 00:01:31,110 --> 00:01:34,190 We only do stochastic gradient with regularization. 26 00:01:35,590 --> 00:01:39,380 So, if you remember stochastic gradient, we just said we take the contribution for 27 00:01:39,380 --> 00:01:40,820 single data point, and 28 00:01:40,820 --> 00:01:43,860 if we added up those contributions we get exactly the gradient. 29 00:01:43,860 --> 00:01:45,550 This is why it worked, 30 00:01:45,550 --> 00:01:48,910 because the sum of these stochastic gradients equal the gradient. 31 00:01:49,920 --> 00:01:54,330 And so, to my theme that, we need to think about how to set up the regularization 32 00:01:54,330 --> 00:01:58,870 term such that the sum of the stochastic gradients also is equal to the gradient. 33 00:02:00,830 --> 00:02:05,840 And one way to think about this is that we take the regularization and 34 00:02:05,840 --> 00:02:07,350 divide it by N. 35 00:02:07,350 --> 00:02:09,370 So in practice you want to do this. 36 00:02:09,370 --> 00:02:11,680 You want to say that the total derivative for 37 00:02:11,680 --> 00:02:17,560 stochastic gradient is contribution data point minus two over n lambda wj. 38 00:02:17,560 --> 00:02:22,480 And so if you were to add this up you get back to the full gradient. 39 00:02:24,470 --> 00:02:28,750 So with regularization now, the algorithm stays the same, but 40 00:02:28,750 --> 00:02:35,170 the contribution data point is its gradient minus two over n lambda wj. 41 00:02:35,170 --> 00:02:37,920 That's the contribution from the regularization point. 42 00:02:37,920 --> 00:02:42,010 If you are using mini-batches, you'd adapt that times. 43 00:02:42,010 --> 00:02:45,570 So you had the sum of the contributions from each data point And 44 00:02:45,570 --> 00:02:49,280 then it will be 2 b over n lambda w j. 45 00:02:49,280 --> 00:02:51,630 So this is how we take care of regularization. 46 00:02:51,630 --> 00:02:55,160 Again very small change, it's going to behave way better.