1 00:00:00,077 --> 00:00:04,907 [MUSIC] 2 00:00:04,907 --> 00:00:09,015 We see now how regularization can play a role in logistic regression to find much 3 00:00:09,015 --> 00:00:12,640 better fits of data and better assessments of probability. 4 00:00:12,640 --> 00:00:16,630 Let's finally talk about how we can learn them from data coefficient 5 00:00:16,630 --> 00:00:18,300 using gradient ascent. 6 00:00:18,300 --> 00:00:21,950 And it's just going to be a very very tiny change on what we did for 7 00:00:21,950 --> 00:00:25,300 learning the coefficients in logistic regression. 8 00:00:25,300 --> 00:00:30,130 So with a tiny change of code, we now address, 9 00:00:30,130 --> 00:00:33,100 alleviate, all those over fitting problems that you had before. 10 00:00:35,060 --> 00:00:39,860 So again, same setting as before, Training Data, features, same model. 11 00:00:39,860 --> 00:00:44,010 Now we have L2 regularized logistic regression or 12 00:00:44,010 --> 00:00:45,900 log likelihood is quality metric, 13 00:00:45,900 --> 00:00:50,870 and we're going to talk about ML algorithm to address it to optimize it to get w hat. 14 00:00:50,870 --> 00:00:55,351 We're going to be using the same kind of gradient ascent algorithm that we 15 00:00:55,351 --> 00:00:59,987 used before, we'll start from some point and we take these little steps. 16 00:00:59,987 --> 00:01:03,880 Go eta steps until we get to our solution w hat, and 17 00:01:03,880 --> 00:01:08,889 the same kind of approach, we're taking the old coefficient, 18 00:01:08,889 --> 00:01:15,220 adding eta times the gradient, and getting the new coefficients w, t plus one. 19 00:01:15,220 --> 00:01:20,730 And so the only thing that we have to ask ourselves is what is the gradient 20 00:01:20,730 --> 00:01:24,182 equal to now that we've add this extra regularization term? 21 00:01:24,182 --> 00:01:28,821 So we somehow need 22 00:01:28,821 --> 00:01:32,842 the gradient of 23 00:01:32,842 --> 00:01:37,174 the regularized 24 00:01:37,174 --> 00:01:41,830 log likelihood. 25 00:01:41,830 --> 00:01:45,410 Let's see what that looks like. 26 00:01:45,410 --> 00:01:49,990 We've seen that our total quality is the sum of the log likelihood of the data, 27 00:01:49,990 --> 00:01:54,680 which a measure of fit, minus lambda times our regularization penalty, 28 00:01:54,680 --> 00:01:57,010 which is this L2 norm squared. 29 00:01:57,010 --> 00:01:59,120 And so what is the derivative of this thing? 30 00:01:59,120 --> 00:02:04,340 This is the thing we need to be able to walk into that hill-climbing direction. 31 00:02:04,340 --> 00:02:06,910 So the derivative of the sum is the sum of the derivative. 32 00:02:06,910 --> 00:02:11,210 So the total derivative is the derivative of the first term, the derivative of 33 00:02:11,210 --> 00:02:14,010 the log-likelihood, which, thankfully, we've seen in the previous module, 34 00:02:15,070 --> 00:02:20,020 minus lambda times the derivative of the quadratic term here. 35 00:02:20,020 --> 00:02:23,500 And it is how the derivative of the quadratic term we already covered 36 00:02:23,500 --> 00:02:25,920 in the regression course. 37 00:02:25,920 --> 00:02:27,500 But we're going to do a quick review here. 38 00:02:27,500 --> 00:02:30,490 But as you can see, just a small change to your code before, 39 00:02:30,490 --> 00:02:34,020 we just have to add this lambda times the derivative of the quadratic term. 40 00:02:35,250 --> 00:02:36,460 So the review, 41 00:02:36,460 --> 00:02:42,240 the derivative of the log-likelihood is going to be the sum of my data points 42 00:02:42,240 --> 00:02:46,360 of the difference between the syndicate of whether it's a positive example and 43 00:02:46,360 --> 00:02:51,350 the probability of it being positive weighed by the value of the feature. 44 00:02:51,350 --> 00:02:53,950 And we talked about last module and 45 00:02:53,950 --> 00:02:55,920 interpreted this piece in quite a bit of detail. 46 00:02:55,920 --> 00:02:59,220 So what I'm going to go over again, we're going to focus on the second part, 47 00:02:59,220 --> 00:03:01,720 which is the derivative of the L2 penalty. 48 00:03:01,720 --> 00:03:05,900 So in other words, what's the partial derivative with respect to some parameter 49 00:03:05,900 --> 00:03:13,110 wj of w0 squared plus w1 squared plus w2 squared, 50 00:03:14,430 --> 00:03:18,880 Plus dot, dot, dot plus wj squared plus dot, 51 00:03:18,880 --> 00:03:23,700 dot, dot plus wd squared. 52 00:03:23,700 --> 00:03:26,090 Now if you look at all of this terms 53 00:03:26,090 --> 00:03:29,490 w zero squared wr squared all of those don't play any role in the derivative. 54 00:03:29,490 --> 00:03:33,770 The only thing that plays a role is wj squared Now what's the derivative 55 00:03:33,770 --> 00:03:34,420 of 30g squared? 56 00:03:34,420 --> 00:03:35,790 It's just 2wj. 57 00:03:35,790 --> 00:03:43,260 So that's always going to change in our code, it's actually 2wj. 58 00:03:43,260 --> 00:03:48,220 So in fact, our total derivative is going to be the same derivative that we've 59 00:03:48,220 --> 00:03:52,670 implemented in the past, mins 2 lambda, Times wj. 60 00:03:54,100 --> 00:03:58,890 So 2 times the regularization coefficient, the regularization penalty, 61 00:03:58,890 --> 00:04:04,490 the parameter lambda times the magnitude, so times the value of that coefficient. 62 00:04:06,700 --> 00:04:10,430 So let's interpret what this extra term does for us. 63 00:04:10,430 --> 00:04:14,560 So what does the minus 2 lambda wj do to the derivative? 64 00:04:14,560 --> 00:04:22,150 So wj is positive, this minus lambda wj is a negative term. 65 00:04:22,150 --> 00:04:29,410 Negative contribution to a derivative which means that it decreases 66 00:04:29,410 --> 00:04:35,150 wj because you're going to add some negative term to it. 67 00:04:35,150 --> 00:04:37,260 It was positive we're going to decrease it. 68 00:04:37,260 --> 00:04:41,863 So since it was positive and 69 00:04:41,863 --> 00:04:47,868 you're decreasing it what happens 70 00:04:47,868 --> 00:04:52,481 is wj becomes closer to 0. 71 00:04:53,977 --> 00:04:57,805 So if the rig is positive you have the negative number and 72 00:04:57,805 --> 00:05:00,340 becomes less positive closer to 0. 73 00:05:00,340 --> 00:05:04,480 And in fact if lambda's bigger then that thing becomes more negative and 74 00:05:04,480 --> 00:05:06,660 going to 0 faster. 75 00:05:06,660 --> 00:05:07,800 That's what happens. 76 00:05:07,800 --> 00:05:13,260 And if wj is very positive that the decrement is also larger so 77 00:05:13,260 --> 00:05:16,880 it becomes again goes to towards 0 even faster. 78 00:05:18,260 --> 00:05:24,880 Now if wj is negative then -2 lambda wj is going to be greater than 0. 79 00:05:24,880 --> 00:05:26,490 Because lambda is also greater than 0. 80 00:05:26,490 --> 00:05:28,490 And what impact does that have? 81 00:05:28,490 --> 00:05:31,950 So you're adding something positive so you're increasing 82 00:05:34,710 --> 00:05:40,179 wj which implies that wj becomes, 83 00:05:40,179 --> 00:05:44,570 again, closer to 0. 84 00:05:44,570 --> 00:05:49,700 It was negative, and I posited it numbers with, it goes a little closer to 0. 85 00:05:49,700 --> 00:05:53,600 So this is extremely intuitive, the regularization takes 86 00:05:53,600 --> 00:05:56,890 positive coefficients and decreases them a little bit, negative coefficients and 87 00:05:56,890 --> 00:05:58,350 increases them a little bit. 88 00:05:58,350 --> 00:06:00,880 So it tries to push coefficients to 0, 89 00:06:00,880 --> 00:06:03,590 that was the effect has on the gradient, exactly what you expect. 90 00:06:05,570 --> 00:06:09,990 Finally, this is exactly the code that we described in the last module, so 91 00:06:09,990 --> 00:06:12,200 learn the coefficients of a logistic regression model. 92 00:06:13,490 --> 00:06:16,970 You start with some, that is equal to 0, or 93 00:06:16,970 --> 00:06:21,640 some other randomly initiated or some kind of smartly initiated parameters. 94 00:06:21,640 --> 00:06:27,270 And you go, for each iteration you go coefficient by coefficient, you compute 95 00:06:27,270 --> 00:06:32,550 a partial derivative, which is this really long term here, sum over data points. 96 00:06:32,550 --> 00:06:37,130 The feature value times the difference between where there's 97 00:06:37,130 --> 00:06:42,180 a positive data point and the predicted value positive, so called a partial j. 98 00:06:44,120 --> 00:06:52,250 And you have the same update, wj(t+1) is wj(t) plus the step size. 99 00:06:54,440 --> 00:06:59,990 It multiplies the partial derivative just as before, 100 00:06:59,990 --> 00:07:04,750 which is the derivative of the likelihood function With respect to wj. 101 00:07:04,750 --> 00:07:08,180 And all you need to change in your code, 102 00:07:08,180 --> 00:07:09,950 there's only one little thing to change in the code. 103 00:07:11,090 --> 00:07:14,650 You have this little thing here which is our only change. 104 00:07:16,140 --> 00:07:18,840 In other words, take all the code you had before, 105 00:07:18,840 --> 00:07:23,140 put- 2 lambda wj in the computation of the derivative, and 106 00:07:23,140 --> 00:07:28,250 now you have a solver for L2 regularized logistic regression. 107 00:07:28,250 --> 00:07:31,687 And this is going to help you a tremendous amount in practice. 108 00:07:31,687 --> 00:07:36,039 [MUSIC]