1 00:00:00,008 --> 00:00:04,761 [MUSIC] 2 00:00:04,761 --> 00:00:09,719 Now that we've seen how regularization can play a role in the classification setting, 3 00:00:09,719 --> 00:00:14,677 let's observe in our data set what happens to the boundaries as we introduce our 4 00:00:14,677 --> 00:00:17,170 regularization penalty. 5 00:00:17,170 --> 00:00:21,000 We're going to work with that Degree 20 features. 6 00:00:21,000 --> 00:00:28,519 So feature discretion model with features of polynomial degrees 20, 7 00:00:28,519 --> 00:00:36,422 which led to this technical term that I called the Crazy Decision Boundary. 8 00:00:36,422 --> 00:00:40,702 And the parameters had very large magnitude, 9 00:00:40,702 --> 00:00:47,740 in fact it varied from minus 3,170 to 3,803, they were very big. 10 00:00:47,740 --> 00:00:52,198 Now we're going to take the same setting, same number of features, same features, 11 00:00:52,198 --> 00:00:55,766 same data, same everything but just vary the parameter lambda and 12 00:00:55,766 --> 00:01:00,810 see what happens and here we're showing the results of doing exactly that. 13 00:01:00,810 --> 00:01:05,566 When lamba is equal to zero, we get very large 14 00:01:05,566 --> 00:01:10,573 coefficients when lamba is large like ten we get 15 00:01:10,573 --> 00:01:16,847 good size coefficients, let's say smaller coefficients 16 00:01:20,308 --> 00:01:25,319 Okay, and so for lambda equals zero, 17 00:01:25,319 --> 00:01:30,957 we had that crazy decision boundary and for 18 00:01:30,957 --> 00:01:37,860 large lambdas, we have a nicer smoother boundary. 19 00:01:41,540 --> 00:01:46,161 In fact, I trust that this boundary with lambda equal ten much more 20 00:01:46,161 --> 00:01:48,812 than the one with lambda equals zero. 21 00:01:48,812 --> 00:01:53,478 And the decision model of lambda equals ten looks a lot like that really beautiful 22 00:01:53,478 --> 00:01:56,839 one I got with the parabola, and fit my data really well but 23 00:01:56,839 --> 00:02:01,298 here there are tons of more features and nevertheless, hiding a little bit of 24 00:02:01,298 --> 00:02:05,910 regularization help us get that really nice separating plane that I can trust. 25 00:02:07,650 --> 00:02:12,060 We can also look to the coefficient parts, what happens to a coefficient 26 00:02:12,060 --> 00:02:15,920 as we increase the penalty lambda, so in the beginning, 27 00:02:15,920 --> 00:02:20,970 when we have a unregularized problem this coefficient tends to be large. 28 00:02:20,970 --> 00:02:24,400 But as we increase lambda, they tend to become smaller and smaller and smaller and 29 00:02:24,400 --> 00:02:25,613 go to zero. 30 00:02:25,613 --> 00:02:32,700 I've used the review data set, property review data set here and 31 00:02:32,700 --> 00:02:37,390 I picked a few words and fit a legislation model just using those words with 32 00:02:37,390 --> 00:02:39,680 different levels of regularization. 33 00:02:39,680 --> 00:02:43,730 So for example, the words that have positive coefficients tend to be 34 00:02:43,730 --> 00:02:47,940 associated with positive aspects of reviews, while the ones with negative 35 00:02:47,940 --> 00:02:52,480 coefficients tend to be associated with negative aspects of reviews. 36 00:02:52,480 --> 00:02:57,050 What is the word in quotes that has the most positive weight? 37 00:02:57,050 --> 00:03:00,300 Well, if you look at the key here, you'll see the word that has the most positive 38 00:03:00,300 --> 00:03:05,630 weight is actually the emoticon smiley face, 39 00:03:05,630 --> 00:03:10,718 well, the word that has most negative weight is another another 40 00:03:10,718 --> 00:03:15,337 emoticon the sad face. 41 00:03:15,337 --> 00:03:20,230 And the beginning all these words have pretty large coefficients 42 00:03:20,230 --> 00:03:24,400 except the words near zero, which are words like this 43 00:03:26,300 --> 00:03:30,470 and review, which are not associated with either positive things or negative things, 44 00:03:30,470 --> 00:03:35,470 although if the word review shows up it's slightly correlated with 45 00:03:35,470 --> 00:03:41,130 a negative review but in general, this coefficients much more than the others. 46 00:03:41,130 --> 00:03:46,080 And as I increase the regularization lambda, 47 00:03:46,080 --> 00:03:49,990 you see the coefficients can become smaller and smaller, and 48 00:03:49,990 --> 00:03:54,550 if I were to keep drawing this, they will eventually go to zero. 49 00:03:54,550 --> 00:04:01,230 And now, if I were to use cross validation to pick the best lambda, 50 00:04:01,230 --> 00:04:05,790 I'll get a result kind of around here, and I'm going to call that lambda star. 51 00:04:06,910 --> 00:04:10,380 And so that's what you do is cross validation, to find that point 52 00:04:10,380 --> 00:04:15,120 where it's fitting data pretty well, but it's not over-fitting it too much. 53 00:04:17,180 --> 00:04:20,140 And as a last point, I'm going to show you something that is pretty exciting. 54 00:04:20,140 --> 00:04:24,350 It's really beautiful about regularization with regression. 55 00:04:24,350 --> 00:04:29,770 Regularization doesn't only address the crazy wiggly decision boundaries but 56 00:04:29,770 --> 00:04:33,000 addressing with those over-confidence problems that we saw 57 00:04:33,000 --> 00:04:34,510 with over-fitting regularization. 58 00:04:34,510 --> 00:04:38,480 So I'm taking the same coefficient, the same thing that I've learned. 59 00:04:38,480 --> 00:04:41,790 The lambda is increasing, the range of coefficients is decreasing, 60 00:04:41,790 --> 00:04:44,360 they're getting smaller but I'm talking the bottom here. 61 00:04:44,360 --> 00:04:47,830 The actual decision boundaries that we learned and 62 00:04:47,830 --> 00:04:50,298 the notion of uncertainty on their data. 63 00:04:50,298 --> 00:04:58,320 So if lambda is equal to zero we have this highly over confident predictions. 64 00:05:00,330 --> 00:05:05,640 If lambda is a one, not only do I get a more natural kind of parabola like 65 00:05:05,640 --> 00:05:10,450 decision boundary even though I'm using Degree 20 features, 66 00:05:10,450 --> 00:05:12,710 polynomial degree 20 is features. 67 00:05:12,710 --> 00:05:15,826 I get a very natural certainty region. 68 00:05:18,625 --> 00:05:20,870 So the region why I don't know if it's positive or 69 00:05:20,870 --> 00:05:23,917 negative is really those points in the boundary which kind of between 70 00:05:23,917 --> 00:05:27,580 those clusters of positive points and the clusters of negative points. 71 00:05:27,580 --> 00:05:33,264 And you get this kind of beautiful smooth transition. 72 00:05:33,264 --> 00:05:37,974 So by introducing regularization, we've now addressed those two fundamental 73 00:05:37,974 --> 00:05:41,793 problems where over-fitting comes in in logistic aggression. 74 00:05:41,793 --> 00:05:45,909 [MUSIC]