1 00:00:00,000 --> 00:00:04,609 [MUSIC] 2 00:00:04,609 --> 00:00:08,883 Next, let's see what happens if we use degree 6 features to fit 3 00:00:08,883 --> 00:00:12,620 a logistic regression classifier on the same data set. 4 00:00:12,620 --> 00:00:18,010 So now our features go all the way up to x2 to the 6th, and x1 to the 6th. 5 00:00:18,010 --> 00:00:21,140 It's a lot more features, a lot more coefficients to be learned from data. 6 00:00:21,140 --> 00:00:23,530 Now, if I take this data set and I fit it for 7 00:00:23,530 --> 00:00:26,510 logistic regression classifier, I get the following decision boundary. 8 00:00:28,400 --> 00:00:30,870 It fits the training data extremely well. 9 00:00:30,870 --> 00:00:35,160 If you look very carefully, it actually gets zero training error, 10 00:00:35,160 --> 00:00:38,400 which should be a warning sign for you, by the way, as we mentioned. 11 00:00:38,400 --> 00:00:43,708 And if you look at the decision boundary, it's extremely complicated 12 00:00:43,708 --> 00:00:48,980 [BLANK AUDIO] complex. 13 00:00:48,980 --> 00:00:52,882 Some people might say, here's a technical term for it. 14 00:00:52,882 --> 00:00:56,899 Crazy [LAUGH] decision boundary. 15 00:00:58,230 --> 00:01:02,060 So even though it has zero training error, it has some weird artifacts to it. 16 00:01:02,060 --> 00:01:06,180 So for example, I'm highlighting here a region in space where 17 00:01:06,180 --> 00:01:10,300 even though all around it's surrounded by a prediction that's positive, right in 18 00:01:10,300 --> 00:01:14,940 the middle of that circle it thinks that the score should be less than zero so for 19 00:01:14,940 --> 00:01:19,720 that region we're saying that y hat should be- 1. 20 00:01:19,720 --> 00:01:24,350 And if it doesn't make any sense to me because it is right around it, 21 00:01:24,350 --> 00:01:26,020 every point is positive 1. 22 00:01:26,020 --> 00:01:30,540 Why should I expect the points in the middle there to be- 1? 23 00:01:30,540 --> 00:01:33,133 The data does it supported at all. 24 00:01:33,133 --> 00:01:35,772 And in fact if you look at the magnitude of the coefficients, 25 00:01:35,772 --> 00:01:37,505 they're starting to get large. 26 00:01:37,505 --> 00:01:41,449 All the natural parabola had coefficients of around 1- 0.5. 27 00:01:41,449 --> 00:01:44,756 Now, we're getting coefficients of the order of 42 or more, 28 00:01:44,756 --> 00:01:49,080 which are 10 to 40 times bigger than the ones they had before. 29 00:01:49,080 --> 00:01:52,660 And that is early warning sign of over fitting, 30 00:01:52,660 --> 00:01:54,990 as we discussed in the regression class. 31 00:01:56,750 --> 00:01:59,275 Now, let's take that one step further, and 32 00:01:59,275 --> 00:02:03,739 fit a logistic regression model that uses polynomial features of degree 20. 33 00:02:03,739 --> 00:02:07,695 So this is going all the way up to x1 to the power of 20 and 34 00:02:07,695 --> 00:02:13,350 x2 to the power of 20, so really, really high order polynomials. 35 00:02:13,350 --> 00:02:18,300 If you look at the boundary that we learned, I mean come on. 36 00:02:18,300 --> 00:02:21,160 I'll say that this, I can say it's truly crazy. 37 00:02:25,130 --> 00:02:31,113 It's really pretty complicated, gets all the data right, but it's highly unsmooth. 38 00:02:31,113 --> 00:02:33,821 And if you look at the learned weight, and there, 39 00:02:33,821 --> 00:02:38,275 the coefficients are of the order 3,000, 4,000, minus 2,000, 40 00:02:38,275 --> 00:02:42,400 they're much, much bigger than that simple parabola that we learned. 41 00:02:42,400 --> 00:02:47,760 It gets all that training data right, but it's clearly overfitting, 42 00:02:47,760 --> 00:02:51,510 and it's clearly outputting very large estimated polynomials. 43 00:02:52,640 --> 00:02:54,770 Very largest, make the coefficients. 44 00:02:54,770 --> 00:02:59,660 And so, we're going to watch very carefully to mind to those coefficients, 45 00:02:59,660 --> 00:03:00,980 we'll try to avoid over fitting. 46 00:03:02,590 --> 00:03:06,500 So the notion of overfitting classification is very similar to that of 47 00:03:06,500 --> 00:03:11,353 regression, except that the error now is measured in terms of classification error. 48 00:03:11,353 --> 00:03:18,191 And there might be some sort of parameters that we learned here, w hat. 49 00:03:18,191 --> 00:03:23,380 Which seem to do very well in the training data, maybe even this crazy boundaries. 50 00:03:23,380 --> 00:03:29,360 While there was some other parameter w*, and so there's another coefficients w*. 51 00:03:29,360 --> 00:03:32,861 That would have done much better in terms of true error. 52 00:03:32,861 --> 00:03:35,461 And the question's how do we go and 53 00:03:35,461 --> 00:03:41,040 push our learning process to be more like w* than it is to w hat? 54 00:03:41,040 --> 00:03:45,840 And we'll do that by push a promises to be not as massive, 55 00:03:45,840 --> 00:03:51,641 not as huge, pushing towards zero, as we did with regularization. 56 00:03:51,641 --> 00:03:55,779 [MUSIC]