1 00:00:00,000 --> 00:00:04,016 [MUSIC] 2 00:00:04,016 --> 00:00:09,310 So next let's discuss what does over-fitting look like for a classifier. 3 00:00:10,750 --> 00:00:13,650 So we talked about a wide range of classifiers, so for 4 00:00:13,650 --> 00:00:19,820 example here I am fitting a linear classifier to the data and it's our usual 5 00:00:19,820 --> 00:00:26,330 example with predicting that their review is positive or negative for a restaurant. 6 00:00:26,330 --> 00:00:30,940 And so we see that the points below the line have score greater than 0. 7 00:00:30,940 --> 00:00:32,990 The points above the line have scored less than 0. 8 00:00:32,990 --> 00:00:36,350 So below the line I predict as positive, above the line I predict as negative. 9 00:00:36,350 --> 00:00:41,220 And you get this very, very simple line in this simple example. 10 00:00:41,220 --> 00:00:45,250 For this module I've created a simple data set shown here on the lower left with 11 00:00:45,250 --> 00:00:50,690 positive and negative examples and I want to fit a bunch of different reference 12 00:00:50,690 --> 00:00:55,370 classifiers to it to really observe how over-fitting happens in practice. 13 00:00:55,370 --> 00:01:00,830 So first, I'm going to fit a simple classifier with a linear feature so 14 00:01:00,830 --> 00:01:03,935 it's just the constant that is zero, 15 00:01:03,935 --> 00:01:09,250 the coefficient of x1 which is going to be w1 and 16 00:01:09,250 --> 00:01:13,740 the coefficient of x2 which is going to be w2 and if I 17 00:01:13,740 --> 00:01:17,546 learn the logistical rational classifier in this data I get the following results. 18 00:01:17,546 --> 00:01:24,790 The constant becomes 0.23 the co-efficient of x one becomes one point twelve and 19 00:01:24,790 --> 00:01:30,850 the co-efficient of x two becomes minus 1.07. 20 00:01:30,850 --> 00:01:34,520 So, on the right here I'm showing the resulting decision boundary 21 00:01:34,520 --> 00:01:36,030 from this classifier. 22 00:01:36,030 --> 00:01:41,430 So, this line here corresponds to 23 00:01:41,430 --> 00:01:47,210 the point where 0.23 plus 1.12 24 00:01:47,210 --> 00:01:53,360 x 1 minus 1. 25 00:01:53,360 --> 00:01:59,630 Let's clear that a little bit, minus 1 point 26 00:02:01,740 --> 00:02:05,500 07, x2 is equal to 0. 27 00:02:05,500 --> 00:02:10,890 So this is our observation of that transition from the points 28 00:02:10,890 --> 00:02:15,960 down here where the Score(x) is greater than 0 and 29 00:02:15,960 --> 00:02:21,390 the points over here where the Score(x) is less than zero. 30 00:02:21,390 --> 00:02:25,130 So the points above that line in predicting is being negative points and 31 00:02:25,130 --> 00:02:27,430 the points below the line is predicting them to be positive. 32 00:02:28,450 --> 00:02:31,140 And you see some interesting things in this simple 33 00:02:31,140 --> 00:02:34,930 data set with just a simple classifier. 34 00:02:34,930 --> 00:02:39,110 You see that the, and this is a pretty decent job of separating the positives 35 00:02:39,110 --> 00:02:43,510 from the negatives but there are a few points here where they are mis-classified 36 00:02:43,510 --> 00:02:47,400 in the training data so this plus is over here and this minus is over here. 37 00:02:47,400 --> 00:02:50,660 And the question is, can I do better? 38 00:02:50,660 --> 00:02:54,430 Can I fit a model with maybe slightly fancier features 39 00:02:54,430 --> 00:02:55,976 that does better on this data set? 40 00:02:55,976 --> 00:03:01,390 To fit our data model, I'm going to try now what are called quadratic features. 41 00:03:01,390 --> 00:03:04,890 So, I'm going to consider, not just one, x1, and x2, but 42 00:03:04,890 --> 00:03:08,250 I'm also going to consider x1 squared, and x2 squared. 43 00:03:09,300 --> 00:03:13,350 Note that this is not general quadratic features, and not considering x1 44 00:03:13,350 --> 00:03:17,510 times x2 or the cross terms, because then they becomes pretty big. 45 00:03:17,510 --> 00:03:21,230 Later on so I'm just going to do this simple quadratic feature. 46 00:03:21,230 --> 00:03:23,690 And if I learn a classifier on the same data, 47 00:03:23,690 --> 00:03:25,460 I get a really cool decision boundary. 48 00:03:25,460 --> 00:03:28,190 And then the decision boundary, when I project it 49 00:03:28,190 --> 00:03:31,900 into this two-dimensional space, becomes this kind of curved parabola. 50 00:03:31,900 --> 00:03:39,409 So the coefficient in this case of The constant becomes 1.68, 51 00:03:39,409 --> 00:03:43,712 x1 is 1.39, x2 is -0.59. 52 00:03:43,712 --> 00:03:48,440 Both of these numbers are different from the ones on 53 00:03:48,440 --> 00:03:53,320 the previous slide because the parameters are all updated. 54 00:03:53,320 --> 00:03:56,870 Within this view of the quadratic terms and 55 00:03:56,870 --> 00:04:02,599 here the quadratic terms become minus 0.17 and minus 0.96. 56 00:04:02,599 --> 00:04:08,234 And using this quadratic terms I get this beautiful 57 00:04:08,234 --> 00:04:14,395 quadratic decision boundary over here which basically 58 00:04:14,395 --> 00:04:20,556 says that 1.68 plus 1.39 X1 minus 0.59x2 59 00:04:20,556 --> 00:04:26,869 minus 0.17x1 squared minus 0.96x2 squared. 60 00:04:26,869 --> 00:04:32,021 This whole thing is equal to zero, and 61 00:04:32,021 --> 00:04:38,783 the points now, on the left side of the parabola, 62 00:04:38,783 --> 00:04:46,028 are the ones that have Score(x), be less than 0 and 63 00:04:46,028 --> 00:04:51,663 the points on the right side here are those 64 00:04:51,663 --> 00:04:56,660 where a score of x is greater than 0. 65 00:04:56,660 --> 00:05:02,880 And you get this beautiful curve where yes you still make a couple of mistakes but 66 00:05:02,880 --> 00:05:05,598 I tell you, those mistakes seem okay to me. 67 00:05:05,598 --> 00:05:11,000 The data pre-well you should never expect for real data sets. 68 00:05:11,000 --> 00:05:14,790 For everything like, in fact, as we will see later in this module, 69 00:05:14,790 --> 00:05:18,470 getting everything right should be a big sign of warning for you. 70 00:05:18,470 --> 00:05:23,600 But I get pretty good fit and it looks beautiful and 71 00:05:23,600 --> 00:05:28,510 note, by the way, that the coefficients that I learn over here are pretty good. 72 00:05:28,510 --> 00:05:32,330 They look like a natural magnitude, about one, 0.5, and so on. 73 00:05:32,330 --> 00:05:35,945 Now let's see what happens when you do an even higher degree polynomial. 74 00:05:35,945 --> 00:05:40,939 [MUSIC]