1 00:00:00,171 --> 00:00:04,702 [MUSIC] 2 00:00:04,702 --> 00:00:09,426 In classification the concept of over-fitting can be even stranger than it 3 00:00:09,426 --> 00:00:14,960 is in regression because here we're not just predicting a particular value. 4 00:00:14,960 --> 00:00:18,950 A price for house or even whether reduce positive or negative but 5 00:00:18,950 --> 00:00:22,340 we're often asking probabilistic questions like what is the probability 6 00:00:22,340 --> 00:00:24,300 that this review is positive? 7 00:00:24,300 --> 00:00:28,180 So let's see what over-fitting means with respect to estimating probabilities. 8 00:00:29,590 --> 00:00:32,380 So as you remember from the previous modules, 9 00:00:32,380 --> 00:00:35,960 we talked about the relationship between the score on the data, so 10 00:00:35,960 --> 00:00:40,830 that's w transpose h, which we raise from minus infinity to plus infinity. 11 00:00:40,830 --> 00:00:42,770 And actual estimate for probabilities, 12 00:00:42,770 --> 00:00:47,070 the probability of y equals plus one let's say given its input x and 13 00:00:47,070 --> 00:00:53,640 w as a sigmoid applied to the score function w transpose h. 14 00:00:53,640 --> 00:00:55,770 So that's the model we're working with and 15 00:00:55,770 --> 00:00:59,440 if you remember as we see over 50 we see w's becoming bigger and bigger. 16 00:00:59,440 --> 00:01:01,500 This coefficient's becoming bigger and bigger and bigger and 17 00:01:01,500 --> 00:01:06,640 bigger which means that w transposed h becomes huge, massive, 18 00:01:06,640 --> 00:01:09,900 and that pushes us either if it's massively positive to saying that 19 00:01:09,900 --> 00:01:13,560 the probability is exactly one, pretty confident they are one or 20 00:01:13,560 --> 00:01:15,880 if they're massively negative, they're confident they are zero. 21 00:01:17,190 --> 00:01:20,500 So over-fitting in classification, especially for 22 00:01:20,500 --> 00:01:23,380 logistic regression can have very devastating effects. 23 00:01:23,380 --> 00:01:28,260 It can yield this really massive coefficients which push 24 00:01:28,260 --> 00:01:31,030 w transposed h score to be very positive or 25 00:01:31,030 --> 00:01:35,570 very negative, which puts the sigmoid to be exactly 1 or exactly 0. 26 00:01:35,570 --> 00:01:39,170 And so not only are we over-fitting but 27 00:01:39,170 --> 00:01:42,260 we're really confident about our predictions. 28 00:01:42,260 --> 00:01:45,320 We think that this is exactly a positive review 29 00:01:45,320 --> 00:01:47,760 when we should more assertive about it. 30 00:01:47,760 --> 00:01:50,120 So let's observe how that shows up the data. 31 00:01:50,120 --> 00:01:54,530 Let's go back to the simple example that we had before where 32 00:01:54,530 --> 00:01:59,470 we're just fitting a function, a classifier using two features, 33 00:01:59,470 --> 00:02:01,930 number of awesome's and number of awful's. 34 00:02:01,930 --> 00:02:04,600 Let's say the coefficient of awesome is +1, 35 00:02:04,600 --> 00:02:09,220 and the coefficient of awful is -1, and we have an input. 36 00:02:09,220 --> 00:02:11,330 Two awesome's, one awful. 37 00:02:11,330 --> 00:02:15,987 If you look at the difference between the number of awful's and number of awesome's, 38 00:02:15,987 --> 00:02:20,920 you have one here cause there's one awesome within these awful's. 39 00:02:20,920 --> 00:02:27,935 And so the actual score that we get is one, 40 00:02:27,935 --> 00:02:34,551 which means the estimated probability 41 00:02:34,551 --> 00:02:39,964 Of a positive review is 0.73 42 00:02:39,964 --> 00:02:46,650 is probability review is positive. 43 00:02:46,650 --> 00:02:50,928 And I can live with that you know, I have a review, it has two awesome's, 44 00:02:50,928 --> 00:02:55,137 two things are awesome about the restaurant were awesome, it's more of 45 00:02:55,137 --> 00:02:59,430 a half probability of being a positive review but not a lot more than half. 46 00:03:00,650 --> 00:03:04,500 If I take the same coefficients, nothings changed, the same input, but I apply these 47 00:03:04,500 --> 00:03:09,160 coefficients by two Now I have two awesome's and, sorry, two awesome's 48 00:03:09,160 --> 00:03:12,850 one awful still as input, but the coefficients are plus two and minus two. 49 00:03:14,720 --> 00:03:16,200 Now the curve becomes steeper. 50 00:03:16,200 --> 00:03:21,041 And so if I look at the same point where the difference between awesome's and 51 00:03:21,041 --> 00:03:25,800 awful's are at one, and I look at my predicted probability. 52 00:03:25,800 --> 00:03:28,430 Now I've increased it tremendously. 53 00:03:28,430 --> 00:03:33,290 Now I see the probability of positive review, is about 0.88. 54 00:03:33,290 --> 00:03:38,270 I'm even more confident that the same exact review is positive. 55 00:03:39,580 --> 00:03:44,661 So that doesn't seem as good, 88% chance that it's positive. 56 00:03:44,661 --> 00:03:48,460 But let's push the coefficiencts up more, let's say that the coefficient of 57 00:03:48,460 --> 00:03:51,940 awesome is plus six and the coefficient for awful is minus six. 58 00:03:51,940 --> 00:03:54,160 Now if I look at the same point. 59 00:03:54,160 --> 00:03:54,980 The same import. 60 00:03:54,980 --> 00:03:58,910 The same difference between awesome's and awful's and I push that up. 61 00:04:00,790 --> 00:04:05,790 I get this pre scare result it says that 62 00:04:05,790 --> 00:04:11,750 the probability of being a positive root 0.997. 63 00:04:11,750 --> 00:04:13,650 I can't trust that. 64 00:04:13,650 --> 00:04:16,920 Is it really a case of 0.997 probability? 65 00:04:16,920 --> 00:04:19,340 The review of two awesome's and one awful is positive? 66 00:04:19,340 --> 00:04:21,060 That doesn't make sense. 67 00:04:21,060 --> 00:04:25,460 So as you can see, we have the same decision boundary still crossing at 0. 68 00:04:25,460 --> 00:04:29,650 The coefficients are just getting a bit bigger every time. 69 00:04:29,650 --> 00:04:34,310 But my estimated probability of the review becomes steeper and steeper more and 70 00:04:34,310 --> 00:04:35,730 more likely. 71 00:04:35,730 --> 00:04:40,360 Which is another type of over 50 that we observe in logistic regression. 72 00:04:40,360 --> 00:04:45,360 So not only the curves with them weird and wiggly but 73 00:04:45,360 --> 00:04:48,350 the estimated probabilities become close to zero and close to one. 74 00:04:49,570 --> 00:04:54,021 So let's look at our data set and see how we observe the same, 75 00:04:54,021 --> 00:04:56,219 the same effect right there. 76 00:04:56,219 --> 00:05:00,089 [MUSIC]