1 00:00:00,000 --> 00:00:04,911 [MUSIC] 2 00:00:04,911 --> 00:00:07,208 We've made the next section optional. 3 00:00:07,208 --> 00:00:12,127 It's not mathematically too complex, but we don't think it's necessary to 4 00:00:12,127 --> 00:00:15,266 understand the whole thread of today's module. 5 00:00:15,266 --> 00:00:18,449 However, for those interested in a little bit more detail whether another 6 00:00:18,449 --> 00:00:21,835 interpretation y over 50 happens in logistic regression of why it's so bad, 7 00:00:21,835 --> 00:00:24,330 we've created a few other examples for you to go through. 8 00:00:25,850 --> 00:00:29,530 So this part really explores that and really explains why those parameters 9 00:00:29,530 --> 00:00:32,380 become really massive in logistic regression. 10 00:00:32,380 --> 00:00:33,620 So let's dive into it. 11 00:00:35,000 --> 00:00:37,940 To understand a little bit better of why over-fitting happens 12 00:00:37,940 --> 00:00:41,480 in logistic regression and why the parameters get really big 13 00:00:41,480 --> 00:00:44,630 we need to introduce the notion of linear separability. 14 00:00:44,630 --> 00:00:49,280 Now linearly separable data is more or less what you'd expect. 15 00:00:49,280 --> 00:00:54,420 There exists a line here where 16 00:00:54,420 --> 00:01:00,060 everything on the left of the line in this case has score Score(x)<0 and 17 00:01:00,060 --> 00:01:05,930 everything on the right of the line here has Score(x)>0. 18 00:01:05,930 --> 00:01:09,710 So it separates the positive examples from the negative examples. 19 00:01:11,100 --> 00:01:14,540 More generally, we say that the data is linearly separable when somebody stops you 20 00:01:14,540 --> 00:01:17,680 in the street and asks, what does it mean for data to be linearly separable? 21 00:01:17,680 --> 00:01:20,550 Which says linearly separable if the following is true. 22 00:01:20,550 --> 00:01:24,540 It's linearly separable if for all positive examples, 23 00:01:25,720 --> 00:01:31,280 score of x which is w hat 24 00:01:31,280 --> 00:01:37,010 transpose h(x) > 0. 25 00:01:37,010 --> 00:01:40,220 So it's greater, it's really greater that 0 for all positive examples. 26 00:01:40,220 --> 00:01:45,891 And then for negative examples, the Score(x) 27 00:01:45,891 --> 00:01:51,430 which is w hat transpose h(x) is less than 0. 28 00:01:51,430 --> 00:01:56,390 And so what that means is that we're getting the training 29 00:01:56,390 --> 00:02:01,440 error to be exactly 0 and this is a really important point. 30 00:02:01,440 --> 00:02:06,230 Again, if you see 20 arrow getting exactly 0, you should start getting worried and 31 00:02:06,230 --> 00:02:09,560 what you see is that they nearly separate with what that's exactly happening. 32 00:02:09,560 --> 00:02:12,220 So you might think it's a great thing my data's nearly inseparable, 33 00:02:12,220 --> 00:02:15,980 you should be careful because you might be getting to an over-fitting situation, 34 00:02:15,980 --> 00:02:20,080 especially if you have a very complex model and you have perfect training error. 35 00:02:20,080 --> 00:02:23,220 Now I'll drawn this in two dimensional space but 36 00:02:23,220 --> 00:02:26,630 if you have D-dimensional features let's say a thousand dimensional features 37 00:02:26,630 --> 00:02:31,140 linear separability corresponds to a plane in a thousand dimensions of space. 38 00:02:32,350 --> 00:02:35,760 Separate positive examples from negative example. 39 00:02:38,170 --> 00:02:45,000 That's my space twice for those who didn't get it. 40 00:02:45,000 --> 00:02:49,370 So that's a way to think about it in high dimensional spaces and the other 41 00:02:49,370 --> 00:02:52,410 little side that I'll teach you and I'm going to details, but it turns out if 42 00:02:53,540 --> 00:02:58,700 you keep adding features like pole numbers that reach 180, 50, 100 and so on. 43 00:02:58,700 --> 00:03:01,920 Eventually, except for some corner cases, 44 00:03:01,920 --> 00:03:04,710 you're actually going to make your data linearly separable. 45 00:03:04,710 --> 00:03:08,620 So you might observe that with enough features, 46 00:03:08,620 --> 00:03:12,190 your training error will go to 0, which means you fall into exactly this 47 00:03:12,190 --> 00:03:14,500 linearly separable case which is a very problematic one. 48 00:03:16,650 --> 00:03:20,390 To understand why that space is problematic, or linearly separable data 49 00:03:20,390 --> 00:03:24,420 becomes a problem for over-fitting, let's look at this example over here. 50 00:03:24,420 --> 00:03:28,510 In this case, the plane 1.0 #awesome- 1.5 #awful 51 00:03:28,510 --> 00:03:33,020 separates the positive examples from negative examples. 52 00:03:34,210 --> 00:03:35,770 Now, what does that mean? 53 00:03:35,770 --> 00:03:40,773 That means that the plane 1.0 number 54 00:03:40,773 --> 00:03:46,232 of awesome minus 1.5 number of awful's 55 00:03:46,232 --> 00:03:50,932 equal to zero is the boundary between 56 00:03:50,932 --> 00:03:56,760 the positive and negative examples. 57 00:03:56,760 --> 00:04:00,030 Now what happens if I multiply both sides of the equation by 10? 58 00:04:00,030 --> 00:04:08,975 I get 10 number of awesome's minus 15 number of awful's. 59 00:04:10,410 --> 00:04:15,010 On the left side by multiply by 10 and if I multiply 0 by 10, I also get 0. 60 00:04:15,010 --> 00:04:17,960 So it turns out that that bigger coefficient 61 00:04:17,960 --> 00:04:19,890 also separates the data in exactly the same way. 62 00:04:21,590 --> 00:04:22,890 And guess what? 63 00:04:22,890 --> 00:04:29,590 If I multiply both sides by 1 billion I still separate the data in the same way. 64 00:04:29,590 --> 00:04:34,169 So 1 billion times number of awesome's minus 1.5 billion 65 00:04:34,169 --> 00:04:38,230 times number of awful's It still separates the data. 66 00:04:38,230 --> 00:04:41,600 And so, whether coefficients are small or big, 67 00:04:41,600 --> 00:04:44,100 we have still have a separating hyperplane. 68 00:04:45,300 --> 00:04:48,380 So why am I going to get pushed to these bigger coefficients? 69 00:04:49,390 --> 00:04:54,420 Well, let's, to understand that, we have to go back to our probability for data. 70 00:04:54,420 --> 00:04:56,060 So let's pick a particular data point here, 71 00:04:56,060 --> 00:05:01,780 one that is near the boundary which has two awesome's and one awful. 72 00:05:01,780 --> 00:05:04,010 I really loved that review, two awesomes and 73 00:05:04,010 --> 00:05:05,720 one awful is one that I keep coming back to. 74 00:05:06,820 --> 00:05:10,330 Now let's see what happens to our estimating probability for this case. 75 00:05:12,490 --> 00:05:16,648 Let's see what happens when we're using the first set of coefficients learned. 76 00:05:16,648 --> 00:05:17,810 W1 is 1.0. 77 00:05:17,810 --> 00:05:18,440 W2 is minus 1.5. 78 00:05:18,440 --> 00:05:23,470 So in this case my estimate of probability is 79 00:05:23,470 --> 00:05:29,020 one over one plus e to the minus so 80 00:05:29,020 --> 00:05:38,320 two times 1.0 which is two minus 1.5 .times one which is minus 1.5. 81 00:05:38,320 --> 00:05:43,400 So this is equal one over one plus 82 00:05:43,400 --> 00:05:50,015 e to the minus 0.5 which turns out to be equal to 0.62. 83 00:05:50,015 --> 00:05:54,600 Now that makes sense to me is 84 00:05:54,600 --> 00:05:59,600 a point close to the boundary that 62% chance that it's a positive review but 85 00:05:59,600 --> 00:06:02,920 then there is still 38% chance is a negative review. 86 00:06:02,920 --> 00:06:05,930 So I feel really good about the prediction. 87 00:06:05,930 --> 00:06:09,180 However, since we're doing maximum likelihood estimation, 88 00:06:09,180 --> 00:06:13,840 we're pushing probabilities towards extremes. 89 00:06:13,840 --> 00:06:17,950 We're trying to learn parameters that makes those probabilities bigger. 90 00:06:17,950 --> 00:06:23,435 So this happens when we use the second set of parameters, 10 and -15. 91 00:06:23,435 --> 00:06:27,860 So in this case, multiply everything by 10 so the probability is 92 00:06:27,860 --> 00:06:33,250 going to be 1/1+e to the minus 5 instead of e 93 00:06:33,250 --> 00:06:38,770 to the minus 0.5 which is equal to 0.99. 94 00:06:38,770 --> 00:06:40,900 Wow. Now even though the point is close to 95 00:06:40,900 --> 00:06:44,730 boundary we're 99% confident that it's a positive review. 96 00:06:44,730 --> 00:06:49,000 That doesn't seem quite right. 97 00:06:49,000 --> 00:06:51,620 Well let's see what happens when we look at 1 billion. 98 00:06:51,620 --> 00:06:57,185 And minus 1.5 billion, well that ratio becomes 1 over 1 + e to 99 00:06:57,185 --> 00:07:03,414 the minus 0.5 billion. 100 00:07:03,414 --> 00:07:07,860 And I don't know my calculator wouldn't compute that exactly. 101 00:07:07,860 --> 00:07:13,030 Probably yours can't but I can tell you is basically one. 102 00:07:13,030 --> 00:07:16,790 So when a parameters become the coefficient becomes really big 103 00:07:16,790 --> 00:07:21,380 I'm sure that they put that point right next to the boundary has probability one 104 00:07:21,380 --> 00:07:24,160 of being a positive review and I don't trust that. 105 00:07:24,160 --> 00:07:28,390 However, maximum likelihood estimation prefers models that are more certain, and 106 00:07:28,390 --> 00:07:32,760 so it is going to push the coefficients to be infinite for linearly-separable data, 107 00:07:32,760 --> 00:07:33,600 because it just can. 108 00:07:33,600 --> 00:07:36,440 So it's going to be pushing larger and larger and larger and 109 00:07:36,440 --> 00:07:40,740 larger until, basically, they go to infinity. 110 00:07:40,740 --> 00:07:43,610 So that's a really bad over-fitting problem that happens in 111 00:07:43,610 --> 00:07:44,610 logistic regression. 112 00:07:45,980 --> 00:07:49,930 So just as a summary of this optional section, we'll see that 113 00:07:49,930 --> 00:07:54,360 logistic regression over 50 here could be where I call it twice as bad. 114 00:07:54,360 --> 00:07:57,170 We have the same kind of bad situation that we had looking 115 00:07:57,170 --> 00:07:59,650 at decision boundaries and typically in regression, 116 00:07:59,650 --> 00:08:02,449 where we had this really complicated function that you learned. 117 00:08:03,830 --> 00:08:09,140 And you become really complex decision boundaries that over-fit the data and 118 00:08:09,140 --> 00:08:10,380 don't generalize well. 119 00:08:10,380 --> 00:08:14,390 But you also have the second effect, where if the data is linearly separable, if you 120 00:08:14,390 --> 00:08:17,910 have lots of features, the data becomes linearly separable or you are close to it, 121 00:08:17,910 --> 00:08:21,330 then the coefficients can get really big and eventually they can go to infinity. 122 00:08:21,330 --> 00:08:26,980 And so you get these massive coefficients and massive confidence about your answers. 123 00:08:26,980 --> 00:08:29,933 And so you will see these two kinds of effects of 124 00:08:29,933 --> 00:08:32,752 over-fitting with logistic regression. 125 00:08:32,752 --> 00:08:36,709 [MUSIC]