[MUSIC] We've made the next section optional. It's not mathematically too complex, but we don't think it's necessary to understand the whole thread of today's module. However, for those interested in a little bit more detail whether another interpretation y over 50 happens in logistic regression of why it's so bad, we've created a few other examples for you to go through. So this part really explores that and really explains why those parameters become really massive in logistic regression. So let's dive into it. To understand a little bit better of why over-fitting happens in logistic regression and why the parameters get really big we need to introduce the notion of linear separability. Now linearly separable data is more or less what you'd expect. There exists a line here where everything on the left of the line in this case has score Score(x)<0 and everything on the right of the line here has Score(x)>0. So it separates the positive examples from the negative examples. More generally, we say that the data is linearly separable when somebody stops you in the street and asks, what does it mean for data to be linearly separable? Which says linearly separable if the following is true. It's linearly separable if for all positive examples, score of x which is w hat transpose h(x) > 0. So it's greater, it's really greater that 0 for all positive examples. And then for negative examples, the Score(x) which is w hat transpose h(x) is less than 0. And so what that means is that we're getting the training error to be exactly 0 and this is a really important point. Again, if you see 20 arrow getting exactly 0, you should start getting worried and what you see is that they nearly separate with what that's exactly happening. So you might think it's a great thing my data's nearly inseparable, you should be careful because you might be getting to an over-fitting situation, especially if you have a very complex model and you have perfect training error. Now I'll drawn this in two dimensional space but if you have D-dimensional features let's say a thousand dimensional features linear separability corresponds to a plane in a thousand dimensions of space. Separate positive examples from negative example. That's my space twice for those who didn't get it. So that's a way to think about it in high dimensional spaces and the other little side that I'll teach you and I'm going to details, but it turns out if you keep adding features like pole numbers that reach 180, 50, 100 and so on. Eventually, except for some corner cases, you're actually going to make your data linearly separable. So you might observe that with enough features, your training error will go to 0, which means you fall into exactly this linearly separable case which is a very problematic one. To understand why that space is problematic, or linearly separable data becomes a problem for over-fitting, let's look at this example over here. In this case, the plane 1.0 #awesome- 1.5 #awful separates the positive examples from negative examples. Now, what does that mean? That means that the plane 1.0 number of awesome minus 1.5 number of awful's equal to zero is the boundary between the positive and negative examples. Now what happens if I multiply both sides of the equation by 10? I get 10 number of awesome's minus 15 number of awful's. On the left side by multiply by 10 and if I multiply 0 by 10, I also get 0. So it turns out that that bigger coefficient also separates the data in exactly the same way. And guess what? If I multiply both sides by 1 billion I still separate the data in the same way. So 1 billion times number of awesome's minus 1.5 billion times number of awful's It still separates the data. And so, whether coefficients are small or big, we have still have a separating hyperplane. So why am I going to get pushed to these bigger coefficients? Well, let's, to understand that, we have to go back to our probability for data. So let's pick a particular data point here, one that is near the boundary which has two awesome's and one awful. I really loved that review, two awesomes and one awful is one that I keep coming back to. Now let's see what happens to our estimating probability for this case. Let's see what happens when we're using the first set of coefficients learned. W1 is 1.0. W2 is minus 1.5. So in this case my estimate of probability is one over one plus e to the minus so two times 1.0 which is two minus 1.5 .times one which is minus 1.5. So this is equal one over one plus e to the minus 0.5 which turns out to be equal to 0.62. Now that makes sense to me is a point close to the boundary that 62% chance that it's a positive review but then there is still 38% chance is a negative review. So I feel really good about the prediction. However, since we're doing maximum likelihood estimation, we're pushing probabilities towards extremes. We're trying to learn parameters that makes those probabilities bigger. So this happens when we use the second set of parameters, 10 and -15. So in this case, multiply everything by 10 so the probability is going to be 1/1+e to the minus 5 instead of e to the minus 0.5 which is equal to 0.99. Wow. Now even though the point is close to boundary we're 99% confident that it's a positive review. That doesn't seem quite right. Well let's see what happens when we look at 1 billion. And minus 1.5 billion, well that ratio becomes 1 over 1 + e to the minus 0.5 billion. And I don't know my calculator wouldn't compute that exactly. Probably yours can't but I can tell you is basically one. So when a parameters become the coefficient becomes really big I'm sure that they put that point right next to the boundary has probability one of being a positive review and I don't trust that. However, maximum likelihood estimation prefers models that are more certain, and so it is going to push the coefficients to be infinite for linearly-separable data, because it just can. So it's going to be pushing larger and larger and larger and larger until, basically, they go to infinity. So that's a really bad over-fitting problem that happens in logistic regression. So just as a summary of this optional section, we'll see that logistic regression over 50 here could be where I call it twice as bad. We have the same kind of bad situation that we had looking at decision boundaries and typically in regression, where we had this really complicated function that you learned. And you become really complex decision boundaries that over-fit the data and don't generalize well. But you also have the second effect, where if the data is linearly separable, if you have lots of features, the data becomes linearly separable or you are close to it, then the coefficients can get really big and eventually they can go to infinity. And so you get these massive coefficients and massive confidence about your answers. And so you will see these two kinds of effects of over-fitting with logistic regression. [MUSIC]