[MUSIC] We've made the next section optional. It's not mathematically too complex,
but we don't think it's necessary to understand the whole
thread of today's module. However, for those interested in
a little bit more detail whether another interpretation y over 50 happens in
logistic regression of why it's so bad, we've created a few other examples for
you to go through. So this part really explores that and
really explains why those parameters become really massive
in logistic regression. So let's dive into it. To understand a little bit better
of why over-fitting happens in logistic regression and
why the parameters get really big we need to introduce the notion
of linear separability. Now linearly separable data is more or
less what you'd expect. There exists a line here where everything on the left of the line in
this case has score Score(x)<0 and everything on the right of
the line here has Score(x)>0. So it separates the positive
examples from the negative examples. More generally, we say that the data is
linearly separable when somebody stops you in the street and asks, what does it
mean for data to be linearly separable? Which says linearly separable
if the following is true. It's linearly separable if for
all positive examples, score of x which is w hat transpose h(x) > 0. So it's greater, it's really greater
that 0 for all positive examples. And then for negative examples,
the Score(x) which is w hat transpose
h(x) is less than 0. And so what that means is that
we're getting the training error to be exactly 0 and
this is a really important point. Again, if you see 20 arrow getting exactly
0, you should start getting worried and what you see is that they nearly separate
with what that's exactly happening. So you might think it's a great
thing my data's nearly inseparable, you should be careful because you might
be getting to an over-fitting situation, especially if you have a very complex
model and you have perfect training error. Now I'll drawn this in
two dimensional space but if you have D-dimensional features let's
say a thousand dimensional features linear separability corresponds to a plane
in a thousand dimensions of space. Separate positive examples
from negative example. That's my space twice for
those who didn't get it. So that's a way to think about it in
high dimensional spaces and the other little side that I'll teach you and
I'm going to details, but it turns out if you keep adding features like pole numbers
that reach 180, 50, 100 and so on. Eventually, except for some corner cases, you're actually going to make
your data linearly separable. So you might observe that
with enough features, your training error will go to 0,
which means you fall into exactly this linearly separable case which
is a very problematic one. To understand why that space is
problematic, or linearly separable data becomes a problem for over-fitting,
let's look at this example over here. In this case,
the plane 1.0 #awesome- 1.5 #awful separates the positive examples
from negative examples. Now, what does that mean? That means that the plane 1.0 number of awesome minus 1.5 number of awful's equal to zero is the boundary between the positive and negative examples. Now what happens if I multiply
both sides of the equation by 10? I get 10 number of awesome's
minus 15 number of awful's. On the left side by multiply by 10 and
if I multiply 0 by 10, I also get 0. So it turns out that
that bigger coefficient also separates the data
in exactly the same way. And guess what? If I multiply both sides by 1 billion I
still separate the data in the same way. So 1 billion times number of
awesome's minus 1.5 billion times number of awful's It
still separates the data. And so,
whether coefficients are small or big, we have still have
a separating hyperplane. So why am I going to get pushed
to these bigger coefficients? Well, let's, to understand that, we have
to go back to our probability for data. So let's pick a particular
data point here, one that is near the boundary which
has two awesome's and one awful. I really loved that review,
two awesomes and one awful is one that
I keep coming back to. Now let's see what happens to our
estimating probability for this case. Let's see what happens when we're using
the first set of coefficients learned. W1 is 1.0. W2 is minus 1.5. So in this case my
estimate of probability is one over one plus e to the minus so two times 1.0 which is two minus
1.5 .times one which is minus 1.5. So this is equal one over one plus e to the minus 0.5 which turns
out to be equal to 0.62. Now that makes sense to me is a point close to the boundary that 62%
chance that it's a positive review but then there is still 38%
chance is a negative review. So I feel really good
about the prediction. However, since we're doing
maximum likelihood estimation, we're pushing probabilities
towards extremes. We're trying to learn parameters that
makes those probabilities bigger. So this happens when we use the second
set of parameters, 10 and -15. So in this case, multiply everything
by 10 so the probability is going to be 1/1+e to
the minus 5 instead of e to the minus 0.5 which is equal to 0.99. Wow.
Now even though the point is close to boundary we're 99% confident
that it's a positive review. That doesn't seem quite right. Well let's see what happens
when we look at 1 billion. And minus 1.5 billion,
well that ratio becomes 1 over 1 + e to the minus 0.5 billion. And I don't know my calculator
wouldn't compute that exactly. Probably yours can't but
I can tell you is basically one. So when a parameters become
the coefficient becomes really big I'm sure that they put that point right
next to the boundary has probability one of being a positive review and
I don't trust that. However, maximum likelihood estimation
prefers models that are more certain, and so it is going to push the coefficients to
be infinite for linearly-separable data, because it just can. So it's going to be pushing larger and
larger and larger and larger until, basically,
they go to infinity. So that's a really bad over-fitting
problem that happens in logistic regression. So just as a summary of this
optional section, we'll see that logistic regression over 50 here could
be where I call it twice as bad. We have the same kind of bad
situation that we had looking at decision boundaries and
typically in regression, where we had this really complicated
function that you learned. And you become really complex decision
boundaries that over-fit the data and don't generalize well. But you also have the second effect, where
if the data is linearly separable, if you have lots of features, the data becomes
linearly separable or you are close to it, then the coefficients can get really big
and eventually they can go to infinity. And so you get these massive coefficients
and massive confidence about your answers. And so
you will see these two kinds of effects of over-fitting with logistic regression. [MUSIC]