[MUSIC] Next, let's see what happens if
we use degree 6 features to fit a logistic regression classifier
on the same data set. So now our features go all the way up
to x2 to the 6th, and x1 to the 6th. It's a lot more features, a lot more
coefficients to be learned from data. Now, if I take this data set and
I fit it for logistic regression classifier,
I get the following decision boundary. It fits the training data extremely well. If you look very carefully,
it actually gets zero training error, which should be a warning sign for
you, by the way, as we mentioned. And if you look at the decision boundary,
it's extremely complicated [BLANK AUDIO] complex. Some people might say,
here's a technical term for it. Crazy [LAUGH] decision boundary. So even though it has zero training error,
it has some weird artifacts to it. So for example, I'm highlighting
here a region in space where even though all around it's surrounded by
a prediction that's positive, right in the middle of that circle it thinks that
the score should be less than zero so for that region we're saying
that y hat should be- 1. And if it doesn't make any sense to
me because it is right around it, every point is positive 1. Why should I expect the points
in the middle there to be- 1? The data does it supported at all. And in fact if you look at
the magnitude of the coefficients, they're starting to get large. All the natural parabola had
coefficients of around 1- 0.5. Now, we're getting coefficients
of the order of 42 or more, which are 10 to 40 times bigger
than the ones they had before. And that is early warning
sign of over fitting, as we discussed in the regression class. Now, let's take that one step further, and fit a logistic regression model that
uses polynomial features of degree 20. So this is going all the way up
to x1 to the power of 20 and x2 to the power of 20, so really,
really high order polynomials. If you look at the boundary that
we learned, I mean come on. I'll say that this,
I can say it's truly crazy. It's really pretty complicated, gets all
the data right, but it's highly unsmooth. And if you look at the learned weight,
and there, the coefficients are of the order 3,000,
4,000, minus 2,000, they're much, much bigger than that
simple parabola that we learned. It gets all that training data right,
but it's clearly overfitting, and it's clearly outputting very
large estimated polynomials. Very largest, make the coefficients. And so, we're going to watch very
carefully to mind to those coefficients, we'll try to avoid over fitting. So the notion of overfitting
classification is very similar to that of regression, except that the error now is
measured in terms of classification error. And there might be some sort of
parameters that we learned here, w hat. Which seem to do very well in the training
data, maybe even this crazy boundaries. While there was some other parameter w*,
and so there's another coefficients w*. That would have done much
better in terms of true error. And the question's how do we go and push our learning process to be
more like w* than it is to w hat? And we'll do that by push
a promises to be not as massive, not as huge, pushing towards zero,
as we did with regularization. [MUSIC]