1
00:00:00,000 --> 00:00:04,609
[MUSIC]

2
00:00:04,609 --> 00:00:08,883
Next, let's see what happens if
we use degree 6 features to fit

3
00:00:08,883 --> 00:00:12,620
a logistic regression classifier
on the same data set.

4
00:00:12,620 --> 00:00:18,010
So now our features go all the way up
to x2 to the 6th, and x1 to the 6th.

5
00:00:18,010 --> 00:00:21,140
It's a lot more features, a lot more
coefficients to be learned from data.

6
00:00:21,140 --> 00:00:23,530
Now, if I take this data set and
I fit it for

7
00:00:23,530 --> 00:00:26,510
logistic regression classifier,
I get the following decision boundary.

8
00:00:28,400 --> 00:00:30,870
It fits the training data extremely well.

9
00:00:30,870 --> 00:00:35,160
If you look very carefully,
it actually gets zero training error,

10
00:00:35,160 --> 00:00:38,400
which should be a warning sign for
you, by the way, as we mentioned.

11
00:00:38,400 --> 00:00:43,708
And if you look at the decision boundary,
it's extremely complicated

12
00:00:43,708 --> 00:00:48,980
[BLANK AUDIO] complex.

13
00:00:48,980 --> 00:00:52,882
Some people might say,
here's a technical term for it.

14
00:00:52,882 --> 00:00:56,899
Crazy [LAUGH] decision boundary.

15
00:00:58,230 --> 00:01:02,060
So even though it has zero training error,
it has some weird artifacts to it.

16
00:01:02,060 --> 00:01:06,180
So for example, I'm highlighting
here a region in space where

17
00:01:06,180 --> 00:01:10,300
even though all around it's surrounded by
a prediction that's positive, right in

18
00:01:10,300 --> 00:01:14,940
the middle of that circle it thinks that
the score should be less than zero so for

19
00:01:14,940 --> 00:01:19,720
that region we're saying
that y hat should be- 1.

20
00:01:19,720 --> 00:01:24,350
And if it doesn't make any sense to
me because it is right around it,

21
00:01:24,350 --> 00:01:26,020
every point is positive 1.

22
00:01:26,020 --> 00:01:30,540
Why should I expect the points
in the middle there to be- 1?

23
00:01:30,540 --> 00:01:33,133
The data does it supported at all.

24
00:01:33,133 --> 00:01:35,772
And in fact if you look at
the magnitude of the coefficients,

25
00:01:35,772 --> 00:01:37,505
they're starting to get large.

26
00:01:37,505 --> 00:01:41,449
All the natural parabola had
coefficients of around 1- 0.5.

27
00:01:41,449 --> 00:01:44,756
Now, we're getting coefficients
of the order of 42 or more,

28
00:01:44,756 --> 00:01:49,080
which are 10 to 40 times bigger
than the ones they had before.

29
00:01:49,080 --> 00:01:52,660
And that is early warning
sign of over fitting,

30
00:01:52,660 --> 00:01:54,990
as we discussed in the regression class.

31
00:01:56,750 --> 00:01:59,275
Now, let's take that one step further, and

32
00:01:59,275 --> 00:02:03,739
fit a logistic regression model that
uses polynomial features of degree 20.

33
00:02:03,739 --> 00:02:07,695
So this is going all the way up
to x1 to the power of 20 and

34
00:02:07,695 --> 00:02:13,350
x2 to the power of 20, so really,
really high order polynomials.

35
00:02:13,350 --> 00:02:18,300
If you look at the boundary that
we learned, I mean come on.

36
00:02:18,300 --> 00:02:21,160
I'll say that this,
I can say it's truly crazy.

37
00:02:25,130 --> 00:02:31,113
It's really pretty complicated, gets all
the data right, but it's highly unsmooth.

38
00:02:31,113 --> 00:02:33,821
And if you look at the learned weight,
and there,

39
00:02:33,821 --> 00:02:38,275
the coefficients are of the order 3,000,
4,000, minus 2,000,

40
00:02:38,275 --> 00:02:42,400
they're much, much bigger than that
simple parabola that we learned.

41
00:02:42,400 --> 00:02:47,760
It gets all that training data right,
but it's clearly overfitting,

42
00:02:47,760 --> 00:02:51,510
and it's clearly outputting very
large estimated polynomials.

43
00:02:52,640 --> 00:02:54,770
Very largest, make the coefficients.

44
00:02:54,770 --> 00:02:59,660
And so, we're going to watch very
carefully to mind to those coefficients,

45
00:02:59,660 --> 00:03:00,980
we'll try to avoid over fitting.

46
00:03:02,590 --> 00:03:06,500
So the notion of overfitting
classification is very similar to that of

47
00:03:06,500 --> 00:03:11,353
regression, except that the error now is
measured in terms of classification error.

48
00:03:11,353 --> 00:03:18,191
And there might be some sort of
parameters that we learned here, w hat.

49
00:03:18,191 --> 00:03:23,380
Which seem to do very well in the training
data, maybe even this crazy boundaries.

50
00:03:23,380 --> 00:03:29,360
While there was some other parameter w*,
and so there's another coefficients w*.

51
00:03:29,360 --> 00:03:32,861
That would have done much
better in terms of true error.

52
00:03:32,861 --> 00:03:35,461
And the question's how do we go and

53
00:03:35,461 --> 00:03:41,040
push our learning process to be
more like w* than it is to w hat?

54
00:03:41,040 --> 00:03:45,840
And we'll do that by push
a promises to be not as massive,

55
00:03:45,840 --> 00:03:51,641
not as huge, pushing towards zero,
as we did with regularization.

56
00:03:51,641 --> 00:03:55,779
[MUSIC]