1
00:00:00,000 --> 00:00:04,911
[MUSIC]

2
00:00:04,911 --> 00:00:07,208
We've made the next section optional.

3
00:00:07,208 --> 00:00:12,127
It's not mathematically too complex,
but we don't think it's necessary to

4
00:00:12,127 --> 00:00:15,266
understand the whole
thread of today's module.

5
00:00:15,266 --> 00:00:18,449
However, for those interested in
a little bit more detail whether another

6
00:00:18,449 --> 00:00:21,835
interpretation y over 50 happens in
logistic regression of why it's so bad,

7
00:00:21,835 --> 00:00:24,330
we've created a few other examples for
you to go through.

8
00:00:25,850 --> 00:00:29,530
So this part really explores that and
really explains why those parameters

9
00:00:29,530 --> 00:00:32,380
become really massive
in logistic regression.

10
00:00:32,380 --> 00:00:33,620
So let's dive into it.

11
00:00:35,000 --> 00:00:37,940
To understand a little bit better
of why over-fitting happens

12
00:00:37,940 --> 00:00:41,480
in logistic regression and
why the parameters get really big

13
00:00:41,480 --> 00:00:44,630
we need to introduce the notion
of linear separability.

14
00:00:44,630 --> 00:00:49,280
Now linearly separable data is more or
less what you'd expect.

15
00:00:49,280 --> 00:00:54,420
There exists a line here where

16
00:00:54,420 --> 00:01:00,060
everything on the left of the line in
this case has score Score(x)<0 and

17
00:01:00,060 --> 00:01:05,930
everything on the right of
the line here has Score(x)>0.

18
00:01:05,930 --> 00:01:09,710
So it separates the positive
examples from the negative examples.

19
00:01:11,100 --> 00:01:14,540
More generally, we say that the data is
linearly separable when somebody stops you

20
00:01:14,540 --> 00:01:17,680
in the street and asks, what does it
mean for data to be linearly separable?

21
00:01:17,680 --> 00:01:20,550
Which says linearly separable
if the following is true.

22
00:01:20,550 --> 00:01:24,540
It's linearly separable if for
all positive examples,

23
00:01:25,720 --> 00:01:31,280
score of x which is w hat

24
00:01:31,280 --> 00:01:37,010
transpose h(x) > 0.

25
00:01:37,010 --> 00:01:40,220
So it's greater, it's really greater
that 0 for all positive examples.

26
00:01:40,220 --> 00:01:45,891
And then for negative examples,
the Score(x)

27
00:01:45,891 --> 00:01:51,430
which is w hat transpose
h(x) is less than 0.

28
00:01:51,430 --> 00:01:56,390
And so what that means is that
we're getting the training

29
00:01:56,390 --> 00:02:01,440
error to be exactly 0 and
this is a really important point.

30
00:02:01,440 --> 00:02:06,230
Again, if you see 20 arrow getting exactly
0, you should start getting worried and

31
00:02:06,230 --> 00:02:09,560
what you see is that they nearly separate
with what that's exactly happening.

32
00:02:09,560 --> 00:02:12,220
So you might think it's a great
thing my data's nearly inseparable,

33
00:02:12,220 --> 00:02:15,980
you should be careful because you might
be getting to an over-fitting situation,

34
00:02:15,980 --> 00:02:20,080
especially if you have a very complex
model and you have perfect training error.

35
00:02:20,080 --> 00:02:23,220
Now I'll drawn this in
two dimensional space but

36
00:02:23,220 --> 00:02:26,630
if you have D-dimensional features let's
say a thousand dimensional features

37
00:02:26,630 --> 00:02:31,140
linear separability corresponds to a plane
in a thousand dimensions of space.

38
00:02:32,350 --> 00:02:35,760
Separate positive examples
from negative example.

39
00:02:38,170 --> 00:02:45,000
That's my space twice for
those who didn't get it.

40
00:02:45,000 --> 00:02:49,370
So that's a way to think about it in
high dimensional spaces and the other

41
00:02:49,370 --> 00:02:52,410
little side that I'll teach you and
I'm going to details, but it turns out if

42
00:02:53,540 --> 00:02:58,700
you keep adding features like pole numbers
that reach 180, 50, 100 and so on.

43
00:02:58,700 --> 00:03:01,920
Eventually, except for some corner cases,

44
00:03:01,920 --> 00:03:04,710
you're actually going to make
your data linearly separable.

45
00:03:04,710 --> 00:03:08,620
So you might observe that
with enough features,

46
00:03:08,620 --> 00:03:12,190
your training error will go to 0,
which means you fall into exactly this

47
00:03:12,190 --> 00:03:14,500
linearly separable case which
is a very problematic one.

48
00:03:16,650 --> 00:03:20,390
To understand why that space is
problematic, or linearly separable data

49
00:03:20,390 --> 00:03:24,420
becomes a problem for over-fitting,
let's look at this example over here.

50
00:03:24,420 --> 00:03:28,510
In this case,
the plane 1.0 #awesome- 1.5 #awful

51
00:03:28,510 --> 00:03:33,020
separates the positive examples
from negative examples.

52
00:03:34,210 --> 00:03:35,770
Now, what does that mean?

53
00:03:35,770 --> 00:03:40,773
That means that the plane 1.0 number

54
00:03:40,773 --> 00:03:46,232
of awesome minus 1.5 number of awful's

55
00:03:46,232 --> 00:03:50,932
equal to zero is the boundary between

56
00:03:50,932 --> 00:03:56,760
the positive and negative examples.

57
00:03:56,760 --> 00:04:00,030
Now what happens if I multiply
both sides of the equation by 10?

58
00:04:00,030 --> 00:04:08,975
I get 10 number of awesome's
minus 15 number of awful's.

59
00:04:10,410 --> 00:04:15,010
On the left side by multiply by 10 and
if I multiply 0 by 10, I also get 0.

60
00:04:15,010 --> 00:04:17,960
So it turns out that
that bigger coefficient

61
00:04:17,960 --> 00:04:19,890
also separates the data
in exactly the same way.

62
00:04:21,590 --> 00:04:22,890
And guess what?

63
00:04:22,890 --> 00:04:29,590
If I multiply both sides by 1 billion I
still separate the data in the same way.

64
00:04:29,590 --> 00:04:34,169
So 1 billion times number of
awesome's minus 1.5 billion

65
00:04:34,169 --> 00:04:38,230
times number of awful's It
still separates the data.

66
00:04:38,230 --> 00:04:41,600
And so,
whether coefficients are small or big,

67
00:04:41,600 --> 00:04:44,100
we have still have
a separating hyperplane.

68
00:04:45,300 --> 00:04:48,380
So why am I going to get pushed
to these bigger coefficients?

69
00:04:49,390 --> 00:04:54,420
Well, let's, to understand that, we have
to go back to our probability for data.

70
00:04:54,420 --> 00:04:56,060
So let's pick a particular
data point here,

71
00:04:56,060 --> 00:05:01,780
one that is near the boundary which
has two awesome's and one awful.

72
00:05:01,780 --> 00:05:04,010
I really loved that review,
two awesomes and

73
00:05:04,010 --> 00:05:05,720
one awful is one that
I keep coming back to.

74
00:05:06,820 --> 00:05:10,330
Now let's see what happens to our
estimating probability for this case.

75
00:05:12,490 --> 00:05:16,648
Let's see what happens when we're using
the first set of coefficients learned.

76
00:05:16,648 --> 00:05:17,810
W1 is 1.0.

77
00:05:17,810 --> 00:05:18,440
W2 is minus 1.5.

78
00:05:18,440 --> 00:05:23,470
So in this case my
estimate of probability is

79
00:05:23,470 --> 00:05:29,020
one over one plus e to the minus so

80
00:05:29,020 --> 00:05:38,320
two times 1.0 which is two minus
1.5 .times one which is minus 1.5.

81
00:05:38,320 --> 00:05:43,400
So this is equal one over one plus

82
00:05:43,400 --> 00:05:50,015
e to the minus 0.5 which turns
out to be equal to 0.62.

83
00:05:50,015 --> 00:05:54,600
Now that makes sense to me is

84
00:05:54,600 --> 00:05:59,600
a point close to the boundary that 62%
chance that it's a positive review but

85
00:05:59,600 --> 00:06:02,920
then there is still 38%
chance is a negative review.

86
00:06:02,920 --> 00:06:05,930
So I feel really good
about the prediction.

87
00:06:05,930 --> 00:06:09,180
However, since we're doing
maximum likelihood estimation,

88
00:06:09,180 --> 00:06:13,840
we're pushing probabilities
towards extremes.

89
00:06:13,840 --> 00:06:17,950
We're trying to learn parameters that
makes those probabilities bigger.

90
00:06:17,950 --> 00:06:23,435
So this happens when we use the second
set of parameters, 10 and -15.

91
00:06:23,435 --> 00:06:27,860
So in this case, multiply everything
by 10 so the probability is

92
00:06:27,860 --> 00:06:33,250
going to be 1/1+e to
the minus 5 instead of e

93
00:06:33,250 --> 00:06:38,770
to the minus 0.5 which is equal to 0.99.

94
00:06:38,770 --> 00:06:40,900
Wow.
Now even though the point is close to

95
00:06:40,900 --> 00:06:44,730
boundary we're 99% confident
that it's a positive review.

96
00:06:44,730 --> 00:06:49,000
That doesn't seem quite right.

97
00:06:49,000 --> 00:06:51,620
Well let's see what happens
when we look at 1 billion.

98
00:06:51,620 --> 00:06:57,185
And minus 1.5 billion,
well that ratio becomes 1 over 1 + e to

99
00:06:57,185 --> 00:07:03,414
the minus 0.5 billion.

100
00:07:03,414 --> 00:07:07,860
And I don't know my calculator
wouldn't compute that exactly.

101
00:07:07,860 --> 00:07:13,030
Probably yours can't but
I can tell you is basically one.

102
00:07:13,030 --> 00:07:16,790
So when a parameters become
the coefficient becomes really big

103
00:07:16,790 --> 00:07:21,380
I'm sure that they put that point right
next to the boundary has probability one

104
00:07:21,380 --> 00:07:24,160
of being a positive review and
I don't trust that.

105
00:07:24,160 --> 00:07:28,390
However, maximum likelihood estimation
prefers models that are more certain, and

106
00:07:28,390 --> 00:07:32,760
so it is going to push the coefficients to
be infinite for linearly-separable data,

107
00:07:32,760 --> 00:07:33,600
because it just can.

108
00:07:33,600 --> 00:07:36,440
So it's going to be pushing larger and
larger and larger and

109
00:07:36,440 --> 00:07:40,740
larger until, basically,
they go to infinity.

110
00:07:40,740 --> 00:07:43,610
So that's a really bad over-fitting
problem that happens in

111
00:07:43,610 --> 00:07:44,610
logistic regression.

112
00:07:45,980 --> 00:07:49,930
So just as a summary of this
optional section, we'll see that

113
00:07:49,930 --> 00:07:54,360
logistic regression over 50 here could
be where I call it twice as bad.

114
00:07:54,360 --> 00:07:57,170
We have the same kind of bad
situation that we had looking

115
00:07:57,170 --> 00:07:59,650
at decision boundaries and
typically in regression,

116
00:07:59,650 --> 00:08:02,449
where we had this really complicated
function that you learned.

117
00:08:03,830 --> 00:08:09,140
And you become really complex decision
boundaries that over-fit the data and

118
00:08:09,140 --> 00:08:10,380
don't generalize well.

119
00:08:10,380 --> 00:08:14,390
But you also have the second effect, where
if the data is linearly separable, if you

120
00:08:14,390 --> 00:08:17,910
have lots of features, the data becomes
linearly separable or you are close to it,

121
00:08:17,910 --> 00:08:21,330
then the coefficients can get really big
and eventually they can go to infinity.

122
00:08:21,330 --> 00:08:26,980
And so you get these massive coefficients
and massive confidence about your answers.

123
00:08:26,980 --> 00:08:29,933
And so
you will see these two kinds of effects of

124
00:08:29,933 --> 00:08:32,752
over-fitting with logistic regression.

125
00:08:32,752 --> 00:08:36,709
[MUSIC]