1
00:00:00,171 --> 00:00:04,702
[MUSIC]

2
00:00:04,702 --> 00:00:09,426
In classification the concept of
over-fitting can be even stranger than it

3
00:00:09,426 --> 00:00:14,960
is in regression because here we're not
just predicting a particular value.

4
00:00:14,960 --> 00:00:18,950
A price for house or even whether
reduce positive or negative but

5
00:00:18,950 --> 00:00:22,340
we're often asking probabilistic
questions like what is the probability

6
00:00:22,340 --> 00:00:24,300
that this review is positive?

7
00:00:24,300 --> 00:00:28,180
So let's see what over-fitting means with
respect to estimating probabilities.

8
00:00:29,590 --> 00:00:32,380
So as you remember from
the previous modules,

9
00:00:32,380 --> 00:00:35,960
we talked about the relationship
between the score on the data, so

10
00:00:35,960 --> 00:00:40,830
that's w transpose h, which we raise
from minus infinity to plus infinity.

11
00:00:40,830 --> 00:00:42,770
And actual estimate for probabilities,

12
00:00:42,770 --> 00:00:47,070
the probability of y equals plus
one let's say given its input x and

13
00:00:47,070 --> 00:00:53,640
w as a sigmoid applied to
the score function w transpose h.

14
00:00:53,640 --> 00:00:55,770
So that's the model we're working with and

15
00:00:55,770 --> 00:00:59,440
if you remember as we see over 50 we
see w's becoming bigger and bigger.

16
00:00:59,440 --> 00:01:01,500
This coefficient's becoming bigger and
bigger and bigger and

17
00:01:01,500 --> 00:01:06,640
bigger which means that w
transposed h becomes huge, massive,

18
00:01:06,640 --> 00:01:09,900
and that pushes us either if it's
massively positive to saying that

19
00:01:09,900 --> 00:01:13,560
the probability is exactly one,
pretty confident they are one or

20
00:01:13,560 --> 00:01:15,880
if they're massively negative,
they're confident they are zero.

21
00:01:17,190 --> 00:01:20,500
So over-fitting in classification,
especially for

22
00:01:20,500 --> 00:01:23,380
logistic regression can have
very devastating effects.

23
00:01:23,380 --> 00:01:28,260
It can yield this really
massive coefficients which push

24
00:01:28,260 --> 00:01:31,030
w transposed h score
to be very positive or

25
00:01:31,030 --> 00:01:35,570
very negative, which puts the sigmoid
to be exactly 1 or exactly 0.

26
00:01:35,570 --> 00:01:39,170
And so not only are we over-fitting but

27
00:01:39,170 --> 00:01:42,260
we're really confident
about our predictions.

28
00:01:42,260 --> 00:01:45,320
We think that this is
exactly a positive review

29
00:01:45,320 --> 00:01:47,760
when we should more assertive about it.

30
00:01:47,760 --> 00:01:50,120
So let's observe how
that shows up the data.

31
00:01:50,120 --> 00:01:54,530
Let's go back to the simple
example that we had before where

32
00:01:54,530 --> 00:01:59,470
we're just fitting a function,
a classifier using two features,

33
00:01:59,470 --> 00:02:01,930
number of awesome's and number of awful's.

34
00:02:01,930 --> 00:02:04,600
Let's say the coefficient
of awesome is +1,

35
00:02:04,600 --> 00:02:09,220
and the coefficient of awful is -1,
and we have an input.

36
00:02:09,220 --> 00:02:11,330
Two awesome's, one awful.

37
00:02:11,330 --> 00:02:15,987
If you look at the difference between the
number of awful's and number of awesome's,

38
00:02:15,987 --> 00:02:20,920
you have one here cause there's
one awesome within these awful's.

39
00:02:20,920 --> 00:02:27,935
And so
the actual score that we get is one,

40
00:02:27,935 --> 00:02:34,551
which means the estimated probability

41
00:02:34,551 --> 00:02:39,964
Of a positive review is 0.73

42
00:02:39,964 --> 00:02:46,650
is probability review is positive.

43
00:02:46,650 --> 00:02:50,928
And I can live with that you know,
I have a review, it has two awesome's,

44
00:02:50,928 --> 00:02:55,137
two things are awesome about
the restaurant were awesome, it's more of

45
00:02:55,137 --> 00:02:59,430
a half probability of being a positive
review but not a lot more than half.

46
00:03:00,650 --> 00:03:04,500
If I take the same coefficients, nothings
changed, the same input, but I apply these

47
00:03:04,500 --> 00:03:09,160
coefficients by two Now I have two
awesome's and, sorry, two awesome's

48
00:03:09,160 --> 00:03:12,850
one awful still as input, but the
coefficients are plus two and minus two.

49
00:03:14,720 --> 00:03:16,200
Now the curve becomes steeper.

50
00:03:16,200 --> 00:03:21,041
And so if I look at the same point where
the difference between awesome's and

51
00:03:21,041 --> 00:03:25,800
awful's are at one, and
I look at my predicted probability.

52
00:03:25,800 --> 00:03:28,430
Now I've increased it tremendously.

53
00:03:28,430 --> 00:03:33,290
Now I see the probability of
positive review, is about 0.88.

54
00:03:33,290 --> 00:03:38,270
I'm even more confident that
the same exact review is positive.

55
00:03:39,580 --> 00:03:44,661
So that doesn't seem as good,
88% chance that it's positive.

56
00:03:44,661 --> 00:03:48,460
But let's push the coefficiencts up more,
let's say that the coefficient of

57
00:03:48,460 --> 00:03:51,940
awesome is plus six and
the coefficient for awful is minus six.

58
00:03:51,940 --> 00:03:54,160
Now if I look at the same point.

59
00:03:54,160 --> 00:03:54,980
The same import.

60
00:03:54,980 --> 00:03:58,910
The same difference between awesome's and
awful's and I push that up.

61
00:04:00,790 --> 00:04:05,790
I get this pre scare result it says that

62
00:04:05,790 --> 00:04:11,750
the probability of being
a positive root 0.997.

63
00:04:11,750 --> 00:04:13,650
I can't trust that.

64
00:04:13,650 --> 00:04:16,920
Is it really a case of 0.997 probability?

65
00:04:16,920 --> 00:04:19,340
The review of two awesome's and
one awful is positive?

66
00:04:19,340 --> 00:04:21,060
That doesn't make sense.

67
00:04:21,060 --> 00:04:25,460
So as you can see, we have the same
decision boundary still crossing at 0.

68
00:04:25,460 --> 00:04:29,650
The coefficients are just
getting a bit bigger every time.

69
00:04:29,650 --> 00:04:34,310
But my estimated probability of the review
becomes steeper and steeper more and

70
00:04:34,310 --> 00:04:35,730
more likely.

71
00:04:35,730 --> 00:04:40,360
Which is another type of over 50 that
we observe in logistic regression.

72
00:04:40,360 --> 00:04:45,360
So not only the curves with them weird and
wiggly but

73
00:04:45,360 --> 00:04:48,350
the estimated probabilities become
close to zero and close to one.

74
00:04:49,570 --> 00:04:54,021
So let's look at our data set and
see how we observe the same,

75
00:04:54,021 --> 00:04:56,219
the same effect right there.

76
00:04:56,219 --> 00:05:00,089
[MUSIC]