1
00:00:00,000 --> 00:00:03,600
I've noticed that almost all the really good machine learning

2
00:00:03,600 --> 00:00:07,890
practitioners tend to be very sophisticated in understanding of Bias and Variance.

3
00:00:07,890 --> 00:00:12,330
Bias and Variance is one of those concepts that's easily learned but difficult to master.

4
00:00:12,330 --> 00:00:16,155
Even if you think you've seen the basic concepts of Bias and Variance,

5
00:00:16,155 --> 00:00:18,805
there's often more new ones to it than you'd expect.

6
00:00:18,805 --> 00:00:20,620
In the Deep Learning Error,

7
00:00:20,620 --> 00:00:22,650
another trend is that there's been

8
00:00:22,650 --> 00:00:26,035
less discussion of what's called the bias-variance trade-off.

9
00:00:26,035 --> 00:00:28,657
You might have heard this thing called the bias-variance trade-off.

10
00:00:28,657 --> 00:00:31,330
But in Deep Learning Error there's less of a trade-off,

11
00:00:31,330 --> 00:00:32,520
so we'd still still solve the bias,

12
00:00:32,520 --> 00:00:33,865
we still solve the variance,

13
00:00:33,865 --> 00:00:37,295
but we just talk less about the bias-variance trade-off.

14
00:00:37,295 --> 00:00:39,795
Let's see what this means.

15
00:00:39,795 --> 00:00:42,750
Let's see the data set that looks like this.

16
00:00:42,750 --> 00:00:44,800
If you fit a straight line to the data,

17
00:00:44,800 --> 00:00:48,130
maybe get a logistic regression fit to that.

18
00:00:48,130 --> 00:00:50,415
This is not a very good fit to the data.

19
00:00:50,415 --> 00:00:52,380
And so this is class of a high bias,

20
00:00:52,380 --> 00:00:56,400
what we say that this is underfitting the data.

21
00:00:56,400 --> 00:00:57,850
On the opposite end,

22
00:00:57,850 --> 00:01:00,645
if you fit an incredibly complex classifier,

23
00:01:00,645 --> 00:01:02,640
maybe deep neural network,

24
00:01:02,640 --> 00:01:05,995
or neural network with all the hidden units,

25
00:01:05,995 --> 00:01:10,255
maybe you can fit the data perfectly,

26
00:01:10,255 --> 00:01:12,220
but that doesn't look like a great fit either.

27
00:01:12,220 --> 00:01:17,535
So there's a classifier of high variance and this is overfitting the data.

28
00:01:17,535 --> 00:01:19,650
And there might be some classifier in between,

29
00:01:19,650 --> 00:01:22,070
with a medium level of complexity,

30
00:01:22,070 --> 00:01:24,585
that maybe fits it correctly like that.

31
00:01:24,585 --> 00:01:27,300
That looks like a much more reasonable fit to the data,

32
00:01:27,300 --> 00:01:31,705
so we call that just right. It's somewhere in between.

33
00:01:31,705 --> 00:01:34,260
So in a 2D example like this,

34
00:01:34,260 --> 00:01:35,610
with just two features,

35
00:01:35,610 --> 00:01:39,700
X-1 and X-2, you can plot the data and visualize bias and variance.

36
00:01:39,700 --> 00:01:41,250
In high dimensional problems,

37
00:01:41,250 --> 00:01:44,735
you can't plot the data and visualize division boundary.

38
00:01:44,735 --> 00:01:46,830
Instead, there are couple of different metrics,

39
00:01:46,830 --> 00:01:49,750
that we'll look at, to try to understand bias and variance.

40
00:01:49,750 --> 00:01:51,960
So continuing our example of cat picture classification,

41
00:01:51,960 --> 00:01:57,600
where that's a positive example and that's a negative example,

42
00:01:57,600 --> 00:02:01,455
the two key numbers to look at to understand bias and variance will be

43
00:02:01,455 --> 00:02:06,415
the train set error and the dev set or the development set error.

44
00:02:06,415 --> 00:02:07,716
So for the sake of argument,

45
00:02:07,716 --> 00:02:10,290
let's say that you're recognizing cats in pictures,

46
00:02:10,290 --> 00:02:13,860
is something that people can do nearly perfectly, right?

47
00:02:13,860 --> 00:02:22,155
So let's say, your training set error is 1% and your dev set error is,

48
00:02:22,155 --> 00:02:23,580
for the sake of argument,

49
00:02:23,580 --> 00:02:25,585
let's say is 11%.

50
00:02:25,585 --> 00:02:26,730
So in this example,

51
00:02:26,730 --> 00:02:29,495
you're doing very well on the training set,

52
00:02:29,495 --> 00:02:34,355
but you're doing relatively poorly on the development set.

53
00:02:34,355 --> 00:02:38,215
So this looks like you might have overfit the training set,

54
00:02:38,215 --> 00:02:40,620
that somehow you're not generalizing well,

55
00:02:40,620 --> 00:02:43,815
to this whole cross-validation set in the development set.

56
00:02:43,815 --> 00:02:46,440
And so if you have an example like this,

57
00:02:46,440 --> 00:02:50,785
we would say this has high variance.

58
00:02:50,785 --> 00:02:54,240
So by looking at the training set error and the development set error,

59
00:02:54,240 --> 00:02:59,730
you would be able to render a diagnosis of your algorithm having high variance.

60
00:02:59,730 --> 00:03:04,435
Now, let's say, that you measure your training set and your dev set error,

61
00:03:04,435 --> 00:03:06,095
and you get a different result.

62
00:03:06,095 --> 00:03:09,610
Let's say, that your training set error is 15%.

63
00:03:09,610 --> 00:03:12,375
I'm writing your training set error in the top row,

64
00:03:12,375 --> 00:03:15,880
and your dev set error is 16%.

65
00:03:15,880 --> 00:03:22,870
In this case, assuming that humans achieve roughly 0% error,

66
00:03:22,870 --> 00:03:27,451
that humans can look at these pictures and just tell if it's cat or not,

67
00:03:27,451 --> 00:03:31,600
then it looks like the algorithm is not even doing very well on the training set.

68
00:03:31,600 --> 00:03:35,380
So if it's not even fitting the training data seam that well,

69
00:03:35,380 --> 00:03:38,220
then this is underfitting the data.

70
00:03:38,220 --> 00:03:40,895
And so this algorithm has high bias.

71
00:03:40,895 --> 00:03:45,390
But in contrast, this actually generalizing at a reasonable level to the dev set,

72
00:03:45,390 --> 00:03:49,365
whereas performance in the dev set is only 1% worse than performance in the training set.

73
00:03:49,365 --> 00:03:52,355
So this algorithm has a problem of high bias,

74
00:03:52,355 --> 00:03:56,325
because it was not even fitting the training set.

75
00:03:56,325 --> 00:04:01,030
Well, this is similar to the leftmost plots we had on the previous slide.

76
00:04:01,030 --> 00:04:03,329
Now, here's another example.

77
00:04:03,329 --> 00:04:06,430
Let's say that you have 15% training set error,

78
00:04:06,430 --> 00:04:08,127
so that's pretty high bias,

79
00:04:08,127 --> 00:04:11,446
but when you evaluate to the dev set it does even worse,

80
00:04:11,446 --> 00:04:13,450
maybe it does 30%.

81
00:04:13,450 --> 00:04:18,175
In this case, I would diagnose this algorithm as having high bias,

82
00:04:18,175 --> 00:04:23,780
because it's not doing that well on the training set, and high variance.

83
00:04:23,780 --> 00:04:27,950
So this has really the worst of both worlds.

84
00:04:27,950 --> 00:04:29,325
And one last example,

85
00:04:29,325 --> 00:04:32,605
if you have 0.5 training set error,

86
00:04:32,605 --> 00:04:35,145
and 1% dev set error,

87
00:04:35,145 --> 00:04:37,130
then maybe our users are quite happy,

88
00:04:37,130 --> 00:04:39,850
that you have a cat classifier with only 1%,

89
00:04:39,850 --> 00:04:44,340
than just we have low bias and low variance.

90
00:04:44,340 --> 00:04:47,610
One subtlety, that I'll just briefly mention that

91
00:04:47,610 --> 00:04:50,955
we'll leave to a later video to discuss in detail,

92
00:04:50,955 --> 00:04:54,188
is that this analysis is predicated on the assumption,

93
00:04:54,188 --> 00:04:59,115
that human level performance gets nearly 0% error or,

94
00:04:59,115 --> 00:05:01,749
more generally, that the optimal error,

95
00:05:01,749 --> 00:05:04,225
sometimes called base error,

96
00:05:04,225 --> 00:05:10,355
so the base in optimal error is nearly 0%.

97
00:05:10,355 --> 00:05:13,565
I don't want to go into detail on this in this particular video,

98
00:05:13,565 --> 00:05:18,070
but it turns out that if the optimal error or the base error were much higher, say,

99
00:05:18,070 --> 00:05:22,360
it were 15%, then if you look at this classifier,

100
00:05:22,360 --> 00:05:25,460
15% is actually perfectly reasonable for training set and you

101
00:05:25,460 --> 00:05:30,120
wouldn't see it as high bias and also a pretty low variance.

102
00:05:30,120 --> 00:05:33,440
So the case of how to analyze bias and variance,

103
00:05:33,440 --> 00:05:37,460
when no classifier can do very well, for example,

104
00:05:37,460 --> 00:05:40,833
if you have really blurry images,

105
00:05:40,833 --> 00:05:46,081
so that even a human or just no system could possibly do very well,

106
00:05:46,081 --> 00:05:49,237
then maybe base error is much higher,

107
00:05:49,237 --> 00:05:52,295
and then there are some details of how this analysis will change.

108
00:05:52,295 --> 00:05:54,725
But leaving aside this subtlety for now,

109
00:05:54,725 --> 00:05:57,430
the takeaway is that by looking at

110
00:05:57,430 --> 00:06:02,676
your training set error you can get a sense of how well you are fitting,

111
00:06:02,676 --> 00:06:04,331
at least the training data,

112
00:06:04,331 --> 00:06:06,770
and so that tells you if you have a bias problem.

113
00:06:06,770 --> 00:06:10,190
And then looking at how much higher your error goes,

114
00:06:10,190 --> 00:06:12,965
when you go from the training set to the dev set,

115
00:06:12,965 --> 00:06:17,055
that should give you a sense of how bad is the variance problem,

116
00:06:17,055 --> 00:06:20,857
so you'll be doing a good job generalizing from a training set to the dev set,

117
00:06:20,857 --> 00:06:22,645
that gives you sense of your variance.

118
00:06:22,645 --> 00:06:26,210
All this is under the assumption that the base error is quite

119
00:06:26,210 --> 00:06:30,235
small and that your training and your dev sets are drawn from the same distribution.

120
00:06:30,235 --> 00:06:32,210
If those assumptions are violated,

121
00:06:32,210 --> 00:06:34,323
there's a more sophisticated analysis you could do,

122
00:06:34,323 --> 00:06:36,510
which we'll talk about in the later video.

123
00:06:36,510 --> 00:06:38,780
Now, on the previous slide,

124
00:06:38,780 --> 00:06:40,849
you saw what high bias,

125
00:06:40,849 --> 00:06:42,185
high variance looks like,

126
00:06:42,185 --> 00:06:44,920
and I guess you have the sense of what it a good class can look like.

127
00:06:44,920 --> 00:06:48,110
What does high bias and high variance looks like?

128
00:06:48,110 --> 00:06:50,535
This is kind of the worst of both worlds.

129
00:06:50,535 --> 00:06:53,415
So you remember, we said that a classifier like this,

130
00:06:53,415 --> 00:06:55,755
then your classifier has high bias,

131
00:06:55,755 --> 00:06:58,185
because it underfits the data.

132
00:06:58,185 --> 00:07:04,030
So this would be a classifier that is mostly linear and therefore underfits the data,

133
00:07:04,030 --> 00:07:05,570
we're drawing this is purple.

134
00:07:05,570 --> 00:07:09,546
But if somehow your classifier does some weird things,

135
00:07:09,546 --> 00:07:14,460
then it is actually overfitting parts of the data as well.

136
00:07:14,460 --> 00:07:16,500
So the classifier that I drew in purple,

137
00:07:16,500 --> 00:07:19,455
has both high bias and high variance.

138
00:07:19,455 --> 00:07:21,300
Where it has high bias, because,

139
00:07:21,300 --> 00:07:23,325
by being a mostly linear classifier,

140
00:07:23,325 --> 00:07:24,875
is just not fitting.

141
00:07:24,875 --> 00:07:28,466
You know, this quadratic line shape that well,

142
00:07:28,466 --> 00:07:31,200
but by having too much flexibility in the middle,

143
00:07:31,200 --> 00:07:32,995
it somehow gets this example,

144
00:07:32,995 --> 00:07:36,720
and this example overfits those two examples as well.

145
00:07:36,720 --> 00:07:40,515
So this classifier kind of has high bias because it was mostly linear,

146
00:07:40,515 --> 00:07:43,620
but you need maybe a curve function or quadratic function.

147
00:07:43,620 --> 00:07:45,115
And it has high variance,

148
00:07:45,115 --> 00:07:49,595
because it had too much flexibility to fit those two mislabel,

149
00:07:49,595 --> 00:07:52,475
or those live examples in the middle as well.

150
00:07:52,475 --> 00:07:54,300
In case this seems contrived, well,

151
00:07:54,300 --> 00:07:57,585
this example is a little bit contrived in two dimensions,

152
00:07:57,585 --> 00:07:59,883
but with very high dimensional inputs.

153
00:07:59,883 --> 00:08:01,795
You actually do get things with

154
00:08:01,795 --> 00:08:04,800
high bias in some regions and high variance in some regions,

155
00:08:04,800 --> 00:08:07,410
and so it is possible to get classifiers like this

156
00:08:07,410 --> 00:08:11,415
in high dimensional inputs that seem less contrived.

157
00:08:11,415 --> 00:08:15,690
So to summarize, you've seen how by looking at your algorithm's error on

158
00:08:15,690 --> 00:08:20,550
the training set and your algorithm's error on the dev set you can try to diagnose,

159
00:08:20,550 --> 00:08:23,413
whether it has problems of high bias or high variance,

160
00:08:23,413 --> 00:08:25,420
or maybe both, or maybe neither.

161
00:08:25,420 --> 00:08:28,995
And depending on whether your algorithm suffers from bias or variance,

162
00:08:28,995 --> 00:08:31,765
it turns out that there are different things you could try.

163
00:08:31,765 --> 00:08:33,840
So in the next video, I want to present to you,

164
00:08:33,840 --> 00:08:37,390
what I call a basic recipe for Machine Learning,

165
00:08:37,390 --> 00:08:40,905
that lets you more systematically try to improve your algorithm,

166
00:08:40,905 --> 00:08:44,370
depending on whether it has high bias or high variance issues.

167
00:08:44,370 --> 00:08:46,110
So let's go on to the next video.