1
00:00:00,120 --> 00:00:01,220
If you run the learning algorithm

2
00:00:01,710 --> 00:00:02,640
and it doesn't do as well

3
00:00:02,840 --> 00:00:04,520
as you are hoping, almost all

4
00:00:04,740 --> 00:00:05,670
the time it will be because

5
00:00:06,100 --> 00:00:07,650
you have either a high bias

6
00:00:08,010 --> 00:00:09,530
problem or a high variance problem.

7
00:00:09,860 --> 00:00:10,940
In other words they're either an

8
00:00:11,130 --> 00:00:13,140
underfitting problem or an overfitting problem.

9
00:00:14,260 --> 00:00:15,090
And in this case it's very

10
00:00:15,350 --> 00:00:16,580
important to figure out

11
00:00:16,790 --> 00:00:17,970
which of these two problems is

12
00:00:18,280 --> 00:00:19,500
bias or variance or a bit of both that you

13
00:00:20,210 --> 00:00:20,430
actually have.

14
00:00:21,050 --> 00:00:21,980
Because knowing which of these

15
00:00:22,440 --> 00:00:23,890
two things is happening would give

16
00:00:24,060 --> 00:00:25,940
a very strong indicator for whether

17
00:00:26,180 --> 00:00:27,490
the useful and promising ways

18
00:00:27,770 --> 00:00:29,030
to try to improve your algorithm.

19
00:00:30,230 --> 00:00:31,270
In this video, I would like

20
00:00:31,380 --> 00:00:33,030
to delve more deeply into

21
00:00:33,220 --> 00:00:34,850
this bias and various issue and

22
00:00:35,180 --> 00:00:36,530
understand them better as well

23
00:00:36,790 --> 00:00:38,470
as figure out how to look

24
00:00:38,610 --> 00:00:42,910
at and evaluate knows whether or not we might have a bias problem or a variance problem.

25
00:00:43,030 --> 00:00:45,750
Since this would be critical to

26
00:00:45,900 --> 00:00:48,180
figuring out how to improve the performance of learning algorithm that you implement.

27
00:00:48,640 --> 00:00:52,270
So you've already

28
00:00:52,680 --> 00:00:53,690
seen this figure a few times,

29
00:00:54,190 --> 00:00:55,230
where if you fit two simple

30
00:00:55,710 --> 00:00:57,900
hypothesis, like a straight line that that underfits the data.

31
00:00:59,660 --> 00:01:00,720
If you fit a two complex

32
00:01:01,250 --> 00:01:02,870
hypothesis, then that might

33
00:01:03,400 --> 00:01:05,050
fit the training set perfectly but

34
00:01:05,270 --> 00:01:06,810
overfit the data and this

35
00:01:06,930 --> 00:01:09,000
may be hypothesis of some

36
00:01:09,340 --> 00:01:11,000
intermediate level of complexity,

37
00:01:11,810 --> 00:01:13,120
of some, maybe degree two

38
00:01:13,390 --> 00:01:15,770
polynomials are not too low and not too high degree.

39
00:01:16,560 --> 00:01:17,340
That's just right.

40
00:01:17,560 --> 00:01:18,480
And gives you the best

41
00:01:19,100 --> 00:01:20,740
generalization error out of these options.

42
00:01:21,770 --> 00:01:22,960
Now that we're armed with the

43
00:01:23,030 --> 00:01:25,130
notion of training and validation

44
00:01:26,100 --> 00:01:27,550
in test sets, we can understand

45
00:01:28,290 --> 00:01:30,530
the concepts of bias and variance a little bit better.

46
00:01:31,310 --> 00:01:33,140
Concretely, let our

47
00:01:33,370 --> 00:01:34,920
training error and cross

48
00:01:35,050 --> 00:01:36,620
validation error be defined as

49
00:01:36,850 --> 00:01:38,440
in the previous videos, just say,

50
00:01:38,680 --> 00:01:40,110
the squared error, the average

51
00:01:40,450 --> 00:01:41,420
squared error as measured

52
00:01:41,830 --> 00:01:42,810
on the 20 sets or as

53
00:01:42,930 --> 00:01:44,710
measured on the cross validation set.

54
00:01:46,560 --> 00:01:47,690
Now let's plot the following figure.

55
00:01:48,470 --> 00:01:49,930
On the horizontal axis I am

56
00:01:50,010 --> 00:01:52,000
going to plot the degree of polynomial,

57
00:01:52,400 --> 00:01:53,380
so as I go the right

58
00:01:54,810 --> 00:01:57,050
I'm going to be fitting higher and higher order polynomials.

59
00:01:58,590 --> 00:01:59,630
So, we'll do that for this

60
00:01:59,810 --> 00:02:01,100
figure, where maybe d equals 1,

61
00:02:01,720 --> 00:02:02,770
were going to be fitting

62
00:02:03,690 --> 00:02:05,600
very simple functions where as

63
00:02:05,740 --> 00:02:06,680
we are the right of this

64
00:02:07,150 --> 00:02:08,950
this may be

65
00:02:09,740 --> 00:02:11,550
d equals 4 or relatively may

66
00:02:11,650 --> 00:02:13,400
be even larger numbers. I'm going to be fitting

67
00:02:14,120 --> 00:02:17,020
very complex high order polynomials that

68
00:02:17,420 --> 00:02:19,980
might fit the training set with much more complex functions

69
00:02:23,550 --> 00:02:26,430
whereas we're

70
00:02:26,890 --> 00:02:27,980
here on the right of the

71
00:02:28,160 --> 00:02:31,250
horizontal axis, I have much larger values of these

72
00:02:31,730 --> 00:02:34,350
of a much higher degree polynomial, and

73
00:02:34,460 --> 00:02:35,560
so here that is going

74
00:02:35,600 --> 00:02:37,490
to correspond to fitting much

75
00:02:37,760 --> 00:02:39,820
more complex functions to your

76
00:02:40,110 --> 00:02:41,920
training set.
Let's look at

77
00:02:42,010 --> 00:02:44,060
the training error and cause-validation error

78
00:02:44,400 --> 00:02:45,610
and plot them on this figure.

79
00:02:46,570 --> 00:02:49,080
Let's start with the training error.

80
00:02:49,820 --> 00:02:50,570
As we increase the degree of the

81
00:02:50,680 --> 00:02:52,220
polynomial, we're going to

82
00:02:53,260 --> 00:02:55,630
fit our training set better and better and so, if d equals 1

83
00:02:57,320 --> 00:02:58,300
that ever rose to the high training error.

84
00:02:58,430 --> 00:02:59,190
If we have a

85
00:02:59,200 --> 00:03:00,410
very high degree of

86
00:03:00,810 --> 00:03:02,580
polynomial, our training error is going to be really low.

87
00:03:02,840 --> 00:03:05,230
Maybe even zero, because it will fit the training set really well.

88
00:03:05,850 --> 00:03:06,910
And so as we increase

89
00:03:07,390 --> 00:03:08,750
of the greater polynomial we find

90
00:03:09,130 --> 00:03:10,150
typically that the training

91
00:03:10,550 --> 00:03:11,830
error decreases, so I'm

92
00:03:11,960 --> 00:03:15,210
going to write j subscript

93
00:03:15,980 --> 00:03:17,920
train of theta there, because

94
00:03:18,210 --> 00:03:19,620
our training error tends to

95
00:03:19,750 --> 00:03:22,380
decrease with the degree

96
00:03:22,790 --> 00:03:25,180
of the polynomial that we fit to the data.

97
00:03:25,410 --> 00:03:28,240
Next, let's look at the cross validation error. Often that matter, if

98
00:03:28,300 --> 00:03:30,680
we look at the test set error

99
00:03:31,480 --> 00:03:32,940
we'll get a pretty similar result as

100
00:03:33,510 --> 00:03:34,720
if we were to plot the

101
00:03:36,710 --> 00:03:39,790
cross validation error. So, we know that if d equals 1, we're fitting

102
00:03:40,620 --> 00:03:42,160
a very simple function, and

103
00:03:42,340 --> 00:03:44,400
so we may be underfitting the

104
00:03:44,540 --> 00:03:45,620
training set, and so we're

105
00:03:45,710 --> 00:03:47,250
going to go very high cross-validation error.

106
00:03:47,390 --> 00:03:49,620
If we fit, you

107
00:03:49,680 --> 00:03:52,020
know, an intermediate degree polynomial; we

108
00:03:52,110 --> 00:03:53,620
have a d equals 2 in our

109
00:03:54,090 --> 00:03:55,010
example in the previous slide,

110
00:03:55,390 --> 00:03:56,100
we are going to have a

111
00:03:56,250 --> 00:03:57,440
much lower cross-validation error, because

112
00:03:57,570 --> 00:03:59,460
we are just fitting, finding

113
00:03:59,860 --> 00:04:01,050
a much better fit to the data.

114
00:04:02,170 --> 00:04:03,230
And conversely if d were

115
00:04:03,350 --> 00:04:04,310
too high, so if d

116
00:04:04,540 --> 00:04:05,990
took on say a value of

117
00:04:06,290 --> 00:04:07,320
four, then we're again

118
00:04:07,730 --> 00:04:08,800
overfitting and so we

119
00:04:08,950 --> 00:04:11,030
end up with a high value for cross-validation error.

120
00:04:12,280 --> 00:04:13,560
So if you were to vary

121
00:04:13,900 --> 00:04:15,180
this smoothly and plot a

122
00:04:15,390 --> 00:04:16,390
curve you might end up

123
00:04:17,040 --> 00:04:18,580
with a curve like that, where

124
00:04:19,210 --> 00:04:21,220
that's Jcv of theta,

125
00:04:21,680 --> 00:04:23,240
and again if you plot j

126
00:04:23,460 --> 00:04:25,810
test of theta you get something very similar.

127
00:04:27,130 --> 00:04:28,220
And so this sort of

128
00:04:28,530 --> 00:04:30,110
plot also helps us

129
00:04:30,530 --> 00:04:32,000
to better understand the notions

130
00:04:32,560 --> 00:04:34,760
of bias and variance. Concretely, if you

131
00:04:35,670 --> 00:04:37,000
have a learning algorithm that's

132
00:04:37,240 --> 00:04:38,830
not performing as well as

133
00:04:39,060 --> 00:04:40,660
you wanted it to, how

134
00:04:41,060 --> 00:04:43,420
can you figure out if your learning algorithm is suffering.

135
00:04:44,920 --> 00:04:46,550
Concretly, suppose you have applied a

136
00:04:46,780 --> 00:04:48,120
learning algorithm and it is

137
00:04:48,250 --> 00:04:49,640
not performing as well

138
00:04:49,930 --> 00:04:52,010
as your are hoping, so your

139
00:04:52,240 --> 00:04:54,940
cross-validation set error or your test set error is high.

140
00:04:55,960 --> 00:04:56,910
How can we figure out if

141
00:04:56,950 --> 00:04:58,250
the learning algorithm is suffering

142
00:04:58,580 --> 00:05:01,070
from high bias or if it is suffering from high variance.

143
00:05:02,580 --> 00:05:03,260
So the setting of a cross-validation

144
00:05:04,140 --> 00:05:06,330
error being high corresponds to

145
00:05:07,150 --> 00:05:09,120
either this regime or this regime.

146
00:05:10,470 --> 00:05:11,560
So this regime on the

147
00:05:11,710 --> 00:05:13,550
left corresponds to a

148
00:05:13,750 --> 00:05:15,190
high bias problem, that is,

149
00:05:15,680 --> 00:05:17,040
if you are fitting an overly

150
00:05:17,560 --> 00:05:19,210
low order polynomial such as

151
00:05:19,280 --> 00:05:21,010
a plus one, when we

152
00:05:21,170 --> 00:05:23,750
really needed a higher order polynomial to fit the data.

153
00:05:24,710 --> 00:05:26,380
Whereas in contrast, this regime

154
00:05:26,850 --> 00:05:28,950
corresponds to a high variance problem.

155
00:05:29,840 --> 00:05:31,280
That is, if d--the degree of polynomial--was

156
00:05:32,820 --> 00:05:35,070
too large for the data set that we have.

157
00:05:35,990 --> 00:05:37,250
And this figure gives us

158
00:05:37,740 --> 00:05:39,990
a clue for how to distinguish between these two cases.

159
00:05:41,280 --> 00:05:42,730
Concretely, for the high

160
00:05:43,140 --> 00:05:45,560
bias case, that is,

161
00:05:45,970 --> 00:05:47,470
the case of under fitting, what

162
00:05:47,760 --> 00:05:49,170
we find is that both

163
00:05:50,230 --> 00:05:51,840
the cross validation error and

164
00:05:52,210 --> 00:05:54,220
the training error are going to be high.

165
00:05:54,990 --> 00:05:55,760
So, if your algorithm is

166
00:05:56,220 --> 00:05:57,410
suffering from a bias problem,

167
00:05:59,550 --> 00:06:01,450
the training set error

168
00:06:03,080 --> 00:06:05,970
would be high and you

169
00:06:06,070 --> 00:06:07,520
may find that the cross

170
00:06:07,870 --> 00:06:11,150
validation error will also be high.

171
00:06:11,680 --> 00:06:14,460
It might be close, maybe

172
00:06:14,700 --> 00:06:16,250
just slightly higher then a training error.

173
00:06:17,100 --> 00:06:18,000
And so, if you see this combination,

174
00:06:19,240 --> 00:06:20,510
that's a sign that your algorithm

175
00:06:21,000 --> 00:06:22,190
may be suffering from high bias.

176
00:06:23,410 --> 00:06:25,760
In contrast; if

177
00:06:25,850 --> 00:06:26,930
your algorithm is suffering from high

178
00:06:27,210 --> 00:06:29,720
variance; then, if you look here,

179
00:06:30,710 --> 00:06:33,500
we'll notice that, J

180
00:06:33,730 --> 00:06:34,790
train, that is the training

181
00:06:35,320 --> 00:06:37,220
error, is going to be low.

182
00:06:39,480 --> 00:06:41,820
That is, you're fitting the training set very well.

183
00:06:43,210 --> 00:06:47,540
Whereas, your cross validation error, assuming

184
00:06:48,280 --> 00:06:49,540
that this say the squared

185
00:06:50,290 --> 00:06:51,320
error which we're trying to minimize.

186
00:06:51,660 --> 00:06:53,790
Whereas in contrast; your

187
00:06:53,990 --> 00:06:54,940
error on a cross validation

188
00:06:55,640 --> 00:06:56,850
set or your cross function like cross

189
00:06:57,120 --> 00:06:58,600
validation set, will be

190
00:06:58,750 --> 00:07:01,410
much bigger than your training set error.

191
00:07:02,860 --> 00:07:03,910
This double greater than sign,

192
00:07:04,680 --> 00:07:06,840
here, it means much bigger than, all right. So, it's much greater than to multiply great to great.

193
00:07:10,480 --> 00:07:11,830
So this is a double greater

194
00:07:12,110 --> 00:07:13,120
than sign, that is the

195
00:07:13,270 --> 00:07:14,600
map symbol for much greater

196
00:07:14,910 --> 00:07:16,980
than denoted by two greater than signs.

197
00:07:18,500 --> 00:07:19,400
And so if you see this

198
00:07:19,580 --> 00:07:21,400
combination, then what you

199
00:07:21,550 --> 00:07:29,340
find. And so if you see this combination of values, then

200
00:07:29,580 --> 00:07:31,180
that is a clue that

201
00:07:31,400 --> 00:07:32,930
your learning algorithm may be suffering

202
00:07:33,360 --> 00:07:35,180
from high variance and might be overfitting.

203
00:07:36,380 --> 00:07:37,910
And the key that distinguishes these two

204
00:07:38,070 --> 00:07:39,320
cases is if you

205
00:07:39,410 --> 00:07:41,390
have a high bias problem your

206
00:07:41,530 --> 00:07:42,750
training set error will also

207
00:07:42,960 --> 00:07:43,870
be high as your

208
00:07:44,050 --> 00:07:45,820
hypothesis just not fitting the training set well.

209
00:07:46,940 --> 00:07:47,820
And if you have a high

210
00:07:47,940 --> 00:07:49,360
variance problem, your training

211
00:07:49,780 --> 00:07:51,080
set error will usually be low,

212
00:07:51,360 --> 00:07:53,730
that is much lower than the cross validation error.

213
00:07:55,780 --> 00:07:57,000
So, hopefully that gives you

214
00:07:57,100 --> 00:07:58,840
a somewhat better understanding of the

215
00:07:58,910 --> 00:08:00,400
two problems of bias and variance.

216
00:08:01,280 --> 00:08:02,190
I still have a lot more

217
00:08:02,360 --> 00:08:04,630
to say about bias and variance in the next few videos.

218
00:08:05,410 --> 00:08:06,590
But what we will see later; is

219
00:08:06,840 --> 00:08:08,460
that by diagnosing, whether a learning

220
00:08:08,520 --> 00:08:11,010
algorithm may be suffering from high bias or a high variance.

221
00:08:11,900 --> 00:08:14,710
I'll show you even more details on how to do that in later videos.

222
00:08:15,600 --> 00:08:16,880
We'll see that by figuring out

223
00:08:17,160 --> 00:08:18,570
whether a learning algorithm may be

224
00:08:18,740 --> 00:08:20,280
suffering from high bias or

225
00:08:20,760 --> 00:08:22,370
a combination of both that

226
00:08:22,530 --> 00:08:23,340
that would give us much better

227
00:08:23,520 --> 00:08:24,670
guidance for what might be

228
00:08:24,790 --> 00:08:25,930
promising things to try

229
00:08:26,130 --> 00:08:28,190
in order to improve the performance of the learning algorithm.