1
00:00:00,760 --> 00:00:01,970
Estimating the bias and

2
00:00:01,970 --> 00:00:06,650
variance of your learning algorithm really
helps you prioritize what to work on next.

3
00:00:06,650 --> 00:00:11,220
But the way you analyze bias and
variance changes when your training set

4
00:00:11,220 --> 00:00:14,570
comes from a different distribution
than your dev and test sets.

5
00:00:14,570 --> 00:00:15,140
Let's see how.

6
00:00:16,480 --> 00:00:19,650
Let's keep using our cat
classification example and

7
00:00:19,650 --> 00:00:22,880
let's say humans get near
perfect performance on this.

8
00:00:22,880 --> 00:00:28,620
So, Bayes error, or Bayes optimal error,
we know is nearly 0% on this problem.

9
00:00:28,620 --> 00:00:33,530
So, to carry out error analysis you
usually look at the training error and

10
00:00:33,530 --> 00:00:37,030
also look at the error on the dev set.

11
00:00:37,030 --> 00:00:41,490
So let's say, in this example
that your training error is 1%,

12
00:00:41,490 --> 00:00:45,040
and your dev error is 10%.

13
00:00:45,040 --> 00:00:49,725
If your dev data came from the same
distribution as your training set,

14
00:00:49,725 --> 00:00:53,125
you would say that here you
have a large variance problem,

15
00:00:53,125 --> 00:00:56,615
that your algorithm's just not
generalizing well from the training set

16
00:00:56,615 --> 00:01:01,795
which it's doing well on to the dev set,
which it's suddenly doing much worse on.

17
00:01:01,795 --> 00:01:05,335
But in the setting where your training
data and your dev data comes from

18
00:01:05,335 --> 00:01:09,658
a different distribution, you can no
longer safely draw this conclusion.

19
00:01:09,658 --> 00:01:13,820
In particular,
maybe it's doing just fine on the dev set,

20
00:01:13,820 --> 00:01:18,250
it's just that the training set was
really easy because it was high res,

21
00:01:18,250 --> 00:01:21,880
very clear images, and
maybe the dev set is just much harder.

22
00:01:23,492 --> 00:01:27,610
So maybe there isn't a variance
problem and this just reflects that

23
00:01:27,610 --> 00:01:32,030
the dev set contains images that are much
more difficult to classify accurately.

24
00:01:33,610 --> 00:01:38,220
So the problem with this analysis is that
when you went from the training error to

25
00:01:38,220 --> 00:01:41,980
the dev error,
two things changed at a time.

26
00:01:41,980 --> 00:01:47,450
One is that the algorithm saw data in
the training set but not in the dev set.

27
00:01:47,450 --> 00:01:51,080
Two, the distribution of data
in the dev set is different.

28
00:01:51,080 --> 00:01:55,158
And because you changed two things at the
same time, it's difficult to know of this

29
00:01:55,158 --> 00:02:00,120
9% increase in error, how much of
it is because the algorithm didn't

30
00:02:00,120 --> 00:02:04,660
see the data in the dev set, so that's
some of the variance part of the problem.

31
00:02:04,660 --> 00:02:07,670
And how much of it, is because
the dev set data is just different.

32
00:02:09,300 --> 00:02:14,150
So, in order to tease out
these two effects, and

33
00:02:14,150 --> 00:02:17,610
if you didn't totally follow what these
two different effects are, don't worry,

34
00:02:17,610 --> 00:02:19,490
we will go over it again in a second.

35
00:02:19,490 --> 00:02:23,702
But in order to tease out these two
effects it will be useful to define a new

36
00:02:23,702 --> 00:02:26,970
piece of data which we'll
call the training-dev set.

37
00:02:26,970 --> 00:02:29,430
So, this is a new subset of data,

38
00:02:29,430 --> 00:02:34,080
which we carve out that should have the
same distribution as training sets, but

39
00:02:34,080 --> 00:02:37,630
you don't explicitly train
in your network on this.

40
00:02:37,630 --> 00:02:38,690
So here's what I mean.

41
00:02:40,330 --> 00:02:45,220
Previously we had set up
some training sets and

42
00:02:45,220 --> 00:02:50,920
some dev sets and
some test sets as follows.

43
00:02:50,920 --> 00:02:53,403
And the dev and
test sets have the same distribution, but

44
00:02:53,403 --> 00:02:56,710
the training sets will have
some different distribution.

45
00:02:56,710 --> 00:03:01,640
What we're going to do is randomly shuffle
the training sets and then carve out just

46
00:03:01,640 --> 00:03:09,180
a piece of the training set
to be the training-dev set.

47
00:03:09,180 --> 00:03:14,830
So just as the dev and test set have the
same distribution, the training set and

48
00:03:14,830 --> 00:03:18,750
the training-dev set,
also have the same distribution.

49
00:03:21,290 --> 00:03:24,940
But, the difference is that now
you train your neural network,

50
00:03:24,940 --> 00:03:27,920
just on the training set proper.

51
00:03:27,920 --> 00:03:29,330
You won't let the neural network,

52
00:03:29,330 --> 00:03:34,660
you won't run that obligation on
the training-dev portion of this data.

53
00:03:34,660 --> 00:03:36,290
To carry out error analysis,

54
00:03:36,290 --> 00:03:39,310
what you should do is now look
at the error of your classifier

55
00:03:39,310 --> 00:03:43,320
on the training set, on the training-dev
set, as well as on the dev set.

56
00:03:44,500 --> 00:03:51,281
So let's say in this example
that your training error is 1%.

57
00:03:53,020 --> 00:04:00,695
And let's say the error on
the training-dev set is 9%,

58
00:04:00,695 --> 00:04:07,910
and the error on the dev set is 10%,
same as before.

59
00:04:08,910 --> 00:04:13,460
What you can conclude from this
is that when you went from

60
00:04:13,460 --> 00:04:17,680
training data to training dev data
the error really went up a lot.

61
00:04:17,680 --> 00:04:22,460
And only the difference between
the training data and the training-dev

62
00:04:22,460 --> 00:04:27,280
data is that your neural network
got to sort the first part of this.

63
00:04:27,280 --> 00:04:30,610
It was trained explicitly on this, but

64
00:04:30,610 --> 00:04:34,840
it wasn't trained explicitly
on the training-dev data.

65
00:04:34,840 --> 00:04:38,320
So this tells you that you
have a variance problem.

66
00:04:40,006 --> 00:04:44,230
Because the training-dev error was
measured on data that comes from the same

67
00:04:44,230 --> 00:04:46,290
distribution as your training set.

68
00:04:46,290 --> 00:04:50,490
So you know that even though your neural
network does well in a training set,

69
00:04:50,490 --> 00:04:53,980
it's just not generalizing well to data

70
00:04:53,980 --> 00:04:58,280
in the training-dev set which comes from
the same distribution, but it's just not

71
00:04:58,280 --> 00:05:02,530
generalizing well to data from the same
distribution that it hadn't seen before.

72
00:05:04,020 --> 00:05:07,200
So in this example we have
really a variance problem.

73
00:05:09,680 --> 00:05:11,510
Let's look at a different example.

74
00:05:11,510 --> 00:05:17,613
Let's say the training error is 1%,
and the training-dev error is 1.5%,

75
00:05:17,613 --> 00:05:21,360
but when you go to the dev
set your error is 10%.

76
00:05:21,360 --> 00:05:24,805
So now, you have actually
a pretty low variance problem,

77
00:05:24,805 --> 00:05:29,798
because when you went from training data
that you've seen to the training-dev data

78
00:05:29,798 --> 00:05:34,579
that the neural network has not seen,
the error increases only a little bit, but

79
00:05:34,579 --> 00:05:37,550
then it really jumps when
you go to the dev set.

80
00:05:37,550 --> 00:05:43,100
So this is a data mismatch problem,
where data mismatched.

81
00:05:44,810 --> 00:05:51,840
So this is a data mismatch problem,

82
00:05:51,840 --> 00:05:55,838
because your learning algorithm was
not trained explicitly on data from

83
00:05:55,838 --> 00:06:00,610
training-dev or dev, but these two data
sets come from different distributions.

84
00:06:00,610 --> 00:06:01,850
But whatever algorithm it's learning,

85
00:06:01,850 --> 00:06:06,230
it works great on training-dev but
it doesn't work well on dev.

86
00:06:06,230 --> 00:06:10,407
So somehow your algorithm has learned
to do well on a different distribution

87
00:06:10,407 --> 00:06:14,462
than what you really care about, so
we call that a data mismatch problem.

88
00:06:17,505 --> 00:06:20,112
Let's just look at a few more examples.

89
00:06:20,112 --> 00:06:24,663
I'll write this on the next row since
I'm running out of space on top.

90
00:06:24,663 --> 00:06:31,326
So Training error,
Training-Dev error, and Dev error.

91
00:06:33,618 --> 00:06:37,254
Let's say that training error is 10%,

92
00:06:37,254 --> 00:06:42,210
training-dev error is 11%,
and dev error is 12%.

93
00:06:42,210 --> 00:06:46,507
Remember that human level proxy for

94
00:06:46,507 --> 00:06:50,100
Bayes error is roughly 0%.

95
00:06:50,100 --> 00:06:56,020
So if you have this type of performance,
then you really have a bias,

96
00:06:56,020 --> 00:07:02,920
an avoidable bias problem, because you're
doing much worse than human level.

97
00:07:02,920 --> 00:07:05,810
So this is really a high bias setting.

98
00:07:07,440 --> 00:07:08,830
And one last example.

99
00:07:08,830 --> 00:07:14,211
If your training error is 10%,
your training-dev error is 11% and

100
00:07:14,211 --> 00:07:19,706
your dev error is 20 %, then it looks
like this actually has two issues.

101
00:07:19,706 --> 00:07:24,070
One, the avoidable bias is quite high,

102
00:07:24,070 --> 00:07:26,940
because you're not even doing
that well on the training set.

103
00:07:26,940 --> 00:07:31,860
Humans get nearly 0% error, but you're
getting 10% error on your training set.

104
00:07:31,860 --> 00:07:36,710
The variance here seems quite small,

105
00:07:38,110 --> 00:07:43,910
but this data mismatch is quite large.

106
00:07:43,910 --> 00:07:48,839
So for for this example I will say,
you have a large bias or

107
00:07:48,839 --> 00:07:54,001
avoidable bias problem as well
as a data mismatch problem.

108
00:07:56,479 --> 00:07:59,462
So let's take what we've
done on this slide and

109
00:07:59,462 --> 00:08:01,710
write out the general principles.

110
00:08:02,810 --> 00:08:09,909
The key quantities I would
look at are human level

111
00:08:09,909 --> 00:08:14,931
error, your training set error,

112
00:08:14,931 --> 00:08:19,620
your training-dev set error.

113
00:08:21,630 --> 00:08:23,891
So that's the same distribution
as the training set, but

114
00:08:23,891 --> 00:08:25,880
you didn't train explicitly on it.

115
00:08:25,880 --> 00:08:30,798
Your dev set error, and depending on
the differences between these errors,

116
00:08:30,798 --> 00:08:35,034
you can get a sense of how big is
the avoidable bias, the variance,

117
00:08:35,034 --> 00:08:36,940
the data mismatch problems.

118
00:08:38,840 --> 00:08:40,880
So let's say that human level error is 4%.

119
00:08:40,880 --> 00:08:43,573
Your training error is 7%.

120
00:08:43,573 --> 00:08:46,660
And your training-dev error is 10%.

121
00:08:46,660 --> 00:08:50,112
And the dev error is 12%.

122
00:08:50,112 --> 00:08:54,100
So this gives you a sense
of the avoidable bias.

123
00:08:55,170 --> 00:08:58,032
because you know, you'd like your
algorithm to do at least as well or

124
00:08:58,032 --> 00:09:01,420
approach human level performance
maybe on the training set.

125
00:09:01,420 --> 00:09:04,460
This is a sense of the variance.

126
00:09:04,460 --> 00:09:08,790
So how well do you generalize from
the training set to the training-dev set?

127
00:09:10,540 --> 00:09:15,550
This is the sense of how much of
a data mismatch problem have you have.

128
00:09:15,550 --> 00:09:18,180
And technically you could
also add one more thing,

129
00:09:18,180 --> 00:09:21,410
which is the test set performance,
and we'll write test error.

130
00:09:21,410 --> 00:09:24,790
You shouldn't be doing development on your
test set because you don't want to overfit

131
00:09:24,790 --> 00:09:25,625
your test set.

132
00:09:25,625 --> 00:09:31,490
But if you also look at this,
then this gap here tells you the degree

133
00:09:31,490 --> 00:09:36,212
of overfitting to the dev set.

134
00:09:36,212 --> 00:09:41,460
So if there's a huge gap between
your dev set performance and

135
00:09:41,460 --> 00:09:45,820
your test set performance, it means
you maybe overtuned to the dev set.

136
00:09:45,820 --> 00:09:49,450
And so maybe you need to find
a bigger dev set, right?

137
00:09:49,450 --> 00:09:53,450
So remember that your dev set and your
test set come from the same distribution.

138
00:09:53,450 --> 00:09:57,170
So the only way for there to be a huge gap
here, for it to do much better on the dev

139
00:09:57,170 --> 00:10:01,304
set than the test set, is if you
somehow managed to overfit the dev set.

140
00:10:01,304 --> 00:10:04,630
And if that's the case, what you might
consider doing is going back and

141
00:10:04,630 --> 00:10:06,650
just getting more dev set data.

142
00:10:06,650 --> 00:10:08,760
Now, I've written these numbers,

143
00:10:08,760 --> 00:10:13,830
as you go down the list of numbers,
always keep going up.

144
00:10:13,830 --> 00:10:17,650
Here's one example of numbers
that doesn't always go up,

145
00:10:17,650 --> 00:10:22,166
maybe human level performance is 4%,
training error is 7%,

146
00:10:22,166 --> 00:10:26,080
training-dev error is 10%, but
let's say that we go to the dev set.

147
00:10:26,080 --> 00:10:30,430
You find that you actually, surprisingly,
do much better on the dev set.

148
00:10:30,430 --> 00:10:34,052
Maybe this is 6%, 6% as well.

149
00:10:36,500 --> 00:10:41,110
So you have seen effects like this,
working on for

150
00:10:41,110 --> 00:10:45,430
example a speech recognition task,
where the training data

151
00:10:45,430 --> 00:10:48,740
turned out to be much harder
than your dev set and test set.

152
00:10:48,740 --> 00:10:53,840
So these two were evaluated on
your training set distribution and

153
00:10:53,840 --> 00:10:57,960
these two were evaluated on
your dev/test set distribution.

154
00:10:57,960 --> 00:11:02,445
So sometimes if your dev/test set
distribution is much easier for

155
00:11:02,445 --> 00:11:07,062
whatever application you're working on
then these numbers can actually go down.

156
00:11:07,062 --> 00:11:08,768
So if you see funny things like this,

157
00:11:08,768 --> 00:11:13,350
there's an even more general formulation
of this analysis that might be helpful.

158
00:11:13,350 --> 00:11:15,129
Let me quickly explain
that on the next slide.

159
00:11:17,420 --> 00:11:21,785
So, let me motivate this using the speech

160
00:11:21,785 --> 00:11:26,900
activated rear-view mirror example.

161
00:11:26,900 --> 00:11:31,575
It turns out that the numbers we've
been writing down can be placed into

162
00:11:31,575 --> 00:11:36,935
a table where on the horizontal axis,
I'm going to place different data sets.

163
00:11:36,935 --> 00:11:42,119
So for example, you might have data from
your general speech recognition task.

164
00:11:43,570 --> 00:11:48,210
So you might have a bunch of data
that you just collected from a lot

165
00:11:48,210 --> 00:11:51,646
of speech recognition problems you
worked on from small speakers,

166
00:11:51,646 --> 00:11:53,740
data you have purchased and so on.

167
00:11:53,740 --> 00:12:00,970
And then you all have the rear
view mirror specific speech data,

168
00:12:00,970 --> 00:12:02,120
recorded inside the car.

169
00:12:04,450 --> 00:12:09,890
So on this x axis on the table,
I'm going to vary the data set.

170
00:12:09,890 --> 00:12:16,250
On this other axis,
I'm going to label different ways or

171
00:12:16,250 --> 00:12:18,470
algorithms for examining the data.

172
00:12:18,470 --> 00:12:21,350
So first, there's human level performance,

173
00:12:21,350 --> 00:12:25,980
which is how accurate are humans
on each of these data sets?

174
00:12:27,010 --> 00:12:31,948
Then there is the error
on the examples that

175
00:12:31,948 --> 00:12:36,210
your neural network has trained on.

176
00:12:38,870 --> 00:12:43,686
And then finally there's
error on the examples that

177
00:12:43,686 --> 00:12:47,412
your neural network has not trained on.

178
00:12:50,036 --> 00:12:55,796
So turns out that what we're calling
on a human level on the previous slide,

179
00:12:55,796 --> 00:12:59,036
there's the number that goes in this box,

180
00:12:59,036 --> 00:13:03,320
which is how well do humans
do on this category of data.

181
00:13:03,320 --> 00:13:06,304
Say data from all sorts of
speech recognition tasks,

182
00:13:06,304 --> 00:13:10,832
the thousand utterances that you
could into your training set.

183
00:13:10,832 --> 00:13:13,490
And the example in
the previous slide is this 4%.

184
00:13:13,490 --> 00:13:20,670
This number here was our,
maybe the training error.

185
00:13:23,320 --> 00:13:26,922
Which in the example in
the previous slide was 7%

186
00:13:29,705 --> 00:13:33,430
Right, if you're learning algorithm has
seen this example, performed gradient

187
00:13:33,430 --> 00:13:37,315
descent on this example, and this example
came from your training set distribution,

188
00:13:37,315 --> 00:13:39,800
or some general speech
recognition distribution.

189
00:13:39,800 --> 00:13:43,980
How well does your algorithm do
on the example it has trained on?

190
00:13:45,114 --> 00:13:53,122
Then here is the training-dev set error.

191
00:13:53,122 --> 00:13:58,040
It's usually a bit higher, which is for
data from this distribution,

192
00:13:58,040 --> 00:14:02,950
from general speech recognition, if your
algorithm did not train explicitly on

193
00:14:02,950 --> 00:14:05,870
some examples from this distribution,
how well does it do?

194
00:14:05,870 --> 00:14:08,520
And that's what we call
the training dev error.

195
00:14:10,700 --> 00:14:14,660
And then if you move over to the right,
this box here

196
00:14:14,660 --> 00:14:19,200
is the dev set error, or
maybe also the test set error.

197
00:14:20,548 --> 00:14:25,310
Which was 6% in the example just now.

198
00:14:25,310 --> 00:14:28,260
And dev and test error,
it's actually technically two numbers, but

199
00:14:28,260 --> 00:14:30,890
either one could go into this box here.

200
00:14:32,870 --> 00:14:37,220
And this is if you have data from your
rearview mirror, from actually recorded in

201
00:14:37,220 --> 00:14:41,270
the car from the rearview mirror
application, but your neural network did

202
00:14:41,270 --> 00:14:45,350
not perform back propagation on
this example, what is the error?

203
00:14:46,940 --> 00:14:51,230
So what we're doing in the analysis
in the previous slide was look at

204
00:14:51,230 --> 00:14:55,940
differences between these two numbers,
these two numbers, and these two numbers.

205
00:14:57,380 --> 00:15:01,880
And this gap here is
a measure of avoidable bias.

206
00:15:03,630 --> 00:15:08,020
This gap here is a measure of variance,
and

207
00:15:08,020 --> 00:15:12,580
this gap here was
a measure of data mismatch.

208
00:15:13,920 --> 00:15:17,540
And it turns out that it could
be useful to also throw in

209
00:15:17,540 --> 00:15:20,010
the remaining two entries in this table.

210
00:15:21,340 --> 00:15:25,270
And so
if this turns out to be also 6%, and

211
00:15:25,270 --> 00:15:30,146
the way you get this number is you ask
some humans to label their rearview mirror

212
00:15:30,146 --> 00:15:33,390
speech data and just measure how
good humans are at this task.

213
00:15:33,390 --> 00:15:35,180
And maybe this turns out also to be 6%.

214
00:15:35,180 --> 00:15:39,260
And the way you do that is you take
some rearview mirror speech data,

215
00:15:39,260 --> 00:15:42,650
put it in the training set so the neural
network learns on it as well, and

216
00:15:42,650 --> 00:15:46,060
then you measure the error
on that subset of the data.

217
00:15:46,060 --> 00:15:50,090
But if this is what you get, then, well,
it turns out that you're actually already

218
00:15:50,090 --> 00:15:54,620
performing at the level of humans on
this rearview mirror speech data, so

219
00:15:54,620 --> 00:15:58,550
maybe you're actually doing quite
well on that distribution of data.

220
00:15:58,550 --> 00:16:03,740
When you do this more subsequent analysis,
it doesn't always give you one

221
00:16:03,740 --> 00:16:07,190
clear path forward, but sometimes it just
gives you additional insights as well.

222
00:16:07,190 --> 00:16:12,000
So for example, comparing these two
numbers in this case tells us that for

223
00:16:12,000 --> 00:16:16,240
humans, the rearview mirror speech
data is actually harder than for

224
00:16:16,240 --> 00:16:21,550
general speech recognition, because humans
get 6% error, rather than 4% error.

225
00:16:21,550 --> 00:16:25,840
But then looking at these
differences as well may help you

226
00:16:25,840 --> 00:16:30,865
understand bias and variance and data
mismatch problems in different degrees.

227
00:16:30,865 --> 00:16:35,760
So this more general formulation is
something I've used a few times.

228
00:16:35,760 --> 00:16:41,020
I've not used it, but for
a lot of problems you find that examining

229
00:16:41,020 --> 00:16:46,040
this subset of entries, kind of looking at
this difference and this difference and

230
00:16:46,040 --> 00:16:51,230
this difference, that that's enough to
point you in a pretty promising direction.

231
00:16:51,230 --> 00:16:54,840
But sometimes filling out this whole
table can give you additional insights.

232
00:16:55,910 --> 00:17:02,535
Finally, we've previously talked a lot
about ideas for addressing bias.

233
00:17:02,535 --> 00:17:05,910
Talked about techniques on
addressing variance, but

234
00:17:05,910 --> 00:17:08,720
how do you address data mismatch?

235
00:17:08,720 --> 00:17:12,600
In particular training on data that
comes from different distribution

236
00:17:12,600 --> 00:17:15,330
that your dev and
test set can get you more data and

237
00:17:15,330 --> 00:17:17,760
really help your learning
algorithm's performance.

238
00:17:17,760 --> 00:17:20,310
But rather than just bias and
variance problems,

239
00:17:20,310 --> 00:17:24,210
you now have this new potential
problem of data mismatch.

240
00:17:24,210 --> 00:17:28,460
What are some good ways that you
could use to address data mismatch?

241
00:17:28,460 --> 00:17:30,690
I'll be honest and
say there actually aren't great or

242
00:17:30,690 --> 00:17:34,130
at least not very systematic
ways to address data mismatch.

243
00:17:34,130 --> 00:17:36,540
But there are some things you
could try that could help.

244
00:17:36,540 --> 00:17:38,790
Let's take a look at
them in the next video.

245
00:17:38,790 --> 00:17:43,200
So what we've seen is that by using
training data that can come from

246
00:17:43,200 --> 00:17:47,690
a different distribution as a dev and test
set, this could give you a lot more data

247
00:17:47,690 --> 00:17:50,630
and therefore help the performance
of your learning algorithm.

248
00:17:50,630 --> 00:17:55,070
But instead of just having bias and
variance as two potential problems,

249
00:17:55,070 --> 00:17:58,518
you now have this third potential problem,
data mismatch.

250
00:17:58,518 --> 00:18:02,200
So what if you perform error analysis and
conclude that data mismatch

251
00:18:02,200 --> 00:18:05,840
is a huge source of error,
how do you go about addressing that?

252
00:18:05,840 --> 00:18:09,730
It turns out that unfortunately
there are super systematic ways

253
00:18:09,730 --> 00:18:14,120
to address data mismatch, but there are
a few things you can try that could help.

254
00:18:14,120 --> 00:18:15,820
Let's take a look at
them in the next video.