Estimating the bias and variance of your learning algorithm really
helps you prioritize what to work on next. But the way you analyze bias and
variance changes when your training set comes from a different distribution
than your dev and test sets. Let's see how. Let's keep using our cat
classification example and let's say humans get near
perfect performance on this. So, Bayes error, or Bayes optimal error,
we know is nearly 0% on this problem. So, to carry out error analysis you
usually look at the training error and also look at the error on the dev set. So let's say, in this example
that your training error is 1%, and your dev error is 10%. If your dev data came from the same
distribution as your training set, you would say that here you
have a large variance problem, that your algorithm's just not
generalizing well from the training set which it's doing well on to the dev set,
which it's suddenly doing much worse on. But in the setting where your training
data and your dev data comes from a different distribution, you can no
longer safely draw this conclusion. In particular,
maybe it's doing just fine on the dev set, it's just that the training set was
really easy because it was high res, very clear images, and
maybe the dev set is just much harder. So maybe there isn't a variance
problem and this just reflects that the dev set contains images that are much
more difficult to classify accurately. So the problem with this analysis is that
when you went from the training error to the dev error,
two things changed at a time. One is that the algorithm saw data in
the training set but not in the dev set. Two, the distribution of data
in the dev set is different. And because you changed two things at the
same time, it's difficult to know of this 9% increase in error, how much of
it is because the algorithm didn't see the data in the dev set, so that's
some of the variance part of the problem. And how much of it, is because
the dev set data is just different. So, in order to tease out
these two effects, and if you didn't totally follow what these
two different effects are, don't worry, we will go over it again in a second. But in order to tease out these two
effects it will be useful to define a new piece of data which we'll
call the training-dev set. So, this is a new subset of data, which we carve out that should have the
same distribution as training sets, but you don't explicitly train
in your network on this. So here's what I mean. Previously we had set up
some training sets and some dev sets and
some test sets as follows. And the dev and
test sets have the same distribution, but the training sets will have
some different distribution. What we're going to do is randomly shuffle
the training sets and then carve out just a piece of the training set
to be the training-dev set. So just as the dev and test set have the
same distribution, the training set and the training-dev set,
also have the same distribution. But, the difference is that now
you train your neural network, just on the training set proper. You won't let the neural network, you won't run that obligation on
the training-dev portion of this data. To carry out error analysis, what you should do is now look
at the error of your classifier on the training set, on the training-dev
set, as well as on the dev set. So let's say in this example
that your training error is 1%. And let's say the error on
the training-dev set is 9%, and the error on the dev set is 10%,
same as before. What you can conclude from this
is that when you went from training data to training dev data
the error really went up a lot. And only the difference between
the training data and the training-dev data is that your neural network
got to sort the first part of this. It was trained explicitly on this, but it wasn't trained explicitly
on the training-dev data. So this tells you that you
have a variance problem. Because the training-dev error was
measured on data that comes from the same distribution as your training set. So you know that even though your neural
network does well in a training set, it's just not generalizing well to data in the training-dev set which comes from
the same distribution, but it's just not generalizing well to data from the same
distribution that it hadn't seen before. So in this example we have
really a variance problem. Let's look at a different example. Let's say the training error is 1%,
and the training-dev error is 1.5%, but when you go to the dev
set your error is 10%. So now, you have actually
a pretty low variance problem, because when you went from training data
that you've seen to the training-dev data that the neural network has not seen,
the error increases only a little bit, but then it really jumps when
you go to the dev set. So this is a data mismatch problem,
where data mismatched. So this is a data mismatch problem, because your learning algorithm was
not trained explicitly on data from training-dev or dev, but these two data
sets come from different distributions. But whatever algorithm it's learning, it works great on training-dev but
it doesn't work well on dev. So somehow your algorithm has learned
to do well on a different distribution than what you really care about, so
we call that a data mismatch problem. Let's just look at a few more examples. I'll write this on the next row since
I'm running out of space on top. So Training error,
Training-Dev error, and Dev error. Let's say that training error is 10%, training-dev error is 11%,
and dev error is 12%. Remember that human level proxy for Bayes error is roughly 0%. So if you have this type of performance,
then you really have a bias, an avoidable bias problem, because you're
doing much worse than human level. So this is really a high bias setting. And one last example. If your training error is 10%,
your training-dev error is 11% and your dev error is 20 %, then it looks
like this actually has two issues. One, the avoidable bias is quite high, because you're not even doing
that well on the training set. Humans get nearly 0% error, but you're
getting 10% error on your training set. The variance here seems quite small, but this data mismatch is quite large. So for for this example I will say,
you have a large bias or avoidable bias problem as well
as a data mismatch problem. So let's take what we've
done on this slide and write out the general principles. The key quantities I would
look at are human level error, your training set error, your training-dev set error. So that's the same distribution
as the training set, but you didn't train explicitly on it. Your dev set error, and depending on
the differences between these errors, you can get a sense of how big is
the avoidable bias, the variance, the data mismatch problems. So let's say that human level error is 4%. Your training error is 7%. And your training-dev error is 10%. And the dev error is 12%. So this gives you a sense
of the avoidable bias. because you know, you'd like your
algorithm to do at least as well or approach human level performance
maybe on the training set. This is a sense of the variance. So how well do you generalize from
the training set to the training-dev set? This is the sense of how much of
a data mismatch problem have you have. And technically you could
also add one more thing, which is the test set performance,
and we'll write test error. You shouldn't be doing development on your
test set because you don't want to overfit your test set. But if you also look at this,
then this gap here tells you the degree of overfitting to the dev set. So if there's a huge gap between
your dev set performance and your test set performance, it means
you maybe overtuned to the dev set. And so maybe you need to find
a bigger dev set, right? So remember that your dev set and your
test set come from the same distribution. So the only way for there to be a huge gap
here, for it to do much better on the dev set than the test set, is if you
somehow managed to overfit the dev set. And if that's the case, what you might
consider doing is going back and just getting more dev set data. Now, I've written these numbers, as you go down the list of numbers,
always keep going up. Here's one example of numbers
that doesn't always go up, maybe human level performance is 4%,
training error is 7%, training-dev error is 10%, but
let's say that we go to the dev set. You find that you actually, surprisingly,
do much better on the dev set. Maybe this is 6%, 6% as well. So you have seen effects like this,
working on for example a speech recognition task,
where the training data turned out to be much harder
than your dev set and test set. So these two were evaluated on
your training set distribution and these two were evaluated on
your dev/test set distribution. So sometimes if your dev/test set
distribution is much easier for whatever application you're working on
then these numbers can actually go down. So if you see funny things like this, there's an even more general formulation
of this analysis that might be helpful. Let me quickly explain
that on the next slide. So, let me motivate this using the speech activated rear-view mirror example. It turns out that the numbers we've
been writing down can be placed into a table where on the horizontal axis,
I'm going to place different data sets. So for example, you might have data from
your general speech recognition task. So you might have a bunch of data
that you just collected from a lot of speech recognition problems you
worked on from small speakers, data you have purchased and so on. And then you all have the rear
view mirror specific speech data, recorded inside the car. So on this x axis on the table,
I'm going to vary the data set. On this other axis,
I'm going to label different ways or algorithms for examining the data. So first, there's human level performance, which is how accurate are humans
on each of these data sets? Then there is the error
on the examples that your neural network has trained on. And then finally there's
error on the examples that your neural network has not trained on. So turns out that what we're calling
on a human level on the previous slide, there's the number that goes in this box, which is how well do humans
do on this category of data. Say data from all sorts of
speech recognition tasks, the thousand utterances that you
could into your training set. And the example in
the previous slide is this 4%. This number here was our,
maybe the training error. Which in the example in
the previous slide was 7% Right, if you're learning algorithm has
seen this example, performed gradient descent on this example, and this example
came from your training set distribution, or some general speech
recognition distribution. How well does your algorithm do
on the example it has trained on? Then here is the training-dev set error. It's usually a bit higher, which is for
data from this distribution, from general speech recognition, if your
algorithm did not train explicitly on some examples from this distribution,
how well does it do? And that's what we call
the training dev error. And then if you move over to the right,
this box here is the dev set error, or
maybe also the test set error. Which was 6% in the example just now. And dev and test error,
it's actually technically two numbers, but either one could go into this box here. And this is if you have data from your
rearview mirror, from actually recorded in the car from the rearview mirror
application, but your neural network did not perform back propagation on
this example, what is the error? So what we're doing in the analysis
in the previous slide was look at differences between these two numbers,
these two numbers, and these two numbers. And this gap here is
a measure of avoidable bias. This gap here is a measure of variance,
and this gap here was
a measure of data mismatch. And it turns out that it could
be useful to also throw in the remaining two entries in this table. And so
if this turns out to be also 6%, and the way you get this number is you ask
some humans to label their rearview mirror speech data and just measure how
good humans are at this task. And maybe this turns out also to be 6%. And the way you do that is you take
some rearview mirror speech data, put it in the training set so the neural
network learns on it as well, and then you measure the error
on that subset of the data. But if this is what you get, then, well,
it turns out that you're actually already performing at the level of humans on
this rearview mirror speech data, so maybe you're actually doing quite
well on that distribution of data. When you do this more subsequent analysis,
it doesn't always give you one clear path forward, but sometimes it just
gives you additional insights as well. So for example, comparing these two
numbers in this case tells us that for humans, the rearview mirror speech
data is actually harder than for general speech recognition, because humans
get 6% error, rather than 4% error. But then looking at these
differences as well may help you understand bias and variance and data
mismatch problems in different degrees. So this more general formulation is
something I've used a few times. I've not used it, but for
a lot of problems you find that examining this subset of entries, kind of looking at
this difference and this difference and this difference, that that's enough to
point you in a pretty promising direction. But sometimes filling out this whole
table can give you additional insights. Finally, we've previously talked a lot
about ideas for addressing bias. Talked about techniques on
addressing variance, but how do you address data mismatch? In particular training on data that
comes from different distribution that your dev and
test set can get you more data and really help your learning
algorithm's performance. But rather than just bias and
variance problems, you now have this new potential
problem of data mismatch. What are some good ways that you
could use to address data mismatch? I'll be honest and
say there actually aren't great or at least not very systematic
ways to address data mismatch. But there are some things you
could try that could help. Let's take a look at
them in the next video. So what we've seen is that by using
training data that can come from a different distribution as a dev and test
set, this could give you a lot more data and therefore help the performance
of your learning algorithm. But instead of just having bias and
variance as two potential problems, you now have this third potential problem,
data mismatch. So what if you perform error analysis and
conclude that data mismatch is a huge source of error,
how do you go about addressing that? It turns out that unfortunately
there are super systematic ways to address data mismatch, but there are
a few things you can try that could help. Let's take a look at
them in the next video.