1 00:00:00,760 --> 00:00:01,970 Estimating the bias and 2 00:00:01,970 --> 00:00:06,650 variance of your learning algorithm really helps you prioritize what to work on next. 3 00:00:06,650 --> 00:00:11,220 But the way you analyze bias and variance changes when your training set 4 00:00:11,220 --> 00:00:14,570 comes from a different distribution than your dev and test sets. 5 00:00:14,570 --> 00:00:15,140 Let's see how. 6 00:00:16,480 --> 00:00:19,650 Let's keep using our cat classification example and 7 00:00:19,650 --> 00:00:22,880 let's say humans get near perfect performance on this. 8 00:00:22,880 --> 00:00:28,620 So, Bayes error, or Bayes optimal error, we know is nearly 0% on this problem. 9 00:00:28,620 --> 00:00:33,530 So, to carry out error analysis you usually look at the training error and 10 00:00:33,530 --> 00:00:37,030 also look at the error on the dev set. 11 00:00:37,030 --> 00:00:41,490 So let's say, in this example that your training error is 1%, 12 00:00:41,490 --> 00:00:45,040 and your dev error is 10%. 13 00:00:45,040 --> 00:00:49,725 If your dev data came from the same distribution as your training set, 14 00:00:49,725 --> 00:00:53,125 you would say that here you have a large variance problem, 15 00:00:53,125 --> 00:00:56,615 that your algorithm's just not generalizing well from the training set 16 00:00:56,615 --> 00:01:01,795 which it's doing well on to the dev set, which it's suddenly doing much worse on. 17 00:01:01,795 --> 00:01:05,335 But in the setting where your training data and your dev data comes from 18 00:01:05,335 --> 00:01:09,658 a different distribution, you can no longer safely draw this conclusion. 19 00:01:09,658 --> 00:01:13,820 In particular, maybe it's doing just fine on the dev set, 20 00:01:13,820 --> 00:01:18,250 it's just that the training set was really easy because it was high res, 21 00:01:18,250 --> 00:01:21,880 very clear images, and maybe the dev set is just much harder. 22 00:01:23,492 --> 00:01:27,610 So maybe there isn't a variance problem and this just reflects that 23 00:01:27,610 --> 00:01:32,030 the dev set contains images that are much more difficult to classify accurately. 24 00:01:33,610 --> 00:01:38,220 So the problem with this analysis is that when you went from the training error to 25 00:01:38,220 --> 00:01:41,980 the dev error, two things changed at a time. 26 00:01:41,980 --> 00:01:47,450 One is that the algorithm saw data in the training set but not in the dev set. 27 00:01:47,450 --> 00:01:51,080 Two, the distribution of data in the dev set is different. 28 00:01:51,080 --> 00:01:55,158 And because you changed two things at the same time, it's difficult to know of this 29 00:01:55,158 --> 00:02:00,120 9% increase in error, how much of it is because the algorithm didn't 30 00:02:00,120 --> 00:02:04,660 see the data in the dev set, so that's some of the variance part of the problem. 31 00:02:04,660 --> 00:02:07,670 And how much of it, is because the dev set data is just different. 32 00:02:09,300 --> 00:02:14,150 So, in order to tease out these two effects, and 33 00:02:14,150 --> 00:02:17,610 if you didn't totally follow what these two different effects are, don't worry, 34 00:02:17,610 --> 00:02:19,490 we will go over it again in a second. 35 00:02:19,490 --> 00:02:23,702 But in order to tease out these two effects it will be useful to define a new 36 00:02:23,702 --> 00:02:26,970 piece of data which we'll call the training-dev set. 37 00:02:26,970 --> 00:02:29,430 So, this is a new subset of data, 38 00:02:29,430 --> 00:02:34,080 which we carve out that should have the same distribution as training sets, but 39 00:02:34,080 --> 00:02:37,630 you don't explicitly train in your network on this. 40 00:02:37,630 --> 00:02:38,690 So here's what I mean. 41 00:02:40,330 --> 00:02:45,220 Previously we had set up some training sets and 42 00:02:45,220 --> 00:02:50,920 some dev sets and some test sets as follows. 43 00:02:50,920 --> 00:02:53,403 And the dev and test sets have the same distribution, but 44 00:02:53,403 --> 00:02:56,710 the training sets will have some different distribution. 45 00:02:56,710 --> 00:03:01,640 What we're going to do is randomly shuffle the training sets and then carve out just 46 00:03:01,640 --> 00:03:09,180 a piece of the training set to be the training-dev set. 47 00:03:09,180 --> 00:03:14,830 So just as the dev and test set have the same distribution, the training set and 48 00:03:14,830 --> 00:03:18,750 the training-dev set, also have the same distribution. 49 00:03:21,290 --> 00:03:24,940 But, the difference is that now you train your neural network, 50 00:03:24,940 --> 00:03:27,920 just on the training set proper. 51 00:03:27,920 --> 00:03:29,330 You won't let the neural network, 52 00:03:29,330 --> 00:03:34,660 you won't run that obligation on the training-dev portion of this data. 53 00:03:34,660 --> 00:03:36,290 To carry out error analysis, 54 00:03:36,290 --> 00:03:39,310 what you should do is now look at the error of your classifier 55 00:03:39,310 --> 00:03:43,320 on the training set, on the training-dev set, as well as on the dev set. 56 00:03:44,500 --> 00:03:51,281 So let's say in this example that your training error is 1%. 57 00:03:53,020 --> 00:04:00,695 And let's say the error on the training-dev set is 9%, 58 00:04:00,695 --> 00:04:07,910 and the error on the dev set is 10%, same as before. 59 00:04:08,910 --> 00:04:13,460 What you can conclude from this is that when you went from 60 00:04:13,460 --> 00:04:17,680 training data to training dev data the error really went up a lot. 61 00:04:17,680 --> 00:04:22,460 And only the difference between the training data and the training-dev 62 00:04:22,460 --> 00:04:27,280 data is that your neural network got to sort the first part of this. 63 00:04:27,280 --> 00:04:30,610 It was trained explicitly on this, but 64 00:04:30,610 --> 00:04:34,840 it wasn't trained explicitly on the training-dev data. 65 00:04:34,840 --> 00:04:38,320 So this tells you that you have a variance problem. 66 00:04:40,006 --> 00:04:44,230 Because the training-dev error was measured on data that comes from the same 67 00:04:44,230 --> 00:04:46,290 distribution as your training set. 68 00:04:46,290 --> 00:04:50,490 So you know that even though your neural network does well in a training set, 69 00:04:50,490 --> 00:04:53,980 it's just not generalizing well to data 70 00:04:53,980 --> 00:04:58,280 in the training-dev set which comes from the same distribution, but it's just not 71 00:04:58,280 --> 00:05:02,530 generalizing well to data from the same distribution that it hadn't seen before. 72 00:05:04,020 --> 00:05:07,200 So in this example we have really a variance problem. 73 00:05:09,680 --> 00:05:11,510 Let's look at a different example. 74 00:05:11,510 --> 00:05:17,613 Let's say the training error is 1%, and the training-dev error is 1.5%, 75 00:05:17,613 --> 00:05:21,360 but when you go to the dev set your error is 10%. 76 00:05:21,360 --> 00:05:24,805 So now, you have actually a pretty low variance problem, 77 00:05:24,805 --> 00:05:29,798 because when you went from training data that you've seen to the training-dev data 78 00:05:29,798 --> 00:05:34,579 that the neural network has not seen, the error increases only a little bit, but 79 00:05:34,579 --> 00:05:37,550 then it really jumps when you go to the dev set. 80 00:05:37,550 --> 00:05:43,100 So this is a data mismatch problem, where data mismatched. 81 00:05:44,810 --> 00:05:51,840 So this is a data mismatch problem, 82 00:05:51,840 --> 00:05:55,838 because your learning algorithm was not trained explicitly on data from 83 00:05:55,838 --> 00:06:00,610 training-dev or dev, but these two data sets come from different distributions. 84 00:06:00,610 --> 00:06:01,850 But whatever algorithm it's learning, 85 00:06:01,850 --> 00:06:06,230 it works great on training-dev but it doesn't work well on dev. 86 00:06:06,230 --> 00:06:10,407 So somehow your algorithm has learned to do well on a different distribution 87 00:06:10,407 --> 00:06:14,462 than what you really care about, so we call that a data mismatch problem. 88 00:06:17,505 --> 00:06:20,112 Let's just look at a few more examples. 89 00:06:20,112 --> 00:06:24,663 I'll write this on the next row since I'm running out of space on top. 90 00:06:24,663 --> 00:06:31,326 So Training error, Training-Dev error, and Dev error. 91 00:06:33,618 --> 00:06:37,254 Let's say that training error is 10%, 92 00:06:37,254 --> 00:06:42,210 training-dev error is 11%, and dev error is 12%. 93 00:06:42,210 --> 00:06:46,507 Remember that human level proxy for 94 00:06:46,507 --> 00:06:50,100 Bayes error is roughly 0%. 95 00:06:50,100 --> 00:06:56,020 So if you have this type of performance, then you really have a bias, 96 00:06:56,020 --> 00:07:02,920 an avoidable bias problem, because you're doing much worse than human level. 97 00:07:02,920 --> 00:07:05,810 So this is really a high bias setting. 98 00:07:07,440 --> 00:07:08,830 And one last example. 99 00:07:08,830 --> 00:07:14,211 If your training error is 10%, your training-dev error is 11% and 100 00:07:14,211 --> 00:07:19,706 your dev error is 20 %, then it looks like this actually has two issues. 101 00:07:19,706 --> 00:07:24,070 One, the avoidable bias is quite high, 102 00:07:24,070 --> 00:07:26,940 because you're not even doing that well on the training set. 103 00:07:26,940 --> 00:07:31,860 Humans get nearly 0% error, but you're getting 10% error on your training set. 104 00:07:31,860 --> 00:07:36,710 The variance here seems quite small, 105 00:07:38,110 --> 00:07:43,910 but this data mismatch is quite large. 106 00:07:43,910 --> 00:07:48,839 So for for this example I will say, you have a large bias or 107 00:07:48,839 --> 00:07:54,001 avoidable bias problem as well as a data mismatch problem. 108 00:07:56,479 --> 00:07:59,462 So let's take what we've done on this slide and 109 00:07:59,462 --> 00:08:01,710 write out the general principles. 110 00:08:02,810 --> 00:08:09,909 The key quantities I would look at are human level 111 00:08:09,909 --> 00:08:14,931 error, your training set error, 112 00:08:14,931 --> 00:08:19,620 your training-dev set error. 113 00:08:21,630 --> 00:08:23,891 So that's the same distribution as the training set, but 114 00:08:23,891 --> 00:08:25,880 you didn't train explicitly on it. 115 00:08:25,880 --> 00:08:30,798 Your dev set error, and depending on the differences between these errors, 116 00:08:30,798 --> 00:08:35,034 you can get a sense of how big is the avoidable bias, the variance, 117 00:08:35,034 --> 00:08:36,940 the data mismatch problems. 118 00:08:38,840 --> 00:08:40,880 So let's say that human level error is 4%. 119 00:08:40,880 --> 00:08:43,573 Your training error is 7%. 120 00:08:43,573 --> 00:08:46,660 And your training-dev error is 10%. 121 00:08:46,660 --> 00:08:50,112 And the dev error is 12%. 122 00:08:50,112 --> 00:08:54,100 So this gives you a sense of the avoidable bias. 123 00:08:55,170 --> 00:08:58,032 because you know, you'd like your algorithm to do at least as well or 124 00:08:58,032 --> 00:09:01,420 approach human level performance maybe on the training set. 125 00:09:01,420 --> 00:09:04,460 This is a sense of the variance. 126 00:09:04,460 --> 00:09:08,790 So how well do you generalize from the training set to the training-dev set? 127 00:09:10,540 --> 00:09:15,550 This is the sense of how much of a data mismatch problem have you have. 128 00:09:15,550 --> 00:09:18,180 And technically you could also add one more thing, 129 00:09:18,180 --> 00:09:21,410 which is the test set performance, and we'll write test error. 130 00:09:21,410 --> 00:09:24,790 You shouldn't be doing development on your test set because you don't want to overfit 131 00:09:24,790 --> 00:09:25,625 your test set. 132 00:09:25,625 --> 00:09:31,490 But if you also look at this, then this gap here tells you the degree 133 00:09:31,490 --> 00:09:36,212 of overfitting to the dev set. 134 00:09:36,212 --> 00:09:41,460 So if there's a huge gap between your dev set performance and 135 00:09:41,460 --> 00:09:45,820 your test set performance, it means you maybe overtuned to the dev set. 136 00:09:45,820 --> 00:09:49,450 And so maybe you need to find a bigger dev set, right? 137 00:09:49,450 --> 00:09:53,450 So remember that your dev set and your test set come from the same distribution. 138 00:09:53,450 --> 00:09:57,170 So the only way for there to be a huge gap here, for it to do much better on the dev 139 00:09:57,170 --> 00:10:01,304 set than the test set, is if you somehow managed to overfit the dev set. 140 00:10:01,304 --> 00:10:04,630 And if that's the case, what you might consider doing is going back and 141 00:10:04,630 --> 00:10:06,650 just getting more dev set data. 142 00:10:06,650 --> 00:10:08,760 Now, I've written these numbers, 143 00:10:08,760 --> 00:10:13,830 as you go down the list of numbers, always keep going up. 144 00:10:13,830 --> 00:10:17,650 Here's one example of numbers that doesn't always go up, 145 00:10:17,650 --> 00:10:22,166 maybe human level performance is 4%, training error is 7%, 146 00:10:22,166 --> 00:10:26,080 training-dev error is 10%, but let's say that we go to the dev set. 147 00:10:26,080 --> 00:10:30,430 You find that you actually, surprisingly, do much better on the dev set. 148 00:10:30,430 --> 00:10:34,052 Maybe this is 6%, 6% as well. 149 00:10:36,500 --> 00:10:41,110 So you have seen effects like this, working on for 150 00:10:41,110 --> 00:10:45,430 example a speech recognition task, where the training data 151 00:10:45,430 --> 00:10:48,740 turned out to be much harder than your dev set and test set. 152 00:10:48,740 --> 00:10:53,840 So these two were evaluated on your training set distribution and 153 00:10:53,840 --> 00:10:57,960 these two were evaluated on your dev/test set distribution. 154 00:10:57,960 --> 00:11:02,445 So sometimes if your dev/test set distribution is much easier for 155 00:11:02,445 --> 00:11:07,062 whatever application you're working on then these numbers can actually go down. 156 00:11:07,062 --> 00:11:08,768 So if you see funny things like this, 157 00:11:08,768 --> 00:11:13,350 there's an even more general formulation of this analysis that might be helpful. 158 00:11:13,350 --> 00:11:15,129 Let me quickly explain that on the next slide. 159 00:11:17,420 --> 00:11:21,785 So, let me motivate this using the speech 160 00:11:21,785 --> 00:11:26,900 activated rear-view mirror example. 161 00:11:26,900 --> 00:11:31,575 It turns out that the numbers we've been writing down can be placed into 162 00:11:31,575 --> 00:11:36,935 a table where on the horizontal axis, I'm going to place different data sets. 163 00:11:36,935 --> 00:11:42,119 So for example, you might have data from your general speech recognition task. 164 00:11:43,570 --> 00:11:48,210 So you might have a bunch of data that you just collected from a lot 165 00:11:48,210 --> 00:11:51,646 of speech recognition problems you worked on from small speakers, 166 00:11:51,646 --> 00:11:53,740 data you have purchased and so on. 167 00:11:53,740 --> 00:12:00,970 And then you all have the rear view mirror specific speech data, 168 00:12:00,970 --> 00:12:02,120 recorded inside the car. 169 00:12:04,450 --> 00:12:09,890 So on this x axis on the table, I'm going to vary the data set. 170 00:12:09,890 --> 00:12:16,250 On this other axis, I'm going to label different ways or 171 00:12:16,250 --> 00:12:18,470 algorithms for examining the data. 172 00:12:18,470 --> 00:12:21,350 So first, there's human level performance, 173 00:12:21,350 --> 00:12:25,980 which is how accurate are humans on each of these data sets? 174 00:12:27,010 --> 00:12:31,948 Then there is the error on the examples that 175 00:12:31,948 --> 00:12:36,210 your neural network has trained on. 176 00:12:38,870 --> 00:12:43,686 And then finally there's error on the examples that 177 00:12:43,686 --> 00:12:47,412 your neural network has not trained on. 178 00:12:50,036 --> 00:12:55,796 So turns out that what we're calling on a human level on the previous slide, 179 00:12:55,796 --> 00:12:59,036 there's the number that goes in this box, 180 00:12:59,036 --> 00:13:03,320 which is how well do humans do on this category of data. 181 00:13:03,320 --> 00:13:06,304 Say data from all sorts of speech recognition tasks, 182 00:13:06,304 --> 00:13:10,832 the thousand utterances that you could into your training set. 183 00:13:10,832 --> 00:13:13,490 And the example in the previous slide is this 4%. 184 00:13:13,490 --> 00:13:20,670 This number here was our, maybe the training error. 185 00:13:23,320 --> 00:13:26,922 Which in the example in the previous slide was 7% 186 00:13:29,705 --> 00:13:33,430 Right, if you're learning algorithm has seen this example, performed gradient 187 00:13:33,430 --> 00:13:37,315 descent on this example, and this example came from your training set distribution, 188 00:13:37,315 --> 00:13:39,800 or some general speech recognition distribution. 189 00:13:39,800 --> 00:13:43,980 How well does your algorithm do on the example it has trained on? 190 00:13:45,114 --> 00:13:53,122 Then here is the training-dev set error. 191 00:13:53,122 --> 00:13:58,040 It's usually a bit higher, which is for data from this distribution, 192 00:13:58,040 --> 00:14:02,950 from general speech recognition, if your algorithm did not train explicitly on 193 00:14:02,950 --> 00:14:05,870 some examples from this distribution, how well does it do? 194 00:14:05,870 --> 00:14:08,520 And that's what we call the training dev error. 195 00:14:10,700 --> 00:14:14,660 And then if you move over to the right, this box here 196 00:14:14,660 --> 00:14:19,200 is the dev set error, or maybe also the test set error. 197 00:14:20,548 --> 00:14:25,310 Which was 6% in the example just now. 198 00:14:25,310 --> 00:14:28,260 And dev and test error, it's actually technically two numbers, but 199 00:14:28,260 --> 00:14:30,890 either one could go into this box here. 200 00:14:32,870 --> 00:14:37,220 And this is if you have data from your rearview mirror, from actually recorded in 201 00:14:37,220 --> 00:14:41,270 the car from the rearview mirror application, but your neural network did 202 00:14:41,270 --> 00:14:45,350 not perform back propagation on this example, what is the error? 203 00:14:46,940 --> 00:14:51,230 So what we're doing in the analysis in the previous slide was look at 204 00:14:51,230 --> 00:14:55,940 differences between these two numbers, these two numbers, and these two numbers. 205 00:14:57,380 --> 00:15:01,880 And this gap here is a measure of avoidable bias. 206 00:15:03,630 --> 00:15:08,020 This gap here is a measure of variance, and 207 00:15:08,020 --> 00:15:12,580 this gap here was a measure of data mismatch. 208 00:15:13,920 --> 00:15:17,540 And it turns out that it could be useful to also throw in 209 00:15:17,540 --> 00:15:20,010 the remaining two entries in this table. 210 00:15:21,340 --> 00:15:25,270 And so if this turns out to be also 6%, and 211 00:15:25,270 --> 00:15:30,146 the way you get this number is you ask some humans to label their rearview mirror 212 00:15:30,146 --> 00:15:33,390 speech data and just measure how good humans are at this task. 213 00:15:33,390 --> 00:15:35,180 And maybe this turns out also to be 6%. 214 00:15:35,180 --> 00:15:39,260 And the way you do that is you take some rearview mirror speech data, 215 00:15:39,260 --> 00:15:42,650 put it in the training set so the neural network learns on it as well, and 216 00:15:42,650 --> 00:15:46,060 then you measure the error on that subset of the data. 217 00:15:46,060 --> 00:15:50,090 But if this is what you get, then, well, it turns out that you're actually already 218 00:15:50,090 --> 00:15:54,620 performing at the level of humans on this rearview mirror speech data, so 219 00:15:54,620 --> 00:15:58,550 maybe you're actually doing quite well on that distribution of data. 220 00:15:58,550 --> 00:16:03,740 When you do this more subsequent analysis, it doesn't always give you one 221 00:16:03,740 --> 00:16:07,190 clear path forward, but sometimes it just gives you additional insights as well. 222 00:16:07,190 --> 00:16:12,000 So for example, comparing these two numbers in this case tells us that for 223 00:16:12,000 --> 00:16:16,240 humans, the rearview mirror speech data is actually harder than for 224 00:16:16,240 --> 00:16:21,550 general speech recognition, because humans get 6% error, rather than 4% error. 225 00:16:21,550 --> 00:16:25,840 But then looking at these differences as well may help you 226 00:16:25,840 --> 00:16:30,865 understand bias and variance and data mismatch problems in different degrees. 227 00:16:30,865 --> 00:16:35,760 So this more general formulation is something I've used a few times. 228 00:16:35,760 --> 00:16:41,020 I've not used it, but for a lot of problems you find that examining 229 00:16:41,020 --> 00:16:46,040 this subset of entries, kind of looking at this difference and this difference and 230 00:16:46,040 --> 00:16:51,230 this difference, that that's enough to point you in a pretty promising direction. 231 00:16:51,230 --> 00:16:54,840 But sometimes filling out this whole table can give you additional insights. 232 00:16:55,910 --> 00:17:02,535 Finally, we've previously talked a lot about ideas for addressing bias. 233 00:17:02,535 --> 00:17:05,910 Talked about techniques on addressing variance, but 234 00:17:05,910 --> 00:17:08,720 how do you address data mismatch? 235 00:17:08,720 --> 00:17:12,600 In particular training on data that comes from different distribution 236 00:17:12,600 --> 00:17:15,330 that your dev and test set can get you more data and 237 00:17:15,330 --> 00:17:17,760 really help your learning algorithm's performance. 238 00:17:17,760 --> 00:17:20,310 But rather than just bias and variance problems, 239 00:17:20,310 --> 00:17:24,210 you now have this new potential problem of data mismatch. 240 00:17:24,210 --> 00:17:28,460 What are some good ways that you could use to address data mismatch? 241 00:17:28,460 --> 00:17:30,690 I'll be honest and say there actually aren't great or 242 00:17:30,690 --> 00:17:34,130 at least not very systematic ways to address data mismatch. 243 00:17:34,130 --> 00:17:36,540 But there are some things you could try that could help. 244 00:17:36,540 --> 00:17:38,790 Let's take a look at them in the next video. 245 00:17:38,790 --> 00:17:43,200 So what we've seen is that by using training data that can come from 246 00:17:43,200 --> 00:17:47,690 a different distribution as a dev and test set, this could give you a lot more data 247 00:17:47,690 --> 00:17:50,630 and therefore help the performance of your learning algorithm. 248 00:17:50,630 --> 00:17:55,070 But instead of just having bias and variance as two potential problems, 249 00:17:55,070 --> 00:17:58,518 you now have this third potential problem, data mismatch. 250 00:17:58,518 --> 00:18:02,200 So what if you perform error analysis and conclude that data mismatch 251 00:18:02,200 --> 00:18:05,840 is a huge source of error, how do you go about addressing that? 252 00:18:05,840 --> 00:18:09,730 It turns out that unfortunately there are super systematic ways 253 00:18:09,730 --> 00:18:14,120 to address data mismatch, but there are a few things you can try that could help. 254 00:18:14,120 --> 00:18:15,820 Let's take a look at them in the next video.