1 00:00:00,000 --> 00:00:03,600 I've noticed that almost all the really good machine learning 2 00:00:03,600 --> 00:00:07,890 practitioners tend to be very sophisticated in understanding of Bias and Variance. 3 00:00:07,890 --> 00:00:12,330 Bias and Variance is one of those concepts that's easily learned but difficult to master. 4 00:00:12,330 --> 00:00:16,155 Even if you think you've seen the basic concepts of Bias and Variance, 5 00:00:16,155 --> 00:00:18,805 there's often more new ones to it than you'd expect. 6 00:00:18,805 --> 00:00:20,620 In the Deep Learning Error, 7 00:00:20,620 --> 00:00:22,650 another trend is that there's been 8 00:00:22,650 --> 00:00:26,035 less discussion of what's called the bias-variance trade-off. 9 00:00:26,035 --> 00:00:28,657 You might have heard this thing called the bias-variance trade-off. 10 00:00:28,657 --> 00:00:31,330 But in Deep Learning Error there's less of a trade-off, 11 00:00:31,330 --> 00:00:32,520 so we'd still still solve the bias, 12 00:00:32,520 --> 00:00:33,865 we still solve the variance, 13 00:00:33,865 --> 00:00:37,295 but we just talk less about the bias-variance trade-off. 14 00:00:37,295 --> 00:00:39,795 Let's see what this means. 15 00:00:39,795 --> 00:00:42,750 Let's see the data set that looks like this. 16 00:00:42,750 --> 00:00:44,800 If you fit a straight line to the data, 17 00:00:44,800 --> 00:00:48,130 maybe get a logistic regression fit to that. 18 00:00:48,130 --> 00:00:50,415 This is not a very good fit to the data. 19 00:00:50,415 --> 00:00:52,380 And so this is class of a high bias, 20 00:00:52,380 --> 00:00:56,400 what we say that this is underfitting the data. 21 00:00:56,400 --> 00:00:57,850 On the opposite end, 22 00:00:57,850 --> 00:01:00,645 if you fit an incredibly complex classifier, 23 00:01:00,645 --> 00:01:02,640 maybe deep neural network, 24 00:01:02,640 --> 00:01:05,995 or neural network with all the hidden units, 25 00:01:05,995 --> 00:01:10,255 maybe you can fit the data perfectly, 26 00:01:10,255 --> 00:01:12,220 but that doesn't look like a great fit either. 27 00:01:12,220 --> 00:01:17,535 So there's a classifier of high variance and this is overfitting the data. 28 00:01:17,535 --> 00:01:19,650 And there might be some classifier in between, 29 00:01:19,650 --> 00:01:22,070 with a medium level of complexity, 30 00:01:22,070 --> 00:01:24,585 that maybe fits it correctly like that. 31 00:01:24,585 --> 00:01:27,300 That looks like a much more reasonable fit to the data, 32 00:01:27,300 --> 00:01:31,705 so we call that just right. It's somewhere in between. 33 00:01:31,705 --> 00:01:34,260 So in a 2D example like this, 34 00:01:34,260 --> 00:01:35,610 with just two features, 35 00:01:35,610 --> 00:01:39,700 X-1 and X-2, you can plot the data and visualize bias and variance. 36 00:01:39,700 --> 00:01:41,250 In high dimensional problems, 37 00:01:41,250 --> 00:01:44,735 you can't plot the data and visualize division boundary. 38 00:01:44,735 --> 00:01:46,830 Instead, there are couple of different metrics, 39 00:01:46,830 --> 00:01:49,750 that we'll look at, to try to understand bias and variance. 40 00:01:49,750 --> 00:01:51,960 So continuing our example of cat picture classification, 41 00:01:51,960 --> 00:01:57,600 where that's a positive example and that's a negative example, 42 00:01:57,600 --> 00:02:01,455 the two key numbers to look at to understand bias and variance will be 43 00:02:01,455 --> 00:02:06,415 the train set error and the dev set or the development set error. 44 00:02:06,415 --> 00:02:07,716 So for the sake of argument, 45 00:02:07,716 --> 00:02:10,290 let's say that you're recognizing cats in pictures, 46 00:02:10,290 --> 00:02:13,860 is something that people can do nearly perfectly, right? 47 00:02:13,860 --> 00:02:22,155 So let's say, your training set error is 1% and your dev set error is, 48 00:02:22,155 --> 00:02:23,580 for the sake of argument, 49 00:02:23,580 --> 00:02:25,585 let's say is 11%. 50 00:02:25,585 --> 00:02:26,730 So in this example, 51 00:02:26,730 --> 00:02:29,495 you're doing very well on the training set, 52 00:02:29,495 --> 00:02:34,355 but you're doing relatively poorly on the development set. 53 00:02:34,355 --> 00:02:38,215 So this looks like you might have overfit the training set, 54 00:02:38,215 --> 00:02:40,620 that somehow you're not generalizing well, 55 00:02:40,620 --> 00:02:43,815 to this whole cross-validation set in the development set. 56 00:02:43,815 --> 00:02:46,440 And so if you have an example like this, 57 00:02:46,440 --> 00:02:50,785 we would say this has high variance. 58 00:02:50,785 --> 00:02:54,240 So by looking at the training set error and the development set error, 59 00:02:54,240 --> 00:02:59,730 you would be able to render a diagnosis of your algorithm having high variance. 60 00:02:59,730 --> 00:03:04,435 Now, let's say, that you measure your training set and your dev set error, 61 00:03:04,435 --> 00:03:06,095 and you get a different result. 62 00:03:06,095 --> 00:03:09,610 Let's say, that your training set error is 15%. 63 00:03:09,610 --> 00:03:12,375 I'm writing your training set error in the top row, 64 00:03:12,375 --> 00:03:15,880 and your dev set error is 16%. 65 00:03:15,880 --> 00:03:22,870 In this case, assuming that humans achieve roughly 0% error, 66 00:03:22,870 --> 00:03:27,451 that humans can look at these pictures and just tell if it's cat or not, 67 00:03:27,451 --> 00:03:31,600 then it looks like the algorithm is not even doing very well on the training set. 68 00:03:31,600 --> 00:03:35,380 So if it's not even fitting the training data seam that well, 69 00:03:35,380 --> 00:03:38,220 then this is underfitting the data. 70 00:03:38,220 --> 00:03:40,895 And so this algorithm has high bias. 71 00:03:40,895 --> 00:03:45,390 But in contrast, this actually generalizing at a reasonable level to the dev set, 72 00:03:45,390 --> 00:03:49,365 whereas performance in the dev set is only 1% worse than performance in the training set. 73 00:03:49,365 --> 00:03:52,355 So this algorithm has a problem of high bias, 74 00:03:52,355 --> 00:03:56,325 because it was not even fitting the training set. 75 00:03:56,325 --> 00:04:01,030 Well, this is similar to the leftmost plots we had on the previous slide. 76 00:04:01,030 --> 00:04:03,329 Now, here's another example. 77 00:04:03,329 --> 00:04:06,430 Let's say that you have 15% training set error, 78 00:04:06,430 --> 00:04:08,127 so that's pretty high bias, 79 00:04:08,127 --> 00:04:11,446 but when you evaluate to the dev set it does even worse, 80 00:04:11,446 --> 00:04:13,450 maybe it does 30%. 81 00:04:13,450 --> 00:04:18,175 In this case, I would diagnose this algorithm as having high bias, 82 00:04:18,175 --> 00:04:23,780 because it's not doing that well on the training set, and high variance. 83 00:04:23,780 --> 00:04:27,950 So this has really the worst of both worlds. 84 00:04:27,950 --> 00:04:29,325 And one last example, 85 00:04:29,325 --> 00:04:32,605 if you have 0.5 training set error, 86 00:04:32,605 --> 00:04:35,145 and 1% dev set error, 87 00:04:35,145 --> 00:04:37,130 then maybe our users are quite happy, 88 00:04:37,130 --> 00:04:39,850 that you have a cat classifier with only 1%, 89 00:04:39,850 --> 00:04:44,340 than just we have low bias and low variance. 90 00:04:44,340 --> 00:04:47,610 One subtlety, that I'll just briefly mention that 91 00:04:47,610 --> 00:04:50,955 we'll leave to a later video to discuss in detail, 92 00:04:50,955 --> 00:04:54,188 is that this analysis is predicated on the assumption, 93 00:04:54,188 --> 00:04:59,115 that human level performance gets nearly 0% error or, 94 00:04:59,115 --> 00:05:01,749 more generally, that the optimal error, 95 00:05:01,749 --> 00:05:04,225 sometimes called base error, 96 00:05:04,225 --> 00:05:10,355 so the base in optimal error is nearly 0%. 97 00:05:10,355 --> 00:05:13,565 I don't want to go into detail on this in this particular video, 98 00:05:13,565 --> 00:05:18,070 but it turns out that if the optimal error or the base error were much higher, say, 99 00:05:18,070 --> 00:05:22,360 it were 15%, then if you look at this classifier, 100 00:05:22,360 --> 00:05:25,460 15% is actually perfectly reasonable for training set and you 101 00:05:25,460 --> 00:05:30,120 wouldn't see it as high bias and also a pretty low variance. 102 00:05:30,120 --> 00:05:33,440 So the case of how to analyze bias and variance, 103 00:05:33,440 --> 00:05:37,460 when no classifier can do very well, for example, 104 00:05:37,460 --> 00:05:40,833 if you have really blurry images, 105 00:05:40,833 --> 00:05:46,081 so that even a human or just no system could possibly do very well, 106 00:05:46,081 --> 00:05:49,237 then maybe base error is much higher, 107 00:05:49,237 --> 00:05:52,295 and then there are some details of how this analysis will change. 108 00:05:52,295 --> 00:05:54,725 But leaving aside this subtlety for now, 109 00:05:54,725 --> 00:05:57,430 the takeaway is that by looking at 110 00:05:57,430 --> 00:06:02,676 your training set error you can get a sense of how well you are fitting, 111 00:06:02,676 --> 00:06:04,331 at least the training data, 112 00:06:04,331 --> 00:06:06,770 and so that tells you if you have a bias problem. 113 00:06:06,770 --> 00:06:10,190 And then looking at how much higher your error goes, 114 00:06:10,190 --> 00:06:12,965 when you go from the training set to the dev set, 115 00:06:12,965 --> 00:06:17,055 that should give you a sense of how bad is the variance problem, 116 00:06:17,055 --> 00:06:20,857 so you'll be doing a good job generalizing from a training set to the dev set, 117 00:06:20,857 --> 00:06:22,645 that gives you sense of your variance. 118 00:06:22,645 --> 00:06:26,210 All this is under the assumption that the base error is quite 119 00:06:26,210 --> 00:06:30,235 small and that your training and your dev sets are drawn from the same distribution. 120 00:06:30,235 --> 00:06:32,210 If those assumptions are violated, 121 00:06:32,210 --> 00:06:34,323 there's a more sophisticated analysis you could do, 122 00:06:34,323 --> 00:06:36,510 which we'll talk about in the later video. 123 00:06:36,510 --> 00:06:38,780 Now, on the previous slide, 124 00:06:38,780 --> 00:06:40,849 you saw what high bias, 125 00:06:40,849 --> 00:06:42,185 high variance looks like, 126 00:06:42,185 --> 00:06:44,920 and I guess you have the sense of what it a good class can look like. 127 00:06:44,920 --> 00:06:48,110 What does high bias and high variance looks like? 128 00:06:48,110 --> 00:06:50,535 This is kind of the worst of both worlds. 129 00:06:50,535 --> 00:06:53,415 So you remember, we said that a classifier like this, 130 00:06:53,415 --> 00:06:55,755 then your classifier has high bias, 131 00:06:55,755 --> 00:06:58,185 because it underfits the data. 132 00:06:58,185 --> 00:07:04,030 So this would be a classifier that is mostly linear and therefore underfits the data, 133 00:07:04,030 --> 00:07:05,570 we're drawing this is purple. 134 00:07:05,570 --> 00:07:09,546 But if somehow your classifier does some weird things, 135 00:07:09,546 --> 00:07:14,460 then it is actually overfitting parts of the data as well. 136 00:07:14,460 --> 00:07:16,500 So the classifier that I drew in purple, 137 00:07:16,500 --> 00:07:19,455 has both high bias and high variance. 138 00:07:19,455 --> 00:07:21,300 Where it has high bias, because, 139 00:07:21,300 --> 00:07:23,325 by being a mostly linear classifier, 140 00:07:23,325 --> 00:07:24,875 is just not fitting. 141 00:07:24,875 --> 00:07:28,466 You know, this quadratic line shape that well, 142 00:07:28,466 --> 00:07:31,200 but by having too much flexibility in the middle, 143 00:07:31,200 --> 00:07:32,995 it somehow gets this example, 144 00:07:32,995 --> 00:07:36,720 and this example overfits those two examples as well. 145 00:07:36,720 --> 00:07:40,515 So this classifier kind of has high bias because it was mostly linear, 146 00:07:40,515 --> 00:07:43,620 but you need maybe a curve function or quadratic function. 147 00:07:43,620 --> 00:07:45,115 And it has high variance, 148 00:07:45,115 --> 00:07:49,595 because it had too much flexibility to fit those two mislabel, 149 00:07:49,595 --> 00:07:52,475 or those live examples in the middle as well. 150 00:07:52,475 --> 00:07:54,300 In case this seems contrived, well, 151 00:07:54,300 --> 00:07:57,585 this example is a little bit contrived in two dimensions, 152 00:07:57,585 --> 00:07:59,883 but with very high dimensional inputs. 153 00:07:59,883 --> 00:08:01,795 You actually do get things with 154 00:08:01,795 --> 00:08:04,800 high bias in some regions and high variance in some regions, 155 00:08:04,800 --> 00:08:07,410 and so it is possible to get classifiers like this 156 00:08:07,410 --> 00:08:11,415 in high dimensional inputs that seem less contrived. 157 00:08:11,415 --> 00:08:15,690 So to summarize, you've seen how by looking at your algorithm's error on 158 00:08:15,690 --> 00:08:20,550 the training set and your algorithm's error on the dev set you can try to diagnose, 159 00:08:20,550 --> 00:08:23,413 whether it has problems of high bias or high variance, 160 00:08:23,413 --> 00:08:25,420 or maybe both, or maybe neither. 161 00:08:25,420 --> 00:08:28,995 And depending on whether your algorithm suffers from bias or variance, 162 00:08:28,995 --> 00:08:31,765 it turns out that there are different things you could try. 163 00:08:31,765 --> 00:08:33,840 So in the next video, I want to present to you, 164 00:08:33,840 --> 00:08:37,390 what I call a basic recipe for Machine Learning, 165 00:08:37,390 --> 00:08:40,905 that lets you more systematically try to improve your algorithm, 166 00:08:40,905 --> 00:08:44,370 depending on whether it has high bias or high variance issues. 167 00:08:44,370 --> 00:08:46,110 So let's go on to the next video.