1 00:00:00,000 --> 00:00:04,380 [MUSIC] 2 00:00:04,380 --> 00:00:05,254 Okay. 3 00:00:05,254 --> 00:00:08,890 Well, now let's turn to this third component which is a variance. 4 00:00:08,890 --> 00:00:11,608 And what variance is gonna say is, 5 00:00:11,608 --> 00:00:17,420 how different can my specific fits to a given data set be from one another, 6 00:00:17,420 --> 00:00:21,379 as I'm looking at different possible data sets? 7 00:00:21,379 --> 00:00:25,426 And in this case, when we are looking at just this constant model, 8 00:00:25,426 --> 00:00:28,752 we showed by that early picture where I drew points that 9 00:00:28,752 --> 00:00:33,020 were mainly above the true relationship and the points mainly below, 10 00:00:33,020 --> 00:00:36,970 that the actual resulting fits didn't vary very much. 11 00:00:36,970 --> 00:00:40,473 And when you look at the space of all possible observations, 12 00:00:40,473 --> 00:00:44,879 you see that the fits, they're fairly similar, they're fairly stable. 13 00:00:44,879 --> 00:00:48,151 And so, when you look at the variation in these fits, 14 00:00:48,151 --> 00:00:51,010 which I'm drawing with these grey bars here. 15 00:00:52,060 --> 00:00:53,910 We see that they don't vary very much. 16 00:00:55,190 --> 00:00:59,283 So, for this low complexity model, 17 00:00:59,283 --> 00:01:03,254 we see that there's low variance. 18 00:01:03,254 --> 00:01:08,003 So, to summarize what this variance is saying is, how much can the fits vary? 19 00:01:08,003 --> 00:01:11,839 And if they could vary dramatically from one data set to the other, 20 00:01:11,839 --> 00:01:14,590 then you would have very erratic predictions. 21 00:01:14,590 --> 00:01:18,070 Your prediction would just be sensitive to what data set you got. 22 00:01:18,070 --> 00:01:21,010 So, that would be a source of error in your predictions. 23 00:01:21,010 --> 00:01:24,200 And to see this, we can start looking at high-complexity models. 24 00:01:25,700 --> 00:01:28,630 So in particular, let's look at this data set again. 25 00:01:28,630 --> 00:01:30,930 And now, let's fit some high-order polynomial to it. 26 00:01:30,930 --> 00:01:33,340 So, that's some fit shown here. 27 00:01:33,340 --> 00:01:35,540 And now, let's take again this same data set. 28 00:01:35,540 --> 00:01:40,300 But let's choose two points, which I'm gonna highlight as these pink circles. 29 00:01:40,300 --> 00:01:42,253 And let's just move them a little bit. 30 00:01:42,253 --> 00:01:46,384 So, out of this whole data set, I've just moved two observations and 31 00:01:46,384 --> 00:01:50,180 not too dramatically, but I get a dramatically different fit. 32 00:01:51,650 --> 00:01:55,980 So then, when I think about looking over all possible data sets I might get, 33 00:01:55,980 --> 00:02:00,100 I might get some crazy set of curves. 34 00:02:00,100 --> 00:02:02,560 There is an average curve. 35 00:02:02,560 --> 00:02:05,980 And in this case, the average curve is actually pretty well behaved. 36 00:02:05,980 --> 00:02:10,519 Because this wild, wiggly curve is at any point, equally, 37 00:02:10,519 --> 00:02:14,086 likely to have been wild above, or wild below. 38 00:02:14,086 --> 00:02:20,570 So, on average over all data sets, it's actually a fairly smooth reasonable curve. 39 00:02:20,570 --> 00:02:26,830 But if I look at the variation between these fits, it's really large. 40 00:02:26,830 --> 00:02:31,030 So, what we're saying is that high-complexity models have high variance. 41 00:02:32,290 --> 00:02:38,254 On the other, if I look at the bias of this model, so here again, 42 00:02:38,254 --> 00:02:45,005 I'm showing this average fit which was this fairly well behaved curve. 43 00:02:45,005 --> 00:02:49,725 And match pretty well to the true relationship between square feet and 44 00:02:49,725 --> 00:02:53,250 house value, because my model is really flexible. 45 00:02:54,610 --> 00:03:00,050 So on average, it was able to fit pretty precisely that true relationship. 46 00:03:00,050 --> 00:03:03,740 So, these high-complexities models have low bias. 47 00:03:06,120 --> 00:03:09,110 So, we can now talk about this bias-variance tradeoff. 48 00:03:09,110 --> 00:03:11,893 So, in particular, we're gonna plot bias and 49 00:03:11,893 --> 00:03:14,546 variance as a function of model complexity. 50 00:03:14,546 --> 00:03:19,343 And so, what we saw in the past slides is that as our 51 00:03:19,343 --> 00:03:24,379 model complexity increases, our bias decreases. 52 00:03:24,379 --> 00:03:26,692 Because we can better and 53 00:03:26,692 --> 00:03:32,212 better approximate the true relationship between x and y. 54 00:03:32,212 --> 00:03:35,670 So, this curve here is our bias curve. 55 00:03:37,280 --> 00:03:41,004 On the other hand, variance increases. 56 00:03:41,004 --> 00:03:45,119 So, our very simple model had very low variance, and 57 00:03:45,119 --> 00:03:48,970 the high-complexity models had high variance. 58 00:03:51,240 --> 00:03:54,128 So, this is a picture of our variance. 59 00:03:55,690 --> 00:03:59,711 And so, what we see is there's this natural tradeoff between bias and 60 00:03:59,711 --> 00:04:00,400 variance. 61 00:04:00,400 --> 00:04:05,878 And one way to summarize this is something that's called mean squared error. 62 00:04:05,878 --> 00:04:10,830 And so, mean squared error, which if you watch the optional 63 00:04:10,830 --> 00:04:15,210 videos that go into all these concepts more in depth. 64 00:04:15,210 --> 00:04:19,703 You'll hear a lot more about mean squared error and a formal definition, 65 00:04:19,703 --> 00:04:21,379 or the derivation of this. 66 00:04:21,379 --> 00:04:29,461 But mean squared error is simply the sum of bias squared plus variance. 67 00:04:29,461 --> 00:04:30,878 Okay. 68 00:04:30,878 --> 00:04:34,460 I guess I'll write out variance to be very clear. 69 00:04:34,460 --> 00:04:42,160 So, this is my little cartoon of bias squared plus variance. 70 00:04:43,880 --> 00:04:45,840 This is my mean squared error curve. 71 00:04:47,370 --> 00:04:51,546 And machine learning is all about this tradeoff between bias and variance. 72 00:04:51,546 --> 00:04:55,045 We're gonna see this again and again in this course. 73 00:04:55,045 --> 00:04:57,880 And we're gonna see it throughout the specialization. 74 00:04:57,880 --> 00:05:01,960 And the goal is finding this sweet spot. 75 00:05:03,330 --> 00:05:09,600 This is the sweet spot where we get our minimum error, 76 00:05:09,600 --> 00:05:15,617 the minimum contribution of bias and variance, to our prediction errors. 77 00:05:15,617 --> 00:05:18,796 So, not sweet, sweet. 78 00:05:18,796 --> 00:05:23,962 It is sweet, sweet, but what I'm trying to write is sweet spot. 79 00:05:23,962 --> 00:05:26,463 And this is what we'd love to get at. 80 00:05:26,463 --> 00:05:28,840 That's the model complexity that we'd want. 81 00:05:30,900 --> 00:05:35,224 But just like with generalization error, so 82 00:05:35,224 --> 00:05:40,587 I'm gonna write this down with generalization error. 83 00:05:40,587 --> 00:05:41,760 Can we compute this? 84 00:05:43,000 --> 00:05:44,500 So, think about that while I'm writing. 85 00:05:49,990 --> 00:05:56,169 We cannot compute bias and variance, 86 00:05:56,169 --> 00:06:01,004 and less mean squared error. 87 00:06:01,004 --> 00:06:01,879 And why? 88 00:06:09,130 --> 00:06:13,645 Well, the reason is because just like with generalization error, 89 00:06:13,645 --> 00:06:16,900 they were defined in terms of the true function. 90 00:06:16,900 --> 00:06:19,407 Well, bias was defined very explicitly in terms of 91 00:06:19,407 --> 00:06:22,600 the relationship relative to the true function. 92 00:06:22,600 --> 00:06:25,350 And when we think about defining variance, 93 00:06:25,350 --> 00:06:30,490 we have to average over all possible data sets, and the same was true for bias too. 94 00:06:30,490 --> 00:06:34,989 But all possible data sets of size n, we could have gotten from the world, 95 00:06:34,989 --> 00:06:37,250 and we just don't know what that is. 96 00:06:37,250 --> 00:06:40,120 So, we can't compute these things exactly. 97 00:06:40,120 --> 00:06:45,023 But throughout the rest of this course, we're gonna look at ways to optimize this 98 00:06:45,023 --> 00:06:48,379 tradeoff between bias and variance in a practical way. 99 00:06:48,379 --> 00:06:52,499 [MUSIC]