1 00:00:00,000 --> 00:00:04,491 [MUSIC] 2 00:00:04,491 --> 00:00:09,108 So now let's talk about a way to automatically address this issue by 3 00:00:09,108 --> 00:00:13,158 modifying the cost term that we're minimizing when we're 4 00:00:13,158 --> 00:00:15,350 addressing how good our fit is. 5 00:00:16,400 --> 00:00:20,530 So, in particular we're looking at this orange box, this quality metric. 6 00:00:20,530 --> 00:00:25,160 And before our quality metric just depended on the difference between 7 00:00:25,160 --> 00:00:29,130 our predicted house sales price, and our actual house sales price. 8 00:00:29,130 --> 00:00:32,770 In particular we're looking at residual sum of squares for measure of fit. 9 00:00:32,770 --> 00:00:37,640 But now we're gonna modify this quality metric to also take into account 10 00:00:37,640 --> 00:00:39,890 a measure of the complexity of the model. 11 00:00:39,890 --> 00:00:43,990 In particular, in order to buy assess toward simpler models. 12 00:00:43,990 --> 00:00:47,890 So when we're thinking about defining this modified cost function, what we're gonna 13 00:00:47,890 --> 00:00:52,610 want to do is balance between how well the function fits the data and 14 00:00:52,610 --> 00:00:56,990 a measure of how complex, or how potentially over fit, the model is. 15 00:00:56,990 --> 00:00:59,400 And what did we see was an indicator of that? 16 00:00:59,400 --> 00:01:02,660 The magnitude of our estimated coefficients. 17 00:01:02,660 --> 00:01:07,270 So, what we're going to balance between is the fit of the model to the data and 18 00:01:07,270 --> 00:01:11,370 the magnitude of the coefficients of the model. 19 00:01:11,370 --> 00:01:15,439 Okay, so we can write down a total cost that has these two terms. 20 00:01:15,439 --> 00:01:20,146 Where this is our new measure of the quality of the fit, and 21 00:01:20,146 --> 00:01:24,660 when I say measure of fit here, what I mean is that a small 22 00:01:24,660 --> 00:01:29,105 number indicates that there's a good fit to the data. 23 00:01:29,105 --> 00:01:33,467 And on the other hand, the measure of the magnitude of the coefficients if that 24 00:01:33,467 --> 00:01:37,235 number is small that means the size of the coefficients are small and 25 00:01:37,235 --> 00:01:40,697 we're unlikely to be in this setting of a very overfit model. 26 00:01:41,840 --> 00:01:44,930 Okay, so clearly we want to balance between these two measures, 27 00:01:44,930 --> 00:01:49,830 because if I just optimize the magnitude of the coefficients, 28 00:01:49,830 --> 00:01:54,900 I'd set all the coefficients to zero and that would sure not be overfit, 29 00:01:54,900 --> 00:01:57,910 but it also would not fit the data well. 30 00:01:57,910 --> 00:02:01,420 So that would be a very high bias solution. 31 00:02:01,420 --> 00:02:06,650 On the other hand, if I just focused on optimizing the measure of fit, 32 00:02:06,650 --> 00:02:07,750 that's what we did before. 33 00:02:07,750 --> 00:02:10,840 That's the thing that was subject to becoming overfit in the face of 34 00:02:10,840 --> 00:02:12,130 complex models. 35 00:02:12,130 --> 00:02:15,310 So somehow we want to trade off between these two terms, and 36 00:02:15,310 --> 00:02:17,770 that's what we're going to discuss now. 37 00:02:17,770 --> 00:02:20,060 Okay, what's our measure of fit? 38 00:02:20,060 --> 00:02:23,000 At this point you guys should be pretty sick of hearing me say this. 39 00:02:23,000 --> 00:02:27,470 It's our residual sum of squares, which I've written here and 40 00:02:27,470 --> 00:02:30,840 hopefully this formula is quite familiar to you at this point. 41 00:02:30,840 --> 00:02:36,072 But sometimes we also write it as follows where 42 00:02:36,072 --> 00:02:41,037 remember this is our predicted value using w 43 00:02:41,037 --> 00:02:46,430 in our model to make these predictions. 44 00:02:46,430 --> 00:02:51,940 And just remember that a small residual sum 45 00:02:51,940 --> 00:02:56,998 of squares is indicative of the model. 46 00:02:59,680 --> 00:03:01,652 Fitting the traiing data well. 47 00:03:09,617 --> 00:03:15,054 So just as we said on the previous slide, when we're thinking about measure of fit, 48 00:03:15,054 --> 00:03:17,900 a small number is gonna indicate a good fit. 49 00:03:20,430 --> 00:03:25,320 Okay, so now what we need is a measure of the magnitude of the coefficients. 50 00:03:25,320 --> 00:03:29,640 So what summary number might be indicative of the size of the regression 51 00:03:29,640 --> 00:03:31,265 coefficients? 52 00:03:31,265 --> 00:03:35,230 Well maybe you think about just summing all the coefficients together? 53 00:03:36,510 --> 00:03:40,439 Is this gonna be a good measure of the overall magnitude of the coefficients? 54 00:03:41,850 --> 00:03:46,926 Probably not in a lot of cases because 55 00:03:46,926 --> 00:03:52,493 you might end up with a situation where, 56 00:03:52,493 --> 00:03:58,553 let's say, w0 is 1,527,301 and 57 00:03:58,553 --> 00:04:02,975 w1 is -1,605,253, 58 00:04:02,975 --> 00:04:08,377 well if you look at and let's say w0 and 59 00:04:08,377 --> 00:04:14,793 w1, the only two coefficients in our model. 60 00:04:14,793 --> 00:04:21,820 If I look at w0 + w1, this is gonna be some small number. 61 00:04:23,380 --> 00:04:28,430 Despite the fact that each of the coefficients themselves were quite large. 62 00:04:28,430 --> 00:04:31,120 Okay, so you might say, I know how to fix this, 63 00:04:31,120 --> 00:04:33,070 I'll just look at the absolute value. 64 00:04:33,070 --> 00:04:38,796 So, maybe what I'll do, is I'll look at absolute 65 00:04:38,796 --> 00:04:43,459 value w0 + w1 plus all the way up to wD and 66 00:04:43,459 --> 00:04:48,388 this is, I'll just write this compactly, 67 00:04:48,388 --> 00:04:56,020 sum from j=0 to capital D, the number of features we have. 68 00:04:56,020 --> 00:04:59,490 Absolute value of wD, sorry, wj. 69 00:05:03,430 --> 00:05:08,098 And this is defined to be equal to what's called the one norm 70 00:05:08,098 --> 00:05:13,670 of the vector of coefficient w. 71 00:05:13,670 --> 00:05:19,680 So we write it, so this is a vector, I'll try and make this a thick font here, 72 00:05:21,450 --> 00:05:26,191 sub 1 and this is called L1 norm. 73 00:05:26,191 --> 00:05:31,260 And this is actually a very reasonable choice. 74 00:05:31,260 --> 00:05:34,829 And we're gonna discuss this more in the next module. 75 00:05:45,857 --> 00:05:48,613 But for now the thing that we're gonna 76 00:05:48,613 --> 00:05:53,710 consider is to consider the sum of the squares of the coefficients. 77 00:05:53,710 --> 00:06:00,200 So w0 squared w1 squared, all the way up to wD squared. 78 00:06:00,200 --> 00:06:07,410 So this is the sum j equals zero to capital D, of wj squared. 79 00:06:07,410 --> 00:06:13,550 And this is defined to be equal to, we've actually seen this norm 80 00:06:14,630 --> 00:06:20,280 many times in this class so far, it's the two norm squared. 81 00:06:20,280 --> 00:06:24,777 So this is called our L2 norm, or really the L2 norm squared and 82 00:06:24,777 --> 00:06:27,710 this is gonna be the focus of this module. 83 00:06:38,728 --> 00:06:40,411 Okay. 84 00:06:40,411 --> 00:06:45,693 So again, what we have, just to summarize, is we have our total cost 85 00:06:45,693 --> 00:06:51,523 is a sum of the measure of fit + a measure of the magnitude of coefficients and 86 00:06:51,523 --> 00:06:56,510 we said our measure of fit is our residual sum of squares. 87 00:06:56,510 --> 00:07:00,113 And our measure of the magnitude of the coefficients for 88 00:07:00,113 --> 00:07:04,183 this module is going to be this two norm of the w vector squared. 89 00:07:04,183 --> 00:07:07,573 [MUSIC]