1 00:00:00,672 --> 00:00:04,284 [MUSIC] 2 00:00:04,284 --> 00:00:07,494 Okay, so let's consider the resulting objective, 3 00:00:07,494 --> 00:00:11,170 where I'm gonna try and search over all possible w vectors. 4 00:00:11,170 --> 00:00:15,790 To find the ones that minimize the sum of residual sum of squares plus the square 5 00:00:15,790 --> 00:00:16,990 of the two norm of w. 6 00:00:18,570 --> 00:00:24,270 So that's gonna be my w hat, my estimated model parameters. 7 00:00:24,270 --> 00:00:29,256 But really what I'd like to do, is I'd like to be able to control how much I'm 8 00:00:29,256 --> 00:00:33,863 weighing the complexity of the model as measured by this magnitude of my 9 00:00:33,863 --> 00:00:37,045 coefficient, relative to the fit of the model. 10 00:00:37,045 --> 00:00:39,561 I'd like to balance between these two terms, and so 11 00:00:39,561 --> 00:00:41,630 I'm gonna introduce another parameter. 12 00:00:41,630 --> 00:00:43,660 And this is called a tuning parameter. 13 00:00:43,660 --> 00:00:49,780 With the model, it's a lambda, and this is balancing between this fit and magnitude. 14 00:00:51,110 --> 00:00:53,935 So let's see what happens if I choose lambda to be 0. 15 00:00:55,460 --> 00:00:57,231 Well, if I choose lambda to be 0, 16 00:00:57,231 --> 00:01:00,900 this magnitude term that we've introduced completely disappears and 17 00:01:00,900 --> 00:01:04,767 my objective reduces down just to minimizing the residual sum of squares. 18 00:01:04,767 --> 00:01:07,889 Which was exactly the same as my objective before. 19 00:01:07,889 --> 00:01:12,029 So, this reduces to 20 00:01:12,029 --> 00:01:17,143 minimizing residual sum 21 00:01:17,143 --> 00:01:22,506 of squares of w as before. 22 00:01:25,988 --> 00:01:28,647 So this is our old solution, 23 00:01:32,907 --> 00:01:38,293 Which leads to some w hat which I'm gonna call 24 00:01:38,293 --> 00:01:43,130 w hat superscript LS for least squares. 25 00:01:44,210 --> 00:01:46,960 Because what we were doing before is commonly referred to 26 00:01:46,960 --> 00:01:48,510 as the least squares solution. 27 00:01:48,510 --> 00:01:52,540 So I'm gonna specifically represent the parameters associated 28 00:01:52,540 --> 00:01:56,279 with that old procedure we're doing as the least squares parameters. 29 00:01:57,840 --> 00:01:58,970 On the other hand, 30 00:01:58,970 --> 00:02:02,820 what if I completely crank up that tuning parameter to be infinity? 31 00:02:04,950 --> 00:02:10,760 So I have a really, really massively large weight on this magnitude term. 32 00:02:10,760 --> 00:02:12,850 Massively large being infinitely large. 33 00:02:14,140 --> 00:02:17,680 So as large as you can possibly imagine. 34 00:02:17,680 --> 00:02:25,012 So what happens to any solution where w hat is not equal to 0? 35 00:02:25,012 --> 00:02:29,467 So, For 36 00:02:29,467 --> 00:02:36,413 solutions where w hat does not equal 0. 37 00:02:39,791 --> 00:02:41,811 Then the total cost is what? 38 00:02:45,212 --> 00:02:48,886 Well I get something that's non-0 times infinity plus something, 39 00:02:48,886 --> 00:02:51,946 my residual sum of squares, whatever that happens to be. 40 00:02:51,946 --> 00:02:55,428 But the sum of that is infinity. 41 00:02:55,428 --> 00:02:57,994 Okay, so my total cost is infinite. 42 00:02:57,994 --> 00:03:02,609 On the other hand, what if w hat is exactly equal to 0? 43 00:03:02,609 --> 00:03:06,742 Then if w hat equals 0, 44 00:03:06,742 --> 00:03:14,185 then total cost is equal to the residual sum 45 00:03:14,185 --> 00:03:19,370 of squares of this 0 vector. 46 00:03:20,380 --> 00:03:24,310 And that's some number, but it's probably not infinity. 47 00:03:24,310 --> 00:03:26,760 Actually it's not infinity, so 48 00:03:27,900 --> 00:03:32,675 the minimizing solution here is always gonna be w hat equals 0. 49 00:03:36,565 --> 00:03:40,445 Cuz that's the thing that's gonna minimize the total cost over all possible w's. 50 00:03:46,551 --> 00:03:51,524 Okay, so just to recap, we said that if we put that tuning parameter 51 00:03:51,524 --> 00:03:56,200 all the way to 0, make it very, very small, all the way to 0. 52 00:03:56,200 --> 00:03:59,560 Then we return to our previously square solution and 53 00:03:59,560 --> 00:04:03,060 if we crank that parameter all the way up to be infinite. 54 00:04:03,060 --> 00:04:09,980 In that limit, we get all of our coefficients being exactly 0, okay? 55 00:04:12,300 --> 00:04:18,029 But we're gonna be operating in a regime where lambda is somewhere in between 0 and 56 00:04:18,029 --> 00:04:18,950 infinity. 57 00:04:20,400 --> 00:04:25,164 And in this case, Then we know 58 00:04:25,164 --> 00:04:30,326 that the magnitude of our estimated coefficients, 59 00:04:30,326 --> 00:04:33,258 they're gonna be less than or 60 00:04:33,258 --> 00:04:39,141 equal to the magnitude of our least squares coefficients. 61 00:04:39,141 --> 00:04:44,504 In particular, the two norm will be less than. 62 00:04:44,504 --> 00:04:48,626 But we also know it's gonna be greater than or equal to 0. 63 00:04:48,626 --> 00:04:55,185 So we're gonna be somewhere in between these two regions. 64 00:04:55,185 --> 00:04:58,041 And a key question is, what lambda do we actually want? 65 00:04:58,041 --> 00:05:01,977 How much do we want to bias away from our least square solution, 66 00:05:01,977 --> 00:05:06,956 which was subject to potentially over-fitting, down to this really simple, 67 00:05:06,956 --> 00:05:11,290 the most trivial model you can consider which is nothing, no model? 68 00:05:12,650 --> 00:05:15,270 So, well not no model, no coefficients in the model. 69 00:05:15,270 --> 00:05:18,360 What's the model if all the coefficients are 0? 70 00:05:18,360 --> 00:05:21,900 Just noise, we just have y equals epsilon, that noise term. 71 00:05:23,230 --> 00:05:29,750 Okay, so we're gonna think about somehow trading off between these two extremes. 72 00:05:30,780 --> 00:05:34,900 Okay, I wanted to mention that this is referred to as Ridge regression. 73 00:05:36,240 --> 00:05:39,610 And that's also known as doing L2 regularization. 74 00:05:39,610 --> 00:05:45,614 Because, for reasons that we'll describe a little bit more later in this module, 75 00:05:45,614 --> 00:05:50,503 we're regularizing the solution to the old objective that we had, 76 00:05:50,503 --> 00:05:52,410 using this L2 norm term. 77 00:05:52,410 --> 00:05:57,009 [MUSIC]