1 00:00:00,000 --> 00:00:04,940 [MUSIC] 2 00:00:04,940 --> 00:00:08,380 Okay, well in place of our ridge regression objective. 3 00:00:08,380 --> 00:00:12,670 What if we took our measure of our magnitude of our coefficients 4 00:00:12,670 --> 00:00:14,260 to be what's called the l1 norm. 5 00:00:14,260 --> 00:00:21,120 Where we're gonna sum over the absolute value of each one of our coefficients. 6 00:00:21,120 --> 00:00:24,360 So, we actually describe this as a reasonable measure of the magnitude of 7 00:00:24,360 --> 00:00:27,960 the coefficients when we're discussing ridge regression last module. 8 00:00:28,960 --> 00:00:32,800 Well, the result of this is something that leads to sparse solutions. 9 00:00:32,800 --> 00:00:36,430 For reasons that we're gonna go through in the remainder of this module. 10 00:00:36,430 --> 00:00:40,246 And this objective is referred to as Lasso regression. 11 00:00:40,246 --> 00:00:45,220 Or L1 regularized regression. 12 00:00:45,220 --> 00:00:50,650 So, just like in ridge regression, lasso is governed by a tuning parameter, 13 00:00:50,650 --> 00:00:55,100 lambda, that controls how much we're favoring sparsity of our solutions 14 00:00:55,100 --> 00:00:57,270 relative to the fit on our training data. 15 00:00:58,330 --> 00:01:01,590 And so, just to be clear, here, 16 00:01:01,590 --> 00:01:05,390 we see that when we're doing our feature selection task, 17 00:01:05,390 --> 00:01:11,010 we're searching over a continuous space, this space of lambda values. 18 00:01:11,010 --> 00:01:15,320 Lambda's governing the sparsity of the solution and that's in contrast to, 19 00:01:15,320 --> 00:01:18,060 for example, the all subsets or greedy approaches, 20 00:01:18,060 --> 00:01:21,790 where we talked about those searching over a discrete set of possible solutions. 21 00:01:21,790 --> 00:01:26,390 So, it's really a fundamentally different approach to doing feature selection. 22 00:01:26,390 --> 00:01:30,980 Okay but let's talk about what happens to our solution as we vary lambda. 23 00:01:33,500 --> 00:01:35,000 And again just to emphasize, 24 00:01:35,000 --> 00:01:40,200 this lambda is a tuning parameter that in this case is balancing fit and sparsity. 25 00:01:41,570 --> 00:01:47,090 Okay so if lambda is equal to zero, what's gonna happen? 26 00:01:47,090 --> 00:01:49,540 Well, this penalty term is completely going to disappear, and 27 00:01:49,540 --> 00:01:53,110 our objective is simply going to be to minimize residual sum of squares. 28 00:01:53,110 --> 00:01:55,970 That was our old least squares objective. 29 00:01:55,970 --> 00:01:59,590 So, we're going to get W hat what I'll call lasso. 30 00:01:59,590 --> 00:02:01,980 The solution to our lasso problem 31 00:02:01,980 --> 00:02:06,020 is going to be exactly equal to W hat least squares. 32 00:02:06,020 --> 00:02:13,970 So, this is equal to our unregularized solution. 33 00:02:20,620 --> 00:02:24,470 And in contrast if we set lambda equals to infinity. 34 00:02:24,470 --> 00:02:27,160 This is where we are completely favoring. 35 00:02:27,160 --> 00:02:34,940 This magnitude penalty in completely ignoring the residual square is fit. 36 00:02:34,940 --> 00:02:39,120 In this case, what's the thing that minimizes the L1 norm. 37 00:02:40,930 --> 00:02:44,400 So, what value of our regression coefficients 38 00:02:44,400 --> 00:02:48,040 is gonna have some other absolute value is being the smallest. 39 00:02:48,040 --> 00:02:52,750 Well again or just like in ridge when lambda's equal to infinity we're gonna 40 00:02:53,840 --> 00:02:57,380 get W hat lasso equal to the zero vector. 41 00:02:59,100 --> 00:03:04,186 And if lambda is in between we're gonna get that in this case the one norm 42 00:03:04,186 --> 00:03:10,252 of our lasso solution 43 00:03:12,420 --> 00:03:18,470 It's gonna be less than or equal to the one norm of our lease square solution and 44 00:03:19,900 --> 00:03:23,980 it's gonna be greater than or equal to this zero vector. 45 00:03:23,980 --> 00:03:25,160 I mean, this zero number. 46 00:03:25,160 --> 00:03:25,670 Sorry. 47 00:03:25,670 --> 00:03:27,810 Here it's just a number once we've taken this norm. 48 00:03:29,630 --> 00:03:30,550 Okay. 49 00:03:30,550 --> 00:03:36,020 So, as of yet, it's not clear why this L 1 norm is leading to sparsity, 50 00:03:36,020 --> 00:03:40,010 and we're going to get to that, but let's first just explore this visually. 51 00:03:40,010 --> 00:03:42,790 And one way we can see this is from the coefficient path. 52 00:03:42,790 --> 00:03:45,230 But first, let's just remember the coefficient path for 53 00:03:45,230 --> 00:03:49,830 ridge regression, where we saw that even for a large value of lambda 54 00:03:51,870 --> 00:03:55,888 Everything was in our model, just with small coefficients. 55 00:03:55,888 --> 00:04:03,520 So, everything has W hat J 56 00:04:03,520 --> 00:04:09,456 greater than zero but all W hat J. 57 00:04:09,456 --> 00:04:13,302 Are small for 58 00:04:13,302 --> 00:04:17,930 large values of our tuning parameter lambda. 59 00:04:19,030 --> 00:04:21,530 In contrast, when we look at the coefficient path for 60 00:04:21,530 --> 00:04:25,010 lasso, we see a very different pattern. 61 00:04:25,010 --> 00:04:30,740 What we see is that at certain critical values of this tuning parameter lambda. 62 00:04:32,050 --> 00:04:34,290 Certain ones of our features jump out of our model. 63 00:04:34,290 --> 00:04:39,400 So, for example here we had square feet of the lot size disappears from the model. 64 00:04:39,400 --> 00:04:44,700 Here number of bedrooms almost simultaneously with number of floors and 65 00:04:44,700 --> 00:04:46,280 number of bathrooms. 66 00:04:46,280 --> 00:04:49,580 Followed by the year the house was built. 67 00:04:49,580 --> 00:04:54,910 And then, but one thing that we see, so let me just be clear, 68 00:04:54,910 --> 00:05:00,320 that for let's say a value of lambda like this, 69 00:05:03,270 --> 00:05:08,980 we have a sparse set of features included in our model. 70 00:05:08,980 --> 00:05:10,671 So, the ones I've circled. 71 00:05:16,260 --> 00:05:21,474 Are the only feature, sorry. 72 00:05:21,474 --> 00:05:26,629 Only features in our model. 73 00:05:28,762 --> 00:05:34,030 And all the other ones, have dropped completely exactly to zero. 74 00:05:35,820 --> 00:05:40,820 And one thing that we see is that when lambda is very large, like the large value 75 00:05:40,820 --> 00:05:46,680 I showed on the previous plot, the only thing in our model is square feet living. 76 00:05:48,180 --> 00:05:53,790 And note that square feet living still has a really significantly large weight on it. 77 00:05:53,790 --> 00:05:58,431 So, I'll say large 78 00:05:58,431 --> 00:06:03,345 weight on square feet 79 00:06:03,345 --> 00:06:08,806 living when everything 80 00:06:08,806 --> 00:06:15,190 else is out of the model. 81 00:06:15,190 --> 00:06:16,680 Meaning not included in the model. 82 00:06:19,070 --> 00:06:25,210 So, square feet living is still very valuable to our predictions, 83 00:06:25,210 --> 00:06:30,500 and it would take quite a large lambda value to say that 84 00:06:30,500 --> 00:06:33,540 square feet living, even that was not relevant. 85 00:06:33,540 --> 00:06:37,480 Eventually, square feet living would be shrunk exactly to 0. 86 00:06:37,480 --> 00:06:38,930 But for a much large value of land. 87 00:06:38,930 --> 00:06:42,060 But, if I go back to my ridge regression solution. 88 00:06:42,060 --> 00:06:46,490 I see that I had a much smaller value on square feet living, 89 00:06:48,620 --> 00:06:54,190 because I was distributing weights across many other features in the model. 90 00:06:54,190 --> 00:06:58,421 So, that individual impact of square feet living wasn't as clear. 91 00:06:58,421 --> 00:07:03,279 [MUSIC]