1 00:00:00,000 --> 00:00:04,543 [MUSIC] 2 00:00:04,543 --> 00:00:08,324 So instead of this closed form solution that can be computationally prohibitive 3 00:00:08,324 --> 00:00:11,939 and think about using our group gradient descent algorithm to solve our ridge 4 00:00:11,939 --> 00:00:12,620 regression. 5 00:00:14,080 --> 00:00:18,510 And let's just talk about the element y's update that we have. 6 00:00:18,510 --> 00:00:21,520 So it's the update to the jth feature weight. 7 00:00:21,520 --> 00:00:26,930 And remember here I'm showing the gradient of our total cost, just as a reminder. 8 00:00:29,040 --> 00:00:34,410 And what we do is we take our previous weight for 9 00:00:34,410 --> 00:00:40,100 the jth feature and we subtract Ada our step size, times. 10 00:00:41,330 --> 00:00:45,640 Whatever this gradient update dictates to form our new weight. 11 00:00:45,640 --> 00:00:48,020 And what does this gradient tell us? 12 00:00:48,020 --> 00:00:50,700 Well, this first term here on this line 13 00:00:51,710 --> 00:00:56,330 is exactly what we had before from our residual sum of squares term. 14 00:00:56,330 --> 00:01:00,022 So this is same as, 15 00:01:00,022 --> 00:01:07,130 before this comes from our residual sum of squares term. 16 00:01:09,490 --> 00:01:13,420 But now in addition to this we have another term appearing. 17 00:01:13,420 --> 00:01:14,770 We have this term. 18 00:01:14,770 --> 00:01:19,408 So, this is our new term that comes from, 19 00:01:25,353 --> 00:01:30,731 The jth component of 2 20 00:01:30,731 --> 00:01:35,838 lambda w is w vector. 21 00:01:35,838 --> 00:01:40,491 Okay, so one thing we see is we have 22 00:01:40,491 --> 00:01:45,010 a wjt appearing in two places. 23 00:01:45,010 --> 00:01:46,780 So we can combine these and 24 00:01:46,780 --> 00:01:51,460 simplify the form of this expression to get a little bit more interpretability. 25 00:01:51,460 --> 00:01:57,050 So what we see is that, our update first takes wjt and 26 00:01:57,050 --> 00:02:02,930 multiplies by 1- 2 x our step size x our tuning parameter. 27 00:02:04,160 --> 00:02:10,745 So, times Ada and Lambda and, here, we know that Ada is greater than 0. 28 00:02:12,790 --> 00:02:17,110 And, we know that Lambda is greater than or equal to 0. 29 00:02:17,110 --> 00:02:23,280 And, so the result is 2 x Ada x Lambda is going to be greater than 0. 30 00:02:23,280 --> 00:02:27,900 So, when we have 1- a positive number. 31 00:02:27,900 --> 00:02:33,090 We know that this is going to be, this thing here is going to be less than or 32 00:02:33,090 --> 00:02:34,580 equal to 1. 33 00:02:34,580 --> 00:02:41,730 But the other thing we want is we want the result of this to be positive. 34 00:02:41,730 --> 00:02:49,475 So to enforce that we're going to say that 2 Ada Lambda has to be less than 1. 35 00:02:50,970 --> 00:02:54,745 Okay, so it's a positive quantity that is less than 1. 36 00:02:56,620 --> 00:03:00,810 Okay, so what we're doing is we're taking our previous coefficient and 37 00:03:00,810 --> 00:03:05,645 we're multiplying by some number that's less than 1, a positive number less than 38 00:03:05,645 --> 00:03:09,690 1, so what that's doing is it's just shrinking the value by some amount, 39 00:03:09,690 --> 00:03:10,520 at some fraction. 40 00:03:10,520 --> 00:03:15,838 And then we have this update here which is the same update that 41 00:03:15,838 --> 00:03:20,739 we had just from our residual summer squares as before so 42 00:03:20,739 --> 00:03:23,885 I'll just call this our update term 43 00:03:28,144 --> 00:03:31,470 From residual sum of squares. 44 00:03:32,940 --> 00:03:35,480 And so let's draw a picture of what's happening here. 45 00:03:37,210 --> 00:03:41,510 And what I'm gonna do is I'm gonna draw a picture that's a function of 46 00:03:43,260 --> 00:03:47,030 iterations going this way. 47 00:03:48,070 --> 00:03:54,795 And magnitude of the coefficients along the y axis. 48 00:04:00,460 --> 00:04:01,890 But this is just a little cartoon. 49 00:04:01,890 --> 00:04:05,980 So, we're gonna say that at some iteration T, 50 00:04:05,980 --> 00:04:10,240 we have a coefficient wjt. 51 00:04:10,240 --> 00:04:13,500 And what we're doing in 52 00:04:13,500 --> 00:04:18,420 ridge regression is we're introducing this intermediate step where 53 00:04:20,420 --> 00:04:23,919 here we first shrink the value of the coefficient. 54 00:04:25,830 --> 00:04:28,810 According to 1- 2 Ada Lamba. 55 00:04:28,810 --> 00:04:35,370 So, this is 1- 2 Ada Lamda x wjt. 56 00:04:35,370 --> 00:04:40,882 And so, just to be very clear 57 00:04:40,882 --> 00:04:46,396 this is an intermediate step 58 00:04:46,396 --> 00:04:53,072 introduced in ridge regression. 59 00:04:57,925 --> 00:05:00,280 So this is some iteration T. 60 00:05:00,280 --> 00:05:06,484 This is some in between iteration and when we get to iteration T + 1. 61 00:05:06,484 --> 00:05:11,190 What we do is we take whatever this update term is. 62 00:05:11,190 --> 00:05:12,030 It could be positive. 63 00:05:12,030 --> 00:05:15,710 It could be negative, and we modify this shrunken coefficient. 64 00:05:17,340 --> 00:05:23,260 So let's just assume for this illustration that it's positive, some positive value. 65 00:05:24,840 --> 00:05:28,560 So, I'm going to take the last value of my shrunken coefficient. 66 00:05:28,560 --> 00:05:30,060 I'm gonna draw it here. 67 00:05:30,060 --> 00:05:33,350 And then I'm gonna add some amount. 68 00:05:33,350 --> 00:05:35,740 Wait, let's see, I'm gonna add an amount that makes it, 69 00:05:35,740 --> 00:05:39,480 let's say a little bit less than what my coefficient was before. 70 00:05:40,820 --> 00:05:41,320 So. 71 00:05:42,990 --> 00:05:45,880 Sorry, my pen is acting crazy here. 72 00:05:45,880 --> 00:05:50,200 So I'm gonna add this is my little update term 73 00:05:54,910 --> 00:05:57,000 from RSS. 74 00:05:58,630 --> 00:06:04,209 And this will give me wj at iteration T+1. 75 00:06:06,640 --> 00:06:12,434 And let's contrast with what our old standard least squares algorithm did. 76 00:06:12,434 --> 00:06:15,685 So previously in red, 77 00:06:22,422 --> 00:06:24,950 Just RSS. 78 00:06:24,950 --> 00:06:27,400 Which was our least squares solution. 79 00:06:27,400 --> 00:06:32,040 We had wj t, so we've started at the same point. 80 00:06:32,040 --> 00:06:36,340 But instead of shrinking the coefficient with this intermediate step, we just took 81 00:06:36,340 --> 00:06:41,130 this little update term from residual sum of squares and modified this coefficient. 82 00:06:41,130 --> 00:06:43,903 So I'm gonna try and 83 00:06:43,903 --> 00:06:49,123 map over to time to iteration T + 1 and 84 00:06:49,123 --> 00:06:55,812 then add the same amount, the same update term so 85 00:06:55,812 --> 00:07:03,970 again this is the same update term that I'm adding to wjt but 86 00:07:03,970 --> 00:07:09,680 this is gonna give me my wjt + 1 according 87 00:07:09,680 --> 00:07:14,308 to the least squares algorithm. 88 00:07:14,308 --> 00:07:19,460 Okay, so just to be clear, what's going on here in case this picture was 89 00:07:19,460 --> 00:07:26,060 not helpful to you, is in the ridge regression case we're taking 90 00:07:26,060 --> 00:07:31,000 our previous value of our coefficient, our jth regression coefficient. 91 00:07:31,000 --> 00:07:34,100 And the first thing we're doing, before we decide whether 92 00:07:35,220 --> 00:07:39,160 what our fit term is saying we should do to it, is we first shrink it. 93 00:07:39,160 --> 00:07:43,180 At every iteration we first take the last value and we make it smaller. 94 00:07:43,180 --> 00:07:45,700 And that's where the model complexity term is coming in. 95 00:07:45,700 --> 00:07:51,020 Where we're saying we want smaller coefficients to avoid over fitting, but 96 00:07:51,020 --> 00:07:53,670 then after that we're going to say well okay 97 00:07:53,670 --> 00:07:56,860 let's listen to how well my model's fitting the data. 98 00:07:56,860 --> 00:07:59,420 That's where this update term is coming in and 99 00:07:59,420 --> 00:08:02,500 we're going to modify our our shrunken coefficient. 100 00:08:02,500 --> 00:08:07,970 And in contrast, our old solution never did that shrinking first, it just said, 101 00:08:07,970 --> 00:08:11,290 okay, let's just listen to the fit, and listen to the fit, and listen to the fit 102 00:08:11,290 --> 00:08:14,960 at every iteration and that's where we could get into issues of overfitting. 103 00:08:16,960 --> 00:08:22,050 Okay, so here's a summary of our standard lee squares gradient descent algorithm. 104 00:08:22,050 --> 00:08:27,920 This is what we had again two modules ago where we just initialize 105 00:08:27,920 --> 00:08:32,840 our weights vector, and while not converge for every feature dimension. 106 00:08:32,840 --> 00:08:33,760 We're going to go through, 107 00:08:33,760 --> 00:08:37,270 compute this partial derivative of the residual sum of squares. 108 00:08:37,270 --> 00:08:40,790 That's that update term I was talking about on the last slide. 109 00:08:40,790 --> 00:08:45,610 And then I'm going to use that to modify 110 00:08:45,610 --> 00:08:51,110 my previous coefficient according to the step size Ada. 111 00:08:51,110 --> 00:08:54,210 And I'm gonna keep iterating until convergence. 112 00:08:54,210 --> 00:08:57,480 Okay, so this is what you code up before. 113 00:08:57,480 --> 00:09:00,410 Now let's look at what we do to have to modify this code to 114 00:09:00,410 --> 00:09:03,030 get a rich regression variant. 115 00:09:03,030 --> 00:09:08,569 And notice that the only change is in this line for updating wj where, 116 00:09:08,569 --> 00:09:15,064 instead of taking wjt and doing this update with what I'm calling partial here, 117 00:09:15,064 --> 00:09:20,500 we're taking 1- 2 Ada Lambda x wj and doing the update. 118 00:09:20,500 --> 00:09:24,700 So it's a trivial, trivial, trivial modification to your code. 119 00:09:24,700 --> 00:09:26,028 And that's really cool. 120 00:09:26,028 --> 00:09:27,274 It's very ,very, 121 00:09:27,274 --> 00:09:31,901 very simple to implement ridge regression given a specific Lambda value. 122 00:09:31,901 --> 00:09:35,949 [MUSIC]