1 00:00:00,148 --> 00:00:04,133 [MUSIC] 2 00:00:04,133 --> 00:00:04,740 Okay. 3 00:00:04,740 --> 00:00:07,120 So now let's think a little bit about the solution. 4 00:00:07,120 --> 00:00:09,910 So this is our w hat ridge. 5 00:00:09,910 --> 00:00:13,490 And what happens if I set lambda equal to 0? 6 00:00:13,490 --> 00:00:19,010 Well, I get w hat ridge is equal to 7 00:00:19,010 --> 00:00:24,970 H transpose H inverse, H transpose y. 8 00:00:26,370 --> 00:00:28,300 And that might look very familiar to you. 9 00:00:28,300 --> 00:00:34,695 That was exactly equal to our w hat least squares, our old solution, 10 00:00:34,695 --> 00:00:39,890 before we introduced this notion of ridge regression. 11 00:00:41,260 --> 00:00:45,430 And what if set lambda all the way to infinity? 12 00:00:45,430 --> 00:00:50,458 Well then i get w hat ridge equals zero. 13 00:00:53,946 --> 00:00:59,688 Because it's like dividing by infinity. 14 00:01:04,664 --> 00:01:08,640 When we have this infinity appearing in this inverse. 15 00:01:08,640 --> 00:01:12,260 Remember the inverse was like our matrix analog of division, so 16 00:01:12,260 --> 00:01:16,240 that's intuition for why w hat ridge is exactly equal to zero. 17 00:01:17,330 --> 00:01:18,250 Okay. 18 00:01:18,250 --> 00:01:21,330 So this is a little sanity check. 19 00:01:21,330 --> 00:01:23,990 That when lambda is equal to zero, this 20 00:01:23,990 --> 00:01:28,700 closed form solution we have is exactly equal to our least square solution. 21 00:01:28,700 --> 00:01:33,520 That's what we had discussed at the very beginning of this module. 22 00:01:33,520 --> 00:01:37,090 And likewise, when we crank lambda all the way up to infinity, 23 00:01:37,090 --> 00:01:39,530 our solution is equal to zero. 24 00:01:41,250 --> 00:01:44,079 But now what we have is we have a closed form for 25 00:01:44,079 --> 00:01:49,190 what the solution is for some lambda in between zero and infinity. 26 00:01:49,190 --> 00:01:52,910 Let's also recall the discussion we had about our previous solution, 27 00:01:52,910 --> 00:01:57,390 w hat least squares, where we said this H 28 00:01:57,390 --> 00:02:02,670 transpose H looks exactly like the following where 29 00:02:02,670 --> 00:02:08,350 this is our little cartoon of our H matrix, where the number for H transpose. 30 00:02:08,350 --> 00:02:11,260 Let me actually write it here, it'll be clearer. 31 00:02:11,260 --> 00:02:17,129 The number of rows is equivalent to the number of observations. 32 00:02:22,518 --> 00:02:27,354 That's N and the number of columns 33 00:02:27,354 --> 00:02:32,356 is equal to the number of features, 34 00:02:32,356 --> 00:02:35,370 which we denote as D. 35 00:02:37,590 --> 00:02:42,890 And what we said not last module, but the module before that, 36 00:02:42,890 --> 00:02:47,421 we said H transpose H, the multiplication of these two matrices is invertible. 37 00:02:48,710 --> 00:02:54,320 In general, if the number of observations is greater than the number of features, 38 00:02:54,320 --> 00:02:57,445 but really it's the number of linearly independent observations being greater 39 00:02:57,445 --> 00:02:59,230 than the number of features, and 40 00:02:59,230 --> 00:03:03,980 we said the complexity of this inverse is cubic in the number of features D. 41 00:03:05,180 --> 00:03:07,530 Well now let's think about similar properties, but 42 00:03:07,530 --> 00:03:09,890 of our ridge regression solution. 43 00:03:09,890 --> 00:03:14,690 Where this H transpose H is exactly like we had it before. 44 00:03:14,690 --> 00:03:20,790 But now before we take our inverse, we're adding lambda times the identity matrix. 45 00:03:20,790 --> 00:03:24,460 And when you take a scaler and multiply by the identity matrix, 46 00:03:24,460 --> 00:03:29,840 you just get that value along the diagonal, so we get a whole bunch of 47 00:03:29,840 --> 00:03:35,360 lambdas along the diagonal and zero everywhere else in this matrix. 48 00:03:37,660 --> 00:03:42,920 So what ends up happening now is the result, H transpose H plus lambda 49 00:03:42,920 --> 00:03:48,940 times identity, is invertible always when lambda is greater than zero. 50 00:03:48,940 --> 00:03:53,170 Even if the number of observations or number of 51 00:03:53,170 --> 00:03:57,160 linearly independent observations is less than the number of features. 52 00:03:58,850 --> 00:04:00,622 So this is really important. 53 00:04:04,376 --> 00:04:06,382 When you have lots of features. 54 00:04:11,263 --> 00:04:16,049 So for large D, which has lots of features, and remember, 55 00:04:16,049 --> 00:04:20,190 that's how we motivated using ridge regression. 56 00:04:20,190 --> 00:04:22,660 We're in these really complicated models where you have lots and 57 00:04:22,660 --> 00:04:26,890 lots of features, a lot of flexibility and the potential to over fit. 58 00:04:26,890 --> 00:04:32,490 Now we see something very explicit about how it helps us. 59 00:04:32,490 --> 00:04:36,880 And just to return to the discussion on the naming 60 00:04:37,950 --> 00:04:41,480 of ridge regression being called a regularization technique. 61 00:04:41,480 --> 00:04:42,740 If you remember, 62 00:04:42,740 --> 00:04:47,260 I said that we're regularizing our standardly square solution. 63 00:04:48,500 --> 00:04:54,100 Well we can see that here, because lambda 64 00:04:54,100 --> 00:04:59,540 times the identity is making H transpose 65 00:04:59,540 --> 00:05:04,692 H plus lambda identity more regular. 66 00:05:04,692 --> 00:05:09,347 That's what's allowing us to do this inverse even in this 67 00:05:09,347 --> 00:05:13,147 other situation, this harder situation, and 68 00:05:13,147 --> 00:05:18,184 because this result is more regular, we call it regularized. 69 00:05:23,213 --> 00:05:24,540 Okay. 70 00:05:24,540 --> 00:05:29,025 But, the complexity of the inverse is still cubic in the number of features we 71 00:05:29,025 --> 00:05:33,648 have and often when we're thinking about ridge regression, like I said we're 72 00:05:33,648 --> 00:05:38,409 thinking about cases where you have lots and lots of features, so doing this close 73 00:05:38,409 --> 00:05:42,850 form solution that we've shown here can be computationally prohibitive. 74 00:05:42,850 --> 00:05:48,109 [MUSIC]