1 00:00:00,750 --> 00:00:08,400 So here we are back at our polynomial regression demo. 2 00:00:08,400 --> 00:00:13,075 And remember when we're just doing these squares estimation. 3 00:00:13,075 --> 00:00:14,490 Let's just quickly scroll through this. 4 00:00:14,490 --> 00:00:18,250 Remember we had this data generated from a sine function. 5 00:00:18,250 --> 00:00:25,060 And when we to fit a degree-2 polynomial, things looked pretty reasonable. 6 00:00:26,425 --> 00:00:31,440 Degree-4 started looking a bit wigglier, larger estimated coefficients and 7 00:00:31,440 --> 00:00:37,230 at degree-16 looked really wiggly and had these massive, massive coefficients, okay. 8 00:00:37,230 --> 00:00:42,220 And now let's get to our ridge regression, where we're just gonna take our 9 00:00:43,720 --> 00:00:46,680 polynomial regression function and modify it. 10 00:00:46,680 --> 00:00:51,260 And in using Graph Lab Create it's really simple to do the ridge regression 11 00:00:51,260 --> 00:00:58,090 modification because, as we mentioned before, there's this l2 penalty input. 12 00:00:58,090 --> 00:01:00,580 To .linear_regression. 13 00:01:00,580 --> 00:01:01,200 And before, 14 00:01:01,200 --> 00:01:04,650 when we're doing just lee squares we set that L2 penalty equal to zero. 15 00:01:04,650 --> 00:01:09,080 And this is this lambda value that we're talking about in trading off 16 00:01:09,080 --> 00:01:12,260 between fit and model complexity. 17 00:01:12,260 --> 00:01:19,140 So here though, we're gonna actually specify a value for this penalty. 18 00:01:19,140 --> 00:01:22,640 And that's the only modification that we have to make in order to implement 19 00:01:22,640 --> 00:01:25,650 ridge regression using Graph Lab Create. 20 00:01:25,650 --> 00:01:27,980 But again in the assignments for 21 00:01:27,980 --> 00:01:32,470 this course you're gonna explore implementing these methods yourself. 22 00:01:33,470 --> 00:01:40,100 Okay so let's go and define this polynomial ridge regression function. 23 00:01:40,100 --> 00:01:41,760 And then we're just gonna go through and 24 00:01:41,760 --> 00:01:46,550 explore performing a fit of this really high order polynomial, this 16th 25 00:01:46,550 --> 00:01:52,010 order polynomial that had very wiggly fit, crazy coefficients associated with it. 26 00:01:52,010 --> 00:01:56,940 But now, looking at solving the ridge regression objective for 27 00:01:56,940 --> 00:01:58,830 different values of lambda. 28 00:01:58,830 --> 00:02:02,820 So to start with, let's consider a really, really small lambda value. 29 00:02:02,820 --> 00:02:07,510 So a very small penalty on the two norm of the coefficients. 30 00:02:07,510 --> 00:02:12,142 And what we'd expect is that the estimated fit would look very 31 00:02:12,142 --> 00:02:16,060 similar to the standard lee squares case. 32 00:02:16,060 --> 00:02:21,240 And if we look at the plot, this figure looks very very similar, 33 00:02:21,240 --> 00:02:26,070 if I scroll up quickly, to the fit we had doing just standard lee squares. 34 00:02:26,070 --> 00:02:30,630 So that checks out to what we know should happen, and, likewise, 35 00:02:30,630 --> 00:02:34,620 the coefficients are still these really really massive numbers. 36 00:02:34,620 --> 00:02:38,960 Okay, but what if we increase the strength of our penalty. 37 00:02:38,960 --> 00:02:42,130 So let's consider a very large L2 penalty. 38 00:02:42,130 --> 00:02:46,230 Here we're considering a value of 100, whereas in the case above we were 39 00:02:46,230 --> 00:02:50,150 considering a value of one-eighth to the -25, so really really tiny. 40 00:02:52,910 --> 00:02:58,220 Well in this case, we end up with much smaller coefficients. 41 00:02:58,220 --> 00:03:00,000 Actually they look really really small. 42 00:03:00,000 --> 00:03:02,440 So let's look at what the fit looks like. 43 00:03:03,440 --> 00:03:06,876 And we see a really, really smooth curve. 44 00:03:06,876 --> 00:03:11,940 And very flat, actually probably way too simple of a description for 45 00:03:11,940 --> 00:03:13,600 what's really going on in the data. 46 00:03:13,600 --> 00:03:15,880 It doesn't seem to capture this trend of the data. 47 00:03:17,290 --> 00:03:19,550 The value's increasing and then decreasing. 48 00:03:19,550 --> 00:03:22,690 It just gets a constant fit followed by a decrease. 49 00:03:22,690 --> 00:03:26,696 So, this seems to be under-fit and so 50 00:03:26,696 --> 00:03:31,660 as we expect, what we have is that when lambda is really really small 51 00:03:31,660 --> 00:03:35,720 we get something similar to our lee square solution and when lambda becomes 52 00:03:35,720 --> 00:03:40,130 really really large we start approaching all the coefficients going to 0. 53 00:03:41,680 --> 00:03:45,470 Okay so now what we're gonna do is look at the fit 54 00:03:45,470 --> 00:03:50,450 as a function of a series of different lambda values going from our 55 00:03:50,450 --> 00:03:53,140 1e to the minus 25 all the way up to the value of 100. 56 00:03:53,140 --> 00:03:58,120 But looking at some other intermediate values as well to look at what the fit and 57 00:03:58,120 --> 00:04:01,590 coefficients look like as we increase lambda. 58 00:04:01,590 --> 00:04:05,480 So we're starting with these crazy, crazy large values. 59 00:04:05,480 --> 00:04:09,070 By the time we're at 1e to the -10 for lambda, 60 00:04:09,070 --> 00:04:13,760 the values have decreased by two orders of magnitude so have times 10 to the 4th now. 61 00:04:15,500 --> 00:04:16,922 Then we keep increasing lambda. 62 00:04:16,922 --> 00:04:18,220 1e to the -6. 63 00:04:18,220 --> 00:04:21,070 And we get values on the order of hundreds for 64 00:04:21,070 --> 00:04:25,890 our coefficients, so in terms of reasonability 65 00:04:25,890 --> 00:04:30,590 of these values I'd say that they start looking a little bit more realistic. 66 00:04:30,590 --> 00:04:32,690 And then we keep going and 67 00:04:32,690 --> 00:04:37,390 you see that the value of the coefficients keep decreasing, and when 68 00:04:37,390 --> 00:04:42,510 we get to this value of lambda that's 100 we get these really small coefficients. 69 00:04:42,510 --> 00:04:47,980 But now lets look at what the fits are for these different lambda values. 70 00:04:47,980 --> 00:04:50,610 And here's the plot that we've been showing before for 71 00:04:50,610 --> 00:04:52,490 this really small lambda. 72 00:04:52,490 --> 00:04:57,480 Increasing the lambda a bit smoother fit, still pretty wiggly and crazy, 73 00:04:57,480 --> 00:05:01,040 especially on these boundary points. 74 00:05:01,040 --> 00:05:03,790 Increase lambda more, things start looking better. 75 00:05:03,790 --> 00:05:09,870 When we get to 1e to the -3, this looks pretty good. 76 00:05:09,870 --> 00:05:14,750 Especially here, it's hard to tell whether the function should be going up or down. 77 00:05:15,830 --> 00:05:20,179 I want to emphasize that app boundaries where you have few observations, 78 00:05:21,380 --> 00:05:26,430 it's very hard to control the fit so we believe much more the fit in 79 00:05:26,430 --> 00:05:30,310 intermediate regions of our x range where we have observations. 80 00:05:31,640 --> 00:05:33,930 Okay but then we get to this really large lambda and 81 00:05:33,930 --> 00:05:37,760 we see that clearly we're over smoothing across the data. 82 00:05:39,450 --> 00:05:44,225 So a natural question is, out of all these possible lambda values we might consider, 83 00:05:44,225 --> 00:05:48,530 and all the associated fits, which is the one that we should use for 84 00:05:48,530 --> 00:05:50,330 forming our predictions? 85 00:05:50,330 --> 00:05:53,270 Well, it would be really nice if there were some automatic procedure for 86 00:05:53,270 --> 00:05:56,450 selecting this lambda value instead of me having to go through, 87 00:05:56,450 --> 00:05:59,600 specify a large set of lambdas, look at the coefficients, look at the fit, and 88 00:05:59,600 --> 00:06:04,170 somehow make some judgment call about which one I want to use. 89 00:06:04,170 --> 00:06:07,830 Well, the good news is that there is a way to automatically choose lambda. 90 00:06:07,830 --> 00:06:11,950 And this is something we're gonna discuss later in this module. 91 00:06:11,950 --> 00:06:15,030 So one method that we're gonna talk about is something called 92 00:06:15,030 --> 00:06:17,170 leave one out cross validation. 93 00:06:17,170 --> 00:06:23,350 And what leave one out cross validation does is it approximates, so minimizing 94 00:06:23,350 --> 00:06:27,150 this leave one out cross-validation error that we're gonna talk about, 95 00:06:27,150 --> 00:06:32,870 approximates minimizing the average mean squared error in our predictions. 96 00:06:32,870 --> 00:06:37,250 So, what we're gonna do here is we're gonna 97 00:06:37,250 --> 00:06:42,196 define this leave one out cross-validation function and then apply it to our data. 98 00:06:42,196 --> 00:06:47,680 And, this leave one 99 00:06:47,680 --> 00:06:52,400 out cross validation function, you're not gonna understand what's going on here yet. 100 00:06:52,400 --> 00:06:54,690 But you will by the end of this module. 101 00:06:54,690 --> 00:06:57,550 You'll be able to implement this method yourself. 102 00:06:57,550 --> 00:07:01,510 But what it's doing is it's looking at 103 00:07:01,510 --> 00:07:06,900 prediction error of different lambda values and then choosing one to minimize. 104 00:07:06,900 --> 00:07:10,700 But of course we're not looking at that on the training error or 105 00:07:10,700 --> 00:07:14,680 on the, sorry on the training set or the test set, we're using a validation set but 106 00:07:14,680 --> 00:07:15,870 in a very specific way. 107 00:07:17,688 --> 00:07:22,610 Okay, so now that we've applied this 108 00:07:22,610 --> 00:07:27,590 leave one out function to our data in some set of specified 109 00:07:27,590 --> 00:07:32,710 penalty values, we can look at what the plot of this leave one 110 00:07:32,710 --> 00:07:37,450 out cross validation error looks like as a function of our considered lambda values. 111 00:07:37,450 --> 00:07:41,570 And in this case, we actually see a curve that's pretty flat. 112 00:07:41,570 --> 00:07:43,480 In a bunch of regions. 113 00:07:43,480 --> 00:07:46,910 And what this means is that our fits are not so 114 00:07:46,910 --> 00:07:49,420 sensitive to those choice of lambda in these regions. 115 00:07:49,420 --> 00:07:55,130 But there is some minimum and we can figure out what that minimum is here. 116 00:07:55,130 --> 00:07:59,160 So here we're just selecting the lambda that has the lowest 117 00:07:59,160 --> 00:08:01,730 cross validation error. 118 00:08:01,730 --> 00:08:05,350 And then we're gonna fit our polynomial 119 00:08:05,350 --> 00:08:09,810 ridge regression model using that specific lambda value. 120 00:08:09,810 --> 00:08:12,390 And we're printing our coefficients and 121 00:08:12,390 --> 00:08:15,420 what you see is we have very reasonable numbers. 122 00:08:15,420 --> 00:08:21,640 Things on the order of 1, .2, .5, and let's look at the associated fit. 123 00:08:21,640 --> 00:08:24,170 And things look really nice in this case. 124 00:08:24,170 --> 00:08:30,700 So, there is a really nice trend throughout most of the range of x. 125 00:08:30,700 --> 00:08:34,410 The only place that things look a little bit crazy is out here in the boundary. 126 00:08:34,410 --> 00:08:38,694 But again, at this boundary region we actually don't have any data to really pin 127 00:08:38,694 --> 00:08:39,837 down this function. 128 00:08:39,837 --> 00:08:43,146 So, considering it's a 16 order polynomial, 129 00:08:43,146 --> 00:08:46,615 we're shrinking coefficients but we don't really 130 00:08:46,615 --> 00:08:51,047 have much information about what the function should do out here. 131 00:08:51,047 --> 00:08:55,652 But what we've seen is that this leave one out cross validation technique 132 00:08:55,652 --> 00:08:59,664 really nicely selects a lambda value that provides a good fit and 133 00:08:59,664 --> 00:09:03,546 automatically does this balance of bias and variance for us. 134 00:09:03,546 --> 00:09:08,199 [MUSIC]