1 00:00:00,000 --> 00:00:04,461 [MUSIC] 2 00:00:04,461 --> 00:00:09,211 So in modules one and two we described how to fit different models and 3 00:00:09,211 --> 00:00:13,600 in module two we described how to fit very complex models. 4 00:00:13,600 --> 00:00:17,760 But at up to our third module we had no idea how to access 5 00:00:17,760 --> 00:00:22,260 whether that fitted model was going to perform well in our prediction tasks. 6 00:00:22,260 --> 00:00:26,100 So in module three, that was our emphasis in assessing the performance 7 00:00:26,100 --> 00:00:31,110 of our fitted module and thinking about how we can select between different models 8 00:00:31,110 --> 00:00:33,620 to get good predictive performance. 9 00:00:33,620 --> 00:00:38,000 So the first notion that we introduced in order to measure how 10 00:00:38,000 --> 00:00:40,210 good our fit was performing was the measure of loss. 11 00:00:40,210 --> 00:00:45,330 So this is kind of a negative measure of performance where we wanna 12 00:00:45,330 --> 00:00:48,460 lose as little as possible in making poor predictions. 13 00:00:48,460 --> 00:00:51,490 We're just under an assumption that our predictions are not perfect. 14 00:00:52,690 --> 00:00:56,620 And we discussed two different examples of loss metrics that 15 00:00:56,620 --> 00:01:00,240 are very commonly used talking about this absolute error or this squared error. 16 00:01:01,990 --> 00:01:03,470 Then with this loss function, 17 00:01:03,470 --> 00:01:06,860 we talked about defining three different measures of air. 18 00:01:06,860 --> 00:01:09,120 The first was our training error, 19 00:01:09,120 --> 00:01:14,140 which we said was not a good assessment of the predicted performance of our model. 20 00:01:14,140 --> 00:01:16,560 Then we defined something called our generalization, or 21 00:01:16,560 --> 00:01:18,900 true error, which is what we really want. 22 00:01:18,900 --> 00:01:21,820 We wanna say how well are we predicting 23 00:01:21,820 --> 00:01:25,370 every possible observation that we might see out there. 24 00:01:25,370 --> 00:01:29,500 And we said, okay, well can't actually compute that so then we defined something 25 00:01:29,500 --> 00:01:33,130 called our test error which looks at the subset of our data 26 00:01:34,690 --> 00:01:39,740 that was not including in the training set and looks at the model that was fit 27 00:01:39,740 --> 00:01:43,060 on the training data set but now making predictions on these held out points. 28 00:01:44,520 --> 00:01:49,900 And we said that test error was a noisy approximation to our generalization error. 29 00:01:51,320 --> 00:01:53,630 And for these three different measures of error, 30 00:01:53,630 --> 00:01:57,660 we talked about how they varied as a function of model complexity. 31 00:01:57,660 --> 00:02:03,250 So training error, we know, goes down with increasing model complexity but 32 00:02:03,250 --> 00:02:05,420 that doesn't indicate that we get better and 33 00:02:05,420 --> 00:02:08,480 better predictions as we increase our model complexity. 34 00:02:08,480 --> 00:02:12,500 But in contrast, if we look at generalization error, true error, 35 00:02:12,500 --> 00:02:20,150 these tend to increase, the errors tend to increase after a certain point. 36 00:02:20,150 --> 00:02:25,220 We say that that point is when this models start to become overfit. 37 00:02:26,220 --> 00:02:29,620 Because they perform very well on the training data set, 38 00:02:29,620 --> 00:02:34,840 but they don't generalize well to new data that we have not yet seen. 39 00:02:34,840 --> 00:02:38,960 And again, although we discuss this in the context of regression, this notion of 40 00:02:38,960 --> 00:02:43,810 training, test, generalization error, and variations with model complexity is 41 00:02:43,810 --> 00:02:47,960 a much more general concept that we'll see again in the specialization. 42 00:02:47,960 --> 00:02:51,238 We then characterized three different sources that contribute to our 43 00:02:51,238 --> 00:02:52,830 prediction error. 44 00:02:52,830 --> 00:02:55,900 These are, the noise that's inherent in the data. 45 00:02:55,900 --> 00:02:57,360 This is our irreducible error. 46 00:02:57,360 --> 00:02:58,730 We have no control over it. 47 00:02:58,730 --> 00:03:02,950 It has nothing to do with our model or our estimation procedure but 48 00:03:02,950 --> 00:03:05,970 then we talked about this idea of bias and variance. 49 00:03:05,970 --> 00:03:11,950 So we just described bias as saying how well can our model fit the true 50 00:03:11,950 --> 00:03:17,620 relationship, averaging over all possible training data sets that we might see. 51 00:03:17,620 --> 00:03:22,460 Whereas variance was describing how much can a fitted function 52 00:03:22,460 --> 00:03:27,180 vary from training data set to training data set, all of size and observations. 53 00:03:28,650 --> 00:03:33,270 So of course noise in the data can contribute to our errors in prediction, 54 00:03:33,270 --> 00:03:36,620 but of course if our model can't adequately describe the true relationship 55 00:03:36,620 --> 00:03:40,260 that's also a source of error as well as this variability from 56 00:03:40,260 --> 00:03:41,450 training set to training set. 57 00:03:43,020 --> 00:03:46,440 So of course we want low bias and low variance to have 58 00:03:46,440 --> 00:03:50,410 good predicted performance, but we saw that there's this bias variance trade-off. 59 00:03:50,410 --> 00:03:54,050 That as you increase model complexity, your bias goes down, but 60 00:03:54,050 --> 00:03:55,620 your variance goes up. 61 00:03:55,620 --> 00:03:59,380 And so there's this sweet spot that trades off between bias and 62 00:03:59,380 --> 00:04:03,190 variance and results in the lowest what's called mean square error. 63 00:04:03,190 --> 00:04:04,850 And that's what we're seeking to find. 64 00:04:06,450 --> 00:04:07,990 And like we've said multiple times, 65 00:04:07,990 --> 00:04:11,260 machine learning is all about exploring this bias variance tradeoff. 66 00:04:13,380 --> 00:04:15,850 Then with concluded this module by saying, 67 00:04:15,850 --> 00:04:20,540 how are we both going to select our model and assess its performance? 68 00:04:20,540 --> 00:04:21,830 And for this we said, 69 00:04:21,830 --> 00:04:25,560 well we need to actually form something called a validation set. 70 00:04:25,560 --> 00:04:28,160 So we're going to fit our model on the training data set, 71 00:04:28,160 --> 00:04:32,180 we're going to select between different models or thinking about selecting 72 00:04:32,180 --> 00:04:36,990 a tuning perimeter describing these different models on our validation set and 73 00:04:36,990 --> 00:04:40,570 then testing the performance on our test set, where we never touched the test data. 74 00:04:42,160 --> 00:04:44,310 In later modules, like we're going to describe, 75 00:04:44,310 --> 00:04:48,690 we talked about how if you don't have enough data to form this validation set, 76 00:04:48,690 --> 00:04:51,920 you can think about doing cross-validation instead. 77 00:04:51,920 --> 00:04:54,290 Then, in our fourth module, we talked about rich regression. 78 00:04:54,290 --> 00:04:56,840 And remember that as our models become more and 79 00:04:56,840 --> 00:05:01,740 more complex, we can become overfit and what we saw is the symptom of overfitting 80 00:05:01,740 --> 00:05:06,120 was that the magnitude of our estimated coefficients just exploded. 81 00:05:06,120 --> 00:05:09,310 So what ridge regression does is it trades off between 82 00:05:09,310 --> 00:05:12,820 a measure of fit of our function to our training data and 83 00:05:12,820 --> 00:05:15,390 a measure of the magnitude of the coefficients. 84 00:05:15,390 --> 00:05:17,590 And implicitly by balancing these two terms we're 85 00:05:17,590 --> 00:05:19,219 doing a bias-variance tradeoff. 86 00:05:20,540 --> 00:05:24,810 In particular we saw that our rich regression objective sought to minimize 87 00:05:24,810 --> 00:05:30,760 our residuals sum of squares plus lambda plus the L2 norm of our coefficients, 88 00:05:30,760 --> 00:05:36,230 and we talked about what the coefficient path of our ridge solution looked like, as 89 00:05:36,230 --> 00:05:42,090 we varied this tuning parameter, lambda, the penalty strength on this L2 norm term. 90 00:05:42,090 --> 00:05:45,020 And we saw that as you increase this penalty parameter, 91 00:05:45,020 --> 00:05:48,750 the magnitude of our coefficients become smaller and smaller and smaller. 92 00:05:48,750 --> 00:05:51,490 Then for our ridge objective just like we did 93 00:05:51,490 --> 00:05:55,220 in our standard lease squares objective, we computed the gradient, 94 00:05:55,220 --> 00:05:58,640 set it equal to zero to get our closed-form solution and this looks 95 00:05:58,640 --> 00:06:03,510 very similar to our solution we had before except with this additional term. 96 00:06:03,510 --> 00:06:07,200 And what we talked about in this module is the fact that by adding 97 00:06:07,200 --> 00:06:10,260 this lambda times the identity matrix. 98 00:06:10,260 --> 00:06:14,230 This allowed us to have a solution, 99 00:06:14,230 --> 00:06:18,900 even when the number of features was larger than the number of observations. 100 00:06:18,900 --> 00:06:23,220 And it allowed for a much more quote, unquote, regularized solution. 101 00:06:23,220 --> 00:06:25,850 That's why it's called a regularized regression technique. 102 00:06:26,960 --> 00:06:30,890 But the complexity of the solution was exactly the same as we had for 103 00:06:30,890 --> 00:06:34,880 these squares, cubic in the number of features that we have. 104 00:06:34,880 --> 00:06:38,750 We also talked about a gradient descent implementation of ridge. 105 00:06:39,930 --> 00:06:44,370 And as we saw, a key question in what solution we would get out of ridge 106 00:06:44,370 --> 00:06:47,850 was determined by this lambda penalty strength. 107 00:06:47,850 --> 00:06:52,219 And so, for this, instead of talking about cutting out a validation set to select 108 00:06:52,219 --> 00:06:56,589 this tuning parameter, we talked about cases where you might not have enough data 109 00:06:56,589 --> 00:07:00,359 to do that, and instead described this cross validation procedure. 110 00:07:00,359 --> 00:07:04,409 [MUSIC]