1 00:00:00,120 --> 00:00:04,579 [MUSIC] 2 00:00:04,579 --> 00:00:07,951 So, the first measure of error of our predictions that we can look at is 3 00:00:07,951 --> 00:00:10,050 something called training error. 4 00:00:10,050 --> 00:00:14,220 And we discussed this at a high level in the first course of the specialization, 5 00:00:14,220 --> 00:00:16,250 but now let's go through it in a little bit more detail. 6 00:00:17,680 --> 00:00:21,690 So, to define training error, we first have to define training data. 7 00:00:21,690 --> 00:00:24,270 So, training data typically you have 8 00:00:24,270 --> 00:00:28,010 some dataset which I've shown you are these blue circles here, and 9 00:00:28,010 --> 00:00:32,488 we're going to choose our training dataset just some subset of these points. 10 00:00:32,488 --> 00:00:35,950 So, the greyed circles are ones that are not included in the training set. 11 00:00:35,950 --> 00:00:39,010 The blue circles are the ones that we're keeping in this training set. 12 00:00:40,350 --> 00:00:43,990 And then we take our training data and, as we've discussed in previous modules of 13 00:00:43,990 --> 00:00:51,300 this course, we use it in order to fit our model, to estimate our model parameters. 14 00:00:51,300 --> 00:00:54,730 Just as an example, for example with this dataset here, 15 00:00:54,730 --> 00:00:59,170 maybe we choose to fit some quadratic function to the data and 16 00:00:59,170 --> 00:01:02,850 like we've talked about in order to fit this quadratic function, 17 00:01:02,850 --> 00:01:06,910 we're gonna minimize the residual sum of squares on these training data points. 18 00:01:08,020 --> 00:01:10,650 So, now we have our estimated model parameters, w hat. 19 00:01:10,650 --> 00:01:16,060 And we want to assess the training error of that estimated model. 20 00:01:16,060 --> 00:01:19,450 And the way we do that is first we need to define some lost functions. 21 00:01:19,450 --> 00:01:21,770 So, maybe we look at squared error, absolute error. 22 00:01:21,770 --> 00:01:25,050 Any one fo the many possibilities for our lost function. 23 00:01:25,050 --> 00:01:29,950 And then the way training error's defined is simply as the average loss, 24 00:01:29,950 --> 00:01:31,760 defined over the training points. 25 00:01:31,760 --> 00:01:35,960 So, mathematically what this is is simply 1 over N. 26 00:01:35,960 --> 00:01:40,020 So, N are the total number of observations in my training set. 27 00:01:40,020 --> 00:01:43,970 Some of the loss over each one of those training observations. 28 00:01:46,850 --> 00:01:50,050 And just to remember to be very clear 29 00:01:50,050 --> 00:01:54,930 the estimated parameters were estimated on the training set. 30 00:01:54,930 --> 00:01:57,970 They were Minimizing the residual semi-squares for 31 00:01:57,970 --> 00:02:02,450 these training points that we're looking at again and defining this training error. 32 00:02:02,450 --> 00:02:06,170 So, we can go through this pictorially in the following example, where in this case 33 00:02:06,170 --> 00:02:09,860 we're specifically looking at using squared error as our loss function. 34 00:02:10,880 --> 00:02:16,030 And in this case, our training error is simply one over n 35 00:02:16,030 --> 00:02:21,990 times the sum of The difference between our actual house sales price and 36 00:02:21,990 --> 00:02:23,830 our predicted house sales price squared. 37 00:02:23,830 --> 00:02:28,240 Where that sum is taken over all houses in our training data set. 38 00:02:28,240 --> 00:02:31,650 And what we see is that in this case where we choose squared error as 39 00:02:31,650 --> 00:02:34,080 our loss function, then 40 00:02:34,080 --> 00:02:39,660 the form of training error Is exactly 1 over N times our residual sum of squares. 41 00:02:39,660 --> 00:02:43,420 And I want to note here that there's some difference in convention that people use, 42 00:02:43,420 --> 00:02:49,830 whether there's the 1 over N as the definition of training error, or not. 43 00:02:49,830 --> 00:02:52,500 So, just be aware of that when you're computing training error and 44 00:02:52,500 --> 00:02:53,910 reporting these numbers. 45 00:02:53,910 --> 00:02:57,295 Here we're defining it as the average loss. 46 00:02:57,295 --> 00:03:01,525 More formally we can write our training error as follows and 47 00:03:01,525 --> 00:03:06,009 then we can define something that's commonly referred to just 48 00:03:06,009 --> 00:03:10,590 as something as RMSE and the full name is root mean square error. 49 00:03:15,470 --> 00:03:21,920 And RMSE is simply the square root of our average loss on the training houses. 50 00:03:21,920 --> 00:03:24,820 So, the square root of our training error. 51 00:03:25,960 --> 00:03:28,800 And the reason one might consider looking at root mean square error 52 00:03:28,800 --> 00:03:31,590 is because the units, in this case, are just dollars. 53 00:03:31,590 --> 00:03:35,840 Whereas when we thought about our training error, the units were dollars squared. 54 00:03:35,840 --> 00:03:41,220 Remember we're taking the squares of all these differences in dollars. 55 00:03:41,220 --> 00:03:42,990 So, the result is dollars squared. 56 00:03:42,990 --> 00:03:47,120 So, that's a little bit less intuitive as an error metric than just 57 00:03:47,120 --> 00:03:50,050 an error in terms of dollars themselves. 58 00:03:50,050 --> 00:03:51,610 Now, that we've defined training error, 59 00:03:51,610 --> 00:03:56,960 we can look at how training error behaves as model complexity increases. 60 00:03:56,960 --> 00:04:00,370 So, to start with let's look at the simplest possible model you might fit, 61 00:04:00,370 --> 00:04:01,620 which is just a constant model. 62 00:04:02,950 --> 00:04:07,864 So this is the simplest model we're gonna consider, or could consider, 63 00:04:07,864 --> 00:04:12,000 and you see that there is pretty significant training error. 64 00:04:12,000 --> 00:04:15,013 So let's just say that that has some value here, 65 00:04:15,013 --> 00:04:18,109 this is the training error of the constant model. 66 00:04:21,200 --> 00:04:24,620 Then let's say I fit a linear model. 67 00:04:24,620 --> 00:04:29,000 Well, a line, these are all linear models we're looking at, it's linear regression. 68 00:04:29,000 --> 00:04:30,920 But just fitting a line to the data. 69 00:04:30,920 --> 00:04:33,060 And you see that my training error has gone down. 70 00:04:33,060 --> 00:04:37,080 So, some other value that I'm showing with this pink circle here. 71 00:04:37,080 --> 00:04:42,220 Then I fit a quadratic function again training error goes down, and 72 00:04:42,220 --> 00:04:47,370 what I see is that as I increase my model complexity to maybe this higher order 73 00:04:47,370 --> 00:04:52,790 of polynomial, I have very low training error just this one pink bar here. 74 00:04:52,790 --> 00:04:59,060 So, training error decreases quite significantly with model complexity and, 75 00:05:00,140 --> 00:05:04,960 in total not that we've gone through these examples we can look at what the plot of 76 00:05:04,960 --> 00:05:10,080 training error versus model complexity tends to look like. 77 00:05:12,320 --> 00:05:18,530 So, there's a decrease in training error as you increase your model complexity. 78 00:05:27,020 --> 00:05:28,560 And why is that? 79 00:05:28,560 --> 00:05:34,410 Well, it's pretty intuitive, because the model was fit on the training points and 80 00:05:34,410 --> 00:05:36,620 then I'm saying how well does it fit it? 81 00:05:36,620 --> 00:05:38,820 As I increase the model complexity, I'm better and 82 00:05:38,820 --> 00:05:42,090 better able to fit my training data points. 83 00:05:42,090 --> 00:05:46,810 So, then when I go to assess my training error with these high-complexity models, 84 00:05:46,810 --> 00:05:48,610 I have very low training error. 85 00:05:50,100 --> 00:05:52,820 So, a natural question is whether a training error 86 00:05:52,820 --> 00:05:54,850 is a good measure of predictive performance? 87 00:05:54,850 --> 00:05:58,383 And what we're showing here is one of our high-complexity, 88 00:05:58,383 --> 00:06:01,910 high-order polynomial models that had very low training error. 89 00:06:01,910 --> 00:06:04,730 So it really fit those training data points well. 90 00:06:04,730 --> 00:06:07,690 But how's it gonna perform on some new house? 91 00:06:07,690 --> 00:06:11,770 So, in particular, maybe we're looking at a house in this gray region, so 92 00:06:11,770 --> 00:06:13,669 with this range of square feet. 93 00:06:14,700 --> 00:06:17,390 Question is, is there something particularly wrong with having 94 00:06:17,390 --> 00:06:19,030 Xt square feet? 95 00:06:19,030 --> 00:06:24,400 Because what our fitted function is saying is that I believe or I'm predicting 96 00:06:24,400 --> 00:06:30,040 that the values of houses with roughly Xt square feet are less valuable 97 00:06:30,040 --> 00:06:33,530 than houses with fewer square feet, cuz there's this dip down in this function. 98 00:06:34,610 --> 00:06:38,110 Do we really believe that this is a true dip in value, that 99 00:06:38,110 --> 00:06:43,550 these houses are just less desirable than houses with fewer or more square feet? 100 00:06:43,550 --> 00:06:44,750 Probably not. 101 00:06:44,750 --> 00:06:46,110 So, what's going wrong here? 102 00:06:47,360 --> 00:06:49,950 The issue is the fact that training error is 103 00:06:49,950 --> 00:06:53,920 overly optimistic when we're going to assess predictive performance. 104 00:06:53,920 --> 00:06:58,990 And that's because these parameters, w-hat, were fit on the training data. 105 00:06:58,990 --> 00:07:03,330 They were fit to minimize this training error. 106 00:07:03,330 --> 00:07:05,920 Sorry, minimize residual sum of squares, 107 00:07:05,920 --> 00:07:09,798 which can often be related to training error. 108 00:07:09,798 --> 00:07:13,580 And then we're using training error to assess predictive performance but 109 00:07:13,580 --> 00:07:16,410 that's gonna be very very optimistic as this picture shows. 110 00:07:17,650 --> 00:07:22,570 So, in general, having small training error does not imply having 111 00:07:22,570 --> 00:07:27,577 good predictive performance unless your training data set is really 112 00:07:27,577 --> 00:07:32,864 representative of everything that you might see there out in the world. 113 00:07:32,864 --> 00:07:35,439 [MUSIC]