1 00:00:00,390 --> 00:00:02,440 You've seen how regularization can help 2 00:00:02,610 --> 00:00:04,660 prevent overfitting, but how 3 00:00:04,960 --> 00:00:06,230 does it affect the bias and 4 00:00:06,460 --> 00:00:08,070 variance of a learning algorithm? 5 00:00:08,630 --> 00:00:09,890 In this video, I like to 6 00:00:10,020 --> 00:00:11,180 go deeper into the issue 7 00:00:11,550 --> 00:00:13,300 of bias and variance, and 8 00:00:13,520 --> 00:00:14,450 talk about how it interacts 9 00:00:15,070 --> 00:00:15,880 with, and is effected by, 10 00:00:16,070 --> 00:00:18,470 the regularization of your learning algorithm. 11 00:00:22,180 --> 00:00:23,390 Suppose we fit a linear 12 00:00:23,700 --> 00:00:24,880 regression model with a very 13 00:00:25,250 --> 00:00:27,460 high order polynomial, but to 14 00:00:27,670 --> 00:00:28,670 prevent overfitting, we are 15 00:00:28,790 --> 00:00:30,880 going to use regularization as shown here. 16 00:00:31,560 --> 00:00:32,780 Suppose we're fitting a high 17 00:00:33,190 --> 00:00:34,690 order polynomial like that shown 18 00:00:35,120 --> 00:00:36,320 here, but to prevent 19 00:00:36,760 --> 00:00:37,770 overfitting, we're going to 20 00:00:37,910 --> 00:00:39,540 use regularization, like that shown 21 00:00:39,910 --> 00:00:41,070 here, so we have this regularization 22 00:00:41,880 --> 00:00:43,050 term to try to 23 00:00:43,390 --> 00:00:45,280 keep the values of the parameters small. 24 00:00:45,720 --> 00:00:47,400 And as usual, the regularization sums 25 00:00:47,770 --> 00:00:49,190 from j equals 1 to 26 00:00:49,290 --> 00:00:50,480 m rather than j equals 0 27 00:00:50,600 --> 00:00:53,130 to m. Let's consider three cases. 28 00:00:53,740 --> 00:00:55,590 The first is the case of 29 00:00:55,660 --> 00:00:56,900 a very large value of the 30 00:00:56,960 --> 00:00:59,250 regularization parameter lambda, such 31 00:00:59,490 --> 00:01:00,640 as if lambda were 32 00:01:00,790 --> 00:01:01,600 equal to 10,000s of huge value. 33 00:01:01,780 --> 00:01:04,100 In this 34 00:01:04,370 --> 00:01:05,510 case, all of these 35 00:01:05,660 --> 00:01:07,250 parameters, theta 1, theta 2, 36 00:01:07,580 --> 00:01:08,310 theta 3 and so on will 37 00:01:08,490 --> 00:01:10,390 be heavily penalized and 38 00:01:10,570 --> 00:01:12,660 so, what ends up with most 39 00:01:13,110 --> 00:01:14,440 of these parameter values being close 40 00:01:14,790 --> 00:01:17,000 to 0 and the hypothesis will be 41 00:01:17,180 --> 00:01:17,940 roughly h or x 42 00:01:18,280 --> 00:01:19,980 just equal or approximately equal 43 00:01:20,270 --> 00:01:21,530 to theta 0, and so we 44 00:01:21,690 --> 00:01:23,560 end up a hypothesis that more 45 00:01:23,800 --> 00:01:25,250 or less looks like that. This is more or 46 00:01:25,370 --> 00:01:28,130 less a flat, constant straight line. 47 00:01:28,410 --> 00:01:30,320 And so this hypothesis has high 48 00:01:30,660 --> 00:01:32,630 bias and a value underfits this data set. 49 00:01:32,970 --> 00:01:34,520 So the horizontal straight 50 00:01:34,840 --> 00:01:35,810 line is just not a very 51 00:01:35,940 --> 00:01:38,100 good model for this data set. 52 00:01:38,700 --> 00:01:39,870 At the other extreme beam is if we have 53 00:01:40,250 --> 00:01:41,560 a very small value of 54 00:01:41,850 --> 00:01:43,310 lambda, such as if lambda 55 00:01:43,710 --> 00:01:45,630 were equal to 0. 56 00:01:45,720 --> 00:01:46,940 In that case, given that we're 57 00:01:47,080 --> 00:01:48,240 fitting a high order polynomial, 58 00:01:48,390 --> 00:01:49,690 this is a 59 00:01:49,940 --> 00:01:51,590 usual overfitting setting. 60 00:01:52,750 --> 00:01:53,990 In that case, given that we're 61 00:01:54,190 --> 00:01:55,240 fitting a high order polynomial, 62 00:01:56,170 --> 00:01:58,050 basically without regularization or with 63 00:01:58,230 --> 00:02:00,170 very minimal regularization, we end 64 00:02:00,350 --> 00:02:02,180 up with our usual high variance, overfitting 65 00:02:02,810 --> 00:02:03,900 setting, because basically if lambda is 66 00:02:04,630 --> 00:02:05,650 equal to zero, we are just 67 00:02:05,790 --> 00:02:08,310 fitting with our regularization so 68 00:02:08,440 --> 00:02:14,460 that overfits the hypothesis 69 00:02:15,700 --> 00:02:16,570 and is only if we have some 70 00:02:16,730 --> 00:02:18,720 intermediate value of lambda that is neither too large nor too small that we end up with parameters theta 71 00:02:19,220 --> 00:02:20,380 that we end up that give us a reasonable 72 00:02:20,770 --> 00:02:22,050 fit to this data. 73 00:02:22,890 --> 00:02:23,810 So how can we automatically 74 00:02:24,610 --> 00:02:26,080 choose a good value 75 00:02:26,580 --> 00:02:28,090 for the regularization parameter lambda? 76 00:02:29,100 --> 00:02:31,370 Just to reiterate, here is our model and here is our learning algorithm subjective. 77 00:02:33,670 --> 00:02:36,580 For the setting where we're using regularization, let me define 78 00:02:37,410 --> 00:02:39,540 j train of theta to be something different 79 00:02:40,410 --> 00:02:42,370 to be the optimization objective 80 00:02:43,170 --> 00:02:44,800 but without the regularization term. 81 00:02:45,540 --> 00:02:47,400 Previously, in earlier video 82 00:02:47,750 --> 00:02:48,670 when we are not using 83 00:02:49,040 --> 00:02:50,800 regularization, I define j train of theta to 84 00:02:51,650 --> 00:02:54,780 be the same as j of theta as the cost function but 85 00:02:55,030 --> 00:02:57,440 when we are using regularization with this extra lambda term 86 00:02:58,480 --> 00:03:00,840 we're going to 87 00:03:01,080 --> 00:03:02,230 define j train my training set error, 88 00:03:02,500 --> 00:03:03,610 to be just my sum of 89 00:03:03,830 --> 00:03:05,070 squared errors on the training 90 00:03:05,410 --> 00:03:06,900 set, or my average squared error 91 00:03:07,120 --> 00:03:10,060 on the training set without taking into account that regularization chart. 92 00:03:10,940 --> 00:03:12,250 And similarly, I'm then 93 00:03:12,410 --> 00:03:13,690 also going to define the 94 00:03:14,210 --> 00:03:16,170 cross-validation set error when the 95 00:03:16,270 --> 00:03:17,370 test set error, as before 96 00:03:17,830 --> 00:03:19,720 to be the average sum of squared errors 97 00:03:20,320 --> 00:03:21,990 on the cross-validation and the test sets. 98 00:03:23,240 --> 00:03:25,270 So just to summarize, 99 00:03:25,820 --> 00:03:27,060 my definitions of J train and 100 00:03:27,490 --> 00:03:28,410 J C V and J 101 00:03:28,620 --> 00:03:29,820 Test are just the 102 00:03:30,050 --> 00:03:31,010 average squared error, or one 103 00:03:31,410 --> 00:03:32,610 half of the average 104 00:03:32,990 --> 00:03:34,600 squared error on my training validation and 105 00:03:34,840 --> 00:03:36,770 test sets without the extra 106 00:03:38,310 --> 00:03:39,290 regularization chart. 107 00:03:39,360 --> 00:03:41,500 So, this is how we can automatically choose the regularization parameter lambda. 108 00:03:43,950 --> 00:03:45,600 What I usually do is may 109 00:03:45,720 --> 00:03:48,040 be have some range of values of lambda I want to try it. 110 00:03:48,220 --> 00:03:49,740 So I might be 111 00:03:49,880 --> 00:03:51,050 considering not using regularization, 112 00:03:52,430 --> 00:03:53,560 or here are a few values I might try. 113 00:03:53,780 --> 00:03:54,740 I might be considering along because 114 00:03:55,210 --> 00:03:57,390 of O1, O2 from O4 and so on. 115 00:03:57,980 --> 00:03:59,400 And you know, I usually step these 116 00:03:59,660 --> 00:04:02,110 up in multiples of 117 00:04:02,310 --> 00:04:04,850 two until some maybe larger value 118 00:04:04,960 --> 00:04:06,140 this in multiples of two you 119 00:04:06,370 --> 00:04:07,890 I actually end up with 10.24; 120 00:04:08,160 --> 00:04:10,700 it's ten exactly, but you 121 00:04:10,870 --> 00:04:12,130 know, this is close enough and 122 00:04:12,750 --> 00:04:14,210 the 35 decimal 123 00:04:14,500 --> 00:04:16,720 places won't affect your result that much. 124 00:04:19,830 --> 00:04:21,610 So, this gives me, maybe 125 00:04:22,330 --> 00:04:24,160 twelve different models, that I'm 126 00:04:24,300 --> 00:04:26,040 trying to select amongst, corresponding to 127 00:04:26,230 --> 00:04:27,900 12 different values of the 128 00:04:28,210 --> 00:04:34,120 regularization parameter lambda and 129 00:04:34,270 --> 00:04:35,400 of course, you can also go 130 00:04:35,600 --> 00:04:37,530 to values less than 0.01 131 00:04:37,610 --> 00:04:38,800 or values larger than 10, 132 00:04:38,900 --> 00:04:41,070 but I've just truncated it here for convenience. 133 00:04:46,400 --> 00:04:47,260 Given each of these 12 134 00:04:47,590 --> 00:04:48,740 models, what we can 135 00:04:48,970 --> 00:04:49,770 do is then the following: 136 00:04:50,800 --> 00:04:52,100 we take this first 137 00:04:52,480 --> 00:04:53,850 model with lambda equals 0, 138 00:04:54,050 --> 00:04:56,110 and minimize my cos 139 00:04:56,390 --> 00:04:58,550 function j of theta and this 140 00:04:58,780 --> 00:05:00,310 would give me some parameter vector theta 141 00:05:00,850 --> 00:05:02,000 and similar to the earlier video, 142 00:05:02,200 --> 00:05:04,060 let me just denote this as 143 00:05:05,550 --> 00:05:06,650 theta superscript 1. 144 00:05:08,580 --> 00:05:09,440 And then I can take my 145 00:05:09,620 --> 00:05:11,210 second model, with lambda 146 00:05:11,690 --> 00:05:13,220 set to 0.01 and 147 00:05:13,850 --> 00:05:15,810 minimize my cos function, now 148 00:05:15,940 --> 00:05:17,560 using lambda equals 0.01 149 00:05:17,660 --> 00:05:18,770 of course, to get some 150 00:05:18,960 --> 00:05:19,980 different parameter vector theta, 151 00:05:20,530 --> 00:05:21,420 we need to know that theta 2, 152 00:05:21,550 --> 00:05:22,690 and for that I 153 00:05:22,930 --> 00:05:24,210 end up with theta 3 154 00:05:24,410 --> 00:05:25,280 so that this is correct for my 155 00:05:25,350 --> 00:05:27,090 third model, and so on, 156 00:05:27,620 --> 00:05:28,980 until for for my final model 157 00:05:29,450 --> 00:05:32,050 with lambda set to 10, 158 00:05:32,050 --> 00:05:35,150 or 10.24, or I end up with this theta 12. 159 00:05:36,340 --> 00:05:37,810 Next I can take 160 00:05:38,050 --> 00:05:39,710 all of these hypotheses, all of 161 00:05:39,790 --> 00:05:41,850 these parameters, and use 162 00:05:42,160 --> 00:05:44,200 my cross-validation set to evaluate them. 163 00:05:44,940 --> 00:05:46,440 So I can look at my 164 00:05:47,120 --> 00:05:48,420 first model, my second 165 00:05:48,770 --> 00:05:49,370 model, fits with these different values 166 00:05:49,400 --> 00:06:00,290 of the regularization parameter and 167 00:06:00,440 --> 00:06:01,320 evaluate them on my cross-validation 168 00:06:01,570 --> 00:06:02,150 set - basically measure the average squared error of each of these parameter 169 00:06:02,240 --> 00:06:03,910 vectors theta on my cross-validation set. 170 00:06:04,250 --> 00:06:05,800 And I would then pick whichever one 171 00:06:05,960 --> 00:06:07,400 of these 12 models gives me 172 00:06:07,570 --> 00:06:10,050 the lowest error on the cross-validation set. 173 00:06:11,050 --> 00:06:11,790 And let's say, for the sake 174 00:06:12,070 --> 00:06:13,660 of this example, that I 175 00:06:13,950 --> 00:06:15,570 end up picking theta 5, 176 00:06:15,650 --> 00:06:18,260 the fifth order polynomial, because 177 00:06:18,650 --> 00:06:21,240 that has the Noah's cross-validation error. 178 00:06:22,010 --> 00:06:24,220 Having done that, finally, what 179 00:06:24,390 --> 00:06:25,220 I would do if I want 180 00:06:25,490 --> 00:06:26,630 to report a test set error 181 00:06:27,370 --> 00:06:28,690 is to take the parameter theta 182 00:06:29,000 --> 00:06:30,890 5 that I've 183 00:06:31,040 --> 00:06:32,550 selected and look at 184 00:06:32,670 --> 00:06:34,710 how well it does on my test set. 185 00:06:34,840 --> 00:06:36,310 And once again here is as 186 00:06:36,480 --> 00:06:37,670 if we fit this parameter 187 00:06:38,230 --> 00:06:40,440 theta to my cross-validation 188 00:06:41,270 --> 00:06:42,460 set, which is why I 189 00:06:42,660 --> 00:06:43,940 am saving aside a separate 190 00:06:44,420 --> 00:06:45,810 test set that I 191 00:06:45,860 --> 00:06:47,060 am going to use to get 192 00:06:47,350 --> 00:06:48,470 a better estimate of how 193 00:06:48,730 --> 00:06:49,940 well my a parameter vector 194 00:06:50,190 --> 00:06:51,690 theta will generalize to previously unseen examples. 195 00:06:54,120 --> 00:06:55,870 So that's model selection applied 196 00:06:56,260 --> 00:06:58,310 to selecting the regularization parameter 197 00:06:59,260 --> 00:07:00,350 lambda. The last thing 198 00:07:00,490 --> 00:07:01,520 I'd like to do in this 199 00:07:01,770 --> 00:07:02,890 video, is get a 200 00:07:02,970 --> 00:07:05,080 better understanding of how 201 00:07:05,650 --> 00:07:07,340 cross-validation and training error 202 00:07:07,680 --> 00:07:10,420 vary as we as 203 00:07:10,530 --> 00:07:12,830 we vary the regularization parameter lambda. 204 00:07:13,460 --> 00:07:15,060 And so just a reminder, that 205 00:07:15,360 --> 00:07:16,760 was our original cosine function j of 206 00:07:16,840 --> 00:07:18,230 theta, but for this 207 00:07:18,400 --> 00:07:19,350 purpose we're going to define 208 00:07:20,450 --> 00:07:21,830 training error without using 209 00:07:22,240 --> 00:07:24,180 the regularization parameter, and cross-validation 210 00:07:24,860 --> 00:07:26,150 error without using the 211 00:07:26,360 --> 00:07:28,810 regularization parameter and what I'd like 212 00:07:29,210 --> 00:07:30,770 to do is plot this J train 213 00:07:31,750 --> 00:07:34,420 and plot this Jcv, meaning just 214 00:07:34,700 --> 00:07:35,820 how well does my 215 00:07:35,920 --> 00:07:38,250 hypothesis do for on 216 00:07:38,580 --> 00:07:39,760 the training set and how well 217 00:07:39,920 --> 00:07:41,280 does my hypothesis do on the 218 00:07:41,340 --> 00:07:43,250 cross-validation set as I 219 00:07:43,320 --> 00:07:45,230 vary my regularization parameter 220 00:07:45,700 --> 00:07:49,170 lambda so as 221 00:07:49,320 --> 00:07:51,740 we saw earlier, if lambda 222 00:07:52,070 --> 00:07:53,730 is small, then we're 223 00:07:53,920 --> 00:07:56,320 not using much regularization and 224 00:07:56,770 --> 00:07:58,860 we run a larger risk of overfitting. 225 00:07:59,950 --> 00:08:01,680 Where as if lambda is 226 00:08:01,930 --> 00:08:03,090 large, that is if we 227 00:08:03,310 --> 00:08:04,210 were on the right part 228 00:08:05,190 --> 00:08:07,400 of this horizontal axis, then 229 00:08:07,690 --> 00:08:08,770 with a large value of lambda 230 00:08:09,560 --> 00:08:12,060 we run the high risk of having a bias problem. 231 00:08:13,040 --> 00:08:14,650 So if you plot J train 232 00:08:15,280 --> 00:08:16,900 and Jcv, what you 233 00:08:16,980 --> 00:08:18,730 find is that for small 234 00:08:19,100 --> 00:08:21,170 values of lambda you can 235 00:08:22,010 --> 00:08:23,040 fit the training set relatively 236 00:08:23,640 --> 00:08:24,690 well because you're not regularizing. 237 00:08:25,600 --> 00:08:26,890 So, for small values of 238 00:08:26,990 --> 00:08:28,750 lambda, the regularization term basically 239 00:08:28,960 --> 00:08:30,100 goes away and you're just 240 00:08:30,420 --> 00:08:32,460 minimizing pretty much your squared error. 241 00:08:32,870 --> 00:08:34,490 So when lambda is small, you 242 00:08:34,630 --> 00:08:35,580 end up with a small value 243 00:08:36,170 --> 00:08:37,790 for J train, whereas if 244 00:08:37,900 --> 00:08:39,180 lambda is large, then you 245 00:08:39,740 --> 00:08:42,480 have a high bias problem and you might not fit your training set so well. 246 00:08:42,640 --> 00:08:43,800 So you end up with a value up there. 247 00:08:44,550 --> 00:08:48,800 So, J train of 248 00:08:48,930 --> 00:08:50,130 theta will tend to 249 00:08:50,320 --> 00:08:52,290 increase when lambda increases 250 00:08:53,050 --> 00:08:54,720 because a large value of 251 00:08:54,920 --> 00:08:55,850 lambda corresponds a high bias 252 00:08:56,400 --> 00:08:57,400 where you might not even fit your 253 00:08:57,590 --> 00:08:59,160 training set well, whereas a 254 00:08:59,290 --> 00:09:01,380 small value of lambda corresponds to, 255 00:09:01,650 --> 00:09:03,500 if you can you know freely 256 00:09:03,850 --> 00:09:06,690 fit to very high degree polynomials, your data, let's say. 257 00:09:06,920 --> 00:09:10,860 As for the cross-validation error, we end up with a figure like this. 258 00:09:12,080 --> 00:09:13,600 Where, over here on 259 00:09:13,930 --> 00:09:15,460 the right, if we 260 00:09:15,530 --> 00:09:16,470 have a large value of lambda, 261 00:09:17,440 --> 00:09:18,600 we may end up underfitting. 262 00:09:19,900 --> 00:09:21,280 And so, this is the bias regime 263 00:09:22,950 --> 00:09:25,750 whereas and cross 264 00:09:26,030 --> 00:09:27,680 validation error will be 265 00:09:27,920 --> 00:09:29,060 high and let me just leave 266 00:09:29,250 --> 00:09:31,760 all that. So, that's Jcv of theta because with 267 00:09:32,270 --> 00:09:33,440 high bias we won't be fitting. 268 00:09:34,430 --> 00:09:36,580 We won't be doing well on the cross-validation set. 269 00:09:38,050 --> 00:09:41,000 Whereas here on the left, this is the high-variance regime. 270 00:09:42,120 --> 00:09:43,620 Where if we have two smaller 271 00:09:44,020 --> 00:09:45,910 value of then we 272 00:09:46,070 --> 00:09:47,190 may be overfitting the data 273 00:09:47,870 --> 00:09:49,140 and so by over fitting the 274 00:09:49,230 --> 00:09:51,320 data then it a cross validation error 275 00:09:51,710 --> 00:09:52,610 will also be high. 276 00:09:53,700 --> 00:09:55,380 And so, this is what the 277 00:09:56,620 --> 00:09:58,270 cross-validation error and what 278 00:09:58,510 --> 00:09:59,860 the training error may look 279 00:10:00,130 --> 00:10:01,410 like on a training set 280 00:10:01,820 --> 00:10:04,270 as we vary the parameter 281 00:10:04,950 --> 00:10:06,920 lambda, as we vary the regularization parameter lambda. 282 00:10:07,110 --> 00:10:08,220 And so, once again, it will 283 00:10:08,430 --> 00:10:10,100 often be some intermediate value 284 00:10:10,790 --> 00:10:13,220 of lambda that you know, subsequent just right 285 00:10:13,720 --> 00:10:14,990 or that works best in 286 00:10:15,120 --> 00:10:16,470 terms of having a small 287 00:10:16,770 --> 00:10:19,710 cross-validation error or a small test set error. 288 00:10:19,920 --> 00:10:20,980 And whereas the curves I've drawn 289 00:10:21,300 --> 00:10:23,630 here are somewhat cartoonish and somewhat idealized. 290 00:10:24,650 --> 00:10:25,670 So on a real data set 291 00:10:26,210 --> 00:10:27,400 the pros you get may 292 00:10:27,510 --> 00:10:28,470 end up looking a little bit more 293 00:10:28,690 --> 00:10:30,580 messy and just a little bit more noisy than this. 294 00:10:31,540 --> 00:10:32,640 For some data sets you will 295 00:10:33,180 --> 00:10:34,450 really see these poor 296 00:10:34,740 --> 00:10:36,180 source of trends and 297 00:10:36,450 --> 00:10:37,340 by looking at the plot 298 00:10:37,900 --> 00:10:38,930 of the whole or cross validation 299 00:10:39,820 --> 00:10:41,460 error, you can either 300 00:10:41,600 --> 00:10:43,370 manually, automatically try to 301 00:10:43,680 --> 00:10:45,100 select a point that minimizes 302 00:10:45,550 --> 00:10:48,590 the cross-validation error and 303 00:10:48,880 --> 00:10:50,600 select the value of lambda corresponding 304 00:10:51,280 --> 00:10:52,780 to low cross-validation error. 305 00:10:53,560 --> 00:10:54,790 When I'm trying to pick the 306 00:10:54,920 --> 00:10:56,870 regularization parameter lambda 307 00:10:57,200 --> 00:10:59,300 for a learning algorithm, often I 308 00:10:59,420 --> 00:11:00,520 find that plotting a figure 309 00:11:00,800 --> 00:11:02,470 like this one showed here, helps 310 00:11:02,750 --> 00:11:04,520 me understand better what's going 311 00:11:04,780 --> 00:11:06,320 on and helps me verify that 312 00:11:06,880 --> 00:11:08,140 I am indeed picking a good 313 00:11:08,320 --> 00:11:09,670 value for the regularization parameter 314 00:11:10,520 --> 00:11:12,320 lambda. So hopefully that 315 00:11:12,520 --> 00:11:14,160 gives you more insight into regularization 316 00:11:15,650 --> 00:11:16,890 and it's effects on the bias 317 00:11:17,400 --> 00:11:18,470 and variance of the learning algorithm. 318 00:11:19,970 --> 00:11:21,510 By know you've seen bias and 319 00:11:21,670 --> 00:11:23,410 variance from a lot of different perspectives. 320 00:11:24,180 --> 00:11:25,470 And what I'd like to do 321 00:11:25,700 --> 00:11:27,000 in the next video is take 322 00:11:27,230 --> 00:11:28,110 a lot of the insights 323 00:11:28,280 --> 00:11:30,070 that we've gone through and build 324 00:11:30,320 --> 00:11:31,210 on them to put together 325 00:11:31,920 --> 00:11:33,770 a diagnostic that's called learning 326 00:11:34,050 --> 00:11:35,100 curves, which is a 327 00:11:35,150 --> 00:11:36,300 tool that I often use 328 00:11:36,720 --> 00:11:37,920 to try to diagnose if a 329 00:11:38,190 --> 00:11:39,630 learning algorithm may be suffering 330 00:11:40,040 --> 00:11:41,330 from a bias problem or a 331 00:11:41,560 --> 00:11:42,950 variance problem or a little bit of both.