1 00:00:00,090 --> 00:00:01,240 Suppose you like to decide 2 00:00:01,510 --> 00:00:03,350 what degree of polynomial to 3 00:00:03,490 --> 00:00:04,650 fit to a data set, sort of 4 00:00:04,820 --> 00:00:07,830 what features to include to give you a learning algorithm. 5 00:00:08,770 --> 00:00:09,940 Or suppose you'd like to choose 6 00:00:10,230 --> 00:00:13,450 the regularization parameter lambda for the learning algorithm. 7 00:00:14,480 --> 00:00:15,020 How do you do that? 8 00:00:15,780 --> 00:00:17,400 These are called model selection problems. 9 00:00:18,270 --> 00:00:19,500 And in our discussion of 10 00:00:19,600 --> 00:00:20,820 how to do this we'll talk 11 00:00:21,070 --> 00:00:22,060 about not just how to 12 00:00:22,150 --> 00:00:23,500 split your data into a train 13 00:00:23,880 --> 00:00:25,400 and test sets but how to 14 00:00:25,480 --> 00:00:26,970 split your data into what 15 00:00:27,140 --> 00:00:28,390 we'll discover is called the 16 00:00:28,670 --> 00:00:30,140 train validation and test sets. 17 00:00:31,030 --> 00:00:32,070 We'll see in this video 18 00:00:32,500 --> 00:00:33,360 just what these things are and 19 00:00:33,440 --> 00:00:35,530 how to use them to do model selection. 20 00:00:36,840 --> 00:00:37,880 We've already seen a lot 21 00:00:38,060 --> 00:00:39,420 of times the problem of overfitting, 22 00:00:40,120 --> 00:00:41,410 in which just because the 23 00:00:41,640 --> 00:00:43,180 learning algorithm fits a training 24 00:00:43,560 --> 00:00:45,560 set well, that doesn't mean there's a good hypothesis. 25 00:00:47,440 --> 00:00:48,670 More generally, this is why 26 00:00:48,970 --> 00:00:50,360 the training set error is 27 00:00:50,600 --> 00:00:52,220 not a good predictor for how 28 00:00:52,490 --> 00:00:54,690 well the hypothesis will do on new examples. 29 00:00:55,940 --> 00:00:57,440 Concretely, if you fit 30 00:00:57,670 --> 00:00:58,800 some set of parameters - theta 31 00:00:59,080 --> 00:01:00,130 0, theta 1, theta 2 32 00:01:00,310 --> 00:01:01,470 and so on - to your 33 00:01:01,680 --> 00:01:03,730 training set then, the fact 34 00:01:04,260 --> 00:01:05,800 that your hypothesis does well in 35 00:01:05,870 --> 00:01:07,350 the training set, well, this 36 00:01:07,570 --> 00:01:08,710 doesn't mean much in terms 37 00:01:09,000 --> 00:01:10,190 of predicting how well your 38 00:01:10,420 --> 00:01:12,530 hypothesis will generalize to new 39 00:01:12,740 --> 00:01:14,390 examples not seen in 40 00:01:14,500 --> 00:01:15,890 the training set. 41 00:01:16,040 --> 00:01:17,620 And the more general principal is that, 42 00:01:18,320 --> 00:01:19,680 once your parameters were fit 43 00:01:19,940 --> 00:01:21,430 to some set of data--maybe the 44 00:01:21,510 --> 00:01:22,770 training set, maybe something else--then 45 00:01:23,560 --> 00:01:25,070 the error of your hypothesis 46 00:01:25,620 --> 00:01:26,740 as measured on that same 47 00:01:26,950 --> 00:01:28,260 data set, such as the 48 00:01:28,500 --> 00:01:30,100 training error, that's unlikely 49 00:01:31,790 --> 00:01:33,060 to be a good estimate 50 00:01:33,690 --> 00:01:35,550 of your actual generalization error, that 51 00:01:35,740 --> 00:01:37,110 is, of how well the 52 00:01:37,560 --> 00:01:39,250 hypothesis will generalize to new examples. 53 00:01:43,020 --> 00:01:45,020 Now let's consider the model selection problem. 54 00:01:45,910 --> 00:01:46,900 Let's say you try to choose 55 00:01:47,470 --> 00:01:49,550 what degree polynomial to fit to data. 56 00:01:50,020 --> 00:01:51,340 So, you should you choose a linear function, a 57 00:01:51,560 --> 00:01:52,620 quadratic function, a cubic function, 58 00:01:53,460 --> 00:01:57,500 all the way up to a 10th power polynomial? 59 00:01:57,530 --> 00:01:59,520 So it's as if there's one extra parameter in 60 00:01:59,640 --> 00:02:00,830 this algorithm, which I'm going 61 00:02:00,890 --> 00:02:02,460 to denote d, which is 62 00:02:03,000 --> 00:02:06,700 what degree of polynomial do you want to pick? 63 00:02:07,050 --> 00:02:07,860 So it is as if 64 00:02:08,150 --> 00:02:10,560 does this, in addition 65 00:02:11,080 --> 00:02:13,070 to the theta parameters it's 66 00:02:13,270 --> 00:02:14,540 as if there's one more parameter d 67 00:02:14,990 --> 00:02:17,480 that your trying to determine using your data cells. 68 00:02:18,430 --> 00:02:19,700 the first option is d 69 00:02:19,990 --> 00:02:21,050 equals 1, which is for the linear 70 00:02:21,270 --> 00:02:22,850 function we can choose d 71 00:02:23,510 --> 00:02:24,770 equals 2, d equals 3, 72 00:02:24,890 --> 00:02:25,940 all the way up to d 73 00:02:26,120 --> 00:02:26,950 equals 10, so we would like 74 00:02:27,090 --> 00:02:29,120 to fit this extra sort of parameter, 75 00:02:29,810 --> 00:02:31,460 which I am denoting by d, 76 00:02:31,650 --> 00:02:33,250 and concretely, let's say 77 00:02:33,530 --> 00:02:34,990 that you want to choose a 78 00:02:35,030 --> 00:02:36,140 model, that is choose 79 00:02:36,500 --> 00:02:38,280 a degree of polynomial choose one off 80 00:02:38,430 --> 00:02:40,190 these ten models, and fit that model 81 00:02:41,250 --> 00:02:42,560 and also get some estimate 82 00:02:43,630 --> 00:02:45,200 of how well your fitted hypothesis 83 00:02:46,130 --> 00:02:47,560 will generalize to new 84 00:02:48,390 --> 00:02:49,830 examples. Here's one thing 85 00:02:50,000 --> 00:02:51,320 you could do: you could 86 00:02:52,310 --> 00:02:53,740 take your first model and 87 00:02:54,510 --> 00:02:55,840 minimize the training error 88 00:02:56,670 --> 00:02:57,500 and this would give you some 89 00:02:57,820 --> 00:03:00,110 parameter vector theta, and 90 00:03:01,080 --> 00:03:02,040 you can then take your second model, 91 00:03:02,760 --> 00:03:05,280 the quadratic function and 92 00:03:05,460 --> 00:03:06,850 for that your training set and 93 00:03:06,950 --> 00:03:08,910 this will give you some other parameters vector theta. 94 00:03:09,840 --> 00:03:10,990 In order to distinguish between 95 00:03:11,320 --> 00:03:13,290 these different parameter vectors, I'm 96 00:03:13,440 --> 00:03:15,720 going to use a superscript 1, superscript 2 97 00:03:15,870 --> 00:03:17,540 there where theta superscript 98 00:03:18,090 --> 00:03:19,030 1 just means the parameters 99 00:03:19,650 --> 00:03:21,030 I get by fitting this 100 00:03:21,150 --> 00:03:22,180 model to my training data, 101 00:03:23,020 --> 00:03:24,870 and theta superscript 2 just means 102 00:03:25,200 --> 00:03:26,320 the parameters I get by fitting 103 00:03:26,980 --> 00:03:30,580 this quadratic function to my training ata and so on. 104 00:03:30,770 --> 00:03:33,530 And by fitting a cubic model I get parameters theta 3 105 00:03:34,180 --> 00:03:36,390 up to, you know, say theta 10. 106 00:03:37,000 --> 00:03:38,050 And one thing we could 107 00:03:38,760 --> 00:03:40,490 do is then take these 108 00:03:40,710 --> 00:03:42,720 parameters and look at the test set error. 109 00:03:42,950 --> 00:03:44,190 So I can compute on my 110 00:03:44,350 --> 00:03:45,840 test set, j test 111 00:03:46,550 --> 00:03:49,790 of 1, j test of 112 00:03:50,920 --> 00:03:52,500 theta 2 and so 113 00:03:52,900 --> 00:03:55,650 on, j test of 114 00:03:55,870 --> 00:03:58,000 theta 3 and so on. 115 00:03:58,210 --> 00:04:00,260 So I'm going to 116 00:04:00,400 --> 00:04:02,140 take each of my hypothesis 117 00:04:02,450 --> 00:04:04,790 with the corresponding and just measure 118 00:04:05,190 --> 00:04:06,470 the performance on the test set. 119 00:04:08,870 --> 00:04:10,070 Now one thing I could do 120 00:04:10,350 --> 00:04:11,610 then is, in order to select 121 00:04:12,580 --> 00:04:13,700 one of these models, I could 122 00:04:13,880 --> 00:04:15,430 then see which model 123 00:04:15,950 --> 00:04:17,010 has the lowest test sets 124 00:04:17,350 --> 00:04:18,240 error, and lets just say 125 00:04:18,430 --> 00:04:19,750 for this example, that I 126 00:04:19,970 --> 00:04:21,920 ended up choosing the fifth order polynomial. 127 00:04:23,150 --> 00:04:24,580 So this seems reasonable so far. 128 00:04:25,550 --> 00:04:26,750 By now, lets say, I want to 129 00:04:27,310 --> 00:04:29,670 take my fit hypothesis, this fifth 130 00:04:30,040 --> 00:04:31,940 order model and let's 131 00:04:32,180 --> 00:04:34,180 say I want to ask how well this model generalized. 132 00:04:35,750 --> 00:04:36,740 One thing I could do is 133 00:04:37,130 --> 00:04:38,580 look at how well my fifth 134 00:04:38,860 --> 00:04:42,350 order polynomial hypothesis, had done on my test set. 135 00:04:43,140 --> 00:04:44,800 But the problem is this 136 00:04:45,050 --> 00:04:46,230 will not to be a fair 137 00:04:46,430 --> 00:04:47,590 estimate of how well 138 00:04:48,120 --> 00:04:49,130 my hypothesis generalizes. 139 00:04:51,050 --> 00:04:53,390 And the reason is, what we've 140 00:04:53,580 --> 00:04:54,850 done is, we've fit this 141 00:04:55,020 --> 00:04:56,780 extra the parameter d, that 142 00:04:56,950 --> 00:04:58,250 is this degree of polynomial, 143 00:04:59,110 --> 00:05:00,340 and we'll fit that parameter 144 00:05:01,040 --> 00:05:02,520 d using the test set. 145 00:05:02,900 --> 00:05:03,890 Namely, we chose the value 146 00:05:04,280 --> 00:05:05,510 of d that gave us the 147 00:05:05,710 --> 00:05:07,270 best possible performance on the 148 00:05:07,350 --> 00:05:09,060 test set, and 149 00:05:09,310 --> 00:05:11,260 so, the performance of 150 00:05:11,560 --> 00:05:13,030 my parameter vector theta five 151 00:05:13,680 --> 00:05:15,230 on the test set, that's likely to 152 00:05:15,510 --> 00:05:16,810 be to be an overly optimistic 153 00:05:17,500 --> 00:05:19,770 estimate of generalization error. 154 00:05:19,960 --> 00:05:21,090 Right? So that because I have fit 155 00:05:21,250 --> 00:05:22,260 this parameter d to my 156 00:05:22,360 --> 00:05:23,720 test set, it is no 157 00:05:23,880 --> 00:05:25,320 longer fair to 158 00:05:25,510 --> 00:05:28,120 evaluate my hypothesis on this test set. 159 00:05:28,340 --> 00:05:30,490 That's because I've fit my parameters to the test set. 160 00:05:30,700 --> 00:05:32,280 I've chosen the degree d of 161 00:05:32,410 --> 00:05:34,540 polynomial using the test set. And 162 00:05:34,730 --> 00:05:36,360 so our hypothesis is 163 00:05:36,530 --> 00:05:37,860 likely to do better on 164 00:05:38,070 --> 00:05:39,280 this test set than it 165 00:05:39,450 --> 00:05:41,030 would on new examples that 166 00:05:41,120 --> 00:05:45,460 it hasn't seen before and that's which is what we hear about. 167 00:05:45,590 --> 00:05:45,880 So, just to reiterate 168 00:05:46,260 --> 00:05:47,720 on the previous slide we 169 00:05:47,840 --> 00:05:48,880 saw that if we fit 170 00:05:49,150 --> 00:05:50,800 some set of parameters, say theta 171 00:05:51,350 --> 00:05:52,430 0, theta 1, and so on, to 172 00:05:52,560 --> 00:05:54,030 some training set, then the 173 00:05:54,100 --> 00:05:55,510 performance of the fitted model on 174 00:05:56,010 --> 00:05:57,500 the training set is not 175 00:05:58,020 --> 00:05:59,180 predictive of how well 176 00:05:59,890 --> 00:06:01,330 the hypothesis we generalized the 177 00:06:01,540 --> 00:06:03,170 new examples; is because these 178 00:06:03,880 --> 00:06:04,720 parameters would fit to the training set. 179 00:06:05,430 --> 00:06:06,530 So they are likely to do 180 00:06:06,670 --> 00:06:07,720 well on the training set, even 181 00:06:08,030 --> 00:06:10,390 if the parameters don't do well on other examples. 182 00:06:12,390 --> 00:06:13,650 And in the procedure I've just described 183 00:06:13,790 --> 00:06:14,840 on this slide, we've just done 184 00:06:14,910 --> 00:06:16,270 the same thing and specifically 185 00:06:16,970 --> 00:06:18,600 what we did is we fit this 186 00:06:18,760 --> 00:06:20,450 parameter d to the test set. 187 00:06:21,740 --> 00:06:22,750 And by having fit the parameter 188 00:06:23,190 --> 00:06:24,630 to the test set, this means that 189 00:06:25,410 --> 00:06:26,290 the performance of the hypothesis 190 00:06:27,080 --> 00:06:28,530 on that test set may not 191 00:06:28,740 --> 00:06:30,340 be a fair estimate of how 192 00:06:30,630 --> 00:06:32,620 well the hypothesis is likely to do 193 00:06:33,110 --> 00:06:35,490 on examples we haven't seen before. 194 00:06:35,510 --> 00:06:37,230 To address this problem in a 195 00:06:37,320 --> 00:06:39,170 model selection setting, if 196 00:06:39,470 --> 00:06:41,010 we want to evaluate a hypothesis 197 00:06:41,800 --> 00:06:44,620 this is usually what we do instead. 198 00:06:45,600 --> 00:06:46,850 Given the data set, instead of just splitting it 199 00:06:47,070 --> 00:06:48,310 into a train and test 200 00:06:48,480 --> 00:06:49,300 set, what we are going to 201 00:06:49,430 --> 00:06:51,360 do is instead split it into three pieces. 202 00:06:52,620 --> 00:06:54,490 And the first piece is 203 00:06:54,710 --> 00:06:57,340 going to be called the training set as usual. 204 00:06:58,730 --> 00:06:59,980 So you call this first part, 205 00:07:01,020 --> 00:07:03,580 the training set, and then 206 00:07:03,730 --> 00:07:05,580 were going to coddle the second 207 00:07:06,040 --> 00:07:07,500 piece of data, which is 208 00:07:07,590 --> 00:07:13,520 called the cross-validation set, and 209 00:07:13,690 --> 00:07:15,510 I'm going to abbreviate cross-validation 210 00:07:17,040 --> 00:07:18,430 CV, and the second 211 00:07:19,560 --> 00:07:20,840 piece of this data, I'm going 212 00:07:21,040 --> 00:07:22,680 to call the cross-validation set 213 00:07:26,430 --> 00:07:28,120 cross-validation, and I 214 00:07:28,270 --> 00:07:31,500 am going to abbreviate cross-validation as CV. 215 00:07:32,150 --> 00:07:33,440 Sometimes it's also called the 216 00:07:33,530 --> 00:07:35,770 validation set, instead of cross-validation set. 217 00:07:36,680 --> 00:07:37,770 And then the last part I 218 00:07:37,810 --> 00:07:40,640 am going to call my usual test set. 219 00:07:40,880 --> 00:07:42,810 And the pretty typical ratio 220 00:07:43,370 --> 00:07:45,230 I wish to split these things; would be to 221 00:07:45,510 --> 00:07:46,620 send 60% of your data 222 00:07:47,010 --> 00:07:48,520 to your training set, maybe 20% 223 00:07:48,880 --> 00:07:52,290 to your cross-validation set, and 20% to your test set. 224 00:07:52,620 --> 00:07:53,840 And these numbers can vary a little 225 00:07:53,970 --> 00:07:55,920 bit but this sort of ratio will be pretty typical. 226 00:07:56,720 --> 00:07:58,130 And so our training set 227 00:07:58,560 --> 00:07:59,950 will now be only, maybe 228 00:08:00,180 --> 00:08:01,070 60% of the data, 229 00:08:02,300 --> 00:08:04,610 and our cross-validation set or 230 00:08:04,690 --> 00:08:07,580 our validation set will have some number of examples. 231 00:08:08,170 --> 00:08:09,220 I'm going to denote that M 232 00:08:09,650 --> 00:08:11,090 subscript cv, so that's the 233 00:08:11,160 --> 00:08:13,780 number of cross-validation examples. 234 00:08:15,580 --> 00:08:17,190 And as following our earlier 235 00:08:17,420 --> 00:08:19,870 notational convention, I'm going 236 00:08:19,940 --> 00:08:22,730 to use XiCV, YiCV. 237 00:08:23,690 --> 00:08:25,350 Following our earlier notational convention 238 00:08:25,510 --> 00:08:26,330 I'm going to use 239 00:08:26,730 --> 00:08:30,770 XiCV, YiCV to 240 00:08:30,880 --> 00:08:33,140 denote the i cross-validation example. 241 00:08:34,410 --> 00:08:35,960 And finally we also 242 00:08:36,400 --> 00:08:37,460 have a test set over here; 243 00:08:38,010 --> 00:08:39,990 with M subscript test, 244 00:08:40,700 --> 00:08:41,940 being the number of test examples. 245 00:08:43,090 --> 00:08:43,880 So, now that we have 246 00:08:44,160 --> 00:08:46,140 defined the training validation or 247 00:08:46,170 --> 00:08:47,700 cross validation and test sets, 248 00:08:48,160 --> 00:08:49,470 we can also define the training 249 00:08:49,890 --> 00:08:52,140 error, cross validation error, and test error. 250 00:08:53,070 --> 00:08:54,690 So here's my training error, 251 00:08:55,250 --> 00:08:56,320 and I'm just writing this as 252 00:08:56,570 --> 00:08:58,000 J j subscript train of theta. 253 00:08:58,500 --> 00:09:00,370 This is pretty much the same thing. 254 00:09:00,670 --> 00:09:01,640 It's usually the same thing as 255 00:09:02,170 --> 00:09:03,930 the j of theta that we'll be writing so far, 256 00:09:04,160 --> 00:09:05,120 this is just a training set 257 00:09:05,770 --> 00:09:07,450 error you guys measure on your training set. 258 00:09:08,440 --> 00:09:09,550 And then j subscript cv 259 00:09:09,950 --> 00:09:12,540 is my cause validation error is pretty much what you'd expect. 260 00:09:13,100 --> 00:09:14,480 Just select the training error, except 261 00:09:14,850 --> 00:09:15,850 measure it on the 262 00:09:16,080 --> 00:09:17,930 cross-validation data set, and here's 263 00:09:18,220 --> 00:09:19,910 my test set error, same as before. 264 00:09:21,170 --> 00:09:22,490 So when theta with the model 265 00:09:22,810 --> 00:09:24,790 selection problem like this 266 00:09:25,000 --> 00:09:26,980 is, instead of using 267 00:09:27,190 --> 00:09:28,880 the test set to select 268 00:09:29,220 --> 00:09:30,490 the model, we're instead going 269 00:09:30,700 --> 00:09:32,350 to use validation set or 270 00:09:32,490 --> 00:09:34,990 the cross-validation set to select the model. 271 00:09:36,230 --> 00:09:37,780 Concretely, we're going to 272 00:09:37,890 --> 00:09:40,680 first take our first hypothesis, take 273 00:09:40,860 --> 00:09:42,480 this first model and say, 274 00:09:43,190 --> 00:09:44,290 minimize the cos function, 275 00:09:45,120 --> 00:09:45,830 and this would give me some 276 00:09:46,140 --> 00:09:47,950 parameter vector theta for the linear model 277 00:09:49,250 --> 00:09:50,760 and as before I'm going to put the 278 00:09:50,870 --> 00:09:52,260 superscript 1 just to 279 00:09:52,410 --> 00:09:53,510 denote that this is a parameter 280 00:09:53,720 --> 00:09:55,300 for the linear model. We do 281 00:09:55,450 --> 00:09:58,190 the same thing for the 282 00:09:58,250 --> 00:10:00,380 quadratic model, get some 283 00:10:00,680 --> 00:10:01,680 parameter vector theta 2, get 284 00:10:01,850 --> 00:10:03,480 some parameter vectors there 3, and so on down 285 00:10:03,720 --> 00:10:04,960 to, say, the tenth by the 286 00:10:05,470 --> 00:10:07,290 polynomial, and what 287 00:10:07,470 --> 00:10:09,240 we I'm going to do is, instead of 288 00:10:09,350 --> 00:10:11,030 testing these hypothesis on the 289 00:10:11,200 --> 00:10:12,320 test set, instead I'm 290 00:10:12,590 --> 00:10:13,600 going to test them on the 291 00:10:13,670 --> 00:10:15,180 cross-validation set. I'm going to 292 00:10:15,400 --> 00:10:17,120 measure j subscript cv, 293 00:10:18,270 --> 00:10:19,760 to see how well each of 294 00:10:19,880 --> 00:10:21,790 these hypothesis do on my 295 00:10:22,630 --> 00:10:28,910 cross validation set and then 296 00:10:29,110 --> 00:10:29,850 I'm going to pick the hypothesis 297 00:10:30,690 --> 00:10:32,340 with the lowest cross-validation error. 298 00:10:32,780 --> 00:10:34,260 So for this example, let's say 299 00:10:34,490 --> 00:10:35,700 for the sake of argument that 300 00:10:36,450 --> 00:10:38,380 it was my fourth order polynomial 301 00:10:39,290 --> 00:10:41,170 that had the lowest cross-validation error. 302 00:10:42,090 --> 00:10:43,050 So in that case, I'm going 303 00:10:43,220 --> 00:10:44,640 to pick this fourth order polynomial 304 00:10:45,290 --> 00:10:47,930 model and finally what 305 00:10:48,100 --> 00:10:49,440 this means is that that parameter 306 00:10:50,140 --> 00:10:51,560 d, remember d was the 307 00:10:51,630 --> 00:10:54,710 degree of polynomial, right d equals 2, d equals 3, 308 00:10:54,900 --> 00:10:56,890 up to d equals 10. What we've 309 00:10:57,070 --> 00:10:58,000 done is we fit that parameter 310 00:10:58,460 --> 00:10:59,590 d, and we'll set d equals 4, and 311 00:10:59,900 --> 00:11:01,210 we did so using 312 00:11:01,600 --> 00:11:02,720 the cross-validation set. 313 00:11:02,920 --> 00:11:04,440 And so this degree of 314 00:11:04,540 --> 00:11:05,540 polynomial, so the parameter 315 00:11:06,530 --> 00:11:07,930 is no longer fit to the test set. 316 00:11:08,830 --> 00:11:10,010 And so we've now saved 317 00:11:10,520 --> 00:11:12,320 a way the test set 318 00:11:13,300 --> 00:11:14,180 and we can use the test 319 00:11:14,330 --> 00:11:15,650 set to measure or to 320 00:11:16,100 --> 00:11:17,890 estimate the generalization error of 321 00:11:18,050 --> 00:11:19,270 the model that was selected 322 00:11:20,370 --> 00:11:23,530 by this algorithm. So, that 323 00:11:23,680 --> 00:11:25,230 was model selection and how 324 00:11:25,550 --> 00:11:26,710 you can take your data and split 325 00:11:26,880 --> 00:11:28,220 it into a train validation 326 00:11:28,940 --> 00:11:30,170 and test set, and use your 327 00:11:30,300 --> 00:11:31,810 cross validation data to select 328 00:11:32,110 --> 00:11:33,760 model and evaluate it on the test set. 329 00:11:35,080 --> 00:11:36,640 One final note: I should 330 00:11:36,810 --> 00:11:38,720 say that in the machine 331 00:11:39,090 --> 00:11:39,960 learning as of this practice 332 00:11:40,490 --> 00:11:41,410 today, there are many 333 00:11:41,730 --> 00:11:43,290 people that will do 334 00:11:43,440 --> 00:11:44,460 that early thing that I 335 00:11:44,570 --> 00:11:45,610 talked about, and said that 336 00:11:46,000 --> 00:11:47,510 isn't such a good idea of 337 00:11:48,110 --> 00:11:49,660 selecting your model using the 338 00:11:49,750 --> 00:11:51,190 test set and they're using 339 00:11:51,580 --> 00:11:52,640 the same test set to report 340 00:11:53,290 --> 00:11:55,130 the error, as though selecting 341 00:11:56,020 --> 00:11:57,420 your degree of polynomial on the 342 00:11:57,510 --> 00:11:58,690 test set, and then reporting 343 00:11:59,220 --> 00:12:00,480 the error on the test set as 344 00:12:00,640 --> 00:12:02,120 though that were good estimate of generalization error. 345 00:12:03,440 --> 00:12:05,510 That sort of practice is unfortunately many 346 00:12:05,730 --> 00:12:06,680 people do do it; and 347 00:12:06,890 --> 00:12:08,070 if you have a massive massive 348 00:12:08,460 --> 00:12:09,480 test set is maybe not 349 00:12:09,700 --> 00:12:11,310 a terrible thing to do, but 350 00:12:13,060 --> 00:12:15,150 most practitioners of machine 351 00:12:15,440 --> 00:12:16,550 learning tend to advise 352 00:12:19,330 --> 00:12:20,110 against that 353 00:12:20,330 --> 00:12:22,360 and is considered better practice to have separate training validations of test sets. I'll just warn you that just 354 00:12:22,560 --> 00:12:23,880 sometimes people do 355 00:12:24,130 --> 00:12:25,270 you know, use the same data 356 00:12:25,550 --> 00:12:27,280 for the purpose of the validation 357 00:12:28,260 --> 00:12:29,250 set and for the purpose 358 00:12:29,490 --> 00:12:30,230 of the test sets. You only have a training set 359 00:12:30,830 --> 00:12:32,030 and the test set and that's because 360 00:12:32,520 --> 00:12:33,590 that's good practice. So, you 361 00:12:33,810 --> 00:12:34,980 will see some people do it 362 00:12:35,540 --> 00:12:36,810 but if possible I will 363 00:12:37,110 --> 00:12:38,710 recommend against doing.