1
00:00:00,090 --> 00:00:01,240
Suppose you like to decide

2
00:00:01,510 --> 00:00:03,350
what degree of polynomial to

3
00:00:03,490 --> 00:00:04,650
fit to a data set, sort of

4
00:00:04,820 --> 00:00:07,830
what features to include to give you a learning algorithm.

5
00:00:08,770 --> 00:00:09,940
Or suppose you'd like to choose

6
00:00:10,230 --> 00:00:13,450
the regularization parameter lambda for the learning algorithm.

7
00:00:14,480 --> 00:00:15,020
How do you do that?

8
00:00:15,780 --> 00:00:17,400
These are called model selection problems.

9
00:00:18,270 --> 00:00:19,500
And in our discussion of

10
00:00:19,600 --> 00:00:20,820
how to do this we'll talk

11
00:00:21,070 --> 00:00:22,060
about not just how to

12
00:00:22,150 --> 00:00:23,500
split your data into a train

13
00:00:23,880 --> 00:00:25,400
and test sets but how to

14
00:00:25,480 --> 00:00:26,970
split your data into what

15
00:00:27,140 --> 00:00:28,390
we'll discover is called the

16
00:00:28,670 --> 00:00:30,140
train validation and test sets.

17
00:00:31,030 --> 00:00:32,070
We'll see in this video

18
00:00:32,500 --> 00:00:33,360
just what these things are and

19
00:00:33,440 --> 00:00:35,530
how to use them to do model selection.

20
00:00:36,840 --> 00:00:37,880
We've already seen a lot

21
00:00:38,060 --> 00:00:39,420
of times the problem of overfitting,

22
00:00:40,120 --> 00:00:41,410
in which just because the

23
00:00:41,640 --> 00:00:43,180
learning algorithm fits a training

24
00:00:43,560 --> 00:00:45,560
set well, that doesn't mean there's a good hypothesis.

25
00:00:47,440 --> 00:00:48,670
More generally, this is why

26
00:00:48,970 --> 00:00:50,360
the training set error is

27
00:00:50,600 --> 00:00:52,220
not a good predictor for how

28
00:00:52,490 --> 00:00:54,690
well the hypothesis will do on new examples.

29
00:00:55,940 --> 00:00:57,440
Concretely, if you fit

30
00:00:57,670 --> 00:00:58,800
some set of parameters - theta

31
00:00:59,080 --> 00:01:00,130
0, theta 1, theta 2

32
00:01:00,310 --> 00:01:01,470
and so on - to your

33
00:01:01,680 --> 00:01:03,730
training set then, the fact

34
00:01:04,260 --> 00:01:05,800
that your hypothesis does well in

35
00:01:05,870 --> 00:01:07,350
the training set, well, this

36
00:01:07,570 --> 00:01:08,710
doesn't mean much in terms

37
00:01:09,000 --> 00:01:10,190
of predicting how well your

38
00:01:10,420 --> 00:01:12,530
hypothesis will generalize to new

39
00:01:12,740 --> 00:01:14,390
examples not seen in

40
00:01:14,500 --> 00:01:15,890
the training set.

41
00:01:16,040 --> 00:01:17,620
And the more general principal is that,

42
00:01:18,320 --> 00:01:19,680
once your parameters were fit

43
00:01:19,940 --> 00:01:21,430
to some set of data--maybe the

44
00:01:21,510 --> 00:01:22,770
training set, maybe something else--then

45
00:01:23,560 --> 00:01:25,070
the error of your hypothesis

46
00:01:25,620 --> 00:01:26,740
as measured on that same

47
00:01:26,950 --> 00:01:28,260
data set, such as the

48
00:01:28,500 --> 00:01:30,100
training error, that's unlikely

49
00:01:31,790 --> 00:01:33,060
to be a good estimate

50
00:01:33,690 --> 00:01:35,550
of your actual generalization error, that

51
00:01:35,740 --> 00:01:37,110
is, of how well the

52
00:01:37,560 --> 00:01:39,250
hypothesis will generalize to new examples.

53
00:01:43,020 --> 00:01:45,020
Now let's consider the model selection problem.

54
00:01:45,910 --> 00:01:46,900
Let's say you try to choose

55
00:01:47,470 --> 00:01:49,550
what degree polynomial to fit to data.

56
00:01:50,020 --> 00:01:51,340
So, you should you choose a linear function, a

57
00:01:51,560 --> 00:01:52,620
quadratic function, a cubic function,

58
00:01:53,460 --> 00:01:57,500
all the way up to a 10th power polynomial?

59
00:01:57,530 --> 00:01:59,520
So it's as if there's one extra parameter in

60
00:01:59,640 --> 00:02:00,830
this algorithm, which I'm going

61
00:02:00,890 --> 00:02:02,460
to denote d, which is

62
00:02:03,000 --> 00:02:06,700
what degree of polynomial do you want to pick?

63
00:02:07,050 --> 00:02:07,860
So it is as if

64
00:02:08,150 --> 00:02:10,560
does this, in addition

65
00:02:11,080 --> 00:02:13,070
to the theta parameters it's

66
00:02:13,270 --> 00:02:14,540
as if there's one more parameter d

67
00:02:14,990 --> 00:02:17,480
that your trying to determine using your data cells.

68
00:02:18,430 --> 00:02:19,700
the first option is d

69
00:02:19,990 --> 00:02:21,050
equals 1, which is for the linear

70
00:02:21,270 --> 00:02:22,850
function we can choose d

71
00:02:23,510 --> 00:02:24,770
equals 2, d equals 3,

72
00:02:24,890 --> 00:02:25,940
all the way up to d

73
00:02:26,120 --> 00:02:26,950
equals 10, so we would like

74
00:02:27,090 --> 00:02:29,120
to fit this extra sort of parameter,

75
00:02:29,810 --> 00:02:31,460
which I am denoting by d,

76
00:02:31,650 --> 00:02:33,250
and concretely, let's say

77
00:02:33,530 --> 00:02:34,990
that you want to choose a

78
00:02:35,030 --> 00:02:36,140
model, that is choose

79
00:02:36,500 --> 00:02:38,280
a degree of polynomial choose one off

80
00:02:38,430 --> 00:02:40,190
these ten models, and fit that model

81
00:02:41,250 --> 00:02:42,560
and also get some estimate

82
00:02:43,630 --> 00:02:45,200
of how well your fitted hypothesis

83
00:02:46,130 --> 00:02:47,560
will generalize to new

84
00:02:48,390 --> 00:02:49,830
examples.
Here's one thing

85
00:02:50,000 --> 00:02:51,320
you could do: you could

86
00:02:52,310 --> 00:02:53,740
take your first model and

87
00:02:54,510 --> 00:02:55,840
minimize the training error

88
00:02:56,670 --> 00:02:57,500
and this would give you some

89
00:02:57,820 --> 00:03:00,110
parameter vector theta, and

90
00:03:01,080 --> 00:03:02,040
you can then take your second model,

91
00:03:02,760 --> 00:03:05,280
the quadratic function and

92
00:03:05,460 --> 00:03:06,850
for that your training set and

93
00:03:06,950 --> 00:03:08,910
this will give you some other parameters vector theta.

94
00:03:09,840 --> 00:03:10,990
In order to distinguish between

95
00:03:11,320 --> 00:03:13,290
these different parameter vectors, I'm

96
00:03:13,440 --> 00:03:15,720
going to use a superscript 1, superscript 2

97
00:03:15,870 --> 00:03:17,540
there where theta superscript

98
00:03:18,090 --> 00:03:19,030
1 just means the parameters

99
00:03:19,650 --> 00:03:21,030
I get by fitting this

100
00:03:21,150 --> 00:03:22,180
model to my training data,

101
00:03:23,020 --> 00:03:24,870
and theta superscript 2 just means

102
00:03:25,200 --> 00:03:26,320
the parameters I get by fitting

103
00:03:26,980 --> 00:03:30,580
this quadratic function to my training ata and so on.

104
00:03:30,770 --> 00:03:33,530
And by fitting a cubic model I get parameters theta 3

105
00:03:34,180 --> 00:03:36,390
up to, you know, say theta 10.

106
00:03:37,000 --> 00:03:38,050
And one thing we could

107
00:03:38,760 --> 00:03:40,490
do is then take these

108
00:03:40,710 --> 00:03:42,720
parameters and look at the test set error.

109
00:03:42,950 --> 00:03:44,190
So I can compute on my

110
00:03:44,350 --> 00:03:45,840
test set, j test

111
00:03:46,550 --> 00:03:49,790
of 1, j test of

112
00:03:50,920 --> 00:03:52,500
theta 2 and so

113
00:03:52,900 --> 00:03:55,650
on, j test of

114
00:03:55,870 --> 00:03:58,000
theta 3 and so on.

115
00:03:58,210 --> 00:04:00,260
So I'm going to

116
00:04:00,400 --> 00:04:02,140
take each of my hypothesis

117
00:04:02,450 --> 00:04:04,790
with the corresponding and just measure

118
00:04:05,190 --> 00:04:06,470
the performance on the test set.

119
00:04:08,870 --> 00:04:10,070
Now one thing I could do

120
00:04:10,350 --> 00:04:11,610
then is, in order to select

121
00:04:12,580 --> 00:04:13,700
one of these models, I could

122
00:04:13,880 --> 00:04:15,430
then see which model

123
00:04:15,950 --> 00:04:17,010
has the lowest test sets

124
00:04:17,350 --> 00:04:18,240
error, and lets just say

125
00:04:18,430 --> 00:04:19,750
for this example, that I

126
00:04:19,970 --> 00:04:21,920
ended up choosing the fifth order polynomial.

127
00:04:23,150 --> 00:04:24,580
So this seems reasonable so far.

128
00:04:25,550 --> 00:04:26,750
By now, lets say, I want to

129
00:04:27,310 --> 00:04:29,670
take my fit hypothesis, this fifth

130
00:04:30,040 --> 00:04:31,940
order model and let's

131
00:04:32,180 --> 00:04:34,180
say I want to ask how well this model generalized.

132
00:04:35,750 --> 00:04:36,740
One thing I could do is

133
00:04:37,130 --> 00:04:38,580
look at how well my fifth

134
00:04:38,860 --> 00:04:42,350
order polynomial hypothesis, had done on my test set.

135
00:04:43,140 --> 00:04:44,800
But the problem is this

136
00:04:45,050 --> 00:04:46,230
will not to be a fair

137
00:04:46,430 --> 00:04:47,590
estimate of how well

138
00:04:48,120 --> 00:04:49,130
my hypothesis generalizes.

139
00:04:51,050 --> 00:04:53,390
And the reason is, what we've

140
00:04:53,580 --> 00:04:54,850
done is, we've fit this

141
00:04:55,020 --> 00:04:56,780
extra the parameter d, that

142
00:04:56,950 --> 00:04:58,250
is this degree of polynomial,

143
00:04:59,110 --> 00:05:00,340
and we'll fit that parameter

144
00:05:01,040 --> 00:05:02,520
d using the test set.

145
00:05:02,900 --> 00:05:03,890
Namely, we chose the value

146
00:05:04,280 --> 00:05:05,510
of d that gave us the

147
00:05:05,710 --> 00:05:07,270
best possible performance on the

148
00:05:07,350 --> 00:05:09,060
test set, and

149
00:05:09,310 --> 00:05:11,260
so, the performance of

150
00:05:11,560 --> 00:05:13,030
my parameter vector theta five

151
00:05:13,680 --> 00:05:15,230
on the test set, that's likely to

152
00:05:15,510 --> 00:05:16,810
be to be an overly optimistic

153
00:05:17,500 --> 00:05:19,770
estimate of generalization error.

154
00:05:19,960 --> 00:05:21,090
Right? So that because I have fit

155
00:05:21,250 --> 00:05:22,260
this parameter d to my

156
00:05:22,360 --> 00:05:23,720
test set, it is no

157
00:05:23,880 --> 00:05:25,320
longer fair to

158
00:05:25,510 --> 00:05:28,120
evaluate my hypothesis on this test set.

159
00:05:28,340 --> 00:05:30,490
That's because I've fit my parameters to the test set.

160
00:05:30,700 --> 00:05:32,280
I've chosen the degree d of

161
00:05:32,410 --> 00:05:34,540
polynomial using the test set. And

162
00:05:34,730 --> 00:05:36,360
so our hypothesis is

163
00:05:36,530 --> 00:05:37,860
likely to do better on

164
00:05:38,070 --> 00:05:39,280
this test set than it

165
00:05:39,450 --> 00:05:41,030
would on new examples that

166
00:05:41,120 --> 00:05:45,460
it hasn't seen before and that's which is what we hear about.

167
00:05:45,590 --> 00:05:45,880
So, just to reiterate

168
00:05:46,260 --> 00:05:47,720
on the previous slide we

169
00:05:47,840 --> 00:05:48,880
saw that if we fit

170
00:05:49,150 --> 00:05:50,800
some set of parameters, say theta

171
00:05:51,350 --> 00:05:52,430
0, theta 1, and so on, to

172
00:05:52,560 --> 00:05:54,030
some training set, then the

173
00:05:54,100 --> 00:05:55,510
performance of the fitted model on

174
00:05:56,010 --> 00:05:57,500
the training set is not

175
00:05:58,020 --> 00:05:59,180
predictive of how well

176
00:05:59,890 --> 00:06:01,330
the hypothesis we generalized the

177
00:06:01,540 --> 00:06:03,170
new examples; is because these

178
00:06:03,880 --> 00:06:04,720
parameters would fit to the training set.

179
00:06:05,430 --> 00:06:06,530
So they are likely to do

180
00:06:06,670 --> 00:06:07,720
well on the training set, even

181
00:06:08,030 --> 00:06:10,390
if the parameters don't do well on other examples.

182
00:06:12,390 --> 00:06:13,650
And in the procedure I've just described

183
00:06:13,790 --> 00:06:14,840
on this slide, we've just done

184
00:06:14,910 --> 00:06:16,270
the same thing and specifically

185
00:06:16,970 --> 00:06:18,600
what we did is we fit this

186
00:06:18,760 --> 00:06:20,450
parameter d to the test set.

187
00:06:21,740 --> 00:06:22,750
And by having fit the parameter

188
00:06:23,190 --> 00:06:24,630
to the test set, this means that

189
00:06:25,410 --> 00:06:26,290
the performance of the hypothesis

190
00:06:27,080 --> 00:06:28,530
on that test set may not

191
00:06:28,740 --> 00:06:30,340
be a fair estimate of how

192
00:06:30,630 --> 00:06:32,620
well the hypothesis is likely to do

193
00:06:33,110 --> 00:06:35,490
on examples we haven't seen before.

194
00:06:35,510 --> 00:06:37,230
To address this problem in a

195
00:06:37,320 --> 00:06:39,170
model selection setting, if

196
00:06:39,470 --> 00:06:41,010
we want to evaluate a hypothesis

197
00:06:41,800 --> 00:06:44,620
this is usually what we do instead.

198
00:06:45,600 --> 00:06:46,850
Given the data set, instead of just splitting it

199
00:06:47,070 --> 00:06:48,310
into a train and test

200
00:06:48,480 --> 00:06:49,300
set, what we are going to

201
00:06:49,430 --> 00:06:51,360
do is instead split it into three pieces.

202
00:06:52,620 --> 00:06:54,490
And the first piece is

203
00:06:54,710 --> 00:06:57,340
going to be called the training set as usual.

204
00:06:58,730 --> 00:06:59,980
So you call this first part,

205
00:07:01,020 --> 00:07:03,580
the training set, and then

206
00:07:03,730 --> 00:07:05,580
were going to coddle the second

207
00:07:06,040 --> 00:07:07,500
piece of data, which is

208
00:07:07,590 --> 00:07:13,520
called the cross-validation set, and

209
00:07:13,690 --> 00:07:15,510
I'm going to abbreviate cross-validation

210
00:07:17,040 --> 00:07:18,430
CV, and the second

211
00:07:19,560 --> 00:07:20,840
piece of this data, I'm going

212
00:07:21,040 --> 00:07:22,680
to call the cross-validation set

213
00:07:26,430 --> 00:07:28,120
cross-validation, and I

214
00:07:28,270 --> 00:07:31,500
am going to abbreviate cross-validation as CV.

215
00:07:32,150 --> 00:07:33,440
Sometimes it's also called the

216
00:07:33,530 --> 00:07:35,770
validation set, instead of cross-validation set.

217
00:07:36,680 --> 00:07:37,770
And then the last part I

218
00:07:37,810 --> 00:07:40,640
am going to call my usual test set.

219
00:07:40,880 --> 00:07:42,810
And the pretty typical ratio

220
00:07:43,370 --> 00:07:45,230
I wish to split these things; would be to

221
00:07:45,510 --> 00:07:46,620
send 60% of your data

222
00:07:47,010 --> 00:07:48,520
to your training set, maybe 20%

223
00:07:48,880 --> 00:07:52,290
to your cross-validation set, and 20% to your test set.

224
00:07:52,620 --> 00:07:53,840
And these numbers can vary a little

225
00:07:53,970 --> 00:07:55,920
bit but this sort of ratio will be pretty typical.

226
00:07:56,720 --> 00:07:58,130
And so our training set

227
00:07:58,560 --> 00:07:59,950
will now be only, maybe

228
00:08:00,180 --> 00:08:01,070
60% of the data,

229
00:08:02,300 --> 00:08:04,610
and our cross-validation set or

230
00:08:04,690 --> 00:08:07,580
our validation set will have some number of examples.

231
00:08:08,170 --> 00:08:09,220
I'm going to denote that M

232
00:08:09,650 --> 00:08:11,090
subscript cv, so that's the

233
00:08:11,160 --> 00:08:13,780
number of cross-validation examples.

234
00:08:15,580 --> 00:08:17,190
And as following our earlier

235
00:08:17,420 --> 00:08:19,870
notational convention, I'm going

236
00:08:19,940 --> 00:08:22,730
to use XiCV, YiCV.

237
00:08:23,690 --> 00:08:25,350
Following our earlier notational convention

238
00:08:25,510 --> 00:08:26,330
I'm going to use

239
00:08:26,730 --> 00:08:30,770
XiCV, YiCV to

240
00:08:30,880 --> 00:08:33,140
denote the i cross-validation example.

241
00:08:34,410 --> 00:08:35,960
And finally we also

242
00:08:36,400 --> 00:08:37,460
have a test set over here;

243
00:08:38,010 --> 00:08:39,990
with M subscript test,

244
00:08:40,700 --> 00:08:41,940
being the number of test examples.

245
00:08:43,090 --> 00:08:43,880
So, now that we have

246
00:08:44,160 --> 00:08:46,140
defined the training validation or

247
00:08:46,170 --> 00:08:47,700
cross validation and test sets,

248
00:08:48,160 --> 00:08:49,470
we can also define the training

249
00:08:49,890 --> 00:08:52,140
error, cross validation error, and test error.

250
00:08:53,070 --> 00:08:54,690
So here's my training error,

251
00:08:55,250 --> 00:08:56,320
and I'm just writing this as

252
00:08:56,570 --> 00:08:58,000
J j subscript train of theta.

253
00:08:58,500 --> 00:09:00,370
This is pretty much the same thing.

254
00:09:00,670 --> 00:09:01,640
It's usually the same thing as

255
00:09:02,170 --> 00:09:03,930
the j of theta that we'll be writing so far,

256
00:09:04,160 --> 00:09:05,120
this is just a training set

257
00:09:05,770 --> 00:09:07,450
error you guys measure on your training set.

258
00:09:08,440 --> 00:09:09,550
And then j subscript cv

259
00:09:09,950 --> 00:09:12,540
is my cause validation error is pretty much what you'd expect.

260
00:09:13,100 --> 00:09:14,480
Just select the training error, except

261
00:09:14,850 --> 00:09:15,850
measure it on the

262
00:09:16,080 --> 00:09:17,930
cross-validation data set, and here's

263
00:09:18,220 --> 00:09:19,910
my test set error, same as before.

264
00:09:21,170 --> 00:09:22,490
So when theta with the model

265
00:09:22,810 --> 00:09:24,790
selection problem like this

266
00:09:25,000 --> 00:09:26,980
is, instead of using

267
00:09:27,190 --> 00:09:28,880
the test set to select

268
00:09:29,220 --> 00:09:30,490
the model, we're instead going

269
00:09:30,700 --> 00:09:32,350
to use validation set or

270
00:09:32,490 --> 00:09:34,990
the cross-validation set to select the model.

271
00:09:36,230 --> 00:09:37,780
Concretely, we're going to

272
00:09:37,890 --> 00:09:40,680
first take our first hypothesis, take

273
00:09:40,860 --> 00:09:42,480
this first model and say,

274
00:09:43,190 --> 00:09:44,290
minimize the cos function,

275
00:09:45,120 --> 00:09:45,830
and this would give me some

276
00:09:46,140 --> 00:09:47,950
parameter vector theta for the linear model

277
00:09:49,250 --> 00:09:50,760
and as before I'm going to put the

278
00:09:50,870 --> 00:09:52,260
superscript 1 just to

279
00:09:52,410 --> 00:09:53,510
denote that this is a parameter

280
00:09:53,720 --> 00:09:55,300
for the linear model. We do

281
00:09:55,450 --> 00:09:58,190
the same thing for the

282
00:09:58,250 --> 00:10:00,380
quadratic model, get some

283
00:10:00,680 --> 00:10:01,680
parameter vector theta 2, get

284
00:10:01,850 --> 00:10:03,480
some parameter vectors there 3, and so on down

285
00:10:03,720 --> 00:10:04,960
to, say, the tenth by the

286
00:10:05,470 --> 00:10:07,290
polynomial, and what

287
00:10:07,470 --> 00:10:09,240
we I'm going to do is, instead of

288
00:10:09,350 --> 00:10:11,030
testing these hypothesis on the

289
00:10:11,200 --> 00:10:12,320
test set, instead I'm

290
00:10:12,590 --> 00:10:13,600
going to test them on the

291
00:10:13,670 --> 00:10:15,180
cross-validation set. I'm going to

292
00:10:15,400 --> 00:10:17,120
measure j subscript cv,

293
00:10:18,270 --> 00:10:19,760
to see how well each of

294
00:10:19,880 --> 00:10:21,790
these hypothesis do on my

295
00:10:22,630 --> 00:10:28,910
cross validation set and then

296
00:10:29,110 --> 00:10:29,850
I'm going to pick the hypothesis

297
00:10:30,690 --> 00:10:32,340
with the lowest cross-validation error.

298
00:10:32,780 --> 00:10:34,260
So for this example, let's say

299
00:10:34,490 --> 00:10:35,700
for the sake of argument that

300
00:10:36,450 --> 00:10:38,380
it was my fourth order polynomial

301
00:10:39,290 --> 00:10:41,170
that had the lowest cross-validation error.

302
00:10:42,090 --> 00:10:43,050
So in that case, I'm going

303
00:10:43,220 --> 00:10:44,640
to pick this fourth order polynomial

304
00:10:45,290 --> 00:10:47,930
model and finally what

305
00:10:48,100 --> 00:10:49,440
this means is that that parameter

306
00:10:50,140 --> 00:10:51,560
d, remember d was the

307
00:10:51,630 --> 00:10:54,710
degree of polynomial, right d equals 2, d equals 3,

308
00:10:54,900 --> 00:10:56,890
up to d equals 10. What we've

309
00:10:57,070 --> 00:10:58,000
done is we fit that parameter

310
00:10:58,460 --> 00:10:59,590
d, and we'll set d equals 4, and

311
00:10:59,900 --> 00:11:01,210
we did so using

312
00:11:01,600 --> 00:11:02,720
the cross-validation set.

313
00:11:02,920 --> 00:11:04,440
And so this degree of

314
00:11:04,540 --> 00:11:05,540
polynomial, so the parameter

315
00:11:06,530 --> 00:11:07,930
is no longer fit to the test set.

316
00:11:08,830 --> 00:11:10,010
And so we've now saved

317
00:11:10,520 --> 00:11:12,320
a way the test set

318
00:11:13,300 --> 00:11:14,180
and we can use the test

319
00:11:14,330 --> 00:11:15,650
set to measure or to

320
00:11:16,100 --> 00:11:17,890
estimate the generalization error of

321
00:11:18,050 --> 00:11:19,270
the model that was selected

322
00:11:20,370 --> 00:11:23,530
by this algorithm.
So, that

323
00:11:23,680 --> 00:11:25,230
was model selection and how

324
00:11:25,550 --> 00:11:26,710
you can take your data and split

325
00:11:26,880 --> 00:11:28,220
it into a train validation

326
00:11:28,940 --> 00:11:30,170
and test set, and use your

327
00:11:30,300 --> 00:11:31,810
cross validation data to select

328
00:11:32,110 --> 00:11:33,760
model and evaluate it on the test set.

329
00:11:35,080 --> 00:11:36,640
One final note: I should

330
00:11:36,810 --> 00:11:38,720
say that in the machine

331
00:11:39,090 --> 00:11:39,960
learning as of this practice

332
00:11:40,490 --> 00:11:41,410
today, there are many

333
00:11:41,730 --> 00:11:43,290
people that will do

334
00:11:43,440 --> 00:11:44,460
that early thing that I

335
00:11:44,570 --> 00:11:45,610
talked about, and said that

336
00:11:46,000 --> 00:11:47,510
isn't such a good idea of

337
00:11:48,110 --> 00:11:49,660
selecting your model using the

338
00:11:49,750 --> 00:11:51,190
test set and they're using

339
00:11:51,580 --> 00:11:52,640
the same test set to report

340
00:11:53,290 --> 00:11:55,130
the error, as though selecting

341
00:11:56,020 --> 00:11:57,420
your degree of polynomial on the

342
00:11:57,510 --> 00:11:58,690
test set, and then reporting

343
00:11:59,220 --> 00:12:00,480
the error on the test set as

344
00:12:00,640 --> 00:12:02,120
though that were good estimate of generalization error.

345
00:12:03,440 --> 00:12:05,510
That sort of practice is unfortunately many

346
00:12:05,730 --> 00:12:06,680
people do do it; and

347
00:12:06,890 --> 00:12:08,070
if you have a massive massive

348
00:12:08,460 --> 00:12:09,480
test set is maybe not

349
00:12:09,700 --> 00:12:11,310
a terrible thing to do, but

350
00:12:13,060 --> 00:12:15,150
most practitioners of machine

351
00:12:15,440 --> 00:12:16,550
learning tend to advise

352
00:12:19,330 --> 00:12:20,110
against that

353
00:12:20,330 --> 00:12:22,360
and is considered better practice to have separate training validations of test sets. I'll just warn you that just

354
00:12:22,560 --> 00:12:23,880
sometimes people do

355
00:12:24,130 --> 00:12:25,270
you know, use the same data

356
00:12:25,550 --> 00:12:27,280
for the purpose of the validation

357
00:12:28,260 --> 00:12:29,250
set and for the purpose

358
00:12:29,490 --> 00:12:30,230
of the test sets. You only have a training set

359
00:12:30,830 --> 00:12:32,030
and the test set and that's because

360
00:12:32,520 --> 00:12:33,590
that's good practice. So, you

361
00:12:33,810 --> 00:12:34,980
will see some people do it

362
00:12:35,540 --> 00:12:36,810
but if possible I will

363
00:12:37,110 --> 00:12:38,710
recommend against doing.