1
00:00:00,000 --> 00:00:04,270
[MUSIC]

2
00:00:04,270 --> 00:00:04,810
Okay.

3
00:00:04,810 --> 00:00:09,712
Let's wrap up by talking about two
really important task when you're doing

4
00:00:09,712 --> 00:00:10,640
regression.

5
00:00:10,640 --> 00:00:12,181
And through this discussion,

6
00:00:12,181 --> 00:00:16,460
it's gonna motivate another important
concept of thinking about validation sets.

7
00:00:17,770 --> 00:00:20,200
So, the two important task in regression,

8
00:00:20,200 --> 00:00:23,296
is first we need to choose
a specific model complexity.

9
00:00:23,296 --> 00:00:26,528
So for example, when we're talking
about polynomial regression,

10
00:00:26,528 --> 00:00:29,140
what's the degree of that polynomial?

11
00:00:29,140 --> 00:00:33,520
And then for our selected model,
we assess its performance.

12
00:00:33,520 --> 00:00:37,310
And actually these two steps aren't
specific gesture regression.

13
00:00:37,310 --> 00:00:40,800
We're gonna see this in all different
aspects of machine learning, where we have

14
00:00:40,800 --> 00:00:44,790
to specify our model and then we need to
assess the performance of that model.

15
00:00:44,790 --> 00:00:48,320
So, what we're gonna talk about
in this portion of this module

16
00:00:48,320 --> 00:00:51,080
generalizes well beyond regression.

17
00:00:51,080 --> 00:00:56,282
And for this first task, where we're
talking about choosing the specific model.

18
00:00:56,282 --> 00:01:00,158
We're gonna talk about it in terms
of sum set of tuning parameters,

19
00:01:00,158 --> 00:01:02,879
lambda, which control
the model complexity.

20
00:01:02,879 --> 00:01:07,683
Again, and for example, lambda might
specify the degree of the polynomial and

21
00:01:07,683 --> 00:01:10,050
polynomial aggression.

22
00:01:10,050 --> 00:01:14,530
So, let's first talk about how we
can think about choosing lambda.

23
00:01:14,530 --> 00:01:18,968
And then for a given model specified
by lambda, a given model complexity,

24
00:01:18,968 --> 00:01:23,421
let's think about how we're gonna
assess the performance of that model.

25
00:01:23,421 --> 00:01:27,903
Well, one really naive approach is
to do what we've described before,

26
00:01:27,903 --> 00:01:32,462
where you take your data set and split
it into a training set and a test set.

27
00:01:32,462 --> 00:01:35,080
And then, what we're gonna do is for

28
00:01:35,080 --> 00:01:40,920
our model selection portion where we're
choosing the model complexity lambda.

29
00:01:40,920 --> 00:01:45,867
For every possible choice of lambda,
we're gonna estimate model

30
00:01:45,867 --> 00:01:51,810
parameters associated with that
lambda model on the training set.

31
00:01:51,810 --> 00:01:58,430
And the we're gonna test the performance
of that fitted model on the test set.

32
00:01:58,430 --> 00:02:02,870
And we're gonna tabulate that for
every lambda that we're considering.

33
00:02:02,870 --> 00:02:05,480
And we're gonna choose our tuning

34
00:02:05,480 --> 00:02:09,629
parameters as the ones that
minimize this test error.

35
00:02:09,629 --> 00:02:13,750
So, the ones that perform
best on the test data.

36
00:02:13,750 --> 00:02:16,440
And we're gonna call those
parameters lambda star.

37
00:02:16,440 --> 00:02:17,540
So, now I have my model.

38
00:02:17,540 --> 00:02:22,400
I have my specific degree of
polynomial that I'm gonna use.

39
00:02:22,400 --> 00:02:27,340
And I wanna go and assess
the performance of this specific model.

40
00:02:27,340 --> 00:02:31,220
And the way I'm gonna do this is
I'm gonna take my test data again.

41
00:02:31,220 --> 00:02:33,431
And I'm gonna say, well, okay,

42
00:02:33,431 --> 00:02:38,370
I know that test error is
an approximation of generalization error.

43
00:02:38,370 --> 00:02:43,118
So, I'm just gonna compute
the test error for

44
00:02:43,118 --> 00:02:46,421
this lambda star fitted model.

45
00:02:46,421 --> 00:02:51,879
And I'm gonna use that as my approximation
of the performance of this model.

46
00:02:51,879 --> 00:02:53,421
Well, what's the issue with this?

47
00:02:53,421 --> 00:02:54,550
Is this gonna perform well?

48
00:02:55,830 --> 00:02:59,461
No, it's really overly optimistic.

49
00:02:59,461 --> 00:03:03,862
So, this issue is just like what we saw
when we weren't dealing with this notion

50
00:03:03,862 --> 00:03:05,646
of choosing model complexity.

51
00:03:05,646 --> 00:03:10,780
We just assumed that we had a specific
model, like a specific degree polynomial.

52
00:03:10,780 --> 00:03:13,670
But we wanted to asses
the performance of the model.

53
00:03:13,670 --> 00:03:16,325
And the naive approach we
took there was saying,

54
00:03:16,325 --> 00:03:18,978
well, we fit the model to
the training data, and

55
00:03:18,978 --> 00:03:23,212
then we're gonna use training error to
assess the performance of the model.

56
00:03:23,212 --> 00:03:26,410
And we said, that was overly optimistic
because we were double dipping.

57
00:03:26,410 --> 00:03:28,921
We already used the data to fit our model.

58
00:03:28,921 --> 00:03:29,951
And then, so

59
00:03:29,951 --> 00:03:35,685
that error was not a good measure of
how we're gonna perform on new data.

60
00:03:35,685 --> 00:03:39,336
Well, it's exactly the same notion
here and let's walk through why.

61
00:03:39,336 --> 00:03:44,357
Most specifically, when we're thinking
about choosing our model complexity,

62
00:03:44,357 --> 00:03:48,805
we were using our test data to compare
between different lambda values.

63
00:03:48,805 --> 00:03:52,491
And we chose the lambda value that
minimized the error on that test data that

64
00:03:52,491 --> 00:03:54,298
performed the best there.

65
00:03:54,298 --> 00:03:58,540
So, you could think of
this as having fit lambda,

66
00:03:58,540 --> 00:04:03,320
this model complexity tuning parameter,
on the test data.

67
00:04:03,320 --> 00:04:06,300
And now,
we're thinking about using test error

68
00:04:06,300 --> 00:04:11,060
as a notion of approximating
how well we'll do on new data.

69
00:04:11,060 --> 00:04:15,891
But the issue is, unless our test data
represents everything we might see out

70
00:04:15,891 --> 00:04:19,547
there in the world,
that's gonna be way too optimistic.

71
00:04:19,547 --> 00:04:25,809
Because lambda was chosen, the model was
chosen, to do well on the test data and

72
00:04:25,809 --> 00:04:30,030
so that won't generalize
well to new observations.

73
00:04:31,360 --> 00:04:33,040
So, what's our solution?

74
00:04:33,040 --> 00:04:35,720
Well, we can just create
two test data sets.

75
00:04:36,940 --> 00:04:41,070
They won't both be called test sets, we're
gonna call one of them a validation set.

76
00:04:41,070 --> 00:04:44,004
So, we're gonna take our entire data set,
just to be clear.

77
00:04:44,004 --> 00:04:47,160
And now,
we're gonna split it into three data sets.

78
00:04:47,160 --> 00:04:51,884
One will be our training data set, one
will be what we call our validation set,

79
00:04:51,884 --> 00:04:53,970
and the other will be our test set.

80
00:04:55,370 --> 00:05:01,910
And then what we're gonna do is, we're
going to fit our model parameters always

81
00:05:01,910 --> 00:05:06,360
on our training data, for every given
model complexity that we're considering.

82
00:05:06,360 --> 00:05:11,556
But then we're gonna select our
model complexity as the model that

83
00:05:11,556 --> 00:05:17,046
performs best on the validation set
has the lowest validation error.

84
00:05:17,046 --> 00:05:20,682
And then we're gonna assess
the performance of that

85
00:05:20,682 --> 00:05:22,920
selected model on the test set.

86
00:05:22,920 --> 00:05:27,228
And we're gonna say that that test
error is now an approximation of our

87
00:05:27,228 --> 00:05:28,769
generalization error.

88
00:05:28,769 --> 00:05:34,198
Because that test set was never used in
either fitting our parameters, w hat,

89
00:05:34,198 --> 00:05:39,426
or selecting our model complexity lambda,
that other tuning parameter.

90
00:05:39,426 --> 00:05:43,733
So, that data was completely held out,
never touched, and

91
00:05:43,733 --> 00:05:47,879
it now forms a fair estimate
of our generalization error.

92
00:05:47,879 --> 00:05:51,556
So in summary,
we're gonna fit our model parameters for

93
00:05:51,556 --> 00:05:55,080
any given complexity on our training set.

94
00:05:55,080 --> 00:06:01,030
Then we're gonna, for every fitted
model and for every model complexity,

95
00:06:01,030 --> 00:06:06,920
we're gonna assess the performance and
tabulate this on our validation set.

96
00:06:06,920 --> 00:06:11,080
And we're gonna use that to select
the optimal set of tuning parameters

97
00:06:11,080 --> 00:06:11,920
lambda star.

98
00:06:11,920 --> 00:06:17,130
And then for that resulting model,
that w hat sub lambda star,

99
00:06:17,130 --> 00:06:23,540
we're gonna assess a notion of the
generalization error using our test set.

100
00:06:24,680 --> 00:06:27,870
And so a question,
is how can we think about

101
00:06:27,870 --> 00:06:33,670
doing the split between our training set,
validation set, and test set?

102
00:06:33,670 --> 00:06:36,118
And there's no hard and fast rule here,

103
00:06:36,118 --> 00:06:39,150
there's no one answer
that's the right answer.

104
00:06:39,150 --> 00:06:43,754
But typical splits that you see out there
are something like an 80-10-10 split.

105
00:06:43,754 --> 00:06:51,090
So, 80% of your data for training data,
10% t for validation, 10% for tests.

106
00:06:51,090 --> 00:06:56,379
Or another common split is 50%, 25%, 25%.

107
00:06:56,379 --> 00:07:02,041
But again, this is assuming that you have
enough data to do this type of split and

108
00:07:02,041 --> 00:07:06,269
still get reasonable estimates
of your model parameters,

109
00:07:06,269 --> 00:07:11,890
reasonable notions of how different
model complexities compare.

110
00:07:11,890 --> 00:07:14,480
Because you have a large
enough validation set, and

111
00:07:14,480 --> 00:07:17,020
you still have a large enough test set

112
00:07:17,020 --> 00:07:21,010
in order to assess the generalization
error of the resulting model.

113
00:07:21,010 --> 00:07:25,670
And if this isn't the case,
we're gonna talk about other methods that

114
00:07:25,670 --> 00:07:28,697
allow us to do these same
types of notions, but

115
00:07:28,697 --> 00:07:33,921
not with this type of hard division
between training, validation, and test.

116
00:07:33,921 --> 00:07:38,109
[MUSIC]