1
00:00:00,000 --> 00:00:04,046
[MUSIC]

2
00:00:04,046 --> 00:00:08,154
Okay, so we can't compute generalization
error, but we want some better measure of

3
00:00:08,154 --> 00:00:11,440
our predictive performance
than training error gives us.

4
00:00:11,440 --> 00:00:13,530
And so this takes us to
something called test error,

5
00:00:13,530 --> 00:00:17,739
and what test error is going to allow us
to do is approximate generalization error.

6
00:00:19,610 --> 00:00:23,960
And the way we're gonna do this
is by approximating the error,

7
00:00:23,960 --> 00:00:27,060
looking at houses that
aren't in our training set.

8
00:00:27,060 --> 00:00:29,270
So to do that,
we have to hold out some houses.

9
00:00:29,270 --> 00:00:33,950
So instead of including all these
colored houses in our training set,

10
00:00:33,950 --> 00:00:38,420
which is these colored houses
are our entire recorded data set,

11
00:00:39,860 --> 00:00:44,290
we're gonna shade out some of them,
these shaded gray houses and

12
00:00:44,290 --> 00:00:49,190
we're gonna make these into
what's called a test set.

13
00:00:49,190 --> 00:00:52,930
Okay.
So here we have houses that are not

14
00:00:52,930 --> 00:00:58,250
included in our training set, the training
set are the remaining colored houses here.

15
00:00:58,250 --> 00:01:01,270
And when we go to fit our models,

16
00:01:01,270 --> 00:01:04,090
we're just going to fit our
models on the training data set.

17
00:01:04,090 --> 00:01:07,330
But then when we go to assess
our performance of that model,

18
00:01:07,330 --> 00:01:11,140
we can look at these test houses,
and these are hopefully

19
00:01:11,140 --> 00:01:15,690
going to serve as a proxy of
everything out there in the world.

20
00:01:15,690 --> 00:01:22,960
So hopefully, our test data set is a good
measure of other houses that we might see,

21
00:01:22,960 --> 00:01:28,850
or at least in order to think of how
well a given model is performing.

22
00:01:29,940 --> 00:01:33,540
Okay, so
test error is gonna be our average loss

23
00:01:33,540 --> 00:01:36,570
computed over the houses
in our test data set.

24
00:01:36,570 --> 00:01:41,430
So formally, we write it as follows
where we have one over N test.

25
00:01:41,430 --> 00:01:47,010
N test are the number of
houses in our test data set

26
00:01:47,010 --> 00:01:51,430
times the sum of the loss defined
over those test set houses.

27
00:01:52,490 --> 00:01:55,730
But I wanna emphasize, and
this is really, really important,

28
00:01:55,730 --> 00:02:00,840
that the estimated parameters W hat
were fit on the training data set.

29
00:02:00,840 --> 00:02:06,130
Okay, so even though this function looks
very, very, very much like training error,

30
00:02:06,130 --> 00:02:09,610
the sum is over the test houses, but

31
00:02:09,610 --> 00:02:13,310
the function we're looking
at was fit on training data.

32
00:02:13,310 --> 00:02:19,690
Okay, so these parameters in this fitted
function never saw the test data.

33
00:02:21,020 --> 00:02:24,470
So just to illustrate this,
like in our previous example,

34
00:02:24,470 --> 00:02:27,720
we might think of fitting a quadratic
function through this data,

35
00:02:27,720 --> 00:02:31,710
where we're gonna minimize the residual
sum of squares on the training points,

36
00:02:31,710 --> 00:02:35,505
those blue circles,
to get our estimated parameters W hat.

37
00:02:36,640 --> 00:02:40,870
Then when we go to compute our test error,
which in this case again we're gonna

38
00:02:40,870 --> 00:02:45,850
use squared error as an example,
we're computing this error

39
00:02:45,850 --> 00:02:50,560
over the test points,
all these grey different circles here.

40
00:02:50,560 --> 00:02:55,770
So test error is 1 over N times
the sum of the difference between our

41
00:02:55,770 --> 00:02:59,600
true house sales prices and
our predicted price

42
00:02:59,600 --> 00:03:04,030
squared summing over all
houses in our test data set.

43
00:03:04,030 --> 00:03:05,820
Okay, so
this is where the difference arises,

44
00:03:05,820 --> 00:03:08,610
where this function was
fit with the blue circles.

45
00:03:08,610 --> 00:03:11,700
The one we're assessing, the performance,
we're looking at these grey circles.

46
00:03:12,970 --> 00:03:17,770
Okay, so let's summarize our measures of
error as a function of model complexity.

47
00:03:19,270 --> 00:03:21,340
And what we saw was
that our training error

48
00:03:22,710 --> 00:03:25,620
decreased with increasing
model complexity.

49
00:03:27,290 --> 00:03:30,587
So here, this is our training error.

50
00:03:36,650 --> 00:03:44,010
And in contrast, our generalization
error went down for some period of time.

51
00:03:44,010 --> 00:03:48,880
But then we started getting
to overly complex models that

52
00:03:48,880 --> 00:03:52,750
didn't generalize well, and the
generalization error started increasing.

53
00:03:53,820 --> 00:03:57,046
So here we have generalization error.

54
00:04:00,212 --> 00:04:03,449
Or true error.

55
00:04:03,449 --> 00:04:07,280
And what is our test error?

56
00:04:07,280 --> 00:04:10,260
Well, our test error is a noisy
approximation of generalization error.

57
00:04:10,260 --> 00:04:15,280
Because if our test data setting included
everything we might ever see in the world

58
00:04:15,280 --> 00:04:18,460
in proportion to how
likely it was to be seen,

59
00:04:18,460 --> 00:04:22,310
then that would be exactly
our generalization error.

60
00:04:22,310 --> 00:04:25,954
But of course, our test data set
is just some finite data set, and

61
00:04:25,954 --> 00:04:29,205
we're using it to approximate
generalization error, so

62
00:04:29,205 --> 00:04:32,212
it's gonna be some noisy
version of this curve here.

63
00:04:36,600 --> 00:04:39,640
So this is our test error.

64
00:04:43,350 --> 00:04:47,060
Okay, so test error is the thing
that we can actually compute.

65
00:04:47,060 --> 00:04:49,962
Generalization error is
the thing that we really want.

66
00:04:49,962 --> 00:04:54,079
[MUSIC]