1
00:00:00,000 --> 00:00:03,387
[MUSIC]

2
00:00:03,387 --> 00:00:08,377
Okay, so the issue we're facing here with
this crazy 13th order polynomial fit

3
00:00:08,377 --> 00:00:10,561
is something called overfitting.

4
00:00:10,561 --> 00:00:15,152
So in particular, what we've done
is we've taken a model and really,

5
00:00:15,152 --> 00:00:18,901
really, really honed in to
our actual observations, but

6
00:00:18,901 --> 00:00:23,134
it doesn't generalize well to
thinking about new predictions.

7
00:00:23,134 --> 00:00:29,334
And so the issues actually go beyond
just making really crazy predictions.

8
00:00:29,334 --> 00:00:33,251
And we're gonna discuss this in a lot
more detail in the regression course.

9
00:00:33,251 --> 00:00:37,372
But I wanna mention that this is a real
problem with any machine learning model or

10
00:00:37,372 --> 00:00:40,150
statistical model that you might consider.

11
00:00:40,150 --> 00:00:44,290
So in these cases,
we wanna fit a model to data, but

12
00:00:44,290 --> 00:00:48,780
we don't want that model to be so
specified exactly to the one data set that

13
00:00:48,780 --> 00:00:52,560
we have that it doesn't generalize
well to new observations we might get.

14
00:00:53,630 --> 00:00:58,690
Okay, so let's go back to this
13th order degree polynomial fit.

15
00:00:58,690 --> 00:01:02,020
And a question is,
do we actually believe this?

16
00:01:02,020 --> 00:01:04,730
Do we believe that this might
be a reasonable fit to the data?

17
00:01:04,730 --> 00:01:08,350
And I think as I alluded to before,
probably not.

18
00:01:08,350 --> 00:01:12,280
So although it minimizes
the residual sum of squares,

19
00:01:12,280 --> 00:01:15,320
it ends up leading to
very bad predictions.

20
00:01:17,240 --> 00:01:22,167
Because I'm sitting here and thinking,
well, this quadratic fit that we had,

21
00:01:22,167 --> 00:01:27,023
even though it didn't minimize my residual
sum of squares as much as that 13th

22
00:01:27,023 --> 00:01:32,034
order polynomial, it still, my gut feeling
is, somehow this is a better model.

23
00:01:32,034 --> 00:01:35,717
Okay, so a question is,
what's going on here, and

24
00:01:35,717 --> 00:01:40,926
how do we think about choosing the right
model order or model complexity?

25
00:01:40,926 --> 00:01:44,003
Well, what we want,
is we want good predictions.

26
00:01:44,003 --> 00:01:45,610
Of course, that's what we're aiming for.

27
00:01:47,050 --> 00:01:49,020
But we can't actually observe the future.

28
00:01:49,020 --> 00:01:52,250
Right, so we can't actually observe
that prediction that we want to make and

29
00:01:52,250 --> 00:01:55,940
say did we do a good job or
not until we actually go ahead and do it.

30
00:01:55,940 --> 00:01:57,760
So when we're thinking
about choosing our model,

31
00:01:57,760 --> 00:02:00,690
somehow we have to work just
with the data that we have.

32
00:02:00,690 --> 00:02:04,820
So how can we think about trying to
choose a good model in this case?

33
00:02:04,820 --> 00:02:09,460
Well, what we can do is we can
think about simulating predictions.

34
00:02:09,460 --> 00:02:11,170
So we're gonna take our
data set that we have,

35
00:02:11,170 --> 00:02:13,760
and we're gonna remove some of the houses.

36
00:02:13,760 --> 00:02:16,080
So those are the grayed-out houses here.

37
00:02:16,080 --> 00:02:20,909
These are gonna be removed temporarily.

38
00:02:20,909 --> 00:02:24,090
And we're gonna fit our model
on the remaining houses.

39
00:02:24,090 --> 00:02:28,910
So all of these guys we're gonna
use to fit our model using

40
00:02:28,910 --> 00:02:33,738
exactly the kind of methods
that we talked about before.

41
00:02:33,738 --> 00:02:37,406
And then what we're gonna
do is we're gonna predict.

42
00:02:37,406 --> 00:02:41,820
So I'll go through an erase these
x's now and put question marks.

43
00:02:43,270 --> 00:02:48,672
And say from the model that I just
learned on the circled houses,

44
00:02:48,672 --> 00:02:52,984
what values do I predict for
these question marks?

45
00:02:52,984 --> 00:02:55,847
And then I can compare to
the actual observed values,

46
00:02:55,847 --> 00:02:58,020
because these houses are in my data set.

47
00:02:59,610 --> 00:03:04,400
Okay, so I can use this as a proxy for
doing the types of

48
00:03:04,400 --> 00:03:07,830
real predictions that I wanna do on
data that I haven't yet collected.

49
00:03:09,020 --> 00:03:12,640
Of course, this type of method only
is gonna work well if I have enough

50
00:03:12,640 --> 00:03:20,270
observations to think about fitting
on versus testing my predictions on.

51
00:03:20,270 --> 00:03:23,620
Okay, so let's introduce
a little bit of terminology.

52
00:03:23,620 --> 00:03:28,340
Well, the houses that we use to fit our
model, we call that the training set.

53
00:03:28,340 --> 00:03:30,370
And the houses that we're
using as a proxy for

54
00:03:30,370 --> 00:03:33,840
our predictions, those that we're
holding out, we call the test set.

55
00:03:35,260 --> 00:03:39,790
Okay, so let's dig a little bit more
into how we're gonna do this analysis.

56
00:03:40,920 --> 00:03:45,180
And the first thing that we can do is look
at something called the training error.

57
00:03:45,180 --> 00:03:48,550
So we're gonna examine every
house in our test data set.

58
00:03:48,550 --> 00:03:51,690
So let's look at this red color here.

59
00:03:51,690 --> 00:03:57,850
So all of our training
houses are represented

60
00:03:57,850 --> 00:04:02,720
with these blue circles here,
and these are the only houses

61
00:04:02,720 --> 00:04:06,640
we're gonna look at when we're thinking
about defining our training error.

62
00:04:06,640 --> 00:04:07,880
So in particular,

63
00:04:07,880 --> 00:04:12,080
we're gonna look at what are the errors
that we make on these houses?

64
00:04:12,080 --> 00:04:16,741
So this is just the residual sum of
squares on the houses in our training

65
00:04:16,741 --> 00:04:20,139
data set, and
that's called the training error.

66
00:04:20,139 --> 00:04:24,399
So in particular, the training error
looks exactly like what we had for

67
00:04:24,399 --> 00:04:27,168
our residual sum of squares calculation,
but

68
00:04:27,168 --> 00:04:31,714
we're only including the houses that
are present in our training data set.

69
00:04:31,714 --> 00:04:36,071
Okay, so then for any given model,
such as a linear fit through the data,

70
00:04:36,071 --> 00:04:40,849
quadratic fit, or so on, what we can do is
we can think about estimating our model

71
00:04:40,849 --> 00:04:44,300
parameters as those that
minimize the training error.

72
00:04:44,300 --> 00:04:47,740
So that's equivalent to what we talked
about before of minimizing the residual

73
00:04:47,740 --> 00:04:48,880
sum of squares.

74
00:04:48,880 --> 00:04:52,440
But again, here we're only looking at
the houses in our training data set.

75
00:04:53,770 --> 00:04:58,830
Okay, so then that's how
we get our estimated w hat,

76
00:04:58,830 --> 00:05:01,800
our estimated model parameters.

77
00:05:01,800 --> 00:05:05,000
But then what we wanna do is we wanna take
these estimated model parameters, and

78
00:05:05,000 --> 00:05:06,800
we wanna say how good are we doing?

79
00:05:06,800 --> 00:05:08,660
And remember what we said,
what we're gonna do,

80
00:05:08,660 --> 00:05:12,120
is we're gonna look at our
held out observations, okay?

81
00:05:12,120 --> 00:05:20,680
So here, these gray circles
are the houses that are in our test set.

82
00:05:20,680 --> 00:05:23,380
So these are houses that were
not used to fit this model.

83
00:05:24,840 --> 00:05:29,412
And we're gonna say how well are we
predicting these actual house sales?

84
00:05:29,412 --> 00:05:31,860
Okay, and so what were our predictions?

85
00:05:31,860 --> 00:05:34,530
Well, remember when we thought
about making a prediction,

86
00:05:34,530 --> 00:05:39,930
we just used the value of the fit,
so just the point on the line.

87
00:05:40,930 --> 00:05:44,670
So to assess how well we're predicting
these held-out observations,

88
00:05:44,670 --> 00:05:48,610
our test data,
we're gonna look at something that, again,

89
00:05:48,610 --> 00:05:51,310
looks exactly like
residual sum of squares.

90
00:05:51,310 --> 00:05:53,440
But it's called our test error,

91
00:05:53,440 --> 00:05:58,140
where we take these estimated
model parameters w hat, and we sum

92
00:05:58,140 --> 00:06:03,220
over our residual sum of squares over all
houses that are in our test data set.

93
00:06:04,520 --> 00:06:06,930
Okay, so that's our test error.

94
00:06:06,930 --> 00:06:11,612
But what we can think about is we can
think about how does test error and

95
00:06:11,612 --> 00:06:15,666
training error vary as a function
of the model complexity?

96
00:06:15,666 --> 00:06:19,249
[MUSIC]