1
00:00:00,148 --> 00:00:04,634
[MUSIC]

2
00:00:04,634 --> 00:00:08,360
So, instead of using training error
to assess our predictive performance.

3
00:00:08,360 --> 00:00:13,100
What we'd really like to do is analyze
something that's called generalization or

4
00:00:13,100 --> 00:00:13,850
true error.

5
00:00:15,450 --> 00:00:19,370
So, in particular, we really want
an estimate of what the loss is

6
00:00:19,370 --> 00:00:22,800
averaged over all houses that we
might ever see in our neighborhood.

7
00:00:24,240 --> 00:00:28,870
But really, in our dataset we only have
a few examples of houses that were sold.

8
00:00:28,870 --> 00:00:32,640
But there are lots of other houses that
are in our neighborhood that we don't have

9
00:00:32,640 --> 00:00:36,400
in our dataset, or other houses that
you might imagine having been sold.

10
00:00:37,850 --> 00:00:42,520
Okay, so to compute this estimate over all
houses that we might see in our dataset,

11
00:00:42,520 --> 00:00:47,450
we'd like to weight these house pairs,
so the pair of house attributes and

12
00:00:47,450 --> 00:00:49,210
the house sale's price.

13
00:00:49,210 --> 00:00:52,245
By how likely that pair is to
have occurred in our dataset.

14
00:00:53,510 --> 00:00:56,390
So to do this we can think about
defining a distribution and

15
00:00:56,390 --> 00:01:00,780
in this case over square feet
of houses in our neighborhood.

16
00:01:00,780 --> 00:01:03,710
So what this little
cartoon is trying to show

17
00:01:03,710 --> 00:01:08,580
is a distribution over
the real line of square feet.

18
00:01:08,580 --> 00:01:13,090
But you can think of it as just
a really dense, in a sense histogram,

19
00:01:13,090 --> 00:01:18,530
counting how many houses that we might
see with a given square feet for

20
00:01:18,530 --> 00:01:21,300
every possible square feet value.

21
00:01:21,300 --> 00:01:24,760
Okay and so what this picture is
showing is a distribution that says

22
00:01:24,760 --> 00:01:29,300
we're very unlikely to see
houses with very small or

23
00:01:29,300 --> 00:01:31,970
low number of square feet,
very small houses.

24
00:01:31,970 --> 00:01:35,990
And we're also very unlikely to
see really, really massive houses.

25
00:01:35,990 --> 00:01:40,040
So there's some bell curve to this,
there's some sweet spot of kind of typical

26
00:01:40,040 --> 00:01:43,550
houses in our neighborhood, and
then the likelihood drops off from there.

27
00:01:45,630 --> 00:01:50,760
Likewise what we can do is define
a distribution that says for

28
00:01:50,760 --> 00:01:54,260
a given square footage of a house,

29
00:01:54,260 --> 00:01:57,550
what's the distribution over
the sales price of that house?

30
00:01:57,550 --> 00:02:03,050
So let's say the house
has 2,640 square feet.

31
00:02:03,050 --> 00:02:07,880
Maybe I expect the range of
house prices to be somewhere

32
00:02:07,880 --> 00:02:12,930
between $680,000 to maybe $950,000.

33
00:02:12,930 --> 00:02:15,720
That might be a typical range.

34
00:02:15,720 --> 00:02:19,640
But of course, you might see much
lower valued houses or higher value,

35
00:02:19,640 --> 00:02:22,510
depending on the quality of that house.

36
00:02:22,510 --> 00:02:24,760
And that's what this distribution
here is representing.

37
00:02:26,390 --> 00:02:30,690
Okay, so formally when we go to
define our generalization error,

38
00:02:30,690 --> 00:02:36,260
we're saying that we're taking
the average value of our loss

39
00:02:37,840 --> 00:02:41,520
weighted by how likely those
pairs were in our dataset.

40
00:02:41,520 --> 00:02:47,160
So specifically we estimate our model
parameters on our training data set so

41
00:02:47,160 --> 00:02:48,930
that's what gives us w hat.

42
00:02:48,930 --> 00:02:52,520
That defines the model we're using for
prediction, and

43
00:02:52,520 --> 00:02:57,600
then we have our loss function,
assessing the cost of predicting f,

44
00:02:57,600 --> 00:03:04,530
this f sub w hat at our square
foot x when the true value was y.

45
00:03:04,530 --> 00:03:09,683
And then what we're gonna do is we're
gonna average over all possible xy's.

46
00:03:09,683 --> 00:03:14,045
But weighted by how likely they
are according to those distributions over

47
00:03:14,045 --> 00:03:17,340
square feet and value given square feet.

48
00:03:17,340 --> 00:03:22,230
Okay, so let's go back to these plots of
looking at error verses model complexity.

49
00:03:22,230 --> 00:03:26,580
But in this case let's quantify
our generalization error

50
00:03:26,580 --> 00:03:28,250
as a function of this complexity.

51
00:03:29,410 --> 00:03:35,400
And to do this, what I'm showing
by this crazy blue region here.

52
00:03:35,400 --> 00:03:39,410
And, it has different gradation
going from white to darker blue,

53
00:03:39,410 --> 00:03:44,110
is the distribution of houses that
I'm likely to see in my dataset.

54
00:03:44,110 --> 00:03:49,220
So, this white region here,
are the houses and

55
00:03:49,220 --> 00:03:51,910
now we just made it not white,
but hopefully we still see.

56
00:03:51,910 --> 00:03:55,420
These are the houses that I'm very,
very likely to see, and

57
00:03:55,420 --> 00:04:00,860
then as I go further away from
this I get to less likely

58
00:04:00,860 --> 00:04:06,790
house sale prices given
a specific square foot value.

59
00:04:06,790 --> 00:04:11,793
And so what I'm gonna do when I look
at thinking about generalization

60
00:04:11,793 --> 00:04:16,796
error is I'm gonna take my fitted
function where remember this green

61
00:04:16,796 --> 00:04:21,388
line was fit on the training data
which are these blue circles.

62
00:04:25,326 --> 00:04:31,929
And then I'm gonna say, how well does it
predict houses in this shaded blue region,

63
00:04:31,929 --> 00:04:37,050
weighted by how likely they are,
how close to that white region.

64
00:04:37,050 --> 00:04:41,460
If you imagine in 3D,
there are these distributions popping up

65
00:04:41,460 --> 00:04:45,020
off of this shaded grey and
shaded blue area.

66
00:04:45,020 --> 00:04:46,720
Maybe I can try and draw it.

67
00:04:47,970 --> 00:04:51,615
Maybe the distribution
at a given square foot,

68
00:04:51,615 --> 00:04:56,455
okay that doesn't look good at all,
let me try and do it again.

69
00:05:00,006 --> 00:05:07,210
Then it looks something like this,
the houses with xt square feet.

70
00:05:07,210 --> 00:05:13,727
And so when I think about how well my
prediction is doing at xt, this x here,

71
00:05:13,727 --> 00:05:20,364
I'm looking at the difference between
this and all points along this line.

72
00:05:20,364 --> 00:05:26,320
Weighted by how likely they are in the
general population of houses I might see.

73
00:05:26,320 --> 00:05:30,940
And then I do that across this entire
region of possible square feet.

74
00:05:32,085 --> 00:05:35,720
Okay, so
what I see here is this constant model who

75
00:05:35,720 --> 00:05:41,112
really doesn't approximate things well
except maybe in this region here.

76
00:05:41,112 --> 00:05:46,000
So overall it has a reasonably
high generalization error and

77
00:05:46,000 --> 00:05:49,340
I can go to my more complex,
just fitting a line through the data.

78
00:05:49,340 --> 00:05:54,700
And I see I have better performance, but
still not doing great in these regions.

79
00:05:56,340 --> 00:06:01,440
So my generalization error dropped a bit,
but when I get to this higher complexity

80
00:06:01,440 --> 00:06:05,970
quadratic fit things are starting to look
a bit better, maybe not great out in these

81
00:06:05,970 --> 00:06:11,060
regions here, so again,
the generalization error drops.

82
00:06:12,330 --> 00:06:17,260
Then I get to this much
higher order polynomial, and

83
00:06:17,260 --> 00:06:21,360
when we were looking at training error,
the training error was lower, right?

84
00:06:21,360 --> 00:06:25,299
But now, when we think about
generalization error, we actually see that

85
00:06:25,299 --> 00:06:29,112
the generalization error is gonna go
up relative to the simpler model,

86
00:06:29,112 --> 00:06:32,878
because if we look at this region here,
it's doing really horribly.

87
00:06:35,551 --> 00:06:39,967
So, we might get a generalization error
that's actually larger than the quadratic,

88
00:06:39,967 --> 00:06:43,775
and then we can fit even a higher order
polynomial, and we get this really,

89
00:06:43,775 --> 00:06:44,760
really crazy fit.

90
00:06:44,760 --> 00:06:49,370
And it's doing horrible basically
everywhere, except maybe at these very,

91
00:06:49,370 --> 00:06:52,290
very small little regions
where it's doing okay.

92
00:06:53,770 --> 00:06:57,270
So in this case we get dramatically
bad generalization there.

93
00:06:58,810 --> 00:07:03,100
Okay, so this is starting to
match a lot more of our intuition

94
00:07:03,100 --> 00:07:06,090
behind what might be
a good fit to this data.

95
00:07:06,090 --> 00:07:08,920
So, let's think about just
drawing the curve over

96
00:07:08,920 --> 00:07:13,420
all possible models now that we've
fit these few specific points.

97
00:07:13,420 --> 00:07:15,700
So our generalization error

98
00:07:17,380 --> 00:07:21,310
in general will have some
shape where it's going down.

99
00:07:21,310 --> 00:07:26,670
And then we get to a point where
the error starts increasing.

100
00:07:26,670 --> 00:07:28,350
Sorry, that should have
been a smoother curve.

101
00:07:32,010 --> 00:07:34,870
The error starts increasing
because we're getting to these

102
00:07:34,870 --> 00:07:38,320
overly complex models that fit
the training data really well but

103
00:07:38,320 --> 00:07:40,860
don't generalize to other
houses that we might see.

104
00:07:42,930 --> 00:07:43,900
But importantly,

105
00:07:43,900 --> 00:07:48,240
in contrast to training error we can't
actually compute generalization error.

106
00:07:48,240 --> 00:07:52,010
Because everything was relative
to this true distribution,

107
00:07:52,010 --> 00:07:55,040
the true way in which the world works.

108
00:07:55,040 --> 00:08:00,220
How likely houses are to appear in our
dataset over all possible square feet and

109
00:08:00,220 --> 00:08:02,350
all possible house values.

110
00:08:02,350 --> 00:08:04,800
And of course, we don't know what that is.

111
00:08:04,800 --> 00:08:10,200
So, this is our ideal picture or
our cartoon of what would happen.

112
00:08:10,200 --> 00:08:14,826
But we can't actually go along and
compute these different points.

113
00:08:14,826 --> 00:08:19,109
[MUSIC]