1
00:00:00,000 --> 00:00:04,380
[MUSIC]

2
00:00:04,380 --> 00:00:05,254
Okay.

3
00:00:05,254 --> 00:00:08,890
Well, now let's turn to this third
component which is a variance.

4
00:00:08,890 --> 00:00:11,608
And what variance is gonna say is,

5
00:00:11,608 --> 00:00:17,420
how different can my specific fits to
a given data set be from one another,

6
00:00:17,420 --> 00:00:21,379
as I'm looking at different
possible data sets?

7
00:00:21,379 --> 00:00:25,426
And in this case, when we are looking
at just this constant model,

8
00:00:25,426 --> 00:00:28,752
we showed by that early picture
where I drew points that

9
00:00:28,752 --> 00:00:33,020
were mainly above the true relationship
and the points mainly below,

10
00:00:33,020 --> 00:00:36,970
that the actual resulting
fits didn't vary very much.

11
00:00:36,970 --> 00:00:40,473
And when you look at the space
of all possible observations,

12
00:00:40,473 --> 00:00:44,879
you see that the fits, they're fairly
similar, they're fairly stable.

13
00:00:44,879 --> 00:00:48,151
And so, when you look at
the variation in these fits,

14
00:00:48,151 --> 00:00:51,010
which I'm drawing with
these grey bars here.

15
00:00:52,060 --> 00:00:53,910
We see that they don't vary very much.

16
00:00:55,190 --> 00:00:59,283
So, for this low complexity model,

17
00:00:59,283 --> 00:01:03,254
we see that there's low variance.

18
00:01:03,254 --> 00:01:08,003
So, to summarize what this variance is
saying is, how much can the fits vary?

19
00:01:08,003 --> 00:01:11,839
And if they could vary dramatically
from one data set to the other,

20
00:01:11,839 --> 00:01:14,590
then you would have very
erratic predictions.

21
00:01:14,590 --> 00:01:18,070
Your prediction would just be
sensitive to what data set you got.

22
00:01:18,070 --> 00:01:21,010
So, that would be a source of
error in your predictions.

23
00:01:21,010 --> 00:01:24,200
And to see this, we can start
looking at high-complexity models.

24
00:01:25,700 --> 00:01:28,630
So in particular,
let's look at this data set again.

25
00:01:28,630 --> 00:01:30,930
And now, let's fit some
high-order polynomial to it.

26
00:01:30,930 --> 00:01:33,340
So, that's some fit shown here.

27
00:01:33,340 --> 00:01:35,540
And now,
let's take again this same data set.

28
00:01:35,540 --> 00:01:40,300
But let's choose two points, which I'm
gonna highlight as these pink circles.

29
00:01:40,300 --> 00:01:42,253
And let's just move them a little bit.

30
00:01:42,253 --> 00:01:46,384
So, out of this whole data set,
I've just moved two observations and

31
00:01:46,384 --> 00:01:50,180
not too dramatically, but
I get a dramatically different fit.

32
00:01:51,650 --> 00:01:55,980
So then, when I think about looking
over all possible data sets I might get,

33
00:01:55,980 --> 00:02:00,100
I might get some crazy set of curves.

34
00:02:00,100 --> 00:02:02,560
There is an average curve.

35
00:02:02,560 --> 00:02:05,980
And in this case, the average curve
is actually pretty well behaved.

36
00:02:05,980 --> 00:02:10,519
Because this wild,
wiggly curve is at any point, equally,

37
00:02:10,519 --> 00:02:14,086
likely to have been wild above,
or wild below.

38
00:02:14,086 --> 00:02:20,570
So, on average over all data sets, it's
actually a fairly smooth reasonable curve.

39
00:02:20,570 --> 00:02:26,830
But if I look at the variation
between these fits, it's really large.

40
00:02:26,830 --> 00:02:31,030
So, what we're saying is that
high-complexity models have high variance.

41
00:02:32,290 --> 00:02:38,254
On the other, if I look at the bias
of this model, so here again,

42
00:02:38,254 --> 00:02:45,005
I'm showing this average fit which
was this fairly well behaved curve.

43
00:02:45,005 --> 00:02:49,725
And match pretty well to the true
relationship between square feet and

44
00:02:49,725 --> 00:02:53,250
house value,
because my model is really flexible.

45
00:02:54,610 --> 00:03:00,050
So on average, it was able to fit pretty
precisely that true relationship.

46
00:03:00,050 --> 00:03:03,740
So, these high-complexities
models have low bias.

47
00:03:06,120 --> 00:03:09,110
So, we can now talk about
this bias-variance tradeoff.

48
00:03:09,110 --> 00:03:11,893
So, in particular,
we're gonna plot bias and

49
00:03:11,893 --> 00:03:14,546
variance as a function
of model complexity.

50
00:03:14,546 --> 00:03:19,343
And so, what we saw in
the past slides is that as our

51
00:03:19,343 --> 00:03:24,379
model complexity increases,
our bias decreases.

52
00:03:24,379 --> 00:03:26,692
Because we can better and

53
00:03:26,692 --> 00:03:32,212
better approximate the true
relationship between x and y.

54
00:03:32,212 --> 00:03:35,670
So, this curve here is our bias curve.

55
00:03:37,280 --> 00:03:41,004
On the other hand, variance increases.

56
00:03:41,004 --> 00:03:45,119
So, our very simple model
had very low variance, and

57
00:03:45,119 --> 00:03:48,970
the high-complexity
models had high variance.

58
00:03:51,240 --> 00:03:54,128
So, this is a picture of our variance.

59
00:03:55,690 --> 00:03:59,711
And so, what we see is there's this
natural tradeoff between bias and

60
00:03:59,711 --> 00:04:00,400
variance.

61
00:04:00,400 --> 00:04:05,878
And one way to summarize this is something
that's called mean squared error.

62
00:04:05,878 --> 00:04:10,830
And so, mean squared error,
which if you watch the optional

63
00:04:10,830 --> 00:04:15,210
videos that go into all these
concepts more in depth.

64
00:04:15,210 --> 00:04:19,703
You'll hear a lot more about mean
squared error and a formal definition,

65
00:04:19,703 --> 00:04:21,379
or the derivation of this.

66
00:04:21,379 --> 00:04:29,461
But mean squared error is simply
the sum of bias squared plus variance.

67
00:04:29,461 --> 00:04:30,878
Okay.

68
00:04:30,878 --> 00:04:34,460
I guess I'll write out
variance to be very clear.

69
00:04:34,460 --> 00:04:42,160
So, this is my little cartoon
of bias squared plus variance.

70
00:04:43,880 --> 00:04:45,840
This is my mean squared error curve.

71
00:04:47,370 --> 00:04:51,546
And machine learning is all about this
tradeoff between bias and variance.

72
00:04:51,546 --> 00:04:55,045
We're gonna see this again and
again in this course.

73
00:04:55,045 --> 00:04:57,880
And we're gonna see it
throughout the specialization.

74
00:04:57,880 --> 00:05:01,960
And the goal is finding this sweet spot.

75
00:05:03,330 --> 00:05:09,600
This is the sweet spot where
we get our minimum error,

76
00:05:09,600 --> 00:05:15,617
the minimum contribution of bias and
variance, to our prediction errors.

77
00:05:15,617 --> 00:05:18,796
So, not sweet, sweet.

78
00:05:18,796 --> 00:05:23,962
It is sweet, sweet, but
what I'm trying to write is sweet spot.

79
00:05:23,962 --> 00:05:26,463
And this is what we'd love to get at.

80
00:05:26,463 --> 00:05:28,840
That's the model
complexity that we'd want.

81
00:05:30,900 --> 00:05:35,224
But just like with generalization error,
so

82
00:05:35,224 --> 00:05:40,587
I'm gonna write this down
with generalization error.

83
00:05:40,587 --> 00:05:41,760
Can we compute this?

84
00:05:43,000 --> 00:05:44,500
So, think about that while I'm writing.

85
00:05:49,990 --> 00:05:56,169
We cannot compute bias and variance,

86
00:05:56,169 --> 00:06:01,004
and less mean squared error.

87
00:06:01,004 --> 00:06:01,879
And why?

88
00:06:09,130 --> 00:06:13,645
Well, the reason is because just
like with generalization error,

89
00:06:13,645 --> 00:06:16,900
they were defined in terms
of the true function.

90
00:06:16,900 --> 00:06:19,407
Well, bias was defined very
explicitly in terms of

91
00:06:19,407 --> 00:06:22,600
the relationship relative
to the true function.

92
00:06:22,600 --> 00:06:25,350
And when we think about defining variance,

93
00:06:25,350 --> 00:06:30,490
we have to average over all possible data
sets, and the same was true for bias too.

94
00:06:30,490 --> 00:06:34,989
But all possible data sets of size n,
we could have gotten from the world,

95
00:06:34,989 --> 00:06:37,250
and we just don't know what that is.

96
00:06:37,250 --> 00:06:40,120
So, we can't compute these things exactly.

97
00:06:40,120 --> 00:06:45,023
But throughout the rest of this course,
we're gonna look at ways to optimize this

98
00:06:45,023 --> 00:06:48,379
tradeoff between bias and
variance in a practical way.

99
00:06:48,379 --> 00:06:52,499
[MUSIC]