1
00:00:00,000 --> 00:00:04,461
[MUSIC]

2
00:00:04,461 --> 00:00:09,211
So in modules one and two we described
how to fit different models and

3
00:00:09,211 --> 00:00:13,600
in module two we described how
to fit very complex models.

4
00:00:13,600 --> 00:00:17,760
But at up to our third module
we had no idea how to access

5
00:00:17,760 --> 00:00:22,260
whether that fitted model was going to
perform well in our prediction tasks.

6
00:00:22,260 --> 00:00:26,100
So in module three, that was our
emphasis in assessing the performance

7
00:00:26,100 --> 00:00:31,110
of our fitted module and thinking about
how we can select between different models

8
00:00:31,110 --> 00:00:33,620
to get good predictive performance.

9
00:00:33,620 --> 00:00:38,000
So the first notion that we
introduced in order to measure how

10
00:00:38,000 --> 00:00:40,210
good our fit was performing
was the measure of loss.

11
00:00:40,210 --> 00:00:45,330
So this is kind of a negative measure
of performance where we wanna

12
00:00:45,330 --> 00:00:48,460
lose as little as possible
in making poor predictions.

13
00:00:48,460 --> 00:00:51,490
We're just under an assumption that
our predictions are not perfect.

14
00:00:52,690 --> 00:00:56,620
And we discussed two different
examples of loss metrics that

15
00:00:56,620 --> 00:01:00,240
are very commonly used talking about this
absolute error or this squared error.

16
00:01:01,990 --> 00:01:03,470
Then with this loss function,

17
00:01:03,470 --> 00:01:06,860
we talked about defining three
different measures of air.

18
00:01:06,860 --> 00:01:09,120
The first was our training error,

19
00:01:09,120 --> 00:01:14,140
which we said was not a good assessment of
the predicted performance of our model.

20
00:01:14,140 --> 00:01:16,560
Then we defined something
called our generalization, or

21
00:01:16,560 --> 00:01:18,900
true error, which is what we really want.

22
00:01:18,900 --> 00:01:21,820
We wanna say how well are we predicting

23
00:01:21,820 --> 00:01:25,370
every possible observation
that we might see out there.

24
00:01:25,370 --> 00:01:29,500
And we said, okay, well can't actually
compute that so then we defined something

25
00:01:29,500 --> 00:01:33,130
called our test error which
looks at the subset of our data

26
00:01:34,690 --> 00:01:39,740
that was not including in the training
set and looks at the model that was fit

27
00:01:39,740 --> 00:01:43,060
on the training data set but now making
predictions on these held out points.

28
00:01:44,520 --> 00:01:49,900
And we said that test error was a noisy
approximation to our generalization error.

29
00:01:51,320 --> 00:01:53,630
And for
these three different measures of error,

30
00:01:53,630 --> 00:01:57,660
we talked about how they varied as
a function of model complexity.

31
00:01:57,660 --> 00:02:03,250
So training error, we know, goes down
with increasing model complexity but

32
00:02:03,250 --> 00:02:05,420
that doesn't indicate
that we get better and

33
00:02:05,420 --> 00:02:08,480
better predictions as we
increase our model complexity.

34
00:02:08,480 --> 00:02:12,500
But in contrast, if we look at
generalization error, true error,

35
00:02:12,500 --> 00:02:20,150
these tend to increase, the errors tend
to increase after a certain point.

36
00:02:20,150 --> 00:02:25,220
We say that that point is when this
models start to become overfit.

37
00:02:26,220 --> 00:02:29,620
Because they perform very well
on the training data set,

38
00:02:29,620 --> 00:02:34,840
but they don't generalize well to
new data that we have not yet seen.

39
00:02:34,840 --> 00:02:38,960
And again, although we discuss this in
the context of regression, this notion of

40
00:02:38,960 --> 00:02:43,810
training, test, generalization error,
and variations with model complexity is

41
00:02:43,810 --> 00:02:47,960
a much more general concept that we'll
see again in the specialization.

42
00:02:47,960 --> 00:02:51,238
We then characterized three different
sources that contribute to our

43
00:02:51,238 --> 00:02:52,830
prediction error.

44
00:02:52,830 --> 00:02:55,900
These are,
the noise that's inherent in the data.

45
00:02:55,900 --> 00:02:57,360
This is our irreducible error.

46
00:02:57,360 --> 00:02:58,730
We have no control over it.

47
00:02:58,730 --> 00:03:02,950
It has nothing to do with our model or
our estimation procedure but

48
00:03:02,950 --> 00:03:05,970
then we talked about this idea of bias and
variance.

49
00:03:05,970 --> 00:03:11,950
So we just described bias as saying
how well can our model fit the true

50
00:03:11,950 --> 00:03:17,620
relationship, averaging over all possible
training data sets that we might see.

51
00:03:17,620 --> 00:03:22,460
Whereas variance was describing
how much can a fitted function

52
00:03:22,460 --> 00:03:27,180
vary from training data set to training
data set, all of size and observations.

53
00:03:28,650 --> 00:03:33,270
So of course noise in the data can
contribute to our errors in prediction,

54
00:03:33,270 --> 00:03:36,620
but of course if our model can't
adequately describe the true relationship

55
00:03:36,620 --> 00:03:40,260
that's also a source of error
as well as this variability from

56
00:03:40,260 --> 00:03:41,450
training set to training set.

57
00:03:43,020 --> 00:03:46,440
So of course we want low bias and
low variance to have

58
00:03:46,440 --> 00:03:50,410
good predicted performance, but we saw
that there's this bias variance trade-off.

59
00:03:50,410 --> 00:03:54,050
That as you increase model complexity,
your bias goes down, but

60
00:03:54,050 --> 00:03:55,620
your variance goes up.

61
00:03:55,620 --> 00:03:59,380
And so there's this sweet spot
that trades off between bias and

62
00:03:59,380 --> 00:04:03,190
variance and results in the lowest
what's called mean square error.

63
00:04:03,190 --> 00:04:04,850
And that's what we're seeking to find.

64
00:04:06,450 --> 00:04:07,990
And like we've said multiple times,

65
00:04:07,990 --> 00:04:11,260
machine learning is all about
exploring this bias variance tradeoff.

66
00:04:13,380 --> 00:04:15,850
Then with concluded this module by saying,

67
00:04:15,850 --> 00:04:20,540
how are we both going to select our
model and assess its performance?

68
00:04:20,540 --> 00:04:21,830
And for this we said,

69
00:04:21,830 --> 00:04:25,560
well we need to actually form
something called a validation set.

70
00:04:25,560 --> 00:04:28,160
So we're going to fit our model
on the training data set,

71
00:04:28,160 --> 00:04:32,180
we're going to select between different
models or thinking about selecting

72
00:04:32,180 --> 00:04:36,990
a tuning perimeter describing these
different models on our validation set and

73
00:04:36,990 --> 00:04:40,570
then testing the performance on our test
set, where we never touched the test data.

74
00:04:42,160 --> 00:04:44,310
In later modules,
like we're going to describe,

75
00:04:44,310 --> 00:04:48,690
we talked about how if you don't have
enough data to form this validation set,

76
00:04:48,690 --> 00:04:51,920
you can think about doing
cross-validation instead.

77
00:04:51,920 --> 00:04:54,290
Then, in our fourth module,
we talked about rich regression.

78
00:04:54,290 --> 00:04:56,840
And remember that as our
models become more and

79
00:04:56,840 --> 00:05:01,740
more complex, we can become overfit and
what we saw is the symptom of overfitting

80
00:05:01,740 --> 00:05:06,120
was that the magnitude of our
estimated coefficients just exploded.

81
00:05:06,120 --> 00:05:09,310
So what ridge regression does
is it trades off between

82
00:05:09,310 --> 00:05:12,820
a measure of fit of our function
to our training data and

83
00:05:12,820 --> 00:05:15,390
a measure of the magnitude
of the coefficients.

84
00:05:15,390 --> 00:05:17,590
And implicitly by balancing
these two terms we're

85
00:05:17,590 --> 00:05:19,219
doing a bias-variance tradeoff.

86
00:05:20,540 --> 00:05:24,810
In particular we saw that our rich
regression objective sought to minimize

87
00:05:24,810 --> 00:05:30,760
our residuals sum of squares plus lambda
plus the L2 norm of our coefficients,

88
00:05:30,760 --> 00:05:36,230
and we talked about what the coefficient
path of our ridge solution looked like, as

89
00:05:36,230 --> 00:05:42,090
we varied this tuning parameter, lambda,
the penalty strength on this L2 norm term.

90
00:05:42,090 --> 00:05:45,020
And we saw that as you increase
this penalty parameter,

91
00:05:45,020 --> 00:05:48,750
the magnitude of our coefficients
become smaller and smaller and smaller.

92
00:05:48,750 --> 00:05:51,490
Then for
our ridge objective just like we did

93
00:05:51,490 --> 00:05:55,220
in our standard lease squares objective,
we computed the gradient,

94
00:05:55,220 --> 00:05:58,640
set it equal to zero to get our
closed-form solution and this looks

95
00:05:58,640 --> 00:06:03,510
very similar to our solution we had
before except with this additional term.

96
00:06:03,510 --> 00:06:07,200
And what we talked about in this
module is the fact that by adding

97
00:06:07,200 --> 00:06:10,260
this lambda times the identity matrix.

98
00:06:10,260 --> 00:06:14,230
This allowed us to have a solution,

99
00:06:14,230 --> 00:06:18,900
even when the number of features was
larger than the number of observations.

100
00:06:18,900 --> 00:06:23,220
And it allowed for a much more quote,
unquote, regularized solution.

101
00:06:23,220 --> 00:06:25,850
That's why it's called
a regularized regression technique.

102
00:06:26,960 --> 00:06:30,890
But the complexity of the solution
was exactly the same as we had for

103
00:06:30,890 --> 00:06:34,880
these squares, cubic in the number
of features that we have.

104
00:06:34,880 --> 00:06:38,750
We also talked about a gradient
descent implementation of ridge.

105
00:06:39,930 --> 00:06:44,370
And as we saw, a key question in what
solution we would get out of ridge

106
00:06:44,370 --> 00:06:47,850
was determined by this
lambda penalty strength.

107
00:06:47,850 --> 00:06:52,219
And so, for this, instead of talking about
cutting out a validation set to select

108
00:06:52,219 --> 00:06:56,589
this tuning parameter, we talked about
cases where you might not have enough data

109
00:06:56,589 --> 00:07:00,359
to do that, and instead described
this cross validation procedure.

110
00:07:00,359 --> 00:07:04,409
[MUSIC]