1
00:00:00,120 --> 00:00:04,579
[MUSIC]

2
00:00:04,579 --> 00:00:07,951
So, the first measure of error of our
predictions that we can look at is

3
00:00:07,951 --> 00:00:10,050
something called training error.

4
00:00:10,050 --> 00:00:14,220
And we discussed this at a high level in
the first course of the specialization,

5
00:00:14,220 --> 00:00:16,250
but now let's go through it
in a little bit more detail.

6
00:00:17,680 --> 00:00:21,690
So, to define training error,
we first have to define training data.

7
00:00:21,690 --> 00:00:24,270
So, training data typically you have

8
00:00:24,270 --> 00:00:28,010
some dataset which I've shown you
are these blue circles here, and

9
00:00:28,010 --> 00:00:32,488
we're going to choose our training
dataset just some subset of these points.

10
00:00:32,488 --> 00:00:35,950
So, the greyed circles are ones that
are not included in the training set.

11
00:00:35,950 --> 00:00:39,010
The blue circles are the ones that
we're keeping in this training set.

12
00:00:40,350 --> 00:00:43,990
And then we take our training data and,
as we've discussed in previous modules of

13
00:00:43,990 --> 00:00:51,300
this course, we use it in order to fit our
model, to estimate our model parameters.

14
00:00:51,300 --> 00:00:54,730
Just as an example, for
example with this dataset here,

15
00:00:54,730 --> 00:00:59,170
maybe we choose to fit some
quadratic function to the data and

16
00:00:59,170 --> 00:01:02,850
like we've talked about in order
to fit this quadratic function,

17
00:01:02,850 --> 00:01:06,910
we're gonna minimize the residual sum of
squares on these training data points.

18
00:01:08,020 --> 00:01:10,650
So, now we have our estimated
model parameters, w hat.

19
00:01:10,650 --> 00:01:16,060
And we want to assess the training
error of that estimated model.

20
00:01:16,060 --> 00:01:19,450
And the way we do that is first we
need to define some lost functions.

21
00:01:19,450 --> 00:01:21,770
So, maybe we look at squared error,
absolute error.

22
00:01:21,770 --> 00:01:25,050
Any one fo the many possibilities for
our lost function.

23
00:01:25,050 --> 00:01:29,950
And then the way training error's
defined is simply as the average loss,

24
00:01:29,950 --> 00:01:31,760
defined over the training points.

25
00:01:31,760 --> 00:01:35,960
So, mathematically what
this is is simply 1 over N.

26
00:01:35,960 --> 00:01:40,020
So, N are the total number of
observations in my training set.

27
00:01:40,020 --> 00:01:43,970
Some of the loss over each one
of those training observations.

28
00:01:46,850 --> 00:01:50,050
And just to remember to be very clear

29
00:01:50,050 --> 00:01:54,930
the estimated parameters were
estimated on the training set.

30
00:01:54,930 --> 00:01:57,970
They were Minimizing
the residual semi-squares for

31
00:01:57,970 --> 00:02:02,450
these training points that we're looking
at again and defining this training error.

32
00:02:02,450 --> 00:02:06,170
So, we can go through this pictorially in
the following example, where in this case

33
00:02:06,170 --> 00:02:09,860
we're specifically looking at using
squared error as our loss function.

34
00:02:10,880 --> 00:02:16,030
And in this case,
our training error is simply one over n

35
00:02:16,030 --> 00:02:21,990
times the sum of The difference between
our actual house sales price and

36
00:02:21,990 --> 00:02:23,830
our predicted house sales price squared.

37
00:02:23,830 --> 00:02:28,240
Where that sum is taken over all
houses in our training data set.

38
00:02:28,240 --> 00:02:31,650
And what we see is that in this case
where we choose squared error as

39
00:02:31,650 --> 00:02:34,080
our loss function, then

40
00:02:34,080 --> 00:02:39,660
the form of training error Is exactly 1
over N times our residual sum of squares.

41
00:02:39,660 --> 00:02:43,420
And I want to note here that there's some
difference in convention that people use,

42
00:02:43,420 --> 00:02:49,830
whether there's the 1 over N as
the definition of training error, or not.

43
00:02:49,830 --> 00:02:52,500
So, just be aware of that when
you're computing training error and

44
00:02:52,500 --> 00:02:53,910
reporting these numbers.

45
00:02:53,910 --> 00:02:57,295
Here we're defining it
as the average loss.

46
00:02:57,295 --> 00:03:01,525
More formally we can write our
training error as follows and

47
00:03:01,525 --> 00:03:06,009
then we can define something
that's commonly referred to just

48
00:03:06,009 --> 00:03:10,590
as something as RMSE and
the full name is root mean square error.

49
00:03:15,470 --> 00:03:21,920
And RMSE is simply the square root of
our average loss on the training houses.

50
00:03:21,920 --> 00:03:24,820
So, the square root of our training error.

51
00:03:25,960 --> 00:03:28,800
And the reason one might consider
looking at root mean square error

52
00:03:28,800 --> 00:03:31,590
is because the units,
in this case, are just dollars.

53
00:03:31,590 --> 00:03:35,840
Whereas when we thought about our training
error, the units were dollars squared.

54
00:03:35,840 --> 00:03:41,220
Remember we're taking the squares of
all these differences in dollars.

55
00:03:41,220 --> 00:03:42,990
So, the result is dollars squared.

56
00:03:42,990 --> 00:03:47,120
So, that's a little bit less intuitive
as an error metric than just

57
00:03:47,120 --> 00:03:50,050
an error in terms of dollars themselves.

58
00:03:50,050 --> 00:03:51,610
Now, that we've defined training error,

59
00:03:51,610 --> 00:03:56,960
we can look at how training error
behaves as model complexity increases.

60
00:03:56,960 --> 00:04:00,370
So, to start with let's look at
the simplest possible model you might fit,

61
00:04:00,370 --> 00:04:01,620
which is just a constant model.

62
00:04:02,950 --> 00:04:07,864
So this is the simplest model we're
gonna consider, or could consider,

63
00:04:07,864 --> 00:04:12,000
and you see that there is pretty
significant training error.

64
00:04:12,000 --> 00:04:15,013
So let's just say that
that has some value here,

65
00:04:15,013 --> 00:04:18,109
this is the training error
of the constant model.

66
00:04:21,200 --> 00:04:24,620
Then let's say I fit a linear model.

67
00:04:24,620 --> 00:04:29,000
Well, a line, these are all linear models
we're looking at, it's linear regression.

68
00:04:29,000 --> 00:04:30,920
But just fitting a line to the data.

69
00:04:30,920 --> 00:04:33,060
And you see that my training
error has gone down.

70
00:04:33,060 --> 00:04:37,080
So, some other value that I'm
showing with this pink circle here.

71
00:04:37,080 --> 00:04:42,220
Then I fit a quadratic function
again training error goes down, and

72
00:04:42,220 --> 00:04:47,370
what I see is that as I increase my model
complexity to maybe this higher order

73
00:04:47,370 --> 00:04:52,790
of polynomial, I have very low training
error just this one pink bar here.

74
00:04:52,790 --> 00:04:59,060
So, training error decreases quite
significantly with model complexity and,

75
00:05:00,140 --> 00:05:04,960
in total not that we've gone through these
examples we can look at what the plot of

76
00:05:04,960 --> 00:05:10,080
training error versus model
complexity tends to look like.

77
00:05:12,320 --> 00:05:18,530
So, there's a decrease in training error
as you increase your model complexity.

78
00:05:27,020 --> 00:05:28,560
And why is that?

79
00:05:28,560 --> 00:05:34,410
Well, it's pretty intuitive, because the
model was fit on the training points and

80
00:05:34,410 --> 00:05:36,620
then I'm saying how well does it fit it?

81
00:05:36,620 --> 00:05:38,820
As I increase the model complexity,
I'm better and

82
00:05:38,820 --> 00:05:42,090
better able to fit my
training data points.

83
00:05:42,090 --> 00:05:46,810
So, then when I go to assess my training
error with these high-complexity models,

84
00:05:46,810 --> 00:05:48,610
I have very low training error.

85
00:05:50,100 --> 00:05:52,820
So, a natural question is
whether a training error

86
00:05:52,820 --> 00:05:54,850
is a good measure of
predictive performance?

87
00:05:54,850 --> 00:05:58,383
And what we're showing here is
one of our high-complexity,

88
00:05:58,383 --> 00:06:01,910
high-order polynomial models that
had very low training error.

89
00:06:01,910 --> 00:06:04,730
So it really fit those
training data points well.

90
00:06:04,730 --> 00:06:07,690
But how's it gonna perform
on some new house?

91
00:06:07,690 --> 00:06:11,770
So, in particular, maybe we're looking
at a house in this gray region, so

92
00:06:11,770 --> 00:06:13,669
with this range of square feet.

93
00:06:14,700 --> 00:06:17,390
Question is, is there something
particularly wrong with having

94
00:06:17,390 --> 00:06:19,030
Xt square feet?

95
00:06:19,030 --> 00:06:24,400
Because what our fitted function is saying
is that I believe or I'm predicting

96
00:06:24,400 --> 00:06:30,040
that the values of houses with roughly
Xt square feet are less valuable

97
00:06:30,040 --> 00:06:33,530
than houses with fewer square feet, cuz
there's this dip down in this function.

98
00:06:34,610 --> 00:06:38,110
Do we really believe that this
is a true dip in value, that

99
00:06:38,110 --> 00:06:43,550
these houses are just less desirable than
houses with fewer or more square feet?

100
00:06:43,550 --> 00:06:44,750
Probably not.

101
00:06:44,750 --> 00:06:46,110
So, what's going wrong here?

102
00:06:47,360 --> 00:06:49,950
The issue is the fact
that training error is

103
00:06:49,950 --> 00:06:53,920
overly optimistic when we're going
to assess predictive performance.

104
00:06:53,920 --> 00:06:58,990
And that's because these parameters,
w-hat, were fit on the training data.

105
00:06:58,990 --> 00:07:03,330
They were fit to minimize
this training error.

106
00:07:03,330 --> 00:07:05,920
Sorry, minimize residual sum of squares,

107
00:07:05,920 --> 00:07:09,798
which can often be related
to training error.

108
00:07:09,798 --> 00:07:13,580
And then we're using training error
to assess predictive performance but

109
00:07:13,580 --> 00:07:16,410
that's gonna be very very
optimistic as this picture shows.

110
00:07:17,650 --> 00:07:22,570
So, in general, having small
training error does not imply having

111
00:07:22,570 --> 00:07:27,577
good predictive performance unless
your training data set is really

112
00:07:27,577 --> 00:07:32,864
representative of everything that you
might see there out in the world.

113
00:07:32,864 --> 00:07:35,439
[MUSIC]