1
00:00:00,000 --> 00:00:04,491
[MUSIC]

2
00:00:04,491 --> 00:00:09,108
So now let's talk about a way to
automatically address this issue by

3
00:00:09,108 --> 00:00:13,158
modifying the cost term that
we're minimizing when we're

4
00:00:13,158 --> 00:00:15,350
addressing how good our fit is.

5
00:00:16,400 --> 00:00:20,530
So, in particular we're looking at
this orange box, this quality metric.

6
00:00:20,530 --> 00:00:25,160
And before our quality metric just
depended on the difference between

7
00:00:25,160 --> 00:00:29,130
our predicted house sales price,
and our actual house sales price.

8
00:00:29,130 --> 00:00:32,770
In particular we're looking at residual
sum of squares for measure of fit.

9
00:00:32,770 --> 00:00:37,640
But now we're gonna modify this quality
metric to also take into account

10
00:00:37,640 --> 00:00:39,890
a measure of the complexity of the model.

11
00:00:39,890 --> 00:00:43,990
In particular, in order to buy
assess toward simpler models.

12
00:00:43,990 --> 00:00:47,890
So when we're thinking about defining this
modified cost function, what we're gonna

13
00:00:47,890 --> 00:00:52,610
want to do is balance between how
well the function fits the data and

14
00:00:52,610 --> 00:00:56,990
a measure of how complex, or
how potentially over fit, the model is.

15
00:00:56,990 --> 00:00:59,400
And what did we see was
an indicator of that?

16
00:00:59,400 --> 00:01:02,660
The magnitude of our
estimated coefficients.

17
00:01:02,660 --> 00:01:07,270
So, what we're going to balance between
is the fit of the model to the data and

18
00:01:07,270 --> 00:01:11,370
the magnitude of
the coefficients of the model.

19
00:01:11,370 --> 00:01:15,439
Okay, so we can write down a total
cost that has these two terms.

20
00:01:15,439 --> 00:01:20,146
Where this is our new measure
of the quality of the fit, and

21
00:01:20,146 --> 00:01:24,660
when I say measure of fit here,
what I mean is that a small

22
00:01:24,660 --> 00:01:29,105
number indicates that there's
a good fit to the data.

23
00:01:29,105 --> 00:01:33,467
And on the other hand, the measure of
the magnitude of the coefficients if that

24
00:01:33,467 --> 00:01:37,235
number is small that means the size
of the coefficients are small and

25
00:01:37,235 --> 00:01:40,697
we're unlikely to be in this
setting of a very overfit model.

26
00:01:41,840 --> 00:01:44,930
Okay, so clearly we want to balance
between these two measures,

27
00:01:44,930 --> 00:01:49,830
because if I just optimize
the magnitude of the coefficients,

28
00:01:49,830 --> 00:01:54,900
I'd set all the coefficients to zero and
that would sure not be overfit,

29
00:01:54,900 --> 00:01:57,910
but it also would not fit the data well.

30
00:01:57,910 --> 00:02:01,420
So that would be a very
high bias solution.

31
00:02:01,420 --> 00:02:06,650
On the other hand, if I just focused
on optimizing the measure of fit,

32
00:02:06,650 --> 00:02:07,750
that's what we did before.

33
00:02:07,750 --> 00:02:10,840
That's the thing that was subject
to becoming overfit in the face of

34
00:02:10,840 --> 00:02:12,130
complex models.

35
00:02:12,130 --> 00:02:15,310
So somehow we want to trade off
between these two terms, and

36
00:02:15,310 --> 00:02:17,770
that's what we're going to discuss now.

37
00:02:17,770 --> 00:02:20,060
Okay, what's our measure of fit?

38
00:02:20,060 --> 00:02:23,000
At this point you guys should be
pretty sick of hearing me say this.

39
00:02:23,000 --> 00:02:27,470
It's our residual sum of squares,
which I've written here and

40
00:02:27,470 --> 00:02:30,840
hopefully this formula is quite
familiar to you at this point.

41
00:02:30,840 --> 00:02:36,072
But sometimes we also
write it as follows where

42
00:02:36,072 --> 00:02:41,037
remember this is our
predicted value using w

43
00:02:41,037 --> 00:02:46,430
in our model to make these predictions.

44
00:02:46,430 --> 00:02:51,940
And just remember that
a small residual sum

45
00:02:51,940 --> 00:02:56,998
of squares is indicative of the model.

46
00:02:59,680 --> 00:03:01,652
Fitting the traiing data well.

47
00:03:09,617 --> 00:03:15,054
So just as we said on the previous slide,
when we're thinking about measure of fit,

48
00:03:15,054 --> 00:03:17,900
a small number is gonna
indicate a good fit.

49
00:03:20,430 --> 00:03:25,320
Okay, so now what we need is a measure
of the magnitude of the coefficients.

50
00:03:25,320 --> 00:03:29,640
So what summary number might be
indicative of the size of the regression

51
00:03:29,640 --> 00:03:31,265
coefficients?

52
00:03:31,265 --> 00:03:35,230
Well maybe you think about just
summing all the coefficients together?

53
00:03:36,510 --> 00:03:40,439
Is this gonna be a good measure of
the overall magnitude of the coefficients?

54
00:03:41,850 --> 00:03:46,926
Probably not in a lot of cases because

55
00:03:46,926 --> 00:03:52,493
you might end up with a situation where,

56
00:03:52,493 --> 00:03:58,553
let's say, w0 is 1,527,301 and

57
00:03:58,553 --> 00:04:02,975
w1 is -1,605,253,

58
00:04:02,975 --> 00:04:08,377
well if you look at and let's say w0 and

59
00:04:08,377 --> 00:04:14,793
w1, the only two
coefficients in our model.

60
00:04:14,793 --> 00:04:21,820
If I look at w0 + w1,
this is gonna be some small number.

61
00:04:23,380 --> 00:04:28,430
Despite the fact that each of the
coefficients themselves were quite large.

62
00:04:28,430 --> 00:04:31,120
Okay, so you might say,
I know how to fix this,

63
00:04:31,120 --> 00:04:33,070
I'll just look at the absolute value.

64
00:04:33,070 --> 00:04:38,796
So, maybe what I'll do,
is I'll look at absolute

65
00:04:38,796 --> 00:04:43,459
value w0 + w1 plus all
the way up to wD and

66
00:04:43,459 --> 00:04:48,388
this is, I'll just write this compactly,

67
00:04:48,388 --> 00:04:56,020
sum from j=0 to capital D,
the number of features we have.

68
00:04:56,020 --> 00:04:59,490
Absolute value of wD, sorry, wj.

69
00:05:03,430 --> 00:05:08,098
And this is defined to be equal
to what's called the one norm

70
00:05:08,098 --> 00:05:13,670
of the vector of coefficient w.

71
00:05:13,670 --> 00:05:19,680
So we write it, so this is a vector,
I'll try and make this a thick font here,

72
00:05:21,450 --> 00:05:26,191
sub 1 and this is called L1 norm.

73
00:05:26,191 --> 00:05:31,260
And this is actually
a very reasonable choice.

74
00:05:31,260 --> 00:05:34,829
And we're gonna discuss this
more in the next module.

75
00:05:45,857 --> 00:05:48,613
But for now the thing that we're gonna

76
00:05:48,613 --> 00:05:53,710
consider is to consider the sum of
the squares of the coefficients.

77
00:05:53,710 --> 00:06:00,200
So w0 squared w1 squared,
all the way up to wD squared.

78
00:06:00,200 --> 00:06:07,410
So this is the sum j equals zero
to capital D, of wj squared.

79
00:06:07,410 --> 00:06:13,550
And this is defined to be equal to,
we've actually seen this norm

80
00:06:14,630 --> 00:06:20,280
many times in this class so
far, it's the two norm squared.

81
00:06:20,280 --> 00:06:24,777
So this is called our L2 norm, or
really the L2 norm squared and

82
00:06:24,777 --> 00:06:27,710
this is gonna be the focus of this module.

83
00:06:38,728 --> 00:06:40,411
Okay.

84
00:06:40,411 --> 00:06:45,693
So again, what we have, just to summarize,
is we have our total cost

85
00:06:45,693 --> 00:06:51,523
is a sum of the measure of fit + a measure
of the magnitude of coefficients and

86
00:06:51,523 --> 00:06:56,510
we said our measure of fit is
our residual sum of squares.

87
00:06:56,510 --> 00:07:00,113
And our measure of the magnitude
of the coefficients for

88
00:07:00,113 --> 00:07:04,183
this module is going to be this
two norm of the w vector squared.

89
00:07:04,183 --> 00:07:07,573
[MUSIC]