1
00:00:00,379 --> 00:00:04,663
[MUSIC]

2
00:00:04,663 --> 00:00:08,140
Well we discussed ridge regression and
cross-validation.

3
00:00:08,140 --> 00:00:12,510
But we kinda brushed under the rug
what can be a fairly important issue

4
00:00:12,510 --> 00:00:15,650
when we discussed our ridge
regression objective, which is

5
00:00:15,650 --> 00:00:19,389
how to deal with the intercept term
that's commonly included in most models.

6
00:00:21,110 --> 00:00:26,060
So in particular let's recall our multiple
regression model, which is shown here.

7
00:00:26,060 --> 00:00:31,620
And so far we've just treated
generically that there's some h0 of x,

8
00:00:31,620 --> 00:00:34,190
that's our first feature
with coefficient w0.

9
00:00:34,190 --> 00:00:38,600
But as we mentioned two modules ago,
typically,

10
00:00:38,600 --> 00:00:43,100
that first feature is
treated to be what's called

11
00:00:43,100 --> 00:00:47,970
the constant feature, so that w0 just
represents the intercept of the model.

12
00:00:47,970 --> 00:00:50,180
So if you're thinking of
some hyper-point issues,

13
00:00:50,180 --> 00:00:53,620
where is it sitting along that y-axis?

14
00:00:53,620 --> 00:00:58,130
And then all the other features
are some arbitrary set of other

15
00:00:58,130 --> 00:01:00,390
terms that you might be interested in.

16
00:01:01,680 --> 00:01:02,180
Okay.

17
00:01:03,390 --> 00:01:07,150
Well if we have this constant
feature in our model, then

18
00:01:07,150 --> 00:01:11,090
the model that I wrote on the previous
slide simplifies to the following.

19
00:01:11,090 --> 00:01:15,040
Where in this case when we think of
our matrix notation for having And

20
00:01:15,040 --> 00:01:16,640
different observations.

21
00:01:16,640 --> 00:01:22,820
When we're forming our H matrix,
the first column of that matrix,

22
00:01:22,820 --> 00:01:28,130
that's the coefficient for
the w0 term, the w0 coefficient.

23
00:01:28,130 --> 00:01:35,230
So in this special case, that entire first
column is filled entirely with ones.

24
00:01:35,230 --> 00:01:43,340
So that we get w0 all along as the first
feature for every observation.

25
00:01:43,340 --> 00:01:45,840
Okay so
this is the specific form that our H

26
00:01:45,840 --> 00:01:50,180
matrix is gonna take in this case where
we have an intercept term in the model.

27
00:01:51,550 --> 00:01:56,715
Now let's return to our standard
ridge regression objective that we

28
00:01:56,715 --> 00:02:01,790
had where we said we have
the RSS(w) + lambda times ||w||_2

29
00:02:01,790 --> 00:02:06,332
squared where that ||w||_2
vector included w_0 for

30
00:02:06,332 --> 00:02:12,200
the intercept term in the models
where that's what it represents.

31
00:02:12,200 --> 00:02:16,500
So a question is does this
really make sense to do?

32
00:02:16,500 --> 00:02:20,270
Because what this is doing is it's
encouraging that intercept term

33
00:02:20,270 --> 00:02:20,940
to be small.

34
00:02:20,940 --> 00:02:24,030
That's what this ridge
regression penalty is doing.

35
00:02:24,030 --> 00:02:26,350
And do we want a small intercept?

36
00:02:26,350 --> 00:02:30,570
So it's useful to think about doing ridge
regression when you're adding lots and

37
00:02:30,570 --> 00:02:35,080
lots of features but regardless of how
many features you add to your model, does

38
00:02:35,080 --> 00:02:39,640
that really matter in how we're thinking
about the magnitude of the intercept?

39
00:02:40,780 --> 00:02:41,830
Not really.

40
00:02:41,830 --> 00:02:45,050
So it probably doesn't make
a lot of sense intuitively

41
00:02:45,050 --> 00:02:47,800
to think about shrinking the intercept

42
00:02:47,800 --> 00:02:51,950
just because we have this very flexible
model with lots of other features.

43
00:02:51,950 --> 00:02:53,520
So let's think about how to address this.

44
00:02:54,650 --> 00:02:59,170
Okay, the first option we have is
not to penalize the intercept term.

45
00:02:59,170 --> 00:03:03,440
And the way we can do that is to
separate out that w0 coefficient

46
00:03:03,440 --> 00:03:05,720
from all the other w's.

47
00:03:05,720 --> 00:03:10,090
w1, w2 all the way up to wd, when we're
thinking about that penalty term.

48
00:03:10,090 --> 00:03:14,200
So we have residual sum of squares of w0,
and

49
00:03:14,200 --> 00:03:16,950
what I'll call w rest,
all those other w's.

50
00:03:16,950 --> 00:03:19,860
And when we add our ridge
regression penalty,

51
00:03:21,180 --> 00:03:24,240
the 2 norm is only taken
of that w rest factor.

52
00:03:24,240 --> 00:03:26,365
All those w's not including our intercept.

53
00:03:27,710 --> 00:03:30,060
So a question is,
how do we implement this in practice?

54
00:03:30,060 --> 00:03:34,030
How is this gonna modify the closed
form solution or the gradient descent

55
00:03:34,030 --> 00:03:38,270
algorithm that we showed previously when
we weren't handling this specific case.

56
00:03:39,520 --> 00:03:42,230
So the very simple
modification we can make

57
00:03:42,230 --> 00:03:45,650
is simply defining something
that I'm calling Imod.

58
00:03:45,650 --> 00:03:48,270
It's a modified identity matrix.

59
00:03:48,270 --> 00:03:53,290
That has a 0 in the first entry,
and so in the one one entry,

60
00:03:53,290 --> 00:03:57,500
and all the other elements are exactly
the same as an identity matrix before.

61
00:03:57,500 --> 00:04:02,230
So to be explicit our H transpose H terms
is gonna look just as it did before but

62
00:04:02,230 --> 00:04:05,610
now this lambda Imod has a 0.

63
00:04:05,610 --> 00:04:11,693
So this is the entry.

64
00:04:11,693 --> 00:04:18,310
Corresponding to the w0 index.

65
00:04:18,310 --> 00:04:23,780
And then we have lambdas

66
00:04:23,780 --> 00:04:30,860
as before everywhere else on this diagonal
and of course still our 0s off diagonal.

67
00:04:30,860 --> 00:04:35,300
Okay, now let's look at our
gradient descent algorithm.

68
00:04:35,300 --> 00:04:40,860
And here it's gonna be very simple, we
just add in a special case that if we're

69
00:04:40,860 --> 00:04:46,672
updating our intercept term, so
if we're looking at that zero feature,

70
00:04:46,672 --> 00:04:51,414
we're just gonna use our

71
00:04:51,414 --> 00:04:55,945
old re-sqaures update.

72
00:04:55,945 --> 00:05:00,870
No shrinkage to w0, but

73
00:05:04,900 --> 00:05:09,560
otherwise, for all other features

74
00:05:14,240 --> 00:05:15,680
we're gonna do the ridge update.

75
00:05:22,960 --> 00:05:27,320
Okay so we see algorithmically its
very straightforward to make this

76
00:05:27,320 --> 00:05:32,080
modification where we don't want
to penalize that intercept term.

77
00:05:33,260 --> 00:05:37,360
But there's another option we have
which is to transform the data.

78
00:05:37,360 --> 00:05:43,770
So in particular if we center
the data about 0 as a pre-processing

79
00:05:43,770 --> 00:05:49,760
step then it doesn't matter so much we're
shrinking the intercept towards 0 and

80
00:05:49,760 --> 00:05:54,680
not correcting for that,
because when we have data centered about 0

81
00:05:54,680 --> 00:05:57,890
in general we tend to believe that
the intercept will be pretty small.

82
00:05:58,900 --> 00:06:01,680
So here what I'm saying is step one,

83
00:06:01,680 --> 00:06:06,930
first we transform all our y
observations to have mean 0.

84
00:06:06,930 --> 00:06:11,180
And then as a second step we just run
exactly the ridge regression we described

85
00:06:11,180 --> 00:06:13,470
at the beginning of this module.

86
00:06:13,470 --> 00:06:17,880
Where we don't account for the fact
that there's this intercept term at all.

87
00:06:17,880 --> 00:06:20,970
So, that's another perfectly
reasonable solution to this problem.

88
00:06:20,970 --> 00:06:26,099
[MUSIC]