1
00:00:00,000 --> 00:00:04,543
[MUSIC]

2
00:00:04,543 --> 00:00:08,324
So instead of this closed form solution
that can be computationally prohibitive

3
00:00:08,324 --> 00:00:11,939
and think about using our group gradient
descent algorithm to solve our ridge

4
00:00:11,939 --> 00:00:12,620
regression.

5
00:00:14,080 --> 00:00:18,510
And let's just talk about
the element y's update that we have.

6
00:00:18,510 --> 00:00:21,520
So it's the update to
the jth feature weight.

7
00:00:21,520 --> 00:00:26,930
And remember here I'm showing the gradient
of our total cost, just as a reminder.

8
00:00:29,040 --> 00:00:34,410
And what we do is we take
our previous weight for

9
00:00:34,410 --> 00:00:40,100
the jth feature and
we subtract Ada our step size, times.

10
00:00:41,330 --> 00:00:45,640
Whatever this gradient update
dictates to form our new weight.

11
00:00:45,640 --> 00:00:48,020
And what does this gradient tell us?

12
00:00:48,020 --> 00:00:50,700
Well, this first term here on this line

13
00:00:51,710 --> 00:00:56,330
is exactly what we had before from
our residual sum of squares term.

14
00:00:56,330 --> 00:01:00,022
So this is same as,

15
00:01:00,022 --> 00:01:07,130
before this comes from our
residual sum of squares term.

16
00:01:09,490 --> 00:01:13,420
But now in addition to this we
have another term appearing.

17
00:01:13,420 --> 00:01:14,770
We have this term.

18
00:01:14,770 --> 00:01:19,408
So, this is our new term that comes from,

19
00:01:25,353 --> 00:01:30,731
The jth component of 2

20
00:01:30,731 --> 00:01:35,838
lambda w is w vector.

21
00:01:35,838 --> 00:01:40,491
Okay, so one thing we see is we have

22
00:01:40,491 --> 00:01:45,010
a wjt appearing in two places.

23
00:01:45,010 --> 00:01:46,780
So we can combine these and

24
00:01:46,780 --> 00:01:51,460
simplify the form of this expression to
get a little bit more interpretability.

25
00:01:51,460 --> 00:01:57,050
So what we see is that,
our update first takes wjt and

26
00:01:57,050 --> 00:02:02,930
multiplies by 1- 2 x our step
size x our tuning parameter.

27
00:02:04,160 --> 00:02:10,745
So, times Ada and Lambda and, here,
we know that Ada is greater than 0.

28
00:02:12,790 --> 00:02:17,110
And, we know that Lambda is
greater than or equal to 0.

29
00:02:17,110 --> 00:02:23,280
And, so the result is 2 x Ada x
Lambda is going to be greater than 0.

30
00:02:23,280 --> 00:02:27,900
So, when we have 1- a positive number.

31
00:02:27,900 --> 00:02:33,090
We know that this is going to be, this
thing here is going to be less than or

32
00:02:33,090 --> 00:02:34,580
equal to 1.

33
00:02:34,580 --> 00:02:41,730
But the other thing we want is we want
the result of this to be positive.

34
00:02:41,730 --> 00:02:49,475
So to enforce that we're going to say
that 2 Ada Lambda has to be less than 1.

35
00:02:50,970 --> 00:02:54,745
Okay, so it's a positive
quantity that is less than 1.

36
00:02:56,620 --> 00:03:00,810
Okay, so what we're doing is we're
taking our previous coefficient and

37
00:03:00,810 --> 00:03:05,645
we're multiplying by some number that's
less than 1, a positive number less than

38
00:03:05,645 --> 00:03:09,690
1, so what that's doing is it's just
shrinking the value by some amount,

39
00:03:09,690 --> 00:03:10,520
at some fraction.

40
00:03:10,520 --> 00:03:15,838
And then we have this update here
which is the same update that

41
00:03:15,838 --> 00:03:20,739
we had just from our residual
summer squares as before so

42
00:03:20,739 --> 00:03:23,885
I'll just call this our update term

43
00:03:28,144 --> 00:03:31,470
From residual sum of squares.

44
00:03:32,940 --> 00:03:35,480
And so let's draw a picture
of what's happening here.

45
00:03:37,210 --> 00:03:41,510
And what I'm gonna do is I'm gonna
draw a picture that's a function of

46
00:03:43,260 --> 00:03:47,030
iterations going this way.

47
00:03:48,070 --> 00:03:54,795
And magnitude of the coefficients
along the y axis.

48
00:04:00,460 --> 00:04:01,890
But this is just a little cartoon.

49
00:04:01,890 --> 00:04:05,980
So, we're gonna say that
at some iteration T,

50
00:04:05,980 --> 00:04:10,240
we have a coefficient wjt.

51
00:04:10,240 --> 00:04:13,500
And what we're doing in

52
00:04:13,500 --> 00:04:18,420
ridge regression is we're introducing
this intermediate step where

53
00:04:20,420 --> 00:04:23,919
here we first shrink
the value of the coefficient.

54
00:04:25,830 --> 00:04:28,810
According to 1- 2 Ada Lamba.

55
00:04:28,810 --> 00:04:35,370
So, this is 1- 2 Ada Lamda x wjt.

56
00:04:35,370 --> 00:04:40,882
And so, just to be very clear

57
00:04:40,882 --> 00:04:46,396
this is an intermediate step

58
00:04:46,396 --> 00:04:53,072
introduced in ridge regression.

59
00:04:57,925 --> 00:05:00,280
So this is some iteration T.

60
00:05:00,280 --> 00:05:06,484
This is some in between iteration and
when we get to iteration T + 1.

61
00:05:06,484 --> 00:05:11,190
What we do is we take
whatever this update term is.

62
00:05:11,190 --> 00:05:12,030
It could be positive.

63
00:05:12,030 --> 00:05:15,710
It could be negative, and
we modify this shrunken coefficient.

64
00:05:17,340 --> 00:05:23,260
So let's just assume for this illustration
that it's positive, some positive value.

65
00:05:24,840 --> 00:05:28,560
So, I'm going to take the last
value of my shrunken coefficient.

66
00:05:28,560 --> 00:05:30,060
I'm gonna draw it here.

67
00:05:30,060 --> 00:05:33,350
And then I'm gonna add some amount.

68
00:05:33,350 --> 00:05:35,740
Wait, let's see,
I'm gonna add an amount that makes it,

69
00:05:35,740 --> 00:05:39,480
let's say a little bit less than
what my coefficient was before.

70
00:05:40,820 --> 00:05:41,320
So.

71
00:05:42,990 --> 00:05:45,880
Sorry, my pen is acting crazy here.

72
00:05:45,880 --> 00:05:50,200
So I'm gonna add this is
my little update term

73
00:05:54,910 --> 00:05:57,000
from RSS.

74
00:05:58,630 --> 00:06:04,209
And this will give me wj at iteration T+1.

75
00:06:06,640 --> 00:06:12,434
And let's contrast with what our old
standard least squares algorithm did.

76
00:06:12,434 --> 00:06:15,685
So previously in red,

77
00:06:22,422 --> 00:06:24,950
Just RSS.

78
00:06:24,950 --> 00:06:27,400
Which was our least squares solution.

79
00:06:27,400 --> 00:06:32,040
We had wj t, so
we've started at the same point.

80
00:06:32,040 --> 00:06:36,340
But instead of shrinking the coefficient
with this intermediate step, we just took

81
00:06:36,340 --> 00:06:41,130
this little update term from residual sum
of squares and modified this coefficient.

82
00:06:41,130 --> 00:06:43,903
So I'm gonna try and

83
00:06:43,903 --> 00:06:49,123
map over to time to iteration T + 1 and

84
00:06:49,123 --> 00:06:55,812
then add the same amount,
the same update term so

85
00:06:55,812 --> 00:07:03,970
again this is the same update
term that I'm adding to wjt but

86
00:07:03,970 --> 00:07:09,680
this is gonna give me my wjt + 1 according

87
00:07:09,680 --> 00:07:14,308
to the least squares algorithm.

88
00:07:14,308 --> 00:07:19,460
Okay, so just to be clear, what's
going on here in case this picture was

89
00:07:19,460 --> 00:07:26,060
not helpful to you, is in the ridge
regression case we're taking

90
00:07:26,060 --> 00:07:31,000
our previous value of our coefficient,
our jth regression coefficient.

91
00:07:31,000 --> 00:07:34,100
And the first thing we're doing,
before we decide whether

92
00:07:35,220 --> 00:07:39,160
what our fit term is saying we should
do to it, is we first shrink it.

93
00:07:39,160 --> 00:07:43,180
At every iteration we first take
the last value and we make it smaller.

94
00:07:43,180 --> 00:07:45,700
And that's where the model
complexity term is coming in.

95
00:07:45,700 --> 00:07:51,020
Where we're saying we want smaller
coefficients to avoid over fitting, but

96
00:07:51,020 --> 00:07:53,670
then after that we're
going to say well okay

97
00:07:53,670 --> 00:07:56,860
let's listen to how well my
model's fitting the data.

98
00:07:56,860 --> 00:07:59,420
That's where this update
term is coming in and

99
00:07:59,420 --> 00:08:02,500
we're going to modify our
our shrunken coefficient.

100
00:08:02,500 --> 00:08:07,970
And in contrast, our old solution never
did that shrinking first, it just said,

101
00:08:07,970 --> 00:08:11,290
okay, let's just listen to the fit, and
listen to the fit, and listen to the fit

102
00:08:11,290 --> 00:08:14,960
at every iteration and that's where we
could get into issues of overfitting.

103
00:08:16,960 --> 00:08:22,050
Okay, so here's a summary of our standard
lee squares gradient descent algorithm.

104
00:08:22,050 --> 00:08:27,920
This is what we had again two
modules ago where we just initialize

105
00:08:27,920 --> 00:08:32,840
our weights vector, and while not
converge for every feature dimension.

106
00:08:32,840 --> 00:08:33,760
We're going to go through,

107
00:08:33,760 --> 00:08:37,270
compute this partial derivative
of the residual sum of squares.

108
00:08:37,270 --> 00:08:40,790
That's that update term I was
talking about on the last slide.

109
00:08:40,790 --> 00:08:45,610
And then I'm going to use that to modify

110
00:08:45,610 --> 00:08:51,110
my previous coefficient
according to the step size Ada.

111
00:08:51,110 --> 00:08:54,210
And I'm gonna keep iterating
until convergence.

112
00:08:54,210 --> 00:08:57,480
Okay, so this is what you code up before.

113
00:08:57,480 --> 00:09:00,410
Now let's look at what we do
to have to modify this code to

114
00:09:00,410 --> 00:09:03,030
get a rich regression variant.

115
00:09:03,030 --> 00:09:08,569
And notice that the only change is
in this line for updating wj where,

116
00:09:08,569 --> 00:09:15,064
instead of taking wjt and doing this
update with what I'm calling partial here,

117
00:09:15,064 --> 00:09:20,500
we're taking 1- 2 Ada Lambda x wj and
doing the update.

118
00:09:20,500 --> 00:09:24,700
So it's a trivial, trivial,
trivial modification to your code.

119
00:09:24,700 --> 00:09:26,028
And that's really cool.

120
00:09:26,028 --> 00:09:27,274
It's very ,very,

121
00:09:27,274 --> 00:09:31,901
very simple to implement ridge regression
given a specific Lambda value.

122
00:09:31,901 --> 00:09:35,949
[MUSIC]