1
00:00:00,077 --> 00:00:04,907
[MUSIC]

2
00:00:04,907 --> 00:00:09,015
We see now how regularization can play
a role in logistic regression to find much

3
00:00:09,015 --> 00:00:12,640
better fits of data and
better assessments of probability.

4
00:00:12,640 --> 00:00:16,630
Let's finally talk about how we can
learn them from data coefficient

5
00:00:16,630 --> 00:00:18,300
using gradient ascent.

6
00:00:18,300 --> 00:00:21,950
And it's just going to be a very
very tiny change on what we did for

7
00:00:21,950 --> 00:00:25,300
learning the coefficients
in logistic regression.

8
00:00:25,300 --> 00:00:30,130
So with a tiny change of code,
we now address,

9
00:00:30,130 --> 00:00:33,100
alleviate, all those over fitting
problems that you had before.

10
00:00:35,060 --> 00:00:39,860
So again, same setting as before,
Training Data, features, same model.

11
00:00:39,860 --> 00:00:44,010
Now we have L2 regularized
logistic regression or

12
00:00:44,010 --> 00:00:45,900
log likelihood is quality metric,

13
00:00:45,900 --> 00:00:50,870
and we're going to talk about ML algorithm
to address it to optimize it to get w hat.

14
00:00:50,870 --> 00:00:55,351
We're going to be using the same kind
of gradient ascent algorithm that we

15
00:00:55,351 --> 00:00:59,987
used before, we'll start from some
point and we take these little steps.

16
00:00:59,987 --> 00:01:03,880
Go eta steps until we get
to our solution w hat, and

17
00:01:03,880 --> 00:01:08,889
the same kind of approach,
we're taking the old coefficient,

18
00:01:08,889 --> 00:01:15,220
adding eta times the gradient, and getting
the new coefficients w, t plus one.

19
00:01:15,220 --> 00:01:20,730
And so the only thing that we have to
ask ourselves is what is the gradient

20
00:01:20,730 --> 00:01:24,182
equal to now that we've add
this extra regularization term?

21
00:01:24,182 --> 00:01:28,821
So we somehow need

22
00:01:28,821 --> 00:01:32,842
the gradient of

23
00:01:32,842 --> 00:01:37,174
the regularized

24
00:01:37,174 --> 00:01:41,830
log likelihood.

25
00:01:41,830 --> 00:01:45,410
Let's see what that looks like.

26
00:01:45,410 --> 00:01:49,990
We've seen that our total quality is
the sum of the log likelihood of the data,

27
00:01:49,990 --> 00:01:54,680
which a measure of fit, minus lambda
times our regularization penalty,

28
00:01:54,680 --> 00:01:57,010
which is this L2 norm squared.

29
00:01:57,010 --> 00:01:59,120
And so
what is the derivative of this thing?

30
00:01:59,120 --> 00:02:04,340
This is the thing we need to be able to
walk into that hill-climbing direction.

31
00:02:04,340 --> 00:02:06,910
So the derivative of the sum
is the sum of the derivative.

32
00:02:06,910 --> 00:02:11,210
So the total derivative is the derivative
of the first term, the derivative of

33
00:02:11,210 --> 00:02:14,010
the log-likelihood, which, thankfully,
we've seen in the previous module,

34
00:02:15,070 --> 00:02:20,020
minus lambda times the derivative
of the quadratic term here.

35
00:02:20,020 --> 00:02:23,500
And it is how the derivative of
the quadratic term we already covered

36
00:02:23,500 --> 00:02:25,920
in the regression course.

37
00:02:25,920 --> 00:02:27,500
But we're going to do a quick review here.

38
00:02:27,500 --> 00:02:30,490
But as you can see,
just a small change to your code before,

39
00:02:30,490 --> 00:02:34,020
we just have to add this lambda times
the derivative of the quadratic term.

40
00:02:35,250 --> 00:02:36,460
So the review,

41
00:02:36,460 --> 00:02:42,240
the derivative of the log-likelihood is
going to be the sum of my data points

42
00:02:42,240 --> 00:02:46,360
of the difference between the syndicate
of whether it's a positive example and

43
00:02:46,360 --> 00:02:51,350
the probability of it being positive
weighed by the value of the feature.

44
00:02:51,350 --> 00:02:53,950
And we talked about last module and

45
00:02:53,950 --> 00:02:55,920
interpreted this piece in
quite a bit of detail.

46
00:02:55,920 --> 00:02:59,220
So what I'm going to go over again,
we're going to focus on the second part,

47
00:02:59,220 --> 00:03:01,720
which is the derivative of the L2 penalty.

48
00:03:01,720 --> 00:03:05,900
So in other words, what's the partial
derivative with respect to some parameter

49
00:03:05,900 --> 00:03:13,110
wj of w0 squared plus w1
squared plus w2 squared,

50
00:03:14,430 --> 00:03:18,880
Plus dot, dot,
dot plus wj squared plus dot,

51
00:03:18,880 --> 00:03:23,700
dot, dot plus wd squared.

52
00:03:23,700 --> 00:03:26,090
Now if you look at all of this terms

53
00:03:26,090 --> 00:03:29,490
w zero squared wr squared all of those
don't play any role in the derivative.

54
00:03:29,490 --> 00:03:33,770
The only thing that plays a role is
wj squared Now what's the derivative

55
00:03:33,770 --> 00:03:34,420
of 30g squared?

56
00:03:34,420 --> 00:03:35,790
It's just 2wj.

57
00:03:35,790 --> 00:03:43,260
So that's always going to change
in our code, it's actually 2wj.

58
00:03:43,260 --> 00:03:48,220
So in fact, our total derivative is
going to be the same derivative that we've

59
00:03:48,220 --> 00:03:52,670
implemented in the past,
mins 2 lambda, Times wj.

60
00:03:54,100 --> 00:03:58,890
So 2 times the regularization coefficient,
the regularization penalty,

61
00:03:58,890 --> 00:04:04,490
the parameter lambda times the magnitude,
so times the value of that coefficient.

62
00:04:06,700 --> 00:04:10,430
So let's interpret what this
extra term does for us.

63
00:04:10,430 --> 00:04:14,560
So what does the minus 2 lambda
wj do to the derivative?

64
00:04:14,560 --> 00:04:22,150
So wj is positive,
this minus lambda wj is a negative term.

65
00:04:22,150 --> 00:04:29,410
Negative contribution to a derivative
which means that it decreases

66
00:04:29,410 --> 00:04:35,150
wj because you're going to
add some negative term to it.

67
00:04:35,150 --> 00:04:37,260
It was positive we're
going to decrease it.

68
00:04:37,260 --> 00:04:41,863
So since it was positive and

69
00:04:41,863 --> 00:04:47,868
you're decreasing it what happens

70
00:04:47,868 --> 00:04:52,481
is wj becomes closer to 0.

71
00:04:53,977 --> 00:04:57,805
So if the rig is positive you
have the negative number and

72
00:04:57,805 --> 00:05:00,340
becomes less positive closer to 0.

73
00:05:00,340 --> 00:05:04,480
And in fact if lambda's bigger then
that thing becomes more negative and

74
00:05:04,480 --> 00:05:06,660
going to 0 faster.

75
00:05:06,660 --> 00:05:07,800
That's what happens.

76
00:05:07,800 --> 00:05:13,260
And if wj is very positive that
the decrement is also larger so

77
00:05:13,260 --> 00:05:16,880
it becomes again goes to
towards 0 even faster.

78
00:05:18,260 --> 00:05:24,880
Now if wj is negative then -2 lambda
wj is going to be greater than 0.

79
00:05:24,880 --> 00:05:26,490
Because lambda is also greater than 0.

80
00:05:26,490 --> 00:05:28,490
And what impact does that have?

81
00:05:28,490 --> 00:05:31,950
So you're adding something positive so
you're increasing

82
00:05:34,710 --> 00:05:40,179
wj which implies that wj becomes,

83
00:05:40,179 --> 00:05:44,570
again, closer to 0.

84
00:05:44,570 --> 00:05:49,700
It was negative, and I posited it numbers
with, it goes a little closer to 0.

85
00:05:49,700 --> 00:05:53,600
So this is extremely intuitive,
the regularization takes

86
00:05:53,600 --> 00:05:56,890
positive coefficients and decreases them
a little bit, negative coefficients and

87
00:05:56,890 --> 00:05:58,350
increases them a little bit.

88
00:05:58,350 --> 00:06:00,880
So it tries to push coefficients to 0,

89
00:06:00,880 --> 00:06:03,590
that was the effect has on the gradient,
exactly what you expect.

90
00:06:05,570 --> 00:06:09,990
Finally, this is exactly the code that
we described in the last module, so

91
00:06:09,990 --> 00:06:12,200
learn the coefficients of
a logistic regression model.

92
00:06:13,490 --> 00:06:16,970
You start with some,
that is equal to 0, or

93
00:06:16,970 --> 00:06:21,640
some other randomly initiated or
some kind of smartly initiated parameters.

94
00:06:21,640 --> 00:06:27,270
And you go, for each iteration you go
coefficient by coefficient, you compute

95
00:06:27,270 --> 00:06:32,550
a partial derivative, which is this really
long term here, sum over data points.

96
00:06:32,550 --> 00:06:37,130
The feature value times
the difference between where there's

97
00:06:37,130 --> 00:06:42,180
a positive data point and the predicted
value positive, so called a partial j.

98
00:06:44,120 --> 00:06:52,250
And you have the same update,
wj(t+1) is wj(t) plus the step size.

99
00:06:54,440 --> 00:06:59,990
It multiplies the partial
derivative just as before,

100
00:06:59,990 --> 00:07:04,750
which is the derivative of the likelihood
function With respect to wj.

101
00:07:04,750 --> 00:07:08,180
And all you need to change in your code,

102
00:07:08,180 --> 00:07:09,950
there's only one little
thing to change in the code.

103
00:07:11,090 --> 00:07:14,650
You have this little thing
here which is our only change.

104
00:07:16,140 --> 00:07:18,840
In other words,
take all the code you had before,

105
00:07:18,840 --> 00:07:23,140
put- 2 lambda wj in the computation
of the derivative, and

106
00:07:23,140 --> 00:07:28,250
now you have a solver for
L2 regularized logistic regression.

107
00:07:28,250 --> 00:07:31,687
And this is going to help you
a tremendous amount in practice.

108
00:07:31,687 --> 00:07:36,039
[MUSIC]