1
00:00:00,000 --> 00:00:04,228
[MUSIC]

2
00:00:04,228 --> 00:00:08,221
We're now down to our final
practical issue stochastic gradient,

3
00:00:08,221 --> 00:00:11,400
which again is going to
be an optional section.

4
00:00:11,400 --> 00:00:14,320
And the question is,
how do we introduce regularization and

5
00:00:14,320 --> 00:00:15,870
what impact it has on the updates.

6
00:00:15,870 --> 00:00:18,390
And this is going to be pretty significant

7
00:00:18,390 --> 00:00:21,739
if you want to implement it
yourself in a general way.

8
00:00:23,130 --> 00:00:26,310
Again, optional section
because it's pretty detailed.

9
00:00:27,810 --> 00:00:31,560
So let's remind ourselves of
the regularized likelihood.

10
00:00:31,560 --> 00:00:35,214
So, we have some total party metric
defined in the pit of our data,

11
00:00:35,214 --> 00:00:36,858
which is the log likelihood.

12
00:00:36,858 --> 00:00:41,597
And some measures of the complex of
the parameters which in our case,

13
00:00:41,597 --> 00:00:46,580
here we're using w squared,
the actual normal for the parameters, and

14
00:00:46,580 --> 00:00:51,238
we wanted to compute the gradient
of this regularized likelihood to

15
00:00:51,238 --> 00:00:54,450
make some progress, and
avoid over fitting.

16
00:00:55,610 --> 00:00:59,660
So, the total derivative is the derivative
of the first term, the quality,

17
00:00:59,660 --> 00:01:03,510
which is the sum over the data points
of the contribution of each data point,

18
00:01:03,510 --> 00:01:05,530
the thing that got really expensive.

19
00:01:05,530 --> 00:01:09,630
And the contribution of the second one,
once we introduced that parameter, lambda,

20
00:01:09,630 --> 00:01:16,522
to trade off the two things, talked about
this a lot, contribution is -2 lambda wj.

21
00:01:16,522 --> 00:01:19,360
And we derive this update

22
00:01:19,360 --> 00:01:22,800
during an earlier module on
regularized logistic regression.

23
00:01:24,180 --> 00:01:29,300
Now this is how we do straight up
gradient updates with regularization.

24
00:01:29,300 --> 00:01:31,110
The question is what do we do?

25
00:01:31,110 --> 00:01:34,190
We only do stochastic
gradient with regularization.

26
00:01:35,590 --> 00:01:39,380
So, if you remember stochastic gradient,
we just said we take the contribution for

27
00:01:39,380 --> 00:01:40,820
single data point, and

28
00:01:40,820 --> 00:01:43,860
if we added up those contributions
we get exactly the gradient.

29
00:01:43,860 --> 00:01:45,550
This is why it worked,

30
00:01:45,550 --> 00:01:48,910
because the sum of these stochastic
gradients equal the gradient.

31
00:01:49,920 --> 00:01:54,330
And so, to my theme that, we need to think
about how to set up the regularization

32
00:01:54,330 --> 00:01:58,870
term such that the sum of the stochastic
gradients also is equal to the gradient.

33
00:02:00,830 --> 00:02:05,840
And one way to think about this is
that we take the regularization and

34
00:02:05,840 --> 00:02:07,350
divide it by N.

35
00:02:07,350 --> 00:02:09,370
So in practice you want to do this.

36
00:02:09,370 --> 00:02:11,680
You want to say that
the total derivative for

37
00:02:11,680 --> 00:02:17,560
stochastic gradient is contribution
data point minus two over n lambda wj.

38
00:02:17,560 --> 00:02:22,480
And so if you were to add this up
you get back to the full gradient.

39
00:02:24,470 --> 00:02:28,750
So with regularization now,
the algorithm stays the same, but

40
00:02:28,750 --> 00:02:35,170
the contribution data point is its
gradient minus two over n lambda wj.

41
00:02:35,170 --> 00:02:37,920
That's the contribution from
the regularization point.

42
00:02:37,920 --> 00:02:42,010
If you are using mini-batches,
you'd adapt that times.

43
00:02:42,010 --> 00:02:45,570
So you had the sum of the contributions
from each data point And

44
00:02:45,570 --> 00:02:49,280
then it will be 2 b over n lambda w j.

45
00:02:49,280 --> 00:02:51,630
So this is how we take
care of regularization.

46
00:02:51,630 --> 00:02:55,160
Again very small change,
it's going to behave way better.