1
00:00:00,730 --> 00:00:03,660
If you suspect your neural network
is over fitting your data.

2
00:00:03,660 --> 00:00:05,840
That is you have a high variance problem,

3
00:00:05,840 --> 00:00:09,400
one of the first things you should
try per probably regularization.

4
00:00:09,400 --> 00:00:11,246
The other way to address high variance,

5
00:00:11,246 --> 00:00:13,917
is to get more training data
that's also quite reliable.

6
00:00:13,917 --> 00:00:15,869
But you can't always get
more training data, or

7
00:00:15,869 --> 00:00:17,850
it could be expensive to get more data.

8
00:00:17,850 --> 00:00:21,760
But adding regularization will often
help to prevent overfitting, or

9
00:00:21,760 --> 00:00:23,910
to reduce the errors in your network.

10
00:00:23,910 --> 00:00:26,020
So let's see how regularization works.

11
00:00:26,020 --> 00:00:28,780
Let's develop these ideas
using logistic regression.

12
00:00:28,780 --> 00:00:33,220
Recall that for logistic regression,
you try to minimize the cost function J,

13
00:00:33,220 --> 00:00:37,050
which is defined as this cost function.

14
00:00:37,050 --> 00:00:41,290
Some of your training examples of the
losses of the individual predictions in

15
00:00:41,290 --> 00:00:45,140
the different examples,
where you recall that w and

16
00:00:45,140 --> 00:00:48,175
b in the logistic regression,
are the parameters.

17
00:00:48,175 --> 00:00:54,620
So w is an x-dimensional parameter vector,
and b is a real number.

18
00:00:54,620 --> 00:00:58,979
And so to add regularization to the
logistic regression, what you do is add to

19
00:00:58,979 --> 00:01:03,154
it this thing, lambda, which is
called the regularization parameter.

20
00:01:03,154 --> 00:01:04,609
I'll say more about that in a second.

21
00:01:04,609 --> 00:01:10,072
But lambda/2m times the norm of w squared.

22
00:01:10,072 --> 00:01:15,840
So here,
the norm of w squared is just equal to

23
00:01:15,840 --> 00:01:22,580
sum from j equals 1 to nx of wj squared,
or this can also be written w

24
00:01:22,580 --> 00:01:27,750
transpose w, it's just a square
Euclidean norm of the prime to vector w.

25
00:01:27,750 --> 00:01:31,910
And this is called L2 regularization.

26
00:01:33,700 --> 00:01:36,618
Because here,
you're using the Euclidean normals, or

27
00:01:36,618 --> 00:01:38,877
else the L2 norm with
the prime to vector w.

28
00:01:38,877 --> 00:01:41,780
Now, why do you regularize
just the parameter w?

29
00:01:41,780 --> 00:01:47,130
Why don't we add something
here about b as well?

30
00:01:47,130 --> 00:01:51,210
In practice, you could do this,
but I usually just omit this.

31
00:01:51,210 --> 00:01:56,310
Because if you look at your parameters,
w is usually a pretty high dimensional

32
00:01:56,310 --> 00:02:00,159
parameter vector,
especially with a high variance problem.

33
00:02:00,159 --> 00:02:02,250
Maybe w just has a lot of parameters, so

34
00:02:02,250 --> 00:02:06,600
you aren't fitting all the parameters
well, whereas b is just a single number.

35
00:02:06,600 --> 00:02:10,200
So almost all the parameters
are in w rather b.

36
00:02:10,200 --> 00:02:12,890
And if you add this last term,
in practice,

37
00:02:12,890 --> 00:02:14,040
it won't make much of a difference,

38
00:02:14,040 --> 00:02:17,960
because b is just one parameter over
a very large number of parameters.

39
00:02:17,960 --> 00:02:21,500
In practice,
I usually just don't bother to include it.

40
00:02:21,500 --> 00:02:22,962
But you can if you want.

41
00:02:22,962 --> 00:02:27,510
So L2 regularization is the most
common type of regularization.

42
00:02:27,510 --> 00:02:32,042
You might have also heard of some
people talk about L1 regularization.

43
00:02:32,042 --> 00:02:38,422
And that's when you add,
instead of this L2 norm,

44
00:02:38,422 --> 00:02:45,674
you instead add a term that is
lambda/m of sum over of this.

45
00:02:45,674 --> 00:02:49,716
And this is also called the L1
norm of the parameter vector w,

46
00:02:49,716 --> 00:02:52,843
so the little subscript 1 down there,
right?

47
00:02:52,843 --> 00:02:58,050
And I guess whether you put m or 2m in the
denominator, is just a scaling constant.

48
00:02:58,050 --> 00:03:03,020
If you use L1 regularization,
then w will end up being sparse.

49
00:03:03,020 --> 00:03:08,040
And what that means is that the w
vector will have a lot of zeros in it.

50
00:03:08,040 --> 00:03:11,700
And some people say that this can help
with compressing the model, because

51
00:03:11,700 --> 00:03:16,140
the set of parameters are zero, and
you need less memory to store the model.

52
00:03:16,140 --> 00:03:19,850
Although, I find that, in practice, L1
regularization to make your model sparse,

53
00:03:19,850 --> 00:03:20,870
helps only a little bit.

54
00:03:20,870 --> 00:03:23,870
So I don't think it's used that much,
at least not for

55
00:03:23,870 --> 00:03:26,520
the purpose of compressing your model.

56
00:03:26,520 --> 00:03:28,472
And when people train your networks,

57
00:03:28,472 --> 00:03:31,423
L2 regularization is just
used much much more often.

58
00:03:31,423 --> 00:03:34,301
Sorry, just fixing up some
of the notation here.

59
00:03:34,301 --> 00:03:35,929
So one last detail.

60
00:03:35,929 --> 00:03:42,823
Lambda here is called the regularization,
Parameter.

61
00:03:45,267 --> 00:03:48,172
And usually, you set this
using your development set, or

62
00:03:48,172 --> 00:03:50,021
using [INAUDIBLE] cross validation.

63
00:03:50,021 --> 00:03:53,274
When you a variety of values and
see what does the best,

64
00:03:53,274 --> 00:03:57,662
in terms of trading off between doing
well in your training set versus also

65
00:03:57,662 --> 00:04:01,007
setting that two normal of
your parameters to be small.

66
00:04:01,007 --> 00:04:03,088
Which helps prevent over fitting.

67
00:04:03,088 --> 00:04:07,165
So lambda is another hyper parameter
that you might have to tune.

68
00:04:07,165 --> 00:04:09,550
And by the way, for
the programming exercises,

69
00:04:09,550 --> 00:04:14,250
lambda is a reserved keyword in
the Python programming language.

70
00:04:14,250 --> 00:04:18,300
So in the programming exercise,
we'll have lambd,

71
00:04:19,340 --> 00:04:23,690
without the a, so as not to clash
with the reserved keyword in Python.

72
00:04:23,690 --> 00:04:27,740
So we use lambd to represent
the lambda regularization parameter.

73
00:04:29,190 --> 00:04:33,320
So this is how you implement L2
regularization for logistic regression.

74
00:04:33,320 --> 00:04:35,280
How about a neural network?

75
00:04:35,280 --> 00:04:39,789
In a neural network, you have a cost
function that's a function of

76
00:04:39,789 --> 00:04:44,621
all of your parameters, w[1],
b[1] through w[L], b[L],

77
00:04:44,621 --> 00:04:48,906
where capital L is the number of
layers in your neural network.

78
00:04:48,906 --> 00:04:54,129
And so the cost function is this,
sum of the losses,

79
00:04:54,129 --> 00:04:58,066
summed over your m training examples.

80
00:04:58,066 --> 00:05:03,087
And says at regularization,
you add lambda over

81
00:05:03,087 --> 00:05:10,190
2m of sum over all of your parameters W,
your parameter matrix is w,

82
00:05:10,190 --> 00:05:14,857
of their, that's called the squared norm.

83
00:05:14,857 --> 00:05:19,749
Where this norm of a matrix,
meaning the squared

84
00:05:19,749 --> 00:05:23,922
norm is defined as the sum
of the i sum of j,

85
00:05:23,922 --> 00:05:29,250
of each of the elements of that matrix,
squared.

86
00:05:29,250 --> 00:05:31,248
And if you want the indices
of this summation.

87
00:05:31,248 --> 00:05:35,253
This is sum from i=1 through n[l-1].

88
00:05:35,253 --> 00:05:38,537
Sum from j=1 through n[l],

89
00:05:38,537 --> 00:05:44,497
because w is an n[l-1] by
n[l] dimensional matrix,

90
00:05:44,497 --> 00:05:51,320
where these are the number of
units in layers [l-1] in layer l.

91
00:05:51,320 --> 00:05:57,447
So this matrix norm,
it turns out is called the Frobenius

92
00:05:57,447 --> 00:06:03,710
norm of the matrix,
denoted with a F in the subscript.

93
00:06:03,710 --> 00:06:07,266
So for
arcane linear algebra technical reasons,

94
00:06:07,266 --> 00:06:10,491
this is not called the l2
normal of a matrix.

95
00:06:10,491 --> 00:06:14,620
Instead, it's called
the Frobenius norm of a matrix.

96
00:06:14,620 --> 00:06:16,980
I know it sounds like it would be more
natural to just call the l2 norm of

97
00:06:16,980 --> 00:06:21,760
the matrix, but for really arcane
reasons that you don't need to know,

98
00:06:21,760 --> 00:06:24,090
by convention,
this is called the Frobenius norm.

99
00:06:24,090 --> 00:06:27,232
It just means the sum of square
of elements of a matrix.

100
00:06:27,232 --> 00:06:30,060
So how do you implement
gradient descent with this?

101
00:06:30,060 --> 00:06:35,343
Previously, we would
complete dw using backprop,

102
00:06:35,343 --> 00:06:40,626
where backprop would give
us the partial derivative

103
00:06:40,626 --> 00:06:46,166
of J with respect to w, or
really w for any given [l].

104
00:06:46,166 --> 00:06:52,995
And then you update w[l],
as w[l]- the learning rate times d.

105
00:06:52,995 --> 00:06:57,890
So this is before we added this extra
regularization term to the objective.

106
00:06:57,890 --> 00:07:02,941
Now that we've added this
regularization term to the objective,

107
00:07:02,941 --> 00:07:07,643
what you do is you take dw and
you add to it, lambda/m times w.

108
00:07:07,643 --> 00:07:10,760
And then you just compute this update,
same as before.

109
00:07:10,760 --> 00:07:14,829
And it turns out that with
this new definition of dw[l],

110
00:07:14,829 --> 00:07:19,315
this new dw[l] is still a correct
definition of the derivative

111
00:07:19,315 --> 00:07:23,385
of your cost function,
with respect to your parameters,

112
00:07:23,385 --> 00:07:27,980
now that you've added the extra
regularization term at the end.

113
00:07:29,260 --> 00:07:33,990
And it's for this reason that L2
regularization is sometimes also

114
00:07:33,990 --> 00:07:36,730
called weight decay.

115
00:07:36,730 --> 00:07:42,348
So if I take this definition of dw[l] and
just plug it in here,

116
00:07:42,348 --> 00:07:47,012
then you see that the update
is w[l] = w[l] times

117
00:07:47,012 --> 00:07:51,994
the learning rate alpha times
the thing from backprop,

118
00:07:54,311 --> 00:08:02,816
+lambda of m times w[l].

119
00:08:02,816 --> 00:08:04,431
Throw the minus sign there.

120
00:08:04,431 --> 00:08:09,382
And so this is equal to w[l]- alpha

121
00:08:09,382 --> 00:08:14,494
lambda / m times w[l]- alpha times

122
00:08:14,494 --> 00:08:18,822
the thing you got from backpop.

123
00:08:18,822 --> 00:08:22,324
And so this term shows that
whatever the matrix w[l] is,

124
00:08:22,324 --> 00:08:25,480
you're going to make it
a little bit smaller, right?

125
00:08:25,480 --> 00:08:28,270
This is actually as if you're
taking the matrix w and

126
00:08:28,270 --> 00:08:33,030
you're multiplying it by 1-alpha lambda/m.

127
00:08:33,030 --> 00:08:38,279
You're really taking the matrix w and
subtracting alpha lambda/m times this.

128
00:08:38,279 --> 00:08:41,130
Like you're multiplying
matrix w by this number,

129
00:08:41,130 --> 00:08:43,528
which is going to be
a little bit less than 1.

130
00:08:43,528 --> 00:08:48,688
So this is why L2 norm regularization
is also called weight decay.

131
00:08:48,688 --> 00:08:53,716
Because it's just like the ordinally
gradient descent, where you update

132
00:08:53,716 --> 00:08:59,260
w by subtracting alpha times the original
gradient you got from backprop.

133
00:08:59,260 --> 00:09:04,616
But now you're also
multiplying w by this thing,

134
00:09:04,616 --> 00:09:08,324
which is a little bit less than 1.

135
00:09:08,324 --> 00:09:11,782
So the alternative name for
L2 regularization is weight decay.

136
00:09:11,782 --> 00:09:15,641
I'm not really going to use that name,
but the intuition for

137
00:09:15,641 --> 00:09:21,030
it's called weight decay is that this
first term here, is equal to this.

138
00:09:21,030 --> 00:09:25,620
So you're just multiplying the weight
metrics by a number slightly less than 1.

139
00:09:25,620 --> 00:09:28,511
So that's how you implement L2
regularization in neural network.

140
00:09:29,545 --> 00:09:32,796
Now, one question that [INAUDIBLE]
has asked me is, hey, Andrew,

141
00:09:32,796 --> 00:09:35,675
why does regularization
prevent over-fitting?

142
00:09:35,675 --> 00:09:37,462
Let's look at the next video,

143
00:09:37,462 --> 00:09:41,805
and gain some intuition for
how regularization prevents over-fitting.