1
00:00:00,000 --> 00:00:03,025
Why does regularization help with overfitting?

2
00:00:03,025 --> 00:00:05,835
Why does it help with reducing variance problems?

3
00:00:05,835 --> 00:00:10,920
Let's go through a couple examples to gain some intuition about how it works.

4
00:00:10,920 --> 00:00:16,635
So, recall that high bias, high variance.

5
00:00:16,635 --> 00:00:25,235
And I just write pictures from our earlier video that looks something like this.

6
00:00:25,235 --> 00:00:27,780
Now, let's see a fitting large and deep neural network.

7
00:00:27,780 --> 00:00:30,550
I know I haven't drawn this one too large or too deep,

8
00:00:30,550 --> 00:00:34,630
unless you think some neural network and this currently overfitting.

9
00:00:34,630 --> 00:00:39,520
So you have some cost function like J of W,

10
00:00:39,520 --> 00:00:44,390
B equals sum of the losses.

11
00:00:44,390 --> 00:00:51,872
So what we did for regularization was add

12
00:00:51,872 --> 00:00:56,395
this extra term that

13
00:00:56,395 --> 00:01:02,690
penalizes the weight matrices from being too large.

14
00:01:02,690 --> 00:01:04,540
So that was the Frobenius norm.

15
00:01:04,540 --> 00:01:08,290
So why is it that shrinking the L two norm or

16
00:01:08,290 --> 00:01:12,445
the Frobenius norm or the parameters might cause less overfitting?

17
00:01:12,445 --> 00:01:14,515
One piece of intuition is that if you

18
00:01:14,515 --> 00:01:17,354
crank regularisation lambda to be really, really big,

19
00:01:17,354 --> 00:01:20,005
they'll be really incentivized to set

20
00:01:20,005 --> 00:01:24,535
the weight matrices W to be reasonably close to zero.

21
00:01:24,535 --> 00:01:30,460
So one piece of intuition is maybe it set the weight to be so close to zero for

22
00:01:30,460 --> 00:01:33,340
a lot of hidden units that's basically zeroing

23
00:01:33,340 --> 00:01:36,675
out a lot of the impact of these hidden units.

24
00:01:36,675 --> 00:01:37,990
And if that's the case,

25
00:01:37,990 --> 00:01:44,765
then this much simplified neural network becomes a much smaller neural network.

26
00:01:44,765 --> 00:01:48,185
In fact, it is almost like a logistic regression unit,

27
00:01:48,185 --> 00:01:50,005
but stacked most probably as deep.

28
00:01:50,005 --> 00:01:51,805
And so that will take you from

29
00:01:51,805 --> 00:01:57,635
this overfitting case much closer to the left to other high bias case.

30
00:01:57,635 --> 00:02:00,760
But hopefully there'll be an intermediate value of lambda that

31
00:02:00,760 --> 00:02:04,820
results in a result closer to this just right case in the middle.

32
00:02:04,820 --> 00:02:07,420
But the intuition is that by cranking up lambda to be

33
00:02:07,420 --> 00:02:10,510
really big they'll set W close to zero,

34
00:02:10,510 --> 00:02:13,280
which in practice this isn't actually what happens.

35
00:02:13,280 --> 00:02:17,110
We can think of it as zeroing out or at least reducing

36
00:02:17,110 --> 00:02:19,270
the impact of a lot of the hidden units so you end up

37
00:02:19,270 --> 00:02:21,935
with what might feel like a simpler network.

38
00:02:21,935 --> 00:02:25,920
They get closer and closer as if you're just using logistic regression.

39
00:02:25,920 --> 00:02:31,360
The intuition of completely zeroing out of a bunch of hidden units isn't quite right.

40
00:02:31,360 --> 00:02:35,225
It turns out that what actually happens is they'll still use all the hidden units,

41
00:02:35,225 --> 00:02:37,610
but each of them would just have a much smaller effect.

42
00:02:37,610 --> 00:02:41,255
But you do end up with a simpler network and as

43
00:02:41,255 --> 00:02:45,040
if you have a smaller network that is therefore less prone to overfitting.

44
00:02:45,040 --> 00:02:47,715
So a lot of this intuition helps better

45
00:02:47,715 --> 00:02:50,765
when you implement regularization in the program exercise,

46
00:02:50,765 --> 00:02:55,360
you actually see some of these variance reduction results yourself.

47
00:02:55,360 --> 00:02:57,955
Here's another attempt at additional intuition

48
00:02:57,955 --> 00:03:01,535
for why regularization helps prevent overfitting.

49
00:03:01,535 --> 00:03:04,030
And for this, I'm going to assume that we're using

50
00:03:04,030 --> 00:03:08,465
the tanh activation function which looks like this.

51
00:03:08,465 --> 00:03:13,515
This is a g of z equals tanh of z.

52
00:03:13,515 --> 00:03:15,200
So if that's the case,

53
00:03:15,200 --> 00:03:19,427
notice that so long as Z is quite small,

54
00:03:19,427 --> 00:03:23,410
so if Z takes on only a smallish range of parameters,

55
00:03:23,410 --> 00:03:28,165
maybe around here, then you're just using the linear regime of the tanh function.

56
00:03:28,165 --> 00:03:34,080
Is only if Z is allowed to wander up to larger values or smaller values like so,

57
00:03:34,080 --> 00:03:37,490
that the activation function starts to become less linear.

58
00:03:37,490 --> 00:03:40,605
So the intuition you might take away from this is that if lambda,

59
00:03:40,605 --> 00:03:42,750
the regularization parameter, is large,

60
00:03:42,750 --> 00:03:46,530
then you have that your parameters will be relatively small,

61
00:03:46,530 --> 00:03:51,290
because they are penalized being large into a cos function.

62
00:03:51,290 --> 00:03:56,740
And so if the blades W are small then because Z is

63
00:03:56,740 --> 00:04:02,550
equal to W and then technically is plus b,

64
00:04:02,550 --> 00:04:04,440
but if W tends to be very small,

65
00:04:04,440 --> 00:04:07,140
then Z will also be relatively small.

66
00:04:07,140 --> 00:04:10,830
And in particular, if Z ends up taking relatively small values,

67
00:04:10,830 --> 00:04:12,787
just in this whole range,

68
00:04:12,787 --> 00:04:16,045
then G of Z will be roughly linear.

69
00:04:16,045 --> 00:04:22,880
So it's as if every layer will be roughly linear.

70
00:04:22,880 --> 00:04:24,800
As if it is just linear regression.

71
00:04:24,800 --> 00:04:27,860
And we saw in course one that if every layer

72
00:04:27,860 --> 00:04:31,275
is linear then your whole network is just a linear network.

73
00:04:31,275 --> 00:04:33,200
And so even a very deep network,

74
00:04:33,200 --> 00:04:35,930
with a deep network with a linear activation function

75
00:04:35,930 --> 00:04:39,245
is at the end they are only able to compute a linear function.

76
00:04:39,245 --> 00:04:43,700
So it's not able to fit those very very complicated decision.

77
00:04:43,700 --> 00:04:49,085
Very non-linear decision boundaries that allow it to really

78
00:04:49,085 --> 00:04:52,940
overfit right to data sets like we saw on

79
00:04:52,940 --> 00:04:57,485
the overfitting high variance case on the previous slide.

80
00:04:57,485 --> 00:04:59,060
So just to summarize,

81
00:04:59,060 --> 00:05:01,665
if the regularization becomes very large,

82
00:05:01,665 --> 00:05:03,873
the parameters W very small,

83
00:05:03,873 --> 00:05:06,350
so Z will be relatively small,

84
00:05:06,350 --> 00:05:08,480
kind of ignoring the effects of b for now,

85
00:05:08,480 --> 00:05:12,935
so Z will be relatively small or,

86
00:05:12,935 --> 00:05:16,250
really, I should say it takes on a small range of values.

87
00:05:16,250 --> 00:05:19,890
And so the activation function if is tanh,

88
00:05:19,890 --> 00:05:21,790
say, will be relatively linear.

89
00:05:21,790 --> 00:05:25,790
And so your whole neural network will be computing something not too far from

90
00:05:25,790 --> 00:05:28,550
a big linear function which is therefore pretty

91
00:05:28,550 --> 00:05:32,250
simple function rather than a very complex highly non-linear function.

92
00:05:32,250 --> 00:05:34,650
And so is also much less able to overfit.

93
00:05:34,650 --> 00:05:38,870
And again, when you enter in regularization for yourself in the program exercise,

94
00:05:38,870 --> 00:05:41,350
you'll be able to see some of these effects yourself.

95
00:05:41,350 --> 00:05:45,680
Before wrapping up our def discussion on regularization,

96
00:05:45,680 --> 00:05:48,310
I just want to give you one implementational tip.

97
00:05:48,310 --> 00:05:52,145
Which is that, when implanting regularization,

98
00:05:52,145 --> 00:05:58,730
we took our definition of the cost function J and we actually modified

99
00:05:58,730 --> 00:06:05,810
it by adding this extra term that penalizes the weight being too large.

100
00:06:05,810 --> 00:06:09,230
And so if you implement gradient descent,

101
00:06:09,230 --> 00:06:18,605
one of the steps to debug gradient descent is to plot the cost function J as a function

102
00:06:18,605 --> 00:06:22,520
of the number of elevations of gradient descent and you want to see that

103
00:06:22,520 --> 00:06:27,730
the cost function J decreases monotonically after every elevation of gradient descent.

104
00:06:27,730 --> 00:06:30,820
And if you're implementing regularization

105
00:06:30,820 --> 00:06:35,350
then please remember that J now has this new definition.

106
00:06:35,350 --> 00:06:37,735
If you plot the old definition of J,

107
00:06:37,735 --> 00:06:39,370
just this first term,

108
00:06:39,370 --> 00:06:42,290
then you might not see a decrease monotonically.

109
00:06:42,290 --> 00:06:45,030
So to debug gradient descent make sure that you're plotting

110
00:06:45,030 --> 00:06:49,910
this new definition of J that includes this second term as well.

111
00:06:49,910 --> 00:06:54,015
Otherwise you might not see J decrease monotonically on every single elevation.

112
00:06:54,015 --> 00:06:57,140
So that's it for L two regularization which is actually

113
00:06:57,140 --> 00:07:01,435
a regularization technique that I use the most in training deep learning modules.

114
00:07:01,435 --> 00:07:05,480
In deep learning there is another sometimes used regularization technique

115
00:07:05,480 --> 00:07:07,390
called dropout regularization.

116
00:07:07,390 --> 00:07:09,280
Let's take a look at that in the next video.