1
00:00:00,000 --> 00:00:02,560
In addition to L2 regularization,

2
00:00:02,560 --> 00:00:06,875
another very powerful regularization techniques is called "dropout."

3
00:00:06,875 --> 00:00:08,715
Let's see how that works.

4
00:00:08,715 --> 00:00:12,600
Let's say you train a neural network like the one on the left and there's over-fitting.

5
00:00:12,600 --> 00:00:14,998
Here's what you do with dropout.

6
00:00:14,998 --> 00:00:16,765
Let me make a copy of the neural network.

7
00:00:16,765 --> 00:00:21,100
With dropout, what we're going to do is go through each of the layers of

8
00:00:21,100 --> 00:00:26,350
the network and set some probability of eliminating a node in neural network.

9
00:00:26,350 --> 00:00:29,305
Let's say that for each of these layers,

10
00:00:29,305 --> 00:00:30,955
we're going to- for each node,

11
00:00:30,955 --> 00:00:34,165
toss a coin and have a 0.5 chance of

12
00:00:34,165 --> 00:00:38,005
keeping each node and 0.5 chance of removing each node.

13
00:00:38,005 --> 00:00:39,820
So, after the coin tosses,

14
00:00:39,820 --> 00:00:42,865
maybe we'll decide to eliminate those nodes,

15
00:00:42,865 --> 00:00:49,775
then what you do is actually remove all the outgoing things from that no as well.

16
00:00:49,775 --> 00:00:51,550
So you end up with a much smaller,

17
00:00:51,550 --> 00:00:53,150
really much diminished network.

18
00:00:53,150 --> 00:00:56,145
And then you do back propagation training.

19
00:00:56,145 --> 00:00:59,705
There's one example on this much diminished network.

20
00:00:59,705 --> 00:01:01,130
And then on different examples,

21
00:01:01,130 --> 00:01:03,700
you would toss a set of coins again and keep

22
00:01:03,700 --> 00:01:07,585
a different set of nodes and then dropout or eliminate different than nodes.

23
00:01:07,585 --> 00:01:09,235
And so for each training example,

24
00:01:09,235 --> 00:01:14,455
you would train it using one of these neural based networks.

25
00:01:14,455 --> 00:01:16,675
So, maybe it seems like a slightly crazy technique.

26
00:01:16,675 --> 00:01:20,470
They just go around coding those are random,

27
00:01:20,470 --> 00:01:22,505
but this actually works.

28
00:01:22,505 --> 00:01:28,480
But you can imagine that because you're training a much smaller network on each example

29
00:01:28,480 --> 00:01:34,591
or maybe just give a sense for why you end up able to regularize the network,

30
00:01:34,591 --> 00:01:38,590
because these much smaller networks are being trained.

31
00:01:38,590 --> 00:01:41,535
Let's look at how you implement dropout.

32
00:01:41,535 --> 00:01:43,425
There are a few ways of implementing dropout.

33
00:01:43,425 --> 00:01:44,915
I'm going to show you the most common one,

34
00:01:44,915 --> 00:01:47,865
which is technique called inverted dropout.

35
00:01:47,865 --> 00:01:49,645
For the sake of completeness,

36
00:01:49,645 --> 00:01:58,977
let's say we want to illustrate this with layer l=3.

37
00:01:58,977 --> 00:02:03,000
So, in the code I'm going to write- there will be a bunch of 3s here.

38
00:02:03,000 --> 00:02:08,380
I'm just illustrating how to represent dropout in a single layer.

39
00:02:08,380 --> 00:02:12,155
So, what we are going to do is set a vector d

40
00:02:12,155 --> 00:02:16,503
and d^3 is going to be the dropout vector for the layer 3.

41
00:02:16,503 --> 00:02:21,585
That's what the 3 is to be np.random.rand(a).

42
00:02:21,585 --> 00:02:27,708
And this is going to be the same shape as a3.

43
00:02:27,708 --> 00:02:31,261
And when I see if this is less than some number,

44
00:02:31,261 --> 00:02:34,470
which I'm going to call keep.prob.

45
00:02:34,470 --> 00:02:37,350
And so, keep.prob is a number.

46
00:02:37,350 --> 00:02:39,105
It was 0.5 on the previous time,

47
00:02:39,105 --> 00:02:42,045
and maybe now I'll use 0.8 in this example,

48
00:02:42,045 --> 00:02:47,040
and there will be the probability that a given hidden unit will be kept.

49
00:02:47,040 --> 00:02:49,129
So keep.prob = 0.8,

50
00:02:49,129 --> 00:02:54,665
then this means that there's a 0.2 chance of eliminating any hidden unit.

51
00:02:54,665 --> 00:02:58,130
So, what it does is it generates a random matrix.

52
00:02:58,130 --> 00:03:00,755
And this works as well if you have factorized.

53
00:03:00,755 --> 00:03:03,180
So d3 will be a matrix.

54
00:03:03,180 --> 00:03:06,660
Therefore, each example have a each hidden unit there's a

55
00:03:06,660 --> 00:03:10,245
0.8 chance that the corresponding d3 will be one,

56
00:03:10,245 --> 00:03:12,815
and a 20% chance there will be zero.

57
00:03:12,815 --> 00:03:20,900
So, this random numbers being less than 0.8 it has a 0.8 chance of being one or be true,

58
00:03:20,900 --> 00:03:24,675
and 20% or 0.2 chance of being false, of being zero.

59
00:03:24,675 --> 00:03:27,569
And then what you are going to do is take your activations from the third layer,

60
00:03:27,569 --> 00:03:30,945
let me just call it a3 in this low example.

61
00:03:30,945 --> 00:03:33,265
So, a3 has the activations you computate.

62
00:03:33,265 --> 00:03:37,335
And you can set a3 to be equal to the old a3,

63
00:03:37,335 --> 00:03:41,849
times- There is element wise multiplication.

64
00:03:41,849 --> 00:03:44,857
Or you can also write this as a3* = d3.

65
00:03:44,857 --> 00:03:50,625
But what this does is for every element of d3 that's equal to zero.

66
00:03:50,625 --> 00:03:53,735
And there was a 20% chance of each of the elements being zero,

67
00:03:53,735 --> 00:03:57,840
just multiply operation ends up zeroing out,

68
00:03:57,840 --> 00:04:00,980
the corresponding element of d3.

69
00:04:00,980 --> 00:04:02,250
If you do this in python,

70
00:04:02,250 --> 00:04:05,880
technically d3 will be a boolean array where value is true and false,

71
00:04:05,880 --> 00:04:06,985
rather than one and zero.

72
00:04:06,985 --> 00:04:10,057
But the multiply operation works and will

73
00:04:10,057 --> 00:04:13,390
interpret the true and false values as one and zero.

74
00:04:13,390 --> 00:04:16,260
If you try this yourself in python, you'll see.

75
00:04:16,260 --> 00:04:22,570
Then finally, we're going to take a3 and scale it up by dividing by

76
00:04:22,570 --> 00:04:30,015
0.8 or really dividing by our keep.prob parameter.

77
00:04:30,015 --> 00:04:32,560
So, let me explain what this final step is doing.

78
00:04:32,560 --> 00:04:36,040
Let's say for the sake of argument that you have 50 units

79
00:04:36,040 --> 00:04:39,930
or 50 neurons in the third hidden layer.

80
00:04:39,930 --> 00:04:43,075
So maybe a3 is 50 by one dimensional or

81
00:04:43,075 --> 00:04:46,650
if you- factorization maybe it's 50 by m dimensional.

82
00:04:46,650 --> 00:04:51,625
So, if you have a 80% chance of keeping them and 20% chance of eliminating them.

83
00:04:51,625 --> 00:04:53,050
This means that on average,

84
00:04:53,050 --> 00:04:59,025
you end up with 10 units shut off or 10 units zeroed out.

85
00:04:59,025 --> 00:05:02,020
And so now, if you look at the value of z^4,

86
00:05:02,020 --> 00:05:08,775
z^4 is going to be equal to w^4 * a^3 + b^4.

87
00:05:08,775 --> 00:05:10,570
And so, on expectation,

88
00:05:10,570 --> 00:05:14,080
this will be reduced by 20%.

89
00:05:14,080 --> 00:05:18,480
By which I mean that 20% of the elements of a3 will be zeroed out.

90
00:05:18,480 --> 00:05:22,240
So, in order to not reduce the expected value of z^4,

91
00:05:22,240 --> 00:05:24,380
what you do is you need to take this,

92
00:05:24,380 --> 00:05:28,870
and divide it by 0.8 because

93
00:05:28,870 --> 00:05:33,635
this will correct or just a bump that back up by roughly 20% that you need.

94
00:05:33,635 --> 00:05:37,455
So it's not changed the expected value of a3.

95
00:05:37,455 --> 00:05:43,435
And, so this line here is what's called the inverted dropout technique.

96
00:05:43,435 --> 00:05:44,830
And its effect is that,

97
00:05:44,830 --> 00:05:47,230
no matter what you set to keep.prob to,

98
00:05:47,230 --> 00:05:50,446
whether it's 0.8 or 0.9 or even one,

99
00:05:50,446 --> 00:05:52,135
if it's set to one then there's no dropout,

100
00:05:52,135 --> 00:05:54,565
because it's keeping everything or 0.5 or whatever,

101
00:05:54,565 --> 00:05:57,980
this inverted dropout technique by dividing by the keep.prob,

102
00:05:57,980 --> 00:06:02,730
it ensures that the expected value of a3 remains the same.

103
00:06:02,730 --> 00:06:05,005
And it turns out that at test time,

104
00:06:05,005 --> 00:06:06,820
when you trying to evaluate a neural network,

105
00:06:06,820 --> 00:06:08,300
which we'll talk about on the next slide,

106
00:06:08,300 --> 00:06:10,065
this inverted dropout technique,

107
00:06:10,065 --> 00:06:13,160
there is there is line to are due to the green box at dropping out.

108
00:06:13,160 --> 00:06:17,540
This makes test time easier because you have less of a scaling problem.

109
00:06:17,540 --> 00:06:20,110
By far the most common implementation

110
00:06:20,110 --> 00:06:22,870
of dropouts today as far as I know is inverted dropouts.

111
00:06:22,870 --> 00:06:24,490
I recommend you just implement this.

112
00:06:24,490 --> 00:06:27,280
But there were some early iterations of dropout that

113
00:06:27,280 --> 00:06:30,165
missed this divide by keep.prob line,

114
00:06:30,165 --> 00:06:33,660
and so at test time the average becomes more and more complicated.

115
00:06:33,660 --> 00:06:37,040
But again, people tend not to use those other versions.

116
00:06:37,040 --> 00:06:40,125
So, what you do is you use the d vector,

117
00:06:40,125 --> 00:06:43,390
and you'll notice that for different training examples,

118
00:06:43,390 --> 00:06:46,090
you zero out different hidden units.

119
00:06:46,090 --> 00:06:49,975
And in fact, if you make multiple passes through the same training set,

120
00:06:49,975 --> 00:06:52,566
then on different pauses through the training set,

121
00:06:52,566 --> 00:06:55,290
you should randomly zero out different hidden units.

122
00:06:55,290 --> 00:06:57,270
So, it's not that for one example,

123
00:06:57,270 --> 00:07:01,155
you should keep zeroing out the same hidden units is that,

124
00:07:01,155 --> 00:07:03,258
on iteration one of grade and descent,

125
00:07:03,258 --> 00:07:05,510
you might zero out some hidden units.

126
00:07:05,510 --> 00:07:07,375
And on the second iteration of great descent

127
00:07:07,375 --> 00:07:09,595
where you go through the training set the second time,

128
00:07:09,595 --> 00:07:13,008
maybe you'll zero out a different pattern of hidden units.

129
00:07:13,008 --> 00:07:16,023
And the vector d or d3, for the third layer,

130
00:07:16,023 --> 00:07:18,395
is used to decide what to zero out,

131
00:07:18,395 --> 00:07:21,565
both in for prob as well as in that prob.

132
00:07:21,565 --> 00:07:22,980
We are just showing for prob here.

133
00:07:22,980 --> 00:07:26,950
Now, having trained the algorithm at test time, here's what you would do.

134
00:07:26,950 --> 00:07:30,535
At test time, you're given some x or which you want to make a prediction.

135
00:07:30,535 --> 00:07:32,335
And using our standard notation,

136
00:07:32,335 --> 00:07:33,764
I'm going to use a^0,

137
00:07:33,764 --> 00:07:38,180
the activations of the zeroes layer to denote just test example x.

138
00:07:38,180 --> 00:07:40,760
So what we're going to do is not to use

139
00:07:40,760 --> 00:07:44,340
dropout at test time in particular which is in a sense.

140
00:07:44,340 --> 00:07:48,314
Z^1= w^1.a^0 + b^1.

141
00:07:48,314 --> 00:07:56,627
a^1 = g^1(z^1 Z).

142
00:07:56,627 --> 00:08:03,745
Z^2 = w^2.a^1 + b^2.

143
00:08:03,745 --> 00:08:04,895
a^2 =...

144
00:08:04,895 --> 00:08:10,060
And so on. Until you get to the last layer and that you make a prediction y^.

145
00:08:10,060 --> 00:08:12,640
But notice that the test time you're not using

146
00:08:12,640 --> 00:08:15,690
dropout explicitly and you're not tossing coins at random,

147
00:08:15,690 --> 00:08:20,285
you're not flipping coins to decide which hidden units to eliminate.

148
00:08:20,285 --> 00:08:22,510
And that's because when you are making predictions at the test time,

149
00:08:22,510 --> 00:08:25,615
you don't really want your output to be random.

150
00:08:25,615 --> 00:08:27,699
If you are implementing dropout at test time,

151
00:08:27,699 --> 00:08:29,890
that just add noise to your predictions.

152
00:08:29,890 --> 00:08:34,105
In theory, one thing you could do is run a prediction process

153
00:08:34,105 --> 00:08:38,940
many times with different hidden units randomly dropped out and have it across them.

154
00:08:38,940 --> 00:08:43,625
But that's computationally inefficient and will give you roughly the same result;

155
00:08:43,625 --> 00:08:46,880
very, very similar results to this different procedure as well.

156
00:08:46,880 --> 00:08:47,980
And just to mention,

157
00:08:47,980 --> 00:08:49,385
the inverted dropout thing,

158
00:08:49,385 --> 00:08:53,455
you remember the step on the previous line when we divided by the cheap.prob.

159
00:08:53,455 --> 00:08:56,450
The effect of that was to ensure that even when you don't see

160
00:08:56,450 --> 00:08:59,664
men dropout at test time to the scaling,

161
00:08:59,664 --> 00:09:02,050
the expected value of these activations don't change.

162
00:09:02,050 --> 00:09:06,540
So, you don't need to add in an extra funny scaling parameter at test time.

163
00:09:06,540 --> 00:09:08,965
That's different than when you have that training time.

164
00:09:08,965 --> 00:09:10,240
So that's dropouts.

165
00:09:10,240 --> 00:09:13,000
And when you implement this in week's premier exercise,

166
00:09:13,000 --> 00:09:16,660
you gain more firsthand experience with it as well.

167
00:09:16,660 --> 00:09:18,440
But why does it really work?

168
00:09:18,440 --> 00:09:20,410
What I want to do the next video is give you

169
00:09:20,410 --> 00:09:23,630
some better intuition about what dropout really is doing.

170
00:09:23,630 --> 00:09:25,160
Let's go on to the next video.